Dirk van Dooren

How to create a webscraper/ monitoring solution in AWS

16/08/2020

This tutorial will teach you how to monitor webpages for changes, send out notifications for these changes, and subscribe to the notifications by using AWS Lambdas. Additionally you can use our COVID-19 monitoring solution to stay up to date in the Netherlands

Intro

This blog was written with two purposes in mind: Helping people keep up to date regarding the latest news on COVID-19 put out by the RIVM (the Dutch authority on health), and creating instructions on how to make a simple solution for custom monitoring of changes on a webpage. 

The technical specifics


Lambda 1, Monitoring

The beating heart of this infrastructure resides in an AWS Lambda triggered by a CloudWatch event.

This lambda does four things: It GETs the COVID-19 page from the RIVM website, it compares it with a previous version of the page, it stores this version of the page and finally it sends an update to the SNS Topic. 

For this lambda we need four imports. boto3, urllib3, os, and from bs4 import BeautifulSoup.

 
http = urllib3.PoolManager()
url = 'https://www.rivm.nl/nieuws/actuele-informatie-over-coronavirus'
resp = http.request('POST',url)

First we get the page:

Then we get the previous post


    object_s3 = boto3.resource('s3') \
.Bucket(BUCKET_NAME) \
.Object(file_name)
old_page = object_s3.get().get('Body').read() def find_latest_post(page):
soup = BeautifulSoup(page, features="html.parser")
latest = soup.find("div", class_="LatestNews")
return latest new_page_news =find_latest_post(resp.data)
old_page_news = find_latest_post(old_page)

Make use of the find function in BeautifulSoup to get the html element you want to compare. For documentation purposes I will use a simple one.

Then we simply compare the news just gotten from the website with the news item that we had stored in the s3 bucket using the function we just declared.

If we find a difference, we know the website has been updated and publish a message to the SNS topic, which then notifies everyone who has been subscribed.

topic = #arn of the topic
sns_client = boto3.client('sns')
If new_page_news == old_page_news:
print("nothing new")
else:
sns_client.publish(
TopicArn=topic
Message=f"Nieuwe RIVM COVID19 update beschikbaar: "{new_post_title}".\ Bekijk het bericht op {url}',
Subject="RIVM COVID19 Update"
MessageAttributes={
'AWS.SNS.SMS.SenderID':{
'DataType': 'String',
'StringValue': 'CD19Update'
}
}
)

Lastly, we write the new message to the s3 bucket for the next comparison.

object_s3.put(Body = new_page)

Now this lambda is done and we can have a look at the second lambda which is used to subscribe.

Lambda 2, Subscription

On the other side of the solution, this form submits a POST to an API Gateway with Lambda integration. The lambda needs to do three things: check Authorization, retrieve the contents of the body and lastly subscribe these contents to the SNS TOPIC

Check the provided password like this, preferably you would get this password from secretsmanager first, but this will also work:

pwd = event['queryStringParameters']['pwd']
if(pwd != password): #preferably retrieve the password from secrets manager
return{
'statusCode':403,
'body': json.dumps("Forbidden")
}

if there is a phone number provided. Rewrite it so that it uses your country code. For us that is.

if(phone[0:2] == "06" and len(phone) == 10):
phone = "+31{}".format(phone[1:])
if(phone[0:4] == "0031" and len(phone) == 13):
phone = "+{}".format(phone[2:])

Then simply subscribe to the topic.

client = boto3.client('sns')
topic = #arn of topic
response_email = client.subscribe(
TopicArn = topic,
Protocol = 'email' #or 'sms'
Endpoint = email_address #or phone
ReturnSubscriptionArn = False

With email subscription this will then automatically trigger a confirmation email. However for phone subscriptions, it does not. Therefore we add code for one SMS message to finish this Lambda.

client.publish(
PhoneNumber = phone,
Message= "Bedankt voor het subscriben",
MessageAttributes ={
AWS.SNS.SMS.SenderID': {
DataType': 'String',
'StringValue': 'CVD19Update'

Conclusion

You have now learned how a custom monitoring solution for changes on a webpage can be implemented fairly easily. This is a basic solution which can easily be
expanded upon. E.g. Additional checking if a phone number/email had already
been subscribed. As of right now there is no unsubscribe function for phone
numbers, this is also something that can be added. Also we could look at
adjusting the subscription confirmation e-mail.

 

A screenshot of a cell phone</p>
<p>Description automatically generated


we could also think of custom messages, or filters on specifics in the message (maybe you only want to know about guidelines, maybe only numbers). You could also extend this solution to incorporate more datasources and add monitoring on those pages automatically.

However, for now, this is it. We hope you find this useful and now more than ever: be well!

You can find the source code for these AWS Lambdas in Cloudnation's Publications Repository