A simple proxy service for scrapers running on Morph
The low-hanging fruit has been picked – NSW and Victoria publish a page/URL per notice, WA publishes a PDF per notice – now it’s time for the harder data sources.
South Australia Health publishes a register of food prosecutions. The data is not structured. Every single entry is formatted differently. Business names, addresses, dates, and even field names are different for each entry.
The clue to what’s going on is in the class name on the content:
This is a pretty common thing about public data: often the only reason it has been published is because legislation requires it.
If the scale of the data being published is small, folk-systems spring up to handle the demand (in SA Health’s case, a WYSIWYG field on a CMS to handle 6 notices). When the scale of reporting is big, you get a more structured, consistent approach (like the NSW Food Authority’s ASP.NET application that handles ~1500 notices).
The challenge becomes: how do you build a scraper that handles the variations in data from artisanal, hand-crafted data sources?
But it turns out that’s not even the most challenging problem with writing a scraper for this dataset. Sure, there are some annoying inconsistencies that require handling a few special cases, but nothing impossible.
The problem lies with how the scraper runs.
The scraper scrapes the food prosecutions register from the SA Health website. The SA health website sits behind some sort of Web Application Firewall. It’s assumed this WAF is meant to block nasty requests to the website.
Unfortunately, the WAF blocks legitimate requests from Morph, which means the scraper fails to run. The WAF sometimes returns a HTTP status code of 200 but with an error message in the body. Sometimes it just silently drops the TCP connection altogether. This behaviour only exhibits on Morph, not when running from within Australia.
Bugs that only show up in production? The best.
To make the scraper work on Morph, we can build a simple Tinyproxy-based proxy service running in AWS to proxy requests from Morph to SA Health’s website. The proxy is locked down to only accept requests originating from Morph.
Designed to be cheap, resilient, and open
The proxy service must be:
- low cost
- resilient to failure
- open source and reproducible
The last point is key.
When I originally tested this proxying approach, I did it with a Digital Ocean droplet in Singapore. I forgot about it for a couple of weeks, then accidentally killed the droplet when I was cleaning up something else in my DO account. Aside from the fact that the proxy’s existence and behaviour was opaque to anyone but me, I wanted other people to be able to use this proxying approach. More selfishly, I didn’t want future Lindsay to have to remember how this house of cards was stacked.
To keep costs low and the service resilient, the proxy service uses the AWS free tier, and autoscaling groups.
- Sets up a single VPC, with a single public subnet, routing tables, and a single internet gateway.
- Sets up an ELB to publicly terminate requests, locked down with a security group to only accept requests from Morph (don’t want to be running an open proxy).
- Sets up an autoscaling group of a single t2.micro (free tier) instance, with a launch config that boots the latest Ubuntu Xenial AMI, and links the ELB to the ASG.
When the scraper runs on Morph with the
MORPH_PROXY environment variable set, it connects through the ELB to the Tinyproxy instance, which then proxies the request on to SA Health’s website.
Drive changes with
make and environment variables
Once you clone the repo and set some environment variables, you can start planning your changes:
To apply the plan:
To destroy the environment:
This Makefile approach was borrowed from hectcastro/terraform-aws-vpc, from which this Terraform config was forked.
Wrap it with a Continuous Deployment pipeline
To keep Terraform changes consistent, all changes to the proxy service are run through a Continuous Deployment pipeline on Travis. This means no changes to the “production” service are done locally. This is important for creating visibility for new contributors of how the service runs and changes.
Terraform relies on
.tfstate files to track state and changes between Terraform runs. Because Travis starts with a clean git clone every build (and thus no
terraform config is used to push/pull persistent state across builds.
The pipeline is very simple – it just runs
These environment variables must be exported for
proxy/cideploy.sh to work:
BUCKET, the name of the S3 bucket the config will be sync’d with by
AWS_ACCESS_KEY_ID, access key for the IAM user, used by
AWS_SECRET_ACCESS_KEY, access key secret for the IAM user, used by
TF_VAR_aws_access_key, access key for the IAM user, used by
TF_VAR_aws_secret_key, access key secret for the IAM user, used by
sa_health_food_prosecutions_register proxy service case, these environment variables are exported as encrypted environment variables in .travis.yml. This keeps the config and most of the data open, and easily reproducible.
Civic hacking for government shortfalls
This was a huge amount of work for a very small data set (6 notices!), but I believe it was worth it.
The approach allows the scraper to reliably run on Morph, and behave in a way that’s consistent with other scrapers. The costs are minimal, which is important if I’m picking up the tab for poor government IT.
(Side note: if you were a member of the public with an urgent enquiry for SA Health, but you were being silently dropped by their WAF, how would you contact them to let them know? Their contact numbers are on their website, after all)
Most importantly, the service is open source and reproducible. When I asked on the Open Australia Slack about other cases of Morph scrapers failing because of active blocking of requests, nobody could think of any.
I hope nobody ever has to do anything like this to make their scrapers run, but if they do, there’s now a Terraform project to set up a proxy that costs less than $5/month.
Happy civic hacking!