This post is a step by step walkthrough of how to start using Neal Richter’s ads.txt web crawler Python script posted under the official IAB Tech Lab’s git repository. You might know him as the former CTO of Rubicon Project or current CTO of Rakuten, but he also served as a key contributor to the ads.txt working group.
Getting this script working ended up being a great way to learn more about Python and web scraping, but was primarily so I could compile the data necessary to analyze publisher adoption of ads.txt since the project was released in June of this year. For more on ads.txt or to read my analysis and download my data on the state of publisher adoption head on over to my State of Ads.txt post.
What the ads.txt web crawler does
The script takes two inputs – first, a txt file of domains and second, a database to write the parsed output. Once you specify a list of domains, the script then appends ‘/ads.txt’ to each and writes them to a temporary CSV file. The script then loops through each record in the CSV file, formatting it into a request which then leverages Python’s request library to execute the call.
Next, the script does some basic error handling. It will timeout a request if the host doesn’t respond in a few seconds, will log an error if the page doesn’t look like an ads.txt file (such as a 404 page), if the browser starts getting redirected like crazy, or other unexpected things happen.
If the page looks like an ads.txt file, the script then parses the file for the expected values – domain, exchange, account ID, type, tag ID, and comment – and logs those to the specified database.
What the ads.txt web crawler doesn’t do
Neal is pretty clear that this script is intended more as an example than a full fledged crawler. The script runs pretty slow for one because it can only process one domain at a time, rather than a bunch in parallel. It also uses a laptop to do the work vs. a production server which would add more bandwidth and speed. It leaves something to be desired on error handling, both on crawling domains and writing the output to the database.
I was closer to chucking my laptop out the window than I’d care to admit trying to get around UnicodeErrors, newline characters, null bytes, or other annoying and technically detailed nuances that made the script puke on my domain files. And finally, the database is also just sitting on your laptop so it won’t scale forever, even if CSV files are typically small, even with tens of thousands of records.
All that said, I’m not a developer by trade and I was able to figure it out, even if it was a bit painful at times. Hopefully this post will help others do the same.
How to get your ads.txt web crawler running
First things first – before you try to run this script need to have a few things already in place on your machine. If you don’t have these pieces yet it’ll be a bit of a chore, but it’s a great excuse to get it done if you want to do more technical projects in the future. (more…)