iab tech lab

State of Ads.txt Adoption

What is ads.txt?

Ads.txt is an IAB Tech Lab project that was created to fight inventory fraud in the digital advertising industry.  The idea is simple; publishers put a file on their server that says exactly which companies they sell their inventory through.  The file lists partners by name, but also includes the publisher’s account ID.  This is the same ID buyers see in a bid request, which they can use as a key for campaign targeting.

Buyers use a web crawler to download all the ads.txt files and the information contained within on a regular basis and use it to target their campaigns.   This means buyers know that if they bid on request that comes from an authorized ID, it’s coming from a source the publisher trusts or has control over.  Buyers seem to be taking the idea seriously, too.  Just a week ago Digitas published an open letter on Digiday saying they won’t buy from any publisher without an ads.txt file.

Ads.txt isn’t a silver bullet for all inventory quality woes, but it is a dead simple solution.  You’d be stupid not to lock the door to your house, even if it’s not a guarantee of safety, right?  The important bit is that for the first time publishers have a tool against inventory fraud instead of relying on the programmatic tech alone.

Are you a developer or patient person? Then try the ads.txt crawler yourself

As part of the program’s release, Neal Richter, a long time ad technology veteran and one of the authors of the ads.txt spec wrote a simple web crawler in Python.  The script takes a list of domains and parses the expected ads.txt format into a database, along with some other error handling bits.

Developers will probably find it a piece of cake to use and non-developers will struggle a bit, like I did.  That said, I got it running after pushing through some initial frustration and researching how to get a small database running on my computer.  I wrote a detailed tutorial / overview of how to get it working for anyone interested in a separate post.

12.8% of publishers have an ads.txt file

At least, among the Alexa 10K global domains that sell advertising.  To get this stat, I took the Alexa top 10,000 domains, removed everything owned by Google, Amazon, and Facebook – which don’t sell their inventory through 3rd parties and therefore don’t need an ads.txt file – and removed the obvious pornography sites.  After filtering, I had 9,572 domains to crawl.  I sent all those through Neal’s crawler and found 1,930 domains selling ads, and 248 with an ads.txt file.  248 / 1,930 = 12.8%, voila! (more…)

Setup an Ads.txt Web Crawler

This post is a step by step walkthrough of how to start using Neal Richter’s ads.txt web crawler Python script posted under the official IAB Tech Lab’s git repository. You might know him as the former CTO of Rubicon Project or current CTO of Rakuten, but he also served as a key contributor to the ads.txt working group.

Getting this script working ended up being a great way to learn more about Python and web scraping, but was primarily so I could compile the data necessary to analyze publisher adoption of ads.txt since the project was released in June of this year. For more on ads.txt or to read my analysis and download my data on the state of publisher adoption head on over to my State of Ads.txt post.

What the ads.txt web crawler does

The script takes two inputs – first, a txt file of domains and second, a database to write the parsed output.  Once you specify a list of domains, the script then appends ‘/ads.txt’ to each and writes them to a temporary CSV file.  The script then loops through each record in the CSV file, formatting it into a request which then leverages Python’s request library to execute the call.

Next, the script does some basic error handling.  It will timeout a request if the host doesn’t respond in a few seconds, will log an error if the page doesn’t look like an ads.txt file (such as a 404 page), if the browser starts getting redirected like crazy, or other unexpected things happen.

If the page looks like an ads.txt file, the script then parses the file for the expected values – domain, exchange, account ID, type, tag ID, and comment – and logs those to the specified database.

What the ads.txt web crawler doesn’t do

Neal is pretty clear that this script is intended more as an example than a full fledged crawler.  The script runs pretty slow for one because it can only process one domain at a time, rather than a bunch in parallel.  It also uses a laptop to do the work vs. a production server which would add more bandwidth and speed.  It leaves something to be desired on error handling, both on crawling domains and writing the output to the database.

I was closer to chucking my laptop out the window than I’d care to admit trying to get around UnicodeErrors, newline characters, null bytes, or other annoying and technically detailed nuances that made the script puke on my domain files.  And finally, the database is also just sitting on your laptop so it won’t scale forever, even if CSV files are typically small, even with tens of thousands of records.

All that said, I’m not a developer by trade and I was able to figure it out, even if it was a bit painful at times.  Hopefully this post will help others do the same.

How to get your ads.txt web crawler running

First things first – before you try to run this script need to have a few things already in place on your machine. If you don’t have these pieces yet it’ll be a bit of a chore, but it’s a great excuse to get it done if you want to do more technical projects in the future.  (more…)