How to Scrape Government Data with JavaScript

Using the Network Inspector, jQuery, querySelector, and async/await to get structured data out of messy websites

Neel Mehta
10 min readNov 24, 2020

When you’re trying to analyze political data — whether you’re trying to build a list of voters in your district or track electoral trends — you’ll often have to grab government data off the internet.

The trouble is that government websites rarely let you download an Excel sheet or CSV of all the underlying data. More often, you’ll have to turn to the time-honored technique of scraping structured data off a clunky website.

If you want to get good at capturing and analyzing government data, you’ll have to add many scraping tools to your toolbelt, from parsing dirty PDFs to becoming a GreaseMonkey. I’ve had to master a fair few of these in my day, so I wanted to use this post to introduce a few.

In this post, we’ll focus on the easier methods: grabbing structured data from more modern websites using JavaScript. As a case study, we’ll be using my recent project to create a list of all the vote-by-mail “dropboxes” that were offered in the 2020 election.

(The techniques I’ll show you should work for any kind of data, not just government data, though I’ve found that you’ll need them most on government websites!)

The easy way: find underlying data in the console

Many modern websites store structured data in a CSV or JSON blob, then have a fancy frontend that loads that data and presents it in an interactive user interface. These are the easiest sites to scrape: you don’t need to grapple with the UI as long as you can find the structured data.

Finding that blob, however, is a bit of a treasure hunt.

For instance, Michigan lets you find dropboxes through MichiganDropbox.com. Unfortunately, it takes many clicks to get to the dropboxes, which would make it a royal pain to scrape:

Finding vote-by-mail dropboxes on MichiganDropbox.com.

Our first step is to see if this website loads data from a CSV or JSON blob. To do that, we open the web console (which you can do using Inspect Element) and go to the Network tab. This tab shows all the files loaded by the webpage. Most of them are JavaScript scripts or CSS stylesheets… but check out what else we find!

Finding the JSON file that powers MichiganDropbox.com.

Logistical note: readers have told me that Medium shrinks some of the images in this article, making smaller text appear blurry. To help, I’ve uploaded all the original-quality pictures from this article into a GitHub repository.

The page loads a file called dropboxLocations.json , which looks like it’s exactly what we need! When we go to the Response tab to see what’s inside that file, we see dropbox data hidden a few layers in the object’s structure.

This JSON file contains a list of counties; each county contains jurisdictions; each jurisdiction contains a list of dropboxes!

Boom! We’ve found the gold mine. Once we have this JSON blob, we’ll be able to get a list of all the dropboxes, no scraping needed. I downloaded the file by right-clicking on the file name and hitting “Copy Response;” the process might differ in other browsers. Check out the JSON file for yourself:

The JSON blob that contains info on all the dropboxes in Michigan.

Now, we just need to feed this JSON blob into whatever script or tool we’re using for analysis. In our case, we wanted to make a giant CSV of all the dropboxes, so I used this excellent JSON to CSV converter first.

The hard way: muck around in the console

But you won’t always be able to find an underlying CSV or JSON file. Consider Georgia’s dropbox finder at GaBallotDropbox.org. Like in Michigan, the website lets you browse through the dropboxes in every county, but I, at least, couldn’t find the underlying data source. (I think they “minified” the dataset to reduce its file size; this has the side effect of making it much harder to extract.)

The Georgia dropbox finder at GaBallotDropbox.org.

Because we can’t find the blob, we’ll have to scrape it the old-fashioned way. Fortunately, the website is fairly simple: just choose a county from the dropdown menu, and all the dropboxes will be immediately printed below.

There are over 100 counties in Georgia, so clicking every single county in the dropdown menu would be a huge pain. Thus, our approach will be something like this:

  1. Open the dropdown menu.
  2. Click a county.
  3. Grab the list of dropboxes that gets printed.
  4. Repeat with the next county.
A close-up of the dropbox data that gets printed for each county.

Opening the dropdown

Let’s get our hands dirty by using Inspect Element to figure out how to computationally open the dropdown menu. I think that upside-down arrow should do it, so let’s click it to find its HTML:

Using Inspect Element to find the button that opens the dropdown.

Now we need to find a unique way to identify the button so that we can trigger it with JavaScript. Unfortunately, this item doesn’t have a unique id, but it has a class that seems somewhat unusual: MuiAutocomplete-popupIndicator.

So, lets head to the Console tab and put in some JavaScript that will click that button. It looks like the JavaScript already on the page has assigned something like jQuery to $, so we can use the typical $("selector") syntax from jQuery.

// Click the dropdown
$('.MuiAutocomplete-popupIndicator').click();

Seems like it works: running this code opens the menu.

Programmatically opening the dropdown menu.

Clicking a county

Now we need to find a way to programmatically click one entry in the dropdown. Fortunately, when we hover over an item, we find that each item has a unique ID:

The tooltip on the “Clinch County” item tells us that its ID is `countySelector-option-31`.

Even better, the IDs are pretty systematic: it’s countySelector-option-### , where the numbers count up from 0 for the first county alphabetically (Appling) to 159 for the last county alphabetically (Worth). (Fun fact: Georgia has more counties than any state besides Texas.) We’ll use this handy property later on.

But first we need to write code to click the list item. This is fairly straightforward, but to make it more interesting we’ll introduce JavaScript’s template strings:

// Click a dropdown item
// As an example, choose Clinch County (#31)
let i = 31;
$(`#countySelector-option-${i}`).click();

Chaining this with the earlier command, we can navigate to any county we like based on the value of i.

Programmatically opening the dropdown, and then clicking on a list item, lets us pull up any county we desire. We just need to set `i` accordingly.

Grabbing the list of dropboxes

Now we need to dig into the HTML and grab the dropboxes. Unfortunately, the elements containing the dropbox info don’t have unique class names, so we’ll have to find some roundabout way to access them.

Inspect Element shows us how the dropbox info is nested in the HTML.

Through trial-and-error, I found that the dropbox info is stored within MuiBox-root elements nested deep within a jss6 element. This code lets us get the list of the dropbox containers:

// Whatever jQuery-like thing $ is doesn't work consistently,
// so I'm making a wrapper around the native alternative.
let $$ = x => document.querySelectorAll(x);
// get NodeList of Divs containing dropbox data
let dropboxDivs = $$(".jss6 div.MuiBox-root div.MuiBox-root");

As I hinted, whatever $ is, it isn’t normal jQuery. Thus, it can’t pull off my complex selector string. Instead, I made a wrapper around document.querySelectorAll, which is JavaScript’s new native clone of jQuery. Most of the selectors that work with jQuery work with querySelectorAll , but the new function’s name is so obnoxious that I just had to give it a catchy shorthand. (I used the new fat-arrow syntax to make a function with minimal code.)

Anyway, this code works as intended, letting us grab data on all the dropboxes in the selected county.

Grabbing the <div>s that contain dropbox information. This is stored in an Array of “Node” objects.

It appears that each dropbox gets a handful of lines of text: there’s the location name, then the address in blue, then the hours in black, then an optional “Note” section in orange. It looks like everything the address lives in an <a> tag inside the box, while the other lines are in a <p>.

The HTML inside each of the <div>s containing dropbox information.

To gather this data, we just look at the innerHTML of each <p> or <a> tag here. Then we stash this as an object inside an array called data.

dropboxDivs.forEach((div) => {
data.push({
location: arrayGet(div.children, 0, "innerHTML"),
address: arrayGet(div.children, 1, "innerHTML"),
hours: arrayGet(div.children, 2, "innerHTML"),
notes: arrayGet(div.children, 3, "innerHTML"),
});
});

arrayGet is just a convenience function I made that checks if the desired index exists in the array, and only looks up innerHTML if that index exists. It’s a simple way to guard against those “index out of bounds” errors. (Don’t worry, I’ll provide all the code I used at the end of this example.)

Looping through all the counties

Finally, we just need to step through every county (from #0 to #159). The high-level flow of the algorithm ends up looking like this:

let data = [];for (let i = 0; i < 160; i++) {
// Click dropdown
// Click dropdown item #i
// Grab the <div>s containing each dropbox
// Extract data from each <div> and add to `data`
}

Fixing timing problems with async/await

One problem I noticed with my algorithm was that the dropdown-item-clicking code would sometimes fail, as would the <div> scraping code. Those snippets were sometimes unable to find the elements they were supposed to be scraping or clicking. This was odd, since each bit worked in isolation.

I realized the problem: the webpage took a few milliseconds to update after I clicked each button, but my next line of code would run before the necessary changes had happened. Specifically, the code would try to find and click the dropdown items before the dropdown had even finished opening.

This timing problem happens a lot when you’re programmatically altering the webpage, and it was the bane of my existence until I realized that I could just tell the script to wait for a bit until the animation was done. Specifically, I found this wonderful function:

// This lets us wait X milliseconds synchronously
const wait = ms => new Promise(res => setTimeout(res, ms));

Once we define this function, we drop it between lines of code to force the script to wait:

for (let i = 0; i < 160; i++) {
// Click the dropdown
$('.MuiAutocomplete-popupIndicator').click();

// Kill time until the animation finishes
await wait(500);

// Click a dropdown item
$(`#countySelector-option-${i}`).click();

// Kill time until the animation finishes
await wait(500);

// Other code...
}

We put this code immediately after we take any action that changes the webpage, including clicking buttons. By waiting, we ensure that the next line of code runs after the animation has finished. I chose to wait half a second (500ms) to be safe, but you could probably get away with less.

The await keyword, by the way, makes this waiting operation “synchronous”: it blocks future code execution until it finishes instead of spinning off into a separate “thread” of sorts.

But await only works when you’re inside an async function, per the rules of the async/await construct. So we just need to wrap this code in an async function and call it, like so:

async function go() {
for (let i = 0; i < 160; i++) {
// The usual code
}
}
go();

Putting it all together

When we glue together the page-changing code, the scraping code, the helper functions, the loops, and theasync/await magic, we get a nifty script that grabs all the dropboxes in Georgia and stores it in the data array. With a few additional tweaks to the code to address edge cases, we’re all set.

Here’s the complete Georgia scraping script! I hope it still works by the time you read this, but regardless, it should be a useful educational tool.

We visit the website, copy-and-paste that script into the developer console, and run the code to get our dropbox data.

After running the script, information for all 360 dropboxes is in the `data` variable.

That data is stashed in the data array, so we just need to export it to JSON or some other format where we can export it. I right-clicked on the object and hit “Copy Object,” but if your browser doesn’t have that, you can use this alternate approach. Either way, you should get a JSON blob that you can paste into a text editor.

Here’s our final JSON file:

The JSON blob that contains info on all the dropboxes in Georgia.

That’s a wrap! Advanced topics?

With these two techniques — grabbing the source data from the Network analyzer and writing custom JavaScript for the console — I was able to get dropbox data from most state websites, not to mention a handful of other election-tech tools I used when working on my most recent campaign. I hope you’ll find them useful additions to your arsenal, too.

There are some more advanced topics we can touch on later: parsing raw HTML the old-fashioned way, scraping PDFs, using OCR to digitize scanned documents, etc. Let me know in the comments if you’d like to learn more about those!

--

--

Neel Mehta
Neel Mehta

Written by Neel Mehta

Associate Product Manager @Google. Former CS @Harvard. Author of "Swipe to Unlock: A Primer on Technology and Business Strategy". All views my own.

Responses (1)