Bulk Download NYS Election District Enrollments
For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
yr = '12'
mon = 'nov'
ext = '.pdf' #.xlsx
url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"
#If there is no such folder, the script will create one automatically
folder_location = r''+mon+yr+'-Enrollment'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='"+mon+yr+ext+"']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.
import glob as g
import pandas as pd
df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))
df = df[(df['STATUS']=='Active')]
df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
df.insert(3,'ED', df['ELECTION DIST'].str[-3:])
df=df.dropna()
df=df.sort_values(by=['COUNTY','ELECTION DIST'])
df
COUNTY | Municipality | Ward | ED | ELECTION DIST | STATUS | DEM | REP | CON | WOR | OTH | BLANK | TOTAL | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Albany | Albany | 001 | 001 | ALBANY 001001 | Active | 74.0 | 9.0 | 0.0 | 0.0 | 0.0 | 16.0 | 99.0 |
5 | Albany | Albany | 001 | 002 | ALBANY 001002 | Active | 311.0 | 16.0 | 2.0 | 1.0 | 10.0 | 47.0 | 387.0 |
9 | Albany | Albany | 001 | 003 | ALBANY 001003 | Active | 472.0 | 26.0 | 5.0 | 1.0 | 27.0 | 121.0 | 652.0 |
13 | Albany | Albany | 001 | 004 | ALBANY 001004 | Active | 437.0 | 30.0 | 2.0 | 3.0 | 12.0 | 92.0 | 576.0 |
17 | Albany | Albany | 001 | 005 | ALBANY 001005 | Active | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 13.0 |
… | … | … | … | … | … | … | … | … | … | … | … | … | … |
53 | Yates | Milo | 000 | 006 | Milo 000006 | Active | 204.0 | 409.0 | 13.0 | 3.0 | 50.0 | 182.0 | 861.0 |
57 | Yates | Potter | 000 | 001 | Potter 000001 | Active | 144.0 | 460.0 | 25.0 | 1.0 | 51.0 | 226.0 | 907.0 |
61 | Yates | Starkey | 000 | 001 | Starkey 000001 | Active | 187.0 | 370.0 | 15.0 | 5.0 | 57.0 | 209.0 | 843.0 |
65 | Yates | Starkey | 000 | 002 | Starkey 000002 | Active | 189.0 | 433.0 | 18.0 | 4.0 | 46.0 | 201.0 | 891.0 |
69 | Yates | Torrey | 000 | 001 | Torrey 000001 | Active | 185.0 | 313.0 | 9.0 | 3.0 | 39.0 | 159.0 | 708.0 |
And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:
tdf=df.groupby(by=['COUNTY']).sum()
pd.cut((tdf.div(tdf.iloc[:,:-1].sum(axis=1),axis=0)*100).sort_values(by='BLANK',ascending=False)['DEM'],3,labels=['Rep','Mix','Dem']).sort_values()
COUNTY Rensselaer Rep Genesee Rep Oswego Rep Livingston Rep Schuyler Rep Orleans Rep Seneca Rep Otsego Rep Warren Rep St.Lawrence Rep Cattaraugus Rep Chemung Rep Cayuga Rep Tioga Rep Yates Rep Broome Rep Franklin Rep Oneida Rep Fulton Rep Allegany Rep Essex Rep Niagara Rep Steuben Rep Herkimer Rep Lewis Rep Delaware Rep Wyoming Rep Chenango Rep Hamilton Rep Greene Rep Sullivan Rep Wayne Rep Jefferson Rep Ontario Rep Chautauqua Rep Putnam Rep Madison Rep Washington Rep Schoharie Rep Dutchess Rep Montgomery Rep Clinton Rep Orange Rep Cortland Rep Suffolk Rep Onondaga Rep Saratoga Rep Erie Mix Schenectady Mix Ulster Mix Rockland Mix Columbia Mix Monroe Mix Westchester Mix Richmond Mix Albany Mix Tompkins Mix Nassau Mix Queens Dem Bronx Dem New York Dem Kings Dem Name: DEM, dtype: category Categories (3, object): ['Rep' < 'Mix' < 'Dem']
If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this:
import pandas as pd
df = pd.read_csv('/home/andy/Desktop/AlbanyED_nov16.txt',
skiprows=3,
sep='\s{2,}',
engine='python',
error_bad_lines=False)
df=df[df['STATUS']=='Active']
df=df.dropna()