For some of my projects, I need access to the full set of enrollments from the State Board of Elections. While you can download them one at a time, it sure is nice to have all of them for statewide coverage to easy processing. You can get them using Beautiful Soup.
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
yr = '12'
mon = 'nov'
ext = '.pdf' #.xlsx
url = "https://www.elections.ny.gov/20"+yr+"EnrollmentED.html"
#If there is no such folder, the script will create one automatically
folder_location = r''+mon+yr+'-Enrollment'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url, { 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103197 Safari/537.36' })
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='"+mon+yr+ext+"']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
What can you do with them? If you are looking at the numbers from 2018 and later, you can load them directly in PANDAS.
import glob as g
import pandas as pd
df = pd.concat((pd.read_excel(f, header=4) for f in g.glob('/home/andy/2021-Enrollment/*xlsx')))
df = df[(df['STATUS']=='Active')]
df.insert(1,'Municipality',df['ELECTION DIST'].str[:-7].str.title())
df.insert(2,'Ward',df['ELECTION DIST'].str[-6:-3])
df.insert(3,'ED', df['ELECTION DIST'].str[-3:])
df=df.dropna()
df=df.sort_values(by=['COUNTY','ELECTION DIST'])
df
COUNTY
Municipality
Ward
ED
ELECTION DIST
STATUS
DEM
REP
CON
WOR
OTH
BLANK
TOTAL
1
Albany
Albany
001
001
ALBANY 001001
Active
74.0
9.0
0.0
0.0
0.0
16.0
99.0
5
Albany
Albany
001
002
ALBANY 001002
Active
311.0
16.0
2.0
1.0
10.0
47.0
387.0
9
Albany
Albany
001
003
ALBANY 001003
Active
472.0
26.0
5.0
1.0
27.0
121.0
652.0
13
Albany
Albany
001
004
ALBANY 001004
Active
437.0
30.0
2.0
3.0
12.0
92.0
576.0
17
Albany
Albany
001
005
ALBANY 001005
Active
13.0
0.0
0.0
0.0
0.0
0.0
13.0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
53
Yates
Milo
000
006
Milo 000006
Active
204.0
409.0
13.0
3.0
50.0
182.0
861.0
57
Yates
Potter
000
001
Potter 000001
Active
144.0
460.0
25.0
1.0
51.0
226.0
907.0
61
Yates
Starkey
000
001
Starkey 000001
Active
187.0
370.0
15.0
5.0
57.0
209.0
843.0
65
Yates
Starkey
000
002
Starkey 000002
Active
189.0
433.0
18.0
4.0
46.0
201.0
891.0
69
Yates
Torrey
000
001
Torrey 000001
Active
185.0
313.0
9.0
3.0
39.0
159.0
708.0
And then do all kinds of typical PANDAS processing. For example to categorize the counties into three pots — based on a comparison of the number of Democrats to Republicans, you could run this command:
COUNTY
Rensselaer Rep
Genesee Rep
Oswego Rep
Livingston Rep
Schuyler Rep
Orleans Rep
Seneca Rep
Otsego Rep
Warren Rep
St.Lawrence Rep
Cattaraugus Rep
Chemung Rep
Cayuga Rep
Tioga Rep
Yates Rep
Broome Rep
Franklin Rep
Oneida Rep
Fulton Rep
Allegany Rep
Essex Rep
Niagara Rep
Steuben Rep
Herkimer Rep
Lewis Rep
Delaware Rep
Wyoming Rep
Chenango Rep
Hamilton Rep
Greene Rep
Sullivan Rep
Wayne Rep
Jefferson Rep
Ontario Rep
Chautauqua Rep
Putnam Rep
Madison Rep
Washington Rep
Schoharie Rep
Dutchess Rep
Montgomery Rep
Clinton Rep
Orange Rep
Cortland Rep
Suffolk Rep
Onondaga Rep
Saratoga Rep
Erie Mix
Schenectady Mix
Ulster Mix
Rockland Mix
Columbia Mix
Monroe Mix
Westchester Mix
Richmond Mix
Albany Mix
Tompkins Mix
Nassau Mix
Queens Dem
Bronx Dem
New York Dem
Kings Dem
Name: DEM, dtype: category
Categories (3, object): ['Rep' < 'Mix' < 'Dem']
If you are using the pre-2018 data, I suggest converting the PDF to text documents using the pdftotext -layout which is available on most Linux distributions as part of the poppler-utils package. This converts the PDF tables to text files, which you can process like this: