Search Results for: pdfs and r

PDFs and R

generally use pdf2text -layout to convert text to pdf in Linux command line then import the text file into R. For working with tables I use tabulizer to extract the tables in the browser though there also are native python and R bindings. You might also use pdftools on CRAN for importing PDFs but in my experience that requires more wrangling than creating a text file using pdf2text or tabulizer externally.

https://cran.r-project.org/web/packages/pdftools/index.html?fbclid=IwAR3oV4_mCuLaYb36bnYi_swoqMyDb93MEmAqUanknB94QiQxZLCSsyE4q4w

https://datascienceplus.com/extracting-tables-from-pdfs-in-r-using-the-tabulizer-package/

https://ropensci.org/blog/2018/12/14/pdftools-20/

Categories:

PANDAS and PDF Tables

PANDAS and PDF Tables

I discovered the tabula-py library, which is an alternative to converting PDFs externally to Python using pdftotext -layout from the poppler set of utilities. This might be a good alternative to having to install and run a seperate command line program.

import tabula
import pandas as pd

pdf="/home/andy/nov16-Enrollment/AlbanyED_nov16.pdf"
df=pd.concat(tabula.read_pdf(pdf,pages='all', stream=True))

df.query("STATUS=='Active'")
COUNTYELECTION DISTSTATUSDEMREPCONGREWORINDWEPREFOTHBLANKTOTALCOUNTY ELECTION DISTUnnamed: 0
0AlbanyALBANY 001001Active6719100100013101NaNNaN
3AlbanyALBANY 001002Active286203011900045374NaNNaN
6AlbanyALBANY 001003Active468329311900081613NaNNaN
9AlbanyALBANY 001004Active449256141510174576NaNNaN
12AlbanyALBANY 001005Active50000100006NaNNaN
7NaNNaNActive291594003101189476Albany WATERVLIET 004004NaN
10NaNNaNActive339213294357000202847Albany WESTERLO 000001NaN
13NaNNaNActive348119171543000160693Albany WESTERLO 000002NaN
16NaNNaNActive314137262235000196712Albany WESTERLO 000003NaN
19NaNNaNActive91,65835,8212,9915256089,612491419841,926183,402Albany County TotalNaN

There also poppler bindings for python, which you can use after installing the poppler-cpp developer library. This should work but currently it’s not parsing things correctly.

from poppler import load_from_file, PageRenderer

pdf_document = load_from_file(pdfFile)
page_1 = pdf_document.create_page(0)
page_1_text = page_1.text()

import pandas as pd
from io import StringIO

df = pd.read_csv(StringIO(page_1_text),
   skiprows=3,
   sep='\s{2,}',
   engine='python',
   error_bad_lines=False)

df
COUNTY ELECTION DISTSTATUSDEMREPCONGREWORINDWEP REF OTH BLANK TOTAL
Albany ALBANY 001001Active67.019.01.00.00.01.00.00.00.013.0101.0
Inactive5.02.00.00.00.00.00.00.00.02.09.0
Total72.021.01.00.00.01.00.00.00.015.0110.0
Albany ALBANY 001002Active286.020.03.00.01.019.00.00.00.045.0374.0
Inactive27.08.00.00.00.03.00.00.00.012.050.0
Total313.028.03.00.01.022.00.00.00.057.0424.0
Albany ALBANY 001003Active468.032.09.03.01.019.00.00.00.081.0613.0
Inactive56.05.00.00.00.02.00.00.00.019.082.0
Total524.037.09.03.01.021.00.00.00.0100.0695.0
Albany ALBANY 001004Active449.025.06.01.04.015.01.00.01.074.0576.0
Inactive64.00.01.00.00.04.00.00.00.013.082.0
Total513.025.07.01.04.019.01.00.01.087.0658.0
Albany ALBANY 001005Active5.00.00.00.00.01.00.00.00.00.06.0
Total5.00.00.00.00.01.00.00.00.00.06.0
Albany ALBANY 001006Active462.011.05.02.04.014.00.00.01.057.0556.0
Inactive63.03.00.00.00.05.00.00.01.012.084.0
Total525.014.05.02.04.019.00.00.02.069.0640.0
Albany ALBANY 001007Active435.019.01.00.02.019.00.00.00.062.0538.0
Inactive60.03.00.00.02.03.00.00.00.07.075.0
Total495.022.01.00.04.022.00.00.00.069.0613.0
Albany ALBANY 001008Active160.07.01.00.02.02.00.00.01.037.0210.0
Page 1 of 46NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

Making progress on the blog πŸ“

I’m making some progress on redoing the data structure to trim down the blog size to make it more efficient with years worth of space to grow. Rather than keep PDFs of all the maps I will have the blog seemlessly generate them on the fly. Doesn’t seem to be real CPU intensive to do that and if I see performance problems I can cache the fifty or hundred frequently requested maps. Most of the PDFs on the blog are image based, I don’t do Vector PDFs due to the size of the maps.

Taking out some obsolete code from the blog, some of which hasn’t been actively used or maintained in over a decade. I’m not going to keep the interactive 3D models as they’re kind of big and Q3D2JS plugin keeps breaking their format requiring changes to the blog code. Also the side by side history format is getting dumped as I don’t use it much in favor of the mapwarper.net website.

I’ve decided to monetize the blog with web advertisements but not the home page. This way people that are here primarily for maps or info about camping can get their information while my blog homepage is free of ads. I hate ads but with the cost of hosting going up, it should cover the cost of hosting and maybe give me a few bucks to cover my time creating maps.

I’ve come up with a framework in my head to make the switch over while not breaking too many things in the change over as I delete and upload the new versions of the data structure. The thing is I want to delete part of the old structure while uploading the new structure without taking all of the blog offline in the process. I should be able to do this but it takes a bit of planning as its a major undertaking, requiring a large amount of data to be re-uploaded in the process. mod-rewrite should get me most of the way there though in the transition – my goal is to break as little as possible or change links around.

Should I pull the plug on the blog? πŸ”Œ

I’m back to thinking about that again. I decided not to auto renew my blog but I still have to October 23rd to make up my mind and pay the $530 fee for the next three years. Indeed the cost is the big sticking point but also a recent email from the hosting company suggesting an up sell to an even more expensive contract due to the size of my blog – 36 gigabytes as they’re enforcing a 40 gig quota come November 1st as unlimited storage with their plus plan comes to an end. Somehow that rubs me the wrong way with the higher costs all around.

I’ve been considering my alternatives. One is to pull the plug on the blog come November. Another is to change web hosts. Or I could improve the efficiency of my data management system and or purge the oldest photos and maps which rarely get served and don’t add much value. The PDFs for maps take up a lot of space but get few downloads compared to the JPEG versions. Lossless compression could be increased, duplicated maps and scaled data deleted. I’ve also thought about embracing advertising, especially on the maps and archives pages which are mostly search engine traffic. Indeed if I were to embrace advertising it would be a revenue neutral proposition.

Updated Albany County Election Results Crunched

I discovered I accidentally uploaded the wrong file for the Albany County 2022 Election Results Reapportioned to the current EDs but it has been fixed. https://docs.google.com/spreadsheets/d/1f3p5vNnWtCnkZ38DZwt5GR74-aPauWSaoFSkspPqEiE/edit#gid=1481996295

Also, I added the 2023 Albany County Primary crunched numbers to my 2019-2023 archive, which includes all of those old results in their original version and “adjusted based on block-level Voting Age Population ” to currently in-effect election districts.

https://andyarthur.org/data/albany.county.election.results.2019.2022.zip

I’ve thought about going back to 2018 and before, but I’m not sure how useful those numbers are as demographics shift over the decade, and prior to November 2018 the State Board had enrollments in PDFs which can be tricker to parse then Excel files.

Some Useful Investigative Open Data Resources

Usually when I want to research a New York organization I start at NY Open Government, which is a website put out by the NY Attorney General and brings together several open datasets.

NY Open Government: nyopengovernment.com

One way to find people’s addresses is to find them in the voter file. Voter Ref contains the voter files for several states, which can be handy for looking up people’s addresses, date of birth, party registration.

Voter Ref: voteref.com

You can confirm the latest information on people’s voting registration and address by using the state board of election’s voter lookup. You will need their county, date of birth and zip code which you can get from Voter Ref.

NY Voter Lookup: voterlookup.elections.ny.gov

See Through NY has listings of many though not all government employees, which can be useful when you are trying to find information on government workers. No addresses here, but you can find salaries and who people work for in government. If you need bulk data, I wrote a scraping script.

See Through NY: seethroughny.net/payrolls

Another good way to gleam people’s addresses and the candidates they search is NYS Campaign Finance Search. If you think somebody might work for a candidate or campaign, you can search the campaign expenses section.

NYS Campaign Finance Search: publicreporting.elections.ny.gov

The FEC Campaign Search includes a contributor’s address and reported employer, which can provide useful information.

FEC Campaign Search: https://www.fec.gov/data/receipts/individual-contributions/

Every county in New York State is required to post their tax rolls to their website. Tax rolls can be usually found by searching on Google: “XXXX County Tax Rolls” without the quotes. Not only can you find all of the properties owned by a person that way, you can find their address, assessed value and other information. Often the county tax rolls include information on tax exemptions, such as the Guilderland Solar Exemption and Veterans STAR Reduction, which can help you find people who have solar on their homes or are Veterans. I wrote a script to convert the PDFs into Excel spreadsheets.

If you need to search a whole county or the even the state, you will want to get the full roll from NYS GIS. You don’t need mapping or GIS software to use the Shapefiles — Microsoft Excel and OpenOffice can natively open .DBF which contain the data tables. NYS GIS offers selected countys tax-maps as a Shapefile or GeoPackage too.

NYS GIS Parcels Program: https://gis.ny.gov/parcels/
NYS GIS Tax Parcels Centroid Points:
gis.ny.gov/gisdata/inventories/details.cfm?DSID=1300

In addition, most counties offer their tax maps as ArcGIS REST/Services that can be used in a GIS Program like QGIS. You can find them by searching on Google: “XXX County “REST/Services” parcel”.

How to download ArcGIS Rest/Services as KML Google Maps File: mappingsupport.com/p2/kmz_demo/How_to_download_arcgis_data_as_kmz.pdf

ArcPuller is a Great Way to Get this Data in R: andyarthur.org/youll-like-arcpullr.html

In addition, Joseph Elfelt maintains a list of many open government REST/Services:

PDF: https://mappingsupport.com/p/surf_gis/list-federal-state-county-
city-GIS-servers.pdf
txt file: https://mappingsupport.com/p/surf_gis/list-federal-state-county-city-GIS-servers.t
xt
csv file: https://mappingsupport.com/p/surf_gis/list-federal-state-county-city-GIS-servers.c
sv

An easier to read version can be found here: https://servers.cartobin.com/state/New%20York

Sometimes it’s useful to find what state contracts people have:

NY Open Book Contracts Search: wwe2.osc.state.ny.us/transparency/contracts/contractsearch.cfm

See how local governments like counties and cities send their money:

Local Government Reports: wwe2.osc.state.ny.us/transparency/LocalGov/LocalGovIntro.cfm

Many different data sets can be found on data.ny.gov

Good Morning – October 21, 2021

Good morning! Happy Thursday. Should be a nice one. ☺

I thought about taking today and Friday off but with the front coming through, I decided it would be an awful cold and gray weekend. Plus going to work just is the default option, doesn’t require me to plan or spend money that I don’t have.

Partly cloudy and 54 degrees at the Elm Ave Park and Ride. β›… Calm wind. The dew point is 51 degrees.

I was surprised how many are on the Express Bus 🚍 downtown – six. This is particularly surprising as the later bus I take is less popular and many of the regular customers stopped riding during the pandemic and after Voorheesville service ended. Maybe this will help make the service more sustainable.

Found an old Broken USB cable to use as a belt cable πŸ”Œ to keep my wallet tied to my pants πŸ‘–. Seems silly but if it keeps me from loosing my wallet, all the better. I worried a lot about that yesterday.

Been working on building a PANDAS 🐼 dataset with party enrollment to make graphs and other data analysis. The hardest thing was downloading and parsing the PDFs but even that isn’t rocket πŸš€ science. Python does all the hard work.