Launched in 2003, New York City’s non-emergency complaint hotline (311) calls have been made available as a dataset on NYC OpenData portal. The portal provides useful intelligence and the ability to detect patterns which would be difficult to understand without geospatial data. For the Carto/mapping lab I explored this dataset to gain insight of the volume and type of complaints reported across neighborhoods in all five boroughs.
The following images exemplifying maps which highlight key insights from the data in easy to use/understand visualizations.
Wired Magazine mapped a week of complaint data in September 2010 called “What’s Your Problem” (Fig 1). This fun map communicates the number of complaints by zip code using colorful graphics and a simple key.
(Fig 2) is from Carto.com’s documentation pages showing analysis from a multi-layer map combining polygons and point data. Again, visually interesting with clear analysis.
FiveThirtyEight presented a choropleth map (Fig 3) depicting mortality rates for leading causes of death in every U.S. county from 1980 to 2014. I like the interactive feature of this map animating the changes over time.
- I used NYC Open Data 311 dataset records from 2010 to present, initially made available to the public in 2015 as part of the Open Data For All initiative. The complete dataset contains 15M rows and 53 variables (columns) describing the service request, such as complaint type, date received, incident address, and resolution description.
- To map the data I used Carto Builder, formerly CartoDB, a web-based analysis tool which enables users to gain key insights for location data through visualization and analysis.
- Tableau Public is business intelligence and analysis tool visualizing quantitative and geospatial data used in this lab as a comparison to Carto Builder.
- MS Excel pivot tables were employed to review and validate map results.
To limit computational overhead, I first filtered the table to eight columns including lat/lon reducing it to 2M rows. I sorted the table to get the top 10 complaints for all years then attempted to “clean” using OpenRefine and R with limited success given the large dataset. Subsequent iterations of “cleaning” the data involved grouping and filtering directly in the portal. By filtering year to 2016 and grouping by month, type, zip code, and count of unique id I created datasets of 160k rows down to 50k rows. I used Excel pivot tables at each step to validate the numbers.
The FiveThirtyEight map looked like a Tableau choropleth so I attempted to recreate it in Tableau for practice and as a comparison. I imported the dataset, dragged zip codes then measures which created filled map (Fig 4). I then created filters based on the dimensions and added date to pages which generated a time slider. The map looked and behaved as expected so I then moved on to Carto Builder.
Carto created point data from the dataset using the lat/lon upon import. I realized I needed polygon geom to create a filled map. Referring back to Tableau I had not noticed that it used generated polygons rather than the lat/lon in the data set. I tried running the Georeference analysis in Carto but received an error message: too many rows. I then imported a shapefile of the zip codes and created a join which appeared to work but still had difficulty processing the 50k rows.
At an impasse, I finished the lab with a good looking map and incorrect measures. I later had the idea of sampling the dataset and researched some tools. Fortuitously I discovered sampling is built in to Carto. I ran the sampling analysis and georeferenced the result set which quickly generated the correct geometry but the numbers still looked “off”.
I revisited the dataset in the portal and through further filtering, I reduced the table size to 20k rows and 1MB file size. I ran the georeference analysis again and got the “too many rows” error again. I resorted to joining this table to the shapefile and generated a filled map (Fig 5).
The widgets for complaints by type and by zip reported accurately but the legend and map represented the count of the complaints rather than the sum. The widget for time-series inaccurately reflected date range. Given the amount of time spent on the data, analysis and widgets, I had to forgo time animation. I subsequently discovered that only point data can be animated. There is likely a work around for polygons but I did not find it.
Update July 3:
While doing research I found the Cartp analysis to intersect second layer to create an aggregate column of the complaint counts. I joined the zip code shapefile to the complete (not sampled) dataset, ran analysis to intersect a second layer to generate aggregates and styled by that value. The map below reflects correctly reflects the table data.
The Carto lab has been the most challenging and time consuming due to issues with the dataset and non-intuitive application. In hindsight and with more experience, understanding the data can save time and effort in the amount of cleaning/manipulation to use it. Going forward I would try to work with smaller datasets or sampling options from the beginning.
Carto Builder is deceptively complex. Though it presents as “intuitive, logical and effortless” a non-geospatial analyst would need some command of statistics. Knowledge of CSS and HTML are very useful but familiarity with SQL is necessary to create complex visualizations. By comparison, Tableau automatically georeferenced the zip codes into polygons had no issue with the larger datasets. I was able to animate polygons and duplicate the FiveThirtyEight map functionality.
Despite less than desired results, the mapping exercise still revealed key insights. The zipcode shapefile increased my awareness of the neighborhood boundaries within the boroughs and their relative size. The mapping of the highest volume of complaints by zip through the year gave clarity to the disparity of issues. To maximize the rich dataset and would like to get the time-series working, keep addresses in the dataset to show point data and drill down to specific issues.