Analysis of Legally Operating Businesses in NYC, plus GeoPandas!

In this post, I will demonstrate how to perform some simple analyses on the Legally Operating Businesses in NYC dataset. I’ll be reading in the dataset from the json format, so it will require some manipulation to read and pull the data we want.

I’ll also be showing how to use the GeoPandas library to read a geospatial map of NYC and plot the businesses by latitude and longitude coordinates over the map. Let’s get right into it!

First, load the necessary libraries:

import numpy as np
import pandas as pd
import json
import ijson
import matplotlib.pyplot as plt
import seaborn as sns

%%bash
head LegallyOperatingBusinessesNYC.json

{
  "meta" : {
    "view" : {
      "id" : "w7w3-xahh",
      "name" : "Legally Operating Businesses",
      "attribution" : "Department of Consumer Affairs (DCA)",
      "averageRating" : 0,
      "category" : "Business",
      "createdAt" : 1425675688,
      "description" : "This data set features businesses/individuals holding a DCA license so that they may legally operate in New York City. Note: Sightseeing guides and temporary street fair vendors are not included in this data set.",

# Read json file
with open('LegallyOperatingBusinessesNYC.json', 'r') as myfile:
    data=myfile.read()

# Parse file and load into dictionary
obj = json.loads(data)

# View json keys
obj.keys()

dict_keys(['meta', 'data'])

This shows us that meta and data are top level keys in the LegallyOperatingBusinessessNYC.json data

# Iteratively read and parse json file while specifiying path meta.view.columns
filename = "LegallyOperatingBusinessesNYC.json"
with open(filename, 'r') as f:
    objects = ijson.items(f, 'meta.view.columns.item')
    columns = list(objects)

# View first item in columns list
print(columns[0])

{'id': -1, 'name': 'sid', 'dataTypeName': 'meta_data', 'fieldName': ':sid', 'position': 0, 'renderTypeName': 'meta_data', 'format': {}, 'flags': ['hidden']}

Each item in columns is a dictionary that contains information about each column. In order to get our header, it looks like fieldName is the relevant key to extract. To get our column names, we just have to extract the fieldName key from each item in columns:

column_names = [col["fieldName"] for col in columns]

# View column_names
column_names

[':sid',
 ':id',
 ':position',
 ':created_at',
 ':created_meta',
 ':updated_at',
 ':updated_meta',
 ':meta',
 'license_nbr',
 'license_type',
 'lic_expir_dd',
 'license_status',
 'license_creation_date',
 'industry',
 'business_name',
 'business_name_2',
 'address_building',
 'address_street_name',
 'address_street_name_2',
 'address_city',
 'address_state',
 'address_zip',
 'contact_phone',
 'address_borough',
 'detail',
 'community_board',
 'council_district',
 'bin',
 'bbl',
 'nta',
 'census_tract',
 'detail_2',
 'longitude',
 'latitude',
 'location',
 ':@computed_region_efsh_h5xi',
 ':@computed_region_f5dn_yrer',
 ':@computed_region_yeji_bk3q',
 ':@computed_region_92fq_4b7q',
 ':@computed_region_sbqj_enih']

Extract the data by making a list of desired useful columns. NOTE: ‘location’ column was excluded because latitude and longitude were used only for later analysis

good_columns = [
    'license_nbr',
    'license_type',
    'lic_expir_dd',
    'license_status',
    'license_creation_date',
    'industry',
    'business_name',
    'business_name_2',
    'address_building',
    'address_street_name',
    'address_street_name_2',
    'address_city',
    'address_state',
    'address_zip',
    'contact_phone',
    'address_borough',
    'detail',
    'community_board',
    'council_district',
    'bin',
    'bbl',
    'nta',
    'census_tract',
    'detail_2',
    'longitude',
    'latitude']

# Extract desired data
data = []
with open(filename, 'r') as f:
    objects = ijson.items(f, 'data.item')
    for row in objects:
        selected_row = []
        for item in good_columns:
            selected_row.append(row[column_names.index(item)])
            data.append(selected_row)

# View first item in data
data[0]

['2080385-DCA',
 'Individual',
 '2020-10-31T00:00:00',
 'Active',
 '2018-12-03T00:00:00',
 'Tow Truck Driver',
 'DELACRUZ VARGAS, CARLOS',
 None,
 None,
 None,
 None,
 'NEW YORK',
 'NY',
 '10027',
 '9173731141',
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

# Create pandas df to analyze data
businesses = pd.DataFrame(data, columns=good_columns)

# View businesses dataframe
businesses.head()

	license_nbr	license_type	lic_expir_dd	license_status	license_creation_date	industry	business_name	business_name_2	address_building	address_street_name	...	detail	community_board	council_district	bin	bbl	nta	census_tract	detail_2	longitude	latitude
0	2080385-DCA	Individual	2020-10-31T00:00:00	Active	2018-12-03T00:00:00	Tow Truck Driver	DELACRUZ VARGAS, CARLOS	None	None	None	...	None	None	None	None	None	None	None	None	None	None
1	2080385-DCA	Individual	2020-10-31T00:00:00	Active	2018-12-03T00:00:00	Tow Truck Driver	DELACRUZ VARGAS, CARLOS	None	None	None	...	None	None	None	None	None	None	None	None	None	None
2	2080385-DCA	Individual	2020-10-31T00:00:00	Active	2018-12-03T00:00:00	Tow Truck Driver	DELACRUZ VARGAS, CARLOS	None	None	None	...	None	None	None	None	None	None	None	None	None	None
3	2080385-DCA	Individual	2020-10-31T00:00:00	Active	2018-12-03T00:00:00	Tow Truck Driver	DELACRUZ VARGAS, CARLOS	None	None	None	...	None	None	None	None	None	None	None	None	None	None
4	2080385-DCA	Individual	2020-10-31T00:00:00	Active	2018-12-03T00:00:00	Tow Truck Driver	DELACRUZ VARGAS, CARLOS	None	None	None	...	None	None	None	None	None	None	None	None	None	None

5 rows × 26 columns

businesses.tail()

	license_nbr	license_type	lic_expir_dd	license_status	license_creation_date	industry	business_name	business_name_2	address_building	address_street_name	...	detail	community_board	council_district	bin	bbl	nta	census_tract	detail_2	longitude	latitude
5182445	1054840-DCA	Business	2020-12-31T00:00:00	Active	2000-12-05T00:00:00	Tobacco Retail Dealer	K P DELI CORP.	None	255	HYLAN BLVD	...	5	501	49	5090837	5028530001	SI37	8	None	-74.07037351766044	40.61193872381227
5182446	1054840-DCA	Business	2020-12-31T00:00:00	Active	2000-12-05T00:00:00	Tobacco Retail Dealer	K P DELI CORP.	None	255	HYLAN BLVD	...	5	501	49	5090837	5028530001	SI37	8	None	-74.07037351766044	40.61193872381227
5182447	1054840-DCA	Business	2020-12-31T00:00:00	Active	2000-12-05T00:00:00	Tobacco Retail Dealer	K P DELI CORP.	None	255	HYLAN BLVD	...	5	501	49	5090837	5028530001	SI37	8	None	-74.07037351766044	40.61193872381227
5182448	1054840-DCA	Business	2020-12-31T00:00:00	Active	2000-12-05T00:00:00	Tobacco Retail Dealer	K P DELI CORP.	None	255	HYLAN BLVD	...	5	501	49	5090837	5028530001	SI37	8	None	-74.07037351766044	40.61193872381227
5182449	1054840-DCA	Business	2020-12-31T00:00:00	Active	2000-12-05T00:00:00	Tobacco Retail Dealer	K P DELI CORP.	None	255	HYLAN BLVD	...	5	501	49	5090837	5028530001	SI37	8	None	-74.07037351766044	40.61193872381227

5 rows × 26 columns

# View basic summary statistical details of data
businesses.describe()

	license_nbr	license_type	lic_expir_dd	license_status	license_creation_date	industry	business_name	business_name_2	address_building	address_street_name	...	detail	community_board	council_district	bin	bbl	nta	census_tract	detail_2	longitude	latitude
count	5182450	5182450	5181566	5182450	5182450	5182450	5181124	847626	3299790	3319212	...	3012100	2932982	2941432	2895776	2895802	2930668	2930668	730002	3110926	3110926
unique	199323	2	1497	2	6845	59	156282	24583	13099	10970	...	5	69	60	48315	48594	193	1308	6446	58954	58954
top	1374839-DCA	Business	2021-02-28T00:00:00	Inactive	2018-02-23T00:00:00	Home Improvement Salesperson	T-MOBILE NORTHEAST LLC	T-MOBILE	1	BROADWAY	...	3	105	3	0	4000000000	MN17	96	General Vendor Type: General Vendor (White)	-77.51958437167269	40.11238533659802
freq	52	3319212	755924	3214406	7618	846092	15834	7254	12376	96434	...	878228	127062	84266	43550	19526	92612	27586	116480	168428	168428

4 rows × 26 columns

From the details, it can be seen that there are 5182450 rows. There are 2 license types, with license type “Business” being the most common, with a frequency of 3319212. The most common industry is “Home Improvement Salesperson”, with a frequency of 846092. The most common business_name is “T-MOBILE NORTHEAST LLC”, with a frequency of 15834. THe most common address street name is “BROADWAY”, with a frequency of 96434.

How many businesses are there sorted by license_type?

businesses["license_type"].value_counts()

Business      3319212
Individual    1863238
Name: license_type, dtype: int64

n, m = businesses.shape
print("Rows:", n)
print("Columns:", m)

Rows: 5182450
Columns: 26

# Plot a bar graph of the number of licenses for each license_type
license_count = businesses['license_type'].value_counts()
plt.figure(figsize=(10,5))
sns.barplot(license_count.index, license_count.values, alpha=0.8)
plt.title('Number of Licenses for each License Type')
plt.ylabel('Count', fontsize=12)
plt.xlabel('License Type', fontsize=12)
plt.show()

png

What are the top industries?

businesses["industry"].value_counts()

Home Improvement Salesperson      846092
Home Improvement Contractor       771394
Tobacco Retail Dealer             633074
Secondhand Dealer - General       321776
Electronics Store                 273312
Tow Truck Driver                  197756
Stoop Line Stand                  167674
General Vendor                    161408
Sightseeing Guide                 160368
Pedicab Driver                    155194
Electronic & Appliance Service    135356
Laundries                         114842
Laundry                           112060
Locksmith                         105014
Debt Collection Agency            103610
Process Server Individual         101556
Ticket Seller                      86034
Electronic Cigarette Dealer        80730
Sidewalk Cafe                      79924
Laundry Jobber                     79378
Garage                             64974
Dealer In Products                 49218
Secondhand Dealer - Auto           43680
Amusement Device Portable          39650
Tow Truck Company                  30732
Parking Lot                        28106
Employment Agency                  24492
Pawnbroker                         24440
Amusement Device Temporary         20332
Auctioneer                         19526
Motion Picture Projectionist       16952
Pedicab Business                   16562
Newsstand                          13754
Special Sale                       13078
Horse Drawn Driver                 11960
Process Serving Agency              8970
Cabaret                             8658
Amusement Device Permanent          7436
Garage and Parking Lot              6240
Games of Chance                     6058
Car Wash                            4316
Gaming Cafe                         3822
Scrap Metal Processor               3666
Pool or Billiard Room               3588
Horse Drawn Cab Owner               3562
Catering Establishment              3224
Bingo Game Operator                 3120
Tow Truck Exemption                 3042
Storage Warehouse                   2704
Auction House Premises              2548
Scale Dealer Repairer               1794
Amusement Arcade                    1378
Locksmith Apprentice                1378
Sightseeing Bus                     1352
General Vendor Distributor           442
Commercial Lessor                    364
Secondhand Dealer - Firearms         312
Booting Company                      312
Ticket Seller Business               156
Name: industry, dtype: int64

# Plot a bar graph of the top 10 industries
industry_count = businesses['industry'].value_counts()
industry_count = industry_count[:10,]
plt.figure(figsize=(10,5))
chart = sns.barplot(industry_count.index, industry_count.values, alpha=0.8)
plt.title('Top 10 Businesses by Industry')
plt.ylabel('Number of Businesses', fontsize=12)
plt.xlabel('Industry', fontsize=12)
chart.set_xticklabels(chart.get_xticklabels(), rotation=70)
plt.show()

png

How many licenses are active vs. inactive?

businesses["license_status"].value_counts()

Inactive    3214406
Active      1968044
Name: license_status, dtype: int64

# Plot a bar graph of Inactive vs. Active Licenses
status_count = businesses['license_status'].value_counts()
plt.figure(figsize=(10,5))
sns.barplot(status_count.index, status_count.values, alpha=0.8)
plt.title('Number of Inactive vs. Active Licenses')
plt.ylabel('Count', fontsize=12)
plt.xlabel('License Status', fontsize=12)
plt.show()

png

# Create subset df of individual active licenses where latitude and longitude is given
df = businesses[(businesses['license_type'] == 'Individual') & (businesses['license_status'] == 'Active') & (businesses['latitude'].notnull()) & (businesses['longitude'].notnull())]

df

	license_nbr	license_type	lic_expir_dd	license_status	license_creation_date	industry	business_name	business_name_2	address_building	address_street_name	...	detail	community_board	council_district	bin	bbl	nta	census_tract	detail_2	longitude	latitude

0 rows × 26 columns

Interestingly, all individual active licenses do not have latitude and longitude coordinates listed!

Geospatial Analysis

To load the map, we need to download the shapefile from Borough Boundaries NYC Open Data. I renamed the shapefile to NYC_geospatial.shp for simplification.

# Plot map
import descartes
import geopandas as gpd
from shapely.geometry import Point, Polygon

# Read geospatial map file
street_map = gpd.read_file('NYC_geospatial.shp')

fix, ax = plt.subplots(figsize = (15, 15))
street_map.plot(ax=ax)

<matplotlib.axes._subplots.AxesSubplot at 0x1bbb11a610>

png

# Create subset df of active businesses where latitude and longitude is given
df = businesses[(businesses['license_status'] == 'Active') & (businesses['latitude'].notnull()) & (businesses['longitude'].notnull())]

# Keep only industry, latitude and longitude coordinates of df
df = df[['industry', 'latitude', 'longitude']]

# Inspect df
df

	industry	latitude	longitude
12090	Stoop Line Stand	40.718663472948315	-73.99761907513695
12091	Stoop Line Stand	40.718663472948315	-73.99761907513695
12092	Stoop Line Stand	40.718663472948315	-73.99761907513695
12093	Stoop Line Stand	40.718663472948315	-73.99761907513695
12094	Stoop Line Stand	40.718663472948315	-73.99761907513695
...	...	...	...
5182445	Tobacco Retail Dealer	40.61193872381227	-74.07037351766044
5182446	Tobacco Retail Dealer	40.61193872381227	-74.07037351766044
5182447	Tobacco Retail Dealer	40.61193872381227	-74.07037351766044
5182448	Tobacco Retail Dealer	40.61193872381227	-74.07037351766044
5182449	Tobacco Retail Dealer	40.61193872381227	-74.07037351766044

1073644 rows × 3 columns

# Convert latitude and longitude to float
df['latitude'] = df['latitude'].astype(float)
df['longitude'] = df['longitude'].astype(float)

df.describe()

	latitude	longitude
count	1.073644e+06	1.073644e+06
mean	4.071494e+01	-7.396611e+01
std	9.905580e-02	3.541175e-01
min	4.011239e+01	-7.751958e+01
25%	4.066054e+01	-7.398479e+01
50%	4.072179e+01	-7.394054e+01
75%	4.076629e+01	-7.387909e+01
max	4.091206e+01	-7.370155e+01

# Since some businesses are out of map range, filter to only include businesses within range of map
df = df[(df['latitude'] > 40.45) & (df['latitude'] < 40.95) & (df['longitude'] > -74.3) & (df['longitude'] < -73.6)]

# Convert latitude and longitude coordinates to geometry points
geometry = [Point(xy) for xy in zip(df["longitude"], df["latitude"])]

# Inspect geometry
geometry[:3]

[<shapely.geometry.point.Point at 0x1ba7961a90>,
 <shapely.geometry.point.Point at 0x1ba7961510>,
 <shapely.geometry.point.Point at 0x1ba7961c50>]

# Specify coordinate system
crs = {'init': 'EPSG:4326'}

# Create GeoDataFrame
geo_df = gpd.GeoDataFrame(df, crs = crs, geometry = geometry)

/Users/tonysong/opt/anaconda3/lib/python3.7/site-packages/pyproj/crs.py:77: FutureWarning: '+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.
  return _prepare_from_string(" ".join(pjargs))

# Inspect geo_df
geo_df.head()

	industry	latitude	longitude	geometry
12090	Stoop Line Stand	40.718663	-73.997619	POINT (-73.99762 40.71866)
12091	Stoop Line Stand	40.718663	-73.997619	POINT (-73.99762 40.71866)
12092	Stoop Line Stand	40.718663	-73.997619	POINT (-73.99762 40.71866)
12093	Stoop Line Stand	40.718663	-73.997619	POINT (-73.99762 40.71866)
12094	Stoop Line Stand	40.718663	-73.997619	POINT (-73.99762 40.71866)

What are the top industries with active licenses within the map boundaries?

df["industry"].value_counts()

Home Improvement Contractor       259038
Tobacco Retail Dealer             168636
Secondhand Dealer - General       102700
Laundries                         102440
Electronics Store                  77350
Electronic Cigarette Dealer        74776
Stoop Line Stand                   54704
Electronic & Appliance Service     43784
Sidewalk Cafe                      32968
Garage                             30212
Dealer In Products                 18330
Secondhand Dealer - Auto           14716
Tow Truck Company                  12142
Newsstand                           8944
Employment Agency                   8190
Parking Lot                         8008
Pawnbroker                          7826
Amusement Device Portable           7280
Pedicab Business                    4810
Car Wash                            4082
Debt Collection Agency              3380
Amusement Device Permanent          3146
Garage and Parking Lot              2886
Horse Drawn Cab Owner               1716
Scrap Metal Processor               1664
Process Serving Agency              1560
Storage Warehouse                   1274
Pool or Billiard Room               1092
Games of Chance                     1014
Auction House Premises               910
Bingo Game Operator                  832
Gaming Cafe                          806
Tow Truck Exemption                  546
Scale Dealer Repairer                546
Amusement Arcade                     494
Sightseeing Bus                      286
Commercial Lessor                    156
Booting Company                      130
Amusement Device Temporary           130
Special Sale                         104
General Vendor Distributor            52
Ticket Seller Business                52
Name: industry, dtype: int64

Plot top 3 industries with active licenses over map:

fig, ax = plt.subplots(figsize = (15, 15))
street_map.plot(ax = ax, alpha = 0.4, color = "grey")
geo_df[geo_df['industry'] == 'Home Improvement Contractor'].plot(ax = ax, markersize = 10, color = "blue", marker = "o", label = "Home Improvement Contractor")
geo_df[geo_df['industry'] == 'Tobacco Retail Dealer'].plot(ax = ax, markersize = 10, color = "red", marker = "^", label = "Tobacco Retail Dealer")
geo_df[geo_df['industry'] == 'Secondhand Dealer - General'].plot(ax = ax, markersize = 10, color = "yellow", marker = ".", label = "Secondhand Dealer - General")
plt.title("Top 3 Industries with Active Licenses") 
plt.legend(prop = {'size': 15})

<matplotlib.legend.Legend at 0x1b8b84a950>

png

From the map, it can be observed that out of all businesses with active licenses, there are a lot more Home Improvement Contractor businesses in Staten Island compared to Tobacco Retail Dealers and Secondhand Dealers. Also, there is a lot more concentration of Tobacco Retail Dealers and Secondhand Dealers in Manhattan compared to other boroughs.