Visualizing Crime in San Francisco

In this blog, we will be looking at the crime data in the city of San Francisco. The data we will be using contains all crimes in San Francisco from the year 2018 to 2020. You can download the data here. Since this dataset is very large (more than 330,000 crimes), we will be considering only a small part of the data for this post.

# import libraries
import numpy as np
import pandas as pd
import folium

# read the data into a pandas dataframe
df = pd.read_csv('SF_Crime_data.csv')
df.head()

	Incident Datetime	Incident Date	Incident Time	Incident Year	Incident Day of Week	Report Datetime	Row ID	Incident ID	Incident Number	CAD Number	...	Current Supervisor Districts	Analysis Neighborhoods	HSOC Zones as of 2018-06-05	OWED Public Spaces	Central Market/Tenderloin Boundary Polygon - Updated	Parks Alliance CPSI (27+TL sites)	ESNCAG - Boundary File	Areas of Vulnerability, 2016	Unnamed: 36	Unnamed: 37
0	2/3/2020 14:45	2/3/2020	14:45	2020	Monday	2/3/2020 17:50	89881675000	898816	200085557	200342870.0	...	8.0	16.0	NaN	NaN	NaN	NaN	NaN	2.0	NaN	NaN
1	2/3/2020 3:45	2/3/2020	3:45	2020	Monday	2/3/2020 3:45	89860711012	898607	200083749	200340316.0	...	2.0	20.0	3.0	NaN	NaN	NaN	NaN	2.0	NaN	NaN
2	2/3/2020 10:00	2/3/2020	10:00	2020	Monday	2/3/2020 10:06	89867264015	898672	200084060	200340808.0	...	3.0	8.0	NaN	35.0	NaN	NaN	NaN	2.0	NaN	NaN
3	1/19/2020 17:12	1/19/2020	17:12	2020	Sunday	2/1/2020 13:01	89863571000	898635	206024187	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	1/5/2020 0:00	1/5/2020	0:00	2020	Sunday	2/3/2020 16:09	89877368020	898773	200085193	200342341.0	...	6.0	30.0	NaN	NaN	NaN	NaN	NaN	1.0	NaN	NaN

5 rows × 38 columns

# dimensions of the data
df.shape

(330054, 38)

There have been more than 330,000 crimes in San Francisco in the past two years.

df.columns

Index(['Incident Datetime', 'Incident Date', 'Incident Time', 'Incident Year',
       'Incident Day of Week', 'Report Datetime', 'Row ID', 'Incident ID',
       'Incident Number', 'CAD Number', 'Report Type Code',
       'Report Type Description', 'Filed Online', 'Incident Code',
       'Incident Category', 'Incident Subcategory', 'Incident Description',
       'Resolution', 'Intersection', 'CNN', 'Police District',
       'Analysis Neighborhood', 'Supervisor District', 'Latitude', 'Longitude',
       'point', 'SF Find Neighborhoods', 'Current Police Districts',
       'Current Supervisor Districts', 'Analysis Neighborhoods',
       'HSOC Zones as of 2018-06-05', 'OWED Public Spaces',
       'Central Market/Tenderloin Boundary Polygon - Updated',
       'Parks Alliance CPSI (27+TL sites)', 'ESNCAG - Boundary File',
       'Areas of Vulnerability, 2016', 'Unnamed: 36', 'Unnamed: 37'],
      dtype='object')

We do not need all these columns for our analysis. So we will consider only the necessary columns.

df = df[['Incident Datetime', 'Incident Day of Week', 'Incident Number', 'Incident Category', 'Incident Description', 
         'Police District', 'Analysis Neighborhood', 'Resolution', 'Latitude', 'Longitude', 'point']]
df.head()

	Incident Datetime	Incident Day of Week	Incident Number	Incident Category	Incident Description	Police District	Analysis Neighborhood	Resolution	Latitude	Longitude	point
0	2/3/2020 14:45	Monday	200085557	Missing Person	Found Person	Taraval	Lakeshore	Open or Active	37.726950	-122.476039	(37.72694991292525, -122.47603947349434)
1	2/3/2020 3:45	Monday	200083749	Stolen Property	Stolen Property, Possession with Knowledge, Re...	Mission	Mission	Cite or Arrest Adult	37.752440	-122.415172	(37.752439644389675, -122.41517229045435)
2	2/3/2020 10:00	Monday	200084060	Non-Criminal	Aided Case, Injured or Sick Person	Tenderloin	Financial District/South Beach	Open or Active	37.784560	-122.407337	(37.784560141211806, -122.40733704162238)
3	1/19/2020 17:12	Sunday	206024187	Lost Property	Lost Property	Taraval	NaN	Open or Active	NaN	NaN	NaN
4	1/5/2020 0:00	Sunday	200085193	Miscellaneous Investigation	Miscellaneous Investigation	Richmond	Pacific Heights	Open or Active	37.787112	-122.440250	(37.78711245591735, -122.44024995765258)

Now, each row consists of the following 11 features:

Incident Datetime: The date and time when the incident occurred
Incident Day of Week: The day of week on which the incident occurred
Incident Number: The incident or crime number
Incident Category: The category of the incident or crime
Incident Desccription: The description of the incident or crime
Police: The police department district
Resolution: The resolution of the crime in terms of whether the perpertrator was arrested or not
Analysis Neighborhoods: The neighborhood where the incident took place
Latitude: The latitude value of the crime location
Longitude: The longitude value of the crime location
point: A tuple of the latitude and logitude values

Let’s drop the missing values from the Latitude and Longitude columns as they will result in an error when creating a map.

df.dropna(subset=['Latitude', 'Longitude'], inplace=True)

Rename the Incident Category column for the sake of simplicity.

df.rename(columns={'Incident Category':'Category'}, inplace=True)
df.head()

	Incident Datetime	Incident Day of Week	Incident Number	Category	Incident Description	Police District	Analysis Neighborhood	Resolution	Latitude	Longitude	point
0	2/3/2020 14:45	Monday	200085557	Missing Person	Found Person	Taraval	Lakeshore	Open or Active	37.726950	-122.476039	(37.72694991292525, -122.47603947349434)
1	2/3/2020 3:45	Monday	200083749	Stolen Property	Stolen Property, Possession with Knowledge, Re...	Mission	Mission	Cite or Arrest Adult	37.752440	-122.415172	(37.752439644389675, -122.41517229045435)
2	2/3/2020 10:00	Monday	200084060	Non-Criminal	Aided Case, Injured or Sick Person	Tenderloin	Financial District/South Beach	Open or Active	37.784560	-122.407337	(37.784560141211806, -122.40733704162238)
4	1/5/2020 0:00	Sunday	200085193	Miscellaneous Investigation	Miscellaneous Investigation	Richmond	Pacific Heights	Open or Active	37.787112	-122.440250	(37.78711245591735, -122.44024995765258)
5	2/3/2020 8:36	Monday	200083909	Miscellaneous Investigation	Miscellaneous Investigation	Central	Financial District/South Beach	Open or Active	37.796926	-122.399507	(37.796926429317054, -122.39950750040278)

limit = 100
df = df.iloc[0:limit, :]

# new dimensions of the data
df.shape

(100, 11)

Visualization

Let’s visualize where these crimes took place in the city of San Francisco.

# San Francisco latitude and longitude values
latitude = 37.7749
longitude = -122.4194

# create a map
sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)
sanfran_map

Let’s group the markers into different clusters. Each cluster is then represented by the number of crimes in each neighborhood. These clusters can be thought of as pockets of San Francisco which you can then analyze separately.
We instantiate a MarkerCluster object and add all the data points in the dataframe to this object.

from folium import plugins

# let's start again with a clean copy of the map of San Francisco
sanfran_map = folium.Map(location = [latitude, longitude], zoom_start = 12)

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(sanfran_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df.Latitude, df.Longitude, df.Category):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
sanfran_map

If run the above code in a Jupyter notebook, you can zoom out all the way, and you will see that all markers are grouped into one cluster - the global cluster of 100 crimes (which is the total number of crimes we have considered in our dataset). Once you start zooming in, the global cluster will start breaking up into smaller clusters. Zooming in all the way will result in individual markers.