In this blog, we will be looking at the crime data in the city of San Francisco. The data we will be using contains all crimes in San Francisco from the year 2018 to 2020. You can download the data here. Since this dataset is very large (more than 330,000 crimes), we will be considering only a small part of the data for this post.

# import libraries
import numpy as np
import pandas as pd
import folium
# read the data into a pandas dataframe
df = pd.read_csv('SF_Crime_data.csv')
df.head()
Incident Datetime Incident Date Incident Time Incident Year Incident Day of Week Report Datetime Row ID Incident ID Incident Number CAD Number ... Current Supervisor Districts Analysis Neighborhoods HSOC Zones as of 2018-06-05 OWED Public Spaces Central Market/Tenderloin Boundary Polygon - Updated Parks Alliance CPSI (27+TL sites) ESNCAG - Boundary File Areas of Vulnerability, 2016 Unnamed: 36 Unnamed: 37
0 2/3/2020 14:45 2/3/2020 14:45 2020 Monday 2/3/2020 17:50 89881675000 898816 200085557 200342870.0 ... 8.0 16.0 NaN NaN NaN NaN NaN 2.0 NaN NaN
1 2/3/2020 3:45 2/3/2020 3:45 2020 Monday 2/3/2020 3:45 89860711012 898607 200083749 200340316.0 ... 2.0 20.0 3.0 NaN NaN NaN NaN 2.0 NaN NaN
2 2/3/2020 10:00 2/3/2020 10:00 2020 Monday 2/3/2020 10:06 89867264015 898672 200084060 200340808.0 ... 3.0 8.0 NaN 35.0 NaN NaN NaN 2.0 NaN NaN
3 1/19/2020 17:12 1/19/2020 17:12 2020 Sunday 2/1/2020 13:01 89863571000 898635 206024187 NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 1/5/2020 0:00 1/5/2020 0:00 2020 Sunday 2/3/2020 16:09 89877368020 898773 200085193 200342341.0 ... 6.0 30.0 NaN NaN NaN NaN NaN 1.0 NaN NaN

5 rows × 38 columns

# dimensions of the data
df.shape
(330054, 38)

There have been more than 330,000 crimes in San Francisco in the past two years.

df.columns
Index(['Incident Datetime', 'Incident Date', 'Incident Time', 'Incident Year',
       'Incident Day of Week', 'Report Datetime', 'Row ID', 'Incident ID',
       'Incident Number', 'CAD Number', 'Report Type Code',
       'Report Type Description', 'Filed Online', 'Incident Code',
       'Incident Category', 'Incident Subcategory', 'Incident Description',
       'Resolution', 'Intersection', 'CNN', 'Police District',
       'Analysis Neighborhood', 'Supervisor District', 'Latitude', 'Longitude',
       'point', 'SF Find Neighborhoods', 'Current Police Districts',
       'Current Supervisor Districts', 'Analysis Neighborhoods',
       'HSOC Zones as of 2018-06-05', 'OWED Public Spaces',
       'Central Market/Tenderloin Boundary Polygon - Updated',
       'Parks Alliance CPSI (27+TL sites)', 'ESNCAG - Boundary File',
       'Areas of Vulnerability, 2016', 'Unnamed: 36', 'Unnamed: 37'],
      dtype='object')

We do not need all these columns for our analysis. So we will consider only the necessary columns.

df = df[['Incident Datetime', 'Incident Day of Week', 'Incident Number', 'Incident Category', 'Incident Description', 
         'Police District', 'Analysis Neighborhood', 'Resolution', 'Latitude', 'Longitude', 'point']]
df.head()
Incident Datetime Incident Day of Week Incident Number Incident Category Incident Description Police District Analysis Neighborhood Resolution Latitude Longitude point
0 2/3/2020 14:45 Monday 200085557 Missing Person Found Person Taraval Lakeshore Open or Active 37.726950 -122.476039 (37.72694991292525, -122.47603947349434)
1 2/3/2020 3:45 Monday 200083749 Stolen Property Stolen Property, Possession with Knowledge, Re... Mission Mission Cite or Arrest Adult 37.752440 -122.415172 (37.752439644389675, -122.41517229045435)
2 2/3/2020 10:00 Monday 200084060 Non-Criminal Aided Case, Injured or Sick Person Tenderloin Financial District/South Beach Open or Active 37.784560 -122.407337 (37.784560141211806, -122.40733704162238)
3 1/19/2020 17:12 Sunday 206024187 Lost Property Lost Property Taraval NaN Open or Active NaN NaN NaN
4 1/5/2020 0:00 Sunday 200085193 Miscellaneous Investigation Miscellaneous Investigation Richmond Pacific Heights Open or Active 37.787112 -122.440250 (37.78711245591735, -122.44024995765258)

Now, each row consists of the following 11 features:

  • Incident Datetime: The date and time when the incident occurred
  • Incident Day of Week: The day of week on which the incident occurred
  • Incident Number: The incident or crime number
  • Incident Category: The category of the incident or crime
  • Incident Desccription: The description of the incident or crime
  • Police: The police department district
  • Resolution: The resolution of the crime in terms of whether the perpertrator was arrested or not
  • Analysis Neighborhoods: The neighborhood where the incident took place
  • Latitude: The latitude value of the crime location
  • Longitude: The longitude value of the crime location
  • point: A tuple of the latitude and logitude values

Let’s drop the missing values from the Latitude and Longitude columns as they will result in an error when creating a map.

df.dropna(subset=['Latitude', 'Longitude'], inplace=True)

Rename the Incident Category column for the sake of simplicity.

df.rename(columns={'Incident Category':'Category'}, inplace=True)
df.head()
Incident Datetime Incident Day of Week Incident Number Category Incident Description Police District Analysis Neighborhood Resolution Latitude Longitude point
0 2/3/2020 14:45 Monday 200085557 Missing Person Found Person Taraval Lakeshore Open or Active 37.726950 -122.476039 (37.72694991292525, -122.47603947349434)
1 2/3/2020 3:45 Monday 200083749 Stolen Property Stolen Property, Possession with Knowledge, Re... Mission Mission Cite or Arrest Adult 37.752440 -122.415172 (37.752439644389675, -122.41517229045435)
2 2/3/2020 10:00 Monday 200084060 Non-Criminal Aided Case, Injured or Sick Person Tenderloin Financial District/South Beach Open or Active 37.784560 -122.407337 (37.784560141211806, -122.40733704162238)
4 1/5/2020 0:00 Sunday 200085193 Miscellaneous Investigation Miscellaneous Investigation Richmond Pacific Heights Open or Active 37.787112 -122.440250 (37.78711245591735, -122.44024995765258)
5 2/3/2020 8:36 Monday 200083909 Miscellaneous Investigation Miscellaneous Investigation Central Financial District/South Beach Open or Active 37.796926 -122.399507 (37.796926429317054, -122.39950750040278)
limit = 100
df = df.iloc[0:limit, :]
# new dimensions of the data
df.shape
(100, 11)

Visualization

Let’s visualize where these crimes took place in the city of San Francisco.

# San Francisco latitude and longitude values
latitude = 37.7749
longitude = -122.4194
# create a map
sanfran_map = folium.Map(location=[latitude, longitude], zoom_start=12)
sanfran_map

Let’s group the markers into different clusters. Each cluster is then represented by the number of crimes in each neighborhood. These clusters can be thought of as pockets of San Francisco which you can then analyze separately.
We instantiate a MarkerCluster object and add all the data points in the dataframe to this object.

from folium import plugins

# let's start again with a clean copy of the map of San Francisco
sanfran_map = folium.Map(location = [latitude, longitude], zoom_start = 12)

# instantiate a mark cluster object for the incidents in the dataframe
incidents = plugins.MarkerCluster().add_to(sanfran_map)

# loop through the dataframe and add each data point to the mark cluster
for lat, lng, label, in zip(df.Latitude, df.Longitude, df.Category):
    folium.Marker(
        location=[lat, lng],
        icon=None,
        popup=label,
    ).add_to(incidents)

# display map
sanfran_map

If run the above code in a Jupyter notebook, you can zoom out all the way, and you will see that all markers are grouped into one cluster - the global cluster of 100 crimes (which is the total number of crimes we have considered in our dataset). Once you start zooming in, the global cluster will start breaking up into smaller clusters. Zooming in all the way will result in individual markers.