In this blog, we will be looking at the crime data in the city of San Francisco. The data we will be using contains all crimes in San Francisco from the year 2018 to 2020. You can download the data here. Since this dataset is very large (more than 330,000 crimes), we will be considering only a small part of the data for this post.
Continue reading
In this blog post, we will explore neighborhoods in New York City using the Foursquare API. We will get the most common venue categories in each neighborhood, and then using the k-means clustering algorithm, group the neighborhoods into clusters.
Continue reading
Imagine that you have a customer dataset, and you are interested in exploring the behavior of your customers using their historical data.
Customer segmentation is the practice of partitioning a customer base into groups of individuals that have similar characteristics. It is a significant strategy as a business can target these specific groups of customers and effectively allocate marketing resources.
Continue reading
k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Although it is one of the simplest models for clustering, it is vastly used in many data science applications.
Continue reading
Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug C, Drug X and Y.
Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.
Continue reading
Imagine a telecommunications provider has segmented its customer base by service usage patterns, categorizing the customers into four groups. If demographic data can be used to predict group membership, the company can customize offers for individual prospective customers. It is a classification problem. That is, given the dataset, with predefined labels, we need to build a model to be used to predict class of a new or unknown case.
Continue reading
In this blog post, we will use scikit-learn to implement different types of linear regression on our dataset. Then, we will split our data into training and testing sets, create a model using the training set, evaluate the model using a test set, and finally use the model to predict an unknown value.
The dataset is related to the Fuel Consumption and Carbon Dioxide Emission of all cars for retail sale in Canada in the year 2019. You can download the dataset here.
Continue reading
In this blog post, we will analyze China’s GDP growth from the year 1960 to 2019. If the data shows a curvy trend, then linear regression will not produce very accurate results when compared to a non-linear regression.
Continue reading
A Choropleth map is a thematic map in which areas are shaded or patterned in proportion to the measurement of the statistical variable being displayed on the map. It provides an easy way to visualize how a measurement varies across a geographical area or it shows the level of variability within a region.
Continue reading
Folium is a powerful Python library that helps you create several types of Leaflet maps. Folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library. Also, these maps are interactive so you can zoom into any region of interest despite the initial zoom level.
Continue reading