Airbnb Inc, is an American online marketplace and hospitality service brokerage company based in San Francisco, California, United States. Members can use the service to arrange or offer lodging, primarily homestays, or tourism experiences.
The booking price of the properties vary according various attributes like its location, property type, size, number of rooms, amenities, day of the week and month of the year etc. Airbnb made its dataset about property listings in Seattle in 2016 publicly available on Kaggle. Our goal is to study this dataset to understand what the major trends in the pricing over the year and across the neighbourhoods in Seattle. We also want to understand through statistical modelling about how each of the attributes related to the property might be influencing the booking price on Airbnb.
Importing necessary libraris.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
%matplotlib inline
Reading datasets.
s_calendar = pd.read_csv("./data/seattle/calendar.csv")
s_reviews = pd.read_csv("./data/seattle/reviews.csv")
s_listings = pd.read_csv("./data/seattle/listings.csv")
Understanding the information present in the dataset through the columns present in the datasets.
s_calendar.head()
s_calendar.shape
s_listings.head()
for column in s_listings.columns.tolist():
print(column)
s_listings.shape
s_reviews.head()
s_reviews.shape
s_listings["price"].head()
s_calendar["price"].head()
As we can note that the price attribute is supposed to have numerical values but the data type is object because it stores strings with '$' signs. We need to convert this column into numerical data type.
"""
Converts price in the string format like "$1,125.00" into numeric value 1125.00
INPUT:
- string price in string format
OUTPUT:
- float value corresponding to the price or None if the input is not parseable to float
"""
def str_to_num (string):
if string is not None:
if type(string) is str and string.startswith('$'):
return float(string.replace('$', '').replace(',', ''))
else:
return None
else:
return None
s_listings_cleaned = pd.concat([s_listings.drop('price', axis=1), s_listings["price"].apply(str_to_num)], axis=1)
s_listings_cleaned['price'].head()
s_calendar_cleaned = pd.concat([s_calendar.drop('price', axis=1), s_calendar["price"].apply(str_to_num)], axis=1)
s_calendar_cleaned['price'].head()
s_calendar_cleaned['available'] = s_calendar_cleaned['available'] == 't'
s_calendar_cleaned['available'].head()
s_calendar_cleaned["date"].dtype
s_calendar_cleaned["date"] = pd.to_datetime(s_calendar_cleaned["date"], format="%Y-%m-%d")
s_calendar_cleaned["date"].dtype
s_calendar_cleaned["date"].head()
s_listings_cleaned["location_categorical"] = s_listings_cleaned.apply(lambda row: str((format(row.latitude, '.2f'), format(row.longitude, '.2f'))), axis=1)
s_listings_cleaned.location_categorical.head()
list(map(lambda amenity : amenity.replace("\"","").replace("{","").replace("}", ""), s_listings_cleaned.amenities.iloc[0].split(",")))
"""
Separates the string value of `amenities` attribute in the row, into a list of individual amenities.
INPUT:
- row : from dataset having amenities attribute
OUTPUT:
- list of amenities derived from the value of `amenities` attribute in row
"""
def separate_amenities(row):
amenities = row.amenities
list_to_return = []
if (amenities is not None and type(amenities) == str):
list_to_return = list(map(lambda amenity : amenity.replace("\"","").replace("{","").replace("}", ""), amenities.split(",")))
if '' in list_to_return:
list_to_return.remove('')
return list_to_return
s_listings_cleaned["amenities_list"] = s_listings_cleaned.apply(separate_amenities, axis=1)
possible_amenities = s_listings_cleaned['amenities_list'].apply(pd.Series).stack().unique()
possible_amenities
"""
Assigns new boolean attribute to the row based on the presence of that amenity in the list.
Returns updated row with additional attributes corresponding to the amenities added to it.
INPUT:
- row containing attributes related to the property, including `amenities_list`
OUTPUT:
- row containing newly added boolean attributes indicating the presence of each possible type of amenity in the property
"""
def add_amenities_columns (row):
amenities = set(row.amenities_list)
for possible_amenity in possible_amenities:
row["amenity_" + possible_amenity] = possible_amenity in amenities
return row
s_listings_cleaned = s_listings_cleaned.apply(add_amenities_columns, axis=1)
s_listings_cleaned["instant_bookable"] = s_listings_cleaned["instant_bookable"] == "t"
df = s_calendar_cleaned
df = df.groupby('listing_id', group_keys=False)\
.apply(lambda x: x[x.available.ne(x.available.shift())])\
.reset_index(drop=True)
df.head(20)
listing_activation_dates = df[df.available == True].groupby("listing_id")[["listing_id","date"]].min()
lower_date = min(s_calendar_cleaned.date)
upper_date = max(s_calendar_cleaned.date)
days = pd.date_range(lower_date, upper_date, freq="D")
number_of_properties_activated_on_date = pd.DataFrame({"date":days, "activated_props": 0}).sort_values(by="date")
number_of_properties_activated_on_date.set_index("date", inplace=True, drop=False)
number_of_properties_activated_on_date_partial = listing_activation_dates.groupby("date")[["listing_id"]].agg(["count"])
number_of_properties_activated_on_date.loc[number_of_properties_activated_on_date_partial.index, "activated_props"] = number_of_properties_activated_on_date_partial.values
number_of_properties_activated_on_date["activations_till_date"] = number_of_properties_activated_on_date["activated_props"].cumsum(axis=0)
# number_of_properties_activated_on_date.head(100)
active_in_2016 = number_of_properties_activated_on_date["activations_till_date"]\
[number_of_properties_activated_on_date["date"] >= pd.to_datetime("2016-01-01")]\
[number_of_properties_activated_on_date["date"] <= pd.to_datetime("2016-12-31")]
dates = active_in_2016.index
months = mdates.MonthLocator()
month_format = mdates.DateFormatter("%b")
fig, ax = plt.subplots()
fig.set_size_inches(12,5)
ax.plot(dates, active_in_2016)
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(month_format)
yticks = ax.get_yticks()
ax.set_xlabel("Month in 2016")
ax.set_ylabel("# of properties that became first time available")
ax.set_title("Number of properties that became first time available in 2016")
plt.show()
It is evident from this plot that number of properties that became first time available during 2016 increased rapidly between January to March. After that, the rate of growth slowed down, but the properties becoming first time available during the year continued to grow throughout the year.
s_calendar_cleaned.groupby("date")[["date", "available"]].agg(["sum"]).head()
availability_by_date = s_calendar_cleaned.groupby("date")[["date","available"]].agg(["sum"]).sort_values(by="date")
availability_in_2016 = availability_by_date[pd.to_datetime("2016-01-01") <= availability_by_date.index][availability_by_date.index <= pd.to_datetime("2016-12-31")][("available","sum")]
active_in_2016 = number_of_properties_activated_on_date[pd.to_datetime("2016-01-01") <= number_of_properties_activated_on_date.index][number_of_properties_activated_on_date.index <= pd.to_datetime("2016-12-31")]["activations_till_date"]
dates = availability_in_2016.index
occupancy_rate = (1 - availability_in_2016/active_in_2016) * 100
months = mdates.MonthLocator()
month_format = mdates.DateFormatter("%b")
fig, ax = plt.subplots()
fig.set_size_inches(12,5)
ax.plot(dates, occupancy_rate)
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(month_format)
yticks = ax.get_yticks()
ax.set_yticklabels(['{}%'.format(y) for y in yticks])
ax.set_xlabel("Month in 2016")
ax.set_ylabel("Occupancy rate")
ax.set_title("Occupancy rate of properties listed on Airbnb in Seattle in 2016")
plt.show()
number_of_occupied_props = (occupancy_rate/100.0) * availability_in_2016
months = mdates.MonthLocator()
month_format = mdates.DateFormatter("%b")
fig, ax = plt.subplots()
fig.set_size_inches(12,5)
ax.plot(dates, number_of_occupied_props)
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(month_format)
yticks = ax.get_yticks()
# ax.set_yticklabels(['{}%'.format(y) for y in yticks])
ax.set_xlabel("Month in 2016")
ax.set_ylabel("Number of occupied properties")
ax.set_title("Number of occupied properties listed on Airbnb in Seattle in 2016")
plt.show()
It is evident from the plot that number of poperties occupied remained between 400 to 500. The occupancy jumped significantly in April to reach new level of just above 600. It remained between 600 and 650 during April to July. In July, there was one more surge in the occupancy of properties, which crossed well above 700. After July it started reducing gradually, but still remained well above 650 by the end of the year. Overall, we can claim that increasing number of people occupied properties through Airbnb during 2016 in Seattle.
avg_price_by_date = s_calendar_cleaned.groupby("date")[["date", "price"]].agg(["mean"])
avg_price_by_date.head(20)
prices_2016 = avg_price_by_date[avg_price_by_date.index >= pd.to_datetime("2016-01-01")][avg_price_by_date.index <= pd.to_datetime("2016-12-31")]
dates = prices_2016.index
months = mdates.MonthLocator()
month_format = mdates.DateFormatter("%b")
fig, ax = plt.subplots()
fig.set_size_inches(12,5)
ax.plot(dates, prices_2016)
ax.xaxis.set_major_locator(months)
ax.xaxis.set_major_formatter(month_format)
yticks = ax.get_yticks()
# ax.set_yticklabels(['{}%'.format(y) for y in yticks])
ax.set_xlabel("Month in 2016")
ax.set_ylabel("Average booking price in $ of property per night")
ax.set_title("Average booking price of properties listed on Airbnb in Seattle in 2016")
plt.show()
The average price per night of properties listed on Airbnb in Seattle grew steadily between January to July in 2016 from around 125 to 155 USD. It remained high between 150 to 155 USD during summer between July to September. After September, the prices started dropping gradually until it stabilized to somewhere around 140 USD. During the end of December, the prices rised slightly during the Christmas holidays.
prices_in_summer = avg_price_by_date.loc[pd.date_range('2016-07-01', '2016-07-31')]
dates = prices_in_summer.index
days = mdates.DayLocator()
day_format = mdates.DateFormatter("%a")
fig, ax = plt.subplots()
fig.set_size_inches(14,5)
ax.plot(dates, prices_in_summer)
ax.xaxis.set_major_locator(days)
ax.xaxis.set_major_formatter(day_format)
yticks = ax.get_yticks()
ax.set_xlabel("Day in July 2016")
ax.set_ylabel("Average booking price in $ of property per night")
ax.set_title("Average booking price of properties listed on Airbnb in Seattle in July 2016")
plt.show()
Zooming into the average prices in the month of July 2016, we can see that the price spikes on Friday and Saturday nights.
props_by_type = s_listings_cleaned.dropna(subset=["property_type"]).groupby("property_type")["id"].agg(["count"])
plt.figure(figsize=(14,5))
x = props_by_type.index
y = props_by_type["count"]
plt.bar(x, y)
plt.xticks(x, x, rotation="vertical")
plt.ylabel("Number of properties")
plt.xlabel("Type of property")
plt.title("Distribution of properties based in its type")
plt.show()
As we can see, most of the properties listed in Seattle on Airbnb in 2016 are of type Apartment or House. There are also significant number of properties of type Condominium and Townhouse.
s_listings_cleaned.dropna(subset=["square_feet"]).groupby("neighbourhood_cleansed")[["neighbourhood_cleansed", "square_feet"]].agg("mean").sort_values(by="square_feet", ascending=False).rename(index=str, columns={"square_feet": "average square feet"}).head(10)
North Queen Anne, Fairmount Park, Wallingford and Alki has properties with average size of more than 1400 square feet.
s_listings_cleaned.dropna(subset=["price"]).groupby("neighbourhood_cleansed")[["neighbourhood_cleansed", "price"]].agg("mean").sort_values(by="price", ascending=False).rename(index=str, columns={"price": "average price per night"}).head(10)
s_listings_cleaned.groupby("neighbourhood_cleansed")[["neighbourhood_cleansed", "id"]].agg(["count"])["id"].sort_values(by="count", ascending=False).rename(index=str, columns={"count": "listings count"}).head(10)
df = s_listings_cleaned.dropna(subset=["square_feet", "price"])
x = df.square_feet
y = df.price
plt.scatter(x, y)
plt.xlabel("Property size in square feet")
plt.ylabel("Booking price in USD")
plt.title("Booking price vs size in square feet for property")
plt.show()
It can be seen that for properties with smaller size of upto 800 square feet, prices are highly correlated with size. This means that the prices for similarly sized properties are similar. However, as the property size increases above 800 square feet, the variation of price among similarly sized properties increases significantly.
df = s_listings_cleaned.dropna(subset=["bedrooms", "price"])
x = df.bedrooms
y = df.price
plt.scatter(x, y)
plt.xlabel("Number of bedrooms in property")
plt.ylabel("Booking price in USD")
plt.title("Booking price against number of bedrooms in property")
plt.show()
df = s_listings_cleaned.dropna(subset=["bathrooms", "price"])
x = df.bathrooms
y = df.price
plt.scatter(x, y)
plt.xlabel("Number of bathrooms in property")
plt.ylabel("Booking price in USD")
plt.title("Booking price against number of bathrooms in property")
plt.show()
In this case price
is the target value. We can drop the rows with missing values in this target variable for the purpose of training and evaluating our model.
input_vars_from_listings = ["neighbourhood_cleansed",
"neighbourhood_group_cleansed",
"city",
"state",
"zipcode",
"market",
"location_categorical",
"property_type",
"room_type",
"accommodates",
"bathrooms",
"bedrooms",
"beds",
"bed_type",
"square_feet",
"guests_included",
"minimum_nights",
"maximum_nights",
"review_scores_rating",
"review_scores_accuracy",
"review_scores_cleanliness",
"review_scores_checkin",
"review_scores_communication",
"review_scores_location",
"review_scores_value",
"jurisdiction_names",
"cancellation_policy",
"reviews_per_month",
"number_of_reviews"
]
amenity_variables = list(map(lambda amenity : "amenity_" + amenity, possible_amenities))
input_vars_from_listings.extend(amenity_variables)
s_calendar_cleaned["month"] = s_calendar_cleaned['date'].dt.month_name()
s_calendar_cleaned["day_of_week"] = s_calendar_cleaned['date'].dt.weekday_name
df = pd.merge(s_listings_cleaned, s_calendar_cleaned, left_on="id", right_on="listing_id", how="inner", suffixes=("_listings", "_calendar"))
df = df.dropna(subset=["price_calendar"])
sns.heatmap(df[["price_calendar",
"square_feet",
"bedrooms",
"bathrooms",
"accommodates",
"beds",
"guests_included"
]].rename(columns={"price_calendar": "price"}).corr(), annot=True, fmt=".2f");
input_vars_from_calendar = ["month", "day_of_week"]
all_input_vars = input_vars_from_listings + input_vars_from_calendar;
X = df[all_input_vars]
y = df["price_calendar"]
The categorical variables are converted into dummy numerical variables by adding one column per categorical level of each categorical attribute using pandas get_dummies function.
We are not replacing the missing values in categorical attributes as it will be handled during the creation of dummy numerical variables by creation of separate numerical column representing the NaN value of that attribute.
cat_vars = X.select_dtypes(include=['object']).copy().columns
for var in cat_vars:
# for each cat add dummy var, drop original column
X = pd.concat([X.drop(var, axis=1), pd.get_dummies(X[var], prefix=var, prefix_sep='_', drop_first=False, dummy_na=True)], axis=1)
We need to prepare the data by replacing missing values with appropriate values in input variables. In the case of numerical attributes, we are replacing them with the mean of the column.
def fill_mean (col):
return col.fillna(col.mean())
X = X.apply(fill_mean, axis=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)
lm_model = Ridge(normalize=True,fit_intercept=True)
lm_model.fit(X_train, y_train)
y_test_preds = lm_model.predict(X_test)
"The r-squared score for the model {} on {} values.".format(r2_score(y_test, y_test_preds), len(y_test))
plt.figure(figsize=(12,5))
plt.scatter(y_test, y_test_preds)
plt.xlabel("Y test")
plt.ylabel("Y test preds")
plt.show()
def coef_weights(coefficients, X_train):
'''
INPUT:
coefficients - the coefficients of the linear model
X_train - the training data, so the column names can be used
OUTPUT:
coefs_df - a dataframe holding the coefficient, estimate, and abs(estimate)
Provides a dataframe that can be used to understand the most influential coefficients
in a linear model by providing the coefficient estimates along with the name of the
variable attached to the coefficient.
'''
coefs_df = pd.DataFrame()
coefs_df['input_variable'] = X_train.columns
coefs_df['coef'] = lm_model.coef_
coefs_df['abs_coef'] = np.abs(lm_model.coef_)
coefs_df = coefs_df.sort_values('abs_coef', ascending=False)
return coefs_df
coef_df = coef_weights(lm_model.coef_, X_train)
coef_df.head(50)
coef_df[coef_df["input_variable"].str.startswith("neighbourhood")]
coef_df[~coef_df["input_variable"].str.startswith("location") & ~coef_df["input_variable"].str.startswith("neighbourhood") & ~coef_df["input_variable"].str.startswith("zipcode")]
coef_df[coef_df["input_variable"].str.startswith("month")]
coef_df[coef_df["input_variable"].str.startswith("day_of_week")]
coef_df[coef_df["input_variable"].str.startswith("property_type")]
coef_df[coef_df["input_variable"].str.startswith("amenity")]