Assignment - Blog Post 1
In this blog, we’ll demonstrate the basic utilization of the plotly pacakge, and some basic database usage with sql(Structured Query Language). Databases increases the efficiency for treating large datasets. We will be using the National Oceanic and Atmospheric Admistration’s (NOAA) climate data set, together with temperature stations and country codes, to create some interactive geographically representative plots corresponding to temperatures.
First, we would like to import some needed packages, make sure you have installed them properly. Despite pandas and numpy that we are aready familiarized with, sqlite3 is new package that enables us to work with databases using Python commands.
import sqlite3
import pandas as pd
import numpy as np
Part 1 Introducing database & data prehandling
Inspecting the temperature dataset here, you might discover that it contains more than ten thousands of rows. Therefore, when only a subset of the complete dataset is need, it is possible to first store the complete dataset into a database and query out specific rows or columns in need.
Creating a database
The following line would create a database in the current directory. For future usage of the database, the same line could be used.
conn = sqlite3.connect("temps.db") # create a database in current directory called temps.db
Then, we would like to read in the dataset files to prepare to populate the temps database that we’ve just created. The lastline just creates a new column in the stations dataset that extracts the country codes in the original station id. The abbreviation would be used for future merging.
#temperatures = pd.read_csv('temps.csv')
stations = pd.read_csv('station-metadata.csv')
countries = pd.read_csv('countries.csv')
stations['ID_'] = stations['ID'].str[0:2]
Take a look at the countries and stations table. The countries table contains corresponding station codes for countries, while stations table has the stations’ ID, name, their latitude and longitde, their elevation, and new abbreviation code column we’ve just created.
countries.head(5)
| FIPS 10-4 | ISO 3166 | Name | |
|---|---|---|---|
| 0 | AF | AF | Afghanistan |
| 1 | AX | - | Akrotiri |
| 2 | AL | AL | Albania |
| 3 | AG | DZ | Algeria |
| 4 | AQ | AS | American Samoa |
stations.head(5)
| ID | LATITUDE | LONGITUDE | STNELEV | NAME | ID_ | |
|---|---|---|---|---|---|---|
| 0 | ACW00011604 | 57.7667 | 11.8667 | 18.0 | SAVE | AC |
| 1 | AE000041196 | 25.3330 | 55.5170 | 34.0 | SHARJAH_INTER_AIRP | AE |
| 2 | AEM00041184 | 25.6170 | 55.9330 | 31.0 | RAS_AL_KHAIMAH_INTE | AE |
| 3 | AEM00041194 | 25.2550 | 55.3640 | 10.4 | DUBAI_INTL | AE |
| 4 | AEM00041216 | 24.4300 | 54.4700 | 3.0 | ABU_DHABI_BATEEN_AIR | AE |
Have a look of the original temperature dataset. Currently, it contains the monthly average temperature values at specific stations for certain years.
temperatures = pd.read_csv('temps.csv')
#temperatures = prepare_df(temperatures)
temperatures.head(5)
| ID | Year | VALUE1 | VALUE2 | VALUE3 | VALUE4 | VALUE5 | VALUE6 | VALUE7 | VALUE8 | VALUE9 | VALUE10 | VALUE11 | VALUE12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ACW00011604 | 1961 | -89.0 | 236.0 | 472.0 | 773.0 | 1128.0 | 1599.0 | 1570.0 | 1481.0 | 1413.0 | 1174.0 | 510.0 | -39.0 |
| 1 | ACW00011604 | 1962 | 113.0 | 85.0 | -154.0 | 635.0 | 908.0 | 1381.0 | 1510.0 | 1393.0 | 1163.0 | 994.0 | 323.0 | -126.0 |
| 2 | ACW00011604 | 1963 | -713.0 | -553.0 | -99.0 | 541.0 | 1224.0 | 1627.0 | 1620.0 | 1596.0 | 1332.0 | 940.0 | 566.0 | -108.0 |
| 3 | ACW00011604 | 1964 | 62.0 | -85.0 | 55.0 | 738.0 | 1219.0 | 1442.0 | 1506.0 | 1557.0 | 1221.0 | 788.0 | 546.0 | 112.0 |
| 4 | ACW00011604 | 1965 | 44.0 | -105.0 | 38.0 | 590.0 | 987.0 | 1500.0 | 1487.0 | 1477.0 | 1377.0 | 974.0 | 31.0 | -178.0 |
Then, we create a function to prepare the temperature dataset into a form we want. The function mainly conducts stacking to expand the original dataset by month. The last line in the function adds a new column, “FIPS 10-4”, which will be used later for merging with the country code dataset. The merge is expected to give out what country each temperature measurement station is located in.
def prepare_df(df):
df = df.set_index(keys=["ID", "Year"])#convert all the columns we don't want to stack into multi-indices
df = df.stack()#separate out values in each row into single entries
df = df.reset_index()#giving each entry ID, Year, and Name columns
df = df.rename(columns = {"level_2" : "Month" , 0 : "Temp"})#rename to create proper column names
df["Month"] = df["Month"].str[5:].astype(int)
df["Temp"] = df["Temp"] / 100
df["FIPS 10-4"] = df["ID"].str[0:2]
return(df)
Take a look at the prepared temperatures dataset, it will contain monthly temperature averages at specific stations, as well as their country codes.
temperatures = prepare_df(temperatures)
temperatures.head(5)
| ID | Year | Month | Temp | FIPS 10-4 | |
|---|---|---|---|---|---|
| 0 | ACW00011604 | 1961 | 1 | -0.89 | AC |
| 1 | ACW00011604 | 1961 | 2 | 2.36 | AC |
| 2 | ACW00011604 | 1961 | 3 | 4.72 | AC |
| 3 | ACW00011604 | 1961 | 4 | 7.73 | AC |
| 4 | ACW00011604 | 1961 | 5 | 11.28 | AC |
Now, we are finally ready to populate our database. We’ll be creating three tables: temperature, countries, and stations. For countries and stations, we would populate them into the database directly. While for the temperature table, we would first give it to the prepare_df function. Three tables would be saved in the form as demonstrated above.
The df_iter object makes a dataframe iterable by specified chunksize. if_exists checks if that some chunk is already added to sql, and execute the following block, making sure we won’t overwrite each time.
df_iter = pd.read_csv("temps.csv", chunksize = 100000)
for df in df_iter:
df = prepare_df(df)
df.to_sql("temperatures", conn, if_exists = "append", index = False)
stations.to_sql("stations", conn, if_exists = "replace", index = False)
countries.to_sql("countries", conn, if_exists = "replace", index = False)
/Users/susanzhang/opt/anaconda3/envs/PIC16B/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
method=method,
Now, we’re supposed to have three dataframes in the database, let’s check!
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]
Don’t forget to close your connections!
conn.close()
PART 2 SQL Query
In this part, we would create a demonstration of how to query specific data out from sql, and generating a plot from the queried dataset. We will be implementing two functions, one for data query and one for plot generation.
We will be creating a query_climate_database() function. The function is expected to return a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year.
def query_climate_database(country, year_begin, year_end, month):
'''param country: a string of the country for which data should be returned;
year_begin, year_end: two integers specifying the time interval of data be returned;
month: an integer giving the month of data to be returned.
@return a Pandas dataframe of temperature readings for the specified country,
in the specified date range, in the specified month of the year
'''
#first create a connection with the database
conn = sqlite3.connect("temps.db")
#basic queries
#containing SELECT, specifies column(s) to be returned;
#FROM, from which table(s)
#LEFT JOIN, merge datasets for specific outcomes
#WHERE, row specification
cmd = \
"""
SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
LEFT JOIN countries C ON S.id_ = C.`fips 10-4`
WHERE C.name = ? AND (T.year BETWEEN ? AND ?) AND T.month = ?
"""
#the '?' syntax in WHERE enables parameterization in a query,
#and we'll supply the parameters sequentially here
param = (country, year_begin, year_end, month,)
#the following line reads a query and returns the dataset from the database
df = pd.read_sql_query(cmd, conn, params=param)
conn.close()#make sure to close the connection after use
return df
df = query_climate_database(country = "India",
year_begin = 1980,
year_end = 2020,
month = 1)
Now, have a look of the dataset that we’ve queried out! It should contain the temperatures in specific month, year, and station, as well as the latitude and longitude of the station, for a specific country.
df
| NAME | LATITUDE | LONGITUDE | Name | Year | Month | Temp | |
|---|---|---|---|---|---|---|---|
| 0 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1980 | 1 | 23.48 |
| 1 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1981 | 1 | 24.57 |
| 2 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1982 | 1 | 24.19 |
| 3 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1983 | 1 | 23.51 |
| 4 | PBO_ANANTAPUR | 14.583 | 77.633 | India | 1984 | 1 | 24.81 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3147 | DARJEELING | 27.050 | 88.270 | India | 1983 | 1 | 5.10 |
| 3148 | DARJEELING | 27.050 | 88.270 | India | 1986 | 1 | 6.90 |
| 3149 | DARJEELING | 27.050 | 88.270 | India | 1994 | 1 | 8.10 |
| 3150 | DARJEELING | 27.050 | 88.270 | India | 1995 | 1 | 5.60 |
| 3151 | DARJEELING | 27.050 | 88.270 | India | 1997 | 1 | 5.70 |
3152 rows × 7 columns
PART 3 Plot with Plotly
In this part, we wish to create a plot using plotly to answer the following question:
How does the average yearly change in temperature vary within a given country?
To answer this question, we’ll need to first calculate a yearly increase of stations in a specific country for a specific month.
One method great for calculating extimate yearly increase is the LinearRegression model in sklearn. So we’ll import that, along with the plotly package.
from sklearn.linear_model import LinearRegression
from plotly import express as px
We’ll define a function that estimates the yearly increase for specific stations.
def coef(data_group):
x = data_group[["Year"]] # 2 brackets because X should be a df
y = data_group["Temp"] # 1 bracket because y should be a series
LR = LinearRegression()
LR.fit(x, y)
return LR.coef_[0]
Then, we’ll define a temperature_coefficient_plot function that plots a point for each station in a specific country. With plotly, the color of the point will reflect an estimate of the yearly change in temperature during the specified month and time period at that station.
def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
'''country: string of the country for which data should be returned;
year_begin, year_end: two integers specifying the time interval of data be returned;
min_obs: minimum required number of years of data for any given station.
(only data for stations with at least min_obs years worth of data would be considered);
**kwargs: additional keyword arguments passed to px.scatter_mapbox().
'''
#first obtain the dataframe from above function:
df = query_climate_database(country = country,
year_begin = year_begin,
year_end = year_end,
month = month)
#add a new column to df with number of years of data there are for that station
df["size"] = df.groupby(["NAME"])['NAME'].transform(lambda x: x.shape[0])
#filter out stations without enough yearly data, inclusive
df_ = df[df['size'] >= min_obs]
#estimate yearly increase for each station
coefs = df_.groupby(["NAME", "Month"]).apply(coef)#
coefs = coefs.reset_index()
#merge the estimates with stations,
#therefore each estimate would have there corresponding latitude and longitude
coefs_ = pd.merge(coefs, stations, on = ["NAME"])
#rename coef column to "Estimated Yearly Increase(Celsius)" and round the coefs
coefs_ = coefs_.rename(columns = {0:"Estimated Yearly Increase(Celsius)"})
coefs_ = coefs_[["Estimated Yearly Increase(Celsius)", 'NAME', 'LATITUDE','LONGITUDE']]
coefs_['Estimated Yearly Increase(Celsius)'] = coefs_['Estimated Yearly Increase(Celsius)'].round(4)
#create plotly plot! add titles and hover overs
fig = px.scatter_mapbox(coefs_,
lat = "LATITUDE",
lon = "LONGITUDE",
hover_name = "NAME",
color = "Estimated Yearly Increase(Celsius)",
title="Estimates of yearly increase in Temperature in January for stations in "
+ country +", years " + str(year_begin) + " - " + str(year_end),
**kwargs)
#fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
return fig
color_map = px.colors.diverging.RdGy_r # choose a colormap
#run the function to generate plotly plot
fig = temperature_coefficient_plot("India", 1980, 2020, 1,
min_obs = 10,
zoom = 2,
mapbox_style="carto-positron",
color_continuous_scale=color_map)
fig.show()
From the plot, we could see that the estimate yearly change of temperature in India various geographically, it might correspond to specific factors, such as distance to the ocean, etc.
Save it as a html form!
from plotly.io import write_html
write_html(fig, "india_temp_esti.html")
Part 4 Continental temperature Change
In this part, I would like to visualize how temperature alteration differs in the northern and southern hemisphere. First, it might be helpful to create a function that selects temperature change measured by stations at the same range of latitude from two hemispheres during a specific year. lat_low specifies the lower bound for northern hemisphere, lat_high specifies the upper bound for the northern hemisphere, therefore -lat_low and -lat_high correspondly specifies that for the southern hemisphere.
#conn = sqlite3.connect("temps.db")
def query_compare_latitude(lat_low, lat_high, year):
'''returns a dataframe containing monthly temps at stations
for the specified latitude from two hemispheres
'''
#neg_lat = np.negative(lat)
conn = sqlite3.connect("temps.db")
cmd = \
"""
SELECT T.id, T.month, T.temp, S.latitude, S.longitude, S.name, T.year
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
WHERE ((S.latitude < ? AND S.latitude > ?) OR (S.latitude < ? AND S.latitude > ?) )AND T.year = ?
"""
params = (lat_high, lat_low, -lat_low, -lat_high , year )
df = pd.read_sql_query(cmd,conn, params = params)
#df_south = pd.read_sql_query(cmd_s, conn)
conn.close()
return df
#select temp measurements for points in year 1954
#latitude in both hemispheres between 20 - 40 degrees
df = query_compare_latitude(20, 25, 1954)
Take a look at our dataframe, it contains monthly temperature measurements at specified latitude range for a specific year.
df
| ID | Month | Temp | LATITUDE | LONGITUDE | NAME | Year | |
|---|---|---|---|---|---|---|---|
| 0 | AG000060680 | 1 | 12.84 | 22.800 | 5.4331 | TAMANRASSET | 1954 |
| 1 | AG000060680 | 2 | 14.20 | 22.800 | 5.4331 | TAMANRASSET | 1954 |
| 2 | AG000060680 | 3 | 17.81 | 22.800 | 5.4331 | TAMANRASSET | 1954 |
| 3 | AG000060680 | 4 | 20.73 | 22.800 | 5.4331 | TAMANRASSET | 1954 |
| 4 | AG000060680 | 5 | 25.96 | 22.800 | 5.4331 | TAMANRASSET | 1954 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 3664 | ZI000067991 | 8 | 18.41 | -22.217 | 30.0000 | BEITBRIDGE | 1954 |
| 3665 | ZI000067991 | 9 | 22.79 | -22.217 | 30.0000 | BEITBRIDGE | 1954 |
| 3666 | ZI000067991 | 10 | 26.78 | -22.217 | 30.0000 | BEITBRIDGE | 1954 |
| 3667 | ZI000067991 | 11 | 25.85 | -22.217 | 30.0000 | BEITBRIDGE | 1954 |
| 3668 | ZI000067991 | 12 | 26.02 | -22.217 | 30.0000 | BEITBRIDGE | 1954 |
3669 rows × 7 columns
from matplotlib import pyplot as plt
#import seaborn as sns
Write a plot_line function! The function would generate a line plot for monthly temperature change for each station, color separated by northern or southern hemispheres.
def plot_line(df):
#return sns.lineplot(x="Month", y="Temp",
#style="NAME", hue = 'LATITUDE', palette = 'vlag',
#data=df)
df['LATITUDE'] = df['LATITUDE'].astype(int)# cast to int for approximation, for ease to generate plot
df['hemi'] = np.where( (df['LATITUDE'] >= 0), 'northern hemisphere', 'southern hemisphere')
return px.line(df, x="Month", y="Temp", color = 'hemi',
title="Monthly temperature(celsius) in both hemispheres between 20 - 40 latitudes"
,line_group="NAME",
hover_name="NAME")
#fig.show()
fig = plot_line(df)
write_html(fig, "p4_1_real.html")
Trying the query and plot functions with latitude range of 25-30, we were able to observe some patterns: for southern hemisphere countries, their mid-year temperatures tends to be cooler, and that is the opposite for northern hemisphere countries. This is intuitively correct, but we might observe that the slope for change for each station, even in the same hemisphere, might be different. Therefore, it might be helpful to plot the monthly temperature change for stations as grouped by continents.
The url below provides a dataset containing country names, their codes, and their corresponding continents.
CCcode = pd.read_csv('ContinentsCountries.csv')
#https://coronadatasource.org/open-data/geo-data/world-regions-continents-countries/
We first select the needed columns.
CCcode = CCcode[['Country Name (usual)', 'Continent' , 'Continent Code']]
CCcode
| Country Name (usual) | Continent | Continent Code | |
|---|---|---|---|
| 0 | Afghanistan | Asia | AS |
| 1 | ?land Islands | Europe | EU |
| 2 | Albania | Europe | EU |
| 3 | Algeria | Africa | AF |
| 4 | American Samoa | Oceania | OC |
| ... | ... | ... | ... |
| 248 | Zimbabwe | Africa | AF |
| 249 | Kosovo | Europe | EU |
| 250 | Netherlands Antilles | North America | NaN |
| 251 | Others | Others | XX |
| 252 | # Undefined | Others | XX |
253 rows × 3 columns
CCcode['Continent'].unique()
array(['Asia', 'Europe', 'Africa', 'Oceania', 'North America',
'Antarctica', 'South America', 'Others'], dtype=object)
As usual, we add the country-continent code dataset to the temps database
conn = sqlite3.connect("temps.db")
CCcode.to_sql("CCcode", conn, if_exists = "replace", index = False)
conn.close()
/Users/feishu/opt/anaconda3/envs/PIC16B/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning:
The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
Now, we create a query to return a dataframe with countries’ yearly temperature, with specified continents, in a specific year
def query_temp_continent(year):
'''return a table with countries' yearly temperature, marked by continent,
in a specific year
'''
conn = sqlite3.connect("temps.db")
cmd = \
"""
SELECT T.id, T.month, T.temp, C.name, T.year, O.continent
FROM temperatures T
LEFT JOIN stations S ON T.id = S.id
LEFT JOIN countries C ON S.id_ = C.`fips 10-4`
LEFT JOIN cccode O ON O.`country name (usual)` = C.name
WHERE T.year =?
"""
params = (year , )
df = pd.read_sql_query(cmd,conn, params = params )
#df_south = pd.read_sql_query(cmd_s, conn)
conn.close()
return df
df3 = query_temp_continent(1969)
df3
| ID | Month | Temp | Name | Year | Continent | |
|---|---|---|---|---|---|---|
| 0 | ACW00011604 | 1 | -0.81 | Antigua and Barbuda | 1969 | North America |
| 1 | ACW00011604 | 2 | -4.41 | Antigua and Barbuda | 1969 | North America |
| 2 | ACW00011604 | 3 | -1.32 | Antigua and Barbuda | 1969 | North America |
| 3 | ACW00011604 | 4 | 5.50 | Antigua and Barbuda | 1969 | North America |
| 4 | ACW00011604 | 5 | 10.15 | Antigua and Barbuda | 1969 | North America |
| ... | ... | ... | ... | ... | ... | ... |
| 164000 | ZIXLT622116 | 8 | 14.60 | Zimbabwe | 1969 | Africa |
| 164001 | ZIXLT622116 | 9 | 17.80 | Zimbabwe | 1969 | Africa |
| 164002 | ZIXLT622116 | 10 | 21.00 | Zimbabwe | 1969 | Africa |
| 164003 | ZIXLT622116 | 11 | 20.40 | Zimbabwe | 1969 | Africa |
| 164004 | ZIXLT622116 | 12 | 19.10 | Zimbabwe | 1969 | Africa |
164005 rows × 6 columns
Now, we create a plot function that plots the monthly change of temperature colorized by countries. The function will be supplied to the future groupby and apply method.
def plot_line_cont2(df):
cont = df['Continent'].unique()
year = df['Year'].unique()
fig = px.scatter(df, x="Month", y="Temp", color="Name",
hover_name="Name",
title = 'Monthly avg Temperature of countries(& regions) in ' + cont[0] + ' in ' + str(year))
fig.show()
write_html(fig, cont[0] + ".html")
df3.groupby(['Continent']).apply(plot_line_cont2)
