Assignment - Blog Post 1

In this blog, we’ll demonstrate the basic utilization of the plotly pacakge, and some basic database usage with sql(Structured Query Language). Databases increases the efficiency for treating large datasets. We will be using the National Oceanic and Atmospheric Admistration’s (NOAA) climate data set, together with temperature stations and country codes, to create some interactive geographically representative plots corresponding to temperatures.

First, we would like to import some needed packages, make sure you have installed them properly. Despite pandas and numpy that we are aready familiarized with, sqlite3 is new package that enables us to work with databases using Python commands.

import sqlite3
import pandas as pd
import numpy as np

Part 1 Introducing database & data prehandling

Inspecting the temperature dataset here, you might discover that it contains more than ten thousands of rows. Therefore, when only a subset of the complete dataset is need, it is possible to first store the complete dataset into a database and query out specific rows or columns in need.

Creating a database

The following line would create a database in the current directory. For future usage of the database, the same line could be used.

conn = sqlite3.connect("temps.db") # create a database in current directory called temps.db

Then, we would like to read in the dataset files to prepare to populate the temps database that we’ve just created. The lastline just creates a new column in the stations dataset that extracts the country codes in the original station id. The abbreviation would be used for future merging.

#temperatures = pd.read_csv('temps.csv')
stations = pd.read_csv('station-metadata.csv')
countries = pd.read_csv('countries.csv')
stations['ID_'] = stations['ID'].str[0:2]

Take a look at the countries and stations table. The countries table contains corresponding station codes for countries, while stations table has the stations’ ID, name, their latitude and longitde, their elevation, and new abbreviation code column we’ve just created.

countries.head(5)

	FIPS 10-4	ISO 3166	Name
0	AF	AF	Afghanistan
1	AX	-	Akrotiri
2	AL	AL	Albania
3	AG	DZ	Algeria
4	AQ	AS	American Samoa

stations.head(5)

	ID	LATITUDE	LONGITUDE	STNELEV	NAME	ID_
0	ACW00011604	57.7667	11.8667	18.0	SAVE	AC
1	AE000041196	25.3330	55.5170	34.0	SHARJAH_INTER_AIRP	AE
2	AEM00041184	25.6170	55.9330	31.0	RAS_AL_KHAIMAH_INTE	AE
3	AEM00041194	25.2550	55.3640	10.4	DUBAI_INTL	AE
4	AEM00041216	24.4300	54.4700	3.0	ABU_DHABI_BATEEN_AIR	AE

Have a look of the original temperature dataset. Currently, it contains the monthly average temperature values at specific stations for certain years.

temperatures = pd.read_csv('temps.csv')
#temperatures = prepare_df(temperatures)

temperatures.head(5)

	ID	Year	VALUE1	VALUE2	VALUE3	VALUE4	VALUE5	VALUE6	VALUE7	VALUE8	VALUE9	VALUE10	VALUE11	VALUE12
0	ACW00011604	1961	-89.0	236.0	472.0	773.0	1128.0	1599.0	1570.0	1481.0	1413.0	1174.0	510.0	-39.0
1	ACW00011604	1962	113.0	85.0	-154.0	635.0	908.0	1381.0	1510.0	1393.0	1163.0	994.0	323.0	-126.0
2	ACW00011604	1963	-713.0	-553.0	-99.0	541.0	1224.0	1627.0	1620.0	1596.0	1332.0	940.0	566.0	-108.0
3	ACW00011604	1964	62.0	-85.0	55.0	738.0	1219.0	1442.0	1506.0	1557.0	1221.0	788.0	546.0	112.0
4	ACW00011604	1965	44.0	-105.0	38.0	590.0	987.0	1500.0	1487.0	1477.0	1377.0	974.0	31.0	-178.0

Then, we create a function to prepare the temperature dataset into a form we want. The function mainly conducts stacking to expand the original dataset by month. The last line in the function adds a new column, “FIPS 10-4”, which will be used later for merging with the country code dataset. The merge is expected to give out what country each temperature measurement station is located in.

def prepare_df(df):
    df = df.set_index(keys=["ID", "Year"])#convert all the columns we don't want to stack into multi-indices
    df = df.stack()#separate out values in each row into single entries
    df = df.reset_index()#giving each entry ID, Year, and Name columns
    df = df.rename(columns = {"level_2"  : "Month" , 0 : "Temp"})#rename to create proper column names
    df["Month"] = df["Month"].str[5:].astype(int)
    df["Temp"]  = df["Temp"] / 100
    df["FIPS 10-4"] = df["ID"].str[0:2]
    return(df)

Take a look at the prepared temperatures dataset, it will contain monthly temperature averages at specific stations, as well as their country codes.

temperatures = prepare_df(temperatures)
temperatures.head(5)

	ID	Year	Month	Temp	FIPS 10-4
0	ACW00011604	1961	1	-0.89	AC
1	ACW00011604	1961	2	2.36	AC
2	ACW00011604	1961	3	4.72	AC
3	ACW00011604	1961	4	7.73	AC
4	ACW00011604	1961	5	11.28	AC

Now, we are finally ready to populate our database. We’ll be creating three tables: temperature, countries, and stations. For countries and stations, we would populate them into the database directly. While for the temperature table, we would first give it to the prepare_df function. Three tables would be saved in the form as demonstrated above.

The df_iter object makes a dataframe iterable by specified chunksize. if_exists checks if that some chunk is already added to sql, and execute the following block, making sure we won’t overwrite each time.

df_iter = pd.read_csv("temps.csv", chunksize = 100000)

for df in df_iter:
    df = prepare_df(df)
    
    df.to_sql("temperatures", conn, if_exists = "append", index = False)

stations.to_sql("stations", conn, if_exists = "replace", index = False)
countries.to_sql("countries", conn, if_exists = "replace", index = False)

/Users/susanzhang/opt/anaconda3/envs/PIC16B/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  method=method,

Now, we’re supposed to have three dataframes in the database, let’s check!

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

[('temperatures',), ('stations',), ('countries',)]

Don’t forget to close your connections!

conn.close()

PART 2 SQL Query

In this part, we would create a demonstration of how to query specific data out from sql, and generating a plot from the queried dataset. We will be implementing two functions, one for data query and one for plot generation.

We will be creating a query_climate_database() function. The function is expected to return a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year.

def query_climate_database(country, year_begin, year_end, month):
    '''param country: a string of the country for which data should be returned;
       year_begin, year_end: two integers specifying the time interval of data be returned;
       month: an integer giving the month of data to be returned.
       @return a Pandas dataframe of temperature readings for the specified country, 
       in the specified date range, in the specified month of the year
    '''
    #first create a connection with the database
    conn = sqlite3.connect("temps.db")
    #basic queries
    #containing SELECT, specifies column(s) to be returned;
    #FROM, from which table(s)
    #LEFT JOIN, merge datasets for specific outcomes
    #WHERE, row specification
    cmd = \
    """
    SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON S.id_ = C.`fips 10-4`
    WHERE C.name = ? AND (T.year BETWEEN ? AND ?) AND T.month = ?
    """
    
    #the '?' syntax in WHERE enables parameterization in a query,
    #and we'll supply the parameters sequentially here
    param = (country, year_begin, year_end, month,)
    
    #the following line reads a query and returns the dataset from the database 
    df = pd.read_sql_query(cmd, conn, params=param)
    conn.close()#make sure to close the connection after use
    return df

df = query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)

Now, have a look of the dataset that we’ve queried out! It should contain the temperatures in specific month, year, and station, as well as the latitude and longitude of the station, for a specific country.

df

	NAME	LATITUDE	LONGITUDE	Name	Year	Month	Temp
0	PBO_ANANTAPUR	14.583	77.633	India	1980	1	23.48
1	PBO_ANANTAPUR	14.583	77.633	India	1981	1	24.57
2	PBO_ANANTAPUR	14.583	77.633	India	1982	1	24.19
3	PBO_ANANTAPUR	14.583	77.633	India	1983	1	23.51
4	PBO_ANANTAPUR	14.583	77.633	India	1984	1	24.81
...	...	...	...	...	...	...	...
3147	DARJEELING	27.050	88.270	India	1983	1	5.10
3148	DARJEELING	27.050	88.270	India	1986	1	6.90
3149	DARJEELING	27.050	88.270	India	1994	1	8.10
3150	DARJEELING	27.050	88.270	India	1995	1	5.60
3151	DARJEELING	27.050	88.270	India	1997	1	5.70

3152 rows × 7 columns

PART 3 Plot with Plotly

In this part, we wish to create a plot using plotly to answer the following question:

How does the average yearly change in temperature vary within a given country?

To answer this question, we’ll need to first calculate a yearly increase of stations in a specific country for a specific month.

One method great for calculating extimate yearly increase is the LinearRegression model in sklearn. So we’ll import that, along with the plotly package.

from sklearn.linear_model import LinearRegression
from plotly import express as px

We’ll define a function that estimates the yearly increase for specific stations.

def coef(data_group):
    x = data_group[["Year"]] # 2 brackets because X should be a df
    y = data_group["Temp"]   # 1 bracket because y should be a series
    LR = LinearRegression()
    LR.fit(x, y)
    return LR.coef_[0]

Then, we’ll define a temperature_coefficient_plot function that plots a point for each station in a specific country. With plotly, the color of the point will reflect an estimate of the yearly change in temperature during the specified month and time period at that station.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    '''country:  string of the country for which data should be returned;
        year_begin, year_end: two integers specifying the time interval of data be returned;
        min_obs: minimum required number of years of data for any given station. 
        (only data for stations with at least min_obs years worth of data would be considered);
        **kwargs: additional keyword arguments passed to px.scatter_mapbox(). 
    '''
    #first obtain the dataframe from above function:
    df = query_climate_database(country = country, 
                       year_begin = year_begin, 
                       year_end = year_end,
                       month = month)
    #add a new column to df with number of years of data there are for that station
    df["size"] = df.groupby(["NAME"])['NAME'].transform(lambda x: x.shape[0])
    #filter out stations without enough yearly data, inclusive
    df_ = df[df['size'] >= min_obs]
    
    #estimate yearly increase for each station
    coefs = df_.groupby(["NAME", "Month"]).apply(coef)#
    coefs = coefs.reset_index()
    
    #merge the estimates with stations,
    #therefore each estimate would have there corresponding latitude and longitude
    coefs_ = pd.merge(coefs, stations, on = ["NAME"])
    #rename coef column to "Estimated Yearly Increase(Celsius)" and round the coefs
    coefs_ = coefs_.rename(columns = {0:"Estimated Yearly Increase(Celsius)"})
    coefs_ = coefs_[["Estimated Yearly Increase(Celsius)", 'NAME', 'LATITUDE','LONGITUDE']]
    coefs_['Estimated Yearly Increase(Celsius)'] = coefs_['Estimated Yearly Increase(Celsius)'].round(4)
    
    #create plotly plot! add titles and hover overs
    fig = px.scatter_mapbox(coefs_, 
                        lat = "LATITUDE",
                        lon = "LONGITUDE", 
                        hover_name = "NAME", 
                        color = "Estimated Yearly Increase(Celsius)", 
                        title="Estimates of yearly increase in Temperature in January for stations in "
                            + country +", years " + str(year_begin) + " - " + str(year_end),
                        **kwargs)

    #fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    return fig
    

color_map = px.colors.diverging.RdGy_r # choose a colormap

#run the function to generate plotly plot
fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

From the plot, we could see that the estimate yearly change of temperature in India various geographically, it might correspond to specific factors, such as distance to the ocean, etc.

Save it as a html form!

from plotly.io import write_html

write_html(fig, "india_temp_esti.html")

Part 4 Continental temperature Change

In this part, I would like to visualize how temperature alteration differs in the northern and southern hemisphere. First, it might be helpful to create a function that selects temperature change measured by stations at the same range of latitude from two hemispheres during a specific year. lat_low specifies the lower bound for northern hemisphere, lat_high specifies the upper bound for the northern hemisphere, therefore -lat_low and -lat_high correspondly specifies that for the southern hemisphere.

#conn = sqlite3.connect("temps.db")
def query_compare_latitude(lat_low, lat_high, year):
    '''returns a dataframe containing monthly temps at stations
    for the specified latitude from two hemispheres
    '''
    #neg_lat = np.negative(lat)
    conn = sqlite3.connect("temps.db")
    cmd = \
    """
    SELECT T.id, T.month, T.temp, S.latitude, S.longitude, S.name, T.year
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    WHERE ((S.latitude < ? AND S.latitude > ?) OR (S.latitude < ? AND S.latitude > ?) )AND T.year = ?
    """
    params = (lat_high, lat_low, -lat_low, -lat_high , year )

    df = pd.read_sql_query(cmd,conn, params = params)
    #df_south = pd.read_sql_query(cmd_s, conn)
    conn.close()
    return df

#select temp measurements for points in year 1954
#latitude in both hemispheres between 20 - 40 degrees
df = query_compare_latitude(20, 25, 1954)

Take a look at our dataframe, it contains monthly temperature measurements at specified latitude range for a specific year.

df

	ID	Month	Temp	LATITUDE	LONGITUDE	NAME	Year
0	AG000060680	1	12.84	22.800	5.4331	TAMANRASSET	1954
1	AG000060680	2	14.20	22.800	5.4331	TAMANRASSET	1954
2	AG000060680	3	17.81	22.800	5.4331	TAMANRASSET	1954
3	AG000060680	4	20.73	22.800	5.4331	TAMANRASSET	1954
4	AG000060680	5	25.96	22.800	5.4331	TAMANRASSET	1954
...	...	...	...	...	...	...	...
3664	ZI000067991	8	18.41	-22.217	30.0000	BEITBRIDGE	1954
3665	ZI000067991	9	22.79	-22.217	30.0000	BEITBRIDGE	1954
3666	ZI000067991	10	26.78	-22.217	30.0000	BEITBRIDGE	1954
3667	ZI000067991	11	25.85	-22.217	30.0000	BEITBRIDGE	1954
3668	ZI000067991	12	26.02	-22.217	30.0000	BEITBRIDGE	1954

3669 rows × 7 columns

from matplotlib import pyplot as plt
#import seaborn as sns 

Write a plot_line function! The function would generate a line plot for monthly temperature change for each station, color separated by northern or southern hemispheres.

def plot_line(df):
    #return sns.lineplot(x="Month", y="Temp",
             #style="NAME", hue = 'LATITUDE', palette = 'vlag',
             #data=df)
    df['LATITUDE'] = df['LATITUDE'].astype(int)# cast to int for approximation, for ease to generate plot
    df['hemi'] = np.where( (df['LATITUDE'] >= 0), 'northern hemisphere', 'southern hemisphere')
    return px.line(df, x="Month", y="Temp", color = 'hemi',
                  title="Monthly temperature(celsius) in both hemispheres between 20 - 40 latitudes"
                  ,line_group="NAME",
                 hover_name="NAME")
    #fig.show()

fig = plot_line(df)

write_html(fig, "p4_1_real.html")

Trying the query and plot functions with latitude range of 25-30, we were able to observe some patterns: for southern hemisphere countries, their mid-year temperatures tends to be cooler, and that is the opposite for northern hemisphere countries. This is intuitively correct, but we might observe that the slope for change for each station, even in the same hemisphere, might be different. Therefore, it might be helpful to plot the monthly temperature change for stations as grouped by continents.

The url below provides a dataset containing country names, their codes, and their corresponding continents.

CCcode = pd.read_csv('ContinentsCountries.csv')
#https://coronadatasource.org/open-data/geo-data/world-regions-continents-countries/

We first select the needed columns.

CCcode = CCcode[['Country Name (usual)', 'Continent' , 'Continent Code']]

CCcode

	Country Name (usual)	Continent	Continent Code
0	Afghanistan	Asia	AS
1	?land Islands	Europe	EU
2	Albania	Europe	EU
3	Algeria	Africa	AF
4	American Samoa	Oceania	OC
...	...	...	...
248	Zimbabwe	Africa	AF
249	Kosovo	Europe	EU
250	Netherlands Antilles	North America	NaN
251	Others	Others	XX
252	# Undefined	Others	XX

253 rows × 3 columns

CCcode['Continent'].unique()

array(['Asia', 'Europe', 'Africa', 'Oceania', 'North America',
       'Antarctica', 'South America', 'Others'], dtype=object)

As usual, we add the country-continent code dataset to the temps database

conn = sqlite3.connect("temps.db")
CCcode.to_sql("CCcode", conn, if_exists = "replace", index = False)
conn.close()

/Users/feishu/opt/anaconda3/envs/PIC16B/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning:

The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.

Now, we create a query to return a dataframe with countries’ yearly temperature, with specified continents, in a specific year

def query_temp_continent(year):
    '''return a table with countries' yearly temperature, marked by continent,
    in a specific year
    '''
    conn = sqlite3.connect("temps.db")
    cmd = \
    """
    SELECT T.id, T.month, T.temp, C.name, T.year, O.continent
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON S.id_ = C.`fips 10-4`
    LEFT JOIN cccode O ON O.`country name (usual)` = C.name
    WHERE T.year =?
    """
    params = (year , )

    df = pd.read_sql_query(cmd,conn, params = params )
    #df_south = pd.read_sql_query(cmd_s, conn)
    conn.close()
    return df

df3 = query_temp_continent(1969)

df3

	ID	Month	Temp	Name	Year	Continent
0	ACW00011604	1	-0.81	Antigua and Barbuda	1969	North America
1	ACW00011604	2	-4.41	Antigua and Barbuda	1969	North America
2	ACW00011604	3	-1.32	Antigua and Barbuda	1969	North America
3	ACW00011604	4	5.50	Antigua and Barbuda	1969	North America
4	ACW00011604	5	10.15	Antigua and Barbuda	1969	North America
...	...	...	...	...	...	...
164000	ZIXLT622116	8	14.60	Zimbabwe	1969	Africa
164001	ZIXLT622116	9	17.80	Zimbabwe	1969	Africa
164002	ZIXLT622116	10	21.00	Zimbabwe	1969	Africa
164003	ZIXLT622116	11	20.40	Zimbabwe	1969	Africa
164004	ZIXLT622116	12	19.10	Zimbabwe	1969	Africa

164005 rows × 6 columns

Now, we create a plot function that plots the monthly change of temperature colorized by countries. The function will be supplied to the future groupby and apply method.

def plot_line_cont2(df):
    cont = df['Continent'].unique()
    year = df['Year'].unique()
    fig = px.scatter(df, x="Month", y="Temp", color="Name",
               hover_name="Name",
        title = 'Monthly avg Temperature of countries(& regions) in ' + cont[0] + ' in ' + str(year))
    fig.show()
    write_html(fig, cont[0] + ".html")

df3.groupby(['Continent']).apply(plot_line_cont2)

From the above graph, we could observe that monthly temperature change also has some patterns corresponding to different continents. In Asia, Europe, and North America, mid-year temperature tends to be the highest. While for South America and Antarctica, things are the opposite. Perhaps that’s why Santa Claus wears T-shirts in these countries!

-added url hyperlinks -changed the logistic for plot_line function, from plotting for specific latitude to for a latitude range

Written on April 21, 2021