Assignment - Blog Post 1

In this blog, we’ll demonstrate the basic utilization of the plotly pacakge, and some basic database usage with sql(Structured Query Language). Databases increases the efficiency for treating large datasets. We will be using the National Oceanic and Atmospheric Admistration’s (NOAA) climate data set, together with temperature stations and country codes, to create some interactive geographically representative plots corresponding to temperatures.

First, we would like to import some needed packages, make sure you have installed them properly. Despite pandas and numpy that we are aready familiarized with, sqlite3 is new package that enables us to work with databases using Python commands.

import sqlite3
import pandas as pd
import numpy as np

Part 1 Introducing database & data prehandling

Inspecting the temperature dataset here, you might discover that it contains more than ten thousands of rows. Therefore, when only a subset of the complete dataset is need, it is possible to first store the complete dataset into a database and query out specific rows or columns in need.

Creating a database

The following line would create a database in the current directory. For future usage of the database, the same line could be used.

conn = sqlite3.connect("temps.db") # create a database in current directory called temps.db

Then, we would like to read in the dataset files to prepare to populate the temps database that we’ve just created. The lastline just creates a new column in the stations dataset that extracts the country codes in the original station id. The abbreviation would be used for future merging.

#temperatures = pd.read_csv('temps.csv')
stations = pd.read_csv('station-metadata.csv')
countries = pd.read_csv('countries.csv')
stations['ID_'] = stations['ID'].str[0:2]

Take a look at the countries and stations table. The countries table contains corresponding station codes for countries, while stations table has the stations’ ID, name, their latitude and longitde, their elevation, and new abbreviation code column we’ve just created.

countries.head(5)
FIPS 10-4 ISO 3166 Name
0 AF AF Afghanistan
1 AX - Akrotiri
2 AL AL Albania
3 AG DZ Algeria
4 AQ AS American Samoa
stations.head(5)
ID LATITUDE LONGITUDE STNELEV NAME ID_
0 ACW00011604 57.7667 11.8667 18.0 SAVE AC
1 AE000041196 25.3330 55.5170 34.0 SHARJAH_INTER_AIRP AE
2 AEM00041184 25.6170 55.9330 31.0 RAS_AL_KHAIMAH_INTE AE
3 AEM00041194 25.2550 55.3640 10.4 DUBAI_INTL AE
4 AEM00041216 24.4300 54.4700 3.0 ABU_DHABI_BATEEN_AIR AE

Have a look of the original temperature dataset. Currently, it contains the monthly average temperature values at specific stations for certain years.

temperatures = pd.read_csv('temps.csv')
#temperatures = prepare_df(temperatures)
temperatures.head(5)
ID Year VALUE1 VALUE2 VALUE3 VALUE4 VALUE5 VALUE6 VALUE7 VALUE8 VALUE9 VALUE10 VALUE11 VALUE12
0 ACW00011604 1961 -89.0 236.0 472.0 773.0 1128.0 1599.0 1570.0 1481.0 1413.0 1174.0 510.0 -39.0
1 ACW00011604 1962 113.0 85.0 -154.0 635.0 908.0 1381.0 1510.0 1393.0 1163.0 994.0 323.0 -126.0
2 ACW00011604 1963 -713.0 -553.0 -99.0 541.0 1224.0 1627.0 1620.0 1596.0 1332.0 940.0 566.0 -108.0
3 ACW00011604 1964 62.0 -85.0 55.0 738.0 1219.0 1442.0 1506.0 1557.0 1221.0 788.0 546.0 112.0
4 ACW00011604 1965 44.0 -105.0 38.0 590.0 987.0 1500.0 1487.0 1477.0 1377.0 974.0 31.0 -178.0

Then, we create a function to prepare the temperature dataset into a form we want. The function mainly conducts stacking to expand the original dataset by month. The last line in the function adds a new column, “FIPS 10-4”, which will be used later for merging with the country code dataset. The merge is expected to give out what country each temperature measurement station is located in.

def prepare_df(df):
    df = df.set_index(keys=["ID", "Year"])#convert all the columns we don't want to stack into multi-indices
    df = df.stack()#separate out values in each row into single entries
    df = df.reset_index()#giving each entry ID, Year, and Name columns
    df = df.rename(columns = {"level_2"  : "Month" , 0 : "Temp"})#rename to create proper column names
    df["Month"] = df["Month"].str[5:].astype(int)
    df["Temp"]  = df["Temp"] / 100
    df["FIPS 10-4"] = df["ID"].str[0:2]
    return(df)

Take a look at the prepared temperatures dataset, it will contain monthly temperature averages at specific stations, as well as their country codes.

temperatures = prepare_df(temperatures)
temperatures.head(5)
ID Year Month Temp FIPS 10-4
0 ACW00011604 1961 1 -0.89 AC
1 ACW00011604 1961 2 2.36 AC
2 ACW00011604 1961 3 4.72 AC
3 ACW00011604 1961 4 7.73 AC
4 ACW00011604 1961 5 11.28 AC

Now, we are finally ready to populate our database. We’ll be creating three tables: temperature, countries, and stations. For countries and stations, we would populate them into the database directly. While for the temperature table, we would first give it to the prepare_df function. Three tables would be saved in the form as demonstrated above.

The df_iter object makes a dataframe iterable by specified chunksize. if_exists checks if that some chunk is already added to sql, and execute the following block, making sure we won’t overwrite each time.

df_iter = pd.read_csv("temps.csv", chunksize = 100000)

for df in df_iter:
    df = prepare_df(df)
    
    df.to_sql("temperatures", conn, if_exists = "append", index = False)

stations.to_sql("stations", conn, if_exists = "replace", index = False)
countries.to_sql("countries", conn, if_exists = "replace", index = False)
/Users/susanzhang/opt/anaconda3/envs/PIC16B/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning: The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.
  method=method,

Now, we’re supposed to have three dataframes in the database, let’s check!

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())
[('temperatures',), ('stations',), ('countries',)]

Don’t forget to close your connections!

conn.close()

PART 2 SQL Query

In this part, we would create a demonstration of how to query specific data out from sql, and generating a plot from the queried dataset. We will be implementing two functions, one for data query and one for plot generation.

We will be creating a query_climate_database() function. The function is expected to return a Pandas dataframe of temperature readings for the specified country, in the specified date range, in the specified month of the year.

def query_climate_database(country, year_begin, year_end, month):
    '''param country: a string of the country for which data should be returned;
       year_begin, year_end: two integers specifying the time interval of data be returned;
       month: an integer giving the month of data to be returned.
       @return a Pandas dataframe of temperature readings for the specified country, 
       in the specified date range, in the specified month of the year
    '''
    #first create a connection with the database
    conn = sqlite3.connect("temps.db")
    #basic queries
    #containing SELECT, specifies column(s) to be returned;
    #FROM, from which table(s)
    #LEFT JOIN, merge datasets for specific outcomes
    #WHERE, row specification
    cmd = \
    """
    SELECT S.name, S.latitude, S.longitude, C.name, T.year, T.month, T.temp
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON S.id_ = C.`fips 10-4`
    WHERE C.name = ? AND (T.year BETWEEN ? AND ?) AND T.month = ?
    """
    
    #the '?' syntax in WHERE enables parameterization in a query,
    #and we'll supply the parameters sequentially here
    param = (country, year_begin, year_end, month,)
    
    #the following line reads a query and returns the dataset from the database 
    df = pd.read_sql_query(cmd, conn, params=param)
    conn.close()#make sure to close the connection after use
    return df
df = query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)

Now, have a look of the dataset that we’ve queried out! It should contain the temperatures in specific month, year, and station, as well as the latitude and longitude of the station, for a specific country.

df
NAME LATITUDE LONGITUDE Name Year Month Temp
0 PBO_ANANTAPUR 14.583 77.633 India 1980 1 23.48
1 PBO_ANANTAPUR 14.583 77.633 India 1981 1 24.57
2 PBO_ANANTAPUR 14.583 77.633 India 1982 1 24.19
3 PBO_ANANTAPUR 14.583 77.633 India 1983 1 23.51
4 PBO_ANANTAPUR 14.583 77.633 India 1984 1 24.81
... ... ... ... ... ... ... ...
3147 DARJEELING 27.050 88.270 India 1983 1 5.10
3148 DARJEELING 27.050 88.270 India 1986 1 6.90
3149 DARJEELING 27.050 88.270 India 1994 1 8.10
3150 DARJEELING 27.050 88.270 India 1995 1 5.60
3151 DARJEELING 27.050 88.270 India 1997 1 5.70

3152 rows × 7 columns

PART 3 Plot with Plotly

In this part, we wish to create a plot using plotly to answer the following question:

How does the average yearly change in temperature vary within a given country?

To answer this question, we’ll need to first calculate a yearly increase of stations in a specific country for a specific month.

One method great for calculating extimate yearly increase is the LinearRegression model in sklearn. So we’ll import that, along with the plotly package.

from sklearn.linear_model import LinearRegression
from plotly import express as px

We’ll define a function that estimates the yearly increase for specific stations.

def coef(data_group):
    x = data_group[["Year"]] # 2 brackets because X should be a df
    y = data_group["Temp"]   # 1 bracket because y should be a series
    LR = LinearRegression()
    LR.fit(x, y)
    return LR.coef_[0]

Then, we’ll define a temperature_coefficient_plot function that plots a point for each station in a specific country. With plotly, the color of the point will reflect an estimate of the yearly change in temperature during the specified month and time period at that station.

def temperature_coefficient_plot(country, year_begin, year_end, month, min_obs, **kwargs):
    '''country:  string of the country for which data should be returned;
        year_begin, year_end: two integers specifying the time interval of data be returned;
        min_obs: minimum required number of years of data for any given station. 
        (only data for stations with at least min_obs years worth of data would be considered);
        **kwargs: additional keyword arguments passed to px.scatter_mapbox(). 
    '''
    #first obtain the dataframe from above function:
    df = query_climate_database(country = country, 
                       year_begin = year_begin, 
                       year_end = year_end,
                       month = month)
    #add a new column to df with number of years of data there are for that station
    df["size"] = df.groupby(["NAME"])['NAME'].transform(lambda x: x.shape[0])
    #filter out stations without enough yearly data, inclusive
    df_ = df[df['size'] >= min_obs]
    
    #estimate yearly increase for each station
    coefs = df_.groupby(["NAME", "Month"]).apply(coef)#
    coefs = coefs.reset_index()
    
    #merge the estimates with stations,
    #therefore each estimate would have there corresponding latitude and longitude
    coefs_ = pd.merge(coefs, stations, on = ["NAME"])
    #rename coef column to "Estimated Yearly Increase(Celsius)" and round the coefs
    coefs_ = coefs_.rename(columns = {0:"Estimated Yearly Increase(Celsius)"})
    coefs_ = coefs_[["Estimated Yearly Increase(Celsius)", 'NAME', 'LATITUDE','LONGITUDE']]
    coefs_['Estimated Yearly Increase(Celsius)'] = coefs_['Estimated Yearly Increase(Celsius)'].round(4)
    
    #create plotly plot! add titles and hover overs
    fig = px.scatter_mapbox(coefs_, 
                        lat = "LATITUDE",
                        lon = "LONGITUDE", 
                        hover_name = "NAME", 
                        color = "Estimated Yearly Increase(Celsius)", 
                        title="Estimates of yearly increase in Temperature in January for stations in "
                            + country +", years " + str(year_begin) + " - " + str(year_end),
                        **kwargs)

    #fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
    return fig
    
color_map = px.colors.diverging.RdGy_r # choose a colormap

#run the function to generate plotly plot
fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)

fig.show()

From the plot, we could see that the estimate yearly change of temperature in India various geographically, it might correspond to specific factors, such as distance to the ocean, etc.

Save it as a html form!

from plotly.io import write_html
write_html(fig, "india_temp_esti.html")

Part 4 Continental temperature Change

In this part, I would like to visualize how temperature alteration differs in the northern and southern hemisphere. First, it might be helpful to create a function that selects temperature change measured by stations at the same range of latitude from two hemispheres during a specific year. lat_low specifies the lower bound for northern hemisphere, lat_high specifies the upper bound for the northern hemisphere, therefore -lat_low and -lat_high correspondly specifies that for the southern hemisphere.

#conn = sqlite3.connect("temps.db")
def query_compare_latitude(lat_low, lat_high, year):
    '''returns a dataframe containing monthly temps at stations
    for the specified latitude from two hemispheres
    '''
    #neg_lat = np.negative(lat)
    conn = sqlite3.connect("temps.db")
    cmd = \
    """
    SELECT T.id, T.month, T.temp, S.latitude, S.longitude, S.name, T.year
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    WHERE ((S.latitude < ? AND S.latitude > ?) OR (S.latitude < ? AND S.latitude > ?) )AND T.year = ?
    """
    params = (lat_high, lat_low, -lat_low, -lat_high , year )

    df = pd.read_sql_query(cmd,conn, params = params)
    #df_south = pd.read_sql_query(cmd_s, conn)
    conn.close()
    return df
#select temp measurements for points in year 1954
#latitude in both hemispheres between 20 - 40 degrees
df = query_compare_latitude(20, 25, 1954)

Take a look at our dataframe, it contains monthly temperature measurements at specified latitude range for a specific year.

df
ID Month Temp LATITUDE LONGITUDE NAME Year
0 AG000060680 1 12.84 22.800 5.4331 TAMANRASSET 1954
1 AG000060680 2 14.20 22.800 5.4331 TAMANRASSET 1954
2 AG000060680 3 17.81 22.800 5.4331 TAMANRASSET 1954
3 AG000060680 4 20.73 22.800 5.4331 TAMANRASSET 1954
4 AG000060680 5 25.96 22.800 5.4331 TAMANRASSET 1954
... ... ... ... ... ... ... ...
3664 ZI000067991 8 18.41 -22.217 30.0000 BEITBRIDGE 1954
3665 ZI000067991 9 22.79 -22.217 30.0000 BEITBRIDGE 1954
3666 ZI000067991 10 26.78 -22.217 30.0000 BEITBRIDGE 1954
3667 ZI000067991 11 25.85 -22.217 30.0000 BEITBRIDGE 1954
3668 ZI000067991 12 26.02 -22.217 30.0000 BEITBRIDGE 1954

3669 rows × 7 columns

from matplotlib import pyplot as plt
#import seaborn as sns 

Write a plot_line function! The function would generate a line plot for monthly temperature change for each station, color separated by northern or southern hemispheres.

def plot_line(df):
    #return sns.lineplot(x="Month", y="Temp",
             #style="NAME", hue = 'LATITUDE', palette = 'vlag',
             #data=df)
    df['LATITUDE'] = df['LATITUDE'].astype(int)# cast to int for approximation, for ease to generate plot
    df['hemi'] = np.where( (df['LATITUDE'] >= 0), 'northern hemisphere', 'southern hemisphere')
    return px.line(df, x="Month", y="Temp", color = 'hemi',
                  title="Monthly temperature(celsius) in both hemispheres between 20 - 40 latitudes"
                  ,line_group="NAME",
                 hover_name="NAME")
    #fig.show()
fig = plot_line(df)
write_html(fig, "p4_1_real.html")

Trying the query and plot functions with latitude range of 25-30, we were able to observe some patterns: for southern hemisphere countries, their mid-year temperatures tends to be cooler, and that is the opposite for northern hemisphere countries. This is intuitively correct, but we might observe that the slope for change for each station, even in the same hemisphere, might be different. Therefore, it might be helpful to plot the monthly temperature change for stations as grouped by continents.

The url below provides a dataset containing country names, their codes, and their corresponding continents.

CCcode = pd.read_csv('ContinentsCountries.csv')
#https://coronadatasource.org/open-data/geo-data/world-regions-continents-countries/

We first select the needed columns.

CCcode = CCcode[['Country Name (usual)', 'Continent' , 'Continent Code']]
CCcode
Country Name (usual) Continent Continent Code
0 Afghanistan Asia AS
1 ?land Islands Europe EU
2 Albania Europe EU
3 Algeria Africa AF
4 American Samoa Oceania OC
... ... ... ...
248 Zimbabwe Africa AF
249 Kosovo Europe EU
250 Netherlands Antilles North America NaN
251 Others Others XX
252 # Undefined Others XX

253 rows × 3 columns

CCcode['Continent'].unique()
array(['Asia', 'Europe', 'Africa', 'Oceania', 'North America',
       'Antarctica', 'South America', 'Others'], dtype=object)

As usual, we add the country-continent code dataset to the temps database

conn = sqlite3.connect("temps.db")
CCcode.to_sql("CCcode", conn, if_exists = "replace", index = False)
conn.close()
/Users/feishu/opt/anaconda3/envs/PIC16B/lib/python3.7/site-packages/pandas/core/generic.py:2789: UserWarning:

The spaces in these column names will not be changed. In pandas versions < 0.14, spaces were converted to underscores.

Now, we create a query to return a dataframe with countries’ yearly temperature, with specified continents, in a specific year

def query_temp_continent(year):
    '''return a table with countries' yearly temperature, marked by continent,
    in a specific year
    '''
    conn = sqlite3.connect("temps.db")
    cmd = \
    """
    SELECT T.id, T.month, T.temp, C.name, T.year, O.continent
    FROM temperatures T
    LEFT JOIN stations S ON T.id = S.id
    LEFT JOIN countries C ON S.id_ = C.`fips 10-4`
    LEFT JOIN cccode O ON O.`country name (usual)` = C.name
    WHERE T.year =?
    """
    params = (year , )

    df = pd.read_sql_query(cmd,conn, params = params )
    #df_south = pd.read_sql_query(cmd_s, conn)
    conn.close()
    return df
df3 = query_temp_continent(1969)
df3
ID Month Temp Name Year Continent
0 ACW00011604 1 -0.81 Antigua and Barbuda 1969 North America
1 ACW00011604 2 -4.41 Antigua and Barbuda 1969 North America
2 ACW00011604 3 -1.32 Antigua and Barbuda 1969 North America
3 ACW00011604 4 5.50 Antigua and Barbuda 1969 North America
4 ACW00011604 5 10.15 Antigua and Barbuda 1969 North America
... ... ... ... ... ... ...
164000 ZIXLT622116 8 14.60 Zimbabwe 1969 Africa
164001 ZIXLT622116 9 17.80 Zimbabwe 1969 Africa
164002 ZIXLT622116 10 21.00 Zimbabwe 1969 Africa
164003 ZIXLT622116 11 20.40 Zimbabwe 1969 Africa
164004 ZIXLT622116 12 19.10 Zimbabwe 1969 Africa

164005 rows × 6 columns

Now, we create a plot function that plots the monthly change of temperature colorized by countries. The function will be supplied to the future groupby and apply method.

def plot_line_cont2(df):
    cont = df['Continent'].unique()
    year = df['Year'].unique()
    fig = px.scatter(df, x="Month", y="Temp", color="Name",
               hover_name="Name",
        title = 'Monthly avg Temperature of countries(& regions) in ' + cont[0] + ' in ' + str(year))
    fig.show()
    write_html(fig, cont[0] + ".html")
df3.groupby(['Continent']).apply(plot_line_cont2)

From the above graph, we could observe that monthly temperature change also has some patterns corresponding to different continents. In Asia, Europe, and North America, mid-year temperature tends to be the highest. While for South America and Antarctica, things are the opposite. Perhaps that’s why Santa Claus wears T-shirts in these countries!


-added url hyperlinks -changed the logistic for plot_line function, from plotting for specific latitude to for a latitude range

Written on April 21, 2021