Weather and Climate Analysis

AWS has provided public Data for ML for NOAA Historical Weather Conditions.

AWS Open Data Repository: https://registry.opendata.aws/

AWS Blog for this Demo: https://aws.amazon.com/blogs/big-data/visualize-over-200-years-of-global-climate-data-using-amazon-athena-and-amazon-quicksight/

NOAA Source Data: https://docs.opendata.aws/noaa-ghcn-pds/readme.html

NOAA Global Forecast Dataset: https://www.ncdc.noaa.gov/data-access/model-data/model-datasets/global-forcast-system-gfs

Import Weather Data from NOAA

As part of the import of the public data, we had to create our own version of it stored in the local dlzp account.

I created a Glue database called: ghcnblog

The initial import of the data was into a table called: tblallyears

S3 Bucket Name: dlzp-weather-data

The data was copied into the DLZP Database using the following script.

create table ghcnblog.tblallyears_qa 
 with (
  format='PARQUET',
  external_location='s3://dlzp-weather-data/ghcnblog/allyearsqa/'
) as
select * from ghcnblog.tblallyears
  where q_flag = '';

In addition, we had to download and import the weather station data

create table ghcnblog.tblghcnd_stations_qa 
 with (
  format='PARQUET',
  external_location='s3://dlzp-weather-data/ghcnblog/stations/'
) as
select * from ghcnblog.tblghcnd_stations;

Exploratory data analysis and visualization

With our two tables created, we are now ready to query and visualize to gain insight.

Exploratory data analysis In the Athena query window, run the following queries to get an idea of the size of the dataset.

Query #1: the total number of observations since 1763:

select count(*) as Total_Number_of_Observations
from ghcnblog.tblallyears_qa;

Query #2: the number of stations since 1763:

select count(*) as Total_Number_of_Stations
from ghcnblog.tblghcnd_stations_qa;

Average weather parameters for the Earth

The following shows a query that calculates the average maximum temperature (Celsius), average minimum temperature (Celsius), and average rainfall (mm) for the Earth since 1763.

In the query, we must convert the data_value from a String variable to a Real variable. We also must divide by 10, because the temperature and precipitation measurements are in tenths of their respective units. For more information about these details and the element codes (TMIB, TMAX and PRCP), see the readme file.

select element, round(avg(CAST(data_value AS real)/10),2) as value
  from ghcnblog.tblallyears_qa
    where element in ('TMIN','TMAX','PRCP')
  group by element; 
  
element value
TMAX 16.35
PRCP 2.34
TMIN 4.42