lemniskett.moe/content/writeups/google-cloudskillsboost/GSP787/index.md

453 lines
14 KiB
Markdown
Raw Normal View History

---
title: "[GSP787] Insights from Data with BigQuery: Challenge Lab"
description: ""
summary: "Quest: Insights from Data with BigQuery"
date: 2023-05-20T21:01:15+07:00
draft: false
author: "Hiiruki" # ["Me", "You"] # multiple authors
tags: ["writeups", "challenge", "google-cloudskillsboost", "gsp787", "google-cloud", "cloudskillsboost", "juaragcp", "google-cloud-platform", "gcp", "cloud-computing", "bigquery", "sql"]
canonicalURL: ""
showToc: true
TocOpen: false
TocSide: 'right' # or 'left'
weight: 14
# aliases: ["/first"]
hidemeta: false
comments: false
disableHLJS: true # to disable highlightjs
disableShare: true
hideSummary: false
searchHidden: false
ShowReadingTime: true
ShowBreadCrumbs: true
ShowPostNavLinks: true
ShowWordCount: true
ShowRssButtonInSectionTermList: true
# UseHugoToc: true
cover:
image: "<image path/url>" # image path/url
alt: "<alt text>" # alt text
caption: "<text>" # display caption under cover
relative: false # when using page bundles set this to true
hidden: true # only hide on current single page
# editPost:
# URL: "https://github.com/hiiruki/hiiruki.dev/blob/main/content/writeups/google-cloudskillsboost/GSP787/index.md"
# Text: "Suggest Changes" # edit text
# appendFilePath: true # to append file path to Edit link
---
### GSP787
![Lab Banner](https://cdn.qwiklabs.com/GMOHykaqmlTHiqEeQXTySaMXYPHeIvaqa2qHEzw6Occ%3D#center)
- Time: 1 hour<br>
- Difficulty: Intermediate<br>
- Price: 5 Credits
Lab: [GSP787](https://www.cloudskillsboost.google/focuses/14294?parent=catalog)<br>
Quest: [Insights from Data with BigQuery](https://www.cloudskillsboost.google/quests/123)<br>
## Challenge lab scenario
You're part of a public health organization which is tasked with identifying answers to queries related to the Covid-19 pandemic. Obtaining the right answers will help the organization in planning and focusing healthcare efforts and awareness programs appropriately.
The dataset and table that will be used for this analysis will be : `bigquery-public-data.covid19_open_data.covid19_open_data`. This repository contains country-level datasets of daily time-series data related to COVID-19 globally. It includes data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, and weather.
### Task 1. Total confirmed cases
- Build a query that will answer "What was the total count of confirmed cases on `Date`?" The query needs to return a single row containing the sum of confirmed cases across all countries. The name of the column should be **total_cases_worldwide**.
Columns to reference:
- cumulative_confirmed
- date
Go to BigQuery and run the following query:
Change the `date` based on the lab instructions.
![Date Variable](./images/date%20variable.webp#center)
```sql
SELECT sum(cumulative_confirmed) as total_cases_worldwide
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date=<****change date eg '2020-05-15'****>
```
Mine is `May, 15 2020`. So, I will change the date to `2020-05-15`.
example:
```sql
SELECT sum(cumulative_confirmed) as total_cases_worldwide
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE date='2020-05-15'
```
### Task 2. Worst affected areas
- Build a query for answering "How many states in the US had more than `Death Count` deaths on `Date`?" The query needs to list the output in the field **count_of_states**.
> **Note**: Don't include NULL values.
Columns to reference:
- country_name
- subregion1_name (for state information)
- cumulative_deceased
Go to BigQuery and run the following query:
Change the `date` and `death_count` based on the lab instructions.
```sql
with deaths_by_states as (
SELECT subregion1_name as state, sum(cumulative_deceased) as death_count
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="United States of America" and date=<****change date eg '2020-05-15'****> and subregion1_name is NOT NULL
group by subregion1_name
)
select count(*) as count_of_states
from deaths_by_states
where death_count > <****change death count here****>
```
Mine is `250` deaths. So, I will change the `death_count` to `250`.
![Date and Death Count Variable](./images/deaths.webp#center)
example:
```sql
with deaths_by_states as (
SELECT subregion1_name as state, sum(cumulative_deceased) as death_count
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="United States of America" and date='2020-05-15' and subregion1_name is NOT NULL
group by subregion1_name
)
select count(*) as count_of_states
from deaths_by_states
where death_count > 250
```
### Task 3. Identifying hotspots
- Build a query that will answer "List all the states in the United States of America that had more than `Confirmed Cases` confirmed cases on `Date`?" The query needs to return the State Name and the corresponding confirmed cases arranged in descending order. Name of the fields to return state and **total_confirmed_cases**.
Columns to reference:
- country_code
- subregion1_name (for state information)
- cumulative_confirmed
Go to BigQuery and run the following query:
```sql
SELECT * FROM (
SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE country_code="US" AND date=<****change date eg '2020-05-15'****> AND subregion1_name is NOT NULL
GROUP BY subregion1_name
ORDER BY total_confirmed_cases DESC
)
WHERE total_confirmed_cases > <****change confirmed case here****>
```
### Task 4. Fatality ratio
1. Build a query that will answer "What was the case-fatality ratio in Italy for the month of Month 2020?" Case-fatality ratio here is defined as (total deaths / total confirmed cases) * 100.
2. Write a query to return the ratio for the month of Month 2020 and contain the following fields in the output: total_confirmed_cases, total_deaths, case_fatality_ratio.
Columns to reference:
- country_name
- cumulative_confirmed
- cumulative_deceased
Go to BigQuery and run the following query:
```sql
SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="Italy" AND date BETWEEN <****change month here '2020-06-01'****> and <****change month here '2020-06-30'****>
```
Change the `month` based on the lab instructions.
![Month Variable](./images/month.webp#center)
Mine is `June, 2020`. So, I will change the month to `2020-06-01` and `2020-06-30`.
example:
```sql
SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="Italy" AND date BETWEEN '2020-06-01' and '2020-06-30'
```
### Task 5. Identifying specific day
- Build a query that will answer: "On what day did the total number of deaths cross `Death count in Italy` in Italy?" The query should return the date in the format **yyyy-mm-dd**.
Columns to reference:
- country_name
- cumulative_deceased
Go to BigQuery and run the following query:
```sql
SELECT date
FROM `bigquery-public-data.covid19_open_data.covid19_open_data`
where country_name="Italy" and cumulative_deceased> <****change the value of death cross****>
order by date asc
limit 1
```
### Task 6. Finding days with zero net new cases
The following query is to identify the number of days in India between `Start date in India` and `Close date in India` when there were zero increases in the number of confirmed cases.
Go to BigQuery and run the following query:
```sql
WITH india_cases_by_date AS (
SELECT
date,
SUM( cumulative_confirmed ) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name ="India"
AND date between < ****change the date here'2020-02-21'****> and <****change the date here'2020-03-15'****>
GROUP BY
date
ORDER BY
date ASC
)
, india_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
select count(*)
from india_previous_day_comparison
where net_new_cases=0
```
Change the `start date` in India and `close date` in India based on the lab instructions.
![Start Date and Close Date](./images/start_close_date.webp#center)
Mine is `25, Feb 2020` and `10, March 2020`. So, I will change the date to `2020-02-25` and `2020-03-10`.
example:
```sql
WITH india_cases_by_date AS (
SELECT
date,
SUM( cumulative_confirmed ) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name ="India"
AND date between '2020-02-25' and '2020-03-10'
GROUP BY
date
ORDER BY
date ASC
)
, india_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases
FROM india_cases_by_date
)
select count(*)
from india_previous_day_comparison
where net_new_cases=0
```
### Task 7. Doubling rate
- Using the previous query as a template, write a query to find out the dates on which the confirmed cases increased by more than `Limit Value`% compared to the previous day (indicating doubling rate of ~ 7 days) in the US between the dates March 22, 2020 and April 20, 2020. The query needs to return the list of dates, the confirmed cases on that day, the confirmed cases the previous day, and the percentage increase in cases between the days.
- Use the following names for the returned fields: **Date**, **Confirmed_Cases_On_Day**, **Confirmed_Cases_Previous_Day**, and **Percentage_Increase_In_Cases**.
Go to BigQuery and run the following query:
Change the `Limit Value` based on the lab instructions.
![Limit Value](./images/percentage.webp#center)
Mine is `5`% so, I will change the value to `5`.
```sql
WITH us_cases_by_date AS (
SELECT
date,
SUM(cumulative_confirmed) AS cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name="United States of America"
AND date between '2020-03-22' and '2020-04-20'
GROUP BY
date
ORDER BY
date ASC
)
, us_previous_day_comparison AS
(SELECT
date,
cases,
LAG(cases) OVER(ORDER BY date) AS previous_day,
cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases,
(cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase
FROM us_cases_by_date
)
select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases
from us_previous_day_comparison
where percentage_increase > <****change percentage value here****>
```
### Task 8. Recovery rate
1. Build a query to list the recovery rates of countries arranged in descending order (limit to `Limit Value`) upto the date May 10, 2020.
2. Restrict the query to only those countries having more than 50K confirmed cases.
- The query needs to return the following fields: `country`, `recovered_cases`, `confirmed_cases`, `recovery_rate`.
Columns to reference:
- country_name
- cumulative_confirmed
- cumulative_recovered
Go to BigQuery and run the following query:
Change the `limit` based on the lab instructions.
![Limit](./images/limit.webp#center)
Mine is `5` so, I will change the value to `5`.
```sql
WITH cases_by_country AS (
SELECT
country_name AS country,
sum(cumulative_confirmed) AS cases,
sum(cumulative_recovered) AS recovered_cases
FROM
bigquery-public-data.covid19_open_data.covid19_open_data
WHERE
date = '2020-05-10'
GROUP BY
country_name
)
, recovered_rate AS
(SELECT
country, cases, recovered_cases,
(recovered_cases * 100)/cases AS recovery_rate
FROM cases_by_country
)
SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate
FROM recovered_rate
WHERE cases > 50000
ORDER BY recovery_rate desc
LIMIT <****change limit here****>
```
### Task 9. CDGR - Cumulative daily growth rate
- The following query is trying to calculate the CDGR on `Date` (Cumulative Daily Growth Rate) for France since the day the first case was reported.The first case was reported on Jan 24, 2020.
- The CDGR is calculated as:
`((last_day_cases/first_day_cases)^1/days_diff)-1)`
Where :
- `last_day_cases` is the number of confirmed cases on May 10, 2020
- `first_day_cases` is the number of confirmed cases on Jan 24, 2020
- `days_diff` is the number of days between Jan 24 - May 10, 2020
Go to BigQuery and run the following query:
```sql
WITH
france_cases AS (
SELECT
date,
SUM(cumulative_confirmed) AS total_cases
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
country_name="France"
AND date IN ('2020-01-24',
<****change the date value here'2020-05-10'****>)
GROUP BY
date
ORDER BY
date)
, summary as (
SELECT
total_cases AS first_day_cases,
LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases,
DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff
FROM
france_cases
LIMIT 1
)
select first_day_cases, last_day_cases, days_diff, POW((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr
from summary
```
### Task 10. Create a Looker Studio report
- Create a [Looker Studio](https://datastudio.google.com/) report that plots the following for the United States:
- Number of Confirmed Cases
- Number of Deaths
- Date range : `Date Range`
Change the `Date Range` based on the lab instructions.
![Date Range](./images/looker_date.webp#center)
```sql
SELECT
date, SUM(cumulative_confirmed) AS country_cases,
SUM(cumulative_deceased) AS country_deaths
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
date BETWEEN <****change the date value here'2020-03-19'****>
AND <****change the date value here'2020-04-22'****>
AND country_name ="United States of America"
GROUP BY date
```
Mine is `2020-03-19` to `2020-04-22`. It should look like this:
```sql
SELECT
date, SUM(cumulative_confirmed) AS country_cases,
SUM(cumulative_deceased) AS country_deaths
FROM
`bigquery-public-data.covid19_open_data.covid19_open_data`
WHERE
date BETWEEN '2020-03-19'
AND '2020-04-22'
AND country_name ="United States of America"
GROUP BY date
```
## Congratulations!
![Congratulations Badge](https://cdn.qwiklabs.com/GfiFidoAd%2BrgYQRFgZggxgzMWJsGgFxnfA6bOWScimw%3D#center)