--- title: "[GSP787] Insights from Data with BigQuery: Challenge Lab" description: "" summary: "Quest: Insights from Data with BigQuery" date: 2023-05-20T21:01:15+07:00 draft: false author: "Hiiruki" # ["Me", "You"] # multiple authors tags: ["writeups", "challenge", "google-cloudskillsboost", "gsp787", "google-cloud", "cloudskillsboost", "juaragcp", "google-cloud-platform", "gcp", "cloud-computing", "bigquery", "sql"] canonicalURL: "" showToc: true TocOpen: false TocSide: 'right' # or 'left' weight: 14 # aliases: ["/first"] hidemeta: false comments: false disableHLJS: true # to disable highlightjs disableShare: true hideSummary: false searchHidden: false ShowReadingTime: true ShowBreadCrumbs: true ShowPostNavLinks: true ShowWordCount: true ShowRssButtonInSectionTermList: true # UseHugoToc: true cover: image: "" # image path/url alt: "" # alt text caption: "" # display caption under cover relative: false # when using page bundles set this to true hidden: true # only hide on current single page # editPost: # URL: "https://github.com/hiiruki/hiiruki.dev/blob/main/content/writeups/google-cloudskillsboost/GSP787/index.md" # Text: "Suggest Changes" # edit text # appendFilePath: true # to append file path to Edit link --- ### GSP787 ![Lab Banner](https://cdn.qwiklabs.com/GMOHykaqmlTHiqEeQXTySaMXYPHeIvaqa2qHEzw6Occ%3D#center) - Time: 1 hour
- Difficulty: Intermediate
- Price: 5 Credits Lab: [GSP787](https://www.cloudskillsboost.google/focuses/14294?parent=catalog)
Quest: [Insights from Data with BigQuery](https://www.cloudskillsboost.google/quests/123)
## Challenge lab scenario You're part of a public health organization which is tasked with identifying answers to queries related to the Covid-19 pandemic. Obtaining the right answers will help the organization in planning and focusing healthcare efforts and awareness programs appropriately. The dataset and table that will be used for this analysis will be : `bigquery-public-data.covid19_open_data.covid19_open_data`. This repository contains country-level datasets of daily time-series data related to COVID-19 globally. It includes data relating to demographics, economy, epidemiology, geography, health, hospitalizations, mobility, government response, and weather. ### Task 1. Total confirmed cases - Build a query that will answer "What was the total count of confirmed cases on `Date`?" The query needs to return a single row containing the sum of confirmed cases across all countries. The name of the column should be **total_cases_worldwide**. Columns to reference: - cumulative_confirmed - date Go to BigQuery and run the following query: Change the `date` based on the lab instructions. ![Date Variable](./images/date%20variable.webp#center) ```sql SELECT sum(cumulative_confirmed) as total_cases_worldwide FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE date=<****change date eg '2020-05-15'****> ``` Mine is `May, 15 2020`. So, I will change the date to `2020-05-15`. example: ```sql SELECT sum(cumulative_confirmed) as total_cases_worldwide FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE date='2020-05-15' ``` ### Task 2. Worst affected areas - Build a query for answering "How many states in the US had more than `Death Count` deaths on `Date`?" The query needs to list the output in the field **count_of_states**. > **Note**: Don't include NULL values. Columns to reference: - country_name - subregion1_name (for state information) - cumulative_deceased Go to BigQuery and run the following query: Change the `date` and `death_count` based on the lab instructions. ```sql with deaths_by_states as ( SELECT subregion1_name as state, sum(cumulative_deceased) as death_count FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="United States of America" and date=<****change date eg '2020-05-15'****> and subregion1_name is NOT NULL group by subregion1_name ) select count(*) as count_of_states from deaths_by_states where death_count > <****change death count here****> ``` Mine is `250` deaths. So, I will change the `death_count` to `250`. ![Date and Death Count Variable](./images/deaths.webp#center) example: ```sql with deaths_by_states as ( SELECT subregion1_name as state, sum(cumulative_deceased) as death_count FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="United States of America" and date='2020-05-15' and subregion1_name is NOT NULL group by subregion1_name ) select count(*) as count_of_states from deaths_by_states where death_count > 250 ``` ### Task 3. Identifying hotspots - Build a query that will answer "List all the states in the United States of America that had more than `Confirmed Cases` confirmed cases on `Date`?" The query needs to return the State Name and the corresponding confirmed cases arranged in descending order. Name of the fields to return state and **total_confirmed_cases**. Columns to reference: - country_code - subregion1_name (for state information) - cumulative_confirmed Go to BigQuery and run the following query: ```sql SELECT * FROM ( SELECT subregion1_name as state, sum(cumulative_confirmed) as total_confirmed_cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_code="US" AND date=<****change date eg '2020-05-15'****> AND subregion1_name is NOT NULL GROUP BY subregion1_name ORDER BY total_confirmed_cases DESC ) WHERE total_confirmed_cases > <****change confirmed case here****> ``` ### Task 4. Fatality ratio 1. Build a query that will answer "What was the case-fatality ratio in Italy for the month of Month 2020?" Case-fatality ratio here is defined as (total deaths / total confirmed cases) * 100. 2. Write a query to return the ratio for the month of Month 2020 and contain the following fields in the output: total_confirmed_cases, total_deaths, case_fatality_ratio. Columns to reference: - country_name - cumulative_confirmed - cumulative_deceased Go to BigQuery and run the following query: ```sql SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="Italy" AND date BETWEEN <****change month here '2020-06-01'****> and <****change month here '2020-06-30'****> ``` Change the `month` based on the lab instructions. ![Month Variable](./images/month.webp#center) Mine is `June, 2020`. So, I will change the month to `2020-06-01` and `2020-06-30`. example: ```sql SELECT sum(cumulative_confirmed) as total_confirmed_cases, sum(cumulative_deceased) as total_deaths, (sum(cumulative_deceased)/sum(cumulative_confirmed))*100 as case_fatality_ratio FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="Italy" AND date BETWEEN '2020-06-01' and '2020-06-30' ``` ### Task 5. Identifying specific day - Build a query that will answer: "On what day did the total number of deaths cross `Death count in Italy` in Italy?" The query should return the date in the format **yyyy-mm-dd**. Columns to reference: - country_name - cumulative_deceased Go to BigQuery and run the following query: ```sql SELECT date FROM `bigquery-public-data.covid19_open_data.covid19_open_data` where country_name="Italy" and cumulative_deceased> <****change the value of death cross****> order by date asc limit 1 ``` ### Task 6. Finding days with zero net new cases The following query is to identify the number of days in India between `Start date in India` and `Close date in India` when there were zero increases in the number of confirmed cases. Go to BigQuery and run the following query: ```sql WITH india_cases_by_date AS ( SELECT date, SUM( cumulative_confirmed ) AS cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name ="India" AND date between < ****change the date here'2020-02-21'****> and <****change the date here'2020-03-15'****> GROUP BY date ORDER BY date ASC ) , india_previous_day_comparison AS (SELECT date, cases, LAG(cases) OVER(ORDER BY date) AS previous_day, cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases FROM india_cases_by_date ) select count(*) from india_previous_day_comparison where net_new_cases=0 ``` Change the `start date` in India and `close date` in India based on the lab instructions. ![Start Date and Close Date](./images/start_close_date.webp#center) Mine is `25, Feb 2020` and `10, March 2020`. So, I will change the date to `2020-02-25` and `2020-03-10`. example: ```sql WITH india_cases_by_date AS ( SELECT date, SUM( cumulative_confirmed ) AS cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name ="India" AND date between '2020-02-25' and '2020-03-10' GROUP BY date ORDER BY date ASC ) , india_previous_day_comparison AS (SELECT date, cases, LAG(cases) OVER(ORDER BY date) AS previous_day, cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases FROM india_cases_by_date ) select count(*) from india_previous_day_comparison where net_new_cases=0 ``` ### Task 7. Doubling rate - Using the previous query as a template, write a query to find out the dates on which the confirmed cases increased by more than `Limit Value`% compared to the previous day (indicating doubling rate of ~ 7 days) in the US between the dates March 22, 2020 and April 20, 2020. The query needs to return the list of dates, the confirmed cases on that day, the confirmed cases the previous day, and the percentage increase in cases between the days. - Use the following names for the returned fields: **Date**, **Confirmed_Cases_On_Day**, **Confirmed_Cases_Previous_Day**, and **Percentage_Increase_In_Cases**. Go to BigQuery and run the following query: Change the `Limit Value` based on the lab instructions. ![Limit Value](./images/percentage.webp#center) Mine is `5`% so, I will change the value to `5`. ```sql WITH us_cases_by_date AS ( SELECT date, SUM(cumulative_confirmed) AS cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name="United States of America" AND date between '2020-03-22' and '2020-04-20' GROUP BY date ORDER BY date ASC ) , us_previous_day_comparison AS (SELECT date, cases, LAG(cases) OVER(ORDER BY date) AS previous_day, cases - LAG(cases) OVER(ORDER BY date) AS net_new_cases, (cases - LAG(cases) OVER(ORDER BY date))*100/LAG(cases) OVER(ORDER BY date) AS percentage_increase FROM us_cases_by_date ) select Date, cases as Confirmed_Cases_On_Day, previous_day as Confirmed_Cases_Previous_Day, percentage_increase as Percentage_Increase_In_Cases from us_previous_day_comparison where percentage_increase > <****change percentage value here****> ``` ### Task 8. Recovery rate 1. Build a query to list the recovery rates of countries arranged in descending order (limit to `Limit Value`) upto the date May 10, 2020. 2. Restrict the query to only those countries having more than 50K confirmed cases. - The query needs to return the following fields: `country`, `recovered_cases`, `confirmed_cases`, `recovery_rate`. Columns to reference: - country_name - cumulative_confirmed - cumulative_recovered Go to BigQuery and run the following query: Change the `limit` based on the lab instructions. ![Limit](./images/limit.webp#center) Mine is `5` so, I will change the value to `5`. ```sql WITH cases_by_country AS ( SELECT country_name AS country, sum(cumulative_confirmed) AS cases, sum(cumulative_recovered) AS recovered_cases FROM bigquery-public-data.covid19_open_data.covid19_open_data WHERE date = '2020-05-10' GROUP BY country_name ) , recovered_rate AS (SELECT country, cases, recovered_cases, (recovered_cases * 100)/cases AS recovery_rate FROM cases_by_country ) SELECT country, cases AS confirmed_cases, recovered_cases, recovery_rate FROM recovered_rate WHERE cases > 50000 ORDER BY recovery_rate desc LIMIT <****change limit here****> ``` ### Task 9. CDGR - Cumulative daily growth rate - The following query is trying to calculate the CDGR on `Date` (Cumulative Daily Growth Rate) for France since the day the first case was reported.The first case was reported on Jan 24, 2020. - The CDGR is calculated as: `((last_day_cases/first_day_cases)^1/days_diff)-1)` Where : - `last_day_cases` is the number of confirmed cases on May 10, 2020 - `first_day_cases` is the number of confirmed cases on Jan 24, 2020 - `days_diff` is the number of days between Jan 24 - May 10, 2020 Go to BigQuery and run the following query: ```sql WITH france_cases AS ( SELECT date, SUM(cumulative_confirmed) AS total_cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE country_name="France" AND date IN ('2020-01-24', <****change the date value here'2020-05-10'****>) GROUP BY date ORDER BY date) , summary as ( SELECT total_cases AS first_day_cases, LEAD(total_cases) OVER(ORDER BY date) AS last_day_cases, DATE_DIFF(LEAD(date) OVER(ORDER BY date),date, day) AS days_diff FROM france_cases LIMIT 1 ) select first_day_cases, last_day_cases, days_diff, POW((last_day_cases/first_day_cases),(1/days_diff))-1 as cdgr from summary ``` ### Task 10. Create a Looker Studio report - Create a [Looker Studio](https://datastudio.google.com/) report that plots the following for the United States: - Number of Confirmed Cases - Number of Deaths - Date range : `Date Range` Change the `Date Range` based on the lab instructions. ![Date Range](./images/looker_date.webp#center) ```sql SELECT date, SUM(cumulative_confirmed) AS country_cases, SUM(cumulative_deceased) AS country_deaths FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE date BETWEEN <****change the date value here'2020-03-19'****> AND <****change the date value here'2020-04-22'****> AND country_name ="United States of America" GROUP BY date ``` Mine is `2020-03-19` to `2020-04-22`. It should look like this: ```sql SELECT date, SUM(cumulative_confirmed) AS country_cases, SUM(cumulative_deceased) AS country_deaths FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE date BETWEEN '2020-03-19' AND '2020-04-22' AND country_name ="United States of America" GROUP BY date ``` ## Congratulations! ![Congratulations Badge](https://cdn.qwiklabs.com/GfiFidoAd%2BrgYQRFgZggxgzMWJsGgFxnfA6bOWScimw%3D#center)