From 22e2bc60e2bd2f1ad7882bb3b174ed991a0763b0 Mon Sep 17 00:00:00 2001 From: hiiruki Date: Sun, 10 Sep 2023 15:31:48 +0700 Subject: [PATCH] writeups/google-cloudskillsboost: [GSP341] Create ML Models with BigQuery ML: Challenge Lab --- .../GSP341/images/year.webp | Bin 0 -> 2158 bytes .../google-cloudskillsboost/GSP341/index.md | 234 ++++++++++++++++++ 2 files changed, 234 insertions(+) create mode 100644 content/writeups/google-cloudskillsboost/GSP341/images/year.webp create mode 100644 content/writeups/google-cloudskillsboost/GSP341/index.md diff --git a/content/writeups/google-cloudskillsboost/GSP341/images/year.webp b/content/writeups/google-cloudskillsboost/GSP341/images/year.webp new file mode 100644 index 0000000000000000000000000000000000000000..85be8e002c10c3993ce30ee37e8bd8db8e86c568 GIT binary patch literal 2158 zcmV-!2$AZ!^1(P%lt_i0?D*5&(aX z{~Z3;{YUo~{J*5{>t2NSj{ZIVL({(c4>9%t{z3g~{tx%RSda9-z4`k5+xutxzww^{ z-^%}||GfU|?N#gp(S!E9Gwo&UGkqGs-BwRzZdk+%gTuX^>?b-G7)Sa(fH*89@ zaH&W&PG=geX&nm0J$)WWVi$i>-WS9*TFEgcB*eYB^&MB*qjInh$Jo^lA8#}os}vcM zKhbu_U!W|>umwrC75zXF(?e)z)^b3`7{eH73y-zk5Q{|ra`+JGNS-dn9wo%NM{4zy zc2&ZF-MVK9dF17*$PUmk%@U)+2(}5&neN!@^{6#c=VJj(;V5UZ1l6Aun~{-Zvzf*WDWhcoW8>g;M8m}@BNBs5z(Ox32S63VvRL7rzG4}Nn+^@@FcEW zCscJGv7odo82+>iqpK6X3Yy%LuzcGAosjJa-6A|y&r5MLLeg@V@B^rQHW zIXwFu-L3zsCBGGwFsVI$)}qS;ya_nRC;;Fx&7Ci!$3jtI%J1i29LL{xm5VAgNYH>? zoTJ1L8i?uJ{wK|+X&~kUB709DSy-%+Jge8T*&>v-8_7X$hDQZNFlL+zS5JHLHiwn} zOVyF{p|4eC`a^@})c9Z0Y@Dya1`s+}Cvh%1UMC}~lWl$g9VlN|+p3mV2Y){3&CwSA z5pUrZ{t;o(r2YYf8|?J$N$6t!_t4t^fuQw?r`X~O`h6gLM~g)f4z?euHl6$ zz?&X#?ItD|EGcpWNUIx*Xs8Y0KT;V+>uG)fB)CskPmDsD4bPQNeCFOhy?A?;-@ZNk zFGxNR5nuz_Ij*T6ObAJ`l=yme4z-OvQa0k$`_cpms&^6JMp*0}9F}kydZ4a(JOZiE z#+}D=`w5%cby|4MFvI&O03wu+Q`vas#%11D_==uvZ{}^8y`NYMmW6;i*i&ZRO1j#( zaLg~+>nmO;MMyK)T>c#-TAP4i)Y_9~((_M)aVQc?U;-NkHCs{8mZo)@EU@vW!uUHt z00>zvyv+`bUlHG{ex;Cfay%sJ%m$3LTAWqG+D%tXMOSR`nzE5;C9e!m-ZC$PK6#K*6@ zE%J!IirNf~z5G2BC$ka#!{{{XhDCA0wcb7qlY$eI8?e*yU+}>dqxY`8Ot*Oa^oc}c zHteDfW#w=6D;ofC;spD zf;HRpsqGO0;vmzfyto5@B>@l{gX2Kn=+qU*dj6qWUR)zmzgXqT=x~NwU6HNjTo6%X zzT95MYZL|!!*t9(iVtDJGP&(X_vb!@EC@baaG*GL$WI0L!-^nIdqIgBKQ_XaOJn7tc57lL~2kw1q#Ykbv|6X#WJ4 zOs*s8zb*IB0kU#~<5dMBT|U0u56NF$5p#W;<72VSH8bG%Q&T<th#OK5cfTf7L^BBf z^uIorLdTeW>=qy6vk}d`$j_t|+U$ug41EjLOzrRp!Ib$7j)gX3Q_U|gz?WaXWUgxh zQ*-hmb1MXgSbH~H*7Z{)Mz)Wc`NBDxchWQo6sob^g{IB&g`eSn6N-lm|AK_5)?yhiJ?`)yRf&;lfg zd?jUqK}_QBJ-x&>xXpWe>$P}K3D=o*vc4_^K|bBb{MLW1$u!)5<5ADY+02&L$#!-9 k!wvQ@Lxz`?rNP_mVEtpj0XMB1{eT)E0d@Lqft" # image path/url + alt: "" # alt text + caption: "" # display caption under cover + relative: false # when using page bundles set this to true + hidden: true # only hide on current single page +# editPost: +# URL: "https://github.com/hiiruki/hiiruki.dev/blob/main/content/writeups/google-cloudskillsboost/GSP341/index.md" +# Text: "Suggest Changes" # edit text +# appendFilePath: true # to append file path to Edit link +--- + +### GSP341 + +![Lab Banner](https://cdn.qwiklabs.com/GMOHykaqmlTHiqEeQXTySaMXYPHeIvaqa2qHEzw6Occ%3D#center) + +- Time: 1 hour 30 minutes
+- Difficulty: Intermediate
+- Price: 7 Credits + +Lab: [GSP341](https://www.cloudskillsboost.google/focuses/14294?parent=catalog)
+Quest: [Create ML Models with BigQuery ML](https://www.cloudskillsboost.google/quests/146)
+ +## Challenge lab scenario + +You have started a new role as a junior member of the Data Science department Jooli Inc. Your team is working on a number of machine learning initiatives related to urban mobility services. You are expected to help with the development and assessment of data sets and machine learning models to help provide insights based on real work data sets. + +You are expected to have the skills and knowledge for these tasks, so don't expect step-by-step guides to be provided. + +## Your challenge + +One of the projects you are working on needs to provide analysis based on real world data that will help in the selection of new bicycle models for public bike share systems. Your role in this project is to develop and evaluate machine learning models that can predict average trip durations for bike schemes using the public data from Austin's public bike share scheme to train and evaluate your models. + +Two of the senior data scientists in your team have different theories on what factors are important in determining the duration of a bike share trip and you have been asked to prioritise these to start. The first data scientist maintains that the key factors are the start station, the location of the start station, the day of the week and the hour the trip started. While the second data scientist argues that this is an over complication and the key factors are simply start station, subscriber type, and the hour the trip started. + +You have been asked to develop a machine learning model based on each of these input features. Given the fact that stay-at-home orders were in place for Austin during parts of 2021 as a result of COVID-19 you will be working on data from previous years. You have been instructed to train your models on data from `Training Year` and then evaluate them against data from `Evaluation Year` on the basis of Mean Absolute Error and the square root of Mean Squared Error. + +You can access the public data for the Austin bike share scheme in your project by opening [this link to the Austin bike share dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=austin_bikeshare&page=dataset) in the browser tab for your lab. + +As a final step you must create and run a query that uses the model that includes subscriber type as a feature, to predict the average trip duration for all trips from the busiest bike sharing station in `Evaluation Year` (based on the number of trips per station in `Evaluation Year`) where the subscriber type is 'Single Trip'. + +## Setup + +```bash +gcloud auth list + +gcloud config list project +``` + +### Task 1. Create a dataset to store your machine learning models + +- Create a new dataset in which you can store your machine learning models. + +Go to your cloud shell and run the following command to create the model: + +```bash +bq mk austin +``` + +### Task 2. Create a forecasting BigQuery machine learning model + +- Create the first machine learning model to predict the trip duration for bike trips. + +The features of this model must incorporate the starting station name, the hour the trip started, the weekday of the trip, and the address of the start station labeled as `location`. You must use `Training Year` data only to train this model. + +Go to BigQuery to make the first model and run the following query: + +Replace `<****Training_Year****>` with the year you are using for training. + +The year in your lab variable looks like this: + +![year](./images/year.webp#center) + +```sql +CREATE OR REPLACE MODEL austin.location_model +OPTIONS + (model_type='linear_reg', labels=['duration_minutes']) AS +SELECT + start_station_name, + EXTRACT(HOUR FROM start_time) AS start_hour, + EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week, + duration_minutes, + address as location +FROM + `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips +JOIN + `bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations +ON + trips.start_station_name = stations.name +WHERE + EXTRACT(YEAR FROM start_time) = <****Training_Year****> + AND duration_minutes > 0 +``` + +### Task 3. Create the second machine learning model + +- Create the second machine learning model to predict the trip duration for bike trips. + +The features of this model must incorporate the starting station name, the bike share subscriber type and the start time for the trip. You must also use `Training Year` data only to train this model. + +Go to BigQuery to make the second model and run the following query: + +Replace `<****Training_Year****>` with the year you are using for training. + +```sql +CREATE OR REPLACE MODEL austin.subscriber_model +OPTIONS + (model_type='linear_reg', labels=['duration_minutes']) AS +SELECT + start_station_name, + EXTRACT(HOUR FROM start_time) AS start_hour, + subscriber_type, + duration_minutes +FROM `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips +WHERE EXTRACT(YEAR FROM start_time) = <****Training_Year****> +``` + +### Task 4. Evaluate the two machine learning models + +- Evaluate each of the machine learning models against `Evaluation Year` data only using separate queries. + +Your queries must report both the Mean Absolute Error and the Root Mean Square Error. + +Go to BigQuery and run the following query: + +Replace `<****Evaluation_Year****>` with the year you are using for evaluating. + +```sql +SELECT + SQRT(mean_squared_error) AS rmse, + mean_absolute_error +FROM + ML.EVALUATE(MODEL austin.location_model, ( + SELECT + start_station_name, + EXTRACT(HOUR FROM start_time) AS start_hour, + EXTRACT(DAYOFWEEK FROM start_time) AS day_of_week, + duration_minutes, + address as location + FROM + `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips + JOIN + `bigquery-public-data.austin_bikeshare.bikeshare_stations` AS stations + ON + trips.start_station_name = stations.name + WHERE EXTRACT(YEAR FROM start_time) = <****Evaluation_Year****> ) +) +``` + +```sql +SELECT + SQRT(mean_squared_error) AS rmse, + mean_absolute_error +FROM + ML.EVALUATE(MODEL austin.subscriber_model, ( + SELECT + start_station_name, + EXTRACT(HOUR FROM start_time) AS start_hour, + subscriber_type, + duration_minutes + FROM + `bigquery-public-data.austin_bikeshare.bikeshare_trips` AS trips + WHERE + EXTRACT(YEAR FROM start_time) = <****Evaluation_Year****>) +) +``` + +### Task 5. Use the subscriber type machine learning model to predict average trip durations + +- When both models have been created and evaluated, use the second model, that uses `subscriber_type` as a feature, to predict average trip length for trips from the busiest bike sharing station in `Evaluation Year` where the subscriber type is `Single Trip`. + +Go to BigQuery and run the following query: + +Replace `<****Evaluation_Year****>` with the year you are using for evaluating. + +```sql +SELECT + start_station_name, + COUNT(*) AS trips +FROM + `bigquery-public-data.austin_bikeshare.bikeshare_trips` +WHERE + EXTRACT(YEAR FROM start_time) = <****Evaluation_Year****> +GROUP BY + start_station_name +ORDER BY + trips DESC +``` + +```sql +SELECT AVG(predicted_duration_minutes) AS average_predicted_trip_length +FROM ML.predict(MODEL austin.subscriber_model, ( +SELECT + start_station_name, + EXTRACT(HOUR FROM start_time) AS start_hour, + subscriber_type, + duration_minutes +FROM + `bigquery-public-data.austin_bikeshare.bikeshare_trips` +WHERE + EXTRACT(YEAR FROM start_time) = <****Evaluation_Year****> + AND subscriber_type = 'Single Trip' + AND start_station_name = '21st & Speedway @PCL')) +``` + +## Congratulations! + +![Congratulations Badge](https://cdn.qwiklabs.com/XHgD9wRAAlXktQmoNrUOvbg38ZBrazddtSoYHS55d8o%3D#center)