
*日本語版は下記にあります。
Introduction
Hi, I’m Pranav, a Machine Learning Engineer at UPSIDER. In this blog, I’ll show you how to build a simple machine learning workflow directly in BigQuery to predict which fruits might go out of stock soon. The goal is to demonstrate how real-time data can be transformed into actionable predictions with SQL only, without the need for complex ML frameworks or data exports.
Even if you haven’t seen my earlier article Building a Real-Time Data Pipeline with GCP (In About 15 Minutes), you can follow along with this blog. To summarize quickly: we have a table in BigQuery called fruit_inventory_events where each row represents a fruit inventory event captured from Firestore.
Quick Look: Fruit Inventory Table Structure
| Column | Type | Description |
|---|---|---|
fruit_id |
STRING | Unique ID for each fruit |
name |
STRING | Fruit name (Apple, Banana, etc.) |
color |
STRING | Color of the fruit (Red, Green, Yellow, etc.) |
taste |
STRING | Taste profile (Sweet, Sour, Bitter, etc.) |
price |
NUMERIC | Price of the fruit |
origin |
STRING | Country or region of origin |
in_stock |
BOOLEAN | Whether the fruit is in stock |
event_type |
STRING | Type of event (CREATE, UPDATE, DELETE) |
timestamp |
TIMESTAMP | When the event occurred |
processed_at |
TIMESTAMP | When the event was recorded in BigQuery |
1. Introduction: From Pipeline to Predictions
With data streaming from Firestore into BigQuery, the natural next step is to extract insights. We’ll use BigQuery ML to predict trends in fruit inventory, such as which fruits might go out of stock soon. BigQuery ML allows us to do machine learning with SQL directly in BigQuery – no separate ML servers required.
2. Why BigQuery ML for Real-Time Data?
BigQuery ML is perfect for this scenario because:
- No data movement – your pipeline already feeds BigQuery; ML happens in-place.
- Speed and simplicity – train and predict using SQL.
- Scalability – BigQuery handles millions of records without extra infra.
- Real-time readiness – as new data arrives, you can retrain or predict anytime.
3. Defining the ML Problem
We want to train a model to predict current inventory risk, with the ultimate goal of forecasting future stock-out events. This is a binary classification problem: given features like color, taste, origin, and price, predict if in_stock will become FALSE. We aggregate inventory events per day to build a training dataset:
SELECT DATE(timestamp) AS date, COUNTIF(event_type = 'UPDATE' AND NOT in_stock) + COUNTIF(event_type = 'DELETE') AS daily_sold FROM `my_dataset.fruit_inventory_events` GROUP BY date ORDER BY date;
4. BigQuery ML Architecture Overview
Explanation:
- Data flows from Firestore to BigQuery via the pipeline.
- BigQuery ML trains a model on the fruit inventory table.
- Predictions are stored in BigQuery.
- In the future blog, we’ll explore exporting models to Vertex AI.

4.1. SQL Query Flowchart
The following flowchart illustrates the steps and relationships between the SQL queries used in the following sections from 5 to 8 for machine learning workflow:

5. Data Preparation
Handling Missing Values
SELECT fruit_id, name, IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events`;
Optional: Encoding Categorical Fields
BigQuery ML automatically one-hot encodes string features like color, taste, and origin.
Manual one-hot example (optional):
SELECT fruit_id, name, price, origin, color, IF(taste = "Sweet", 1, 0) AS taste_Sweet, IF(taste = "Sour", 1, 0) AS taste_Sour, IF(taste = "Bitter", 1, 0) AS taste_Bitter, in_stock FROM `my_dataset.fruit_inventory_events`;
6. Training a Logistic Regression Model
Run this SQL in the BigQuery Console. After completion, a model fruit_stock_model will appear in your dataset.
CREATE OR REPLACE MODEL `my_dataset.fruit_stock_model` OPTIONS ( model_type = 'logistic_reg', input_label_cols = ['in_stock'] ) AS SELECT IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events`;
7. Evaluating the Model
SELECT * FROM ML.EVALUATE(MODEL `my_dataset.fruit_stock_model`, ( SELECT IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events` ));
Metrics include:
- Accuracy
- Precision
- Recall
- F1 Score
- Log Loss
- ROC AUC
8. Using the Model for Predictions
Batch Prediction
SELECT fruit_id, name, predicted_in_stock AS predicted_in_stock, predicted_in_stock_probs[OFFSET(0)] AS probability_out_stock FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, ( SELECT fruit_id, name, IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events` WHERE in_stock = TRUE ));
Single Prediction
SELECT * FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, ( SELECT "mangosteen-999" AS fruit_id, "Mangosteen" AS name, "Purple" AS color, "Sweet" AS taste, "Thailand" AS origin, 5.0 AS price, TRUE AS in_stock ));
9. Next Steps
- Schedule continuous retraining as new data arrives.
- Integrate predictions into dashboards or alerts.
- Explore Vertex AI export and online inference.
10. Final Thoughts
With a few SQL statements, we went from streaming data to actionable ML predictions entirely in BigQuery. The combination of real-time pipelines and in-database ML allows rapid iteration and actionable insights without moving data or managing infrastructure.
References
- Building a Real-Time Data Pipeline with GCP (In About 15 Minutes) – UPSIDER Techblog
previous blog on creating a Firestore → BigQuery real-time pipeline. - BigQuery ML Documentation – Google Cloud
Official guide on BigQuery ML, model types, and SQL-based machine learning. - Time Series Forecasting with BigQuery ML
Reference for creating and using ARIMA_PLUS models for forecasting in BigQuery.
(日本語版)
BigQuery MLによるリアルタイム予測 (約15分で)
はじめに
こんにちは、私はUPSIDERのMachine Learning EngineerであるPranavです。このブログでは、どの果物が間もなく在庫切れになるかを予測するために、BigQuery内で直接シンプルな機械学習ワークフローを構築する方法を紹介します。この目的は、複雑なMLフレームワークやデータのエクスポートを必要とせず、SQLのみを使用してリアルタイムデータを実用的な予測に変換できることを示すことです。
以前の記事「Building a Real-Time Data Pipeline with GCP (In About 15 Minutes)」をご覧になっていない方でも、このブログの手順に従うことができます。手短にまとめると、BigQueryにはfruit_inventory_eventsというテーブルがあり、各行がFirestoreから取得された果物の在庫イベントを表しています。
クイックルック: Fruit Inventory Table Structure
| Column | Type | Description |
|---|---|---|
fruit_id |
STRING | Unique ID for each fruit |
name |
STRING | Fruit name (Apple, Banana, etc.) |
color |
STRING | Color of the fruit (Red, Green, Yellow, etc.) |
taste |
STRING | Taste profile (Sweet, Sour, Bitter, etc.) |
price |
NUMERIC | Price of the fruit |
origin |
STRING | Country or region of origin |
in_stock |
BOOLEAN | Whether the fruit is in stock |
event_type |
STRING | Type of event (CREATE, UPDATE, DELETE) |
timestamp |
TIMESTAMP | When the event occurred |
processed_at |
TIMESTAMP | When the event was recorded in BigQuery |
1. Introduction: パイプラインから予測へ
FirestoreからBigQueryへデータがストリーミングされるようになったら、次の自然なステップはインサイトを抽出することです。BigQuery MLを使用して、どの果物が間もなく在庫切れになるかなど、果物在庫の傾向を予測します。BigQuery MLを使用すると、別のMLサーバーを必要とせず、SQLをBigQuery内で直接使って機械学習を実行できます。
2. なぜリアルタイムデータにBigQuery MLを使うの?
BigQuery MLがこのシナリオに最適である理由は次のとおりです。
No data movement – パイプラインはすでにBigQueryにデータを供給しており、MLはインプレース(その場)で実行されます。
Speed and simplicity – SQLを使用してトレーニングと予測を行います。
Scalability – BigQueryは追加のインフラなしで何百万ものレコードを処理します。
Real-time readiness – 新しいデータが到着するたびに、いつでも再トレーニングまたは予測を実行できます。
3. ML問題の定義
果物がいつ在庫切れになるかを予測したいと考えています。これはbinary classification problemです。色、味、原産地、価格などの特徴が与えられたとき、in_stockがFALSEになるかどうかを予測します。トレーニングデータセットを構築するために、在庫イベントを日ごとに集計します。
SELECT DATE(timestamp) AS date, COUNTIF(event_type = 'UPDATE' AND NOT in_stock) + COUNTIF(event_type = 'DELETE') AS daily_sold FROM `my_dataset.fruit_inventory_events` GROUP BY date ORDER BY date;
4. BigQuery MLアーキテクチャ概要
解説:
データはパイプラインを介してFirestoreからBigQueryに流れます。
BigQuery MLは果物在庫テーブルでモデルをトレーニングします。
予測はBigQueryに保存されます。
今後のブログで、モデルをVertex AIにエクスポートすることを検討します。

4.1. SQLクエリのフローチャート
次のフローチャートは、機械学習ワークフローのためにセクション5から8で使用されるSQLクエリ間のステップと関係を示しています。

5. データ準備
Handling Missing Values (欠損値の処理)
SELECT fruit_id, name, IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events`;
Optional: Encoding Categorical Fields (オプション: カテゴリカルフィールドのエンコーディング)
BigQuery MLは、color、taste、originなどの文字列特徴量を自動的にワンホットエンコーディングします。
Manual one-hot example (optional):
SELECT fruit_id, name, price, origin, color, IF(taste = "Sweet", 1, 0) AS taste_Sweet, IF(taste = "Sour", 1, 0) AS taste_Sour, IF(taste = "Bitter", 1, 0) AS taste_Bitter, in_stock FROM `my_dataset.fruit_inventory_events`;
6. ロジスティック回帰モデルのトレーニング
このSQLをBigQuery Consoleで実行します。完了後、データセットにfruit_stock_modelというモデルが表示されます。
CREATE OR REPLACE MODEL `my_dataset.fruit_stock_model` OPTIONS ( model_type = 'logistic_reg', input_label_cols = ['in_stock'] ) AS SELECT IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events`;
7. モデルの評価
SELECT * FROM ML.EVALUATE(MODEL `my_dataset.fruit_stock_model`, ( SELECT IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events` ));
メトリクスには以下が含まれます:
Accuracy
Precision
Recall
F1 Score
Log Loss
ROC AUC
8. 予測のためのモデルの使用
Batch Prediction (バッチ予測)
SELECT fruit_id, name, predicted_in_stock AS predicted_in_stock, predicted_in_stock_probs[OFFSET(0)] AS probability_out_stock FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, ( SELECT fruit_id, name, IFNULL(color, "Unknown") AS color, IFNULL(taste, "Unknown") AS taste, IFNULL(origin, "Unknown") AS origin, IFNULL(price, 0.0) AS price, in_stock FROM `my_dataset.fruit_inventory_events` WHERE in_stock = TRUE ));
Single Prediction (単一予測)
SELECT * FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, ( SELECT "mangosteen-999" AS fruit_id, "Mangosteen" AS name, "Purple" AS color, "Sweet" AS taste, "Thailand" AS origin, 5.0 AS price, TRUE AS in_stock ));
9. 次のステップ
新しいデータが到着するたびに、継続的な再トレーニングをスケジュールします。
予測をダッシュボードやアラートに統合します。
Vertex AIへのエクスポートとオンライン推論を探求します。
10. 最後に
いくつかのSQL文だけで、ストリーミングデータから実用的なML予測までをBigQuery内だけで実現しました。リアルタイムパイプラインとデータベース内MLの組み合わせにより、データの移動やインフラ管理なしで、迅速なイテレーションと実用的なインサイトが可能になります。
参考文献
Building a Real-Time Data Pipeline with GCP (In About 15 Minutes) – UPSIDER Techblog
Firestore → BigQueryのリアルタイムパイプライン作成に関する以前のブログ。BigQuery ML Documentation – Google Cloud
BigQuery ML、モデルタイプ、SQLベースの機械学習に関する公式ガイド。Time Series Forecasting with BigQuery ML
BigQueryでの予測のためのARIMA_PLUSモデルの作成と使用に関するリファレンス。
We Are Hiring
UPSIDER Engineering Deckはこちら📣