UPSIDER Tech Blog

Real-Time Predictions with BigQuery ML (In About 15 Minutes)

*日本語版は下記にあります。

Introduction

Hi, I’m Pranav, a Machine Learning Engineer at UPSIDER. In this blog, I’ll show you how to build a simple machine learning workflow directly in BigQuery to predict which fruits might go out of stock soon. The goal is to demonstrate how real-time data can be transformed into actionable predictions with SQL only, without the need for complex ML frameworks or data exports.

Even if you haven’t seen my earlier article Building a Real-Time Data Pipeline with GCP (In About 15 Minutes), you can follow along with this blog. To summarize quickly: we have a table in BigQuery called fruit_inventory_events where each row represents a fruit inventory event captured from Firestore.

Quick Look: Fruit Inventory Table Structure

Column Type Description
fruit_id STRING Unique ID for each fruit
name STRING Fruit name (Apple, Banana, etc.)
color STRING Color of the fruit (Red, Green, Yellow, etc.)
taste STRING Taste profile (Sweet, Sour, Bitter, etc.)
price NUMERIC Price of the fruit
origin STRING Country or region of origin
in_stock BOOLEAN Whether the fruit is in stock
event_type STRING Type of event (CREATE, UPDATE, DELETE)
timestamp TIMESTAMP When the event occurred
processed_at TIMESTAMP When the event was recorded in BigQuery

1. Introduction: From Pipeline to Predictions

With data streaming from Firestore into BigQuery, the natural next step is to extract insights. We’ll use BigQuery ML to predict trends in fruit inventory, such as which fruits might go out of stock soon. BigQuery ML allows us to do machine learning with SQL directly in BigQuery – no separate ML servers required.


2. Why BigQuery ML for Real-Time Data?

BigQuery ML is perfect for this scenario because:

  • No data movement – your pipeline already feeds BigQuery; ML happens in-place.
  • Speed and simplicity – train and predict using SQL.
  • Scalability – BigQuery handles millions of records without extra infra.
  • Real-time readiness – as new data arrives, you can retrain or predict anytime.

Learn more about BigQuery ML


3. Defining the ML Problem

We want to train a model to predict current inventory risk, with the ultimate goal of forecasting future stock-out events. This is a binary classification problem: given features like color, taste, origin, and price, predict if in_stock will become FALSE. We aggregate inventory events per day to build a training dataset:

SELECT
   DATE(timestamp) AS date,
   COUNTIF(event_type = 'UPDATE' AND NOT in_stock)
   + COUNTIF(event_type = 'DELETE') AS daily_sold
FROM `my_dataset.fruit_inventory_events`
GROUP BY date
ORDER BY date;

4. BigQuery ML Architecture Overview

Explanation:

  • Data flows from Firestore to BigQuery via the pipeline.
  • BigQuery ML trains a model on the fruit inventory table.
  • Predictions are stored in BigQuery.
  • In the future blog, we’ll explore exporting models to Vertex AI.

4.1. SQL Query Flowchart

The following flowchart illustrates the steps and relationships between the SQL queries used in the following sections from 5 to 8 for machine learning workflow:


5. Data Preparation

Handling Missing Values

SELECT
   fruit_id,
   name,
   IFNULL(color, "Unknown") AS color,
   IFNULL(taste, "Unknown") AS taste,
   IFNULL(origin, "Unknown") AS origin,
   IFNULL(price, 0.0) AS price,
   in_stock
FROM `my_dataset.fruit_inventory_events`;

Optional: Encoding Categorical Fields

BigQuery ML automatically one-hot encodes string features like color, taste, and origin.

Manual one-hot example (optional):

SELECT
   fruit_id,
   name,
   price,
   origin,
   color,
   IF(taste = "Sweet", 1, 0) AS taste_Sweet,
   IF(taste = "Sour", 1, 0) AS taste_Sour,
   IF(taste = "Bitter", 1, 0) AS taste_Bitter,
   in_stock
FROM `my_dataset.fruit_inventory_events`;

6. Training a Logistic Regression Model

Run this SQL in the BigQuery Console. After completion, a model fruit_stock_model will appear in your dataset.

CREATE OR REPLACE MODEL `my_dataset.fruit_stock_model` OPTIONS (
   model_type = 'logistic_reg',
   input_label_cols = ['in_stock']
) AS
SELECT
   IFNULL(color, "Unknown") AS color,
   IFNULL(taste, "Unknown") AS taste,
   IFNULL(origin, "Unknown") AS origin,
   IFNULL(price, 0.0) AS price,
   in_stock
FROM `my_dataset.fruit_inventory_events`;

7. Evaluating the Model

SELECT *
FROM ML.EVALUATE(MODEL `my_dataset.fruit_stock_model`, (
   SELECT
      IFNULL(color, "Unknown") AS color,
      IFNULL(taste, "Unknown") AS taste,
      IFNULL(origin, "Unknown") AS origin,
      IFNULL(price, 0.0) AS price,
      in_stock
   FROM `my_dataset.fruit_inventory_events`
));

Metrics include:

  • Accuracy
  • Precision
  • Recall
  • F1 Score
  • Log Loss
  • ROC AUC

8. Using the Model for Predictions

Batch Prediction

SELECT
   fruit_id,
   name,
   predicted_in_stock AS predicted_in_stock,
   predicted_in_stock_probs[OFFSET(0)] AS probability_out_stock
FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, (
   SELECT
      fruit_id,
      name,
      IFNULL(color, "Unknown") AS color,
      IFNULL(taste, "Unknown") AS taste,
      IFNULL(origin, "Unknown") AS origin,
      IFNULL(price, 0.0) AS price,
      in_stock
   FROM `my_dataset.fruit_inventory_events`
   WHERE in_stock = TRUE
));

Single Prediction

SELECT *
FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, (
   SELECT
      "mangosteen-999" AS fruit_id,
      "Mangosteen" AS name,
      "Purple" AS color,
      "Sweet" AS taste,
      "Thailand" AS origin,
      5.0 AS price,
      TRUE AS in_stock
));

9. Next Steps

  • Schedule continuous retraining as new data arrives.
  • Integrate predictions into dashboards or alerts.
  • Explore Vertex AI export and online inference.

10. Final Thoughts

With a few SQL statements, we went from streaming data to actionable ML predictions entirely in BigQuery. The combination of real-time pipelines and in-database ML allows rapid iteration and actionable insights without moving data or managing infrastructure.


References


(日本語版)


BigQuery MLによるリアルタイム予測 (約15分で)

はじめに

こんにちは、私はUPSIDERのMachine Learning EngineerであるPranavです。このブログでは、どの果物が間もなく在庫切れになるかを予測するために、BigQuery内で直接シンプルな機械学習ワークフローを構築する方法を紹介します。この目的は、複雑なMLフレームワークやデータのエクスポートを必要とせず、SQLのみを使用してリアルタイムデータを実用的な予測に変換できることを示すことです。

以前の記事「Building a Real-Time Data Pipeline with GCP (In About 15 Minutes)」をご覧になっていない方でも、このブログの手順に従うことができます。手短にまとめると、BigQueryにはfruit_inventory_eventsというテーブルがあり、各行がFirestoreから取得された果物の在庫イベントを表しています。

クイックルック: Fruit Inventory Table Structure

Column Type Description
fruit_id STRING Unique ID for each fruit
name STRING Fruit name (Apple, Banana, etc.)
color STRING Color of the fruit (Red, Green, Yellow, etc.)
taste STRING Taste profile (Sweet, Sour, Bitter, etc.)
price NUMERIC Price of the fruit
origin STRING Country or region of origin
in_stock BOOLEAN Whether the fruit is in stock
event_type STRING Type of event (CREATE, UPDATE, DELETE)
timestamp TIMESTAMP When the event occurred
processed_at TIMESTAMP When the event was recorded in BigQuery

1. Introduction: パイプラインから予測へ

FirestoreからBigQueryへデータがストリーミングされるようになったら、次の自然なステップはインサイトを抽出することです。BigQuery MLを使用して、どの果物が間もなく在庫切れになるかなど、果物在庫の傾向を予測します。BigQuery MLを使用すると、別のMLサーバーを必要とせず、SQLをBigQuery内で直接使って機械学習を実行できます。


2. なぜリアルタイムデータにBigQuery MLを使うの?

BigQuery MLがこのシナリオに最適である理由は次のとおりです。

  • No data movement – パイプラインはすでにBigQueryにデータを供給しており、MLはインプレース(その場)で実行されます。

  • Speed and simplicity – SQLを使用してトレーニングと予測を行います。

  • Scalability – BigQueryは追加のインフラなしで何百万ものレコードを処理します。

  • Real-time readiness – 新しいデータが到着するたびに、いつでも再トレーニングまたは予測を実行できます。

Learn more about BigQuery ML


3. ML問題の定義

果物がいつ在庫切れになるかを予測したいと考えています。これはbinary classification problemです。色、味、原産地、価格などの特徴が与えられたとき、in_stockがFALSEになるかどうかを予測します。トレーニングデータセットを構築するために、在庫イベントを日ごとに集計します。

SELECT
   DATE(timestamp) AS date,
   COUNTIF(event_type = 'UPDATE' AND NOT in_stock)
   + COUNTIF(event_type = 'DELETE') AS daily_sold
FROM `my_dataset.fruit_inventory_events`
GROUP BY date
ORDER BY date;

4. BigQuery MLアーキテクチャ概要

解説:

  • データはパイプラインを介してFirestoreからBigQueryに流れます。

  • BigQuery MLは果物在庫テーブルでモデルをトレーニングします。

  • 予測はBigQueryに保存されます。

  • 今後のブログで、モデルをVertex AIにエクスポートすることを検討します。


4.1. SQLクエリのフローチャート

次のフローチャートは、機械学習ワークフローのためにセクション5から8で使用されるSQLクエリ間のステップと関係を示しています。


5. データ準備

Handling Missing Values (欠損値の処理)

SELECT
   fruit_id,
   name,
   IFNULL(color, "Unknown") AS color,
   IFNULL(taste, "Unknown") AS taste,
   IFNULL(origin, "Unknown") AS origin,
   IFNULL(price, 0.0) AS price,
   in_stock
FROM `my_dataset.fruit_inventory_events`;

Optional: Encoding Categorical Fields (オプション: カテゴリカルフィールドのエンコーディング)

BigQuery MLは、colortasteoriginなどの文字列特徴量を自動的にワンホットエンコーディングします。

Manual one-hot example (optional):

SELECT
   fruit_id,
   name,
   price,
   origin,
   color,
   IF(taste = "Sweet", 1, 0) AS taste_Sweet,
   IF(taste = "Sour", 1, 0) AS taste_Sour,
   IF(taste = "Bitter", 1, 0) AS taste_Bitter,
   in_stock
FROM `my_dataset.fruit_inventory_events`;

6. ロジスティック回帰モデルのトレーニング

このSQLをBigQuery Consoleで実行します。完了後、データセットにfruit_stock_modelというモデルが表示されます。

CREATE OR REPLACE MODEL `my_dataset.fruit_stock_model` OPTIONS (
   model_type = 'logistic_reg',
   input_label_cols = ['in_stock']
) AS
SELECT
   IFNULL(color, "Unknown") AS color,
   IFNULL(taste, "Unknown") AS taste,
   IFNULL(origin, "Unknown") AS origin,
   IFNULL(price, 0.0) AS price,
   in_stock
FROM `my_dataset.fruit_inventory_events`;

7. モデルの評価

SELECT *
FROM ML.EVALUATE(MODEL `my_dataset.fruit_stock_model`, (
   SELECT
      IFNULL(color, "Unknown") AS color,
      IFNULL(taste, "Unknown") AS taste,
      IFNULL(origin, "Unknown") AS origin,
      IFNULL(price, 0.0) AS price,
      in_stock
   FROM `my_dataset.fruit_inventory_events`
));

メトリクスには以下が含まれます:

  • Accuracy

  • Precision

  • Recall

  • F1 Score

  • Log Loss

  • ROC AUC


8. 予測のためのモデルの使用

Batch Prediction (バッチ予測)

SELECT
   fruit_id,
   name,
   predicted_in_stock AS predicted_in_stock,
   predicted_in_stock_probs[OFFSET(0)] AS probability_out_stock
FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, (
   SELECT
      fruit_id,
      name,
      IFNULL(color, "Unknown") AS color,
      IFNULL(taste, "Unknown") AS taste,
      IFNULL(origin, "Unknown") AS origin,
      IFNULL(price, 0.0) AS price,
      in_stock
   FROM `my_dataset.fruit_inventory_events`
   WHERE in_stock = TRUE
));

Single Prediction (単一予測)

SELECT *
FROM ML.PREDICT(MODEL `my_dataset.fruit_stock_model`, (
   SELECT
      "mangosteen-999" AS fruit_id,
      "Mangosteen" AS name,
      "Purple" AS color,
      "Sweet" AS taste,
      "Thailand" AS origin,
      5.0 AS price,
      TRUE AS in_stock
));

9. 次のステップ

  • 新しいデータが到着するたびに、継続的な再トレーニングをスケジュールします。

  • 予測をダッシュボードやアラートに統合します。

  • Vertex AIへのエクスポートとオンライン推論を探求します。


10. 最後に

いくつかのSQL文だけで、ストリーミングデータから実用的なML予測までをBigQuery内だけで実現しました。リアルタイムパイプラインとデータベース内MLの組み合わせにより、データの移動やインフラ管理なしで、迅速なイテレーションと実用的なインサイトが可能になります。


参考文献

We Are Hiring

herp.careers

herp.careers

UPSIDER Engineering Deckはこちら📣

speakerdeck.com