Deployment of Serverless Machine Learning models with GPUs using Google Cloud: Cloud Run
Deploying machine learning models at scale can be challenging. However, Google Cloud provides a suite of powerful tools to help you deploy your model on a fully managed, serverless environment — complete with GPU support! In this guide, we’ll go step by step through installing the dependencies, extending quotas, and finally deploying your model. Then, we’ll see how to call your newly deployed model from both the console and JavaScript. Let’s dive right in!
1. Overview: Why Serverless?
Serverless means you focus on your code while the cloud provider automatically scales your application up or down depending on demand. With Google Cloud Run, you can run stateless containers in a fully managed environment. The addition of GPU support (in preview at the time of writing) enables you to handle heavier machine learning workloads.
Key advantages include:
- Automatic scaling.
- Pay only for the time your container is actually handling requests.
- Simplified deployment using container images.
- Support for GPU workloads (currently in preview).
2. Project Setup and Quotas
- Enable the required APIs:
2. Set Your Google Cloud Project (requires gcloud installed)
- If you haven’t already, log in with:
gcloud auth login
- Then choose the project you want to work with:
gcloud config set project <YOUR-PROJECT-ID>
- Replace
<YOUR-PROJECT-ID>
with the actual ID you see in your Google Cloud Console.
3. GPU Quota:
- If you want GPU support, you must request GPU quota for the specific region you plan to deploy in. For example, if you want to deploy in us-central1, you’ll need to request a GPU quota for that region. Go to the Quotas page and look for “GPUs (all regions)” or “GPU in us-central1.” Request an increase if needed.
- Click this Link to configure it
- For more information, you might find this Google Documentation URL useful: https://cloud.google.com/run/docs/configuring/services/gpu
4. Service Account Permissions:
- By default, your Cloud Build service account should have permissions to deploy to Cloud Run. If it doesn’t, add the Cloud Run Admin role to the service account used by Cloud Build.
3. Files in Your Project
We’ll use a Dockerfile to define our container environment and a cloudbuild.yaml file to instruct Cloud Build on how to build and deploy it. Additionally, we have our application code (main.py
) and a requirements.txt
specifying dependencies.
Let’s assume you have the following structure in your project folder:
├── cloudbuild.yaml
├── main.py
├── requirements.txt
└── ambulance.jpg (example image for testing)
Below, we’ll briefly explain the purpose of each file before showing its contents.
3.1. cloudbuild.yaml
Purpose: Tells Cloud Build which steps to execute. Here, we do two Docker steps: one to build and one to push the image to Artifact Registry (or Container Registry), and then we run the gcloud beta run deploy
command directly from within Cloud Build.
steps:
- name: 'gcr.io/cloud-builders/docker'
args: [ 'build', '-t', 'us-central1-docker.pkg.dev/$PROJECT_ID/<IMAGENAME>', '.' ]
# Environment variables for dynamic tagging
env:
- 'PROJECT_ID=$PROJECT_ID'
- name: 'gcr.io/cloud-builders/docker'
args: ['push', 'us-central1-docker.pkg.dev/$PROJECT_ID/<IMAGENAME>']
# Deploy to Cloud Run with GPU (nvidia-l4) configuration
- name: 'gcr.io/cloud-builders/gcloud'
args:
[
'beta',
'run',
'deploy',
'nvidia-l4-model',
'--image',
'us-central1-docker.pkg.dev/$PROJECT_ID/<IMAGENAME>',
'--concurrency',
'4',
'--cpu',
'8',
'--gpu',
'1',
'--gpu-type',
'nvidia-l4',
'--max-instances',
'1',
'--memory',
'16Gi',
'--no-allow-unauthenticated',
'--no-cpu-throttling',
'--timeout=600',
'--region',
'us-central1',
]
images:
- 'us-central1-docker.pkg.dev/$PROJECT_ID/<IMAGENAME>'
options:
machineType: 'E2_HIGHCPU_32'
In this cloudbuild.yaml
:
- Build the Docker image and tag it (
us-central1-docker.pkg.dev/PROJECT_ID/IMAGENAME
). - Push the image to your Artifact Registry.
- Deploy the container to Cloud Run using GPU parameters:
--gpu 1 --gpu-type nvidia-l4
ensures you’re requesting an L4 GPU.--cpu 8
,--memory 16Gi
, and--concurrency 4
customize resources and concurrency.--max-instances 1
limits to a single instance (common when ensuring GPU usage is cost-managed).--no-allow-unauthenticated
requires an ID token or IAM-based auth to invoke.- images: Lists the Docker image we’re building.
- options: Sets the Cloud Build machine type (e.g.,
E2_HIGHCPU_32
) for faster builds.
3.2. requirements.txt
This file lists the Python dependencies required by your application. It allows Cloud Build (and you, locally) to install everything with a single command:
torch
pillow
numpy
flask
3.3. main.py
This is where your machine learning logic lives. We use a Flask app that listens on /predict
, accepts an image file, and returns the predicted class index. You can replace resnet18
with your own model. The code also shows how to preprocess images before feeding them to the model.
from flask import Flask, request, jsonify
import torch
import io
from PIL import Image
import numpy as np
app = Flask(__name__)
# Load a pretrained ResNet-18 model
model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=True)
model.eval()
@app.route('/predict', methods=['POST'])
def predict():
if 'file' not in request.files:
return jsonify({'error': 'No file part in the request'}), 400
file = request.files['file']
if file.filename == '':
return jsonify({'error': 'No selected file'}), 400
if file:
img_bytes = file.read()
img = Image.open(io.BytesIO(img_bytes))
# Preprocess the image
img_tensor = preprocess_image(img)
with torch.no_grad():
outputs = model(img_tensor)
_, predicted = torch.max(outputs, 1)
# 'predicted' is the class index
return jsonify({'prediction': int(predicted.item())})
def preprocess_image(image):
# Convert to RGB, resize to 224x224, normalize
image = image.convert('RGB')
image = image.resize((224, 224))
image = np.array(image) / 255.0
image = np.transpose(image, (2, 0, 1)) # (C, H, W)
image_tensor = torch.from_numpy(image).float()
image_tensor = image_tensor.unsqueeze(0) # Add batch dimension
return image_tensor
if __name__ == '__main__':
port = int(os.environ.get("PORT", 8080)) # Default to 8080
app.run(host='0.0.0.0', port=8080)
3.4. Dockerfile
Purpose: Defines our container environment. In this case, we start from a CUDA-enabled Ubuntu base image, install Python dependencies, and run our Python main.py
.
FROM nvidia/cuda:12.0.1-base-ubuntu20.04
RUN apt-get update && apt-get install -y python3-pip
COPY requirements.txt .
RUN pip3 install -r requirements.txt
COPY . /app
WORKDIR /app
CMD ["python3", "main.py"]
4. How to Build and Deploy
With the files above in place, you can trigger the entire build-and-deploy process by running:
gcloud builds submit --config cloudbuild.yaml .
What happens:
- Docker is used to build the container image locally within Cloud Build. (If you don’t have it, intall Docker Desktop or a Docker distribution that satisfies you)
- The image is pushed to
us-central1-docker.pkg.dev/$PROJECT_ID/<image_name>
. - A
gcloud beta run deploy
command is executed within Cloud Build to deploy your service to Cloud Run with all the specified GPU parameters.
After the build and deployment succeed, you will have a private Cloud Run endpoint (because we used --no-allow-unauthenticated
). You can then manage IAM or use an ID token to invoke the endpoint.
5. Testing the Deployed Model
5.1. Getting the Service URL
Once deployed, your Service URL will be displayed in the Shell. It generally looks like:
https://nvidia-l4-model-xxxxxx-uc.a.run.app
(Exact URL depends on the project and service name.)
5.2. Making an Authenticated Request
Because you used --no-allow-unauthenticated
, the service requires a valid ID token for calls. You can get an ID token via the gcloud
command:
gcloud auth print-identity-token
And call the model function with
gcloud auth print-identity-token | curl \
-X POST \
-H "Authorization: Bearer $(cat /dev/stdin)" \
-F "file=@ambulance.jpg" \
https://nvidia-l4-model-xxxxxx-uc.a.run.app/predict
Ensure you replace the URL with your actual Cloud Run service URL. The output will be something like:
{"prediction": 407}
(where 407 is the predicted class index from Imagenet 1k).
6. Calling the Model from JavaScript
You can similarly call the model from Node.js or any other environment. You just need to provide the correct bearer token in the Authorization header.
For example, using google-auth-library
:
const { GoogleAuth } = require('google-auth-library');
const fetch = require('node-fetch');
const FormData = require('form-data');
const fs = require('fs');
const path = require('path');
// Service URL
const MODEL_SERVICE_URL = 'https://nvidia-l4-model-xxxxxx-uc.a.run.app/predict';
async function getIdToken() {
const auth = new GoogleAuth();
const client = await auth.getIdTokenClient(MODEL_SERVICE_URL);
const tokenHeaders = await client.getRequestHeaders();
return tokenHeaders['Authorization'].split(' ')[1];
}
async function callModelLocal(imagePath) {
try {
const imageData = fs.readFileSync(path.resolve(imagePath));
const formData = new FormData();
formData.append('file', imageData, path.basename(imagePath));
const idToken = await getIdToken();
const response = await fetch(MODEL_SERVICE_URL, {
method: 'POST',
headers: {
Authorization: `Bearer ${idToken}`,
...formData.getHeaders(),
},
body: formData,
});
if (!response.ok) {
const errorText = await response.text();
throw new Error(`Model service error: ${errorText}`);
}
const jsonResponse = await response.json();
console.log('Prediction:', jsonResponse.prediction);
} catch (error) {
console.error('Error:', error);
}
}
// Example usage: node script.js sweater.png
const imagePath = process.argv[2] || 'ambulance.jpg';
callModelLocal(imagePath);
Be sure you’ve configured your local environment with either Application Default Credentials (e.g., gcloud auth application-default login
) or set GOOGLE_APPLICATION_CREDENTIALS
to a JSON key file.
7. Conclusion
You’ve just configured a GPU-equipped Docker container for machine learning inference, built it with Cloud Build, and deployed it to Cloud Run with GPU settings — entirely serverless. The no-allow-unauthenticated flag ensures that only authorized requests reach your model, enhancing security.
Key Takeaways:
- Use Cloud Build automates container image building and artifact pushing.
- Leverage Cloud Run to serve your ML models in a fully managed, highly scalable environment that frees you to focus on new features rather than infrastructure.
- GPU support (via
--gpu
)to power more complex and accurate model predictions. It is currently in preview, so quotas and limitations may apply. - Implement robust authentication so you can confidently manage who accesses your model.
For more tips, tutorials, and discussions on ML and cloud computing, be sure to follow me on Twitter! Feel free to reach out if you have questions, and happy coding!
Assume your project folder has the following structure: