At the Google Hackathon
, we were building an app that uses generative AI (specifically, an LLM model) to analyze images and make real-time driving decisions in the CARLA simulator using Gemini API.
LLM models excel at image description, but Gemini takes it a step further, generating driving decisions based on the scenario. To leverage LLMs effectively, I experimented with Vertex AI Studio. Integrating the gemini-pro-vision-001 model with CARLA presented several challenges, including random outputs and response lag. Through persistent prompt fine-tuning, we're achieving the desired outcomes.It required us to run the simulation multiple time to get the dedicated results.
In this article I am explaining how I used Vertex AI framework to use gemini-pro-vision-001 model for our stage 0 (Driver Assistance using GenAI). The Gemini-Pro Vision model is a multimodal AI, meaning it tackles both text and images. It excels at analyzing visuals, summarizing scenes, and even creating captions based on what it sees. Vertex AI which is a fully-managed, unified AI development platform for building and using generative AI in Google cloud.
Let's get in to the code.
Import the google cloud client library.
//Import the Google Cloud client library
const {
VertexAI,
HarmCategory,
HarmBlockThreshold,
} = require('@google-cloud/vertexai');
HarmCategory
: This constant represent different categories of potential harm a generated text or image content could contain (e.g., hate speech, violence).
HarmBlockThreshold
: This constant help to specify different thresholds for blocking content based on the harm category. It might have values like BLOCK_ALL
, BLOCK_MEDIUM_AND_ABOVE
, or ALLOW_ALL
.
Overall, HarmCategory and HarmBlockThreshold setting are set to implement responsible AI.
Let's initialize vertex AI
// Initialize Vertex AI
const vertexAI = new VertexAI({project: project, location: location})
const generativeVisionModel = vertexAI.getGenerativeModel({
model: visionModel,
safety_settings: [
{
category: HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
threshold: HarmBlockThreshold.BLOCK_LOW_AND_ABOVE,
},
],
generation_config: {max_output_tokens: 10, temperature: 0.1},
system_instruction: "You are a ego vehicle. See the images from third person perspective."
});
"project" and "location" information you can get from your google cloud account setup. They look like below
const project = 'xxx-xxx-1xxxx';
const location = 'us-central1';
const visionModel = 'gemini-1.0-pro-vision';
Let's prep the prompt to query gemini vision model
const prompt = `Input: Image data (from front mounted camera on the ego vehicle)
Output: STOP (if a vehicle is within 7 meters and obstructing my way)
GO (if my way is clear for 10 meters)
Restrictions: No additional explanation needed.`;
Now, let's create multimodal request for gemini vision pro model
const filePart = {inlineData: {data: base64Image, mimeType: 'image/jpeg'}};
const textPart = {text: prompt};
const request = {
contents: [{role: 'user', parts: [textPart,filePart]}],
};
Finally, make the request
const apiResule = await generativeVisionModel.generateContent(request);
const contentResponse = await apiResule.response;
const resText = contentResponse.candidates[0].content.parts[0].text;
return res.send(resText);
It will respond with STOP or Go after analyzing the images which are getting generated by CARLA simulation in real time.
It was intensive learning when it comes to designing prompt specific to autonomous vehicles. At the end of the Hackathon, we learned that context will play a greater role i.e which lane, direction, traffic light, distance etc etc hence we need to pass environment variable in the api end point to make Gemini work for us.
Please visit our submission entry in the Hackathon. I would love you can support by liking us on Devpost.
The working code can be found gemini_api
and demo video
.
To run the script, please install gcloud cli on your machine. Please follow How to run gcloud command line using a service account