Project Jarvis Deep Dive: The End of "Search" and the Rise of "Action"; How Google's New Chrome AI Controls Your Browser, Credit Card, and Digital Life

مجید قربانی نژاد Hello Tekin Army and survivors of the Legacy Web! 🫡 Mark the date: Thursday, January 8, 2026. This might be the day the way we interact with computers changed forever. Let's be honest; the current internet is "broken." It is cluttered with cookie consent pop-ups, intrusive video ads, endless CAPTCHAs asking us to identify traffic lights, and registration forms that feel like interrogations. We have become unpaid data-entry clerks for websites. But Google has finally dragged its ultra-secret project, **"Jarvis,"** out of the shadows. The promise is as terrifying as it is seductive: "Give me the browser, and I will do the clicking." This technology, technically known as a **CUA (Computer-Using Agent)**, is no longer a Large Language Model (LLM) that just talks; it has eyes and hands. It sees, scrolls, clicks, and swipes. In this mega-analysis from TekinGame, we are going to surgically dismantle Jarvis's "Silicon Brain." From the Gemini 2.0 architecture powering it to the nightmares it is causing for cybersecurity experts—referencing the <a href="/blog/nightly-news-wrap-up-dec-13-2025-xbox-handheld-oled-leak-clair-obscur-backlash-ai-security-crisis">DevOps security crisis we covered last month</a>—and the existential question keeping webmasters awake: "If Jarvis buys the shoes, who clicks on the ads?" Brew your strongest coffee; this is the deepest dive you will read today.

1. The Agent Revolution: Why "Chatbots" Are Dead Until 2025, our interaction with AI was limited to a text box. We typed a prompt, and the AI generated text. This paradigm is called Generative AI . However,

Jarvis belongs to a new evolutionary branch called Agentic AI . The distinction lies in the word "Agency." ChatGPT (in its legacy forms) was like a knowledgeable librarian with no hands. Jarvis is like

an employee sitting at your desk. It possesses three key traits that chatbots lack: Perception: It understands context. It knows it is currently on the checkout page of Amazon.ae or the login portal of

Emirates ID. Planning: It breaks down a goal ("Book a flight to London") into steps: Check dates, compare prices, select seat, enter passport info, pay. Action: It can hijack the mouse cursor and keyboard

input stream to execute those steps. This paradigm shift is the most significant leap since the invention of the Graphical User Interface (GUI) in the 1980s. 2. Technical Anatomy: Vision vs. DOM (How Jarvis

Sees) This section is for the tech-savvy soldiers of the Tekin Army. Google faced a massive fork in the road when building Jarvis: Should the AI read the website's code (HTML/DOM) or should it "see" the

website like a human? The "Vision-Based" Approach Jarvis relies heavily on Multimodal models like Gemini 2.0 Flash , which take continuous screenshots of your browser. The technical reasons for this choice

are fascinating: Messy Modern Web: Modern frameworks like React and Vue often produce obfuscated HTML code that is hard for a bot to parse, but the visual rendering is clear to the human eye (and Jarvis). Read Full Article