I Gave an AI Agent Access to the Web. Here’s What Happened.
Testing Proxie-Lite 3B on real-world tasks — from search queries to image decoding — without any fine-tuning
In today’s evolving AI landscape, lightweight AI Agents like Proxie-Lite 3B are redefining what’s possible with minimal computing and no fine-tuning. This post explores real-world experiments to assess how this compact Agent performs, where it shines, and where it breaks down.
What is Proxie-Lite 3B?
Proxie-Lite 3B is a compact, open-weight AI Agent developed by Convergence AI. It combines visual understanding with language processing capabilities to navigate and interact with web interfaces effectively.
Built on the Qwen2.5-VL-3B-Instruct architecture, it uses a smart context-window management system that maintains task awareness while minimizing image token overhead.
Despite its modest 3-billion parameter size, it can execute complex web-based tasks using a headless browser and a custom agent loop — without any fine-tuning. It can:
- Navigate and click through websites
- Fill out forms and scrape content
- Summarize long documents
- Handle location-based queries
Best of all, it runs locally or via Hugging Face Spaces, making it ideal for experimentation without heavy computing requirements.
How Well Does It Work? Real-World Web Tasks
Below are some experiments I ran using the Proxie-Lite 3B model without fine-tuning, demonstrating its versatility, limitations, and some surprising behaviors. For a defined task (e.g., perform queries), I detail one of the prompts I used, and the corresponding results. The colors next to each observation mean:
🟢 Fully accomplished | 🟠 Partial success | 🔴 Failed
🔍 Perform Queries
The Agent handles fact-based searches effectively, even when multi-step reasoning is required.
PROMPT
“How old was Messi when he won his first Copa America tournament?”
🟢 OBSERVATION
To accomplish the task, the Agent entered the query into Google, clicked the first result, and extracted the correct age. This differs from tools like Perplexity, which typically search for the year Messi won, find his birth year, and then compute his age.
🛒 Find and Compare Items
Proxie-Lite can handle direct search-based fact-finding tasks with ease, often using the shortest navigation path available. Direct searches worked well, but comparison tasks across multiple domains pushed its limits.
PROMPT
“I’m looking for the best price on the book ‘Leviatan’ by ‘Hobbes’ from online bookstores in Canada. Find at least three different options, compare their prices, and identify the most affordable one.”
🟠 OBSERVATION
The Agent was able to access several online bookstores, but struggled with aggregating and comparing price data from different sites. After several runs, it finally reached the objective by searching and displaying all results in a single Google result page.
📍 Geolocation Tasks
The Agent can complete location-based lookups and interact with mapping tools like Google Maps. Mapping-based lookups were promising, though handling structured data proved tricky.
PROMPT
“List restaurants near the Ricardo Bochini Stadium, including their specialties (e.g., type of cuisine or signature dishes) and their contact phone numbers.”
🟠 OBSERVATION
The Agent successfully used Google Maps and retrieved structured data about nearby restaurants, including phone numbers, but failed to determine the types of cuisine. Additionally, the last result (#6) is not in the specified location, since the Agent accidentally “jumped” to a different location before finalizing the list.
📚 Search and Summarize Content
Proxie-Lite can consume and digest long-form text or videos, offering short, coherent summaries. In this example, we reached a standout performance for summarization with no extra guidance.
PROMPT
“Go to ArXiv and access the latest paper on the topic of Artificial Intelligence. Summarize it in 2 sentences.”
🟢 OBSERVATION
In this case, the Agent successfully navigated ArXiv, selected a recent AI paper, and produced a surprisingly coherent 2-sentence summary.
➗ Basic Math
In its raw version (without accessing external tools), Proxie-Lite can perform basic math operations by prompting search engines. The Agent cleverly outsources math to Google — simple, but effective.
PROMPT
“What’s the result of 2 + 3.14 ?”
🟢 OBSERVATION
The Agent solved the task by entering “2 + 3.14” into Google’s search bar and directly retrieving the result from the search engine.
🖼️ Search and Analyze Images
Proxie-Lite can retrieve and describe images, having the ability to perform visual-textual understanding of web interfaces. Basic image descriptions were accurate.
PROMPT
“Search for a random image of a beach. Can you describe what you see?”
🟢 OBSERVATION
After searching for an image and accessing the corresponding website, the Agent generated a short description (“A tropical beach with palm trees, white sand, and clear blue water.”). Vision capabilities were basic but accurate for generic scenes.
▶️ Run Specific Online Videos
Proxie-Lite can identify online video content, but interaction with embedded media players remains challenging.
PROMPT
“Search YouTube for the latest episode of the Lex Fridman podcast. Once you find it, play it.”
🔴 OBSERVATION
Although successful in accessing the target YouTube channel and identifying the latest podcast episode, the Agent was unable to click through and play the video.
Similar issues arose when attempting to play specific music videos — while the Agent could locate them correctly, it often failed to trigger playback or interact with embedded media controls reliably.
🎲 What Happens When We Remove Constraints?
What can we reveal when given fully open-ended instructions? Open-ended autonomy highlights randomness and inconsistency — fascinating, but unreliable for predictable results.
PROMPT
“You are a free and autonomous AI agent with full access to the internet. Explore, create, and pursue any objective you choose — your possibilities are limitless. Do anything you want.”
OBSERVATION
The Agent defined the objective of searching for Python programming content. But when the prompt was repeated, it instead searched for a ranking of the top 5 websites by market share in the UK.
This reveals the stochastic nature of Agent: different runs, even with identical prompts, can result in dramatically different behaviors.
Limitations
While evaluating the quality of final outputs is crucial, understanding how the Agent reached those outputs is equally important. Proxie-Lite (like many other solutions) operates using a clear pattern: Observe → Think → Act, and this trace is key for debugging and evaluating the reasoning process.
The Agent shows potential but struggles with consistency, CAPTCHA barriers, and looping behaviors.
CAPTCHA Barriers
When encountering a CAPTCHA, the Agent prints:
The current page requires solving a CAPTCHA to proceed. The next step is to wait for the CAPTCHA solver to automatically solve the CAPTCHA.
Unfortunately, this frequently leads to stalled execution. CAPTCHA-solving is not yet integrated and presents a major blocker for full autonomy.
Recursive Behaviour
The Agent sometimes gets stuck in loops — especially when its chosen action fails. Without a fallback mechanism or memory reset, the Agent often retries the same failed strategy indefinitely.
Stochastic Initial Actions
Even structured prompts lead to unpredictable first steps, making the Agent’s process hard to control. This unpredictability can be insightful for observing its internal “thought process”, but poses challenges for reproducibility.
✅ In Summary
Proxie-Lite 3B is a surprisingly capable Agent for lightweight web-based automation. It mimics human interaction with a mix of reasoning, planning, and tool use — delivering solid performance on many routine tasks.
Its open architecture makes it ideal for prototyping and experimentation, especially in use cases like market research, content aggregation, or testing web interfaces.
But don’t expect it to replace human workflows yet. Think of it as a digital intern: curious, sometimes brilliant, occasionally lost. While it’s exciting to see what these Agents can do, they’re not ready for prime time in high-stakes situations, and definitely not suited for tasks involving sensitive data, compliance requirements, or critical business decisions.