Build AI agents to automate web browsing with human level reasoning using source code of webpages?

Apurv Agarwal
10 min readJun 19, 2024

--

Source: LangGraph (Overview of plan-and-execute agents)

Welcome to this article about how we can build AI agents that can interact with web and perform tasks.

In my previous article I introduced the basic concept and idea on how we combine browser automation tools and LLMs to achieve this.

In this article we will look at a practical example.

Problem Statement: Here is a sample problem that we will work on in this tutorial.

Go to https://www.sec.gov/edgar/searchedgar/companysearch. Search for Nvidia. You will see a list of links for documents. We need to click at each and another page will open. Copy the link for all documents. I need the output which will be all the links on that page. After getting links for one page go back do same for next link.


## We will put the problem in string in our python code.

task = """Go to https://www.sec.gov/edgar/searchedgar/companysearch. Search for Nvidia. You will see a list of links for documents.
We need to click at each and another page will open. Copy the link for all documents. I need the output which will be all the links on that page.
After getting links for one page go back do same for next link."""

In this tutorial we will build a system that can automatically perform the above task. It will be possible because of OpenAI’s large context window which is 128,000. This will allow us to send the entire source code of the web page as the chat prompt to OpenAI.

# Let's import the basic requirements
from playwright.sync_api import sync_playwright
from openai import OpenAI


# Defining variables needed to interact with OpenAI's LLM
client = OpenAI(api_key='YOUR-OPENAI API KEY')
model = "gpt-4o"


# Initalizing our playwright instance
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()

Primarily we will try to output executable code from OpenAI for each step. Towards that pursuit we will write a function that can take the main task and split in into subtasks that can be executed by Playwright one step at a time.

def get_task_list(task):
task_list_creation_prompt = """
You are an AI assistant that, given a high-level task, will break it down into smaller subtasks. Each subtask should be simple enough to be executed sequentially by the automation tool Playwright without using loops. Write the subtasks as a numbered list. Each step should be actionable and straightforward, ensuring the automation progresses sequentially as it communicates continuously with an external processor for updates.

Example:
1. Go to page XYZ.
2. Check for the presence of element ABC.
3. Fill out form DEF if it is present.
4. Click submit button GHI.

"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": task_list_creation_prompt},
{"role": "user", "content": f"Write the task list for this task: {task}"},
]
)
return completion.choices[0].message.content

We will define a list named commands. In that we will store each executable python command inferred from the LLM. It’s actually a string as we need to pass it in the prompt.

commands = ""

The very first step will be to go to the url where we have to perform the action. So to create the very first command, we will write a function that can extract the url that we have to first go to.

def write_goto_command(task):
goto_command_prompt = """
I am giving you the task that needs to be executed by automation software playwright.
Your task is just to tell given the task, which URL should playwright go to first.
Just output url and nothing else
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": goto_command_prompt},
{"role": "user", "content": f"What is the URL for this task: {task}"},
]
)
return completion.choices[0].message.content

For our task, the the output of this function will come out to be https://www.sec.gov/edgar/searchedgar/companysearch

Hence we will get the first command.

url = write_goto_command(task)
next_command = f"page.goto('{url}')"

We will also define a list called local_storage_list. Consider this string as a database where our required output will be stored. In this case our output is link to the documents.

local_storage_list = []

Now at this point I would like to explain how the whole process will go.

  1. We will generate the sub_task_list from our task.
  2. We will create the first command.

Now In a Loop we will perform following actions :-

  1. Execute the current command in playwright.
  2. Add this executed command in command list.
  3. Collect the source code of current page.
  4. Update the sub_task_list based on commands executed and current page source.
  5. Generate the next command based on executed commands, current sub_task_list and current page source.
  6. Repeat

We will keep performing these tasks until the LLM tells all the tasks are complete on the sub_task_list.

Now I will explain how we will perform each of this step.

To execute the executable code we can use exec command of python. Once executed we will add that command to list of executed commands.

exec(next_command)
commands = commands + "\n" + next_command

So the very first execution will be of going to the target page. Once we reach there we will get the source code of the page. We are doing this to see if we need to update our sub_task_list based on changes on the page.

page_content = page.content()

Now we have all the required ingredients after the first execution let’s write prompt to update the original sub_task_list.



def update_task_list(task, task_list, commands, page_content):
task_list_updation_prompt = f"""
You are an AI assistant that is helping update a task list based on the current state of execution. Given the original high-level task, the existing task list, the commands that have been executed, and the current page content, update the task list to reflect what still needs to be done or adjust the approach based on the page's current state.
Original Task:
{task}
\n
Initial Task List:
{task_list}
\n
Executed Commands:
{commands}
\n
Current Page Content:
{page_content}
\n
Based on this information, provide a revised list of subtasks, ensuring each can be executed by Playwright sequentially without loops. Adjust the tasks as needed based on the changes observed in the page content.
\n
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": task_list_updation_prompt},
{"role": "user", "content": f"Write an updated task list based on the current page content. Make the changes only if updates are required?"},

]
)
return completion.choices[0].message.content

We this above prompt we will get updated sub_task_list based on current contents of the page. We can now produce the next command that needs to be executed.


def get_next_command(tasks, commands, page_content):
playwright_command_prompt = f"""
You are an assistant tasked with generating executable code snippets for Playwright and Python. Output Playwright commands to perform web operations or Python commands to store web content in a list named `local_storage_list`.
Avoid using JavaScript decorators, backticks, and the `await` keyword.
\n
Task List:
{tasks}
\n
Executed Command Till Now:
{commands}
\n
Current Page Content:
{page_content}
\n
Based on this context, output the necessary code to complete the next step in the task list. Ensure the code is suitable for sequential execution and directly interacts with `local_storage_list` for storing data as needed.
Just output the code, nothing else. Also output will be just one-line code that needs to be executed next.
\n
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": playwright_command_prompt},
{"role": "user", "content": f"What should be the next playwright command.?"},

]
)
return completion.choices[0].message.content

The above prompt will either produce a playwright command that does something on browser, or it will create a python command that stores the required output in our local_storage_list.

Sometimes the code produced might have a python flag on it. So just to be extra cautious we’ll make the code go through a sanitation.

import re
def extract_code(output: str) -> str:
# Define a regex pattern to match the code block
pattern = r"```python\n(.*?)\n```"
# Search for the pattern in the output string
match = re.search(pattern, output, re.DOTALL)
# If a match is found, return the captured group
if match:
return match.group(1)
else:
return output

next_command = extract_code(next_command)

Finally to break the loop we can write a prompt checking every-time if sub_task_list is completed.

Depending on the task this might also need to keep looking at the source code. In our sample problem we don’t know how many documents are there so LLM will need to keep looking at source code to check if something is remaining.


def check_if_task_list_completed(task, task_list, commands, page_content):
completed_prompt = """
You are an assistant tasked with determining whether all sub-tasks required for a given high-level task have been completed based on the Playwright automation tool. Using the provided list of sub-tasks, the list of executed commands, and the current page source, decide if further actions are required or if the task is complete.
\n
Original Task:
{task}
\n
Sub-Tasks List:
{task_list}
\n
Executed Commands:
{commands}
\n
Current Page Content:
{page_content}
\n
Return 'True' if more commands need to be generated to complete the remaining sub-tasks, or 'False' if all sub-tasks are completed and the automation can be stopped.
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": completed_prompt},
{"role": "user", "content": f"Are all the tasks completed?"},

]
)
return completion.choices[0].message.content

Finally let’s put it all together.



import re



task = """Go to https://www.sec.gov/edgar/searchedgar/companysearch. Search for Nvidia. You will see a list of links for documents.
We need to click at each and another page will open. Copy the link for all documents. I need the output which will be all the links on that page.
After getting links for one page go back do same for next link."""


def get_task_list(task):
task_list_creation_prompt = """
You are an AI assistant that, given a high-level task, will break it down into smaller subtasks. Each subtask should be simple enough to be executed sequentially by the automation tool Playwright without using loops. Write the subtasks as a numbered list. Each step should be actionable and straightforward, ensuring the automation progresses sequentially as it communicates continuously with an external processor for updates.

Example:
1. Go to page XYZ.
2. Check for the presence of element ABC.
3. Fill out form DEF if it is present.
4. Click submit button GHI.

"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": task_list_creation_prompt},
{"role": "user", "content": f"Write the task list for this task: {task}"},
]
)
return completion.choices[0].message.content





def write_goto_command(task):
goto_command_prompt = """
I am giving you the task that needs to be executed by automation software playwright.
Your task is just to tell given the task, which URL should playwright go to first.
Just output url and nothing else
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": goto_command_prompt},
{"role": "user", "content": f"What is the URL for this task: {task}"},
]
)
return completion.choices[0].message.content





def update_task_list(task, task_list, commands, page_content):
task_list_updation_prompt = f"""
You are an AI assistant that is helping update a task list based on the current state of execution. Given the original high-level task, the existing task list, the commands that have been executed, and the current page content, update the task list to reflect what still needs to be done or adjust the approach based on the page's current state.
Original Task:
{task}
\n
Initial Task List:
{task_list}
\n
Executed Commands:
{commands}
\n
Current Page Content:
{page_content}
\n
Based on this information, provide a revised list of subtasks, ensuring each can be executed by Playwright sequentially without loops. Adjust the tasks as needed based on the changes observed in the page content.
\n
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": task_list_updation_prompt},
{"role": "user", "content": f"Write an updated task list based on the current page content. Make the changes only if updates are required?"},

]
)
return completion.choices[0].message.content





def get_next_command(tasks, commands, page_content):
playwright_command_prompt = f"""
You are an assistant tasked with generating executable code snippets for Playwright and Python. Output Playwright commands to perform web operations or Python commands to store web content in a list named `local_storage_list`.
Avoid using JavaScript decorators, backticks, and the `await` keyword.
\n
Task List:
{tasks}
\n
Executed Command Till Now:
{commands}
\n
Current Page Content:
{page_content}
\n
Based on this context, output the necessary code to complete the next step in the task list. Ensure the code is suitable for sequential execution and directly interacts with `local_storage_list` for storing data as needed.
Just output the code, nothing else. Also output will be just one-line code that needs to be executed next.
\n
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": playwright_command_prompt},
{"role": "user", "content": f"What should be the next playwright command.?"},

]
)
return completion.choices[0].message.content






def check_if_task_list_completed(task, task_list, commands, page_content):
completed_prompt = """
You are an assistant tasked with determining whether all sub-tasks required for a given high-level task have been completed based on the Playwright automation tool. Using the provided list of sub-tasks, the list of executed commands, and the current page source, decide if further actions are required or if the task is complete.
\n
Original Task:
{task}
\n
Sub-Tasks List:
{task_list}
\n
Executed Commands:
{commands}
\n
Current Page Content:
{page_content}
\n
Return 'True' if more commands need to be generated to complete the remaining sub-tasks, or 'False' if all sub-tasks are completed and the automation can be stopped.
"""
completion = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": completed_prompt},
{"role": "user", "content": f"Are all the tasks completed?"},

]
)
return completion.choices[0].message.content




def extract_code(output: str) -> str:
# Define a regex pattern to match the code block
pattern = r"```python\n(.*?)\n```"
# Search for the pattern in the output string
match = re.search(pattern, output, re.DOTALL)
# If a match is found, return the captured group
if match:
return match.group(1)
else:
return output




# Let's import the basic requirements
from playwright.sync_api import sync_playwright
from openai import OpenAI

# Defining variables needed to interact with OpenAI's LLM
client = OpenAI(api_key='YOUR-OPENAI-API-KEY')
model = "gpt-4o"

# Initalizing our playwright instance
playwright = sync_playwright().start()
browser = playwright.chromium.launch(headless=False)
page = browser.new_page()


commands = ""
local_storage_list = []


url = write_goto_command(task)
next_command = f"page.goto('{url}')"
next_command = extract_code(next_command)



exec(next_command)
commands = commands + "\n" + next_command

page_content = page.content()



while check_if_task_list_completed(task, task_list, commands, page_content).lower() == "true":
task_list = update_task_list(task, task_list, commands, page_content)
next_command = get_next_command(tasks, commands, page_content)
next_command = extract_code(next_command)
exec(next_command)
page_content = page.content()
commands = commands + "\n" + next_command


While this approach works, it will be extremely expensive and redundant to send whole page source to OpenAI APIs, it will create a robust agent which can perform the task every time.

Also this approach won’t directly work if page source code is really large.

In next article we will explore how can we reduce this cost, by several techniques. One of the way we can reduce the number of tokens is sending only the relevant part of webpage.

--

--