Text to SQL Natural Language Query on TPC-H — A quick LangChain Agent example
Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.
I keep a close check on BIRD-Bench , which hosts a leader board of Foundation Models on text-to-sql tasks.
Here’s a snippet :
Satisfying to see Human performance still reigns supreme :-) Only, to be broken by a FM soon ! !
I got inspired by the wonderful blog by Pradip Nichite and decided to take a stab at using LangChain Agents and OpenAI GPT 3.5 Turbo on the famous TPC-H schema.
To make the experiment fast, I decided to download the TPC-H SQLite database, which stood at 1.17GB, a decent number of rows to test our model.
Here’s the link for the DB : TPC-H.db , ever thankful to https://github.com/lovasoa/ for keeping the file.
Now, that we have the TPC-H database, let’s quickly inspect the schema, from TPC.org :
As usual, let’s create a virtual environment and install dependencies :
virtualenv gpttext2sql
pip install langchain openai notebook
Open a fresh notebook (My notebook hyperlink is attached in the end) and let’s run through the experiment :
import os
os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"
Ideally, these API keys should be kept in a “.env” , but, let me run through the experiment with hardcoded API keys.
from langchain.agents.agent_toolkits import SQLDatabaseToolkit
from langchain.sql_database import SQLDatabase
from langchain.agents import create_sql_agent
from langchain.agents import AgentExecutor
from langchain.chat_models import ChatOpenAI
This imports the necessary LangChain SQL agent classes.
db = SQLDatabase.from_uri("sqlite:///D:/sw/sqlite/TPC-H.db")
Set the SQLite TPC-H database URI
llm = ChatOpenAI(model_name="gpt-3.5-turbo")
Instantiate “GPT-3.5-Turbo”
toolkit = SQLDatabaseToolkit(db=db, llm=llm)
agent_executor = create_sql_agent(
llm = llm,
toolkit = toolkit,
verbose = True
)
Prepare the LangChain agent with the db and LLM references
Now, we can run these agents and find out how accurate the Natural Language Queries(NLQ) are :
agent_executor.run("Describe the orders table")
Let’s analyze the response :
> Entering new AgentExecutor chain...
Action: sql_db_list_tables
Action Input: ""
Observation: CUSTOMER, LINEITEM, NATION, ORDERS, PART, PARTSUPP, REGION, SUPPLIER
Thought:The "orders" table is present in the database. I should query the schema of the "orders" table to get more information about its columns.
Action: sql_db_schema
Action Input: "ORDERS"
Observation:
CREATE TABLE "ORDERS" (
"O_ORDERKEY" INTEGER NOT NULL,
"O_CUSTKEY" INTEGER NOT NULL,
"O_ORDERSTATUS" TEXT NOT NULL,
"O_TOTALPRICE" INTEGER NOT NULL,
"O_ORDERDATE" DATE NOT NULL,
"O_ORDERPRIORITY" TEXT NOT NULL,
"O_CLERK" TEXT NOT NULL,
"O_SHIPPRIORITY" INTEGER NOT NULL,
"O_COMMENT" TEXT NOT NULL,
PRIMARY KEY ("O_ORDERKEY"),
FOREIGN KEY("O_CUSTKEY") REFERENCES "CUSTOMER" ("C_CUSTKEY")
)
/*
3 rows from ORDERS table:
O_ORDERKEY O_CUSTKEY O_ORDERSTATUS O_TOTALPRICE O_ORDERDATE O_ORDERPRIORITY O_CLERK O_SHIPPRIORITY O_COMMENT
1 36901 O 173665.47 1996-01-02 5-LOW Clerk#000000951 0 nstructions sleep furiously among
2 78002 O 46929.18 1996-12-01 1-URGENT Clerk#000000880 0 foxes. pending accounts at the pending, silent asymptot
3 123314 F 193846.25 1993-10-14 5-LOW Clerk#000000955 0 sly final accounts boost. carefully regular ideas cajole carefully. depos
*/
Thought:The "orders" table has the following columns:
- O_ORDERKEY
- O_CUSTKEY
- O_ORDERSTATUS
- O_TOTALPRICE
- O_ORDERDATE
- O_ORDERPRIORITY
- O_CLERK
- O_SHIPPRIORITY
- O_COMMENT
Final Answer: The "orders" table has the following columns: O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT.
> Finished chain.
'The "orders" table has the following columns: O_ORDERKEY, O_CUSTKEY, O_ORDERSTATUS, O_TOTALPRICE, O_ORDERDATE, O_ORDERPRIORITY, O_CLERK, O_SHIPPRIORITY, O_COMMENT.'
This was a perfect DESCRIBE command NLQ !
Let’s “Find the top unshipped orders with the highest value” and check how good the SQL is :
agent_executor.run("find the top unshipped orders with the highest value")
Here’s the output :
> Entering new AgentExecutor chain...
Action: sql_db_list_tables
Action Input: ""
Observation: CUSTOMER, LINEITEM, NATION, ORDERS, PART, PARTSUPP, REGION, SUPPLIER
Thought:I can query the ORDERS table to find unshipped orders with their values. I also need to check the schema of the ORDERS table to determine which columns to include in my query.
Action: sql_db_schema
Action Input: "ORDERS"
Observation:
CREATE TABLE "ORDERS" (
"O_ORDERKEY" INTEGER NOT NULL,
"O_CUSTKEY" INTEGER NOT NULL,
"O_ORDERSTATUS" TEXT NOT NULL,
"O_TOTALPRICE" INTEGER NOT NULL,
"O_ORDERDATE" DATE NOT NULL,
"O_ORDERPRIORITY" TEXT NOT NULL,
"O_CLERK" TEXT NOT NULL,
"O_SHIPPRIORITY" INTEGER NOT NULL,
"O_COMMENT" TEXT NOT NULL,
PRIMARY KEY ("O_ORDERKEY"),
FOREIGN KEY("O_CUSTKEY") REFERENCES "CUSTOMER" ("C_CUSTKEY")
)
/*
3 rows from ORDERS table:
O_ORDERKEY O_CUSTKEY O_ORDERSTATUS O_TOTALPRICE O_ORDERDATE O_ORDERPRIORITY O_CLERK O_SHIPPRIORITY O_COMMENT
1 36901 O 173665.47 1996-01-02 5-LOW Clerk#000000951 0 nstructions sleep furiously among
2 78002 O 46929.18 1996-12-01 1-URGENT Clerk#000000880 0 foxes. pending accounts at the pending, silent asymptot
3 123314 F 193846.25 1993-10-14 5-LOW Clerk#000000955 0 sly final accounts boost. carefully regular ideas cajole carefully. depos
*/
Thought:I can query the ORDERS table to find unshipped orders with their values. The relevant columns to include in my query are O_ORDERKEY, O_TOTALPRICE, and O_SHIPPRIORITY. I should order the results by O_TOTALPRICE in descending order.
Action: sql_db_query
Action Input: "SELECT O_ORDERKEY, O_TOTALPRICE, O_SHIPPRIORITY FROM ORDERS WHERE O_ORDERSTATUS = 'O' ORDER BY O_TOTALPRICE DESC LIMIT 10"
Observation: [(3043270, 530604.44, 0), (4576548, 525590.57, 0), (2232932, 522720.61, 0), (2199712, 515531.82, 0), (4745607, 508047.99, 0), (3967937, 502906.33, 0), (1395745, 502742.76, 0), (5709632, 499753.01, 0), (2152359, 498599.91, 0), (1672039, 495148.48, 0)]
Thought:I have found the top unshipped orders with the highest value. Here are the order keys, total prices, and shipping priorities for the top 10 orders:
1. Order Key: 3043270, Total Price: 530604.44, Shipping Priority: 0
2. Order Key: 4576548, Total Price: 525590.57, Shipping Priority: 0
3. Order Key: 2232932, Total Price: 522720.61, Shipping Priority: 0
4. Order Key: 2199712, Total Price: 515531.82, Shipping Priority: 0
5. Order Key: 4745607, Total Price: 508047.99, Shipping Priority: 0
6. Order Key: 3967937, Total Price: 502906.33, Shipping Priority: 0
7. Order Key: 1395745, Total Price: 502742.76, Shipping Priority: 0
8. Order Key: 5709632, Total Price: 499753.01, Shipping Priority: 0
9. Order Key: 2152359, Total Price: 498599.91, Shipping Priority: 0
10. Order Key: 1672039, Total Price: 495148.48, Shipping Priority: 0
Final Answer: The top unshipped orders with the highest value are the orders with the order keys 3043270, 4576548, 2232932, 2199712, 4745607, 3967937, 1395745, 5709632, 2152359, and 1672039.
> Finished chain.
Interesting to see how the “Thought” breaks down these tasks into smaller mini tasks and validates with the “Observation”.
Needless to say, with an increase in the complexity of the schema and query, these agents will not be able to to generate a “Perfect 100” query all the while. But, SQL tasks, like chess, can be broken down into smaller sub logic tasks and can be solved.
Challenge is the wide variety of business problems which are modelled using SQL schema which needs to be solved.
Keep an eye on the BIRD-Bench !
In another article, I will try a similar SQL LangChain agent NLQ on a mysql database.
Here’s the notebook attached.
So long !