Easy text analysis with SQL and LLMs

Comparing LLM functions in Snowflake, BigQuery, Databricks.. and spreadsheets

Jul 02, 2024

LLMs have made previously complex natural language processing tasks accessible to basically anyone (e.g. classifying support tickets, extracting addresses, summarizing text). But while pasting data into ChatGPT is great, what do you do if you want to scale out to hundreds or thousands of entries?

LLM functions in SQL present a compelling solution that scales to thousands of rows while being accessible to any analyst.

In this series of posts we’ll deep dive into solving a realistic use case (determining the intent of customer support queries) using LLM functions. All the major data warehouses recently added SQL functions for LLM completions:

Snowflake Cortex: CORTEX.COMPLETE (announced November 2023)
BigQuery ML: ML.GENERATE_TEXT (Gemini support launched in March 2024)
Databricks: AI functions (announced April 2023)

For new users, all of the platforms above offer $300-400 in free trial credits if you want to follow along with our prompts below.

Summary of the final results. Read below for the details!

We’ll go through the basics of how to get good results and explore the different performance and cost characteristics of each provider (and finally, for non-SQL users — our approach for exposing these capabilities through spreadsheets). In our next post we will improve on these results with a few techniques that are a bit more advanced.

The data: ticket categorization

As a realistic use case, we’ll look at the open-source “Banking77” dataset, which consists of 13,000 customer support queries for a hypothetical online bank with 77 distinct categories.

The sheer number of categories (and also the amount of nuance between categories) poses a significant challenge for humans and AI alike. (Randomly guessing a category only gets you ~1.3% accuracy 😉). Specifically-trained NLP models can achieve ~94% accuracy, for reference.

The first 18 intents along with some examples for each

Few-shot prompting for LLM classification

Although each vendor has a slightly different set of models available, we’ll start with a fairly generic system prompt for generating classification responses. We came up with this (granted, there’s probably still some room to improve it!)

You are an expert at triaging customer support queries.
Categorize user messages into one of the following banking-related intents:

Intent: activate_my_card
Examples: Can you help me activate my card; I need assistance activating my card?

Intent: age_limit
Examples: Is there a minimum age for having an account?; Would I be able to open an account for my daughter?

[...repeat for all 77 intents]

Respond to each user message with the most appropriate intent and nothing else.

The full system prompt (with all 77 intents listed) is available here for reference.

Because of the subtlety of the questions, it’s very important to list out both the intents and also some examples for each one (we grabbed two examples from the training set to ensure that none of the examples are accidentally used for evaluation.)

For many lighter models (e.g. Llama 3 8B) we also found it necessary to provide 1-2 additional “few-shot examples” in the message log to correctly induce the model to respond with a clean and verifiable intent (i.e. without additional explanations or formatting). This will be shown in the Snowflake examples below.

Getting the Banking77 dataset

To follow along with the examples below, you can download the train.csv and test.csv files from the PolyAI GitHub repository and upload them as tables into each data warehouse. We consistently use test.csv below (3,080 rows) to report evaluation results (train.csv is used later for fine-tuning and retrieval testing.)

Snowflake Cortex (CORTEX.COMPLETE)

We’ll start with the SNOWFLAKE.CORTEX.COMPLETE function (Note: availability is still limited to select AWS and Azure regions as of June 2024.)

New users can create a trial Snowflake account and get access to $400 in credits.

To use, it’s as simple as writing:

SELECT SNOWFLAKE.CORTEX.COMPLETE(
  'snowflake-arctic',                -- model
  'What is a large language model?'  -- prompt_or_history
);

For a more complete evaluation, we found the full ‘object form’ (i.e. with an array prompt_or_history, and an object as options) to be most convenient:

SELECT
  text,
  category,
  SNOWFLAKE.CORTEX.COMPLETE(
    'llama3-70b',
    [
      {'role': 'system', 'content': 'You are an expert at triaging customer support queries.
Categorize user messages into one of the following banking-related intents:
...
Respond to each user message with the most appropriate intent and nothing else.'},
      {'role': 'user', 'content': 'I am still waiting on my card?'},
      {'role': 'assistant', 'content': 'card_arrival'},
      {'role': 'user', 'content': 'Someone stole my card!'},
      {'role': 'assistant', 'content': 'lost_or_stolen_card'},
      {'role': 'user', 'content': text}
    ],
    {'max_tokens': 20} -- intents never exceed 20 tokens
  ) AS complete_result,
  complete_result:choices[0].messages::string AS predicted_category,
  complete_result:usage.prompt_tokens::int AS prompt_tokens,
  complete_result:usage.completion_tokens::int AS completion_tokens,
  -- handle slight errors in LLM output
  LOWER(TRIM(predicted_category)) = label AS is_correct
FROM
  banking77_test; -- uploaded test.csv

This evaluation query illustrates some of the main appeals of running LLMs from SQL:

It’s incredibly easy to do things in batch - and then you can chain the results with normal SQL aggregations to compute the overall accuracy, total prompt_tokens, or accuracy for each intent.
This greatly simplifies iteration speed of adjusting the prompt - you can also quickly preview results with a LIMIT clause before evaluating on the full dataset.

Results

model		  accuracy	 tokens_in  tokens_out  cost     latency
mistral-large	    80.26%	10,632,895	30,873	$108.77  347s
snowflake-arctic    56.88%	11,071,311	23,203	$26.63†	 556s
llama3-70b	    80.23%	 9,358,853	15,084	$22.68	 314s
mixtral-8x7b	    74.19%	10,642,135	24,173	$4.69	 145s
llama3-8b	    68.60%	 9,358,853	16,138	$3.56	 70s

Llama3 70B strikes a good balance between cost and accuracy here, scoring just about as well as the significantly more expensive mistral-large while being 4x cheaper ($2.421/million tokens vs. $10.20/million tokens).
mixtral-8x7b is a close second - at $0.44/million tokens it’s 5x cheaper than Llama3 70B with comparable accuracy. (This could likely be improved with some Mistral model-specific prompt adjustments; one observation was that the Mistral models may have some tokenization issues with spurious backslashes (\) appearing in the output.)
† snowflake-arctic is currently still free during its preview period; we estimated the pricing via Together.ai ($2.40/1M tokens). Unfortunately, however, we found that it struggled with the prompt shown above - possibly an issue with the length of the system prompt? In particular, it didn’t adhere to the list of intents and often responded with full commentary (despite the examples in the prompt.)
Note that the Llama3-based models use a newer, more efficient tokenizer (about 12% better than the Mistral models here) which improves the actual pricing slightly in their favor.
Snowflake’s batch inference throughput is extremely impressive (~134K input tokens/second with Llama3 8B, ~30K input tokens/second with Llama3 70B). If you tried to do this with a normal cloud-hosted inference platform, you’d likely hit rate limit restrictions.
- For example, Azure OpenAI has a default maximum of 450K tokens/minute, which means that you’d have to wait at least 20 minutes for this dataset.
- You do pay a premium for the throughput, though: Llama3 70B only costs $0.64/1M tokens on Groq and $0.90/1M tokens (about ~3x cheaper) from Together.ai/Fireworks.
- Importantly, the speed/throughput of CORTEX.COMPLETE seems to be largely independent of the Snowflake warehouse type. Digging into the query profile shows an ExternalFunction named COMPLETE$V2 which seems to be invoked in parallel with batches of ~24 rows.

Also check out Cybersyn’s AI Utilities, which expose similar UDFs for calling OpenAI and Anthropic models! In our testing, we noted that it does make one API call per row, which may lead to rate limit errors depending on your account limits and query size.

BigQuery ML (ML.GENERATE_TEXT)

ML.GENERATE_TEXT works in a pretty similar way and supports both Google’s flagship Gemini 1.5 Pro and Gemini 1.5 Flash models. (You do also have to do some setup work to create a ‘Cloud resource connection’ to bridge between BigQuery and Vertex AI).

Google Cloud offers $300 in free credit for new users.

A basic SQL generation query looks like this:

-- replace your_project and your_dataset accordingly
-- (create a dataset if you don't have one already)
CREATE MODEL IF NOT EXISTS `your_project.your_dataset.gemini-pro`
  -- name of your Vertex AI "Cloud resource connection"
  -- (assuming your region is `us`)
  REMOTE WITH CONNECTION `your_project.us.vertex_ai`
  -- or gemini-1.5-flash, etc
  OPTIONS(ENDPOINT = 'gemini-1.5-pro');

SELECT
  prompt,
  ml_generate_text_llm_result,
  ml_generate_text_status
FROM
  ML.GENERATE_TEXT(
    MODEL `your_dataset.gemini-pro`,
    (SELECT 'What is a large language model?' AS prompt),
    STRUCT(
      0.2 AS temperature,
      75 AS max_output_tokens,
      TRUE AS flatten_json_output));

Costs are billed according to Vertex AI pricing (note that Gemini uses non-whitespace characters, not tokens, as its unit of billing). As of June 2024:

Gemini 1.5 Pro: $3.75/1M chars in, $7.50/1M chars out
Gemini 1.5 Flash: $0.375/1M chars in, $0.75/1M chars out (10x cheaper)
The fine print notes that “Gemini models are available in batch mode at 50% discount”. We confirmed that BigQuery ML usage was in fact billed as batch predictions at the 50% discounted rate.

For this evaluation set, although ML.GENERATE_TEXT doesn’t support more nuanced system/user/assistant prompting, we found that the Gemini 1.5 models have no problem adhering to a simpler prompt/query of this form:

SELECT
    *,
    LOWER(TRIM(ml_generate_text_llm_result)) = category AS is_correct
FROM ML.GENERATE_TEXT(
    MODEL `banking77.gemini-pro`,
    (SELECT
        text,
        category,
        REPLACE('''Given the following list of banking-related intents for user queries:

Intent: activate_my_card
Examples: Can you help me activate my card; I need assistance activating my card?
...(other examples)

Please classify this user query with the most appropriate intent: {query}

Respond with only the classified intent with no other formatting.''',
            '{query}',
            text
        ) AS prompt
    FROM `banking77.test`),
    STRUCT(
      0 AS temperature,
      20 AS max_output_tokens,
      TRUE AS flatten_json_output,
      -- the default (medium) safety thresholds are too aggressive
      [STRUCT('HARM_CATEGORY_HATE_SPEECH' AS category,
          'BLOCK_ONLY_HIGH' AS threshold),
        STRUCT('HARM_CATEGORY_DANGEROUS_CONTENT' AS category,
          'BLOCK_ONLY_HIGH' AS threshold),
        STRUCT('HARM_CATEGORY_HARASSMENT' AS category,
          'BLOCK_ONLY_HIGH' AS threshold),
        STRUCT('HARM_CATEGORY_SEXUALLY_EXPLICIT' AS category,
          'BLOCK_ONLY_HIGH' AS threshold)] AS safety_settings));

Results

model		  accuracy     chars_in  chars_out      cost   latency
gemini-1.5-pro	    82.40%   33,862,246	    62,840    $21.28     3264s
gemini-1.5-flash    81.66%   33,862,246	    62,840     $2.13     1216s

Unsurprisingly, the Gemini models produce better results than any open source model (matching their public benchmark standing)
Gemini 1.5 Flash in particular hits an excellent price/performance ratio for this kind of text processing use case.
While the overall batch throughput is still quite a good amount faster than evaluating prompts through the API, it’s still significantly slower than equivalent model classes in Snowflake. Still, 20 minutes for ~3000 rows (with Flash) is pretty tolerable for many use cases.
Gemini’s default safety filters seem to be pretty aggressive. This dataset is about as SFW as it gets, but a good ~10% of queries tripped the medium content filters - even innocuous queries like “Where can I view my PIN?”. Fortunately increasing the safety thresholds to BLOCK_ONLY_HIGH in the query above resolves this.

Databricks (AI_QUERY)

On Databricks, the AI_QUERY function can be used with any of the models in the Foundation Model API page (including their own DBRX Instruct model, but also the Mixtral-8x7b and Llama3 models.)

Databricks also offers $400 in trial credits for new users.

It’s just as easy as the others to use (though, at this point, it doesn’t offer the same kind of options, e.g. temperature, max tokens, etc.

SELECT ai_query(
  'databricks-meta-llama-3-70b-instruct',
  'What is a large language model?'
);

The rest of the evaluation query is basically the same as the ones above.

Result

We tried both the databricks-meta-llama-3-70b-instruct and databricks-dbrx-instruct models; the results with Llama 3 70B were, not surprisingly, comparable to the results we got on Snowflake.

However, while the cost of inference through Databricks is significantly cheaper (at $1.00/1M2 vs $2.42/1M), the overall latency was the slowest out of all the providers we tested — clocking in just about 1h 30m to process the full test dataset. So it may not be the best option if you’re looking for faster responses (unless there’s a Spark flag we’re missing - at ~1.73s per row it seems it might be processing each row in sequence.)

Arcwise (AI.CLASSIFY)

What about users who can’t write SQL? We’ve got you covered!

In Arcwise AI, we recently added the ability to batch process text through LLM-powered AI functions in Google Sheets ($AI.CLASSIFY, $AI.EXTRACT, $AI.TRANSFORM).

Though we don’t use SQL-based LLM functions (yet), we internally apply a bag of tricks to efficiently perform batch transformations like classification and extraction by processing hundreds or even thousands of function invocations with one GPT or Claude request. This reduces token costs by up to 60x compared to the SQL examples above (though it can be done in SQL too — an exercise for the reader!)

An example system prompt for processing lists is pretty simple (we do have some additional tweaks to curb issues like outputs not lining up with inputs; another exercise to the reader 😉)

You are a data processing expert.
The user will provide a JSON list of values, along with instructions for how to process the values.
Output a JSON list of processed values without any extra whitespace or formatting. Output null if the value could not be processed.

# Example
Given the following values:
["zebra","apple","test","banana"]
Classify the values into categories: animal, fruit

# Expected output
["animal","fruit",null,"fruit"]

# Example
Given the following values:
["John Doe","Jane Smith","Alice Johnson","2020-01-01"]
Extract the following fields from the values: first name

# Expected output
["John","Jane","Alice",null]

User prompt:

Given the following values:
{list}
Classify the values into one of the categories:
# activate_my_card
Examples: Can you help me activate my card; I need assistance activating my card?
# ... (repeat for 77 intents)

This drastically improves token usage and latency for batch use cases like classification, because all the context of each category and its description (in this case, around 3000 tokens) no longer has to be repeated for each individual row.

Here’s how we fare on this test set using GPT-4o (via Azure OpenAI) and Claude 3.5 Sonnet (via AWS Bedrock) (using the exact list of intents and examples shown in the prompts above):

model		   accuracy   tokens_in tokens_out       cost  latency
gpt-4o		     79.84%	142,626	     21618	$1.04	   59s
claude-3.5-sonnet    82.08%	147,876	     36021	$0.98	  223s

We were a bit surprised to see that both models performed basically on par with the other leading foundation models (this task might be a little too simple). Claude slightly edges out GPT-4o here but is quite a bit slower (at least on AWS Bedrock), though this is something that is likely to improve over time.

Nevertheless, this approach lets us deliver comparable results at a very reasonable cost/latency (we can also decrease latency by querying our batches with greater parallelism, at the risk of hitting API rate limits). You can test out how it fares with your own spreadsheets through our free Chrome extension!

Conclusions

As of June 2024, our recommendations would be:

Gemini 1.5 Flash (via BigQuery) offers an extremely compelling price-to-performance ratio for text processing tasks, though you’ll have to wait a while for larger datasets (it’s still easier and faster than making requests through the API!)
Snowflake Cortex offers the fastest batch inference latencies and a wide variety of open-source models, albeit at a slight price premium. It also offers easy-to-use finetuning of Llama and Mistral models, which can significantly change the price-to-performance frontier for specific use cases (we’ll go into this in the next article!)

Part 2: Using fine-tuning and RAG

In part 2 of this guide, we’ll go over how to use more advanced methods including fine-tuning and RAG (both through only SQL queries!) to improve the quality of responses - in the end reaching ~95% accuracy at fractions of the cost. (Spoiler alert: for this task, simpler methods still win over LLMs).

Subscribe for updates, and please reach out to us in the meantime if you have any questions about this post!

Prices as of June 2024. Snowflake-quoted prices are in terms of Snowflake credits, which are priced at USD $2/credit on the Standard plan ($3/credit on Enterprise)

Actual prices are in DBU, which have a variable conversion rate to USD.

Arcwise Blog

Discussion about this post