Home Up PDF Prof. Dr. Ingo Claßen

Spider 2 Lite Questions

Example 1

instance_id

bq011

db

ga4

question

How many distinct pseudo users had positive engagement time in the 7-day period ending on January 7, 2021 at 23:59:59, but had no positive engagement time in the 2-day period ending on the same date (January 7, 2021 at 23:59:59) ?

external_knowledge

ga4_obfuscated_sample_ecommerce.events.md

Example 2

instance_id

bq010

db

ga360

question

Find the top-selling product among customers who bought ‘Youtube Men’s Vintage Henley’ in July 2017, excluding itself.

external_knowledge

google_analytics_sample.ga_sessions.md

Example 3

instance_id

bq009

db

ga360

question

Which traffic source has the highest total transaction revenue for the year 2017, and what is the difference in millions (rounded to two decimal places) between the highest and lowest monthly total transaction revenue for that traffic source?

external_knowledge

google_analytics_sample.ga_sessions.md

Example 4

instance_id

bq001

db

ga360

question

For each visitor who made at least one transaction in February 2017, how many days elapsed between the date of their first visit in February and the date of their first transaction in February, and on what type of device did they make that first transaction?

external_knowledge

google_analytics_sample.ga_sessions.md

Example 5

instance_id

bq002

db

ga360

question

During the first half of 2017, focusing on hits product revenue, which traffic source generated the highest total product revenue, and what were the maximum daily, weekly, and monthly product revenues (in millions) for that top-performing source over this period?

external_knowledge

google_analytics_sample.ga_sessions.md

Example 6

instance_id

bq003

db

ga360

question

Between April 1 and July 31 of 2017, using the hits product revenue data along with the totals transactions to classify sessions as purchase (transactions ≥ 1 and productRevenue not null) or non- purchase (transactions null and productRevenue null), compare the average pageviews per visitor for each group by month

external_knowledge

google_analytics_sample.ga_sessions.md

Example 7

instance_id

bq004

db

ga360

question

In July 2017, among all visitors who bought any YouTube-related product, which distinct product—excluding those containing ‘YouTube’ in the product name—had the highest total quantity purchased?

external_knowledge

google_analytics_sample.ga_sessions.md

Example 8

instance_id

bq008

db

ga360

question

In January 2017, among visitors whose campaign name contains ‘Data Share’ and who accessed any page starting with ‘/home’, which page did they most commonly visit next, and what is the maximum time (in seconds) they spent on the ‘/home’ page before moving on?

external_knowledge

google_analytics_sample.ga_sessions.md

Example 9

instance_id

bq269

db

ga360

question

Between June 1, 2017, and July 31, 2017, consider only sessions that have non-null pageviews. Classify each session as ‘purchase’ if it has at least one transaction, or ‘non_purchase’ otherwise. For each month, sum each visitor’s total pageviews under each classification, then compute the average pageviews per visitor for both purchase and non- purchase groups in each month, and present the results side by side.

external_knowledge

None

Example 10

instance_id

bq268

db

ga360

question

Identify the longest number of days between the first visit and the last recorded event (either the last visit or the first transaction) for a user, where the last recorded event is associated with a mobile device. The last recorded event could either be the last visit or the first transaction, and you should focus on users whose last recorded event occurred on a mobile device.

external_knowledge

None

Example 11

instance_id

bq270

db

ga360

question

What were the monthly add-to-cart and purchase conversion rates, calculated as a percentage of pageviews on product details, from January to March 2017?

external_knowledge

ga360_hits.eCommerceAction.action_type.md

Example 12

instance_id

bq275

db

ga360

question

Which visitor IDs belong to users whose first transaction occurred on a device explicitly labeled as ‘mobile’ on a later date than their first visit?

external_knowledge

None

Example 13

instance_id

bq374

db

ga360

question

Calculates the percentage of new users who, between August 1, 2016, and April 30, 2017, both stayed on the site for more than 5 minutes during their initial visit and made a purchase on a subsequent visit at any later time, relative to the total number of new users in the same period.

external_knowledge

None

Example 14

instance_id

sf_bq029

db

PATENTS

question

Get the average number of inventors per patent and the total count of patent publications in Canada (CA) for each 5-year period from 1960 to 2020, based on publication dates. Only include patents that have at least one inventor listed, and group results by 5-year intervals (1960-1964, 1965-1969, etc.).

external_knowledge

None

Example 15

instance_id

sf_bq026

db

PATENTS

question

For the assignee who has been the most active in the patent category ‘A61’, I’d like to know the five patent jurisdictions code where they filed the most patents during their busiest year, separated by commas.

external_knowledge

None

Example 16

instance_id

sf_bq091

db

PATENTS

question

In which year did the assignee with the most applications in the patent category ‘A61’ file the most?

external_knowledge

None

Example 17

instance_id

sf_bq099

db

PATENTS

question

For patent class A01B3, I want to analyze the information of the top 3 assignees based on the total number of applications. Please provide the following five pieces of information: the name of this assignee, total number of applications, the year with the most applications, the number of applications in that year, and the country code with the most applications during that year.

external_knowledge

None

Example 18

instance_id

sf_bq033

db

PATENTS

question

How many U.S. publications related to IoT (where the abstract includes the phrase ‘internet of things’) were filed each month from 2008 to 2022, including months with no filings?

external_knowledge

None

Example 19

instance_id

sf_bq209

db

PATENTS

question

Can you calculate the number of utility patents that were granted in 2010 and have exactly one forward citation within a 10-year window following their application/filing date? For this analysis, forward citations should be counted as distinct citing application numbers that cited the patent within 10 years after the patent’s own filing date.

external_knowledge

None

Example 20

instance_id

sf_bq027

db

PATENTS

question

For patents granted between 2010 and 2018, provide the publication number of each patent and the number of backward citations it has received in the SEA category.

external_knowledge

None

Example 21

instance_id

sf_bq210

db

PATENTS

question

How many US B2 patents granted between 2008 and 2018 contain claims that do not include the word ‘claim’?

external_knowledge

None

Example 22

instance_id

sf_bq211

db

PATENTS

question

Among patents granted between 2010 and 2023 in CN, how many of them belong to families that have a total of over one distinct applications?

external_knowledge

None

Example 23

instance_id

sf_bq213

db

PATENTS

question

What is the most common 4-digit IPC code among US B2 utility patents granted from June to August in 2022?

external_knowledge

patents_info.md

Example 24

instance_id

sf_bq212

db

PATENTS

question

For United States utility patents under the B2 classification granted between June and September of 2022, identify the most frequent 4-digit IPC code for each patent. Then, list the publication numbers and IPC4 codes of patents where this code appears 10 or more times.

external_knowledge

patents_info.md

Example 25

instance_id

sf_bq214

db

PATENTS_GOOGLE

question

For United States utility patents under the B2 classification granted between 2010 and 2014, find the one with the most forward citations within a month of its filing date, and identify the most similar patent from the same filing year, regardless of its type.

external_knowledge

patents_info.md

Example 26

instance_id

sf_bq216

db

PATENTS_GOOGLE

question

Identify the top five patents filed in the same year as US-9741766-B2 that are most similar to it based on technological similarities. Please provide the publication numbers.

external_knowledge

patents_info.md

Example 27

instance_id

sf_bq247

db

PATENTS_GOOGLE

question

From the publications dataset, first identify the top six families with the most publications whose family_id is not ‘-1’. Then, using the abs_and_emb table (joined on publication_number), provide each of those families’ IDs alongside every non-empty abstract associated with their publications.

external_knowledge

None

Example 28

instance_id

sf_bq127

db

PATENTS_GOOGLE

question

For each publication family whose earliest publication was first published in January 2015, please provide the earliest publication date, the distinct publication numbers, their country codes, the distinct CPC and IPC codes, distinct families (namely, the ids) that cite and are cited by this publication family. Please present all lists as comma-separated values, sorted alphabetically

external_knowledge

None

Example 29

instance_id

sf_bq215

db

PATENTS

question

Which US patent (with a B2 kind code and a grant date between 2015 and 2018) has the highest originality score calculated as 1 - (the sum of squared occurrences of distinct 4-digit IPC codes in its backward citations divided by the square of the total occurrences of these 4-digit IPC codes)?

external_knowledge

patents_info.md

Example 30

instance_id

sf_bq222

db

PATENTS

question

Find the CPC technology areas in Germany that had the highest exponential moving average (smoothing factor 0.1) of patent filings per year, specifically for patents granted in December 2016. For each CPC group at level 4, show the full title, CPC group, and the year with the highest exponential moving average of patent filings.

external_knowledge

sliding_windows_calculation_cpc.md

Example 31

instance_id

sf_bq221

db

PATENTS

question

Identify the CPC technology areas with the highest exponential moving average of patent filings each year (with a smoothing factor of 0.2), considering only the first CPC code for each patent that has a valid filing date and a non-empty application number, and report the full CPC title along with the best year associated with the highest exponential moving average for each CPC group at level 5.

external_knowledge

sliding_windows_calculation_cpc.md

Example 32

instance_id

sf_bq223

db

PATENTS

question

Which assignees, excluding DENSO CORP itself, have cited patents assigned to DENSO CORP, and what are the titles of the primary CPC subclasses associated with these citations? Provide the name of each citing assignee (excluding DENSO CORP), the full title of the primary CPC subclass (based on the first CPC code), and the count of citations grouped by the citing assignee and the CPC subclass title. Ensure that only citations of patents with valid filing dates are considered, and focus on the first CPC code for each citing patent. The results should specifically exclude DENSO CORP as a citing assignee.

external_knowledge

patents_info.md

Example 33

instance_id

sf_bq420

db

PATENTS_USPTO

question

Can you identify the top 5 patents that were initially rejected under section 101 with no allowed claims, based on the length of their granted claims? The patents should have been granted in the US between 2010 and 2023. Additionally, ensure to select the first office action date for each application. Please include their first publication numbers, along with their first publication dates, length of the filed claims and grant dates.

external_knowledge

None

Example 34

instance_id

sf_bq207

db

PATENTS_USPTO

question

Could you provide the earliest publication numbers, corresponding application numbers, claim numbers, and word counts for the top 100 independent patent claims, based on the highest word count, retrieved from claims stats within uspto_oce_claims (filtered by ind_flg=’1’), matched with their publication numbers from uspto_oce_claims match, and further joined with patents publications to ensure only the earliest publication for each application is included, ordered by descending word count, and limited to the top 100 results?

external_knowledge

None

Example 35

instance_id

sf_bq128

db

PATENTSVIEW

question

Retrieve the following information for U.S. patents filed between January 1, 2014, and February 1, 2014. The patent title and abstract. The publication date of the patent. The number of backward citations for each patent (i.e., the number of patents cited by the current patent before its filing date). The number of forward citations for each patent within the first 5 years of its publication (i.e., the number of patents that cited the current patent within 5 years after its publication). For each patent, ensure the forward citations are counted only for citations within 5 years after the publication date, and backward citations are counted for citations before the filing date.

external_knowledge

forward_backward_citation.md

Example 36

instance_id

sf_bq246

db

PATENTSVIEW

question

Retrieve U.S. patents with the number of forward citations within the first 3 years after the patent application date (i.e., patents citing the current patent within 3 years). Only include patents with both backward citations within 1 year before the application date and forward citations within 1 year after the application date. The query should focus on specific CPC categories, sort results by backward citations in descending order, and return the patent with the most backward citations, limiting to one result.

external_knowledge

None

Example 37

instance_id

sf_bq052

db

PATENTSVIEW

question

Retrieve the following information for U.S. patents: The patent ID, title, and application date. The number of backward citations within 1 month before the application date (i.e., patents that cited the current patent before its application). The number of forward citations within 1 month after the application date (i.e., patents that cited the current patent after its application). The abstract text of the patent. Only include patents that belong to specific CPC categories, such as subsection ‘C05’ or group ‘A01G’. The query should filter patents to include only those that have at least one backward citation or one forward citation in the 1-month period specified. Sort the results by application date and return all matching records.

external_knowledge

None

Example 38

instance_id

sf_bq036

db

GITHUB_REPOS

question

What was the average number of GitHub commits made per month in 2016 for repositories containing Python code?

external_knowledge

None

Example 39

instance_id

sf_bq100

db

GITHUB_REPOS

question

How can we identify the top 10 most frequently used packages in GitHub repository contents by looking for import statements enclosed in parentheses, splitting any multi-line imports by newlines, extracting package names that appear within double quotes, counting how often these packages appear, ignoring any null results, and finally ordering them in descending order of their frequency? The final answer should remove the quotation marks.

external_knowledge

None

Example 40

instance_id

sf_bq101

db

GITHUB_REPOS

question

From GitHub Repos contents, how can we identify the top 10 most frequently imported package names in Java source files by splitting each file’s content into lines, filtering for valid import statements, extracting only the package portion using a suitable regex, grouping by these extracted package names, counting their occurrences, and finally returning the 10 packages that appear most often in descending order of frequency?

external_knowledge

None

Example 41

instance_id

sf_bq182

db

GITHUB_REPOS_DATE

question

Which primary programming languages, determined by the highest number of bytes in each repository, had at least 5 PullRequestEvents on January 18, 2023 across all their repositories?

external_knowledge

None

Example 42

instance_id

sf_bq217

db

GITHUB_REPOS_DATE

question

On January 18, 2023, how many pull request creation events occurred in GitHub repositories that include JavaScript as one of their programming languages? Use data from the githubarchive table for the events and the languages table for repository language information.

external_knowledge

None

Example 43

instance_id

sf_bq191

db

GITHUB_REPOS_DATE

question

From the 2017 GitHub WatchEvent data, find the top two repositories that have more than 300 distinct watchers, ensuring the results are joined with the ‘sample_files’ table so that we return each repository’s name along with its distinct watcher count, and limit the output to the two repositories with the highest watcher counts.

external_knowledge

None

Example 44

instance_id

sf_bq224

db

GITHUB_REPOS_DATE

question

Which repository with an approved license in licenses.md had the highest combined total of forks, issues, and watches in April 2022?

external_knowledge

None

Example 45

instance_id

sf_bq192

db

GITHUB_REPOS_DATE

question

Find the most active Python repository on GitHub based on watcher count, issues, and forks. The query should select repositories with specific open-source licenses (artistic-2.0, isc, mit, apache-2.0), count distinct watchers, issue events, and forks for each repository in April 2022, and include only those with .py files on the master branch. Join the license data with watch counts, issue events, and fork counts, then sort by a combined metric of forks, issues, and watches, returning the name and count of the most active repository.

external_knowledge

None

Example 46

instance_id

sf_bq225

db

GITHUB_REPOS

question

From the GitHub repository files in ‘github_repos.sample_files’ joined with ‘github_repos.sample_contents’, which 10 programming languages occur most frequently (based on recognized file extensions) in files that have non-empty content, ordered by their file counts in descending order?

external_knowledge

lang_and_ext.md

Example 47

instance_id

sf_bq180

db

GITHUB_REPOS

question

Get the top 5 most frequently used module names from Python (.py) and R (.r) scripts, counting occurrences of modules in import and from statements for Python, and library() calls for R. The query should consider only Python and R files, group by module name, and return the top 5 modules ordered by frequency.

external_knowledge

None

Example 48

instance_id

sf_bq233

db

GITHUB_REPOS

question

Can you analyze the joined data from github repos files and github_repos contents, focusing only on files ending with ‘.py’ or ‘.r’, then extract Python modules from ‘import’ or ‘from … import’ lines and R libraries from ‘library(…)’ lines, count their occurrences, and finally list the results sorted by language and by the number of occurrences in descending order?

external_knowledge

None

Example 49

instance_id

sf_bq248

db

GITHUB_REPOS

question

Among all repositories that do not use any programming language whose name (case-insensitively) includes the substring “python,” what is the proportion of files whose paths include “readme.md” and whose contents contain the phrase “Copyright (c)”?

external_knowledge

None

Example 50

instance_id

sf_bq193

db

GITHUB_REPOS

question

Retrieve all non-empty, non-commented lines from README.md files in GitHub repositories, excluding lines that are comments (either starting with # for Markdown or // for code comments). For each line, calculate how often each unique line appears across all repositories and return a comma-separated list of the programming languages used in each repository containing that line, sorted alphabetically, with the results ordered by the frequency of occurrence in descending order.

external_knowledge

None

Example 51

instance_id

sf_bq295

db

GITHUB_REPOS_DATE

question

Using the 2017 GitHub Archive data for watch events, which three repositories that include at least one Python file (with a .py extension) smaller than 15,000 bytes and containing the substring “def “ in its content have the highest total number of watch events for that year?

external_knowledge

None

Example 52

instance_id

sf_bq249

db

GITHUB_REPOS

question

Please provide a report on the number of occurrences of specific line types across files from the GitHub repository. Categorize a line as ‘trailing’ if it ends with a blank character, as ‘Space’ if it starts with a space, and as ‘Other’ if it meets neither condition. The report should include the total number of occurrences for each category, considering all lines across all files.

external_knowledge

None

Example 53

instance_id

sf_bq375

db

GITHUB_REPOS

question

Determine which file type among Python (.py), C (.c), Jupyter Notebook (.ipynb), Java (.java), and JavaScript (.js) in the GitHub codebase has the most files with a directory depth greater than 10, and provide the file count.

external_knowledge

None

Example 54

instance_id

sf_bq255

db

GITHUB_REPOS

question

How many commit messages are there in repositories that use the ‘Shell’ programming language and ‘apache-2.0’ license, where the length of the commit message is more than 5 characters but less than 10,000 characters, and the messages do not start with the word ‘merge’, ‘update’ or ‘test’?

external_knowledge

None

Example 55

instance_id

sf_bq194

db

GITHUB_REPOS

question

Among all Python (.py), R (.r, .R, *.Rmd, *.rmd), and IPython notebook (.ipynb) files in the GitHub sample dataset, which library or module is identified as the second most frequently imported or loaded based on the extracted import statements?

external_knowledge

None

Example 56

instance_id

sf_bq377

db

GITHUB_REPOS

question

Extract and count the frequency of all package names listed in the require section of JSON-formatted content

external_knowledge

None

Example 57

instance_id

sf_bq359

db

GITHUB_REPOS

question

List the repository names and commit counts for the top two GitHub repositories with JavaScript as the primary language and the highest number of commits.

external_knowledge

None

Example 58

instance_id

sf_bq252

db

GITHUB_REPOS

question

Could you please find the name of the repository that contains the most copied non-binary Swift file in the dataset, ensuring each file is uniquely identified by its ID?

external_knowledge

None

Example 59

instance_id

sf_bq251

db

PYPI

question

I want to know the GitHub project URLs for the top 3 most downloaded PyPI packages based on download count. First, extract PyPI package metadata including name, version, and project URLs. Filter these URLs to only include those that link to GitHub repositories. Use a regular expression to clean the GitHub URLs by removing unnecessary parts like ‘issues’, ‘pull’, ‘blob’, and ‘tree’ paths, keeping only the main repository URL. For packages with multiple versions, use only the most recent version based on upload time. Join this data with download metrics to determine the most downloaded packages. Return only the cleaned GitHub repository URLs (without quotation marks) for the top 3 packages by total download count, ensuring that only packages with valid GitHub URLs are included in the results.

external_knowledge

None

Example 60

instance_id

bq019

db

cms_data

question

In the 2014 CMS Medicare inpatient charges data, which DRG definition has the highest total number of discharges, and among the top three cities with the most discharges for that DRG definition, what are their respective weighted average total payments (weighted by total discharges)

external_knowledge

None

Example 61

instance_id

bq234

db

cms_data

question

What is the most prescribed medication in each state in 2014?

external_knowledge

None

Example 62

instance_id

bq235

db

cms_data

question

Can you tell me which healthcare provider incurs the highest combined average costs for both outpatient and inpatient services in 2014?

external_knowledge

None

Example 63

instance_id

bq172

db

cms_data

question

For the drug with the highest total number of prescriptions in New York State during 2014, could you list the top five states with the highest total claim counts for this drug? Please also include their total claim counts and total drug costs.

external_knowledge

None

Example 64

instance_id

bq177

db

cms_data

question

For the provider whose total inpatient Medicare cost from 2011 through 2015 is the highest (computed as the sum of average_medicare_payments multiplied by total_discharges), please list that provider’s yearly average inpatient cost and yearly average outpatient cost for each calendar year in this period, where the inpatient cost is calculated as the average of (average_medicare_payments × total_discharges) and the outpatient cost is calculated as the average of (average_total_payments × outpatient_services).

external_knowledge

None

Example 65

instance_id

bq354

db

cms_data

question

Could you provide the percentage of participants for standard acne, atopic dermatitis, psoriasis, and vitiligo as defined by the International Classification of Diseases 10-CM (ICD-10-CM), including their subcategories? Please include all related concepts mapped to the standard ICD-10-CM codes (L70 for acne, L20 for atopic dermatitis, L40 for psoriasis, and L80 for vitiligo) by utilizing concept relationships, including descendant concepts. The percentage should be calculated based on the total number of participants, considering only the standard concepts and their related descendants.

external_knowledge

None

Example 66

instance_id

bq355

db

cms_data

question

Please tell me the percentage of participants not using quinapril and related medications(Quinapril RxCUI: 35208).

external_knowledge

None

Example 67

instance_id

bq032

db

noaa_data

question

Can you provide the latitude of the final coordinates for the hurricane that traveled the second longest distance in the North Atlantic during 2020?

external_knowledge

functions_st_distance.md

Example 68

instance_id

bq119

db

noaa_data

question

Please show information about the hurricane with the third longest total travel distance in the North Atlantic during 2020, including its travel coordinates, the cumulative travel distance (in kilometers) at each point, and the maximum sustained wind speed at those times.

external_knowledge

functions_st_distance.md

Example 69

instance_id

sf_bq117

db

NOAA_DATA

question

What is the total number of severe storm events that occurred in the most affected month over the past 15 years according to NOAA records, considering only the top 100 storm events with the highest property damage?

external_knowledge

None

Example 70

instance_id

bq419

db

noaa_data

question

Which 5 states had the most storm events from 1980 to 1995, considering only the top 1000 states with the highest event counts each year? Please use state abbreviations.

external_knowledge

None

Example 71

instance_id

sf_bq071

db

NOAA_DATA_PLUS

question

Can you provide the count of hurricanes and a list of hurricane names (sorted alphabetically and separated by commas) for each city and its associated zip code, where the hurricanes fall within the boundaries of the zip codes? Please exclude any unnamed hurricanes, and sort the results by the count of hurricanes in descending order. The output should include the following columns: city, zip code, state, count of hurricanes, and the list of hurricanes.

external_knowledge

functions_st_within.md

Example 72

instance_id

sf_bq236

db

NOAA_DATA_PLUS

question

What are the top 5 zip codes of the areas in the United States that have experienced the most hail storm events in the past 10 years? Don’t use data from hail reports table.

external_knowledge

functions_st_within.md

Example 73

instance_id

bq356

db

noaa_data

question

Among all NOAA GSOD weather stations that recorded valid daily temperature data (non-missing temp, max, min) in 2019 and whose period of record began on or before January 1, 2000, and continued through at least June 30, 2019, how many of these stations achieved 90% or more of the maximum possible number of valid temperature-record days in 2019?

external_knowledge

None

Example 74

instance_id

bq042

db

noaa_data

question

Can you help me retrieve the average temperature, average wind speed, and precipitation for LaGuardia Airport in NYC on June 12 for each year from 2011 through 2020, specifically using the station ID 725030?

external_knowledge

None

Example 75

instance_id

bq394

db

noaa_data

question

What are the top 3 months between 2010 and 2014 with the smallest sum of absolute differences between the average air temperature, wet bulb temperature, dew point temperature, and sea surface temperature, including respective years and sum of differences? Please present the year and month in numerical format.

external_knowledge

None

Example 76

instance_id

bq357

db

noaa_data

question

What are the latitude and longitude coordinates and dates between 2005 and 2015 with the top 5 highest daily average wind speeds, excluding records with missing wind speed values? Using data from tables start with prefix “icoads_core”.

external_knowledge

None

Example 77

instance_id

bq181

db

noaa_data

question

What percentage of weather stations recorded valid temperature data (with no missing or invalid values) for at least 90% of the days in 2022, where the temperature, maximum, and minimum values are neither NULL nor equal to 9999.9, and the station has a valid identifier (USAF code not equal to ‘999999’), out of all available stations in the NOAA GSOD database?

external_knowledge

None

Example 78

instance_id

bq045

db

noaa_data

question

Which weather stations in Washington State recorded more than 150 rainy days in 2023 but fewer rainy days compared to 2022? Defining a “rainy day” as one having precipitation greater than zero millimeters and not equal to 99.99. Only include stations with valid precipitation data.

external_knowledge

None

Example 79

instance_id

sf_bq358

db

NEW_YORK_CITIBIKE_1

question

Can you tell me which bike trip in New York City on July 15, 2015, started and ended in ZIP Code areas with the highest average temperature for that day, as recorded by the Central Park weather station (WBAN ‘94728’)? If there’s more than one trip that meets these criteria, I’d like to know about the one that starts in the smallest ZIP Code and ends in the largest ZIP Code. Please return the starting and ending ZIP Codes of this trip.

external_knowledge

functions_st_within.md

Example 80

instance_id

bq290

db

noaa_data

question

Can you calculate the difference in maximum temperature, minimum temperature, and average temperature between US and UK weather stations for each day in October 2023, using the date field, and excluding records with missing or invalid temperature values?

external_knowledge

None

Example 81

instance_id

bq031

db

noaa_data

question

Provide the daily weather data for Rochester from January 1 to March 31, 2019, including temperature (in Celsius), precipitation (in centimeters), and wind speed (in meters per second). For each variable, calculate the 8-day moving average (including the current day and the previous 7 days). Also, calculate the difference between the moving average on each day and the moving averages for the previous 1 to 8 days (i.e., lag1 to lag8). The result should include: The daily values for temperature, precipitation, and wind speed.The 8-day moving averages for each variable. The differences between the moving averages for each of the previous 1 to 8 days (e.g., the difference between today’s moving average and the moving average from 1 day ago, from 2 days ago, and so on). Round all values to one decimal place. The data should be ordered by date, starting from January 9, 2019.

external_knowledge

None

Example 82

instance_id

bq392

db

noaa_gsod

question

What are the top 3 dates in October 2009 with the highest average temperature for station number 723758, in the format YYYY-MM-DD?

external_knowledge

None

Example 83

instance_id

sf_bq050

db

NEW_YORK_CITIBIKE_1

question

I want to analyze bike trips in New York City for 2014 by linking trip data with weather information to understand how weather conditions (temperature, wind speed, and precipitation) affect bike trips between neighborhoods. For each combination of starting and ending neighborhoods, I need the following: 1. Total number of bike trips between the neighborhoods. 2. Average trip duration in minutes (rounded to 1 decimal). 3. Average temperature at the start of the trip (rounded to 1 decimal). 4. Average wind speed at the start (in meters per second, rounded to 1 decimal). 5. Average precipitation at the start (in centimeters, rounded to 1 decimal). 6. The month with the most trips (e.g., 4 for April). The data should be grouped by the starting and ending neighborhoods, with:zip_codes in geo_us_boundaries used to map the bike trip locations based on latitude and longitude. zip_codes in cyclistic used to obtain the borough and neighborhood names. Using weather data from the Central Park station for the trip date, covering all trips in 2014.

external_knowledge

functions_st_within.md

Example 84

instance_id

sf_bq426

db

NEW_YORK_CITIBIKE_1

question

What user type recorded the highest average temperature for trips starting and ending in New York City’s zip code 10019 during 2018? Include average precipitation, wind speed, and temperature for that user type based on weather data from the New York Central Park station.

external_knowledge

functions_st_within.md

Example 85

instance_id

sf_bq291

db

NOAA_GLOBAL_FORECAST_SYSTEM

question

Can you provide a daily weather summary for July 2019 within a 5 km radius of latitude 26.75 and longitude 51.5? I need the maximum, minimum, and average temperatures; total precipitation; average cloud cover between 10 AM and 5 PM; total snowfall (when average temperature is below 32°F); and total rainfall (when average temperature is 32°F or above) for each forecast date. The data should correspond to forecasts created in July 2019 for the following day.

external_knowledge

functions_st_within.md

Example 86

instance_id

bq208

db

new_york_noaa

question

Can you provide weather stations within a 20-mile radius of Chappaqua, New York (Latitude: 41.197, Longitude: -73.764), and tell me the number of valid temperature observations they have recorded from 2011 to 2020, excluding any invalid or missing temperature data?

external_knowledge

functions_st_dwithin.md

Example 87

instance_id

bq047

db

new_york_noaa

question

Could you analyze the relationship between each complaint type and daily temperature in New York City, specifically using temperature data from LaGuardia (STN=725030) and JFK (STN=744860) airports for the 10 years starting in 2008, and then determine, for each complaint type that has more than 5000 total occurrences and shows a strong correlation (absolute value > 0.5) with temperature, the total number of complaints, the total number of days with valid temperature records, and the Pearson correlation coefficients (rounded to four decimals) between temperature and both the daily complaint count as well as the daily percentage of total complaints, excluding any days with missing or invalid temperature data (such as 9999.9)?

external_knowledge

None

Example 88

instance_id

bq048

db

new_york_noaa

question

Which complaint types with more than 3000 total requests from 2011 to 2020 show the strongest positive and negative Pearson correlations with the daily average wind speed measured at station 744860 (JFK Airport), based on daily complaint proportions (the ratio of type- specific complaints to total daily complaints)? Please provide the complaint types and their correlation coefficients, rounded to four decimal places.

external_knowledge

None

Example 89

instance_id

bq293

db

new_york_geo

question

I want to analyze New York City yellow taxi trip data specifically for January 1, 2015, using the bigquery-public- data.new_york.tlc_yellow_trips_2015 dataset. I need to join this with the geo_us_boundaries.zip_codes table to map each trip to its respective NYC zip code based on the pickup coordinates. The analysis should create a complete hour-by-hour breakdown for all zip codes by cross-joining distinct zip codes with distinct hours, including hours with zero trips. For each zip code and hour combination, calculate the total number of trips and the following time-based metrics: count of trips from 1 hour ago, 1 day (24 hours) ago, 7 days (168 hours) ago, and 14 days (336 hours) ago. Additionally, compute the 14-day and 21-day moving averages and standard deviations of trip counts, excluding the current hour. Only include trips with valid latitude and longitude coordinates. The final results should be sorted by the highest trip counts, showing only the top 5 groups with the most trips.

external_knowledge

functions_st_contains.md

Example 90

instance_id

sf_bq017

db

GEO_OPENSTREETMAP

question

What are the five longest types of highways within the multipolygon boundary of Denmark (as defined by Wikidata ID ‘Q35’) by total length, analyzed through planet features?

external_knowledge

functions_st_dwithin.md

Example 91

instance_id

sf_bq131

db

GEO_OPENSTREETMAP

question

What is the number of bus stops for the bus network with the most stops within the multipolygon boundary of Denmark (as defined by Wikidata ID ‘Q35’), analyzed through planet features?

external_knowledge

functions_st_dwithin.md

Example 92

instance_id

sf_bq349

db

GEO_OPENSTREETMAP

question

Which OpenStreetMap ID from the planet features table corresponds to an administrative boundary, represented as multipolygons, whose total number of ‘amenity’-tagged Points of Interest (POIs), as derived from the planet nodes table, is closest to the median count among all such boundaries?

external_knowledge

functions_st_dwithin.md

Example 93

instance_id

sf_bq007

db

CENSUS_BUREAU_ACS_2

question

Identify the top 10 U.S. states with the highest vulnerable population, calculated based on a weighted sum of employment sectors using 2017 ACS 5-Year data, and determine their average median income change from 2015 to 2018 using zip code data.

external_knowledge

total_vulnerable_weights.md

Example 94

instance_id

sf_bq429

db

CENSUS_BUREAU_ACS_2

question

Which are the top five states with the greatest average difference in median income between 2015 and 2018 at the ZIP code level, and what is the corresponding average number of vulnerable employees across wholesale trade, natural resources and construction, arts and entertainment, information, and retail trade industries in 2017 according to the ACS Five-Year Estimates and ZIP code boundaries data?

external_knowledge

avg_vulnerable_weights.md

Example 95

instance_id

sf_bq073

db

CENSUS_BUREAU_ACS_2

question

Using data on ZIP-level median income differences between 2015 and 2018, along with the 2017 ACS employment figures, list each state in descending order of total vulnerable workers, where “vulnerable” is defined as 38% of wholesale trade employees and 41% of manufacturing employees in 2017. Your results should include the state name, the number of vulnerable wholesale trade workers, the number of vulnerable manufacturing workers, and the combined total of these vulnerable workers.

external_knowledge

None

Example 96

instance_id

sf_bq410

db

CENSUS_BUREAU_ACS_2

question

Find the top 3 states with the smallest adjusted non-labor force population using 2017 ACS tract-level data. Calculate the adjusted non-labor force population as (unemployed_pop + not_in_labor_force - group_quarters), clamping any negative values to zero. For each of these states, display the state abbreviation, the total median income change between 2015 and 2018 across all tracts, the total adjusted non-labor force population, and the average population-adjusted proportion (calculated as the ratio of adjusted non-labor force to total population at the tract level). Join tract data to states using FIPS code prefixes, exclude any tracts with null income differences, and sort results by the adjusted non-labor force population in ascending order.

external_knowledge

None

Example 97

instance_id

sf_bq348

db

GEO_OPENSTREETMAP

question

Within the rectangular area defined by the geogpoints (31.1798246, 18.4519921), (54.3798246, 18.4519921), (54.3798246, 33.6519921), and (31.1798246, 33.6519921), which are the top three usernames responsible for the highest number of historical nodes, originally tagged with the amenities ‘hospital’, ‘clinic’, or ‘doctors’, that do not appear anymore in the current planet_nodes dataset?

external_knowledge

functions_st_intersects_polygon_line.md

Example 98

instance_id

sf_bq253

db

GEO_OPENSTREETMAP

question

Find the name of the OpenStreetMap relation that encompasses the most features within the same geographic area as the multipolygon tagged with the Wikidata item ‘Q1095’. The relation should have a specified name and no ‘wikidata’ tag, and at least one of its included features must have a ‘wikidata’ tag. The analysis should be conducted using the planet_features table. Return the name of this relation.

external_knowledge

functions_st_dwithin.md

Example 99

instance_id

sf_bq254

db

GEO_OPENSTREETMAP

question

Among all multipolygons located within the same geographic area as the multipolygon associated with Wikidata item Q191, but lacking a ‘wikidata’ tag themselves, which two rank highest by the number of points that lie within their boundaries, and what are their names?

external_knowledge

functions_st_dwithin.md

Example 100

instance_id

sf_bq056

db

GEO_OPENSTREETMAP_BOUNDARIES

question

How many different pairs of roads classified as motorway, trunk, primary, secondary, or residential in California overlap each other without sharing nodes and do not have a bridge tag, where these roads are tagged with ‘highway’, analyzed through planet ways

external_knowledge

functions_st_intersects.md

Example 101

instance_id

sf_bq289

db

GEO_OPENSTREETMAP_CENSUS_PLACES

question

Can you find the shortest distance between any two amenities (either a library, place of worship, or community center) located within Philadelphia, analyzed through pennsylvania table and planet features points?

external_knowledge

functions_st_contains.md

Example 102

instance_id

sf_bq250

db

GEO_OPENSTREETMAP_WORLDPOP

question

Based on the most recent 1km population grid data in Singapore before January 2023, using ST_CONVEXHULL to aggregate all population grid centroids into a bounding region and ST_INTERSECTS to identify hospitals from OpenStreetMap’s planet layer (layer_code in (2110, 2120)) that fall within this region, then calculating the distance from each grid cell to its nearest hospital, what is the total population of the grid cell that is farthest from any hospital?

external_knowledge

OpenStreetMap_data_in_layered_GIS_format.md

Example 103

instance_id

sf_bq083

db

CRYPTO

question

Can you calculate the daily change in the market value of USDC tokens (address 0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48) for 2023, based on Ethereum transactions? The change should be computed from minting (input pattern 0x40c10f19%) and burning (input pattern 0x42966c68%) operations. For each transaction, minting should be positive and burning negative. Extract the relevant amount from the ‘input’ field as a hexadecimal, convert it to millions, express it in USD format. Group the results by date and order them in descending order.

external_knowledge

Total_Market_Value_Change.md

Example 104

instance_id

sf_bq184

db

CRYPTO

question

Using only the traces, can you calculate daily cumulative counts of smart contracts created by external addresses (where the trace_address is NULL) versus those created by other contracts (where the trace_address is NOT NULL) for each date from 2017-01-01 through 2021-12-31, ensuring that all dates in this range are included even if no new contracts were created on some days, and showing monotonically increasing cumulative totals for both categories?

external_knowledge

None

Example 105

instance_id

sf_bq195

db

CRYPTO

question

What are the top 10 Ethereum addresses by balance, considering both value transactions and gas fees, before September 1, 2021? Only keep successful transactions with no call type or where the call type is ‘call’.

external_knowledge

None

Example 106

instance_id

sf_bq256

db

CRYPTO

question

Determine the final Ether balance of the Ethereum address that initiated the highest number of successful transactions prior to September 1, 2021 (UTC), excluding calls of type delegatecall, callcode, or staticcall and including all relevant incoming and outgoing transfers, miner rewards, and gas fee deductions, with the final balance presented in Ether after converting from the native unit.

external_knowledge

None

Example 107

instance_id

sf_bq080

db

CRYPTO

question

Using only the Ethereum traces table, can you provide a daily cumulative count of smart contracts created by external users (where trace_address is null) versus contracts created by other contracts (where trace_address is not null) between August 30, 2018, and September 30, 2018? Ensure results include every date in this range, even if no new contracts were created, and show strictly increasing cumulative totals.

external_knowledge

None

Example 108

instance_id

sf_bq342

db

CRYPTO

question

What is the difference between the average hourly changes in transaction values for the Ethereum token 0x68e54af74b22acaccffa04ccaad13be16ed14eac, specifically considering only transactions where the address 0x8babf0ba311aab914c00e8fda7e8558a8b66de5d was the sender or the address 0xfbd6c6b112214d949dcdfb1217153bc0a742862f was the receiver, between January 1, 2019, and December 31, 2020, when comparing 2019 to 2020?

external_knowledge

None

Example 109

instance_id

sf_bq341

db

CRYPTO

question

Which Ethereum address has the top 3 smallest positive balance from transactions involving the token at address “0xa92a861fc11b99b24296af880011b47f9cafb5ab”?

external_knowledge

None

Example 110

instance_id

sf_bq444

db

CRYPTO

question

Can you pull the blockchain timestamp, block number, and transaction hash for the first five mint and burn events from Ethereum logs for the address ‘0x8ad599c3a0ff1de082011efddc58f1908eb6e6d8’? Please include mint events identified by the topic ‘0x7a53080ba414158be7ec69b987b5fb7d07dee101fe85488f0853ae16239d0bde’ and burn events by ‘0x0c396cd989a39f4459b5fa1aed6a9a8dcdbc45908acfd67e028cd568da98982c’, and order them by block timestamp from the oldest to the newest.

external_knowledge

ethereum_logs_and_events_overview.md

Example 111

instance_id

sf_bq340

db

CRYPTO

question

Which six Ethereum addresses, excluding ‘0x0000000000000000000000000000000000000000’, have the largest absolute differences between their previous and current balances from the tokens at addresses ‘0x0d8775f648430679a709e98d2b0cb6250d2887ef0’ and ‘0x1e15c05cbad367f044cbfbafda3d9a1510db5513’?

external_knowledge

None

Example 112

instance_id

sf_bq005

db

CRYPTO

question

Calculate the daily average Bitcoin block interval (in seconds) for 2023 by joining consecutive blocks via row-numbered self-joins (including cross-day intervals), excluding the genesis block, and list the first 10 dates with their unadjusted averages.

external_knowledge

None

Example 113

instance_id

sf_bq334

db

CRYPTO

question

Calculate the annual differences in Bitcoin output value averages between two methods: Merged input/output records: Combine the inputs and outputs tables, filter to only output records, and calculate yearly averages. Transactions table: Directly use the output_value field from the transactions table for yearly averages. Show the difference (merged outputs average minus transactions average) only for years with data in both methods.

external_knowledge

None

Example 114

instance_id

sf_bq335

db

CRYPTO

question

Among all Bitcoin addresses that have at least one transaction in October 2017 (combining both inputs and outputs), which address conducted its final transaction on the latest date in that month, and, among any addresses sharing that same latest date, which one has the highest sum of transaction values?

external_knowledge

None

Example 115

instance_id

sf_bq057

db

CRYPTO

question

Which month (e.g., 3 for March) in 2021 witnessed the highest percentage of Bitcoin transaction volume occurring in CoinJoin transactions (defined as transactions with >2 outputs, output value ≤ input value, and having multiple equal-value outputs)? Also provide the percentage of all Bitcoin transactions that were CoinJoins, the percentage of UTXOs involved in CoinJoin transactions (average of input and output percentages), and the percentage of total Bitcoin volume that occurred in CoinJoin transactions for that month. Round all percentages to 1 decimal place.

external_knowledge

None

Example 116

instance_id

sf_bq068

db

CRYPTO

question

Using double-entry bookkeeping principles by treating transaction inputs as debits (negative values) and outputs as credits (positive values) for all Bitcoin Cash transactions between 2014-03-01 and 2014-04-01, how can we calculate the maximum and minimum final balances grouped by address type from these transactions?

external_knowledge

None

Example 117

instance_id

sf_bq092

db

CRYPTO

question

In April 2023, what are the highest and lowest balances across all Dash addresses when calculating the net balance for each address using double-entry bookkeeping (where inputs are treated as debits/negative values and outputs as credits/positive values)? Consider all transactions filtered by block_timestamp_month=’2023-04-01’, and when an address appears as an array in the data, concatenate the array elements into a comma-separated string. For each address and type combination, sum all the values to determine the balance.

external_knowledge

None

Example 118

instance_id

sf_bq093

db

CRYPTO

question

What were the maximum and minimum net balance changes for Ethereum Classic addresses on October 14, 2016? Calculate these by summing all transactions where addresses received funds (debits), sent funds (credits), and paid or received gas fees. Only include successful status transactions and exclude internal calls of types. For gas fees, consider both the fees paid by transaction senders and received by miners, calculated as multiplied by the gas price for both miners and senders

external_knowledge

None

Example 119

instance_id

sf_bq292

db

CRYPTO

question

Analyze Bitcoin transactions since July 2023 to determine monthly percentages of: (1)Transactions classified as CoinJoins (defined by >2 outputs, output value ≤ input value, and multiple identical-value outputs), (2) UTXOs involved in CoinJoins (calculated as the average of CoinJoin input/output ratios against total network UTXOs), (3) Transaction volume (based on input value) attributed to CoinJoins. Provide results in a table with monthly metrics for transactions, UTXOs, and volume.

external_knowledge

None

Example 120

instance_id

sf_bq135

db

CRYPTO

question

Which date before 2022 had the highest total transaction amount in the Zilliqa blockchain data?

external_knowledge

None

Example 121

instance_id

sf_bq136

db

CRYPTO

question

Find all exactly 2-hop transaction paths on Zilliqa blockchain between the source address ‘zil1jrpjd8pjuv50cfkfr7eu6yrm3rn5u8rulqhqpz’ and destination address ‘zil19nmxkh020jnequql9kvqkf3pkwm0j0spqtd26e’, considering both regular transactions and contract transitions. A 2-hop path means there must be an intermediate address between source and destination. Exclude paths where any intermediate address has more than 50 outgoing transactions to filter out exchanges and high- activity wallets. Ensure transactions in each path follow chronological order (earlier transaction timestamps first). Display results in the format: ‘ –(tx ABCDE..)–> --(tx FGHIJ..)--> ' where the transaction IDs are truncated to the first 5 characters. Include only confirmed on-chain transactions in both steps of the path.

external_knowledge

None

Example 122

instance_id

sf_bq065

db

CRYPTO

question

From the oracle requests table, retrieve the 10 most recent oracle requests with script ID 3. For each request, extract all symbol-rate pairs by matching each symbol in the “symbols” array with its corresponding rate at the same position in the “rates” array from the decoded result. Adjust each rate by dividing it by the request’s multiplier value. Return the block timestamp, oracle request ID, symbol, and the adjusted rate for each symbol-rate pair. Sort the results in chronological order with the newest records first.

external_knowledge

None

Example 123

instance_id

sf_bq037

db

HUMAN_GENOME_VARIANTS

question

About the refined human genetic variations collected in phase 3 on 2015-02-20, I want to know the minimum and maximum start positions as well as the proportions of these two respectively for reference bases ‘AT’ and ‘TA’.

external_knowledge

None

Example 124

instance_id

sf_bq012

db

ETHEREUM_BLOCKCHAIN

question

Calculate the average balance (in quadrillions, 10^15) of the top 10 Ethereum addresses by net balance, including incoming and outgoing transfers from traces (only successful transactions and excluding call types like delegatecall, callcode, and staticcall), miner rewards (sum of gas fees per block), and sender gas fee deductions. Exclude null addresses and round the result to two decimal places.

external_knowledge

None

Example 125

instance_id

sf_bq187

db

ETHEREUM_BLOCKCHAIN

question

Calculate the total circulating supply of ‘BNB’ tokens (in units divided by 10^18) by summing balances of all non-zero addresses, where each address’s balance equals its total received BNB minus sent BNB. Exclude transactions involving the zero address (0x000…) for both senders and receivers.

external_knowledge

None

Example 126

instance_id

sf_bq450

db

ETHEREUM_BLOCKCHAIN

question

Generate a comprehensive report of all Ethereum addresses active before January 1, 2017, calculating their net balances (adjusted for transaction fees and excluding delegatecall/callcode/staticcall transactions), hourly activity patterns, active days, incoming/outgoing transaction metrics (counts, unique counterparties, average ETH transfers), ERC20 token interactions (in/out counts, unique tokens, counterparties), mining rewards, contract creation frequency, failed transaction counts, and contract bytecode sizes, with all ETH values converted to standard units (divided by 10^18) and excluding addresses with no transaction history.

external_knowledge

ethereum_data_transformation.md

Example 127

instance_id

bq034

db

ghcn_d

question

I want to know the IDs, names of weather stations within a 50 km straight-line distance from the center of Chicago (41.8319°N, 87.6847°W)

external_knowledge

None

Example 128

instance_id

bq383

db

ghcn_d

question

Could you provide the highest recorded precipitation, minimum temperature, and maximum temperature from the last 15 days of each year from 2013 to 2016 at weather station USW00094846? Ensure each value represents the peak measurement for that period, with precipitation in millimeters and temperatures in degrees Celsius, using only validated data (non-null values and no quality flags)

external_knowledge

None

Example 129

instance_id

bq051

db

new_york_ghcn

question

Calculate the average daily number of Citibike trips in New York City during 2016, categorizing days as rainy if the total precipitation exceeds 5 millimeters (obtained by dividing the raw precipitation value by 10), and non-rainy otherwise. Use data from the nearest GHCN station located within 50 km of (40.7128, -74.0060) that has valid, unflagged measurements, then compare the resulting average Citibike trips on rainy days versus non-rainy days.

external_knowledge

None

Example 130

instance_id

bq038

db

new_york

question

Identify the top 10 Citibike stations by highest proportion of group rides, defined as trips starting and ending at the same station where multiple riders departed/arrived within the same 2-minute time window. Calculate the proportion as the number of trips that are part of a group divided by the total number of trips ending at that station.

external_knowledge

None

Example 131

instance_id

bq053

db

new_york

question

Calculate the change in the number of living trees of each fall color in New York City from 1995 to 2015 by computing, for each tree species, the difference between the number of trees not marked as dead in 1995 and the number of trees alive in 2015, matching species by the uppercase form of their scientific names from the tree_species table. Then, group the species by their fall color and sum these differences to determine the total change in the number of trees for each fall color.

external_knowledge

None

Example 132

instance_id

bq054

db

new_york

question

Please provide the top 10 tree species in New York, using their uppercase Latin names where the Latin name is not empty and including their common names, showing the total number of trees, the counts of alive and dead trees for each year, and the corresponding growth in these counts from 1995 to 2015, then order by the difference in total tree counts between these years.

external_knowledge

None

Example 133

instance_id

bq021

db

new_york

question

For the top 20 Citi Bike routes in 2016, which route is faster than yellow taxis and among those, which one has the longest average bike duration? Please provide the start station name of this route. The coordinates are rounded to three decimals.

external_knowledge

None

Example 134

instance_id

bq202

db

new_york_plus

question

For the station that had the highest number of Citibike trips starting there in 2018, which numeric day of the week and which hour of the day had the greatest number of trips based on the start time of those trips?

external_knowledge

None

Example 135

instance_id

bq185

db

new_york_plus

question

What is the average trip duration in minutes for all valid Yellow taxi trips that took place between February 1, 2016, and February 7, 2016 (inclusive), with a positive trip duration, more than three passengers, and a trip distance of at least ten miles, where both the pickup and dropoff locations are in Brooklyn?

external_knowledge

None

Example 136

instance_id

bq040

db

new_york_plus

question

For NYC yellow taxi trips between January 1 and January 7, 2016, excluding any trips picked up in ‘EWR’ or ‘Staten Island,’ determine the proportion of rides that fall into each tip category in each pickup borough. Only include trips where the dropoff time is after the pickup time, the passenger count is greater than zero, and trip_distance, tip_amount, tolls_amount, mta_tax, fare_amount, and total_amount are all non-negative. Classify the tip percentage as follows: 0% (no tip), up to 5%, 5% to 10%, 10% to 15%, 15% to 20%, 20% to 25%, and more than 25%.

external_knowledge

taxi_tip_rate.md

Example 137

instance_id

bq098

db

new_york_plus

question

For NYC yellow taxi trips where both the pickup and dropoff occurred between January 1 and 7, 2016, inclusive, calculate the percentage of trips with no tip in each pickup borough, ensuring that only trips where the dropoff occurs after the pickup are included, the passenger count is greater than zero, and the trip distance, tip amount, tolls amount, MTA tax, fare amount, and total amount are non-negative; define “no tip” trips as those where the tip rate is zero, with the tip rate calculated as (tip_amount × 100) divided by total_amount (and considered zero when total_amount is zero).

external_knowledge

taxi_tip_rate.md

Example 138

instance_id

bq039

db

new_york_plus

question

Find the top 10 taxi trips in New York City between July 1 and July 7, 2016 (ensuring both pickup and dropoff times fall within these dates) where the passenger count is greater than five, the trip distance is at least ten miles, and there are no negative fare-related amounts (including tip, tolls, mta tax, fare, and total costs). Exclude any trips where the dropoff time is not strictly after the pickup time, then sort the results by total fare amount in descending order. Finally, display each trip’s pickup zone, dropoff zone, trip duration in seconds, driving speed in miles per hour, and tip rate as a percentage of the total fare amount.

external_knowledge

None

Example 139

instance_id

bq203

db

new_york_plus

question

For each New York City borough, how many subway stations are there in total, how many have at least one entrance that is marked both as an actual entry and as ADA-compliant, and what percentage of the total stations in each borough does this represent, listing boroughs from the highest to the lowest percentage?

external_knowledge

None

Example 140

instance_id

bq035

db

san_francisco

question

What is the total distance traveled by each bike in the San Francisco Bikeshare program, measured in meters? Use data from bikeshare trips and stations to calculate this.

external_knowledge

None

Example 141

instance_id

bq186

db

san_francisco

question

Please find, for each year-month combination (in the format YYYYMM) derived from the start date of bike share trips in San Francisco, the first trip duration in minutes, the last trip duration in minutes, the highest trip duration in minutes, and the lowest trip duration in minutes, where ‘first’ and ‘last’ are determined by the chronological order of the trip start date, then group your results by this year- month and sort them by the same year-month key.

external_knowledge

None

Example 142

instance_id

bq081

db

san_francisco_plus

question

Find the latest ride data for each region between 2014 and 2017. I want to know the name of each region, the trip ID of this ride, the ride duration, the start time, the starting station, and the gender of the rider.

external_knowledge

None

Example 143

instance_id

sf_bq294

db

SAN_FRANCISCO_PLUS

question

Could you provide the details of the top 5 longest bike share trips that started between July 1, 2017, and December 31, 2017, including the trip ID, duration in seconds, start date, start station name, route (derived from start station name to end station name), bike number, subscriber type, member’s birth year, the member’s current age (calculated using the current year), an age classification based on whether the member is younger than 40, between 40 and 60, or older than 60, the member’s gender, and the name of the region of the start station? Please exclude any trips where the start station name, member’s birth year, or member’s gender is not specified.

external_knowledge

trip_info.md

Example 144

instance_id

bq339

db

san_francisco_plus

question

Which month in 2017 had the largest absolute difference between cumulative bike usage minutes for customers and subscribers? Which month (in number) in 2017 had the largest absolute difference between cumulative bike usage minutes (in thousands) for customers and subscribers, based on the trip end dates in the San Francisco bikeshare data?

external_knowledge

None

Example 145

instance_id

bq400

db

san_francisco_plus

question

For trips where ‘Clay St & Drumm St’ occurs before ‘Sacramento St & Davis St’ in the stop sequence (one direction only), what are the earliest departure times from ‘Clay St & Drumm St’ and the latest arrival times at ‘Sacramento St & Davis St’ in the format HH:MM:SS? Please provide the trip headsign for each route.

external_knowledge

None

Example 146

instance_id

bq059

db

san_francisco_plus

question

What is the highest average speed (rounded to 1 decimal, in metric m/s) for bike trips in Berkeley with trip distance greater than 1000 meters?

external_knowledge

None

Example 147

instance_id

bq376

db

san_francisco_plus

question

For each neighborhood in San Francisco where at least one bike share station and at least one crime incident are located, provide the neighborhood name along with the total count of bike share stations and the total number of crime incidents in that neighborhood.

external_knowledge

None

Example 148

instance_id

sf_bq014

db

THELOOK_ECOMMERCE

question

Can you help me figure out the revenue for the product category that has the highest number of customers making a purchase in their first non-cancelled and non-returned order?

external_knowledge

None

Example 149

instance_id

sf_bq188

db

THELOOK_ECOMMERCE

question

Among all product categories in the dataset, identify the category with the highest total purchase quantity (based on order_items table), and for that specific category, what is the average time in minutes that users spend on each product page visit? The average time should be calculated as the difference between the timestamp when a user views a product page and the timestamp of the next event within the same session

external_knowledge

None

Example 150

instance_id

sf_bq258

db

THELOOK_ECOMMERCE

question

Generate a monthly report for each product category , where each row corresponds to orders that have a status of ‘Complete’ and were delivered before the year 2022, grouping by the month and year of delivery. For each category, calculate the total revenue (the sum of sale_price), the total number of completed orders, and compute the month-over-month percentage growth for both revenue and orders by comparing each month’s totals to the previous month’s. Then, for the same orders, aggregate and show the total cost (from product costs), total profit (revenue minus total cost), and finally the profit-to- cost ratio for each month.

external_knowledge

None

Example 151

instance_id

sf_bq259

db

THELOOK_ECOMMERCE

question

Using data up to the end of 2022 and organized by the month of each user’s first purchase, can you provide the percentage of users who made a purchase in each of the first, second, third, and fourth months since their initial purchase, where the “first month” refers to the month of their initial purchase?

external_knowledge

None

Example 152

instance_id

sf_bq189

db

THELOOK_ECOMMERCE

question

Based solely on completed orders, calculate the average monthly percentage growth rate in the number of unique orders (counting distinct order IDs) for each product category by comparing each month’s count to the previous month within the same category. Identify the product category with the highest average of these monthly order growth rates. Then, for that specific product category, compute the average monthly revenue growth rate by calculating the percentage change in total revenue (sum of sale prices) from month to month and averaging these values over the entire period.

external_knowledge

None

Example 153

instance_id

sf_bq260

db

THELOOK_ECOMMERCE

question

From January 1, 2019, to April 30, 2022, how many users are at the youngest age and how many users are at the oldest age for each gender in the e-commerce platform, counting both youngest and oldest users separately for each gender?

external_knowledge

None

Example 154

instance_id

sf_bq261

db

THELOOK_ECOMMERCE

question

For each month prior to January 2024, identify the product that achieved the highest total profit (calculated as the sum of sale_price minus the product’s cost) across all order items, then report the total cost and total profit for that top product per month, including all order items regardless of their status, and present the results chronologically by month.

external_knowledge

None

Example 155

instance_id

sf_bq262

db

THELOOK_ECOMMERCE

question

Generate a monthly analysis report for e-commerce sales from June 2019 to December 2019 that includes, for each product category and each month, the total number of orders, total revenue, and total profit, along with their month-over-month growth rates using the data from June 2019 as the basis for calculating growth starting from July 2019. Ensure that all orders are included regardless of their status, and present the results sorted in ascending order by month (formatted as “2019-07”) and then by product category. Omitting June 2019 from the final output but using it for the growth calculations.

external_knowledge

None

Example 156

instance_id

sf_bq190

db

THELOOK_ECOMMERCE

question

Determine the number of users who are the youngest and oldest for each gender (male and female) separately, among those who signed up between January 1, 2019, and April 30, 2022. For each gender, identify the minimum and maximum ages within this date range, and count how many users fall into these respective age groups.

external_knowledge

None

Example 157

instance_id

sf_bq263

db

THELOOK_ECOMMERCE

question

Please create a month-by-month report for the year 2023 that focuses on the ‘Sleep & Lounge’ category, showing for each month the total sales, total cost, number of complete orders, total profit, and the profit-to-cost ratio, ensuring that the order is marked as ‘Complete,’ the creation date is between January 1, 2023, and December 31, 2023, and the cost data is accurately associated with the corresponding product through the order items.

external_knowledge

None

Example 158

instance_id

sf_bq264

db

THELOOK_ECOMMERCE

question

Identify the difference in the number of the oldest and youngest users registered between January 1, 2019, and April 30, 2022, from our e-commerce platform data.

external_knowledge

None

Example 159

instance_id

sf_bq197

db

THELOOK_ECOMMERCE

question

For each month prior to July 2024, identify the single best-selling product (determined by highest sales volume, with total revenue as a tiebreaker) among all orders with a ‘Complete’ status and products with non-null brands. Return a report showing the month, product name, brand, category, total sales, rounded total revenue, and order status for these monthly top performers.

external_knowledge

None

Example 160

instance_id

sf_bq265

db

THELOOK_ECOMMERCE

question

Can you list the email addresses of the top 10 users who registered in 2019 and made purchases in 2019, ranking them by their highest average order value, where average order value is calculated by multiplying the number of items in each order by the sale price, summing this total across all orders for each user, and then dividing by the total number of orders?

external_knowledge

None

Example 161

instance_id

sf_bq266

db

THELOOK_ECOMMERCE

question

Please provide the names of the products that had sales in each month of 2020 and had the lowest profit, calculated as the difference between their retail price and cost from the products data. Exclude any months where this data isn’t available. Please list the products in chronological order based on the month.

external_knowledge

None

Example 162

instance_id

sf_bq333

db

THELOOK_ECOMMERCE

question

Which three browsers have the shortest average session duration—calculated by the difference in seconds between the earliest and latest timestamps for each user’s session—while only including browsers that have more than 10 total sessions, and what are their respective average session durations?

external_knowledge

None

Example 163

instance_id

sf_bq361

db

THELOOK_ECOMMERCE

question

For the user cohort with a first purchase date in January 2020, what proportion of users returned in the subsequent months of 2020?

external_knowledge

None

Example 164

instance_id

sf_bq271

db

THELOOK_ECOMMERCE

question

Please generate a report that, for each month in 2021, provides the number of orders, the number of unique purchasers, and the profit (calculated as the sum of product retail prices minus the sum of product costs), where the orders were placed during 2021 by users who registered in 2021 for inventory items created in 2021, and group the results by the users’ country, product department, and product category.

external_knowledge

None

Example 165

instance_id

sf_bq272

db

THELOOK_ECOMMERCE

question

Please provide the names of the top three most profitable products for each month from January 2019 through August 2022, excluding any products associated with orders that were canceled or returned. For each product in each month, the profit should be calculated as the sum of the sale prices of all order items minus the sum of the costs of those sold items in that month.

external_knowledge

None

Example 166

instance_id

sf_bq273

db

THELOOK_ECOMMERCE

question

Can you list the top 5 months from August 2022 to November 2023 where the profit from Facebook-sourced completed orders showed the largest month-over-month increase? Calculate profit as sales minus costs, group by delivery month, and include only orders created between August 2022 and November 2023. Compare each month’s profit to its previous month to find the largest increases.

external_knowledge

None

Example 167

instance_id

sf_bq020

db

GENOMICS_CANNABIS

question

What is the name of the reference sequence with the highest variant density in the given cannabis genome dataset?

external_knowledge

None

Example 168

instance_id

sf_bq107

db

GENOMICS_CANNABIS

question

What is the variant density of the cannabis reference with the longest reference length? Pay attention that a variant is present if there is at least one variant call with a genotype greater than 0.

external_knowledge

None

Example 169

instance_id

bq025

db

census_bureau_international

question

Provide a list of the top 10 countries for the year 2020, ordered by the highest percentage of their population under 20 years old. For each country, include the total population under 20 years old, the total midyear population, and the percentage of the population that is under 20 years old.

external_knowledge

None

Example 170

instance_id

bq115

db

census_bureau_international

question

Which country has the highest percentage of population under the age of 25 in 2017?

external_knowledge

None

Example 171

instance_id

bq030

db

covid19_open_data

question

As of May 10, 2020, among all countries that had more than 50,000 confirmed COVID-19 cases, which three countries had the highest recovery rates based on the total number of recovered cases relative to their total confirmed cases, and what were their respective recovery rates expressed as percentages?

external_knowledge

None

Example 172

instance_id

bq018

db

covid19_open_data

question

Which day in March and April had the highest COVID-19 confirmed case growth rate in the United States? The format is MM-DD.

external_knowledge

None

Example 173

instance_id

bq086

db

covid19_open_world_bank

question

You need to calculate the percentage of each country’s population that had been confirmed with COVID-19 by June 30, 2020. The population data for 2018 can be found in the World Bank dataset, and the cumulative COVID-19 confirmed cases data is available in the COVID-19 Open Data dataset. Calculate the percentage of each country’s population, that was cumulatively confirmed to have COVID-19

external_knowledge

None

Example 174

instance_id

bq085

db

covid19_jhu_world_bank

question

Could you provide, for the United States, France, China, Italy, Spain, Germany, and Iran, the total number of confirmed COVID-19 cases as of April 20, 2020, along with the number of cases per 100,000 people based on their total 2020 populations calculated by summing all relevant population entries from the World Bank data

external_knowledge

None

Example 175

instance_id

bq130

db

covid19_nyt

question

Analyze daily new COVID-19 case counts from March to May 2020, identifying the top five states by daily increases. Please compile a ranking based on how often each state appears in these daily top fives. Then, examine the state that ranks fourth overall and identify its top five counties based on their frequency of appearing in the daily top five new case counts.

external_knowledge

None

Example 176

instance_id

bq087

db

covid19_symptom_search

question

Please calculate the overall percentage change in the average weekly search frequency for the symptom ‘Anosmia’ across the five New York City counties—Bronx County, Queens County, Kings County, New York County, and Richmond County—by comparing the combined data from January 1, 2019, through December 31, 2019, with the combined data from January 1, 2020, through December 31, 2020.

external_knowledge

None

Example 177

instance_id

bq088

db

covid19_symptom_search

question

Please calculate the average levels of anxiety and depression symptoms from the weekly country data for the United States during the periods from January 1, 2019, to January 1, 2020, and from January 1, 2020, to January 1, 2021. Then, compute the percentage increase in these average symptom levels from the 2019 period to the 2020 period.

external_knowledge

None

Example 178

instance_id

bq089

db

covid19_usa

question

Given the latest population estimates from the 2018 five-year American Community Survey, what is the number of vaccine sites per 1000 people for counties in California?

external_knowledge

None

Example 179

instance_id

bq407

db

covid19_usa

question

Find the top three counties with populations over 50,000, using the 2020 5-year census data, that had the highest COVID-19 case fatality rates on August 27, 2020. For these counties, provide the name, state, median age, total population, number of confirmed COVID-19 cases per 100,000 people, number of deaths per 100,000 people, and the case fatality rate as a percentage

external_knowledge

None

Example 180

instance_id

bq137

db

census_bureau_usa

question

Please find all zip code areas located within 10 kilometers of the coordinates (-122.3321, 47.6062) by joining the 2010 census population data (summing only male and female populations with no age constraints) and the zip code area information, and return each area’s polygon, land and water area in meters, latitude and longitude, state code, state name, city, county, and total population.

external_knowledge

functions_st_dwithin.md

Example 181

instance_id

bq060

db

census_bureau_international

question

Which top 3 countries had the highest net migration in 2017 among those with an area greater than 500 square kilometers? And what are their migration rates?

external_knowledge

None

Example 182

instance_id

bq338

db

census_bureau_acs_1

question

Can you find the census tracts in the 36047 area that are among the top 20 for the largest percentage increases in population from 2011 to 2018, are also among the top 20 for the largest absolute increases in median income during the same period, and had over 1,000 residents in each of those years?

external_knowledge

None

Example 183

instance_id

bq061

db

census_bureau_acs_1

question

Which census tract has witnessed the largest increase in median income between 2015 and 2018 in California? Tell me the tract code.

external_knowledge

None

Example 184

instance_id

bq064

db

census_bureau_acs_1

question

Using the 2017 U.S. Census Tract data from the BigQuery public datasets, you need to proportionally allocate each tract’s population and income to the zip codes based on the overlapping area between their geographic boundaries. Then, filter the results to include only those zip codes located within a 5-mile radius of a specific point in Washington State, with coordinates at latitude 47.685833°N and longitude -122.191667°W. Finally, calculate the total population and the average individual income for each zip code (rounded to one decimal place) and sort the results by the average individual income in descending order.

external_knowledge

functions_st_intersects_area.md

Example 185

instance_id

bq461

db

ncaa_basketball

question

Please provide a chronological summary of all scoring plays from the 2014 season game where the Wildcats were the home team and the Fighting Irish were the away team. Include for each scoring event the game clock, cumulative scores for both teams (Wildcats and Fighting Irish), the team that scored, and a description of the event.

external_knowledge

None

Example 186

instance_id

bq198

db

ncaa_basketball

question

List the top 5 universities with the most seasons where they achieved the maximum wins in their respective NCAA basketball seasons between 1900-2000, showing each team’s total number of such peak-performance seasons, while excluding entries with missing team names.

external_knowledge

None

Example 187

instance_id

bq462

db

ncaa_basketball

question

Please generate a table from the NCAA basketball dataset that lists the top five records in each of these four categories: (1) Top Venues

the largest venues by seating capacity with Date shown as ‘N/A’; (2) Biggest Championship Margins - National Championship games since the 2016 season (season > 2015) with the biggest point margin victories; (3) Highest Scoring Games - games since the 2011 season (season > 2010) with the highest total points scored by both teams combined; and (4) Total Threes - games since the 2011 season (season > 2010) with the highest total three-pointers made by both teams combined. The final table should be organized with columns for Category, Date, Matchup or Venue, and Key Metric, with each category’s 5 records presented in descending order of their key metric.

external_knowledge

None

Example 188

instance_id

bq427

db

ncaa_basketball

question

Could you determine, for each shot type, the average x and y coordinates (adjusted to ensure consistency regarding the left or right basket), the average number of shot attempts, and the average number of successful shots, considering only shots taken before March 15, 2018, excluding those with null shot types or coordinates, ensuring the shots are on the correct side of the court based on the team’s basket.

external_knowledge

basketball.md

Example 189

instance_id

bq428

db

ncaa_basketball

question

For the top five team markets with the highest number of distinct players who scored at least 15 points during the second period of games between 2010 and 2018, provide details of each game they played in NCAA basketball historical tournament matches during the same period, as specified in the data model document.

external_knowledge

ncaa_data_model.md

Example 190

instance_id

bq144

db

ncaa_insights

question

Create a dataset by combining NCAA men’s basketball tournament game outcomes from the 2014 season onwards, including both the historical tournament games and the 2018 tournament results, with the corresponding pace and efficiency performance metrics for each team and their opponents from the feature_engineering data. The dataset should include the season, game outcome labels (win or loss), team and opponent seeds, school names, pace and efficiency rankings, statistical values, and the differences between the team’s and the opponent’s metrics to enable a comprehensive analysis of team and opponent dynamics.

external_knowledge

NCAA_Basketball_Tournament_SQL_Query_Variable_Guide.md

Example 191

instance_id

bq113

db

bls

question

Which county in Utah experienced the greatest percentage increase in construction employment from 2000 to 2018, calculated by averaging the employment levels during the third month of each quarter in those years? What is the corresponding percentage increase?

external_knowledge

None

Example 192

instance_id

bq112

db

bls

question

Between 1998 and 2017, for Allegheny County in the Pittsburgh area, did the average annual wages for all industries keep pace with the inflation of all consumer items, and what were the respective percentage growth rates (to two decimal places) for wages and the CPI over that period?

external_knowledge

None

Example 193

instance_id

bq055

db

google_dei

question

Can you provide the top three races with the largest percentage differences between Google’s 2021 overall hiring data from dar non intersectional hiring and the average percentages in the 2021 BLS data for the technology sectors specifically defined as ‘Internet publishing and broadcasting and web search portals,’ ‘Software publishers,’ ‘Data processing, hosting, and related services,’ or the industry group ‘Computer systems design and related services,’ along with their respective differences?

external_knowledge

None

Example 194

instance_id

bq075

db

google_dei

question

Could you provide a combined 2021 report comparing racial (Asian, Black, Hispanic/Latinx, White) and gender (U.S. Women, U.S. Men) distributions across Google’s overall workforce hiring, Google’s overall workforce representation, and the BLS data specifically for the technology sectors defined as Internet publishing and broadcasting and web search portals or Computer systems design and related services?

external_knowledge

None

Example 195

instance_id

bq406

db

google_dei

question

Please calculate the growth rates for Asians, Black people, Latinx people, Native Americans, White people, US women, US men, global women, and global men from 2014 to 2024 concerning the overall workforce.

external_knowledge

None

Example 196

instance_id

sf_bq084

db

GOOG_BLOCKCHAIN

question

For each month in the year 2023, how many total transactions occurred (counting all transaction records without removing duplicates of transaction hashes), and how many transactions per second were processed each month, where the transactions-per-second value is calculated by dividing the monthly total count by the exact number of seconds in that month, including the correct leap-year logic if applicable based on the extracted year from the transaction timestamp? Show the monthly transaction count, the computed transactions per second, the year, and the month, and present the rows in descending order of the monthly transaction count.

external_knowledge

None

Example 197

instance_id

sf_bq058

db

GOOG_BLOCKCHAIN

question

Retrieve all finalized deposits into Optimism at block 29815485 using the Optimism Standard Bridge, including transaction hash, an Etherscan link (the complete URL), L1 and L2 token addresses, sender and receiver addresses (with leading zeroes stripped), and the deposited amount (converted from hex to decimal). Ensure data is properly formatted and parsed according to Optimism’s address and token standards, and remove the prefix ‘0x’ except transaction hash. Note that, the keccak-256 hash of the Ethereum event signature for DepositFinalized is “0x3303facd24627943a92e9dc87cfbb34b15c49b726eec3ad3487c16be9ab8efe8”.

external_knowledge

optimism_standard_bridge_contract.md

Example 198

instance_id

sf_bq416

db

GOOG_BLOCKCHAIN

question

Could you retrieve the top three largest USDT transfers on the TRON blockchain by listing the block numbers, source addresses, destination addresses (in TronLink format), and transfer amounts, using the USDT contract address ‘0xa614f803b6fd780986a42c78ec9c7f77e6ded13c’ and the transfer event signature ‘0xddf252ad1be2c89b69c2b068fc378daa952ba7f163c4a11628f55a4df523b3ef’, dividing the raw transfer value by 1,000,000 to convert it into the final USDT amount, and then ordering the results by the largest transferred amounts first?

external_knowledge

blockchain_data_transformations.md

Example 199

instance_id

sf_bq226

db

GOOG_BLOCKCHAIN

question

Which sender address, represented as a complete URL on https://cronoscan.com, has been used most frequently on the Cronos blockchain in transactions to non-null ‘to_address’ fields, within blocks larger than 4096 bytes, since January 1, 2023?

external_knowledge

None

Example 200

instance_id

sf_bq016

db

DEPS_DEV_V1

question

Considering only the highest release versions of NPM packages, which dependency (package and its version) appears most frequently among the dependencies of these packages?

external_knowledge

None

Example 201

instance_id

sf_bq062

db

DEPS_DEV_V1

question

What is the most frequently used license by packages in each system?

external_knowledge

None

Example 202

instance_id

sf_bq063

db

DEPS_DEV_V1

question

Find the GitHub URL (with link label ‘SOURCE_REPO’) of the latest released version of the NPM package that has the highest number of dependencies in its latest released version, excluding packages whose names contain the character ‘@’ and only considering URLs where the link label is ‘SOURCE_REPO’ and the URL contains ‘github.com’.

external_knowledge

None

Example 203

instance_id

sf_bq028

db

DEPS_DEV_V1

question

Considering only the latest release versions of NPM package, which packages are the top 8 most popular based on the Github star number, as well as their versions?

external_knowledge

None

Example 204

instance_id

bq022

db

chicago

question

Calculate the minimum and maximum trip duration in minutes (rounded to the nearest whole number), total number of trips, and average fare for each of six equal quantile groups based on trip duration, considering only trips between 0 and 60 minutes.

external_knowledge

None

Example 205

instance_id

bq362

db

chicago

question

Which three companies had the largest increase in trip numbers between two consecutive months in 2018?

external_knowledge

None

Example 206

instance_id

bq363

db

chicago

question

Calculate the total number of trips and average fare (formatted to two decimal places) for ten equal-sized quantile groups. Create ten quantile groups by partitioning the trip duration dimension (from 1-50 minutes) into equal sets. Each group should represent a similar number of distinct minute values. Display each group’s time range formatted as “XXm to XXm” (where the numbers are zero-padded to two digits), the total trips count, and the average fare. The time ranges should represent the minimum and maximum duration values within each quantile. Sort the results chronologically by time range. Use NTILE(10) to create the quantiles from the ordered trip durations.

external_knowledge

None

Example 207

instance_id

bq076

db

chicago

question

What is the highest number of motor vehicle theft incidents that occurred in any single month during 2016?

external_knowledge

None

Example 208

instance_id

bq077

db

chicago

question

For each year from 2010 to 2016, what is the highest number of motor thefts in one month?

external_knowledge

None

Example 209

instance_id

sf_bq350

db

OPEN_TARGETS_PLATFORM_1

question

For the detailed molecule data, Please display the drug id, drug type and withdrawal status for approved drugs with a black box warning and known drug type among ‘Keytruda’, ‘Vioxx’, ‘Premarin’, and ‘Humira’

external_knowledge

None

Example 210

instance_id

sf_bq379

db

OPEN_TARGETS_PLATFORM_1

question

Which target approved symbol has the overall association score closest to the mean score for psoriasis?

external_knowledge

None

Example 211

instance_id

sf_bq078

db

OPEN_TARGETS_PLATFORM_2

question

Retrieve the approved symbol of target genes with the highest overall score that are associated with the disease ‘EFO_0000676’ from the data source ‘IMPC’.

external_knowledge

None

Example 212

instance_id

sf_bq095

db

OPEN_TARGETS_PLATFORM_1

question

Generate a list of drugs from the table containing molecular details that have completed clinical trials for pancreatic endocrine carcinoma, disease ID EFO_0007416. Please include each drug’s name, the target approved symbol, and links to the relevant clinical trials.

external_knowledge

None

Example 213

instance_id

bq109

db

open_targets_genetics_1

question

Find the average, variance, max-min difference, and the QTL source(right study) of the maximum log2(h4/h3) for data where right gene id is “ENSG00000169174”, h4 > 0.8, h3 < 0.02, reported trait includes “lesterol levels”, right biological feature is “IPSC”, and the variant is ‘1_55029009_C_T’.

external_knowledge

None

Example 214

instance_id

sf_bq325

db

OPEN_TARGETS_GENETICS_2

question

Please identify the top 10 genes with the strongest associations across all studies by first selecting, for each gene within each study, the variant with the lowest p-value, and then ranking all such gene–variant pairs to return the 10 genes with the smallest p-values overall.

external_knowledge

None

Example 215

instance_id

bq090

db

CYMBAL_INVESTMENTS

question

How much higher the average intrinsic value is for trades using the feeling-lucky strategy compared to those using the momentum strategy under long-side trades?

external_knowledge

None

Example 216

instance_id

bq442

db

CYMBAL_INVESTMENTS

question

Please collect the information of the top 6 trade report with the highest closing prices. Refer to the document for all the information I want.

external_knowledge

Trade_Capture_Report_Data_List.md

Example 217

instance_id

bq079

db

usfs_fia

question

Considering only the latest evaluation group per state for the ‘EXPCURR’ evaluation type, determine which state has the highest total acreage of timberland and which has the highest total acreage of forestland. For timberland, include plots where the condition status code is 1, the reserved status code is 0, and the site productivity class code is between 1 and 6. For forestland, include plots where the condition status code is 1. Calculate the total acres by summing the adjusted expansion factors for macroplots and subplots, using their respective proportion bases (‘MACR’ for macroplots and ‘SUBP’ for subplots) and adjustment factors when greater than zero. For each category (timberland and forestland), provide the state code, evaluation group, state name, and the total acres for the state with the highest total acreage, considering only the latest evaluation group per state.

external_knowledge

None

Example 218

instance_id

bq024

db

usfs_fia

question

For the year 2012, which top 10 evaluation groups have the largest subplot acres when considering only the condition with the largest subplot acres within each group? Please include the evaluation group, evaluation type, condition status code, evaluation description, state code, macroplot acres, and subplot acres.

external_knowledge

None

Example 219

instance_id

bq220

db

usfs_fia

question

Based on the condition, plot_tree, and population tables in bigquery- public-data.usfs_fia, for the evaluation_type set to ‘EXPCURR’ and condition_status_code equal to 1, which states had the largest average subplot size and the largest average macroplot size, respectively, for each of the years 2015, 2016, and 2017? Please include the type of plot (subplot or macroplot), the specific year, the state, and the corresponding average size in your results.

external_knowledge

subplot_macroplot_size.md

Example 220

instance_id

bq096

db

gbif

question

Determine which year had the earliest date after January on which more than 10 sightings of Sterna paradisaea were recorded north of 40 degrees latitude. For each year, find the first day after January with over 10 sightings of this species in that region, and identify the year whose earliest such date is the earliest among all years.

external_knowledge

None

Example 221

instance_id

sf_bq276

db

NOAA_PORTS

question

Can you provide a comprehensive list of all ports in region number 6585 that lie within U.S. state boundaries and have been affected by named storms in the North Atlantic basin with wind speeds of at least 35 knots and a Saffir-Simpson classification of at least minimal tropical storm strength, including for each port its name, the state name, the distinct years in which storms occurred, the total count of distinct storms, the distinct storm names, the average storm category, the average wind speed, and the respective geometries for both the port and the tropical storm areas?

external_knowledge

persistent_udfs_routines.md

Example 222

instance_id

bq277

db

noaa_ports

question

Which single port, listed under region number ‘6585’, is located within a U.S. state boundary and appears most frequently inside the geographic areas of named tropical storms with wind speeds of at least 35 knots in the North Atlantic basin, excluding those labeled ‘NOT_NAMED’?

external_knowledge

persistent_udfs_routines.md

Example 223

instance_id

bq278

db

sunroof_solar

question

Please provide a detailed comparison of the solar potential for each state, distinguishing between postal code and census tract levels. For each state, include the total number of buildings available for solar installations, the average percentage of Google Maps area covered by Project Sunroof, the average percentage of that coverage which is suitable for solar, the total potential panel count, the total kilowatt capacity, the energy generation potential, the carbon dioxide offset, the current number of buildings with solar panels, and the gap in potential installations calculated by adjusting the total qualified buildings with the coverage and suitability percentages and subtracting the current installations.

external_knowledge

None

Example 224

instance_id

bq102

db

gnomAD

question

Identify which start positions are associated with missense variants in the BRCA1 gene on chromosome 17, where the reference base is ‘C’ and the alternate base is ‘T’. Using data from the gnomAD v2.1.1 version.

external_knowledge

None

Example 225

instance_id

bq445

db

gnomAD

question

Using the gnomAD v2.1.1 genomes data for chromosome 17, determine the smallest start position and largest end position of any variant whose nested VEP annotations contain the symbol ‘BRCA1’. Then, for all variants whose positions fall within that gene region, retrieve the ‘Protein_position’ values only if the ‘Consequence’ includes ‘missense_variant’, sort them in ascending order by ‘Protein_position’, and finally output the first such result.

external_knowledge

None

Example 226

instance_id

bq103

db

gnomAD

question

Generate summary statistics on genetic variants in the region between positions 55039447 and 55064852 on chromosome 1. This includes the number of variants, the total allele count, the total number of alleles, and distinct gene symbols (using Variant Effect Predictor, VEP, for gene annotation). Additionally, compute the density of mutations by dividing the length of the region by the number of variants. Using data from the gnomAD v3 version.

external_knowledge

None

Example 227

instance_id

sf_bq104

db

GOOGLE_TRENDS

question

Based on the most recent refresh date, identify the top-ranked rising search term for the week that is exactly one year prior to the latest available week in the dataset.

external_knowledge

None

Example 228

instance_id

sf_bq411

db

GOOGLE_TRENDS

question

Please retrieve the top three Google Trends search terms (ranks 1, 2, and 3) from top_terms for each weekday (Monday through Friday) between September 1, 2024, and September 14, 2024, grouped by the refresh_date column and ordered in descending order of refresh_date.

external_knowledge

None

Example 229

instance_id

bq105

db

nhtsa_traffic_fatalities_plus

question

According to the 2015 and 2016 accident and driver distraction, and excluding cases where the driver’s distraction status is recorded as ‘Not Distracted,’ ‘Unknown if Distracted,’ or ‘Not Reported,’ how many traffic accidents per 100,000 people were caused by driver distraction in each U.S. state for those two years, based on 2010 census population data, and which five states each year had the highest rates?

external_knowledge

None

Example 230

instance_id

bq108

db

nhtsa_traffic_fatalities

question

Within the 2015 dataset for accidents that occurred from January through August and involved more than one distinct person, what percentage of these accidents had more than one individual with a severe injury (injury severity = 4)

external_knowledge

None

Example 231

instance_id

bq067

db

nhtsa_traffic_fatalities

question

I want to create a labeled dataset from the National Highway Traffic Safety Administration traffic fatality data that predicts whether a traffic accident involving more than one distinct person results in more than one fatality, where the label is 1 if an accident has more than one person with an injury severity code of 4 (fatal injury) and 0 otherwise. For each accident, include the numeric predictors: state_number, the vehicle body_type, the number_of_drunk_drivers, the day_of_week, the hour_of_crash, and a binary indicator for whether the accident occurred in a work zone (1 if it is not “None,” otherwise 0). Also, engineer a feature for the average absolute difference between travel_speed and speed_limit per accident, only considering travel speeds up to 151 mph (excluding codes 997, 998, 999) and speed limits up to 80 mph (excluding codes 98, 99), and categorize this average speed difference into levels from 0 to 4 in 20 mph increments with lower bounds inclusive and upper bounds exclusive. Finally, only include accidents that involve more than one distinct person.

external_knowledge

nhtsa_traffic_fatalities.md

Example 232

instance_id

bq396

db

nhtsa_traffic_fatalities

question

Which top 3 states had the largest differences in the number of traffic accidents between rainy and clear weather during weekends in 2016? Please also provide the respective differences for each state.

external_knowledge

None

Example 233

instance_id

bq441

db

nhtsa_traffic_fatalities

question

Please help me compile the critical details on traffic accidents in 2015, as listed in the info document.

external_knowledge

Traffic_Fatalities_Info_List_2015.md

Example 234

instance_id

bq097

db

sdoh

question

What is the increasing amount of the average earnings per job between the years 2012 and 2017 for each geographic region in Massachusetts (indicated by “MA” at the end of GeoName)?

external_knowledge

None

Example 235

instance_id

bq120

db

sdoh

question

Identify the top 10 regions (counties) with the highest total number of SNAP-participating households, using the 2017 5-year ACS county- level data and SNAP enrollment data from January 1, 2017, excluding regions where the total SNAP participation is zero. For each of these regions, calculate the ratio of households earning under $20,000 to the total number of SNAP-participating households.

external_knowledge

None

Example 236

instance_id

bq110

db

sdoh

question

What is the change in the number of homeless veterans between 2012 and 2018 for each CoC region in New York that has data available in both years?

external_knowledge

None

Example 237

instance_id

bq395

db

sdoh

question

Calculate the percentage change in the total number of unsheltered homeless people from 2015 to 2018 for each state by summing the counts over all Continuums of Care (CoCs) within each state. Then, determine the national average of these state percentage changes. Identify the five states whose percentage change is closest to this national average percentage change. Please provide the state abbreviations.

external_knowledge

None

Example 238

instance_id

bq352

db

sdoh

question

Please list the average number of prenatal weeks in 2018 for counties in Wisconsin where more than 5% of the employed population had commutes of 45-59 minutes in 2017.

external_knowledge

None

Example 239

instance_id

bq074

db

sdoh

question

Count the number of counties that experienced an increase in unemployment from 2015 to 2018, using 5-year ACS data, and a decrease in dual-eligible enrollee counts between December 1, 2015, and December 1, 2018.

external_knowledge

None

Example 240

instance_id

bq066

db

sdoh

question

Could you assess the relationship between the poverty rates from the previous year’s census data and the percentage of births without maternal morbidity for the years 2016 to 2018? Use only data for births where no maternal morbidity was reported and for each year, use the 5-year census data from the year before to compute the Pearson correlation coefficient

external_knowledge

None

Example 241

instance_id

bq114

db

openaq

question

Which three cities have the largest difference between their 1990 EPA PM2.5 measurements (using units_of_measure = ‘Micrograms/cubic meter (LC)’ and parameter_name = ‘Acceptable PM2.5 AQI & Speciation Mass’) and their 2020 OpenAQ PM2.5 measurements (where pollutant = ‘pm25’ based on the year extracted from the timestamp), with both datasets matched by latitude and longitude rounded to two decimals, and the difference ordered from greatest to least?

external_knowledge

None

Example 242

instance_id

bq116

db

sec_quarterly_financials

question

Which U.S. state reported the highest total annual revenue in billions of dollars during fiscal year 2016, considering companies that provided four quarters of data and reported measure tags in (‘Revenues’,’SalesRevenueNet’,’SalesRevenueGoodsNet’), excluding any entries where the state field (stprba) is null or empty?

external_knowledge

None

Example 243

instance_id

sf_bq015

db

STACKOVERFLOW_PLUS

question

Identify and rank the top 10 tags from Stack Overflow questions that were referenced in Hacker News comments on or after 2014 by counting how many times each question was mentioned, then splitting the questions’ tag strings by the ‘|’ delimiter, grouping by tag

external_knowledge

None

Example 244

instance_id

bq041

db

stackoverflow

question

Compute the monthly statistics for new StackOverflow users created in

For each month, report the total number of new users, the percentage of these new users who asked at least one question within 30 days of signing up, and among those who asked a question within 30 days, the percentage who then answered at least one question after their first question and within 30 days following their first question.

external_knowledge

None

Example 245

instance_id

sf_bq121

db

STACKOVERFLOW

question

How do the average reputation and number of badges vary among Stack Overflow users based on the number of complete years they have been members, considering only those who joined on or before October 1, 2021?

external_knowledge

None

Example 246

instance_id

bq123

db

stackoverflow

question

You need to determine which day of the week has the third highest percentage of questions on Stack Overflow that receive an answer within an hour. To do this, use the question creation date from the posts_questions table and the earliest answer creation date from the posts_answers table. Once you’ve calculated the percentage of questions that get answered within an hour for each day, identify the day with the third highest percentage and report that percentage.

external_knowledge

None

Example 247

instance_id

bq280

db

stackoverflow

question

Please provide the display name of the user who has answered the most questions on Stack Overflow, considering only users with a reputation greater than 10.

external_knowledge

None

Example 248

instance_id

bq300

db

stackoverflow

question

What is the highest number of answers received for a single Python 2 specific question on Stack Overflow, excluding any discussions that involve Python 3?

external_knowledge

None

Example 249

instance_id

bq301

db

stackoverflow

question

Retrieve details of accepted answers to Stack Overflow questions posted in January 2016 that have tags including “javascript” and at least one of “xss”, “cross-site”, “exploit”, or “cybersecurity”; the answers themselves must also have been posted in January 2016. For each accepted answer, include the answer’s ID, the answerer’s reputation, score, and comment count, along with the associated question’s tags, score, answer count, the asker’s reputation, view count, and comment count.

external_knowledge

None

Example 250

instance_id

bq302

db

stackoverflow

question

What is the monthly proportion of Stack Overflow questions tagged with ‘python’ in the year 2022?

external_knowledge

None

Example 251

instance_id

bq303

db

stackoverflow

question

From July 1, 2019 through December 31, 2019, for all users with IDs between 16712208 and 18712208 on Stack Overflow, retrieve the user ID and the tags of the relevant question for each of their contributions, including comments on both questions and answers, any answers they posted, and any questions they authored, making sure to correctly associate the comment or answer with its parent question’s tags.

external_knowledge

None

Example 252

instance_id

bq304

db

stackoverflow

question

Retrieve the top 50 most viewed questions for each of the following Android-related tags on StackOverflow: ‘android-layout’, ‘android- activity’, ‘android-intent’, ‘android-edittext’, ‘android-fragments’, ‘android-recyclerview’, ‘listview’, ‘android-actionbar’, ‘google- maps’, and ‘android-asynctask’. Each question must contain the word ‘how’ in either its title or body and must not contain any of the following troubleshooting terms in either its title or body: ‘fail’, ‘problem’, ‘error’, ‘wrong’, ‘fix’, ‘bug’, ‘issue’, ‘solve’, or ‘trouble’. Only include tags that have at least 50 questions meeting these criteria, and for each such tag, select the top 50 questions ranked by view count.

external_knowledge

None

Example 253

instance_id

bq310

db

stackoverflow

question

What is the title of the most viewed “how” question related to Android development on StackOverflow, across specified tags such as ‘android- layout’, ‘android-activity’, ‘android-intent’, and others

external_knowledge

None

Example 254

instance_id

bq305

db

stackoverflow

question

Which 10 users have the highest combined view counts for questions they are associated with, where a user is considered associated if they own the question, or their answer is the accepted answer, or their answer’s score is greater than 5, or their answer’s score exceeds 20% of the total answer scores for that question (and is above 0), or their answer is among the top three highest-scoring answers for that question?

external_knowledge

None

Example 255

instance_id

bq306

db

stackoverflow

question

Identify the top 10 tags for user 1908967, based only on answers posted before June 7, 2018, where each tag’s score is 10 times the number of upvotes (vote_type_id=2) and 15 times the number of accepted answers (vote_type_id=1). Derive tags from the questions associated with those answers, and consider only the upvotes and accepted answers for those answers. Return the tags with the highest total scores in descending order, limited to 10 tags.

external_knowledge

None

Example 256

instance_id

sf_bq307

db

STACKOVERFLOW

question

Find the top 10 gold badges that users most commonly earn as their first gold badge on Stack Overflow. For each of these badges, display the badge name, the number of users who earned it as their first gold badge, and the average number of days from the user’s account creation date to the date they earned the badge, calculated in days without any adjustments for date formats.

external_knowledge

None

Example 257

instance_id

bq308

db

stackoverflow

question

Show the number of Stack Overflow questions asked each day of the week in 2021, and find out how many and what percentage of those were answered within one hour.

external_knowledge

None

Example 258

instance_id

bq309

db

stackoverflow

question

Retrieve the top 10 longest questions on Stack Overflow, measured by the length of their body text, where each question either has an accepted answer or has no accepted answer but has at least one answer with a score-to-view ratio exceeding 0.01. For each of these questions, include the reputation of the user who asked the question, the user’s net votes (calculated as their total up_votes minus down_votes), and the total number of badges the user has earned.

external_knowledge

None

Example 259

instance_id

bq124

db

fhir_synthea

question

Among all patients, how many individuals remain alive (i.e., with no recorded deceased.dateTime), have a diagnosis of either Diabetes or Hypertension, and are prescribed at least seven distinct active medications?

external_knowledge

None

Example 260

instance_id

bq391

db

fhir_synthea

question

Among living patients whose last names begin with “A” and who each have exactly one distinct condition, which eight conditions have the highest number of different active medications prescribed to any single patient, and what are their corresponding codes?

external_knowledge

None

Example 261

instance_id

bq126

db

the_met

question

What are the titles, artist names, mediums, and original image URLs of objects with ‘Photograph’ in their names from the ‘Photographs’ department, created not by an unknown artist, with an object end date of 1839 or earlier?

external_knowledge

None

Example 262

instance_id

bq366

db

the_met

question

What are the top three most frequently associated labels with artworks from each historical period in The Met’s collection, only considering labels linked to 500 or more artworks? Provide me with the period, label, and the associated count.

external_knowledge

None

Example 263

instance_id

bq414

db

the_met

question

Retrieve the object id, title, and the formatted metadata date (as a string in ‘YYYY-MM-DD’ format) for objects in the “The Libraries” department where the cropConfidence is greater than 0.5, the object’s title contains the word “book”.

external_knowledge

None

Example 264

instance_id

bq200

db

mlb

question

Using data from both the regular season and the post-season, identify the pitcher who achieved the highest non-zero pitch speed for each team by confirming whether the pitcher’s ID appears in the relevant home or away player lists for that game, then retrieve that pitcher’s full name along with the maximum valid pitch speed they achieved while playing for that specific team.

external_knowledge

None

Example 265

instance_id

sf_bq458

db

WORD_VECTORS_US

question

Tokenize the body text of each article into words, excluding stop words, and obtain the corresponding word vectors for these words from the glove vector. For each word, weight its word vector by dividing each component by the 0.4th power of the word’s frequency from the word frequencies. Then, for each article, aggregate these weighted word vectors by summing their components to form an article vector. Normalize each article vector to unit length by dividing by its magnitude. Finally, retrieve the ID, date, title, and the normalized article vector for each article.

external_knowledge

tokenize_func.md

Example 266

instance_id

sf_bq459

db

WORD_VECTORS_US

question

Please find the top 10 most relevant articles by only processing each article’s ‘body’ field, where each body is tokenized with no stopwords, each remaining token is turned into a GloVe-based word vector and weighted by dividing each dimension by the 0.4th power of its word frequency, then these weighted vectors are summed and normalized to get a unit vector for each article. Perform the same weighting and normalization on the query phrase ‘Epigenetics and cerebral organoids: promising directions in autism spectrum disorders’ and compute the cosine similarity between the query vector and each article vector. Finally, return the id, date, title, and the cosine similarity score for the top 10 articles with the highest similarity.

external_knowledge

tokenize_func.md

Example 267

instance_id

sf_bq460

db

WORD_VECTORS_US

question

Please process the articles from the ‘nature’ dataset by first tokenizing the body text into words and removing stopwords. For each remaining word, retrieve its word vector from the glove_vectors table and its frequency from the word_frequencies table, then divide each word vector by the 0.4th power of the word’s frequency to weight it. Sum the weighted vectors to obtain an aggregate vector for each article, normalize this aggregate vector to unit length, and then compute the cosine similarity scores between these normalized vectors. Finally, return the IDs, dates, titles, and cosine similarity scores of the top 10 articles most similar to the article with the ID ‘8a78ef2d-d5f7-4d2d-9b47-5adb25cbd373’.

external_knowledge

tokenize_func.md

Example 268

instance_id

bq204

db

eclipse_megamovie

question

Find the user with the highest total clicks across all records from all available photo collections.

external_knowledge

None

Example 269

instance_id

bq389

db

epa_historical_air_quality

question

Please calculate the monthly average levels of PM10, PM2.5 FRM, PM2.5 non-FRM, volatile organic emissions, SO2 (scaled by a factor of 10), and Lead (scaled by a factor of 100) air pollutants in California for the year 2020.

external_knowledge

None

Example 270

instance_id

sf_bq345

db

IDC

question

How large are the DICOM image files with SEG or RTSTRUCT modalities and the SOP Class UID “1.2.840.10008.5.1.4.1.1.66.4”, when grouped by collection, study, and series IDs, if they have no references to other series, images, or sources? Can you also provide a viewer URL formatted as “https://viewer.imaging.datacommons.cancer.gov/viewer/” followed by the study ID, and list these sizes in kilobytes, sorted from largest to smallest?

external_knowledge

None

Example 271

instance_id

sf_bq346

db

IDC

question

In publicly accessible DICOM data where the Modality is ‘SEG’ and the SOPClassUID is ‘1.2.840.10008.5.1.4.1.1.66.4’, and each segmentation references its original SOPInstanceUID, which five segmentation categories (by ‘SegmentedPropertyCategory.CodeMeaning’) occur most frequently?

external_knowledge

None

Example 272

instance_id

sf_bq347

db

IDC

question

From the union of the specified MR series with SeriesInstanceUID 1.3.6.1.4.1.14519.5.2.1.3671.4754.105976129314091491952445656147 and all associated segmentation instances, which modality has the greatest number of SOP instances in total, and how many are there?

external_knowledge

None

Example 273

instance_id

sf_bq390

db

IDC

question

In the “qin_prostate_repeatability” collection, please provide the distinct StudyInstanceUIDs for studies that include T2-weighted axial MR imaging and also contain anatomical structure segmentations labeled as “Peripheral zone.”

external_knowledge

None

Example 274

instance_id

sf_bq421

db

IDC

question

Can you list all unique pairs of embedding medium and staining substance code meanings, along with the number of occurrences for each pair, based on distinct embedding medium and staining substance codes from the ‘SM’ modality in the DICOM dataset’s un-nested specimen preparation sequences, ensuring that the codes are from the SCT coding scheme?

external_knowledge

None

Example 275

instance_id

sf_bq422

db

IDC

question

Using the ‘nlst’ collection’s CT images, calculate and compare two separate metrics: 1) The average series size in MiB for the top 3 patients with the highest slice interval difference tolerance (defined as the difference between the maximum and minimum unique slice intervals across all their series), and 2) The average series size in MiB for the top 3 patients with the highest exposure difference (defined as the difference between the maximum and minimum unique exposure values across all their series). For each patient, calculate the series size by summing the instance sizes of all images in that series and converting to MiB. Return the results as two separate groups labeled “Top 3 by Slice Interval” and “Top 3 by Max Exposure” with their respective average series sizes.

external_knowledge

None

Example 276

instance_id

sf_bq069

db

IDC

question

Could you help me generate a report of CT image series from the dicom_all table such that all series from the NLST collection are excluded, any localizers or JPEG-compressed series (transfer syntaxes 1.2.840.10008.1.2.4.70 or 1.2.840.10008.1.2.4.51) are skipped, and only those passing certain geometry checks—namely a single orientation, identical pixel spacing, matching SOP instance and position counts, uniform pixel rows and columns, and a near-unity dot product of image orientation vectors—are included, while also computing slice interval differences, exposure differences, and approximate series size in MB for each qualified series?

external_knowledge

nonNlstCohort.md

Example 277

instance_id

sf_bq219

db

IOWA_LIQUOR_SALES

question

In the Iowa Liquor Sales dataset, starting from January 1, 2022 through the last fully completed month, which two liquor categories, each contributing an average of at least 1% to the monthly sales volume over at least 24 months of available data, have the lowest Pearson correlation coefficient when comparing their monthly percentages of total liquor sales across those months, and what are their names?

external_knowledge

None

Example 278

instance_id

bq199

db

iowa_liquor_sales

question

Identify the top 10 liquor categories in Iowa in 2021 by calculating, for each category, the average of the per-liter retail prices across all sales transactions in that category during 2021. For these top categories, provide their average per-liter retail prices calculated in the same manner for the years 2019, 2020, and 2021.

external_knowledge

None

Example 279

instance_id

bq218

db

iowa_liquor_sales

question

What are the top 5 items with the highest year-over-year growth percentage in total sales revenue for the year 2023?

external_knowledge

None

Example 280

instance_id

bq049

db

iowa_liquor_sales_plus

question

Please show the monthly per capita Bourbon Whiskey sales during 2022 in Dubuque County for the zip code that ranks third in total Bourbon Whiskey sales, using only the population aged 21 and older.

external_knowledge

None

Example 281

instance_id

bq360

db

nppes

question

Among healthcare providers whose practice location is in Mountain View, CA, and who have a specified specialization in the field healthcare provider taxonomy, identify the top 10 most common specializations based on the count of distinct NPIs. Then determine which of those top 10 has a count of distinct NPIs closest to the average count across those 10 specializations.

external_knowledge

None

Example 282

instance_id

bq286

db

usa_names

question

Can you tell me the name of the most popular female baby in Wyoming for the year 2021, based on the proportion of female babies given that name compared to the total number of female babies given the same name across all states?

external_knowledge

None

Example 283

instance_id

sf_bq044

db

TCGA

question

For bladder cancer patients who have mutations in the CDKN2A (cyclin- dependent kinase inhibitor 2A) gene, using clinical data from the Genomic Data Commons Release 39, what types of mutations are they, what is their gender, vital status, and days to death - and for four downstream genes (MDM2 (MDM2 proto-oncogene), TP53 (tumor protein p53), CDKN1A (cyclin-dependent kinase inhibitor 1A), and CCNE1 (Cyclin E1)), what are the gene expression levels for each patient?

external_knowledge

TCGA_Study_Abbreviations.md

Example 284

instance_id

sf_bq043

db

TCGA

question

What are the RNA expression levels of the genes MDM2, TP53, CDKN1A, and CCNE1, along with associated clinical information, in bladder cancer patients with CDKN2A mutations in the ‘TCGA-BLCA’ project? Use clinical data from the Genomic Data Commons Release 39, data about somatic mutations derived from the hg19 human genome reference in Feb 2017.

external_knowledge

None

Example 285

instance_id

bq143

db

CPTAC_PDC

question

Use CPTAC proteomics and RNAseq data for Clear Cell Renal Cell Carcinoma to select ‘Primary Tumor’ and ‘Solid Tissue Normal’ samples. Join the datasets on sample submitter IDs and gene symbols. Calculate the correlation between protein abundance (log2 ratio) and gene expression levels (log-transformed+1 FPKM) for each gene and sample type. Filter out correlations with an absolute value greater than 0.5, and compute the average correlation for each sample type.

external_knowledge

None

Example 286

instance_id

sf_bq147

db

TCGA

question

Can you identify the TCGA breast cancer cases from the RNA sequencing hg38 r35` where the protein_coding gene and the project TCGA-BRCA, and which have RNA sequencing samples of multiple tissue types—including “Solid Tissue Normal”—within the same case?

external_knowledge

None

Example 287

instance_id

sf_bq148

db

TCGA

question

Could you identify the top five protein-coding genes that exhibit the highest variance in their expression levels (measured as fpkm_uq_unstranded) specifically within ‘Solid Tissue Normal’ samples? Please limit the analysis to TCGA-BRCA project cases that include at least one ‘Solid Tissue Normal’ sample type.

external_knowledge

None

Example 288

instance_id

sf_bq175

db

TCGA_MITELMAN

question

Identify cytoband names on chromosome 1 in the TCGA-KIRC segment allelic dataset where the frequency of amplifications, gains, and heterozygous deletions each rank within the top 11. Calculate these rankings based on the maximum copy number observed across various genomic studies of kidney cancer, reflecting the severity of genetic alterations.

external_knowledge

Comprehensive_Guide_to_Copy_Number_Variations_in_Cancer_Genomics.md

Example 289

instance_id

sf_bq176

db

TCGA_MITELMAN

question

Identify the case barcodes from the TCGA-LAML study with the highest weighted average copy number in cytoband 15q11 on chromosome 15, using segment data and cytoband overlaps from TCGA’s genomic and Mitelman databases.

external_knowledge

None

Example 290

instance_id

sf_bq170

db

TCGA_MITELMAN

question

For breast cancer cases (TCGA-BRCA) from Release 23 of the active GDC archive, identify and categorize copy number variations (CNVs) across all cytobands on every chromosome. For each cytoband and each case, determine the overlap between the cytoband region and the case’s copy number segments, and compute the overlap-weighted average copy number for that cytoband in the case, rounding to the nearest whole number. Classify the rounded copy number into CNV types as follows: homozygous deletions (0), heterozygous deletions (1), normal diploid state (2), gains (3), and amplifications (greater than 3). For each cytoband, provide its name and start/end positions, and calculate the frequency of each CNV type across all cases as a percentage of the total number of cases, rounded to two decimal places.

external_knowledge

copy_number_variations.md

Example 291

instance_id

sf_bq150

db

TCGA_HG19_DATA_V0

question

Assess whether different genetic variants affect the log10-transformed TP53 expression levels in TCGA-BRCA samples using sequencing and mutation data. Provide the total number of samples, the number of mutation types, the mean square between groups, the mean square within groups, and the F-statistic.

external_knowledge

TCGA_F_Score.md

Example 292

instance_id

sf_bq152

db

TCGA_HG38_DATA_V0

question

For breast cancer cases (TCGA-BRCA) from Release 23 of the active GDC archive, identify and categorize copy number variations (CNVs) across all cytobands on every chromosome. For each cytoband and each case, determine the overlap between the cytoband region and the case’s copy number segments, and compute the overlap-weighted average copy number for that cytoband in the case, rounding to the nearest whole number. Classify the rounded copy number into CNV types as follows: homozygous deletions (0), heterozygous deletions (1), normal diploid state (2), gains (3), and amplifications (greater than 3). For each cytoband, provide its name and start/end positions, and calculate the frequency of each CNV type across all cases as a percentage of the total number of cases, rounded to two decimal places.

external_knowledge

None

Example 293

instance_id

sf_bq155

db

TCGA_HG38_DATA_V0

question

In the TCGA-BRCA cohort of patients who are 80 years old or younger at diagnosis and have a pathological stage of Stage I, Stage II, or Stage IIA, calculate the t-statistic derived from the Pearson correlation between the log10-transformed average RNA-Seq expression levels (using HTSeq__Counts + 1) of the gene SNORA31 and the average microRNA-Seq expression levels of all unique microRNAs, only considering pairs with more than 25 samples and where the absolute Pearson correlation coefficient is between 0.3 and 1.0

external_knowledge

None

Example 294

instance_id

sf_bq141

db

TCGA_HG38_DATA_V0

question

Using the TCGA-KIRP dataset, select patients from the ‘TCGA_bioclin_v0.Clinical’ table who have a non-null clinical_stage and a disease_code of ‘KIRP.’ Retrieve their gene expression data from the ‘TCGA_hg38_data_v0.RNAseq_Gene_Expression’ table for the genes ‘MT-CO3,’ ‘MT-CO1,’ and ‘MT-CO2,’ and randomly split the patients into a training set (90%) and a test set (10%) based on their case_barcode via the FARM_FINGERPRINT method. For each clinical stage in the training set, calculate the average HTSeq__FPKM_UQ expression of the three genes. For each patient in the test set, compute the Euclidean distance between the patient’s expression values and the stage- specific averages, and assign that patient to the clinical stage whose average is closest. Finally, output the case_barcode and the predicted clinical stage.

external_knowledge

None

Example 295

instance_id

bq046

db

TCGA_bioclin_v0

question

Find case barcodes and their corresponding GDC file URLs for female patients aged 30 or younger diagnosed with breast cancer (BRCA) in TCGA database. The query should first identify patients from the Annotations table where entity_type is “Patient” and either category is “History of unacceptable prior treatment related to a prior/other malignancy” or classification is “Redaction”. Also identify patients from the Clinical table with disease_code “BRCA”, age_at_diagnosis less than or equal to 30, and gender “FEMALE”. Perform a FULL JOIN between these two sets, but only keep patients where both categoryName and classificationName are NULL in the final result. Use GDC metadata from archive release 14 (specifically rel14_caseData, rel14_fileData_current, and rel14_GDCfileID_to_GCSurl_NEW tables) to get the corresponding file URLs.

external_knowledge

None

Example 296

instance_id

sf_bq153

db

PANCANCER_ATLAS_1

question

Calculate, for each histology type specified in the ‘icd_o_3_histology’ field (excluding those enclosed in square brackets), the average of the per-patient average log10(normalized_count + 1) expression levels of the IGF2 gene among LGG patients with valid IGF2 expression data. Match gene expression and clinical data using the ParticipantBarcode field.

external_knowledge

None

Example 297

instance_id

sf_bq154

db

PANCANCER_ATLAS_1

question

Calculate the Kruskal-Wallis H-score among groups of LGG patients for IGF2 gene expression, where each patient’s IGF2 expression is determined by applying log10(normalized_count + 1) and then averaging across samples. Group the patients by ICD-O-3 histology codes, exclude any codes fully enclosed in square brackets, only include groups with more than one patient, and ensure that normalized count is not null. Finally, return the total number of groups, the total number of samples, and the Kruskal-Wallis H-score in descending order.

external_knowledge

Regulome_Explorer_Kruskal-Wallis_test_for_numerical_and_categorical_data.md

Example 298

instance_id

sf_bq156

db

PANCANCER_ATLAS_1

question

Compute the t-score (rounded to 2 decimals) to compare the difference in mean expression levels of the gene DRG2 between two groups (TP53 mutated vs. non-mutated) in the Lower Grade Glioma (LGG) study, where the expression levels are calculated as the average of log10(normalized_count + 1) for each participant, only considering samples with TP53 mutations that have a ‘FILTER’ status of ‘PASS’ in the mutation data, and ignoring any groups with fewer than 10 samples or with zero variance; refer to t_score.md for the method of computing the t-score.

external_knowledge

t_score.md

Example 299

instance_id

sf_bq157

db

PANCANCER_ATLAS_1

question

Please compute the T-score to determine the statistical difference in the expression of the DRG2 gene between LGG patients with and without TP53 mutation: for each patient, calculate the average of log10(normalized_count + 1) of DRG2 expression across all their samples, using only samples present in the MC3_MAF_V5_one_per_tumor_sample table for the LGG study; identify patients with TP53 mutations from this table where Hugo_Symbol is ‘TP53’ and FILTER is ‘PASS’; then perform a T-test comparing the mean averaged log-transformed DRG2 expression between patients with and without TP53 mutation.

external_knowledge

Regulome_Explorer_T_test_for_numerical_and_binary_data.md

Example 300

instance_id

sf_bq158

db

PANCANCER_ATLAS_1

question

Which top five histological types of breast cancer (BRCA) in the PanCancer Atlas exhibit the highest percentage of CDH1 gene mutations?

external_knowledge

None

Example 301

instance_id

sf_bq159

db

PANCANCER_ATLAS_1

question

Calculate the chi-square value to assess the association between histological types and the presence of CDH1 gene mutations in BRCA patients using data from the PanCancer Atlas. Focus on patients with known histological types and consider only reliable mutation entries. Exclude any histological types or mutation statuses with marginal totals less than or equal to 10. Match clinical and mutation data using ParticipantBarcode

external_knowledge

None

Example 302

instance_id

bq161

db

pancancer_atlas_2

question

Calculate the net difference between the number of pancreatic adenocarcinoma (PAAD) patients in TCGA’s dataset who are confirmed to have mutations in both KRAS and TP53 genes, and those without mutations in either gene. Utilize patient clinical and follow-up data alongside genomic mutation details from TCGA’s cancer genomics database, focusing specifically on PAAD studies where the mutations have passed quality filters.

external_knowledge

None

Example 303

instance_id

bq151

db

pancancer_atlas_2

question

Using TCGA dataset, calculate the chi-squared statistic to evaluate the association between KRAS and TP53 gene mutations in patients diagnosed with pancreatic adenocarcinoma (PAAD). Incorporate clinical follow-up data and high-quality mutation annotations to accurately determine the frequency of patients with co-occurring KRAS and TP53 mutations compared to those with each mutation occurring independently. Ensure that patient records are meticulously matched based on unique identifiers to maintain data integrity. This analysis aims to identify and quantify potential correlations between KRAS and TP53 genetic alterations within the PAAD patient population.

external_knowledge

None

Example 304

instance_id

bq162

db

HTAN_1

question

Based on the 5th revision (r5) of the HTAN data, list the imaging assay types available at the HTAN WUSTL center that have Level2 data and any associated higher-level data (Level3, Level4) derived from them through ‘entityId’ relationships in the ‘id_provenance_r5’ table; exclude any records where the ‘Component’ is NULL or contains ‘Auxiliary’ or ‘OtherAssay’; for each imaging assay type, provide the available data levels (Level2, Level3, Level4), and do not include Level1 data or Electron Microscopy assay types.

external_knowledge

None

Example 305

instance_id

sf_bq163

db

HTAN_2

question

Which 20 genes in the HTAN scRNAseq MSK-SCLC combined samples dataset show the greatest difference in average X_value expression between female and male epithelial cells specifically in cluster 41 of 74-year-old human stage patients? Please calculate the difference by subtracting male average X_value from female average X_value for each gene

external_knowledge

None

Example 306

instance_id

sf_bq164

db

HTAN_2

question

Consolidate metadata from spatial transcriptomics and scRNAseq datasets—including levels 1 through 4 and auxiliary files—for the run ID ‘HT264P1-S1H2Fc2U1Z1Bs1-H2Bs2-Test’. Include Filename, HTAN Parent Biospecimen ID, Component, File Format, Entity ID, and Run ID.

external_knowledge

None

Example 307

instance_id

sf_bq166

db

TCGA_MITELMAN

question

Using segment-level copy number data from the copy_number_segment_allelic_hg38_gdc_r23 dataset restricted to ‘TCGA- KIRC’ samples, merge these segments with the cytogenetic band definitions in ‘CytoBands_hg38’ to identify each sample’s maximum copy number per cytoband. Classify these maximum copy numbers into amplifications (>3), gains (=3), homozygous deletions (=0), heterozygous deletions (=1), or normal (=2), then calculate the frequency of each subtype out of the total number of distinct cases, and finally present these frequencies as percentages sorted by chromosome and cytoband.

external_knowledge

Comprehensive_Guide_to_Copy_Number_Variations_in_Cancer_Genomics.md

Example 308

instance_id

bq165

db

mitelman

question

Can you use CytoConverter genomic coordinates to calculate the frequency of chromosomal gains and losses across a cohort of breast cancer (morphology=’3111’) and adenocarcinoma (topology=’0401’) samples? Concretely, please include the number and frequency (2 decimals in percentage) of amplifications (gains of more than 1 copy), gains (1 extra copy), losses (1 copy) and homozygous deletions (loss of 2 copies) for each chromosomal band. And sort the result by the ordinal of each chromosome and the starting-ending base-pair position of each band in ascending order.

external_knowledge

None

Example 309

instance_id

bq169

db

mitelman

question

Retrieve distinct case references, case numbers, investigation numbers, and clone information where a single clone simultaneously exhibits all three of the following genetic alterations: (1) a loss on chromosome 13 between positions 48,303,751 and 48,481,890, (2) a loss on chromosome 17 between positions 7,668,421 and 7,687,490, and (3) a gain on chromosome 11 between positions 108,223,067 and 108,369,102. For each matching clone, display the chromosomal details for each of these three regions (including chromosome number represented by ChrOrd, start position, and end position) and the corresponding karyotype short description from the KaryClone table. Use the CytoConverted and KaryClone.

external_knowledge

None

Example 310

instance_id

bq111

db

mitelman

question

Could you compute, by chromosome, the Pearson correlation between the frequency of copy number aberrations (including amplifications, gains, losses, and deletions) from the Mitelman database for cases with morph = 3111 and topo = 0401, and those computed from TCGA data, returning correlation coefficients and corresponding p-values for each aberration type, ensuring only results with at least five matching records are shown.

external_knowledge

Correlations_between_Mitelman_and_TCGA_datasets.md

Example 311

instance_id

sf_bq451

db

_1000_GENOMES

question

Extract genotype data for single nucleotide polymorphisms (SNPs) on chromosome X, excluding positions where the start value is between 59999 and 2699519 or between 154931042 and 155260559. For each sample, identify genotype calls where the genotype array has at least one allele. Classify each genotype call into one of the following categories: homozygous reference alleles (both alleles are 0), homozygous alternate alleles (both alleles are the same and greater than 0), or heterozygous alleles (alleles are different, or any allele is null, and at least one allele is greater than 0). Compute the total number of callable sites (the sum of all three genotype categories), the number of homozygous reference, homozygous alternate, and heterozygous genotype calls, the total number of single nucleotide variants (SNVs) as the sum of homozygous alternate and heterozygous genotype calls, the percentage of heterozygous genotype calls among all SNVs, and the percentage of homozygous alternate genotype calls among all SNVs. Output the sample ID along with these computed counts and percentages, and order the results by the percentage of heterozygous genotype calls among SNVs in descending order, then by sample ID.

external_knowledge

1000_genomes_alleles_type.md

Example 312

instance_id

sf_bq452

db

_1000_GENOMES

question

Identify variants on chromosome 12 and, for each variant, calculate the chi-squared score using allele counts in cases and controls, where cases are individuals from the ‘EAS’ super population and controls are individuals from all other super populations. Apply Yates’s correction for continuity in the chi-squared calculation, ensuring that the expected counts for each allele in both groups are at least 5. Return the start position, end position, and chi-squared score of the top variants where the chi-squared score is no less than 29.71679.

external_knowledge

None

Example 313

instance_id

sf_bq453

db

_1000_GENOMES

question

In chromosome 17 between positions 41196311 and 41277499, what are the reference names, start and end positions, reference bases, distinct alternate bases, variant types, and the chi-squared scores (calculated from Hardy-Weinberg equilibrium) along with the total number of genotypes, their observed and expected counts for homozygous reference, heterozygous, and homozygous alternate genotypes, as well as allele frequencies (including those from 1KG), for each variant?

external_knowledge

None

Example 314

instance_id

sf_bq454

db

_1000_GENOMES

question

For the 1000 Genomes dataset, analyze common autosomal variants (those with an allele frequency of at least 0.05) across different super populations. For each super population, count how many variants are shared by each specific number of samples within that super population. Include in your results the total population size of each super population, whether the variant is common (allele frequency ≥ 0.05), the number of samples having each variant, and the total count of variants shared by that many samples. Only include autosomal variants by explicitly excluding sex chromosomes (X, Y) and mitochondrial DNA (MT) from the analysis. Consider only samples that have at least one alternate allele (non-reference) for the variant.

external_knowledge

None

Example 315

instance_id

sf_bq415

db

HUMAN_GENOME_VARIANTS

question

List the top 10 samples in the genome data that have the highest number of positions where there is exactly one alternate allele and the sample’s genotype is homozygous for the reference allele (both alleles are 0). Order the results in descending order of these counts.

external_knowledge

Homozygous_Reference_Genotype.md

Example 316

instance_id

bq279

db

austin

question

Can you provide the number of distinct active and closed bike share stations for each year 2013 and 2014?

external_knowledge

None

Example 317

instance_id

bq281

db

austin

question

What is the highest number of electric bike rides lasting more than 10 minutes taken by subscribers with ‘Student Membership’ in a single day, excluding rides starting or ending at ‘Mobile Station’ or ‘Repair Shop’?

external_knowledge

None

Example 318

instance_id

bq282

db

austin

question

Can you tell me the numeric value of the active council district in Austin which has the highest number of bike trips that start and end within the same district, but not at the same station?

external_knowledge

None

Example 319

instance_id

bq006

db

austin

question

What is the date with the second highest Z-score for daily counts of ‘PUBLIC INTOXICATION’ incidents in Austin for the year 2016? List the date in the format of ‘2016-xx-xx’.

external_knowledge

None

Example 320

instance_id

sf_bq283

db

AUSTIN

question

Among all stations that are currently active, identify those that rank in the top 15 (including ties) based on the total number of trips that start at each station. For each of these stations, return the station ID, the total number of starting trips, the percentage of those trips out of the overall starting trips from active stations, and the average trip duration in minutes. Order the results by the station’s rank.

external_knowledge

None

Example 321

instance_id

bq284

db

bbc

question

Can you provide a breakdown of the total number of articles into different categories and the percentage of those articles that mention “education” within each category from the BBC News?

external_knowledge

None

Example 322

instance_id

bq413

db

dimensions_ai_covid19

question

Retrieve the venue titles of publications that have a date_inserted from the year 2021 onwards and are associated with a grid whose address city is ‘Qianjiang’. For each publication, prioritize the venue title by selecting the journal title first if it exists; if not, then the proceedings title; if that’s also unavailable, then the book title; and finally, if none of those are available, the book series title.

external_knowledge

None

Example 323

instance_id

bq425

db

ebi_chembl

question

Using data from ChEMBL Release 23, retrieve all distinct molecules associated with the company ‘SanofiAventis,’ listing the trade name and the most recent approval date for each molecule. Make sure to keep only the latest approval date per molecule and ensure the company field precisely matches ‘SanofiAventis’ without relying on other fields.

external_knowledge

None

Example 324

instance_id

bq430

db

ebi_chembl

question

Find pairs of different molecules tested in the same assay and standard type, where both have 10–15 heavy atoms, fewer than 5 activities in that assay, fewer than 2 duplicate activities, non-null standard values, and pChEMBL values over 10. For each pair, report the maximum heavy atom count, the latest publication date (calculated based on the document’s rank within the same journal and year, and map it to a synthetic month and day), the highest document ID, classify the change in standard values as ‘increase’, ‘decrease’, or ‘no- change’ based on their values and relations, and generate UUIDs from their activity IDs and canonical SMILES.

external_knowledge

chembl.md

Example 325

instance_id

bq023

db

fec

question

Using the 2018 5-Year American Community Survey (ACS) for median incomes at the census tract level and the 2020 Federal Election Commission (FEC) individual contributions dataset filtered for donors in New York, matched to census tract geographies via a ZIP code to census tract crosswalk, calculate and list the average political donation amount and the median income for each census tract located in Kings County (Brooklyn), New York.

external_knowledge

None

Example 326

instance_id

bq094

db

fec

question

Please provide a list of all 2016 committees that supported at least one candidate and received a total amount of individual contributions between $0 and $200 (inclusive of more than $0 and less than $200) where these small-dollar contributions sum to more than $0 overall. For each qualifying committee, include its name, the number of unique candidates it supported, the candidates’ names in alphabetical order (separated by commas), and the total sum of these small-dollar donations received by the committee.

external_knowledge

None

Example 327

instance_id

bq287

db

fda

question

Among all Utah ZIP codes, what is the 2017 American Community Survey employment rate for the population aged 16 or older in the ZIP code that has the fewest FDIC-insured bank locations?

external_knowledge

None

Example 328

instance_id

bq432

db

fda

question

Please provide the food events data where both "date_created" and "date_started" are between January 1 and January 31, 2015, apply the following data cleansing steps: split reactions and outcomes fields into arrays by commas, handle special numeric patterns in the products_brand_name field (where a digit is followed by comma and another digit) by preserving those numeric patterns while replacing other “, “ with “ – “, replace “, “ with “ – “ in products_industry_code, products_role, and products_industry_name fields, and calculate industry_code_length and brand_name_length as the array lengths after splitting.

external_knowledge

Food_Event_Cleansing_Logic.md

Example 329

instance_id

bq285

db

fda

question

Could you provide me with the zip code of the location that has the highest number of bank institutions in Florida?

external_knowledge

None

Example 330

instance_id

bq288

db

fda

question

What is the total number of all banking institutions in the state that has the highest sum of assets from banks established between January 1, 1900, and December 31, 2000, with institution names starting with ‘Bank’?

external_knowledge

None

Example 331

instance_id

sf_bq412

db

GOOGLE_ADS

question

Please retrieve the page URLs, first shown time, last shown time, removal reason, violation category, and the lower and upper bounds of times shown for the five most recently removed ads in the Croatia region (region code ‘HR’), where the times shown availability date is null, the times shown lower bound exceeds 10,000, the times shown upper bound is below 25,000, and the ads used at least one non-unused audience selection approach among demographics, geographic location, contextual signals, customer lists, or topics of interest, ordering the resulting ads by their last shown time in descending order.

external_knowledge

None

Example 332

instance_id

sf_bq423

db

GOOGLE_ADS

question

Between January 1, 2023, and January 1, 2024, which image-type advertisement on the topic of Health, published by a verified advertiser located in Cyprus, was shown in Croatia, has times_shown_availability_date as NULL (meaning the times shown data is available), utilized demographic information, geo-location targeting, contextual signals, customer lists, and topics of interest without any of these selection methods being unused, and additionally had its first shown date strictly after January 1, 2023, and last shown date strictly before January 1, 2024? Among such ads, provide the page URL of the one with the highest upper bound of times shown.

external_knowledge

None

Example 333

instance_id

sf_bq070

db

IDC

question

Could you provide a clean, structured dataset from dicom_all table that only includes SM images marked as VOLUME from the TCGA-LUAD and TCGA-LUSC collections, excluding any slides with compression type “other,” where the specimen preparation step explicitly has “Embedding medium” set to “Tissue freezing medium,” and ensuring that the tissue type is only “normal” or “tumor” and the cancer subtype is reported accordingly?

external_knowledge

dicom_dataset_selection.md

Example 334

instance_id

sf_bq320

db

IDC

question

In the dicom_pivot table, how many unique StudyInstanceUID values exactly match the SegmentedPropertyTypeCodeSequence of “15825003” (case-insensitive) and also have a collection_id of either “Community” or “nsclc_radiomics”?

external_knowledge

None

Example 335

instance_id

sf_bq321

db

IDC

question

How many unique StudyInstanceUIDs are there from the DWI, T2 Weighted Axial, Apparent Diffusion Coefficient series, and T2 Weighted Axial Segmentations in the ‘qin_prostate_repeatability’ collection?

external_knowledge

None

Example 336

instance_id

sf_bq323

db

IDC

question

Within the ‘prostatex’ collection, for MRI sequences where the Modality is ‘MR’, assign the label ‘t2w_prostateX’ to sequences whose SeriesDescription contains ‘t2_tse_tra’ and ‘adc_prostateX’ to sequences whose SeriesDescription contains ‘ADC’. For all sequences labeled as ‘t2w_prostateX’ or ‘adc_prostateX’, calculate the average Repetition Time, the average Echo Time, and the average Slice Thickness, and then compute the sum of these averages to obtain the combined overall average.

external_knowledge

None

Example 337

instance_id

sf_bq417

db

IDC

question

Please provide identification details, study and series information, storage location, and total size in MB for the medical images belonging to male patients who are exactly 18 years old based on the numeric portion of the PatientAge field, where the BodyPartExamined is set to ‘MEDIASTINUM’ and the study date is strictly after September 1, 2014.

external_knowledge

IDC_data_model.md

Example 338

instance_id

sf_bq455

db

IDC

question

Identify the top five CT scan series by size (in MiB), including their SeriesInstanceUID, series number, patient ID, and series size. These series must be from the CT modality and not part of the ‘nlst’ collection. Exclude any series where the ImageType is classified as ‘LOCALIZER’ or where the TransferSyntaxUID is either ‘1.2.840.10008.1.2.4.70’ or ‘1.2.840.10008.1.2.4.51’ (i.e., JPEG compressed). The selected series must have consistent slice intervals, exposure levels, image orientation (with only one unique ImageOrientationPatient value), pixel spacing, image positions (both z-axis and xy positions), and pixel dimensions (rows and columns). Ensure that the number of images matches the number of unique z-axis positions, indicating no duplicate slices. Additionally, the z-axis component of the cross product of the x and y direction cosines from ImageOrientationPatient must have an absolute value between 0.99 and 1.01, ensuring alignment with the expected imaging plane. Finally, order the results by series size in descending order and limit the output to the top five series satisfying these conditions.

external_knowledge

None

Example 339

instance_id

sf_bq456

db

IDC

question

Please retrieve from the dicom_all table each PatientID, StudyInstanceUID, StudyDate, and the CodeMeaning of the FindingSite for patients whose StudyDate is in the year 2001, along with the maximum values of each of the following measurements identified by their CodeMeaning (Elongation, Flatness, Least Axis in 3D Length, Major Axis in 3D Length, Maximum 3D Diameter of a Mesh, Minor Axis in 3D Length, Sphericity, Surface Area of Mesh, Surface to Volume Ratio, Volume from Voxel Summation, and Volume of Mesh), ensuring that the quantitative_measurements table is joined on segmentationInstanceUID matching the SOPInstanceUID in dicom_all, and grouping by PatientID, StudyInstanceUID, StudyDate, and FindingSite CodeMeaning.

external_knowledge

None

Example 340

instance_id

sf_bq324

db

IDC

question

How many frames in total are present across all whole slide microscopy images from the TCGA-BRCA collection that use the SM modality and include an eosin-based staining step in their SpecimenPreparationSequence?

external_knowledge

None

Example 341

instance_id

bq418

db

targetome_reactome

question

Determine which three lowest-level Reactome pathways (with TAS evidence) have the highest chi-squared statistics, considering only Homo sapiens targets associated with sorafenib under the conditions that the median assay value is ≤ 100 and both low and high assay values are ≤ 100 or null. For each of these three pathways, how many of these targets and non-targets lie within the pathway and outside it?

external_knowledge

None

Example 342

instance_id

bq330

db

fda

question

Which Colorado zip code has the highest concentration of bank locations per block group, based on the overlap between zip codes and block groups?

external_knowledge

overlap_ratio.md

Example 343

instance_id

bq398

db

world_bank

question

What are the top three debt indicators for Russia based on the highest debt values?

external_knowledge

None

Example 344

instance_id

bq399

db

world_bank

question

Which high-income country had the highest average crude birth rate respectively in each region, and what are their corresponding average birth rate, during the 1980s?

external_knowledge

None

Example 345

instance_id

bq457

db

libraries_io

question

Get details of repositories that use specific feature toggle libraries. For each repository, include the full name with owner, hosting platform type, size in bytes, primary programming language, fork source name (if any), last update timestamp, the artifact and library names of the feature toggle used, and the library’s programming languages. Include repositories that depend on the specified feature toggle libraries, defined by their artifact names, library names, platforms, and languages.

external_knowledge

feature_toggle_libraries.md

Example 346

instance_id

bq227

db

london

question

Could you provide the annual percentage shares, rounded to two decimal places, of the top 5 minor crime categories from 2008 in London’s total crimes, with each year displayed in one row?

external_knowledge

None

Example 347

instance_id

bq232

db

london

question

Could you provide the total number of ‘Other Theft’ incidents within the ‘Theft and Handling’ category for each year in the Westminster borough?

external_knowledge

None

Example 348

instance_id

bq228

db

london

question

Please provide a list of the top three major crime categories in the borough of Barking and Dagenham, along with the number of incidents in each category.

external_knowledge

None

Example 349

instance_id

bq229

db

open_images

question

Using the bigquery-public-data.open_images dataset, can you provide a count of how many distinct image URLs are categorized as ‘cat’ (where the image has label ‘/m/01yrx’ with confidence=1) and how many distinct image URLs are categorized as ‘other’ (meaning they have no cat label ‘/m/01yrx’ at all)?

external_knowledge

None

Example 350

instance_id

bq230

db

usda_nass_agriculture

question

Using the crops dataset, find the total 2022 production figures, measured in bushels, for corn from the ‘FIELD CROPS’ category and mushrooms from the ‘HORTICULTURE’ group for each U.S. state. Only include data rows where ‘statisticcat_desc’ is ‘PRODUCTION’, ‘agg_level_desc’ is ‘STATE’, ‘value’ is not null, and ensure that for corn the ‘unit_desc’ is ‘BU’. Combine both results so that each state’s 2022 corn and mushroom totals are presented.

external_knowledge

None

Example 351

instance_id

bq326

db

world_bank

question

Based on the World Bank global population dataset and the World Bank health nutrition population dataset, how many countries experienced an increase of more than 1% from the previous year to 2018 in both their total population and per capita current health expenditure (PPP)?

external_knowledge

None

Example 352

instance_id

bq424

db

world_bank

question

List the top 10 countries with respect to the total amount of long- term external debt in descending order, excluding those without a specified region.

external_knowledge

None

Example 353

instance_id

bq327

db

world_bank

question

How many debt indicators for Russia have a value of 0, excluding NULL values?

external_knowledge

None

Example 354

instance_id

bq328

db

world_bank

question

Which region has the highest median GDP (constant 2015 US$) value?

external_knowledge

None

Example 355

instance_id

sf_bq331

db

META_KAGGLE

question

Find the top three users who have authored the first message in forum topics, ranked in descending order by their message scores, where a message score is defined as the number of distinct users who voted on that message. For each of these users, provide their username and the absolute difference between their message score and the average message score across all first messages in forum topics.

external_knowledge

None

Example 356

instance_id

sf_bq380

db

META_KAGGLE

question

Using the data from Forum Message Votes and Users tables, find the top three users who have received the most distinct upvotes on the Kaggle forum. For each of these users, list their username, the total number of distinct upvotes they have received (based on the ToUserId field), and the total number of distinct upvotes they have given (based on the FromUserId field), sorted by the number of upvotes received in descending order and restricted to only the top three results.

external_knowledge

None

Example 357

instance_id

sf_bq370

db

WIDE_WORLD_IMPORTERS

question

How many customers have orders and invoices that match at the line- item level and, when aggregated, result in each customer having an equal count of orders and invoices as well as an identical total value for the orders and invoices?

external_knowledge

None

Example 358

instance_id

sf_bq371

db

WIDE_WORLD_IMPORTERS

question

In the year 2013, considering each invoice’s total value as the product of unit price and quantity and grouping by the quarter (Q1, Q2, Q3, Q4) in which the invoice date occurs, what is the difference between the maximum and minimum average invoice values across these quarters?

external_knowledge

None

Example 359

instance_id

sf_bq372

db

WIDE_WORLD_IMPORTERS

question

Among all orders that do not appear in the invoice table, for each customer category calculate the maximum lost order value, then determine which customer category’s maximum lost order value is closest to the overall average of these maximum lost order values across all categories?

external_knowledge

None

Example 360

instance_id

sf_bq373

db

WIDE_WORLD_IMPORTERS

question

Using the invoice date to determine each month of the year 2014, and summing the total invoice line amounts for each customer across these months, what is the median of the resulting average monthly spending across all customers?

external_knowledge

None

Example 361

instance_id

bq393

db

hacker_news

question

I want to identify users who had activity followed by inactivity. Specifically, find the user ID and their corresponding month number (counting from their first activity month) for the user with the highest month number who became inactive (no activity recorded) after their last recorded activity month. For this analysis, only consider data up until September 10, 2024, and ensure the month number represents the count of months since the user’s first activity. The user should have at least one month where they were expected to be active (within their activity span) but actually had no records.

external_knowledge

None

Example 362

instance_id

bq403

db

irs_990

question

Which three years in 2012-2017 have the smallest absolute difference between median revenue and median functional expenses for organizations filing IRS 990 forms? Please output three years and respective differences.

external_knowledge

None

Example 363

instance_id

bq397

db

ecommerce

question

After removing any duplicate records from the rev_transactions dataset, identify each channel grouping that has transactions from more than one country. For each such channel grouping, find the country with the highest total number of transactions and report both the country name and the sum of transactions for that channel grouping.

external_knowledge

None

Example 364

instance_id

bq402

db

ecommerce

question

Calculate the conversion rate from unique visitors to purchasers by using data exclusively from the web_analytics table in the data-to- insights.ecommerce dataset. A visitor is defined as a unique fullVisitorId present in the table, while a purchaser is a visitor who has at least one transaction recorded (totals.transactions is not null). The conversion rate is computed by dividing the number of unique purchasers by the total number of unique visitors. Additionally, calculate the average number of transactions per purchaser, considering only those visitors who have made at least one transaction.

external_knowledge

None

Example 365

instance_id

sf_bq160

db

META_KAGGLE

question

Please provide the creation date, title, parent forum title, reply count, distinct user replies count, total upvotes, and total views for the earliest five forum topics that belong to any sub-forum under the parent forum titled “General”. The reply count should be calculated as the total number of messages associated with each topic, and the distinct user replies count should be the number of unique users who have posted messages in the topic. The upvotes should be calculated as the total number of upvotes on all messages within each topic. If any values are missing or None, please treat them as zero

external_knowledge

None

Example 366

instance_id

sf_bq167

db

META_KAGGLE

question

Identify the pair of Kaggle users involved in ForumMessageVotes such that one user has given the other the greatest distinct number of upvotes, then also display how many upvotes that recipient returned. Present the usernames of both users, the total distinct upvotes one received from the other, and the upvotes they gave back, sorting by the highest received count and then by the highest given count, and show only the top result.

external_knowledge

None

Example 367

instance_id

sf_bq171

db

META_KAGGLE

question

Whose Forum message upvotes are closest to the average in 2019? If there’s a tie, tell me the one with the alphabetically first username.

external_knowledge

None

Example 368

instance_id

sf_bq118

db

DEATH

question

Among individuals identified as white, how much higher is the average number of deaths from ICD-10 codes whose descriptions contain the word “discharge” (specifically excluding “Urethral discharge,” “Discharge of firework,” and “Legal intervention involving firearm discharge”) compared to the average number of deaths from ICD-10 codes whose descriptions contain the word “vehicle,” when aggregated by age groups?

external_knowledge

None

Example 369

instance_id

sf_bq072

db

DEATH

question

Please provide, for each age from 12 through 18 (inclusive), the total number of deaths and the number of deaths among individuals identified as Black (based on race descriptions containing the word ‘black’), specifically for deaths associated with ICD-10 codes whose descriptions include the word ‘vehicle’ and for deaths associated with ICD-10 codes whose descriptions include the word ‘firearm.’ Use the EntityAxisConditions table to determine which ICD-10 codes were involved in each death, rather than joining ICD-10 code information directly on the death records.

external_knowledge

None

Example 370

instance_id

ga001

db

ga4

question

I want to know the preferences of customers who purchased the Google Navy Speckled Tee in December 2020. What other product was purchased with the highest total quantity alongside this item?

external_knowledge

None

Example 371

instance_id

ga002

db

ga4

question

Tell me the most purchased other products and their quantities by customers who bought the Google Red Speckled Tee each month for the three months starting from November 2020.

external_knowledge

None

Example 372

instance_id

ga003

db

firebase

question

I’m trying to evaluate which board types were most effective on September 15, 2018. Can you find out the average scores for each board type from the quick play mode completions on that day?

external_knowledge

None

Example 373

instance_id

ga004

db

ga4

question

Can you figure out the average difference in pageviews between users who bought something and those who didn’t in December 2020? Just label anyone who was involved in purchase events as a purchaser.

external_knowledge

None

Example 374

instance_id

ga008

db

ga4

question

Could you provide the total number of page views for each day in November 2020 as well as the average number of page views per user on those days, restricted to users who made at least one purchase in November 2020?

external_knowledge

None

Example 375

instance_id

ga017

db

ga4

question

How many distinct users viewed the most frequently visited page during January 2021?

external_knowledge

None

Example 376

instance_id

ga007

db

ga4

question

Please find out what percentage of the page views on January 2, 2021, were for PDP type pages.

external_knowledge

ga4_page_category.md

Example 377

instance_id

ga013

db

ga4

question

I want to know all the pages visited by user 1402138.5184246691 on January 2, 2021. Please show the names of these pages and adjust the names to PDP or PLP where necessary.

external_knowledge

ga4_page_category.md

Example 378

instance_id

ga018

db

ga4

question

On January 2nd, 2021, I want to determine the percentage of times users transition from a product list page (PLP) view to a product detail page (PDP) view within the same session, using only page_view events. Could you calculate how many PLP views eventually led to a PDP view in the same session on that date, and then provide the resulting percentage of PLP-to-PDP transitions?

external_knowledge

ga4_page_category.md

Example 379

instance_id

ga032

db

ga4

question

Can you generate the navigation flow for user with pseudo_id ‘1362228.4966015575’ on January 28th 2021, showing only the page_view events? Please connect the page titles with ‘»’ between each step, convert product detail pages to ‘PDP’ and product listing pages to ‘PLP’ based on the URL structure, and merge adjacent identical pages so they only appear once in the sequence. I need to understand how this specific user navigated through our website on that day.

external_knowledge

ga4_page_category.md

Example 380

instance_id

ga031

db

ga4

question

I want to know the user session conversion rate on January 2nd, 2021, using only ‘page_view’ events. The conversion rate should be calculated as the percentage of user visits that reached both the Home and Checkout Confirmation pages in one session, relative to those that landed on the Home page.

external_knowledge

None

Example 381

instance_id

ga006

db

ga4

question

For the date range November 1–30, 2020, can you retrieve each user_pseudo_id and its average purchase revenue in USD per session for users who had more than one purchase session, considering only events with event_name=’purchase’ and a non-null ecommerce.purchase_revenue_in_usd, grouping sessions by the ga_session_id from event_params

external_knowledge

None

Example 382

instance_id

ga009

db

ga4

question

Could you tell me the average number of engaged sessions per user for December 2020, counting only those sessions where the event parameter ‘session_engaged’ is equal to ‘1’ and using ‘user_pseudo_id’ combined with the ‘ga_session_id’ to identify distinct sessions?

external_knowledge

None

Example 383

instance_id

ga010

db

ga4

question

Can you give me an overview of our website traffic for December 2020? I’m particularly interested in the channel with the fourth highest number of sessions.

external_knowledge

ga4_dimensions_and_metrics.md

Example 384

instance_id

ga014

db

ga4

question

Can you provide the total number of sessions for each traffic channel in December 2020, using the information from the ‘event_params’

external_knowledge

ga4_dimensions_and_metrics.md

Example 385

instance_id

ga011

db

ga4

question

What is the page with the second highest total page views, after cleaning up its URL (removing extra slashes) and extracting the correct page path, on the website ‘shop.googlemerchandisestore.com’ during December 2020?

external_knowledge

None

Example 386

instance_id

ga012

db

ga4

question

On November 30, 2020, identify the item category with the highest tax rate by dividing tax value in usd by purchase revenue in usd for purchase events, and then retrieve the transaction IDs, total item quantities, and both purchase revenue in usd and purchase revenue for those purchase events in that top-tax-rate category.

external_knowledge

None

Example 387

instance_id

ga019

db

firebase

question

Could you determine what percentage of users either did not uninstall our app within seven days or never uninstalled it after installing during August and September 2018?

external_knowledge

None

Example 388

instance_id

ga030

db

firebase

question

Can you group users by the week of their first session start, starting from July 2, 2018? For each group, calculate the retention rate in the fourth week (i.e., the percentage of users from the original group who returned in the fourth week after their first session). Please identify the cohort with the highest retention rate in the fourth week, and name the group by the Monday date of the cohort’s first session week. Return the result in the format “YYYY-MM-DD”.

external_knowledge

retention_rate.md

Example 389

instance_id

ga005

db

firebase

question

Conduct a weekly cohort analysis for user retention, starting from July 9, 2018, and ending on October 2, 2018. Group users by the week of their first session_start event (with weeks starting on Monday), and identify new users as those where the event_date matches the date of their user_first_touch_timestamp. Calculate the Week 2 retention rate for each weekly cohort, defined as the percentage of users who had a session_start event exactly 2 weeks after their first week. Only include cohorts from July 9, 2018 through September 17, 2018 (the last cohort that can be analyzed for 2-week retention given the available data). Present the results with each weekly cohort and its corresponding Week 2 retention rate, ordered by cohort date.

external_knowledge

retention_rate.md

Example 390

instance_id

ga028

db

firebase

question

Please perform a 7-day retention analysis for users who first session start the app during the week starting on July 2, 2018. For each week from Week 0 (the week of their first session) to Week 4, provide the total number of new users in Week 0 and the number of retained users for each subsequent week. Ensuring that you only count events up to October 2, 2018, and group dates by Monday-based weeks

external_knowledge

retention_rate.md

Example 391

instance_id

ga020

db

firebase

question

Which quickplay event type had the lowest user retention rate during the second week after their initial engagement, for users who first engaged between August 1 and August 15, 2018, as measured by the presence of session_start events??

external_knowledge

retention_rate.md

Example 392

instance_id

ga021

db

firebase

question

What is the retention rate for users two weeks after their initial quickplay event, calculated separately for each quickplay event type, within the period from July 2, 2018, to July 16, 2018? Please focus on users who started a session (session_start) during this period.

external_knowledge

retention_rate.md

Example 393

instance_id

ga022

db

firebase

question

Could you please help me get the weekly customer retention rate in September 2018 for new customers who first used our app (first_open event) within the first week starting from September 1st, 2018 (timezone in Shanghai)? The retention rates should cover the following weeks 1, 2, and 3 period after the initial use and display them in column format.

external_knowledge

retention_rate.md

Example 394

instance_id

ga025

db

firebase

question

For all users who first opened the app in September 2018 and then uninstalled within seven days, I want to know what percentage of them experienced an app crash (app_exception). The calculation should be done by converting the timestamps to dates first, and then calculating the days to uninstall based on the dates. Only users who uninstalled within 7 days and experienced a crash should be considered in the final percentage.

external_knowledge

None

Example 395

instance_id

local002

db

E_commerce

question

Can you calculate the 5-day symmetric moving average of predicted toy sales for December 5 to 8, 2018, using daily sales data from January 1, 2017, to August 29, 2018, with a simple linear regression model? Finally provide the sum of those four 5-day moving averages?

external_knowledge

None

Example 396

instance_id

local003

db

E_commerce

question

According to the RFM definition document, calculate the average sales per order for each customer within distinct RFM segments, considering only ‘delivered’ orders. Use the customer unique identifier. Clearly define how to calculate Recency based on the latest purchase timestamp and specify the criteria for classifying RFM segments. The average sales should be computed as the total spend divided by the total number of orders. Please analyze and report the differences in average sales across the RFM segments

external_knowledge

RFM.md

Example 397

instance_id

local004

db

E_commerce

question

Could you tell me the number of orders, average payment per order and customer lifespan in weeks of the 3 custumers with the highest average payment per order, where the lifespan is calculated by subtracting the earliest purchase date from the latest purchase date in days, dividing by seven, and if the result is less than seven days, setting it to 1.0?

external_knowledge

None

Example 398

instance_id

local007

db

Baseball

question

Could you help me calculate the average single career span value in years for all baseball players? Please precise the result as a float number. First, calculate the difference in years, months, and days between the debut and final game dates. For each player, the career span is computed as the sum of the absolute number of years, plus the absolute number of months divided by 12, plus the absolute number of days divided by 365. Round each part to two decimal places before summing. Finally, average the career spans and round the result to a float number.

external_knowledge

None

Example 399

instance_id

local008

db

Baseball

question

I would like to know the given names of baseball players who have achieved the highest value of games played, runs, hits, and home runs, with their corresponding score values.

external_knowledge

None

Example 400

instance_id

local009

db

Airlines

question

What is the distance of the longest route where Abakan is either the departure or destination city (in kilometers)?

external_knowledge

haversine_formula.md

Example 401

instance_id

local010

db

Airlines

question

Distribute all the unique city pairs into the distance ranges 0, 1000, 2000, 3000, 4000, 5000, and 6000+, based on their average distance of all routes between them. Then how many pairs are there in the distance range with the fewest unique city paires?

external_knowledge

haversine_formula.md

Example 402

instance_id

local015

db

California_Traffic_Collision

question

Please calculate the fatality rate for motorcycle collisions, separated by helmet usage. Specifically, calculate two percentages: 1) the percentage of motorcyclist fatalities in collisions where parties (drivers or passengers) were wearing helmets, and 2) the percentage of motorcyclist fatalities in collisions where parties were not wearing helmets. For each group, compute this by dividing the total number of motorcyclist fatalities by the total number of collisions involving that group. Use the parties table to determine helmet usage (from party_safety_equipment fields).

external_knowledge

None

Example 403

instance_id

local017

db

California_Traffic_Collision

question

In which year were the two most common causes of traffic accidents different from those in other years?

external_knowledge

None

Example 404

instance_id

local018

db

California_Traffic_Collision

question

For the primary collision factor violation category that was the most common cause of traffic accidents in 2021, how many percentage points did its share of annual road incidents in 2021 decrease compared to its share in 2011?

external_knowledge

None

Example 405

instance_id

local019

db

WWE

question

For the NXT title that had the shortest match (excluding titles with “title change”), what were the names of the two wrestlers involved?

external_knowledge

None

Example 406

instance_id

local026

db

IPL

question

Please help me identify the top 3 bowlers who, in the overs where the maximum runs were conceded in each match, gave up the highest number of runs in a single over across all matches. For each of these bowlers, provide the match in which they conceded these maximum runs. Only consider overs that had the most runs conceded within their respective matches, and among these, determine which bowlers conceded the most runs in a single over overall.

external_knowledge

None

Example 407

instance_id

local020

db

IPL

question

Which bowler has the lowest bowling average per wicket taken?

external_knowledge

None

Example 408

instance_id

local021

db

IPL

question

Could you calculate the average of the total runs scored by all strikers who have scored more than 50 runs in any single match?

external_knowledge

None

Example 409

instance_id

local022

db

IPL

question

Retrieve the names of players who scored no less than 100 runs in a match while playing for the team that lost that match.

external_knowledge

None

Example 410

instance_id

local023

db

IPL

question

Please help me find the names of top 5 players with the highest average runs per match in season 5, along with their batting averages.

external_knowledge

None

Example 411

instance_id

local024

db

IPL

question

Can you help me find the top 5 countries whose players have the highest average of their individual average runs per match across all seasons? Specifically, for each player, calculate their average runs per match over all matches they played, then compute the average of these player averages for each country, and include these country batting averages in the result.

external_knowledge

None

Example 412

instance_id

local025

db

IPL

question

For each match, considering every innings, please combine runs from both batsman scored and extra runs for each over, then identify the single over with the highest total runs, retrieve the bowler for that over from the ball by ball table, and calculate the average of these highest over totals across all matches, ensuring that all runs and bowler details are accurately reflected.

external_knowledge

None

Example 413

instance_id

local028

db

Brazilian_E_Commerce

question

Could you generate a report that shows the number of delivered orders for each month in the years 2016, 2017, and 2018? Each column represents a year, and each row represents a month

external_knowledge

None

Example 414

instance_id

local031

db

Brazilian_E_Commerce

question

What is the highest monthly delivered orders volume in the year with the lowest annual delivered orders volume among 2016, 2017, and 2018?

external_knowledge

None

Example 415

instance_id

local029

db

Brazilian_E_Commerce

question

Please identify the top three customers, based on their customer_unique_id, who have the highest number of delivered orders, and provide the average payment value, city, and state for each of these customers.

external_knowledge

None

Example 416

instance_id

local030

db

Brazilian_E_Commerce

question

Among all cities with delivered orders, find the five cities whose summed payments are the lowest, then calculate the average of their total payments and the average of their total delivered order counts.

external_knowledge

None

Example 417

instance_id

local032

db

Brazilian_E_Commerce

question

Could you help me find the sellers who excel in the following categories, considering only delivered orders: the seller with the highest number of distinct customer unique IDs, the seller with the highest profit (calculated as price minus freight value), the seller with the highest number of distinct orders, and the seller with the most 5-star ratings? For each category, please provide the seller ID and the corresponding value, labeling each row with a description of the achievement.

external_knowledge

None

Example 418

instance_id

local034

db

Brazilian_E_Commerce

question

Could you help me calculate the average of the total number of payments made using the most preferred payment method for each product category, where the most preferred payment method in a category is the one with the highest number of payments?

external_knowledge

None

Example 419

instance_id

local037

db

Brazilian_E_Commerce

question

Identify the top three product categories whose most commonly used payment type has the highest number of payments across all categories, and specify the number of payments made in each category using that payment type.

external_knowledge

None

Example 420

instance_id

local035

db

Brazilian_E_Commerce

question

In the “olist_geolocation” table, please identify which two consecutive cities, when sorted by geolocation_state, geolocation_city, geolocation_zip_code_prefix, geolocation_lat, and geolocation_lng, have the greatest distance between them based on the difference in distance computed between each city and its immediate predecessor in that ordering.

external_knowledge

spherical_law.md

Example 421

instance_id

local038

db

Pagila

question

Could you help me determine which actor starred most frequently in English-language children’s category films that were rated either G or PG, had a running time of 120 minutes or less, and were released between 2000 and 2010? Please provide the actor’s full name.

external_knowledge

None

Example 422

instance_id

local039

db

Pagila

question

Please help me find the film category with the highest total rental hours in cities where the city’s name either starts with “A” or contains a hyphen.

external_knowledge

None

Example 423

instance_id

local040

db

modern_data

question

In the combined dataset that unifies the trees data with the income data by ZIP code, filling missing ZIP values where necessary, which three boroughs, restricted to records with median and mean income both greater than zero and a valid borough name, contain the highest number of trees, and what is the average mean income for each of these three boroughs?

external_knowledge

None

Example 424

instance_id

local041

db

modern_data

question

What percentage of trees in the Bronx have a health status of Good?

external_knowledge

None

Example 425

instance_id

local049

db

modern_data

question

Can you help me calculate the average number of new unicorn companies per year in the top industry from 2019 to 2021?

external_knowledge

None

Example 426

instance_id

local054

db

chinook

question

Could you tell me the first names of customers who spent less than $1 on albums by the best-selling artist, along with the amounts they spent?

external_knowledge

None

Example 427

instance_id

local055

db

chinook

question

Identify the artist with the highest overall sales of albums (tie broken by alphabetical order) and the artist with the lowest overall sales of albums (tie broken by alphabetical order), then calculate the amount each customer spent specifically on those two artists’ albums. Next, compute the average spending for the customers who purchased from the top-selling artist and the average spending for the customers who purchased from the lowest-selling artist, and finally return the absolute difference between these two averages.

external_knowledge

None

Example 428

instance_id

local198

db

chinook

question

Using the sales data, what is the median value of total sales made in countries where the number of customers is greater than 4?

external_knowledge

None

Example 429

instance_id

local056

db

sqlite-sakila

question

Which customer has the highest average monthly change in payment amounts? Provide the customer’s full name.

external_knowledge

None

Example 430

instance_id

local058

db

education_business

question

Can you provide a list of hardware product segments along with their unique product counts for 2020 in the output, ordered by the highest percentage increase in unique fact sales products from 2020 to 2021?

external_knowledge

None

Example 431

instance_id

local059

db

education_business

question

For the calendar year 2021, what is the overall average quantity sold of the top three best-selling hardware products (by total quantity sold) in each division?

external_knowledge

None

Example 432

instance_id

local060

db

complex_oracle

question

In the United States, for Q4 2019 and Q4 2020, first select only those cities where total sales (with no promotions) rose by at least 20% from Q4 2019 to Q4 2020. Among these cities, rank products by their overall sales (still excluding promotions) in those quarters and take the top 20%. Then compute each top product’s share of total sales in Q4 2019 and Q4 2020 and calculate the difference in share from Q4 2019 to Q4 2020, returning the results in descending order of that share change.

external_knowledge

None

Example 433

instance_id

local063

db

complex_oracle

question

Among all products sold in the United States with promo_id=999, considering only those cities whose sales increased by at least 20% from Q4 2019 (calendar_quarter_id=1772) to Q4 2020 (calendar_quarter_id=1776), which product that ranks in the top 20% of total sales has the smallest percentage-point change in its share of total sales between these two quarters?

external_knowledge

None

Example 434

instance_id

local061

db

complex_oracle

question

What is the average projected monthly sales in USD for France in 2021, considering only product sales with promotions where promo_total_id = 1 and channels where channel_total_id = 1, by taking each product’s monthly sales from 2019 and 2020, calculating the growth rate from 2019 to 2020 for that same product and month, applying this growth rate to project 2021 monthly sales, converting all projected 2021 amounts to USD with the 2021 exchange rates, and finally averaging and listing them by month?

external_knowledge

projection_calculation.md

Example 435

instance_id

local050

db

complex_oracle

question

What is the median of the average monthly projected sales in USD for France in 2021, calculated by using the monthly sales data from 2019 and 2020 (filtered by promo_total_id=1 and channel_total_id=1), applying the growth rate from 2019 to 2020 to project 2021, converting to USD based on the currency table, and then determining the monthly averages before finding their median?

external_knowledge

projection_calculation.md

Example 436

instance_id

local062

db

complex_oracle

question

Please group all Italian customers into ten buckets for December 2021 by summing their profits from all products purchased (where profit is calculated as quantity_sold multiplied by the difference between unit_price and unit_cost), then divide the overall range of total monthly profits into ten equal intervals. For each bucket, provide the number of customers, and identify the minimum and maximum total profits within that bucket.

external_knowledge

None

Example 437

instance_id

local067

db

complex_oracle

question

Can you provide the highest and lowest profits for Italian customers segmented into ten evenly divided tiers based on their December 2021 sales profits?

external_knowledge

None

Example 438

instance_id

local070

db

city_legislation

question

Please examine our database records for Chinese cities (country_code_2 = ‘cn’) during July 2021 and identify both the shortest and longest streaks of consecutive date entries. For each date in these streaks, return exactly one record per date along with the corresponding city name. In your output, please ensure the first letter of each city name is capitalized and the rest are lowercase. Display the dates and city names for both the shortest and longest consecutive date streaks, ordered by date.

external_knowledge

None

Example 439

instance_id

local071

db

city_legislation

question

Could you review our records in June 2022 and identify which countries have the longest streak of consecutive inserted city dates? Please list the 2-letter length country codes of these countries.

external_knowledge

None

Example 440

instance_id

local072

db

city_legislation

question

Identify the country with data inserted on nine different days in January 2022. Then, find the longest consecutive period with data insertions for this country during January 2022, and calculate the proportion of entries that are from its capital city within this longest consecutive insertion period.

external_knowledge

None

Example 441

instance_id

local068

db

city_legislation

question

Calculate the number of new cities inserted in April, May, and June for each year from 2021 to 2023. For each month, compute the cumulative running total of cities added for that specific month across the years up to and including the given year (i.e., sum the counts of that month over the years). Additionally, calculate the year-over-year growth percentages for both the monthly total and the running total for each month, comparing each year to the previous year. Present the results only for 2022 and 2023, listing the year, the month, the total number of cities added in that month, the cumulative running total for that month, and the year-over-year growth percentages for both the monthly total and the running total. Use the data from 2021 solely as a baseline for calculating growth rates, and exclude it from the final output.

external_knowledge

None

Example 442

instance_id

local073

db

modern_data

question

For each pizza order, provide a single result row with the row ID, order ID, customer ID, pizza name, and final set of ingredients. The final ingredients are determined by starting with the standard toppings from the pizza’s recipe, removing any excluded toppings, and adding any extra toppings. Present the ingredients in a string starting with the pizza name followed by ‘: ‘, with ingredients listed in alphabetical order. Ingredients appearing multiple times (e.g., from standard and extra toppings) should be prefixed with ‘2x’ and listed first, followed by single-occurrence ingredients, both in alphabetical order. Group by row ID, order ID, pizza name, and order time to ensure each order appears once. Sort results by row ID in ascending order. Assign pizza_id 1 to ‘Meatlovers’ pizzas and pizza_id 2 to all others.

external_knowledge

None

Example 443

instance_id

local066

db

modern_data

question

Based on our customer pizza order information, summarize the total quantity of each ingredient used in the pizzas we delivered. Output the name and quantity for each ingredient.

external_knowledge

None

Example 444

instance_id

local065

db

modern_data

question

Calculate the total income from Meat Lovers pizzas priced at $12 and Vegetarian pizzas at $10. Include any extra toppings charged at $1 each. Ensure that canceled orders are filtered out. How much money has Pizza Runner earned in total?

external_knowledge

None

Example 445

instance_id

local074

db

bank_sales_trading

question

Please generate a summary of the closing balances at the end of each month for each customer transactions, show the monthly changes and monthly cumulative bank account balances. Ensure that even if a customer has no account activity in a given month, the balance for that month is still included in the output.

external_knowledge

None

Example 446

instance_id

local064

db

bank_sales_trading

question

For each customer and each month of 2020, first calculate the month- end balance by adding all deposit amounts and subtracting all withdrawal amounts that occurred during that specific month. Then determine which month in 2020 has the highest count of customers with a positive month-end balance and which month has the lowest count. For each of these two months, compute the average month-end balance across all customers and provide the difference between these two averages

external_knowledge

None

Example 447

instance_id

local297

db

bank_sales_trading

question

For each customer, group all deposits and withdrawals by the first day of each month to obtain a monthly net amount, then calculate each month’s closing balance by cumulatively summing these monthly nets. Next, determine the most recent month’s growth rate by comparing its closing balance to the prior month’s balance, treating deposits as positive and withdrawals as negative, and if the previous month’s balance is zero, the growth rate should be the current month’s balance multiplied by 100. Finally, compute the percentage of customers whose most recent month shows a growth rate of more than 5%.

external_knowledge

None

Example 448

instance_id

local298

db

bank_sales_trading

question

For each month, calculate the total balance from all users for the previous month (measured as of the 1st of each month), replacing any negative balances with zero. Ensure that data from the first month is used only as a baseline for calculating previous total balance, and exclude it from the final output. Sort the results in ascending order by month.

external_knowledge

None

Example 449

instance_id

local299

db

bank_sales_trading

question

For a bank database with customer transactions, calculate each customer’s daily running balance (where deposits add to the balance and other transaction types subtract). For each customer and each day, compute the 30-day rolling average balance (only after having 30 days of data, and treating negative averages as zero). Then group these daily averages by month and find each customer’s maximum 30-day average balance within each month. Sum these maximum values across all customers for each month. Consider the first month of each customer’s transaction history as the baseline period and exclude it from the final results, presenting monthly totals of these summed maximum 30-day average balances.

external_knowledge

None

Example 450

instance_id

local300

db

bank_sales_trading

question

For each customer, calculate their daily balances for every day between their earliest and latest transaction dates, including days without transactions by carrying forward the previous day’s balance. Treat any negative daily balances as zero. Then, for each month, determine the highest daily balance each customer had during that month. Finally, for each month, sum these maximum daily balances across all customers to obtain a monthly total.

external_knowledge

None

Example 451

instance_id

local075

db

bank_sales_trading

question

Can you provide a breakdown of how many times each product was viewed, how many times they were added to the shopping cart, and how many times they were left in the cart without being purchased? Also, give me the count of actual purchases for each product. Ensure that products with a page id in (1, 2, 12, 13) are filtered out.

external_knowledge

None

Example 452

instance_id

local077

db

bank_sales_trading

question

Please analyze our interest data from September 2018 to August 2019. For each month, calculate the average composition for each interest by dividing the composition by the index value. Identify the interest with the highest average composition value each month and report its average composition as the max index composition for that month. Compute the three-month rolling average of these monthly max index compositions. Ensure the output includes the date, the interest name, the max index composition for that month, the rolling average, and the names and max index compositions of the top interests from one month ago and two months ago.

external_knowledge

None

Example 453

instance_id

local078

db

bank_sales_trading

question

Identify the top 10 and bottom 10 interest categories based on their highest composition values across all months. For each category, display the time(MM-YYYY), interest name, and the composition value

external_knowledge

None

Example 454

instance_id

local081

db

northwind

question

Considering only the customers who placed orders in 1998, calculate the total amount each customer spent by summing the unit price multiplied by the quantity of all products in their orders, excluding any discounts. Assign each customer to a spending group based on the customer group thresholds, and determine how many customers are in each spending group and what percentage of the total number of customers who placed orders in 1998 each group represents.

external_knowledge

None

Example 455

instance_id

local085

db

northwind

question

Among employees who have more than 50 total orders, which three have the highest percentage of late orders, where an order is considered late if the shipped date is on or after its required date? Please list each employee’s ID, the number of late orders, and the corresponding late-order percentage.

external_knowledge

None

Example 456

instance_id

local096

db

Db-IMDB

question

For each year, calculate the percentage of films that had exclusively female actors (meaning no male actors and no actors with unknown/unspecified gender). Consider actors with gender marked as ‘Male’ or ‘None’ as non-female. For the results, display the year, the total number of movies in that year, and the percentage of movies with exclusively female actors. Extract the year from the Movie.year field by taking the last 4 characters and converting to a number.

external_knowledge

None

Example 457

instance_id

local097

db

Db-IMDB

question

Could you analyze our data and identify which ten-year period starting from any movie release year present in the data had the largest number of films, considering consecutive ten-year periods beginning at each unique year? Only output the start year and the total count for that specific period.

external_knowledge

None

Example 458

instance_id

local098

db

Db-IMDB

question

From the first year each actor appeared in a film to the last, how many actors in the database never had a gap longer than three consecutive years without at least one new movie appearance, meaning there is no four-year span anywhere in their active career without at least a single film credit?

external_knowledge

None

Example 459

instance_id

local099

db

Db-IMDB

question

I need you to look into the actor collaborations and tell me how many actors have made more films with Yash Chopra than with any other director. This will help us understand his influence on the industry better.

external_knowledge

None

Example 460

instance_id

local100

db

Db-IMDB

question

Find out how many actors have a ‘Shahrukh number’ of 2? This means they acted in a film with someone who acted with Shahrukh Khan, but not directly with him.

external_knowledge

None

Example 461

instance_id

local114

db

education_business

question

Provide a detailed web sales report for each region, including the number of orders, total sales amount, and the name and sales amount of all sales representatives who achieved the highest total sales amount in that region (include all representatives in case of a tie).

external_knowledge

None

Example 462

instance_id

local128

db

BowlingLeague

question

List the bowlers (including their ID, first name, and last name), match number, game number, handicap score, tournament date, and location for only those bowlers who have won games with a handicap score of 190 or less at all three venues: Thunderbird Lanes, Totem Lanes, and Bolero Lanes. Only include the specific game records where they won with a handicap score of 190 or less at these three locations.

external_knowledge

None

Example 463

instance_id

local130

db

school_scheduling

question

Could you provide a list of last names for all students who have completed English courses (where completion is defined as having a ClassStatus of 2), along with their quintile ranks based on their individual grades in those courses? The quintile should be determined by calculating how many students have grades greater than or equal to each student’s grade, then dividing this ranking by the total number of students who completed English courses. The quintiles should be labeled as “First” (top 20%), “Second” (top 21-40%), “Third” (top 41-60%), “Fourth” (top 61-80%), and “Fifth” (bottom 20%). Please sort the results from highest performing quintile to lowest (First to Fifth).

external_knowledge

None

Example 464

instance_id

local131

db

EntertainmentAgency

question

Could you list each musical style with the number of times it appears as a 1st, 2nd, or 3rd preference in a single row per style?

external_knowledge

None

Example 465

instance_id

local133

db

EntertainmentAgency

question

Given a database of musical styles and user preferences, where Musical_Preferences contains user rankings of musical styles (PreferenceSeq=1 for first choice, PreferenceSeq=2 for second choice, PreferenceSeq=3 for third choice): Calculate a weighted score for each musical style by assigning 3 points for each time it was ranked as first choice, 2 points for each second choice, and 1 point for each third choice ranking. Calculate the total weighted score for each musical style that has been ranked by at least one user. Then, compute the absolute difference between each style’s total weighted score and the average total weighted score across all such styles.

external_knowledge

None

Example 466

instance_id

local132

db

EntertainmentAgency

question

Show all pairs of entertainers and customers who each have up to three style strengths or preferences, where the first and second style preferences of the customers match the first and second style strengths of the entertainers (or in reverse order). Only return the entertainer’s stage name and the customer’s last name

external_knowledge

None

Example 467

instance_id

local141

db

AdventureWorks

question

How did each salesperson’s annual total sales compare to their annual sales quota? Provide the difference between their total sales and the quota for each year, organized by salesperson and year.

external_knowledge

None

Example 468

instance_id

local152

db

imdb_movies

question

Can you provide the top 9 directors by movie count, including their ID, name, number of movies, average inter-movie duration (rounded to the nearest integer), average rating (rounded to 2 decimals), total votes, minimum and maximum ratings, and total movie duration? Sort the output first by movie count in descending order and then by total movie duration in descending order.

external_knowledge

None

Example 469

instance_id

local230

db

imdb_movies

question

Determine the top three genres with the most movies rated above 8, and then identify the top four directors who have directed the most films rated above 8 within those genres. List these directors and their respective movie counts.

external_knowledge

None

Example 470

instance_id

local156

db

bank_sales_trading

question

Analyze the annual average purchase price per Bitcoin by region, computed as the total dollar amount spent divided by the total quantity purchased each year, excluding the first year’s data for each region. Then, for each year, rank the regions based on these average purchase prices, and calculate the annual percentage change in cost for each region compared to the previous year.

external_knowledge

None

Example 471

instance_id

local157

db

bank_sales_trading

question

Using the “bitcoin_prices” table, please calculate the daily percentage change in trading volume for each ticker from August 1 to August 10, 2021, ensuring that any volume ending in “K” or “M” is accurately converted to thousands or millions, any “-“ volume is treated as zero, only non-zero volumes are used to determine the previous day’s volume, and the results are ordered by ticker and date.

external_knowledge

None

Example 472

instance_id

local163

db

education_business

question

Which university faculty members’ salaries are closest to the average salary for their respective ranks? Please provide the ranks, first names, last names, and salaries.university

external_knowledge

None

Example 473

instance_id

local168

db

city_legislation

question

Among job postings that specifically have the Data Analyst, require a non-null annual average salary, and are remote, what is the overall average salary when considering only the top three most frequently demanded skills for these positions?

external_knowledge

None

Example 474

instance_id

local169

db

city_legislation

question

What is the annual retention rate of legislators who began their first term between January 1, 1917 and December 31, 1999, measured as the proportion of this cohort still in office on December 31st for each of the first 20 years following their initial term start? The results should show all 20 periods in sequence regardless of whether any legislators were retained in a particular year.

external_knowledge

None

Example 475

instance_id

local171

db

city_legislation

question

For male legislators from Louisiana, how many distinct legislators were actively serving on December 31 of each year from more than 30 years since their first term up to less than 50 years, grouping the results by the exact number of years elapsed since their first term?

external_knowledge

None

Example 476

instance_id

local167

db

city_legislation

question

Based on the state each female legislator first represented, which state has the highest number of female legislators whose terms included December 31st at any point, and what is that count? Please provide the state’s abbreviation.

external_knowledge

None

Example 477

instance_id

local170

db

city_legislation

question

Identify the state abbreviations where, for both male and female legislators, the retention rate remains greater than zero at specific intervals of 0, 2, 4, 6, 8, and 10 years after their first term start date. A legislator is considered retained if they are serving on December 31 of the respective year. Only include states where both gender cohorts maintain non-zero retention rates at all six of these time points during the first decade of service.

external_knowledge

None

Example 478

instance_id

local193

db

sqlite-sakila

question

Could you find out the average percentage of the total lifetime sales (LTV) that occur in the first 7 and 30 days after a customer’s initial purchase? Also, include the average total lifetime sales (LTV). Please exclude customers with zero lifetime sales. The 7- and 30-day periods should be based on the exact number of hours-minutes-seconds, not calendar days.

external_knowledge

None

Example 479

instance_id

local194

db

sqlite-sakila

question

Please provide a list of the top three revenue-generating films for each actor, along with the average revenue per actor in those films, calculated by dividing the total film revenue equally among the actors for each film.

external_knowledge

None

Example 480

instance_id

local195

db

sqlite-sakila

question

Please find out how widespread the appeal of our top five actors is. What percentage of our customers have rented films featuring these actors?

external_knowledge

None

Example 481

instance_id

local196

db

sqlite-sakila

question

For each rating category of the first movie rented by customers—where the first movie is identified based on the earliest payment date per customer—please provide the average total amount spent per customer and the average number of subsequent rentals (calculated as the total number of rentals minus one) for customers whose first rented movie falls into that rating category.

external_knowledge

None

Example 482

instance_id

local197

db

sqlite-sakila

question

Among our top 10 paying customers, can you identify the largest change in payment amounts from one month to the immediately following month? Specifically, please determine for which customer and during which month this maximum month-over-month difference occurred, and provide the difference rounded to two decimal places.

external_knowledge

None

Example 483

instance_id

local199

db

sqlite-sakila

question

Can you identify the year and month with the highest rental orders created by the store’s staff for each store? Please list the store ID, the year, the month, and the total rentals for those dates.

external_knowledge

None

Example 484

instance_id

local201

db

modern_data

question

Identify the first 10 words, sorted alphabetically, that are 4 to 5 characters long, start with ‘r’, and have at least one anagram of the same length, considering case-sensitive letters. Provide the count of such anagrams for each word.

external_knowledge

None

Example 485

instance_id

local202

db

city_legislation

question

For alien data, how many of the top 10 states by alien population have a higher percentage of friendly aliens than hostile aliens, with an average alien age exceeding 200?

external_knowledge

None

Example 486

instance_id

local209

db

delivery_center

question

In the dataset of orders joined with store information, which store has the highest total number of orders, and among that store’s orders, what is the ratio of orders that appear in the deliveries table with a ‘DELIVERED’ status to the total orders for that store?

external_knowledge

None

Example 487

instance_id

local210

db

delivery_center

question

Can you identify the hubs that saw more than a 20% increase in finished orders from February to March?

external_knowledge

None

Example 488

instance_id

local212

db

delivery_center

question

Can you find 5 delivery drivers with the highest average number of daily deliveries?

external_knowledge

None

Example 489

instance_id

local218

db

EU_soccer

question

Can you calculate the median from the highest season goals of each team?

external_knowledge

None

Example 490

instance_id

local219

db

EU_soccer

question

In each league, considering all seasons, which single team has the fewest total match wins based on comparing home and away goals, including teams with zero wins, ensuring that if multiple teams tie for the fewest wins, only one team is returned for each league?

external_knowledge

None

Example 491

instance_id

local221

db

EU_soccer

question

Tell me top10 teams with the most wins across the league

external_knowledge

None

Example 492

instance_id

local220

db

EU_soccer

question

Which player has participated in the highest number of winning matches and which player has participated in the highest number of losing matches, considering only matches where they actually played (excluding null entries) and where their team won or lost (excluding draws)?

external_knowledge

None

Example 493

instance_id

local228

db

IPL

question

For each IPL season, identify the top three batsmen with the highest total runs scored and the top three bowlers with the most wickets taken, excluding ‘run out’, ‘hit wicket’, and ‘retired hurt’ dismissals. In the event of ties in runs or wickets, break the tie using the smaller player ID. Then output these six players in matched positions—batsman 1 with bowler 1, batsman 2 with bowler 2, and batsman 3 with bowler 3—in ascending order of the season ID, along with each player’s total runs or wickets.

external_knowledge

None

Example 494

instance_id

local229

db

IPL

question

Find the IDs of players who scored the highest number of partnership runs for each match. The output should include the IDs of two players, each with their individual scores and the total partnership score. For each pair, the player with the higher individual score should be listed as player 1, and the player with the lower score as player 2. In cases where both players have the same score, the player with the higher ID should be player 1, and the player with the lower ID should be player 2. There can be multiple rows for a single match.

external_knowledge

None

Example 495

instance_id

local244

db

music

question

Calculate the duration of each track, classify them as short, medium, or long, output the minimum and maximum time for each kind (in minutes) and the total revenue for each category, group by the category.

external_knowledge

music_length_type.md

Example 496

instance_id

local253

db

education_business

question

Using a Salary Dataset where the salary values need to be cleaned by removing non-numeric characters and converting them to a numeric type, write a detailed SQL query that identifies the top 5 companies by average salary in each of Mumbai, Pune, New Delhi, and Hyderabad, then compares each company’s average salary in those cities to the overall national average salary. The final result should display four columns: Location, Company Name, Average Salary in State, and Average Salary in Country, listing only the top 5 companies in each of the specified locations.

external_knowledge

None

Example 497

instance_id

local258

db

IPL

question

Calculate the total number of wickets taken by each bowler (excluding run-outs and other dismissals not attributed to the bowler), their economy rate (total runs conceded divided by total overs bowled, considering only runs scored off the bat and ignoring any extra runs like wides and no-balls), their strike rate (average number of balls bowled per wicket taken), and their best bowling performance in a single match (the match with the most wickets taken by the bowler, formatted as “wickets-runs” where runs are the runs conceded excluding extras).

external_knowledge

baseball_game_special_words_definition.md

Example 498

instance_id

local259

db

IPL

question

For each player, list their ID, name, their most frequent role across all matches, batting hand, bowling skill, total runs scored, total matches played, total times they were dismissed, batting average (total runs divided by total dismissals), highest score in a single match, the number of matches in which they scored at least 30 runs, at least 50 runs, and at least 100 runs, total balls faced in their career, strike rate (total runs divided by total balls faced, multiplied by 100), total wickets taken, economy rate (average runs conceded per over), and their best bowling performance in a single match (most wickets taken in a match, formatted as “wickets taken-runs given”, where the best performance is the one with the most wickets, and if tied, the fewest runs conceded). Ignore the extra runs data.

external_knowledge

baseball_game_special_words_definition.md

Example 499

instance_id

local262

db

stacking

question

Which problems exceed the total number of times they appear in the solution table when counting all occurrences, across steps 1, 2, and 3, where any non-“Stack” model’s maximum test score is lower than the “Stack” model’s test score for the same step and version?

external_knowledge

None

Example 500

instance_id

local263

db

stacking

question

Identify the L1_model associated with each model (specified by name and version) that occurs most frequently for each status (‘strong’ or ‘soft’), along with the number of times it occurs. A model has a ‘strong’ status if, for any of its steps, the maximum test score among non-‘Stack’ models is less than the ‘Stack’ model’s test score. It has a ‘soft’ status if the maximum test score among non-‘Stack’ models equals the ‘Stack’ model’s test score. Count how many times each L1_model is associated with a ‘strong’ or ‘soft’ status across all models, and determine which L1_model has the highest occurrence for each status.

external_knowledge

None

Example 501

instance_id

local264

db

stacking

question

Which model category (L1_model) appears the most frequently across all steps and versions when comparing traditional models to the Stack model, and what is the total count of its occurrences?

external_knowledge

None

Example 502

instance_id

local269

db

oracle_sql

question

What is the average total quantity across all final packaging combinations, considering only the leaf-level items within each combination after fully expanding any nested packaging relationships?

external_knowledge

None

Example 503

instance_id

local270

db

oracle_sql

question

Which top-level packaging containers, meaning those not contained within any other packaging, have any item for which the total quantity accumulated across all nested levels in the hierarchy exceeds 500, and what are the names of both these containers and the corresponding items?

external_knowledge

None

Example 504

instance_id

local272

db

oracle_sql

question

For order 423, identify the product IDs, aisles, and positions from which to pick the exact quantities needed for each order line, ensuring that the total picked quantity for each product matches the cumulative quantities ordered without exceeding the available inventory in warehouse 1. Calculate the quantities to be picked from each location by prioritizing inventory with earlier purchased dates and smaller quantities, and ensure that picking respects the sequence and cumulative quantities of the order lines for products with multiple entries.

external_knowledge

None

Example 505

instance_id

local273

db

oracle_sql

question

Calculate the average pick percentage for each product name, using a first-in-first-out approach that selects from inventory locations based on the earliest purchase date and smallest available quantity, ensuring that the picked quantity reflects only the overlapping range between each order’s required quantity and the inventory’s available quantity, and then grouping and ordering the results by product name?

external_knowledge

None

Example 506

instance_id

local274

db

oracle_sql

question

Which products were picked for order 421, and what is the average number of units picked for each product, using FIFO (First-In, First- Out) method?

external_knowledge

None

Example 507

instance_id

local275

db

oracle_sql

question

Based on monthly sales data starting in January 2016 and using a centered moving average to adjust for seasonality, which products had a seasonality-adjusted sales ratio that stayed consistently above 2 for every month in the year 2017?

external_knowledge

calculation_method.md

Example 508

instance_id

local277

db

oracle_sql

question

What is the average forecasted annual sales for products 4160 and 7790 during 2018, using monthly sales data starting from January 2016 for the first 36 months, applying seasonality adjustments from time steps 7 through 30, and employing a weighted regression method to estimate sales?

external_knowledge

calculation_method.md

Example 509

instance_id

local279

db

oracle_sql

question

Using a recursive monthly inventory adjustment model starting from December 2018 inventory levels, where we restock a product if its ending inventory drops below the minimum required level, determine for each product the month in 2019 where the absolute difference between its ending inventory and the minimum required level is the smallest, and return the product_id, that month, and the absolute difference.

external_knowledge

None

Example 510

instance_id

local283

db

EU_soccer

question

Analyze the soccer match dataset to determine the champion team for each season across all countries and leagues, awarding 3 points for every win, 1 point for every tie, and 0 points for every loss. For each season, return the champion’s team name, the league, the country, and the total points accumulated.

external_knowledge

None

Example 511

instance_id

local284

db

bank_sales_trading

question

For veg whsle data, can you generate a summary of our items’ loss rates? Include the average loss rate, and also break down the count of items that are below, above, and within one standard deviation from this average.

external_knowledge

None

Example 512

instance_id

local285

db

bank_sales_trading

question

For veg whsle data, can you analyze our financial performance over the years 2020 to 2023? I need insights into the average wholesale price, maximum wholesale price, minimum wholesale price, wholesale price difference, total wholesale price, total selling price, average loss rate, total loss, and profit for each category within each year. Round all calculated values to two decimal places.

external_knowledge

None

Example 513

instance_id

local286

db

electronic_sales

question

Prepare a comprehensive performance report on our sellers, focusing on total sales, average item price, average review scores, and packing times. Ensure that the report includes only those sellers who have sold a quantity of more than 100 products and highlight the product category names in English with the highest sales volume.

external_knowledge

None

Example 514

instance_id

local301

db

bank_sales_trading

question

For weekly-sales data, I need an analysis of our sales performance around mid-June for the years 2018, 2019, and 2020. Specifically, calculate the percentage change in sales between the four weeks leading up to June 15 and the four weeks following June 15 for each year.

external_knowledge

None

Example 515

instance_id

local302

db

bank_sales_trading

question

Analyze the average percentage change in sales between the 12 weeks before and after June 15, 2020, for each attribute type: region, platform, age band, demographic, and customer type. For each attribute type, calculate the average percentage change in sales across all its attribute values. Identify the attribute type with the highest negative impact on sales and provide its average percentage change in sales.

external_knowledge

None

Example 516

instance_id

local329

db

log

question

How many unique sessions visited the /regist/input page and then the /regist/confirm page, in that order?

external_knowledge

None

Example 517

instance_id

local330

db

log

question

Using the activity log table, compute the total number of unique user sessions where each web page appears as either a landing page (the first page visited in a session based on timestamp) or an exit page (the last page visited in a session based on timestamp), or both. Count each session only once per page even if the page serves as both landing and exit for that session.

external_knowledge

None

Example 518

instance_id

local331

db

log

question

Which three distinct third-page visits are most frequently observed immediately after two consecutive visits to the ‘/detail’ page, and how many times does each third-page visit occur?

external_knowledge

None

Example 519

instance_id

local358

db

log

question

How many users are there in each age category (20s, 30s, 40s, 50s, and others)?

external_knowledge

None

Example 520

instance_id

local360

db

log

question

For each user session in the activity log table, identify the number of events that occurred before the first ‘/detail’ click or ‘/complete’ conversion, counting only events that have a non-empty search type. Find the sessions with the minimum count of such pre- click/pre-conversion events. If multiple sessions share this minimum count, include all of them in the results. Return each qualifying session along with the corresponding path and search type.

external_knowledge

None

Example 521

instance_id

local344

db

f1

question

Considering all races where pit stop data is available, and focusing on instances when a driver was not behind another car on the previous lap but is behind on the current lap (accounting for retirements, pit- stop entries, pit-stop exits, and race starts), how many times has each type of overtake occurred in Formula 1?

external_knowledge

f1_overtake.md

Example 522

instance_id

local336

db

f1

question

In the first five laps of the race, how many overtakes occurred in each category—retirements, pit stops, start-related overtakes, and standard on-track passes?

external_knowledge

f1_overtake.md

Example 523

instance_id

local335

db

f1

question

In Formula 1 seasons since 2001, considering only drivers who scored points in a season, which five constructors have had the most seasons where their drivers scored the fewest total points among all point- scoring drivers in that season?

external_knowledge

None

Example 524

instance_id

local309

db

f1

question

For each year, which driver and which constructor scored the most points? I want the full name of each driver.

external_knowledge

None

Example 525

instance_id

local310

db

f1

question

Using only the data from the ‘results’ table, find the three years in which the sum of the highest total points earned by any driver and the highest total points earned by any constructor in that year (both calculated by summing up points from the ‘results’ table) is smallest, and list those three years in order of ascending total.

external_knowledge

None

Example 526

instance_id

local311

db

f1

question

Which constructors had the top 3 combined points from their best driver and team, and in which years did they achieve them?

external_knowledge

None

Example 527

instance_id

local354

db

f1

question

Among Formula 1 drivers who raced during the 1950s, which drivers completed a season in that decade with the same constructor in both the first and the last race they participated in, while also taking part in at least two distinct race rounds during that season?

external_knowledge

None

Example 528

instance_id

local355

db

f1

question

Calculate the overall average first round and average last round of races missed by Formula 1 drivers across all years. Include only drivers who missed fewer than three races in a given year and who switched teams between their appearances immediately before and after their hiatus (i.e., the constructor ID for the race right before their first missed race must be different from the constructor ID for the race right after their last missed race in that year). Do not group results by year; return just the overall averages across the entire dataset.

external_knowledge

None

Example 529

instance_id

local356

db

f1

question

Provide the full names of drivers who have been overtaken on track more times than they have overtaken others on track during race laps, excluding position changes due to pit stops (both at pit entry and exit), retirements, or position changes that occurred during the first lap of a race (considered as start movements).

external_knowledge

None

Example 530

instance_id

sf001

db

GLOBAL_WEATHER__CLIMATE_DATA_FOR_BI

question

Assuming today is April 1, 2024, I would like to know the daily snowfall amounts greater than 6 inches for each U.S. postal code during the week ending after the first two full weeks of the previous year. Show the postal code, date, and snowfall amount.

external_knowledge

None

Example 531

instance_id

sf003

db

GLOBAL_GOVERNMENT

question

For each year from 2015 to 2020, which Census Zip Code Tabulation Area had the second-highest annual population growth rate? Include only areas with a population estimate of at least 25,000 people in that year (based on 5-Year American Community Survey estimates). For each year, provide the zip code, state abbreviation, and the annual growth rate percentage.

external_knowledge

None

Example 532

instance_id

sf002

db

FINANCE__ECONOMICS

question

As of December 31, 2022, list the top 10 active banks with assets exceeding $10 billion, ranked by the highest percentage of uninsured assets, where the percentage is calculated as one minus the value of the ‘% Insured (Estimated)’ variable from quarterly estimates. Provide the names of these banks and their respective percentages of uninsured assets.

external_knowledge

None

Example 533

instance_id

sf044

db

FINANCE__ECONOMICS

question

What was the percentage change in post-market close prices for the Magnificent 7 tech companies from January 1 to June 30, 2024?

external_knowledge

None

Example 534

instance_id

sf006

db

FINANCE__ECONOMICS

question

For each U.S. state, find how the number of active financial branch entities has changed from March 1, 2020, to December 31, 2021. An entity is considered active on a specific date if its start date is on or before that date and its end date is either null or on or after that date. For each state, calculate the number of entities active on March 1, 2020, the number of entities active on December 31, 2021, and the percentage change in these counts

external_knowledge

None

Example 535

instance_id

sf008

db

US_REAL_ESTATE

question

Determine the percentage change in gross income inflow and the seasonally-adjusted purchase-only home price index for the Phoenix- Mesa-Scottsdale, AZ Metro Area from January 1, 2023, to December 31,

Gross income inflow refers to the total adjusted gross income from all financial entities within the specified metro area

external_knowledge

None

Example 536

instance_id

sf010

db

US_REAL_ESTATE

question

What are the cumulative ratios of mortgages near default in California for each recorded date in 2023, including those that are 90 to 180 days past due, in forbearance, or undergoing foreclosure, bankruptcy, or deed-in-lieu processes?

external_knowledge

None

Example 537

instance_id

sf037

db

US_REAL_ESTATE

question

How can we find the shortest straight-line distance in miles between each ‘The Home Depot’ store and its nearest ‘Lowe’s Home Improvement’ location? Using the US_REAL_ESTATE.CYBERSYN database, join the point_of_interest_index table with point_of_interest_addresses_relationships and us_addresses tables to get geographic coordinates. For each ‘The Home Depot’ location (identified by its poi_id), calculate its distance to all ‘Lowe’s Home Improvement’ stores using ST_DISTANCE and ST_MAKEPOINT functions, convert the distance from meters to miles (dividing by 1609), and return only the record with the minimum distance for each ‘The Home Depot’ store using QUALIFY with ROW_NUMBER().

external_knowledge

None

Example 538

instance_id

sf012

db

WEATHER__ENVIRONMENT

question

Using data from the FEMA National Flood Insurance Program Claim Index, for each year from 2010 through 2019, what were the total building damage amounts and total contents damage amounts reported under the National Flood Insurance Program for the NFIP community named ‘City Of New York,’ grouped by each year of loss?

external_knowledge

None

Example 539

instance_id

sf018

db

BRAZE_USER_EVENT_DEMO_DATASET

question

Examine user engagement with push notifications within a specified one-hour window on June 1, 2023.

external_knowledge

PushNotificationAnalysis.md

Example 540

instance_id

sf035

db

BRAZE_USER_EVENT_DEMO_DATASET

question

How many unique users started sessions each day within each app group between June 1, 2023, and June 7, 2023? Also show the app group ID and the start day of the session.

external_knowledge

None

Example 541

instance_id

sf029

db

AMAZON_VENDOR_ANALYTICS__SAMPLE_DATASET

question

Generate a daily detailed sales report for each product under the ‘Manufacturing’ distributor view, covering the 30 days leading up to February 6, 2022, by joining the sales, traffic, inventory, and net PPM data on date, ASIN, program, period, and distributor_view. The report must include total ordered units, ordered revenue, average selling price, glance views, conversion rate, shipped units, shipped revenue, average net PPM, average procurable product OOS, total on- hand units and value, net received units and value, open purchase order quantities, unfilled customer ordered units, and average vendor confirmation rate, receive fill rate, sell-through rate, and vendor lead time.

external_knowledge

None

Example 542

instance_id

sf040

db

US_ADDRESSES__POI

question

Find the top 10 northernmost addresses in Florida’s largest zip code area. What are their address numbers, street names, and types?

external_knowledge

None

Example 543

instance_id

sf009

db

NETHERLANDS_OPEN_MAP_DATA

question

A real estate company needs a detailed side-by-side comparison of buildings in Amsterdam and Rotterdam. They require a report showing each building class and subclass, with the total surface area (in square meters) and the total number of buildings for each classification category in both cities. The data should be organized by building class and subclass in ascending order, with Amsterdam and Rotterdam statistics presented in parallel columns to facilitate direct comparison. Can you generate this comprehensive building classification comparison report?

external_knowledge

None

Example 544

instance_id

sf013

db

NETHERLANDS_OPEN_MAP_DATA

question

Compare the total road lengths in Amsterdam and Rotterdam by creating a side-by-side analysis of both cities. For each combination of road class and subclass, calculate the total length of roads (in meters) specifically for QUADKEY segments ‘12020210’ and ‘12020211’. Present the results with columns for class, subclass, Amsterdam’s road lengths, and Rotterdam’s road lengths.

external_knowledge

None

Example 545

instance_id

sf041

db

YES_ENERGY__SAMPLE_DATA

question

Produce a report for ERCOT on October 1, 2022, that combines hourly data on day-ahead and real-time prices from node ID 10000697078, load forecasts (datatypeid 19060) and actual loads, plus wind (forecast datatypeid 9285, actual datatypeid 16) and solar (forecast datatypeid 662, actual datatypeid 650) generation forecasts and actuals from object ID 10000712973. This report should include time zone alignments, peak classifications, and net load calculations, providing insights into daily operational dynamics and efficiency.

external_knowledge

ERCOT_Daily_Market_Dynamics_Report.md

Example 546

instance_id

sf011

db

CENSUS_GALAXY__ZIP_CODE_TO_BLOCK_GROUP_SAMPLE

question

Determine the population distribution within each block group relative to its census tract in New York State using 2021 ACS data. Include block group ID, census value, state county tract ID, total tract population, and the population ratio of each block group.

external_knowledge

None

Example 547

instance_id

sf014

db

CENSUS_GALAXY__AIML_MODEL_DATA_ENRICHMENT_SAMPLE

question

What is the New York State ZIP code with the highest number of commuters traveling over one hour, according to 2021 ACS data? Include the zip code, the total commuters, state benchmark for this duration, and state population.

external_knowledge

None