TIL

Optimizing Search with Parallel Queries in TypeScript

Recently, I encountered a situation where I built a search interface for a backend process. Users could search using various criteria—different identifiers or other metadata tagged to the records. However, the query I initially wrote to handle these searches didn’t perform well. It relied heavily on OR conditions or IN clauses (which are logically equivalent to OR), making indexing difficult and query performance sluggish^[1]

In this case, all the different search options were mutually exclusive — if results were found using one criterion, no other criterion would yield results. All our identifiers are UUIDs or globally unique formats, ensuring that only one search method could return valid results. Additionally, some queries were fast and efficient, while others were slower and more resource-intensive. Combining all these into a single query meant the user would always have to wait for the slowest part of the search, even if they were searching for something could return results immediately.

A Different Approach: Parallel Query Execution

To improve performance, I decided to run multiple searches in parallel and return results as soon as one of them succeeded. If none of the queries returned results, the system would fall back to showing the latest records. This solution is somewhat similar to Promise.race(), but with a twist. Promise.race() resolves or rejects with the first settled promise, regardless of whether it yields useful data. In my case, I wanted to inspect the resolved values and only return a result if it contained results.

After working with ChatGPT, Claude, and some friends, I developed the following solution. Below are two versions of the function—one that uses key-value pairs for better traceability and another that works with an array of queries. I have more to say about it below.

Implementation

/**
 * Runs multiple queries in parallel and resolves with the first query that returns records.
 * If no query returns records, it rejects with an error.
 *
 * @param queries An array of promises representing database queries.
 * @returns A promise that resolves with the first valid result or rejects if no results are found.
 */
export async function firstQueryWithResultsWithKey<T>({
  queriesAsRecord,
  resultTest = ({ result }) => Array.isArray(result) && result.length > 0,
  noResultsReturn = null,
}: {
  queriesAsRecord: Record<string, Promise<T>>;
  resultTest?: ({ result, key }: { result: T; key: string | null }) => boolean;
  noResultsReturn?: T | null;
}): Promise<{ result: T; key: string | null }> {
  return new Promise((resolve, reject) => {
    let completed = false; // Track if a query has returned results
    let pendingCount = Object.keys(queriesAsRecord).length; // Track the number of pending queries

    if (pendingCount === 0) {
      if (noResultsReturn) {
        resolve({ result: noResultsReturn, key: null });
      } else {
        reject(new Error("No queries returned any records"));
      }
    }

    // Define a function to handle each query independently
    const handleQuery = async (query: Promise<T>, key: string) => {
      try {
        if (!completed) {
          const result = await query;

          // If the result has records and no other query has resolved
          if (resultTest({ result, key }) && !completed) {
            completed = true; 
            resolve({ result, key }); // Resolve with the first valid result
          }
        }
      } catch (error) {
        console.error("Query error:", error); // Log query errors
      } finally {
        // Decrement pending count and check if all queries are exhausted
        pendingCount--;
        if (pendingCount === 0 && !completed) {
          if (noResultsReturn) {
            resolve({ result: noResultsReturn, key: null });
          } else {
            reject(new Error("No queries returned any records"));
          }
        }
      }
    };

    // Start all queries in parallel
    Object.entries(queriesAsRecord).forEach(([key, query]) =>
      handleQuery(query, key),
    );
  });
}

export async function firstQueryWithResults<T>({
  queries,
  resultTest = (result) => Array.isArray(result) && result.length > 0,
  noResultsReturn = null,
}: {
  queries: Promise<T>[];
  resultTest?: (result: T) => boolean;
  noResultsReturn?: T | null;
}): Promise<T> {
  let queriesAsRecord = Object.fromEntries(
    queries.map((query, index) => [index.toString(), query]),
  );

  const { result } = await firstQueryWithResultsWithKey({
    queriesAsRecord,
    resultTest: ({ result }) => resultTest(result),
    noResultsReturn,
  });
  return result;
}

Both versions of the function execute a collection of promises in parallel, returning the first valid result. If no queries pass the resultTest, the function returns the provided fallback value (if any) or throws an error.

Example Usage

Here’s a unit test to demonstrate how the function works:

const wait = (interval: number) =>
    new Promise((resolve) => setTimeout(resolve, interval));

it("should return the noResultsReturn value if no queries return results", async () => {
    const query1 = async ({ id }: { id: string }): Promise<QueryResult> => {
      await wait(100);
      return [];
    };
    const query2 = async ({ id }: { id: string }): Promise<QueryResult> => {
      await wait(200);
      return [];
    };
    const query3 = async ({ id }: { id: string }): Promise<QueryResult> => {
      await wait(300);
      return [{a: id}];
    };

    let result: QueryResult | null = null;

    result = await firstQueryWithResults({
      queries: [query1({ id: "1" }), query2({ id: "2" }), query3({ id: "3" })],
      resultTest: (result) => result.length > 0,
      noResultsReturn: [],
    });
    expect(result).toEqual([]);
  });

In this test, three promises simulate query operations. The first two return empty arrays after 100ms and 200ms, while the third returns a valid result after 300ms. The resultTest function ensures that only non-empty arrays pass the validation.

Reflections and Considerations

This approach improves user experience by returning results as quickly as possible, but it also increases the load on the database since multiple queries run in parallel. For my use case — an internal back-office tool — I find the trade-off acceptable. However, I plan to monitor system performance to ensure this doesn’t impact the overall system.

An alternative approach would be to let users specify the type of identifier they are searching for. This way, the system would only execute the relevant query, reducing database load. However, the current solution offers a seamless experience since users don’t need to know or care about the type of identifier—they just paste it and search.

Although I used the term “query” throughout the article because of my use case, the function is generic and can be applied to any collection of promises, not just database operations.

I welcome any feedback or suggestions on this approach. While I’m not fully convinced this is the optimal solution, I found the challenge of reimagining Promise.race() to be an interesting exercise.

Here is a gist with the function and more unit tests so you can see how it works: https://gist.github.com/ryanguill/e89931f9d223e74dabcb070879a58298

[1] There are ongoing improvements in PostgreSQL to optimize queries with OR conditions. For more, see: https://www.crunchydata.com/blog/real-world-performance-gains-with-postgres-17-btree-bulk-scans

comments

October 19, 2024

safe_cast() function for postgresql

I love using JSON in relational databases. When support for JSON types and functionality first started coming out in SQL I generally thought it was neat but that it wouldn’t be something I would ever want to use in production. I could not have been more wrong.

I could talk at length (and have) about all the ways that it is useful, but if you do you will find that the main way you pull information out of JSON will bring it out as TEXT. And frequently when you’re using JSON you can’t be sure that the data is exactly the right format you expect anyway. Lately I store a lot of JSON that comes back from LLMs, and while it gets it right most of the time, you can never really be sure - you need to trust be verify.

So I have been using this function for a long time to safely convert from text to a given datatype in postgresql. If the cast can be made successfully it will, otherwise it will return the second argument as a default - most of the time I use null but it can be anything.

/* 
  utility function to convert from text to various data types
    , such as when you are pulling values out of json 
  The function will cast the value of the first argument 
    to the type of the second argument.
  If the first argument cannot be convert to the target type
    the value of the second argument will be returned as the default.
  If you want to return null as the default, cast the null to the target
    type like `null::integer` or `null::date`
*/
DROP FUNCTION IF EXISTS safe_cast(text, anyelement);
CREATE OR REPLACE FUNCTION safe_cast(text, anyelement)
RETURNS anyelement
LANGUAGE plpgsql as $$
BEGIN
    $0 := $1;
    RETURN $0;
    EXCEPTION WHEN OTHERS THEN
        RETURN $2;
END;
$$;

The function itself looks a little terse and strange, but if you look and play with the examples you’ll get a better understanding of how it works. I have been using it for a long time, I can’t remember if I actually wrote it or got it from somewhere else - I believe I may have adapted it from this answer on stackoverflow: https://stackoverflow.com/a/2095676/7186

Below are examples, but you can play with the function and these examples here: https://dbfiddle.uk/3kAop9-Z

select
	  safe_cast('true', false::boolean) = true
	, safe_cast('false'::text, false::boolean) = false
	, safe_cast('yes'::text, false::boolean) = true
	, safe_cast('no'::text, false::boolean) = false
	, safe_cast('on'::text, false::boolean) = true
	, safe_cast('off'::text, false::boolean) = false
	, safe_cast('1'::text, false::boolean) = true
	, safe_cast('0'::text, false::boolean) = false
	, safe_cast(1::text, false::boolean) = true
	, safe_cast(0::text, false::boolean) = false
	, safe_cast('foo'::text, false::boolean) = false
	, safe_cast('3'::text, false::boolean) = false
	, safe_cast(3::text, false::boolean) = false
	, safe_cast('', false::boolean) = false
	, safe_cast(null::text, false::boolean) is null
	;

select
	  safe_cast('123', null::numeric) = 123::numeric
	, safe_cast('123.45', null::numeric) = 123.45::numeric
	, safe_cast('0', null::numeric) = 0::numeric
	, safe_cast('-1', null::numeric) = -1::numeric
	, safe_cast('-1.2', null::numeric) = -1.2::numeric
	, safe_cast('123x', null::numeric) is null
	, safe_cast('', null::numeric) is null
	, safe_cast('foobar', null::numeric) is null
	, safe_cast('123', null::numeric) > 1
    , safe_cast('123', 0::integer) = 123::integer
	, safe_cast('', 0::integer) = 0::integer
	, safe_cast('foo', 0::integer) = 0::integer
	;

select
	  safe_cast('2024-01-02', null::date) = '2024-01-02'::date
	, safe_cast('01-02-2024', null::date) = '2024-01-02'::date
	, safe_cast('2024-01-023', null::date) = '2024-01-23'::date --possibly surprising
	, safe_cast('2024-01-', null::date) is null
	, safe_cast('2024-01-123', null::date) is null
	, safe_cast('2024', null::date) is null
	, safe_cast('foobar', null::date) is null
    , safe_cast('2024-01-02', null::timestamptz) = '2024-01-02 00:00:00'::timestamptz
	;

Other databases have similar functionality built in, but most do not have the ability to set a default if the value cannot be safely cast baked into the function, most just return null and then you can use coalesce or similar to set a different default if you need to. (The following list is the ones I know off the top of my head, and is not meant to be exhaustive)

MSSQL has TRY_CAST: https://learn.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-ver16

Snowflake has TRY_CAST: https://docs.snowflake.com/en/sql-reference/functions/try_cast

DuckDB has TRY_CAST https://duckdb.org/docs/sql/expressions/cast.html#try_cast

comments

October 17, 2024

Partitioning using UUIDs
Recently I had a situation where I needed to process quite a few records, lets say around 100k, and the processing of each record was around 2 seconds. The math on that wasn’t pretty, so I needed to find a way to partition the set of data so that I could run multiple instances of the processing at the same time - but where each instance would have their own data and I wouldn’t try to process a record by more than one instance.

Normally you would want to look for some natural key if you can, but in this case there wasn’t anything I could use, especially that would be well distributed. I also wanted to have the flexibility to partition for as few or as many instances as I needed - I might start with 5 instances but that might not be enough.

The one thing I did have is UUIDs. We use UUIDs for virtually all of our keys - an explanation of all of the benefits of that will have to be its own blog post someday.

So I had the thought to use UUIDs to do the partitioning. The easiest thing you could do is look at the last character, that would quickly give you 16 partitions without much work, and most of the versions of UUIDs the last character would be fairly well distributed.

But then I remembered - a UUID is really just a 128 bit number, or two 64 bit integers (bigints) in a trench coat - and assuming they are randomly generated (we use v4 for most things right now so I know they are), I can use modulus math to partition into however many buckets I need.

I was able to come up with the following that given a UUID will extract the high bigint and then give mod 5:
```
SELECT abs(('x' || translate(gen_random_uuid()::text, '-', ''))::bit(64)::bigint % 5)
```
This might look complicated at first, but while it has a few steps its fairly straight forward. A few things to notice first:

A UUID is a hex encoded number that has hyphens in particular places to break it up into pieces that make it a little easier to read. So to break it apart and pull out just the higher piece:

a) convert the uuid to text

b) remove the hyphens

c) prepend an ‘x’ to the beginning of the string - this gives us a string of hex characters that can be understood as hex encoded

d) convert that to a bit string using ::bit(64) - this part uses a fact that if you take a number bigger than a bit type can hold it will just take as much as can fit and throw away the rest, so this will only take the first 64 bits

e) take that 64 bits and convert it to a bigint using ::bigint

f) lastly take its absolute value because the number is signed and you might get a negative, but that doesnt matter for our purpose

With these steps done you only need to use % to take the modulus of that number and get the remainder. You can change 5 to whatever number you want to get the number of buckets - just remember that the answer will be zero based. If you use % 5 you will get 0 through 4 as possible answers.

Now when I create my instances I only have to put in the number of partitions and which partition this instance is working with, and a simple where clause in the query to get the population it is working on will restrict it to only the records that belong to that instance and no two instances will try to do the same work.

For clarity, here is what this might look at in use:
```
SELECT
  id
  , ...
FROM foo
WHERE abs(('x' || translate(id::text, '-', ''))::bit(64)::bigint % 5) = 2
```
If you’re interested in exploring more about how UUIDs split into numbers and back again or proving to yourself that a random uuid is evenly distributed, you can check out this fiddle: https://dbfiddle.uk/yX-y7Ath

And one more bonus fact related to UUIDs - MD5 hashes are also 128 bit numbers stored as hex, and since UUIDs are also just 128 bit numbers stored as hex, you can convert MD5 hashes to UUID and store them using the UUID type in your database for space savings compared to storing it as text.

comments
August 9, 2024

Aggregating event data into sessions using SQL

Lets say you have a table of events, each representing a point in time when an event occurred. Maybe it’s clicks in your applications, or pages that a user visited. You might like to see how long users are spending in your application, or what they are doing in a given session.

So the task at hand is to use SQL to analyze this event data and combine these points in time into groups of events - I will call them sessions for this post.

The first thing you must do is decide how much time you would allow someone to go before another event happens for it to still be considered the same session. Anything less than this duration between events, and the events are connected into the same session; anything more, and we will call it a new session. The value you use here is up to you and your needs - if the actions you are looking at take a lot of time to perform, or maybe someone is spending time reading a lot of content before continuing to the next thing, you might want to make this threshold higher, maybe tens of minutes or even more. On the other hand, if you expect they should be clicking around fairly quickly, you might want to go much shorter, maybe only seconds. For this example I am going to use 5 minutes between events as my threshold.

The next thing you need to decide is, if the session only lasted for one event, how long would you want to say it lasted? For this example, I am going to use 1 minute.

So there are several SQL features we can use to our advantage - this is all PostgreSQL; everything here works at least as far back as version 9.5, possibly earlier.

The first feature is Postgres’ interval type. If you haven’t played with intervals, you are really missing out. It is a native data type that defines a period of time - perfect for our use cases.

The next thing to understand is window functions. I have been writing SQL long enough that these still feel “new”, but they have been around for a long time now! Window functions are great; they let you aggregate against different partitions (groups, or in this case windows) in the same query, which is perfect for our needs where we want to consider each user independently.

Let’s look at some example data to get our bearings:

DROP TABLE IF EXISTS event_log CASCADE;
CREATE TABLE event_log (
	  ts timestamptz
	, name varchar
);

INSERT INTO event_log ( ts, name ) VALUES
  ('2024-07-01T12:00:00.000Z', 'Alice')
, ('2024-07-01T12:01:00.000Z', 'Alice')
, ('2024-07-01T12:02:00.000Z', 'Alice')
, ('2024-07-01T12:03:00.000Z', 'Alice')
, ('2024-07-01T12:04:00.000Z', 'Alice')
, ('2024-07-01T12:05:00.000Z', 'Alice')
, ('2024-07-01T12:12:00.000Z', 'Alice')
, ('2024-07-01T12:14:00.000Z', 'Alice')
, ('2024-07-01T12:17:00.000Z', 'Alice')
, ('2024-07-01T12:21:00.000Z', 'Alice')
, ('2024-07-01T12:26:00.000Z', 'Alice')
, ('2024-07-01T12:30:00.000Z', 'Alice')
, ('2024-07-01T12:00:00.000Z', 'Bob')
, ('2024-07-01T12:02:00.000Z', 'Bob')
, ('2024-07-01T12:05:00.000Z', 'Bob')
, ('2024-07-01T12:09:00.000Z', 'Bob')
, ('2024-07-01T12:14:00.000Z', 'Bob')
, ('2024-07-01T12:20:00.000Z', 'Bob')
, ('2024-07-01T12:25:00.000Z', 'Bob')
, ('2024-07-01T12:35:00.000Z', 'Bob')
, ('2024-07-01T12:44:00.000Z', 'Bob')
, ('2024-07-01T12:48:00.000Z', 'Bob')
, ('2024-07-01T13:05:00.000Z', 'Bob')
;

Here you can see we have two users, Alice and Bob, and they have several events. Of course, in a real world scenario, these names would probably be IDs to a users table, you would have an event type and probably lots of other attributes for the event, but none of that is necessary for this demo. If you have trouble extending this example to a more real world scenario, reach out in the comments or on Twitter.

Here is the final query, we will walk through the different parts of it to see how it works:

with inputs as (
  select
      '5 minutes'::interval threshold -- change this to however long you want your session timeouts to be
    , '1 minute'::interval min_session_duration -- if there is only one event in the session, how long do you want to say it lasted?
)
, prep as (
  select
      ts
    , name
    , row_number() 
      over (partition by name order by ts) row_num
    , coalesce(
      ts - lag(ts) 
        over (partition by name order by ts asc) <= inputs.threshold
      , false) is_connected
  from
    event_log
  cross
    join inputs
)
, session_detail as (
  select
      ts
    , name
    , max(row_num) 
      filter (where NOT is_connected) 
      over (partition by name order by ts asc) session_seq
  from prep
)
, sessions as (
  select
      row_number() 
      over (partition by name order by min(ts), session_seq) session_seq
    , name
    , count(*) event_count
    , min(ts) start_ts
    , max(ts) end_ts
    , greatest(inputs.min_session_duration, max(ts) - min(ts)) session_duration
  from session_detail
  cross
    join inputs
  group by
      session_seq
    , name
    , inputs.min_session_duration
)
select
    session_seq
  , concat_ws('-', name, session_seq) unique_id
  , name
  , event_count
  , start_ts
  , end_ts
  , session_duration
from sessions
;

If you arent comfortable with SQL that might look like a lot, but trust me, its really not bad - and we will walk through all the different parts.

Here is what the query returns (I have removed the date and seconds and milliseconds for the timestamps for brevity, they are all the same for the example):

session_seq	unique_id	name	event_count	start_ts	end_ts	session_duration
1	Alice-1	Alice	6	12:00	12:05	00:05:00
2	Alice-2	Alice	6	12:12	12:30	00:18:00
1	Bob-1	Bob	5	12:00	12:14	00:14:00
2	Bob-2	Bob	2	12:20	12:25	00:05:00
3	Bob-3	Bob	1	12:35	12:35	00:01:00
4	Bob-4	Bob	2	12:44	12:48	00:04:00
5	Bob-5	Bob	1	13:05	13:05	00:01:00

Lets start with the two pieces of SQL that make this work.

First theres this line:

, coalesce(
    ts - 
    lag(ts) over (partition by name order by ts asc) 
        <= inputs.threshold
    , false) is_connected

Forgive the formatting - there is no great way to format it that will make everyone happy, but hopefully I can walk you through the pieces.

The goal with this line is to get true or false, is the row we are looking at part of the same session as the previous event for the user.

Starting from the inside out - for the current row, we are taking its ts (timestamp), and subtracting the timestamp of the row before it. This is a window function - you know that because of the over () clause. The window is how we decide what “before it” means.

We are using partition by name, meaning that our window will always only consider rows that have the same values for name as our current row - and we are using order by ts asc, meaning “before” will be a row earlier than the current row.

partition by works just like group by, you are defining the individual groupings of columns whose combination of values are considered unique to define your different groups. In our case, we only want to consider rows by the same user.

lag() is just one option, theres its dual lead(), which if you were to order by desc instead would work just the same.

We do our subtraction and get an interval - a duration of time since that previous event. Now we just need to know if that interval is less than our threshold - if so we are still part of the same session, if not then this is the beginning of a new session.

But our lag(ts) ... could return null - if we are looking at the first event for a user, there is no previous value. And so if we take our current row’s timestamp and subtract null, the result will of course be null. And when we do our comparison with the threshold, we will also get null. So we coalesce() our value to false in that case - if it’s the first event for a user it certainly cannot be part of a previous session.

So now lets look at the other important part of this approach:

, max(row_num) 
      filter (where NOT is_connected) 
      over (partition by name order by ts asc) session_seq

This part is a little more complicated, so bear with me:

Earlier in the query we assigned a row number to each row with the row_number() window function, using the same partitions as we previously discussed.

Our goal is to get the highest row number that was the beginning of the current session - so we are only interested in rows where is_connected is false.

The most important thing to understand is that max() as a window function that includes an order by clause will not consider the entire window, but only the window frame up to the current row. I’m going to quote the postgres documentation here: ¹

There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Some window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.

So our max() is only considering rows where is_connected is false, meaning the events that were the beginning of sessions. Then we want to pick the highest row number that came before the current row - so if we are in the second session we want to pick the beginning of the second session.

These are the two most important parts of this solution - how to figure out if the current row is a continuation of a session or the start of a new one, and then how to find the start of the session for the current row, so we know which session we are a part of.

The one other thing I want to call out here is the CTE at the top I call inputs - this is a pattern I use a lot - it pulls the “variables” of a query to the top, and just selects them as literal values - then I can cross join them where I need to in the rest of the query, making it dynamic to those inputs.

Here is a database fiddle if you want to play with this query to understand it better. I really recommend pulling out parts of the CTEs and understanding how each part works: https://dbfiddle.uk/M_iWoTG-

One last thing to mention is that this query is useful if you want to slice and dice your event data in a few different ways to explore it. You might get different results and gain more insights into your data if you tweak the thresholds. However, if you already have these thresholds set in stone, this would be an easy thing to calculate as your data comes in, or in an intermediate table, ingesting the event data on some period and calculating the duration from the previous event and which session they were a part of.

comments

June 24, 2024

My PostgreSQL wishlist

If you know me you’re aware that I love SQL and relational databases. Of all of the databases I have used, PostgreSQL is by far my favorite. SQL is the one technology I have seen in common across every job I have had over the past two decades. I would bet you could say the same - even if you aren’t writing it directly (which would be a shame in my opinion), it is still there, behind the scenes, handling all of your data. But if you’re fortunate enough to work directly with SQL and a database, it’s hard to find a better tool.

But while I have never seen a better database than PostgreSQL (and you can’t beat the price!), that doesn’t mean that there aren’t things that could be improved. My goal with this post is to start a conversation and throw out some ideas that I have gathered over the years of things I yearned for. Best case scenario, someone tells me that these things already exist and I learn something new! Second best, these ideas resonate with others and we find ways to get some of these in future versions. But I would also love any conversation about why any of these are bad ideas or technically infeasible - I’ll learn something then too. And if anyone has ideas about how I can contribute to these things directly, please let me know!

Read more
comments

January 8, 2024

Optimizing Search with Parallel Queries in TypeScript

A Different Approach: Parallel Query Execution

Implementation

Example Usage

Reflections and Considerations

safe_cast() function for postgresql

Partitioning using UUIDs

Aggregating event data into sessions using SQL

My PostgreSQL wishlist