• Bulk insert into postgres using a single parameter

    motivation

    At my job recently we had a need to bulk insert lots of records quickly and efficiently, but we had some limitations we needed to work around. We have been using Prisma for a while, which has a createMany function to insert multiple rows, but something we have changed or upgraded recently (we aren’t sure what exactly) was causing it to actually insert rows one at a time instead. This isn’t too big of a deal if you are only inserting a few rows at a time, but we needed to insert hundreds to tens of thousands of rows at a time and doing them one by one wasn’t going to cut it.

    We have also very recently started using Prisma’s typedSQL functionality to allow us to write real SQL and still get type safety as well, which is great, except it doesn’t handle a variable number of parameters.

    So I had the idea to instead send a single parameter into the statement as JSON, an array of objects, then parse that into a recordset, and use that to insert into the database.

    example

    Here is what that looks like:

    WITH raw_input(json_blob) AS (
        -- In production you would pass $1 here;
        VALUES (
            $$[
                {
                  "id":           "9f2ddc93-0db3-46d2-a2cb-e6df4418faad",
                  "key":          "status",
                  "value":        "active",
                  "created_at":   "2025-06-24T12:34:56Z",
                  "updated_at":   "2025-06-24T13:45:00Z"
                },
                {
                  "id":           "b8f9152e-a203-4d0a-b530-0a4e45c9b0a9",
                  "key":          "priority",
                  "value":        "high",
                  "created_at":   "2025-06-24T12:34:56Z",
                  "updated_at":   "2025-06-24T13:45:00Z"
                }
            ]$$::json
        )
    )
    -- Expand the JSON array to a proper recordset
    , parsed AS (
        SELECT *
        FROM raw_input
        CROSS JOIN LATERAL json_to_recordset(json_blob) AS r(
            id          uuid,
            key         text,
            value       text,
            created_at  timestamptz,
            updated_at  timestamptz
        )
    )
    -- Bulk-insert the rows in a single SQL statement
    INSERT INTO target_table (
          id, key, value, created_at, updated_at
    )
    SELECT
          id, key, value, created_at, updated_at
    FROM   parsed
    RETURNING *;
    

    You can try it for yourself here: https://dbfiddle.uk/Am4xRgbz

    There are only two parts to really understand here:

    json_to_recordset takes a json blob and a schema for the resulting recordset and parses your data into a recordset. You don’t have to reference or use all keys in the json, and if you provide a column that doesn’t have a matching key you will get null for those values.

    The only other thing to know is that you need to use CROSS JOIN LATERAL, this will parse each record individually giving you one row per object.

    And that’s all there is to it. Now we have a statement that takes a single input, but that input can contain thousands of rows. The limit for a single parameter into a statement is an entire gigabyte of data, so you can use this to insert a lot of data at once.

    The difference for us was significant. When inserting 1k records individually it was timing out after 60s. Even inserting in batches of 250 at a time using the previous createMany functionality it was taking 10’s of seconds. This was able to do it in less than a second every time, usually less than a 1/4 of a second.

    Of course you could use this with ON CONFLICT DO UPDATE for upserts, or just as a way to get data into a statement for joining with other data. And you can enrich the json_to_recordset definition with default columns, CHECK constraints, or even a WHERE clause inside the CTE to filter out data early before you insert. Even join with other data in the database.

    full SQL example

    Adding the full SQL example here in case the dbfiddle link ever stops working:

    
    create table target_table (id uuid, key text, value text, created_at timestamptz, updated_at timestamptz);
    
    WITH raw_input(json_blob) AS (
        -- In production you would pass $1 here;
        VALUES (
            $$[
                {
                  "id":           "9f2ddc93-0db3-46d2-a2cb-e6df4418faad",
                  "key":          "status",
                  "value":        "active",
                  "created_at":   "2025-06-24T12:34:56Z",
                  "updated_at":   "2025-06-24T13:45:00Z"
                },
                {
                  "id":           "b8f9152e-a203-4d0a-b530-0a4e45c9b0a9",
                  "key":          "priority",
                  "value":        "high",
                  "created_at":   "2025-06-24T12:34:56Z",
                  "updated_at":   "2025-06-24T13:45:00Z"
                }
            ]$$::json
        )
    )
    -- Expand the JSON array to a proper recordset
    , parsed AS (
        SELECT *
        FROM raw_input
        CROSS JOIN LATERAL json_to_recordset(json_blob) AS r(
            id          uuid,
            key         text,
            value       text,
            created_at  timestamptz,
            updated_at  timestamptz
        )
    )
    -- Bulk-insert the rows in a single SQL statement
    INSERT INTO target_table (
          id, key, value, created_at, updated_at
    )
    SELECT
          id, key, value, created_at, updated_at
    FROM   parsed
    RETURNING *;
    
    
    -- example so you can see what the parsed data looks like
    WITH raw_input(json_blob) AS (
        VALUES (
            $$[
                {
                  "id":           "9f2ddc93-0db3-46d2-a2cb-e6df4418faad",
                  "key":          "status",
                  "value":        "active",
                  "created_at":   "2025-06-24T12:34:56Z",
                  "updated_at":   "2025-06-24T13:45:00Z"
                },
                {
                  "id":           "b8f9152e-a203-4d0a-b530-0a4e45c9b0a9",
                  "key":          "priority",
                  "value":        "high",
                  "created_at":   "2025-06-24T12:34:56Z",
                  "updated_at":   "2025-06-24T13:45:00Z"
                }
            ]$$::json
        )
    )
    , parsed AS (
        SELECT *
        FROM raw_input
        CROSS JOIN LATERAL json_to_recordset(json_blob) AS r(
            id          uuid,
            key         text,
            value       text,
            created_at  timestamptz,
            updated_at  timestamptz
        )
    )  
    select * from parsed;
    
    select * from target_table;
    
    

    comments

  • Precedence Level ordering with Resets using window functions in SQL

    Motivation

    One of my larger projects at Vendr in the last couple years has been a document extraction system. It uses LLM and other technologies to extract structured data from contracts and other documents, allowing us to build a sizable dataset of all of the actual pricing and negotiation for SaaS software for thousands of companies.

    But as we develop our extractions to get new details, or to improve the accuracy of the information, we may extract a document multiple times. We also have human review, correction, and people adding more information from context that may not be in the document. We may be extracting dozens of fields across the extractions, and so we needed a way to determine the precedence of different extractions on a per field per document level.

    The problem at hand: Precedence Level

    There is more that goes into how to choose which extraction is the best for an individual document than what I am going to go into here, the whole process is actually quite involved, but I wanted to talk about a particular challenge to allow us to set a precedence level on each extraction (or change it after the fact), to help guide that process.

    To start with, we needed two precedence levels - DEFAULT and IMPORTANT. We would typically set an LLM extraction as DEFAULT, and human involved extractions as IMPORTANT. An IMPORTANT extraction beats a DEFAULT one, even if the DEFAULT comes after an IMPORTANT, but within the same precedence level the last one wins.

    But then sometimes we may have an advancement in the LLM extractions that would lead us to want to let it take precedence over any previous extractions, even human guided IMPORTANT ones - but from then we would want any future extraction to take precedence again, either a new more recent DEFAULT or a new human corrected IMPORTANT extraction.

    So I wanted to support another precedence level called RESET. An extraction with RESET should beat anything that came before it, but be treated like DEFAULT after that point, with a new DEFAULT or IMPORTANT taking precedence.

    We store all extractions in the database and never get rid of them - we want to keep track of every value that was ever extracted for a field in a document and retain that history, and also we want to be able to adjust the precedence (and other attributes that affect which extraction is the best) at a later date.

    To calculate the best extraction I use a SQL function in postgres - so I needed a way to implement this precedence logic using nothing but SQL.

    The SQL challenge

    Ordering information by precedence level with just a priority (IMPORTANT > DEFAULT) is easy in SQL - you just have to assign a sort value to those levels using a join against those values or a case statement plus a timestamp. But implementing something like RESET means that a normal sort wont work - the ordering is dependent on a timestamp and if a RESET comes before it.

    So to do this, you need window functions - in this case LEAD or LAG - to look at the records before the current row and figure out if there was a RESET preceding it. But not just for the record immediately before it, all of them.

    Here is an example of the query (much simplified and reworked to try and focus on just the precedence level for clarity):

    with data (id, document_id, value, precedence_level, updated_at) as (
    	values
    		( 1, 1, 'Z', 'DEFAULT', '2025-01-01'::date)
    		, ( 2, 1, 'Y', 'DEFAULT', '2025-01-02') -- later DEFAULT beats previous DEFAULT
    		, ( 3, 1, 'X', 'DEFAULT', '2025-01-03') -- later DEFAULT beats previous DEFAULT
    		, ( 4, 1, 'E', 'IMPORTANT', '2025-01-04') -- IMPORTANT beats DEFAULT
    		, ( 5, 1, 'F', 'DEFAULT', '2025-01-05') -- previous IMPORTANT beats this record
    		, ( 6, 1, 'D', 'IMPORTANT', '2025-01-06') -- later IMPORTANT beats previous IMPORTANT
    		, ( 7, 1, 'C', 'RESET', '2025-01-07') -- RESET beats everything else
    		, ( 8, 1, 'B', 'DEFAULT', '2025-01-08') -- bc last record was RESET this DEFAULT now takes precedence
    		, ( 9, 1, 'A', 'IMPORTANT', '2025-01-09') -- IMPORTANT overrides DEFAULT
    )
    , data_with_resets as (
    	select *
    		, count(*)
    			filter (where data.precedence_level = 'RESET')
    			over (
    				partition by document_id
    				order by updated_at asc
    			) prev_reset_count
    	from data
    )
    , calculate_ranking as (
    	select
    		dense_rank() over (
    			partition by document_id
    			order by
    				(prev_reset_count * 10) +
    				(data.precedence_level = 'IMPORTANT')::int * 5
    					desc
    				, updated_at desc
    		) as rnk
    		, *
    	from data_with_resets data
    	order by 1
    )
    , data_with_prev_value as (
    	select *
    		, lead(value) over (partition by document_id order by rnk) prev_value
    	from calculate_ranking
    )
    , final as (
    	select
    		dense_rank() over (ORDER BY rnk, updated_at desc ) rnk_final
    		, *
    	from data_with_prev_value
    	order by 1 asc
    )
    select *
    from final
    ;
    

    You can play with this query and the parts of it here: https://dbfiddle.uk/HBKDOaR3?highlight=16

    The key parts that make this work start with the data_with_resets CTE. Here we count how many preceding rows are a RESET precedence level using a window for the same document ordered by the timestamp. It isn’t necessarily obvious from this code unless you are familiar with window functions, but when you use count with a window (the over clause) it is a running count up to and including the current row.

    Then the actual logic happens in the calculate_ranking CTE. we use a dense_rank which gives us a consecutive ranking without gaps. We partition by the document and then we order by the precedence level. We add 10 for each RESET, DEFAULT is treated as 0 and IMPORTANT is treated as 5, then ordering by the timestamp.

    The next CTE isn’t required for our purpose here, but allows us to figure out what the previous value was, based on the same precedence ordering, so we can tell if something changed.

    Technically the last final CTE isn’t strictly necessary, but it allows us to get a single ranking field to order the final recordset by.

    Conclusion

    The main takeaway I hope to convey with this post is that SQL traditionally has you think in terms of sets of data which is very powerful, but sometimes you need to be able to work iteratively and consider the previous or next row, or a window of rows not including the entire set. Window functions extend your capabilities of what you can achieve in a single SQL query and using only your database, keeping you from having to do this work in the application layer, increasing performance and lowering your memory and network utilization.

    comments

  • Optimizing Search with Parallel Queries in TypeScript

    Recently, I encountered a situation where I built a search interface for a backend process. Users could search using various criteria—different identifiers or other metadata tagged to the records. However, the query I initially wrote to handle these searches didn’t perform well. It relied heavily on OR conditions or IN clauses (which are logically equivalent to OR), making indexing difficult and query performance sluggish[1]

    In this case, all the different search options were mutually exclusive — if results were found using one criterion, no other criterion would yield results. All our identifiers are UUIDs or globally unique formats, ensuring that only one search method could return valid results. Additionally, some queries were fast and efficient, while others were slower and more resource-intensive. Combining all these into a single query meant the user would always have to wait for the slowest part of the search, even if they were searching for something could return results immediately.

    A Different Approach: Parallel Query Execution

    To improve performance, I decided to run multiple searches in parallel and return results as soon as one of them succeeded. If none of the queries returned results, the system would fall back to showing the latest records. This solution is somewhat similar to Promise.race(), but with a twist. Promise.race() resolves or rejects with the first settled promise, regardless of whether it yields useful data. In my case, I wanted to inspect the resolved values and only return a result if it contained results.

    After working with ChatGPT, Claude, and some friends, I developed the following solution. Below are two versions of the function—one that uses key-value pairs for better traceability and another that works with an array of queries. I have more to say about it below.

    Implementation

    /**
     * Runs multiple queries in parallel and resolves with the first query that returns records.
     * If no query returns records, it rejects with an error.
     *
     * @param queries An array of promises representing database queries.
     * @returns A promise that resolves with the first valid result or rejects if no results are found.
     */
    export async function firstQueryWithResultsWithKey<T>({
      queriesAsRecord,
      resultTest = ({ result }) => Array.isArray(result) && result.length > 0,
      noResultsReturn = null,
    }: {
      queriesAsRecord: Record<string, Promise<T>>;
      resultTest?: ({ result, key }: { result: T; key: string | null }) => boolean;
      noResultsReturn?: T | null;
    }): Promise<{ result: T; key: string | null }> {
      return new Promise((resolve, reject) => {
        let completed = false; // Track if a query has returned results
        let pendingCount = Object.keys(queriesAsRecord).length; // Track the number of pending queries
    
        if (pendingCount === 0) {
          if (noResultsReturn) {
            resolve({ result: noResultsReturn, key: null });
          } else {
            reject(new Error("No queries returned any records"));
          }
        }
    
        // Define a function to handle each query independently
        const handleQuery = async (query: Promise<T>, key: string) => {
          try {
            if (!completed) {
              const result = await query;
    
              // If the result has records and no other query has resolved
              if (resultTest({ result, key }) && !completed) {
                completed = true; 
                resolve({ result, key }); // Resolve with the first valid result
              }
            }
          } catch (error) {
            console.error("Query error:", error); // Log query errors
          } finally {
            // Decrement pending count and check if all queries are exhausted
            pendingCount--;
            if (pendingCount === 0 && !completed) {
              if (noResultsReturn) {
                resolve({ result: noResultsReturn, key: null });
              } else {
                reject(new Error("No queries returned any records"));
              }
            }
          }
        };
    
        // Start all queries in parallel
        Object.entries(queriesAsRecord).forEach(([key, query]) =>
          handleQuery(query, key),
        );
      });
    }
    
    export async function firstQueryWithResults<T>({
      queries,
      resultTest = (result) => Array.isArray(result) && result.length > 0,
      noResultsReturn = null,
    }: {
      queries: Promise<T>[];
      resultTest?: (result: T) => boolean;
      noResultsReturn?: T | null;
    }): Promise<T> {
      let queriesAsRecord = Object.fromEntries(
        queries.map((query, index) => [index.toString(), query]),
      );
    
      const { result } = await firstQueryWithResultsWithKey({
        queriesAsRecord,
        resultTest: ({ result }) => resultTest(result),
        noResultsReturn,
      });
      return result;
    }
    
    

    Both versions of the function execute a collection of promises in parallel, returning the first valid result. If no queries pass the resultTest, the function returns the provided fallback value (if any) or throws an error.

    Example Usage

    Here’s a unit test to demonstrate how the function works:

    const wait = (interval: number) =>
        new Promise((resolve) => setTimeout(resolve, interval));
    
    it("should return the noResultsReturn value if no queries return results", async () => {
        const query1 = async ({ id }: { id: string }): Promise<QueryResult> => {
          await wait(100);
          return [];
        };
        const query2 = async ({ id }: { id: string }): Promise<QueryResult> => {
          await wait(200);
          return [];
        };
        const query3 = async ({ id }: { id: string }): Promise<QueryResult> => {
          await wait(300);
          return [{a: id}];
        };
    
        let result: QueryResult | null = null;
    
        result = await firstQueryWithResults({
          queries: [query1({ id: "1" }), query2({ id: "2" }), query3({ id: "3" })],
          resultTest: (result) => result.length > 0,
          noResultsReturn: [],
        });
        expect(result).toEqual([]);
      });
    
    

    In this test, three promises simulate query operations. The first two return empty arrays after 100ms and 200ms, while the third returns a valid result after 300ms. The resultTest function ensures that only non-empty arrays pass the validation.


    Reflections and Considerations

    This approach improves user experience by returning results as quickly as possible, but it also increases the load on the database since multiple queries run in parallel. For my use case — an internal back-office tool — I find the trade-off acceptable. However, I plan to monitor system performance to ensure this doesn’t impact the overall system.

    An alternative approach would be to let users specify the type of identifier they are searching for. This way, the system would only execute the relevant query, reducing database load. However, the current solution offers a seamless experience since users don’t need to know or care about the type of identifier—they just paste it and search.

    Although I used the term “query” throughout the article because of my use case, the function is generic and can be applied to any collection of promises, not just database operations.

    I welcome any feedback or suggestions on this approach. While I’m not fully convinced this is the optimal solution, I found the challenge of reimagining Promise.race() to be an interesting exercise.

    Here is a gist with the function and more unit tests so you can see how it works: https://gist.github.com/ryanguill/e89931f9d223e74dabcb070879a58298

    [1] There are ongoing improvements in PostgreSQL to optimize queries with OR conditions. For more, see: https://www.crunchydata.com/blog/real-world-performance-gains-with-postgres-17-btree-bulk-scans

    comments

  • safe_cast() function for postgresql

    I love using JSON in relational databases. When support for JSON types and functionality first started coming out in SQL I generally thought it was neat but that it wouldn’t be something I would ever want to use in production. I could not have been more wrong.

    I could talk at length (and have) about all the ways that it is useful, but if you do you will find that the main way you pull information out of JSON will bring it out as TEXT. And frequently when you’re using JSON you can’t be sure that the data is exactly the right format you expect anyway. Lately I store a lot of JSON that comes back from LLMs, and while it gets it right most of the time, you can never really be sure - you need to trust be verify.

    So I have been using this function for a long time to safely convert from text to a given datatype in postgresql. If the cast can be made successfully it will, otherwise it will return the second argument as a default - most of the time I use null but it can be anything.

    /* 
      utility function to convert from text to various data types
        , such as when you are pulling values out of json 
      The function will cast the value of the first argument 
        to the type of the second argument.
      If the first argument cannot be convert to the target type
        the value of the second argument will be returned as the default.
      If you want to return null as the default, cast the null to the target
        type like `null::integer` or `null::date`
    */
    DROP FUNCTION IF EXISTS safe_cast(text, anyelement);
    CREATE OR REPLACE FUNCTION safe_cast(text, anyelement)
    RETURNS anyelement
    LANGUAGE plpgsql as $$
    BEGIN
        $0 := $1;
        RETURN $0;
        EXCEPTION WHEN OTHERS THEN
            RETURN $2;
    END;
    $$;
    

    The function itself looks a little terse and strange, but if you look and play with the examples you’ll get a better understanding of how it works. I have been using it for a long time, I can’t remember if I actually wrote it or got it from somewhere else - I believe I may have adapted it from this answer on stackoverflow: https://stackoverflow.com/a/2095676/7186

    Below are examples, but you can play with the function and these examples here: https://dbfiddle.uk/3kAop9-Z

    
    select
    	  safe_cast('true', false::boolean) = true
    	, safe_cast('false'::text, false::boolean) = false
    	, safe_cast('yes'::text, false::boolean) = true
    	, safe_cast('no'::text, false::boolean) = false
    	, safe_cast('on'::text, false::boolean) = true
    	, safe_cast('off'::text, false::boolean) = false
    	, safe_cast('1'::text, false::boolean) = true
    	, safe_cast('0'::text, false::boolean) = false
    	, safe_cast(1::text, false::boolean) = true
    	, safe_cast(0::text, false::boolean) = false
    	, safe_cast('foo'::text, false::boolean) = false
    	, safe_cast('3'::text, false::boolean) = false
    	, safe_cast(3::text, false::boolean) = false
    	, safe_cast('', false::boolean) = false
    	, safe_cast(null::text, false::boolean) is null
    	;
    
    select
    	  safe_cast('123', null::numeric) = 123::numeric
    	, safe_cast('123.45', null::numeric) = 123.45::numeric
    	, safe_cast('0', null::numeric) = 0::numeric
    	, safe_cast('-1', null::numeric) = -1::numeric
    	, safe_cast('-1.2', null::numeric) = -1.2::numeric
    	, safe_cast('123x', null::numeric) is null
    	, safe_cast('', null::numeric) is null
    	, safe_cast('foobar', null::numeric) is null
    	, safe_cast('123', null::numeric) > 1
        , safe_cast('123', 0::integer) = 123::integer
    	, safe_cast('', 0::integer) = 0::integer
    	, safe_cast('foo', 0::integer) = 0::integer
    	;
    
    select
    	  safe_cast('2024-01-02', null::date) = '2024-01-02'::date
    	, safe_cast('01-02-2024', null::date) = '2024-01-02'::date
    	, safe_cast('2024-01-023', null::date) = '2024-01-23'::date --possibly surprising
    	, safe_cast('2024-01-', null::date) is null
    	, safe_cast('2024-01-123', null::date) is null
    	, safe_cast('2024', null::date) is null
    	, safe_cast('foobar', null::date) is null
        , safe_cast('2024-01-02', null::timestamptz) = '2024-01-02 00:00:00'::timestamptz
    	;
    

    Other databases have similar functionality built in, but most do not have the ability to set a default if the value cannot be safely cast baked into the function, most just return null and then you can use coalesce or similar to set a different default if you need to. (The following list is the ones I know off the top of my head, and is not meant to be exhaustive)

    MSSQL has TRY_CAST: https://learn.microsoft.com/en-us/sql/t-sql/functions/try-cast-transact-sql?view=sql-server-ver16

    Snowflake has TRY_CAST: https://docs.snowflake.com/en/sql-reference/functions/try_cast

    DuckDB has TRY_CAST https://duckdb.org/docs/sql/expressions/cast.html#try_cast

    comments

  • Partitioning using UUIDs

    Recently I had a situation where I needed to process quite a few records, lets say around 100k, and the processing of each record was around 2 seconds. The math on that wasn’t pretty, so I needed to find a way to partition the set of data so that I could run multiple instances of the processing at the same time - but where each instance would have their own data and I wouldn’t try to process a record by more than one instance.

    Normally you would want to look for some natural key if you can, but in this case there wasn’t anything I could use, especially that would be well distributed. I also wanted to have the flexibility to partition for as few or as many instances as I needed - I might start with 5 instances but that might not be enough.

    The one thing I did have is UUIDs. We use UUIDs for virtually all of our keys - an explanation of all of the benefits of that will have to be its own blog post someday.

    So I had the thought to use UUIDs to do the partitioning. The easiest thing you could do is look at the last character, that would quickly give you 16 partitions without much work, and most of the versions of UUIDs the last character would be fairly well distributed.

    But then I remembered - a UUID is really just a 128 bit number, or two 64 bit integers (bigints) in a trench coat - and assuming they are randomly generated (we use v4 for most things right now so I know they are), I can use modulus math to partition into however many buckets I need.

    I was able to come up with the following that given a UUID will extract the high bigint and then give mod 5:

    SELECT abs(('x' || translate(gen_random_uuid()::text, '-', ''))::bit(64)::bigint % 5)
    

    This might look complicated at first, but while it has a few steps its fairly straight forward. A few things to notice first:

    A UUID is a hex encoded number that has hyphens in particular places to break it up into pieces that make it a little easier to read. So to break it apart and pull out just the higher piece:

    a) convert the uuid to text

    b) remove the hyphens

    c) prepend an ‘x’ to the beginning of the string - this gives us a string of hex characters that can be understood as hex encoded

    d) convert that to a bit string using ::bit(64) - this part uses a fact that if you take a number bigger than a bit type can hold it will just take as much as can fit and throw away the rest, so this will only take the first 64 bits

    e) take that 64 bits and convert it to a bigint using ::bigint

    f) lastly take its absolute value because the number is signed and you might get a negative, but that doesnt matter for our purpose

    With these steps done you only need to use % to take the modulus of that number and get the remainder. You can change 5 to whatever number you want to get the number of buckets - just remember that the answer will be zero based. If you use % 5 you will get 0 through 4 as possible answers.

    Now when I create my instances I only have to put in the number of partitions and which partition this instance is working with, and a simple where clause in the query to get the population it is working on will restrict it to only the records that belong to that instance and no two instances will try to do the same work.

    For clarity, here is what this might look at in use:

    SELECT
      id
      , ...
    FROM foo
    WHERE abs(('x' || translate(id::text, '-', ''))::bit(64)::bigint % 5) = 2
    

    If you’re interested in exploring more about how UUIDs split into numbers and back again or proving to yourself that a random uuid is evenly distributed, you can check out this fiddle: https://dbfiddle.uk/yX-y7Ath

    And one more bonus fact related to UUIDs - MD5 hashes are also 128 bit numbers stored as hex, and since UUIDs are also just 128 bit numbers stored as hex, you can convert MD5 hashes to UUID and store them using the UUID type in your database for space savings compared to storing it as text.

    comments