TIL
My Thoughts, Opinions, Learnings and Discoveries
-
Partitioning using UUIDs
Recently I had a situation where I needed to process quite a few records, lets say around 100k, and the processing of each record was around 2 seconds. The math on that wasn’t pretty, so I needed to find a way to partition the set of data so that I could run multiple instances of the processing at the same time - but where each instance would have their own data and I wouldn’t try to process a record by more than one instance.
Normally you would want to look for some natural key if you can, but in this case there wasn’t anything I could use, especially that would be well distributed. I also wanted to have the flexibility to partition for as few or as many instances as I needed - I might start with 5 instances but that might not be enough.
The one thing I did have is UUIDs. We use UUIDs for virtually all of our keys - an explanation of all of the benefits of that will have to be its own blog post someday.
So I had the thought to use UUIDs to do the partitioning. The easiest thing you could do is look at the last character, that would quickly give you 16 partitions without much work, and most of the versions of UUIDs the last character would be fairly well distributed.
But then I remembered - a UUID is really just a 128 bit number, or two 64 bit integers (bigints) in a trench coat - and assuming they are randomly generated (we use v4 for most things right now so I know they are), I can use modulus math to partition into however many buckets I need.
I was able to come up with the following that given a UUID will extract the high bigint and then give mod 5:
SELECT abs(('x' || translate(gen_random_uuid()::text, '-', ''))::bit(64)::bigint % 5)
This might look complicated at first, but while it has a few steps its fairly straight forward. A few things to notice first:
A UUID is a hex encoded number that has hyphens in particular places to break it up into pieces that make it a little easier to read. So to break it apart and pull out just the higher piece:
a) convert the uuid to text
b) remove the hyphens
c) prepend an ‘x’ to the beginning of the string - this gives us a string of hex characters that can be understood as hex encoded
d) convert that to a bit string using
::bit(64)
- this part uses a fact that if you take a number bigger than a bit type can hold it will just take as much as can fit and throw away the rest, so this will only take the first 64 bitse) take that 64 bits and convert it to a bigint using
::bigint
f) lastly take its absolute value because the number is signed and you might get a negative, but that doesnt matter for our purpose
With these steps done you only need to use
%
to take the modulus of that number and get the remainder. You can change 5 to whatever number you want to get the number of buckets - just remember that the answer will be zero based. If you use% 5
you will get 0 through 4 as possible answers.Now when I create my instances I only have to put in the number of partitions and which partition this instance is working with, and a simple where clause in the query to get the population it is working on will restrict it to only the records that belong to that instance and no two instances will try to do the same work.
For clarity, here is what this might look at in use:
SELECT id , ... FROM foo WHERE abs(('x' || translate(id::text, '-', ''))::bit(64)::bigint % 5) = 2
If you’re interested in exploring more about how UUIDs split into numbers and back again or proving to yourself that a random uuid is evenly distributed, you can check out this fiddle: https://dbfiddle.uk/yX-y7Ath
And one more bonus fact related to UUIDs - MD5 hashes are also 128 bit numbers stored as hex, and since UUIDs are also just 128 bit numbers stored as hex, you can convert MD5 hashes to UUID and store them using the UUID type in your database for space savings compared to storing it as text.
comments -
Aggregating event data into sessions using SQL
Lets say you have a table of events, each representing a point in time when an event occurred. Maybe it’s clicks in your applications, or pages that a user visited. You might like to see how long users are spending in your application, or what they are doing in a given session.
So the task at hand is to use SQL to analyze this event data and combine these points in time into groups of events - I will call them sessions for this post.
The first thing you must do is decide how much time you would allow someone to go before another event happens for it to still be considered the same session. Anything less than this duration between events, and the events are connected into the same session; anything more, and we will call it a new session. The value you use here is up to you and your needs - if the actions you are looking at take a lot of time to perform, or maybe someone is spending time reading a lot of content before continuing to the next thing, you might want to make this threshold higher, maybe tens of minutes or even more. On the other hand, if you expect they should be clicking around fairly quickly, you might want to go much shorter, maybe only seconds. For this example I am going to use 5 minutes between events as my threshold.
The next thing you need to decide is, if the session only lasted for one event, how long would you want to say it lasted? For this example, I am going to use 1 minute.
So there are several SQL features we can use to our advantage - this is all PostgreSQL; everything here works at least as far back as version 9.5, possibly earlier.
The first feature is Postgres’ interval type. If you haven’t played with intervals, you are really missing out. It is a native data type that defines a period of time - perfect for our use cases.
The next thing to understand is window functions. I have been writing SQL long enough that these still feel “new”, but they have been around for a long time now! Window functions are great; they let you aggregate against different partitions (groups, or in this case windows) in the same query, which is perfect for our needs where we want to consider each user independently.
Let’s look at some example data to get our bearings:
DROP TABLE IF EXISTS event_log CASCADE; CREATE TABLE event_log ( ts timestamptz , name varchar ); INSERT INTO event_log ( ts, name ) VALUES ('2024-07-01T12:00:00.000Z', 'Alice') , ('2024-07-01T12:01:00.000Z', 'Alice') , ('2024-07-01T12:02:00.000Z', 'Alice') , ('2024-07-01T12:03:00.000Z', 'Alice') , ('2024-07-01T12:04:00.000Z', 'Alice') , ('2024-07-01T12:05:00.000Z', 'Alice') , ('2024-07-01T12:12:00.000Z', 'Alice') , ('2024-07-01T12:14:00.000Z', 'Alice') , ('2024-07-01T12:17:00.000Z', 'Alice') , ('2024-07-01T12:21:00.000Z', 'Alice') , ('2024-07-01T12:26:00.000Z', 'Alice') , ('2024-07-01T12:30:00.000Z', 'Alice') , ('2024-07-01T12:00:00.000Z', 'Bob') , ('2024-07-01T12:02:00.000Z', 'Bob') , ('2024-07-01T12:05:00.000Z', 'Bob') , ('2024-07-01T12:09:00.000Z', 'Bob') , ('2024-07-01T12:14:00.000Z', 'Bob') , ('2024-07-01T12:20:00.000Z', 'Bob') , ('2024-07-01T12:25:00.000Z', 'Bob') , ('2024-07-01T12:35:00.000Z', 'Bob') , ('2024-07-01T12:44:00.000Z', 'Bob') , ('2024-07-01T12:48:00.000Z', 'Bob') , ('2024-07-01T13:05:00.000Z', 'Bob') ;
Here you can see we have two users, Alice and Bob, and they have several events. Of course, in a real world scenario, these names would probably be IDs to a users table, you would have an event type and probably lots of other attributes for the event, but none of that is necessary for this demo. If you have trouble extending this example to a more real world scenario, reach out in the comments or on Twitter.
Here is the final query, we will walk through the different parts of it to see how it works:
with inputs as ( select '5 minutes'::interval threshold -- change this to however long you want your session timeouts to be , '1 minute'::interval min_session_duration -- if there is only one event in the session, how long do you want to say it lasted? ) , prep as ( select ts , name , row_number() over (partition by name order by ts) row_num , coalesce( ts - lag(ts) over (partition by name order by ts asc) <= inputs.threshold , false) is_connected from event_log cross join inputs ) , session_detail as ( select ts , name , max(row_num) filter (where NOT is_connected) over (partition by name order by ts asc) session_seq from prep ) , sessions as ( select row_number() over (partition by name order by min(ts), session_seq) session_seq , name , count(*) event_count , min(ts) start_ts , max(ts) end_ts , greatest(inputs.min_session_duration, max(ts) - min(ts)) session_duration from session_detail cross join inputs group by session_seq , name , inputs.min_session_duration ) select session_seq , concat_ws('-', name, session_seq) unique_id , name , event_count , start_ts , end_ts , session_duration from sessions ;
If you arent comfortable with SQL that might look like a lot, but trust me, its really not bad - and we will walk through all the different parts.
Here is what the query returns (I have removed the date and seconds and milliseconds for the timestamps for brevity, they are all the same for the example):
session_seq unique_id name event_count start_ts end_ts session_duration 1 Alice-1 Alice 6 12:00 12:05 00:05:00 2 Alice-2 Alice 6 12:12 12:30 00:18:00 1 Bob-1 Bob 5 12:00 12:14 00:14:00 2 Bob-2 Bob 2 12:20 12:25 00:05:00 3 Bob-3 Bob 1 12:35 12:35 00:01:00 4 Bob-4 Bob 2 12:44 12:48 00:04:00 5 Bob-5 Bob 1 13:05 13:05 00:01:00 Lets start with the two pieces of SQL that make this work.
First theres this line:
, coalesce( ts - lag(ts) over (partition by name order by ts asc) <= inputs.threshold , false) is_connected
Forgive the formatting - there is no great way to format it that will make everyone happy, but hopefully I can walk you through the pieces.
The goal with this line is to get true or false, is the row we are looking at part of the same session as the previous event for the user.
Starting from the inside out - for the current row, we are taking its ts (timestamp), and subtracting the timestamp of the row before it. This is a window function - you know that because of the
over ()
clause. The window is how we decide what “before it” means.We are using
partition by name
, meaning that our window will always only consider rows that have the same values for name as our current row - and we are usingorder by ts asc
, meaning “before” will be a row earlier than the current row.partition by
works just likegroup by
, you are defining the individual groupings of columns whose combination of values are considered unique to define your different groups. In our case, we only want to consider rows by the same user.lag()
is just one option, theres its duallead()
, which if you were to order bydesc
instead would work just the same.We do our subtraction and get an interval - a duration of time since that previous event. Now we just need to know if that interval is less than our threshold - if so we are still part of the same session, if not then this is the beginning of a new session.
But our
lag(ts) ...
could returnnull
- if we are looking at the first event for a user, there is no previous value. And so if we take our current row’s timestamp and subtractnull
, the result will of course benull
. And when we do our comparison with the threshold, we will also getnull
. So wecoalesce()
our value tofalse
in that case - if it’s the first event for a user it certainly cannot be part of a previous session.So now lets look at the other important part of this approach:
, max(row_num) filter (where NOT is_connected) over (partition by name order by ts asc) session_seq
This part is a little more complicated, so bear with me:
Earlier in the query we assigned a row number to each row with the
row_number()
window function, using the same partitions as we previously discussed.Our goal is to get the highest row number that was the beginning of the current session - so we are only interested in rows where
is_connected
is false.The most important thing to understand is that
max()
as a window function that includes anorder by
clause will not consider the entire window, but only the window frame up to the current row. I’m going to quote the postgres documentation here: 1There is another important concept associated with window functions: for each row, there is a set of rows within its partition called its window frame. Some window functions act only on the rows of the window frame, rather than of the whole partition. By default, if ORDER BY is supplied then the frame consists of all rows from the start of the partition up through the current row, plus any following rows that are equal to the current row according to the ORDER BY clause. When ORDER BY is omitted the default frame consists of all rows in the partition.
So our
max()
is only considering rows whereis_connected
is false, meaning the events that were the beginning of sessions. Then we want to pick the highest row number that came before the current row - so if we are in the second session we want to pick the beginning of the second session.These are the two most important parts of this solution - how to figure out if the current row is a continuation of a session or the start of a new one, and then how to find the start of the session for the current row, so we know which session we are a part of.
The one other thing I want to call out here is the CTE at the top I call
inputs
- this is a pattern I use a lot - it pulls the “variables” of a query to the top, and just selects them as literal values - then I can cross join them where I need to in the rest of the query, making it dynamic to those inputs.Here is a database fiddle if you want to play with this query to understand it better. I really recommend pulling out parts of the CTEs and understanding how each part works: https://dbfiddle.uk/M_iWoTG-
One last thing to mention is that this query is useful if you want to slice and dice your event data in a few different ways to explore it. You might get different results and gain more insights into your data if you tweak the thresholds. However, if you already have these thresholds set in stone, this would be an easy thing to calculate as your data comes in, or in an intermediate table, ingesting the event data on some period and calculating the duration from the previous event and which session they were a part of.
comments -
My PostgreSQL wishlist
If you know me you’re aware that I love SQL and relational databases. Of all of the databases I have used, PostgreSQL is by far my favorite. SQL is the one technology I have seen in common across every job I have had over the past two decades. I would bet you could say the same - even if you aren’t writing it directly (which would be a shame in my opinion), it is still there, behind the scenes, handling all of your data. But if you’re fortunate enough to work directly with SQL and a database, it’s hard to find a better tool.
But while I have never seen a better database than PostgreSQL (and you can’t beat the price!), that doesn’t mean that there aren’t things that could be improved. My goal with this post is to start a conversation and throw out some ideas that I have gathered over the years of things I yearned for. Best case scenario, someone tells me that these things already exist and I learn something new! Second best, these ideas resonate with others and we find ways to get some of these in future versions. But I would also love any conversation about why any of these are bad ideas or technically infeasible - I’ll learn something then too. And if anyone has ideas about how I can contribute to these things directly, please let me know!
comments -
Notes on PostgreSQL Explain Analyze
One thing you’ll want to learn if you use PostgreSQL for any length of time is how to use
commentsEXPLAIN
. At my job at Vendr, like my previous roles, we are no exception. The good and bad thing is that in many cases you can go pretty far before you start having issues. But that means that not everyone has had the opportunity to learn the dark art of reading explain output. And as great as PG is, it is not as user friendly in this area as other RDBMs like MSSQL. But with this primer, a few free online tools, and a little bit of time, you can quickly learn how to think about the performance of your queries and where to start when optimizing. -
Postgres pivot table using JSON
Something I have a need to do often but can be difficult to do at times in SQL is to create a pivot table. As an example imagine wanting to see customers and their revenue by month. It is straightforward to create a normal data set where the dates are the rows and you have a revenue amount for each. Something like this:
dte total 2022-01-01 22030 2022-02-01 22753 2022-03-01 0 2022-04-01 9456 2022-05-01 7798 2022-06-01 38278 2022-07-01 18736 2022-08-01 6794 2022-09-01 21033 2022-10-01 28576 2022-11-01 10172 2022-12-01 41901 But you quickly come up to two obstacles as you try to take it further - you either want to have the months as columns like this:
jan feb mar apr may jun jul aug sep oct nov dec 22030 22753 0 9456 7798 38278 18736 6794 21033 28576 10172 41901 or you want to see multiple customers, which as a column can be difficult, or even harder is having the months as columns and the customers as rows:
cus_id jan feb mar apr may jun jul aug sep oct nov dec 1 0 10170 0 5399 0 14821 7927 0 14 15466 3675 14447 2 22030 12583 0 4057 7798 23457 10809 6794 21019 13110 6497 27454 The term for this is pivot table - which is something you may have done many times in Excel or other spreadsheet applications.
But this is difficult in SQL, because SQL requires you to have a static column list. You can’t ask SQL to give you whatever columns are necessary, you must declare them in your query. (
SELECT *
may seem like an exception to this rule, but in this case the SQL engine still knows what the columns are going to be before the query is executed).Luckily Postgres gives you a way around this. If you want actual columns you still have to specify them, but the hard part of aggregation into these pivoted columns and rows are made much easier. The key is using the JSON functionality, which allows you to represent complex values in a single cell. It allows you to aggregate the values into what really represents multiple values, and then pull them back apart after the fact.
Here is an example of what this looks like:
WITH columns AS ( SELECT generate_series dte , customer_id FROM generate_series('2022-01-01'::date, '2022-12-31', '1 month') CROSS JOIN ( SELECT DISTINCT customer_id FROM invoice ) AS customers ), data AS ( SELECT columns.dte , columns.customer_id , SUM(COALESCE(invoice.amount, 0)) total FROM columns LEFT OUTER JOIN invoice ON DATE_PART('year', invoice_date) = DATE_PART('year', columns.dte) AND DATE_PART('month', invoice_date) = DATE_PART('month', columns.dte) AND columns.customer_id = invoice.customer_id GROUP BY columns.dte , columns.customer_id ), result AS ( SELECT customer_id , JSONB_OBJECT_AGG(TO_CHAR(dte, 'YYYY-MM'), total) pivotData FROM data GROUP BY customer_id ) SELECT customer_id , (pivotData->>'2022-01') "jan" , (pivotData->>'2022-02') "feb" , (pivotData->>'2022-03') "mar" , (pivotData->>'2022-04') "apr" , (pivotData->>'2022-05') "may" , (pivotData->>'2022-06') "jun" , (pivotData->>'2022-07') "jul" , (pivotData->>'2022-08') "aug" , (pivotData->>'2022-09') "sep" , (pivotData->>'2022-10') "oct" , (pivotData->>'2022-11') "nov" , (pivotData->>'2022-12') "dec" FROM result ORDER BY customer_id
Which gives results like we were after above.
If you would like to see how this is done, I have an interactive fiddle you can play with that shows you step by step how each of these parts work:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=39e115cb8afd6e62c0101286ecd08a3f
This example is using PG 15, but this functionality works all the way back to PG 9.5.
This query is also a great example of using
generate_series
to generate a set of data to join against, so that you can find any holes and represent all the data points you need (months in this case), even if there is no actual data for that point.In conclusion, the JSON functionality built into todays relational databases are more than just schemaless data stores and complex values in cells. They can also be powerful as intermediate steps to help you manipulate and transform your data in useful ways.
comments