Something I have a need to do often but can be difficult to do at times in SQL is to create a pivot table. As an example imagine wanting to see customers and their revenue by month. It is straightforward to create a normal data set where the dates are the rows and you have a revenue amount for each. Something like this:
dte total 2022-01-01 22030 2022-02-01 22753 2022-03-01 0 2022-04-01 9456 2022-05-01 7798 2022-06-01 38278 2022-07-01 18736 2022-08-01 6794 2022-09-01 21033 2022-10-01 28576 2022-11-01 10172 2022-12-01 41901
But you quickly come up to two obstacles as you try to take it further - you either want to have the months as columns like this:
jan feb mar apr may jun jul aug sep oct nov dec 22030 22753 0 9456 7798 38278 18736 6794 21033 28576 10172 41901
or you want to see multiple customers, which as a column can be difficult, or even harder is having the months as columns and the customers as rows:
cus_id jan feb mar apr may jun jul aug sep oct nov dec 1 0 10170 0 5399 0 14821 7927 0 14 15466 3675 14447 2 22030 12583 0 4057 7798 23457 10809 6794 21019 13110 6497 27454
The term for this is pivot table - which is something you may have done many times in Excel or other spreadsheet application.
But this is difficult in SQL, because SQL requires you to have a static column list. You can’t ask SQL to give you whatever columns are necessary, you must declare them in your query. (
SELECT *may seem like an exception to this rule, but in this case the SQL engine still knows what the columns are going to be before the query is executed).
Luckily Postgres gives you a way around this. If you want actual columns you still have to specify them, but the hard part of aggregation into these pivoted columns and rows are made much easier. The key is using the JSON functionality, which allows you to represent complex values in a single cell. It allows you to aggregate the values into what really represents multiple values, and then pull them back apart after the fact.
Here is an example of what this looks like:
WITH columns AS ( SELECT generate_series dte , customer_id FROM generate_series('2022-01-01'::date, '2022-12-31', '1 month') CROSS JOIN ( SELECT DISTINCT customer_id FROM invoice ) AS customers ), data AS ( SELECT columns.dte , columns.customer_id , SUM(COALESCE(invoice.amount, 0)) total FROM columns LEFT OUTER JOIN invoice ON DATE_PART('year', invoice_date) = DATE_PART('year', columns.dte) AND DATE_PART('month', invoice_date) = DATE_PART('month', columns.dte) AND columns.customer_id = invoice.customer_id GROUP BY columns.dte , columns.customer_id ), result AS ( SELECT customer_id , JSONB_OBJECT_AGG(TO_CHAR(dte, 'YYYY-MM'), total) pivotData FROM data GROUP BY customer_id ) SELECT customer_id , (pivotData->>'2022-01') "jan" , (pivotData->>'2022-02') "feb" , (pivotData->>'2022-03') "mar" , (pivotData->>'2022-04') "apr" , (pivotData->>'2022-05') "may" , (pivotData->>'2022-06') "jun" , (pivotData->>'2022-07') "jul" , (pivotData->>'2022-08') "aug" , (pivotData->>'2022-09') "sep" , (pivotData->>'2022-10') "oct" , (pivotData->>'2022-11') "nov" , (pivotData->>'2022-12') "dec" FROM result ORDER BY customer_id
Which gives results like we were after above.
If you would like to see how this is done, I have an interactive fiddle you can play with that shows you step by step how each of these parts work:
This example is using PG 15, but this functionality works all the way back to PG 9.5.
This query is also a great example of using
generate_seriesto generate a set of data to join against, so that you can find any holes and represent all the data points you need (months in this case), even if there is no actual data for that point.
In conclusion, the JSON functionality built into todays relational databases are more than just schemaless data stores and complex values in cells. They can also be powerful as intermediate steps to help you manipulate and transform your data in useful ways.comments
An alternate title for this post might be
Processing API Requests with a Queue. We recently had a project where we were expecting a burst of high traffic and heavy load on some API endpoints. We wanted to make sure that we could handle all of the traffic, even if the processing time was affected - dropped requests were not an option. After doing quite a bit of research this post is what we came up with. In the end this strategy worked well for our purposes, but we did identify some ways that we would improve it in the future.
Also this same strategy will work for any long-running api request. Things like reporting for example, where you need to be able to make a request but it may take a very long (and possibly indeterminate) amount of time to complete. Please forgive the tone of the document, it is being adapted from my notes and it isn’t in conversational form. It’s likely that there may be some things that I should expand upon, so if anything needs clarification please feel free to ask in the comments.comments
Recently I came across an error using postgres that stumped me for a while so I wanted to document it for next time. I was issuing an update statement to a table that had no numeric columns, but received the error:
ERROR: value overflows numeric format. Not only did my statement not affect any numeric columns, the table itself didn’t have any numeric columns. The actual problem ended up being some malformed json that I was trying to insert into a
jsonbcolumn. The json had a value like
300e715100which was actually part of a hashed string, but the json serializer I was using incorreclty identified it as a very large scientific notiation number, and so did not quote it. Because postgres cannot deal with a number that large it throws the error. Quoting the value properly fixed the problem. I also want to note that the error would not happen with
json, only with
jsonbbecause postgres is actually parsing the document.
You can try the code and see the error for yourself here: http://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=1584330f148ab0e9ed72529dfb466a12comments
Today I learned that postgres allows you to use a subselect in an update statement using a special syntax. This allows you to update a record from other data in the system easily (without remembering the weird update-with-join syntax) or to have an update syntax that more closely resembles an insert statement. For example:
The subselect can be any query, just return your columns in the same order as the column list you provide. Here is the relevant documentation if you want to read further: https://www.postgresql.org/docs/current/static/sql-update.htmlcomments
There are few things I have learned in my programming career that have paid off like higher order functions.comments