This post is mainly a notebook I use to record some practical HiveSQL tricks during daily work. Hopefully it will also make your life easier.

Concatenate strings from several rows by SQL

We use the same logic as we did in Pandas to perform the task.

SELECT  t.user_id
        ,concat_ws(',',collect_set(tag)) AS tags
        ,concat_ws(',',collect_set(product)) AS products
FROM    table_name
GROUP BY user_id
all tags and products will grouped by each user name, seperated by comma

Remove duplicates in a more efficient way

We use row_number() function to identify the replicated rows and then simply select the rows we need.

SELECT	user_id
        ,product_id
        ,order_time
FROM (
        SELECT	user_id
                ,product_id
                ,order_time
                ,row_number() OVER (PARTITION BY user_id, product_id ORDER BY order_time ASC) AS rn
        FROM    orders
) t
WHERE t.rn = 1
We select the first record for any products purchased by all users

First we select all rows like we'd normally do, then we add a new column row_number() with partition by user_id and product_id. This is similar to .groupby(['user_id','product_id']) in pandas. Then the ordering is by order_time ascendingly, meaning the earliest order will be numbered as 1.

Due to the row_number() cannot in WHERE clause, we need to embed the query to select the rows we need. For dropping duplicates, WHERE t.rn = 1 will do the trick. We can also keep the earliest 5 orders for all users:

SELECT	user_id
        ,product_id
        ,order_time
FROM (
        SELECT	user_id
                ,product_id
                ,order_time
                ,row_number() OVER (PARTITION BY user_id ORDER BY order_time ASC) AS rn
        FROM    orders
) t
WHERE t.rn <= 5
here only partition by user_id, and rn <= 5

Fill string with certain characters to a fixed length

LPAD( string str, int len, string pad ) and RPAD( string str, int len, string pad )

For example, lpad('9:00',5,'0') can be used to add a "0"  in front of "9:00"  so it becomes "09:00", but it won't add 0 to "13:00" since "13:00" has 5 characters.

Get Object from JSON Embedded Rows

SPLIT_PART(get_json_object(sentence,'$.redirect_url'),'-',2,2)

Calculate between Rows

# get the previous row's create_time
LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time)

# calculate the time difference between current row's create_time and previous
# one, group by user_id
DATEDIFF(create_time,LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time),'SS' ) AS time_span

Calculate the timespan between every two rows

Similarly, LEAD() is doing the opposite of LAG(), it will get the value of next n row(s).