This post is mainly a notebook I use to record some practical HiveSQL tricks during daily work. Hopefully it will also make your life easier.
Concatenate strings from several rows by SQL
We use the same logic as we did in Pandas to perform the task.
Remove duplicates in a more efficient way
row_number() function to identify the replicated rows and then simply select the rows we need.
First we select all rows like we'd normally do, then we add a new column
row_number() with partition by
product_id. This is similar to
.groupby(['user_id','product_id']) in pandas. Then the ordering is by
order_time ascendingly, meaning the earliest order will be numbered as 1.
Due to the
row_number() cannot in WHERE clause, we need to embed the query to select the rows we need. For dropping duplicates,
WHERE t.rn = 1 will do the trick. We can also keep the earliest 5 orders for all users:
Fill string with certain characters to a fixed length
LPAD( string str, int len, string pad ) and
RPAD( string str, int len, string pad )
lpad('9:00',5,'0') can be used to add a "0" in front of "9:00" so it becomes "09:00", but it won't add 0 to "13:00" since "13:00" has 5 characters.
Get Object from JSON Embedded Rows
Calculate between Rows
LEAD() is doing the opposite of
LAG(), it will get the value of next n row(s).