This post is mainly a notebook I use to record some practical HiveSQL tricks during daily work. Hopefully it will also make your life easier.
Concatenate strings from several rows by SQL
We use the same logic as we did in Pandas to perform the task.
Remove duplicates in a more efficient way
We use ROW_NUMBER()
function to identify the replicated rows and then simply select the rows we need.
First we select all rows like we'd normally do, then we add a new column ROW_NUMBER()
with partition by user_id
and product_id
. This is similar to .groupby(['user_id','product_id'])
in pandas
. Then the ordering is by order_time
ascendingly, meaning the earliest order will be numbered as 1.
Due to the ROW_NUMBER()
cannot in WHERE
clause, we need to embed the query to select the rows we need. For dropping duplicates, WHERE t.rn = 1
will do the trick. We can also keep the earliest 5 orders for all users:
Fill string with certain characters to a fixed length
LPAD( string str, int len, string pad)
and RPAD( string str, int len, string pad)
For example, lpad('9:00:00',8,'0')
can be used to add a "0" in front of "9:00:00" so it becomes "09:00:00", but it won't add 0 to "13:00:00" since "13:00:00" has 8 characters.
Get Object from JSON Embedded Rows
Concat Distinct Values by Columns
Get Percentile
Alter Table: add columns
Calculate between Rows
Similarly, LEAD()
is doing the opposite of LAG()
, it will get the value of next n row(s).
Rows to Columns, Columns to Rows
name | subject | score |
---|---|---|
小明 | 数学 | 98 |
小明 | 英语 | 95 |
小明 | 语文 | 96 |
杰仔 | 科学 | 88 |
阿花 | 数学 | 91 |
阿花 | 英语 | 93 |