This post is mainly a notebook I use to record some practical HiveSQL tricks during daily work. Hopefully it will also make your life easier.

## Concatenate strings from several rows by SQL

We use the same logic as we did in Pandas to perform the task.

## Remove duplicates in a more efficient way

We use row_number() function to identify the replicated rows and then simply select the rows we need.

First we select all rows like we'd normally do, then we add a new column row_number() with partition by user_id and product_id. This is similar to .groupby(['user_id','product_id']) in pandas. Then the ordering is by order_time ascendingly, meaning the earliest order will be numbered as 1.

Due to the row_number() cannot in WHERE clause, we need to embed the query to select the rows we need. For dropping duplicates, WHERE t.rn = 1 will do the trick. We can also keep the earliest 5 orders for all users:

## Fill string with certain characters to a fixed length

LPAD( string str, int len, string pad ) and RPAD( string str, int len, string pad )

For example, lpad('9:00',5,'0') can be used to add a "0"  in front of "9:00"  so it becomes "09:00", but it won't add 0 to "13:00" since "13:00" has 5 characters.

## Get Object from JSON Embedded Rows

SPLIT_PART(get_json_object(sentence,'\$.redirect_url'),'-',2,2)

## Calculate between Rows

Similarly, LEAD() is doing the opposite of LAG(), it will get the value of next n row(s).