This post is mainly a notebook I use to record some practical HiveSQL tricks during daily work. Hopefully it will also make your life easier.
Concatenate strings from several rows by SQL
We use the same logic as we did in Pandas to perform the task.
SELECT t.user_id
,concat_ws(',',collect_set(tag)) AS tags
,concat_ws(',',collect_set(product)) AS products
FROM table_name
GROUP BY user_id
Remove duplicates in a more efficient way
We use ROW_NUMBER()
function to identify the replicated rows and then simply select the rows we need.
SELECT user_id
,product_id
,order_time
FROM (
SELECT user_id
,product_id
,order_time
,ROW_NUMBER() OVER (PARTITION BY user_id, product_id ORDER BY order_time ASC) AS rn
FROM orders
) t
WHERE t.rn = 1
First we select all rows like we'd normally do, then we add a new column ROW_NUMBER()
with partition by user_id
and product_id
. This is similar to .groupby(['user_id','product_id'])
in pandas
. Then the ordering is by order_time
ascendingly, meaning the earliest order will be numbered as 1.
Due to the ROW_NUMBER()
cannot in WHERE
clause, we need to embed the query to select the rows we need. For dropping duplicates, WHERE t.rn = 1
will do the trick. We can also keep the earliest 5 orders for all users:
SELECT user_id
,product_id
,order_time
FROM (
SELECT user_id
,product_id
,order_time
,ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY order_time ASC) AS rn
FROM orders
) t
WHERE t.rn <= 5
Fill string with certain characters to a fixed length
LPAD( string str, int len, string pad)
and RPAD( string str, int len, string pad)
For example, lpad('9:00:00',8,'0')
can be used to add a "0" in front of "9:00:00" so it becomes "09:00:00", but it won't add 0 to "13:00:00" since "13:00:00" has 8 characters.
Get Object from JSON Embedded Rows
SPLIT_PART(get_json_object(sentence,'$.redirect_url'),'-',2,2)
Concat Distinct Values by Columns
SELECT user_id
,CONCAT_WS(',',COLLECT_SET(type)) AS types
,CONCAT_WS(',',COLLECT_SET(product_name)) AS product_names
FROM user_purchase
WHERE ds = '${bizdate}'
GROUP BY user_id
Get Percentile
WITH score AS (
SELECT month
,user_id
,score
FROM user_score
)
SELECT month
,percentile_approx(score,0) AS score_0 -- min
,percentile_approx(score,0.05) AS score_5 -- 5%
,percentile_approx(score,0.25) AS score_25 -- 1st quarter
,percentile_approx(score,0.5) AS score_50 -- median
,percentile_approx(score,0.75) AS score_75 -- 3rd quarter
,percentile_approx(score,0.95) AS score_95 -- 95%
,percentile_approx(score,1) AS score_100 -- max
FROM score
WHERE month >= '2020-09'
GROUP BY month
ORDER BY month DESC
;
Alter Table: add columns
ALTER TABLE `user_info` ADD COLUMNS (
`user_age` BIGINT COMMENT 'User Age'
,`user_location` STRING COMMENT 'User Location'
);
Calculate between Rows
# get the previous row's create_time
LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time)
# calculate the time difference between current row's create_time and previous
# one, group by user_id
DATEDIFF(create_time,LAG(create_time,1) OVER (PARTITION BY user_id ORDER BY create_time),'SS' ) AS time_span
Similarly, LEAD()
is doing the opposite of LAG()
, it will get the value of next n row(s).
Rows to Columns, Columns to Rows
name | subject | score |
---|---|---|
小明 | 数学 | 98 |
小明 | 英语 | 95 |
小明 | 语文 | 96 |
杰仔 | 科学 | 88 |
阿花 | 数学 | 91 |
阿花 | 英语 | 93 |
-- create table
WITH df_student AS (
SELECT '小明' AS name, '语文' AS subject, 96 AS score
UNION
SELECT '小明' AS name, '数学' AS subject, 98 AS score
UNION
SELECT '小明' AS name, '英语' AS subject, 95 AS score
UNION
SELECT '阿花' AS name, '数学' AS subject, 91 AS score
UNION
SELECT '阿花' AS name, '英语' AS subject, 93 AS score
UNION
SELECT '杰仔' AS name, '科学' AS subject, 88 AS score
)
-- rows to columns
, df_student_wide AS (
SELECT name
,MAX(CASE WHEN subject = '语文' THEN score ELSE '-' END) 语文
,MAX(CASE WHEN subject = '数学' THEN score ELSE '-' END) 数学
,MAX(CASE WHEN subject = '英语' THEN score ELSE '-' END) 英语
,MAX(CASE WHEN subject = '科学' THEN score ELSE '-' END) 科学
FROM df_student
GROUP BY name
)
-- columns to rows
SELECT *
FROM (
SELECT name,'语文' AS subject,语文 AS score FROM df_student_wide
UNION
SELECT name,'数学' AS subject,数学 AS score FROM df_student_wide
UNION
SELECT name,'英语' AS subject,英语 AS score FROM df_student_wide
UNION
SELECT name,'科学' AS subject,科学 AS score FROM df_student_wide
) t
WHERE t.score != '-'
;