ETL 작업, DB에 저장하기

데이터 엔지니어링 관련 2021. 8. 3. 16:26

아래와 같은 json파일을 전처리 후 postgresql 에 저장하기

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}

{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

ETL Pipeline

Connect to the database.
Process song files.
1. Insert song data into songs table.
2. Insert artist data into artists table.
Process log_files.
1. Insert ts (unix timestamp) in time table.
  1. from the field ts we can extract year, day, hour, week, month and day of the week.
2. Insert user info in users table.
3. Insert songpplay records into songplays table. In this case we need an additional select to get the artist_id and the artist_id. This is very important for the star schema will successful. I've improved this query using an additional INDEX in song table for the artist_id field to make the JOIN with artists table.
Disconnect and finish.

저작자표시 비영리 변경금지

'데이터 엔지니어링 관련' 카테고리의 다른 글

빅데이터 - 구자환 교수님 (0)	2021.08.11
song, log dataset 으로 데이터 모델링 (0)	2021.07.29
정규화 (0)	2021.07.28
PostgreSQL , MySQL 비교 (0)	2021.07.28

hjc_

୧( “̮ )୨

ETL 작업, DB에 저장하기

'데이터 엔지니어링 관련' 카테고리의 다른 글

공지사항

카테고리

글 보관함

달력

링크

hjc_

LATEST FROM OUR BLOG

LATEST COMMENTS

BLOG VISITORS

티스토리툴바

« 2025/05 »
일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31