아래와 같은 json파일을 전처리 후 postgresql 에 저장하기
{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}
ETL Pipeline
- Connect to the database.
- Process song files.
- Insert song data into songs table.
- Insert artist data into artists table.
- Process log_files.
- Insert ts (unix timestamp) in time table.
- from the field ts we can extract year, day, hour, week, month and day of the week.
- Insert user info in users table.
- Insert songpplay records into songplays table. In this case we need an additional select to get the artist_id and the artist_id. This is very important for the star schema will successful. I've improved this query using an additional INDEX in song table for the artist_id field to make the JOIN with artists table.
- Insert ts (unix timestamp) in time table.
- Disconnect and finish.
'데이터 엔지니어링 관련' 카테고리의 다른 글
빅데이터 - 구자환 교수님 (0) | 2021.08.11 |
---|---|
song, log dataset 으로 데이터 모델링 (0) | 2021.07.29 |
정규화 (0) | 2021.07.28 |
PostgreSQL , MySQL 비교 (0) | 2021.07.28 |