아래와 같은 json파일을 전처리 후 postgresql 에 저장하기  

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
{"artist": null, "auth": "Logged In", "firstName": "Walter", "gender": "M", "itemInSession": 0, "lastName": "Frye", "length": null, "level": "free", "location": "San Francisco-Oakland-Hayward, CA", "method": "GET","page": "Home", "registration": 1540919166796.0, "sessionId": 38, "song": null, "status": 200, "ts": 1541105830796, "userAgent": "\"Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/36.0.1985.143 Safari\/537.36\"", "userId": "39"}

ETL Pipeline

  1. Connect to the database.
  2. Process song files.
    1. Insert song data into songs table.
    2. Insert artist data into artists table.
  3. Process log_files.
    1. Insert ts (unix timestamp) in time table.
      1. from the field ts we can extract year, day, hour, week, month and day of the week.
    2. Insert user info in users table.
    3. Insert songpplay records into songplays table. In this case we need an additional select to get the artist_id and the artist_id. This is very important for the star schema will successful. I've improved this query using an additional INDEX in song table for the artist_id field to make the JOIN with artists table.
  4. Disconnect and finish.

 

 

 
 

 

'데이터 엔지니어링 관련' 카테고리의 다른 글

빅데이터 - 구자환 교수님  (0) 2021.08.11
song, log dataset 으로 데이터 모델링  (0) 2021.07.29
정규화  (0) 2021.07.28
PostgreSQL , MySQL 비교  (0) 2021.07.28
블로그 이미지

hjc_

୧( “̮ )୨

,