'How do I load this file into Hive (Serde) [duplicate]
I'm struggling with creating a schema for a file (comma delimited) I need to load into Hive. Content looks something like this - First few columns have perfect values, sorted nicely:
2021-09-13,11111111,111111,2244,2186,xxxxx,xxxxx,2000106,xxx,2018-06-25 10:54:54,2018-06-25 07:24:00,2021-09-13 01:28:00,0,CVSS:3.0/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:N,,,false,
Then, there's a column with a huge chunk of data with end-of-line characters, commas and what not. It is wrapped in quotes:
"1. Navigate to the following URL:
https://sample.com/home.html
2. Review the server HTTP response headers:
HTTP/1.1 200 OK
Server: xxxxxxxxxxxxxxx
Pragma: No-cache
Cache-Control: no-cache, public
Expires: xxxxxxxxxxxxxxxxxxx
Content-Length: 2911
X-Cnection: close
Content-Type: text/html;charset=UTF-8
Vary: Accept-Encoding
Date: xxxxxxxxxxxxxxxxx
Connection: close
Set-Cookie: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx; xxxxxxxxxxxxxxxxxxxxxx
Set-Cookie: bm_sv=xxxxxxxxxxxxxxxxxxxxxxxx=; Domain=.xxxxxx.com; Path=/; Max-Age=4737; HttpOnly
3. Note that neither the ""X-Frame-Options"" nor ""frame-ancestors"" headers appear to be present","xxxxxxxxxxxxxxxxxxxxxxxxx.",xxxxxxxxxxxxxxxxxx,"Missing ""X-Frame-Options""",443,6,http,"xxxm.
xxxxxxxxxxxxxxxx.",Information,false,"To remediate this issue, (re)configure the web application to use xxxxxx ""self"" :
Content-Security-Policy: frame-ancestors 'self' xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxx."
And then the rest of it:
,12.0,2021-09-13 13:03:49,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Could anyone please advise how to set the DDL (LazySimpleSerDe, OpenCSVSerDe, RegexSerDe)?
Thanks in advance, Gal
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
