'Amazon Athena not parsing cloudfront logs
I'm following the Athena getting started guide and trying to parse my own Cloudfront logs. However, the fields are not being parsed.
I used a small test file, as follows:
#Version: 1.0
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type
2016-02-02 07:57:45 LHR5 5001 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /foo 404 - Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Error -tHYQ3YpojqpR8yFHCUg5YW4OC_yw7X0VWvqwsegPwDqDFkIqhZ_gA== d3g47gpj5mj0b.cloudfront.net https421 0.076 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Error
2016-02-02 07:57:45 LHR5 1158241 86.177.253.38 GET d3g47gpj5mj0b.cloudfront.net /images/posts/cover/404.jpg 200 https://d3g47gpj5mj0b.cloudfront.net/foo Mozilla/5.0%2520(Macintosh;%2520Intel%2520Mac%2520OS%2520X%252010_10_5)%2520AppleWebKit/537.36%2520(KHTML,%2520like%2520Gecko)%2520Chrome/47.0.2526.111%2520Safari/537.36 - - Miss oUdDIjmA1ON1GjWmFEKlrbNzZx60w6EHxzmaUdWEwGMbq8V536O4WA== d3g47gpj5mj0b.cloudfront.net https 419 0.440 - TLSv1.2 ECDHE-RSA-AES128-GCM-SHA256 Miss
And created the table with this SQL:
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
`Date` DATE,
Time STRING,
Location STRING,
Bytes INT,
RequestIP STRING,
Method STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
os STRING,
Browser STRING,
BrowserVersion STRING
) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
) LOCATION 's3://test/athena-csv/'
But no data comes back:
I can see it returns 4 rows, but the first 2 should be excluded because they start with a #, so it's like the regex isn't being parsed correctly.
Am I doing something wrong? Or is the regex wrong (seems unlikely, as it's in the docs, and looks fine to me)?
Solution 1:[1]
This is what I ended up with:
CREATE EXTERNAL TABLE logs (
`date` date,
`time` string,
`location` string,
`bytes` int,
`request_ip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referer` string,
`useragent` string,
`uri_query` string,
`cookie` string,
`edge_type` string,
`edget_requiest_id` string,
`host_header` string,
`cs_protocol` string,
`cs_bytes` int,
`time_taken` string,
`x_forwarded_for` string,
`ssl_protocol` string,
`ssl_cipher` string,
`result_type` string,
`protocol` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'input.regex' = '^(?!#.*)(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+(\\S+)\\s*(\\S*)'
) LOCATION 's3://logs'
Note the double backslashes are intentional.
The format of the cloudfront logs changed at some point to add the protocol. This handles older and newer files.
Solution 2:[2]
Actually, all answers here have a small mistake: the 4th field must be a BIGINT, not a INT. Otherwise your >2GB file requests are not parsed correctly. After a long discussion with AWS Business Support, it appears that the correct format would be:
CREATE EXTERNAL TABLE your_table_name (
`Date` DATE,
Time STRING,
Location STRING,
SCBytes BIGINT,
RequestIP STRING,
Method STRING,
Host STRING,
Uri STRING,
Status INT,
Referrer STRING,
UserAgent STRING,
UriQS STRING,
Cookie STRING,
ResultType STRING,
RequestId STRING,
HostHeader STRING,
Protocol STRING,
CSBytes BIGINT,
TimeTaken FLOAT,
XForwardFor STRING,
SSLProtocol STRING,
SSLCipher STRING,
ResponseResultType STRING,
CSProtocolVersion STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LOCATION 's3://path_to_your_data_directory'
TBLPROPERTIES ('skip.header.line.count' = '2')
Solution 3:[3]
After pulling my hair out with this, and improving on @CoderDans answer:
The secret is to use \t for value separation instead of \s for the regex.
CREATE EXTERNAL TABLE IF NOT EXISTS mytablename (
`date` date,
`time` string,
`location` string,
`bytes` int,
`request_ip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referer` string,
`useragent` string,
`uri_query` string,
`cookie` string,
`edge_type` string,
`edget_request_id` string,
`host_header` string,
`cs_protocol` string,
`cs_bytes` int,
`time_taken` int,
`x_forwarded_for` string,
`ssl_protocol` string,
`ssl_cipher` string,
`result_type` string,
`protocol_version` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^(?!#.*)(?!#.*)([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)\t+([^\t]+)$'
) LOCATION 's3://mybucket/myprefix/';
Solution 4:[4]
Athena is case insensitive and considers every column to be lowercase. Try defining your Athena table and query with lowercase field names instead.
Solution 5:[5]
The demo doesn't work for me either. After playing with it for a bit I got the following to work:
CREATE EXTERNAL TABLE IF NOT EXISTS DBNAME.TABLENAME (
`date` date,
`time` string,
`location` string,
`bytes` int,
`request_ip` string,
`method` string,
`host` string,
`uri` string,
`status` int,
`referer` string,
`useragent` string,
`uri_query` string,
`cookie` string,
`edge_type` string,
`edget_requiest_id` string,
`host_header` string,
`cs_protocol` string,
`cs_bytes` int,
`time_taken` string,
`x_forwarded_for` string,
`ssl_protocol` string,
`ssl_cipher` string,
`result_type` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
'input.regex' = '^(?!#.*)(?!#.*)([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)$'
) LOCATION 's3://bucket/logs/';
Replace the bucket/log and dbname.table with your info. For some reason it is still inserting the empty rows for the lines with #, but I got the rest of the data.
I think the next step is to try and make one for the user-agents or cookies.
Solution 6:[6]
This one worked for me. I started here, but I had to add the "protocol" column.
CREATE EXTERNAL TABLE IF NOT EXISTS default.cloudfront_logs (
`date` DATE,
time STRING,
location STRING,
bytes BIGINT,
request_ip STRING,
method STRING,
host STRING,
uri STRING,
status INT,
referrer STRING,
user_agent STRING,
query_string STRING,
cookie STRING,
result_type STRING,
request_id STRING,
host_header STRING,
request_protocol STRING,
request_bytes BIGINT,
time_taken FLOAT,
xforwarded_for STRING,
ssl_protocol STRING,
ssl_cipher STRING,
response_result_type STRING,
http_version STRING,
fle_status STRING,
fle_encrypted_fields INT,
protocol string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LOCATION 's3://bucketname/prefix/'
TBLPROPERTIES ( 'skip.header.line.count'='2' )
Solution 7:[7]
I would like to share a strange behavior I experience with Athena on a fresh AWS account regarding this issue.
The example from reference docs at AWS or regex methods did not work very well for me so I tried again Athena console wizard (create->create a table from datasource->s3 bucket data) and surprisingly it worked.
I checked out the wizard-generated table's DDL (SHOW CREATE TABLE table_name;) and the only noticeable difference was TBLPROPERTIES.has_encrypted_data set to false.
So first I assumed that has_encrypted_data's default value might be true since the bucket I was querying against had no encryption settings, so I tested multiple times with this setting set as true and false and noticed both conditions worked fine.
So again, out of confusion I removed this property, (using AWS ref docs's examples again), guess what? It worked.
So I am assuming this issue might be related to a bug on AWS's end.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Eran Medan |
| Solution 2 | |
| Solution 3 | Gregor Melhorn |
| Solution 4 | Jens Roland |
| Solution 5 | |
| Solution 6 | tom |
| Solution 7 | hoonoh |

