'AWK remove query params from URL

I have access.log file with >1m lines. The exaple of line:

113.10.154.38 - - [27/May/2016:03:36:26 +0200] "POST /index.php?option=com_jce&task=plugin&plugin=imgmanager&file=imgmanager&method=form&cid=20&6bc427c8a7981f4fe1f5ac65c1246b5f=cf6dd3cf1923c950586d0dd595c8e20b HTTP/1.1" 200 22 "-" "BOT/0.1 (BOT for JCE)" "-"

I need to parse log lines to count 10 most common urls, BUT i need to remove query params from url. Without query params i wrote this code

awk '{print $7}' test.log | sort | uniq -c | sort -rn | \
head | awk '{print NR,"\b. URL:", $2,"\n   Requests:", $1}'

But i don't know how to remove query params and count top 10 most common urls without params to get clear top of requests.



Solution 1:[1]

Use the sub() function to remove a pattern from a string.

You also need to do this when you're extracting the field to sort and count unique values.

awk '{sub(/\?.*/, "", $7); print $7}' test.log | sort | uniq -c | sort -rn | ...

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1