'How to deduplicate GCP logs from Logs Explorer?

I am using GCP Logs explorer to store logging messages from my pipeline. I need to debug an issue by looking at logs from a specific event. The message of this error is identical except for an event ID at the end.

So for example, the error message is

event ID does not exist: foo

I know that I can use the following syntax to construct a query that will return the logs with this particular message structure

resource.type="some_resource"
resource.labels.project_id="some_project"
resource.labels.job_id="some_id"
severity=WARNING
jsonPayload.message:"Event ID does not exist:"

The last line in that query will then return every log where the message has that string.

I end up with a result like this

Event ID does not exist: 1A
Event ID does not exist: 2A
Event ID does not exist: 2A
Event ID does not exist: 3A

so I wish to deduplicate that to end up with only

Event ID does not exist: 1A
Event ID does not exist: 2A
Event ID does not exist: 3A

But I don't see support for this type of deduplication in the language docs

Due to the amount of rows, I also cannot download a delimited log file. Is it possible to deduplicate the amount of rows?



Solution 1:[1]

To deduplicate records with BigQuery, follow these steps:

  • Identify whether your dataset contains duplicates.
  • Create a SELECT query that aggregates the desired column using a GROUP BY clause.
  • Materialize the result to a new table using CREATE OR REPLACE TABLE [tablename] AS [SELECT STATEMENT].

You can review the full tutorial in this link.

To analyze a big quantity of logs, you could route them to BigQuery and analyze the logs using Fluentd.

Fluentd has an output plugin that can use BigQuery as a destination for storing the collected logs. Using the plugin, you can directly load logs into BigQuery in near real time from many servers.

In this link, you can find a complete tutorial on how to Analyze logs using Fluentd and BigQuery.

To route your logs to BigQuery, first it is necessary to create a sink and route it to BigQuery.

Sinks control how Cloud Logging routes logs. Using sinks, you can route some or all of your logs to supported destinations.

Sinks belong to a given Google Cloud resource: Cloud projects, billing accounts, folders, and organizations. When the resource receives a log entry, it routes the log entry according to the sinks contained by that resource. The log entry is sent to the destination associated with each matching sink.

You can route log entries from Cloud Logging to BigQuery using sinks. When you create a sink, you define a BigQuery dataset as the destination. Logging sends log entries that match the sink's rules to partitioned tables that are created for you in that BigQuery dataset.

1) In the Cloud console, go to the Logs Router page:

2) Select an existing Cloud project.

3) Select Create sink.

4) In the Sink details panel, enter the following details:

Sink name: Provide an identifier for the sink; note that after you create the sink, you can't rename the sink but you can delete it and create a new sink.

Sink description (optional): Describe the purpose or use case for the sink.

5) In the Sink destination panel, select the sink service and destination:

Select sink service: Select the service where you want your logs routed. Based on the service that you select, you can select from the following destinations:

BigQuery table: Select or create the particular dataset to receive the routed logs. You also have the option to use partitioned tables.

For example, if your sink destination is a BigQuery dataset, the sink destination would be the following:

bigquery.googleapis.com/projects/PROJECT_ID/datasets/DATASET_ID

Note that if you are routing logs between Cloud projects, you still need the appropriate destination permissions.

6) In the Choose logs to include in sink panel, do the following:

In the Build inclusion filter field, enter a filter expression that matches the log entries you want to include. If you don't set a filter, all logs from your selected resource are routed to the destination.

To verify you entered the correct filter, select Preview logs. This opens the Logs Explorer in a new tab with the filter prepopulated.

7) (Optional) In the Choose logs to filter out of sink panel, do the following:

In the Exclusion filter name field, enter a name.

In the Build an exclusion filter field, enter a filter expression that matches the log entries you want to exclude. You can also use the sample function to select a portion of the log entries to exclude. You can create up to 50 exclusion filters per sink. Note that the length of a filter can't exceed 20,000 characters.

8) Select Create sink.

More information about Configuring and managing sinks here.

To review details, the formatting, and rules that apply when routing log entries from Cloud Logging to BigQuery, please follow this link.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1