'what is tpch data and how can I create it with Kafka?

I am trying to get Trino/Presto work with Kafka. Both Trino and Presto tutorials talk about loading topics "with tpch data" and they give an illustration of that by using a client called kafka-tpch. There is no description of kafka-tpch on github page nor maven. However, the tutorial shows parameters --prefix tpch. --tpch-type tiny. Does anyone know what these mean?

I am using Trino's helm charts and I think I'm able to register Kafka as a catalog with

 "kafka": | 
          connector.name=kafka
          kafka.table-names=newtopic
          kafka.nodes=0.kafka.io:31120,1.kafka.io:31120
          kafka.default-schema=public
          kafka.hide-internal-columns=false

Consequently, I can do SHOW tables which gives me [['newtopic']]. I can also do DESCRIBE newtopic which gives me [['_partition_id', 'bigint', '', 'Partition Id'], ['_partition_offset', 'bigint', '', 'Offset for the message within the partition'], ['_message_corrupt', 'boolean', '', 'Message data is corrupt'], ['_message', 'varchar', '', 'Message text'], ['_headers', 'map(varchar, array(varbinary))', '', 'Headers of the message as map'], ['_message_length', 'bigint', '', 'Total number of message bytes'], ['_key_corrupt', 'boolean', '', 'Key data is corrupt'], ['_key', 'varchar', '', 'Key text'], ['_key_length', 'bigint', '', 'Total number of key bytes'], ['_timestamp', 'timestamp(3)', '', 'Message timestamp']]

That's as far as I get. SELECT _message FROM newtopic LIMIT 10 gives TrinoExternalError(type=EXTERNAL, name=KAFKA_SPLIT_ERROR, message="Cannot list splits for table 'newtopic' reading topic 'newtopic'", query_id=20220219_110406_00002_9jm62)

My test data for newtopic is pretty simple and it was produced with the confluent-kafka python client and the following snippet:

for e in 'abcdefghijklmnopqrstuvwxyz':
    data = {'test-string' : e}
    p.poll(0)
    p.produce('newtopic', json.dumps(data).encode('utf-8'), callback=kafka_delivery_report)


Solution 1:[1]

TPC-H is a benchmarking specification.

It's not required for using Kafka or Presto/Trino, but you can see in the source code it just produces a bunch of data https://github.com/hgschmie/kafka-tpch/blob/master/src/main/java/de/softwareforge/kafka/LoadCommand.java#L131

The --prefix flag shown is a string literal to place on the Kafka topic names when using the kafka-tpch utility. Like I said though, that's not required; the documentation is just showing one way to get a bunch of data into Kafka at once.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1