'what is tpch data and how can I create it with Kafka?
I am trying to get Trino/Presto work with Kafka. Both Trino and Presto tutorials talk about loading topics "with tpch data" and they give an illustration of that by using a client called kafka-tpch. There is no description of kafka-tpch on github page nor maven. However, the tutorial shows parameters --prefix tpch. --tpch-type tiny. Does anyone know what these mean?
I am using Trino's helm charts and I think I'm able to register Kafka as a catalog with
"kafka": |
connector.name=kafka
kafka.table-names=newtopic
kafka.nodes=0.kafka.io:31120,1.kafka.io:31120
kafka.default-schema=public
kafka.hide-internal-columns=false
Consequently, I can do SHOW tables which gives me [['newtopic']]. I can also do DESCRIBE newtopic which gives me [['_partition_id', 'bigint', '', 'Partition Id'], ['_partition_offset', 'bigint', '', 'Offset for the message within the partition'], ['_message_corrupt', 'boolean', '', 'Message data is corrupt'], ['_message', 'varchar', '', 'Message text'], ['_headers', 'map(varchar, array(varbinary))', '', 'Headers of the message as map'], ['_message_length', 'bigint', '', 'Total number of message bytes'], ['_key_corrupt', 'boolean', '', 'Key data is corrupt'], ['_key', 'varchar', '', 'Key text'], ['_key_length', 'bigint', '', 'Total number of key bytes'], ['_timestamp', 'timestamp(3)', '', 'Message timestamp']]
That's as far as I get. SELECT _message FROM newtopic LIMIT 10 gives TrinoExternalError(type=EXTERNAL, name=KAFKA_SPLIT_ERROR, message="Cannot list splits for table 'newtopic' reading topic 'newtopic'", query_id=20220219_110406_00002_9jm62)
My test data for newtopic is pretty simple and it was produced with the confluent-kafka python client and the following snippet:
for e in 'abcdefghijklmnopqrstuvwxyz':
data = {'test-string' : e}
p.poll(0)
p.produce('newtopic', json.dumps(data).encode('utf-8'), callback=kafka_delivery_report)
Solution 1:[1]
TPC-H is a benchmarking specification.
It's not required for using Kafka or Presto/Trino, but you can see in the source code it just produces a bunch of data https://github.com/hgschmie/kafka-tpch/blob/master/src/main/java/de/softwareforge/kafka/LoadCommand.java#L131
The --prefix flag shown is a string literal to place on the Kafka topic names when using the kafka-tpch utility. Like I said though, that's not required; the documentation is just showing one way to get a bunch of data into Kafka at once.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
