'Parquet Read for Column Comparison in c#?
I have a 20-million entry parquet file with 3 columns
file-path
create-date
mod-date
eg
file-path create-date mod-date
/bad 1649890805 1649890805
/bad/10/1/one.json 1649890806 1649890806
/good/4/32/two.json 1649890805 1649890805
/good/5/0/three.json 1649890812 1649890813
I do not want any file-path that begins with "/bad", but any other StartsWith is fine.
I need every value of file-path and mod-date where mod-date is greater than create-date. If it is too slow to obtain every file path, a count would be fine. In the above set, I would only want "/good/5/0/three.json" with "1649890813" to be returned.
I picked through docs at https://github.com/elastacloud/parquet-dotnet/blob/master/src/Parquet.Test/Reader/ParquetCsvComparison.cs and the unit test code to understand how to do a read, but still don't understand enough to know how to do what I want to do. So I have the below but am stuck.
using (var fileStream = File.OpenRead(ParquetFilePath))
{
using (var prr = new prr(fileStream, new ParquetOptions { TreatByteArrayAsString = true }))
{
for (var prC = 0; prC < prr.RowGroupCount; prC++)
{
var lg = prr.ReadEntireRowGroup(prC); //debug viewing
using (ParquetRowGroupReader grr = prr.OpenRowGroup(prC))
{
DataColumn[] columns = dataFields.Select(grr.ReadColumn).ToArray();
// How to compare all column data from columns 1 and 2?
}
}
}
}
Thanks.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|