'R {arrow}: read out data.frame is "identical" to original but generates different hash
I'm interested in ensuring that a data.frame object obtained after a write (to format x)->read operation is always the same, regardless of the intermediary format used.
I'm toying around with different tests, including hashing all the columns and hashing the data.frame objects themselves and comparing to the original. And this seems to work fine for csvs out of the box. However, when writing parquets I find that the objects do not produce the same hash, even if all the columns do.
Code to reproduce:
library(arrow)
df <- data.frame(foo = c(1,2,3), bar=c(4.5, 4.6, 4.7))
write.csv(df, "test_csv.csv", row.names=F)
write_parquet(df, "test_parquet.parquet")
df_csv <- read.csv("test_csv.csv", colClasses=lapply(df, class))
df_parquet <- read_parquet("test_parquet.parquet")
# size is the same
print(object.size(df) == object.size(df_csv))
# [1] TRUE
print(object.size(df) == object.size(df_parquet))
# [1] TRUE
# objects are 'identical'
print(identical(df, df_csv))
# [1] TRUE
print(identical(df, df_parquet))
# [1] TRUE
# all columns generate same hash
print(all(sapply(df, digest)==sapply(df_csv, digest)))
# [1] TRUE
print(all(sapply(df, digest)==sapply(df_parquet, digest)))
# [1] TRUE
# objects generate the same hash
print(digest(df)==digest(df_csv))
# [1] TRUE
print(digest(df)==digest(df_parquet))
# [1] FALSE
I assume this has to do with some metadata being different, but I can't find out what. This does not seem to be an issue with the names or column classes:
# it's not the names or element classes
> all(colnames(df_csv) == colnames(df_parquet))
# [1] TRUE
> all(row.names(df_csv) == row.names(df_parquet))
# [1] TRUE
> identical(lapply(df_csv,class), lapply(df_parquet,class))
# [1] TRUE
Any help would be greatly appreciated.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
