'Convert a character Column in a data.table to bigz Integer
I am working with a data.table that has been read in from a .txt file with fread. The data.table contains some amount of integer columns as well as a column of very large integers that I intend to store as bigz. However, fread will only read in large integers as character if I plan on keeping all of the digits (and I do).
#Something to the effect of (run not needed):
#fread(file = FILENAME.txt, header=TRUE, colClasses = c(rep("integer", 10), "character"), data.table = TRUE)
Additionally, I am working with a fairly large dataset. My primary problem is converting a character column in a data.table to a bigz column without creating a new object.
Here's a toy example which demonstrates my issue. First, I know that data.tables can have bigzcolumns - IF they are introduced in a new object.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa) #The same number in character form
(good = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(exa, 3)))
str(good) #Notice "bigs" is type bigz (and raw?)
However, if a character column is to be converted to a bigz column on the fly, an error results. The syntax in these conversion methods "works" w.r.t. the numeric nums column if as.bigz is replaced with as.character.
(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
str(bad)
#Method 1
bad[,bigs:=as.bigz(bigs)]
#Method 2 (re-create data.table first)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
set(bad, j="bigs", value = as.bigz(bad$bigs))
Error below. It appears that the issue stems from bigz integers being stored as raw, although I am not sure where '64' is coming from - exa has 24 digits.
Warning messages:
1: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Supplied 64 items to be assigned to 3 items of column 'bigs' (61 unused)
2: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Coerced 'raw' RHS to 'character' to match the column's type. Either change the target column ['bigs'] to 'raw' first (by creating a new 'raw' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
I have a work-around for now, but it requires creating a new object (and deleting the old one).
(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
meh = data.table(as.data.frame(bad)[,-3], bigs = as.bigz(bad$bigs))
rm(bad)
str(meh)
identical(good, meh) #Well, at least this works
I think this situation could be resolved if:
freadcould read inbigzintegers, or- there is a way to change the column type without creating a new object.
Admittedly, I am a data.table novice. Thanks in advance!
Solution 1:[1]
These bigq numbers seem to be a pain to work with. Additionally, it seems they cannot be held as the only column in a data.table.
The only work around I can find is to declare a new data.table which is what you have already done, only it can be done more succinctly without creating a new object.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = data.table(bad,bigsN = as.bigz(bad$bigs))
str(bad)
However, these columns cannot be manipulated inside the data.table without the same problems.
bad$bigsN = bad$bigsN*2
## Error in `[<-.data.table`(x, j = name, value = value) :
## Unsupported type 'raw'
## In addition: Warning message:
## In `[<-.data.table`(x, j = name, value = value) :
## Supplied 64 items to be assigned to 3 items of column 'bigsN' (61 unused)
The best solution I can think of is simply to keep these objects as separate vectors to your data.table.
as.list
Another solution would be to embed the the bigz in a list.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = bad[,bigs := as.list(as.bigz(bad$bigs))]
This gives R a better handle on the location of elements, and is more memory efficient at the creation stage. The down side is each element is a length 1 bigz vector and as such holds 4 redundant bytes of data per element. It also still cannot be used for arithmetic in a vectorised fashion.
bad$bigs = bad$bigs * 2
## Error in bad$bigs * 2 : non-numeric argument to binary operator
bad$bigs[[2]] = bad$bigs[[2]] * 2
bad$bigs
## [[1]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176
##
## [[2]]
## Big Integer ('bigz') :
## [1] 2417851639229258349412352
##
## [[3]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176
In fact, it would seem very little can be done with it in a vetorised fashion, including sorting or even converting it back into a bigz vector.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 |
