'Convert a character Column in a data.table to bigz Integer

I am working with a data.table that has been read in from a .txt file with fread. The data.table contains some amount of integer columns as well as a column of very large integers that I intend to store as bigz. However, fread will only read in large integers as character if I plan on keeping all of the digits (and I do).

#Something to the effect of (run not needed):
#fread(file = FILENAME.txt, header=TRUE, colClasses = c(rep("integer", 10), "character"), data.table = TRUE)

Additionally, I am working with a fairly large dataset. My primary problem is converting a character column in a data.table to a bigz column without creating a new object.

Here's a toy example which demonstrates my issue. First, I know that data.tables can have bigzcolumns - IF they are introduced in a new object.

library(gmp)
library(data.table)
exa = as.bigz(2)^80          #A very large number          
cha = as.character(exa)      #The same number in character form
(good = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(exa, 3)))   
str(good)                    #Notice "bigs" is type bigz (and raw?)

However, if a character column is to be converted to a bigz column on the fly, an error results. The syntax in these conversion methods "works" w.r.t. the numeric nums column if as.bigz is replaced with as.character.

(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
str(bad)
#Method 1
bad[,bigs:=as.bigz(bigs)]
#Method 2 (re-create data.table first)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
set(bad, j="bigs", value = as.bigz(bad$bigs))

Error below. It appears that the issue stems from bigz integers being stored as raw, although I am not sure where '64' is coming from - exa has 24 digits.

Warning messages:
1: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Supplied 64 items to be assigned to 3 items of column 'bigs' (61 unused)
2: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Coerced 'raw' RHS to 'character' to match the column's type. Either change the target column ['bigs'] to 'raw' first (by creating a new 'raw' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.

I have a work-around for now, but it requires creating a new object (and deleting the old one).

(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
meh = data.table(as.data.frame(bad)[,-3], bigs = as.bigz(bad$bigs))
rm(bad)
str(meh)
identical(good, meh)          #Well, at least this works

I think this situation could be resolved if:

  1. fread could read in bigz integers, or
  2. there is a way to change the column type without creating a new object.

Admittedly, I am a data.table novice. Thanks in advance!



Solution 1:[1]

These bigq numbers seem to be a pain to work with. Additionally, it seems they cannot be held as the only column in a data.table.

The only work around I can find is to declare a new data.table which is what you have already done, only it can be done more succinctly without creating a new object.

library(gmp)
library(data.table)

exa = as.bigz(2)^80          #A very large number          
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = data.table(bad,bigsN = as.bigz(bad$bigs))
str(bad)

However, these columns cannot be manipulated inside the data.table without the same problems.

bad$bigsN = bad$bigsN*2
## Error in `[<-.data.table`(x, j = name, value = value) : 
##   Unsupported type 'raw'
## In addition: Warning message:
## In `[<-.data.table`(x, j = name, value = value) :
##   Supplied 64 items to be assigned to 3 items of column 'bigsN' (61 unused)

The best solution I can think of is simply to keep these objects as separate vectors to your data.table.

as.list

Another solution would be to embed the the bigz in a list.

library(gmp)
library(data.table)

exa = as.bigz(2)^80          #A very large number          
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = bad[,bigs := as.list(as.bigz(bad$bigs))]

This gives R a better handle on the location of elements, and is more memory efficient at the creation stage. The down side is each element is a length 1 bigz vector and as such holds 4 redundant bytes of data per element. It also still cannot be used for arithmetic in a vectorised fashion.

 bad$bigs = bad$bigs * 2
## Error in bad$bigs * 2 : non-numeric argument to binary operator
 bad$bigs[[2]] = bad$bigs[[2]] * 2
 bad$bigs
## [[1]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176
## 
## [[2]]
## Big Integer ('bigz') :
## [1] 2417851639229258349412352
## 
## [[3]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176

In fact, it would seem very little can be done with it in a vetorised fashion, including sorting or even converting it back into a bigz vector.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1