Convert a character Column in a data.table to bigz Integer - r

I am working with a data.table that has been read in from a .txt file with fread. The data.table contains some amount of integer columns as well as a column of very large integers that I intend to store as bigz. However, fread will only read in large integers as character if I plan on keeping all of the digits (and I do).
#Something to the effect of (run not needed):
#fread(file = FILENAME.txt, header=TRUE, colClasses = c(rep("integer", 10), "character"), data.table = TRUE)
Additionally, I am working with a fairly large dataset. My primary problem is converting a character column in a data.table to a bigz column without creating a new object.
Here's a toy example which demonstrates my issue. First, I know that data.tables can have bigzcolumns - IF they are introduced in a new object.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa) #The same number in character form
(good = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(exa, 3)))
str(good) #Notice "bigs" is type bigz (and raw?)
However, if a character column is to be converted to a bigz column on the fly, an error results. The syntax in these conversion methods "works" w.r.t. the numeric nums column if as.bigz is replaced with as.character.
(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
str(bad)
#Method 1
bad[,bigs:=as.bigz(bigs)]
#Method 2 (re-create data.table first)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
set(bad, j="bigs", value = as.bigz(bad$bigs))
Error below. It appears that the issue stems from bigz integers being stored as raw, although I am not sure where '64' is coming from - exa has 24 digits.
Warning messages:
1: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Supplied 64 items to be assigned to 3 items of column 'bigs' (61 unused)
2: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Coerced 'raw' RHS to 'character' to match the column's type. Either change the target column ['bigs'] to 'raw' first (by creating a new 'raw' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
I have a work-around for now, but it requires creating a new object (and deleting the old one).
(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
meh = data.table(as.data.frame(bad)[,-3], bigs = as.bigz(bad$bigs))
rm(bad)
str(meh)
identical(good, meh) #Well, at least this works
I think this situation could be resolved if:
fread could read in bigz integers, or
there is a way to change the column type without creating a new object.
Admittedly, I am a data.table novice. Thanks in advance!

These bigq numbers seem to be a pain to work with. Additionally, it seems they cannot be held as the only column in a data.table.
The only work around I can find is to declare a new data.table which is what you have already done, only it can be done more succinctly without creating a new object.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = data.table(bad,bigsN = as.bigz(bad$bigs))
str(bad)
However, these columns cannot be manipulated inside the data.table without the same problems.
bad$bigsN = bad$bigsN*2
## Error in `[<-.data.table`(x, j = name, value = value) :
## Unsupported type 'raw'
## In addition: Warning message:
## In `[<-.data.table`(x, j = name, value = value) :
## Supplied 64 items to be assigned to 3 items of column 'bigsN' (61 unused)
The best solution I can think of is simply to keep these objects as separate vectors to your data.table.
as.list
Another solution would be to embed the the bigz in a list.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = bad[,bigs := as.list(as.bigz(bad$bigs))]
This gives R a better handle on the location of elements, and is more memory efficient at the creation stage. The down side is each element is a length 1 bigz vector and as such holds 4 redundant bytes of data per element. It also still cannot be used for arithmetic in a vectorised fashion.
bad$bigs = bad$bigs * 2
## Error in bad$bigs * 2 : non-numeric argument to binary operator
bad$bigs[[2]] = bad$bigs[[2]] * 2
bad$bigs
## [[1]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176
##
## [[2]]
## Big Integer ('bigz') :
## [1] 2417851639229258349412352
##
## [[3]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176
In fact, it would seem very little can be done with it in a vetorised fashion, including sorting or even converting it back into a bigz vector.

Related

Disable partial name idenfication of function arguments

I am trying to make a function in R that outputs a data frame in a standard way, but that also allows the user to have the personalized columns that he deams necessary (the goal is to make a data format for paleomagnetic data, for which there are common informations that everybody use, and some more unusual that the user might like to keep in the format).
However, I realized that if the user wants the header of his data to be a prefix of one of the defined arguments of the data formating function (e.g. via the 'sheep' argument, that is a prefix of the 'sheepc' argument, see example below), the function interprets it as the defined argument (through partial name identification, see http://adv-r.had.co.nz/Functions.html#lexical-scoping for more details).
Is there a way to prevent this, or to at least give a warning to the user saying that he cannot use this name ?
PS I realize this question is similar to Disabling partial variable names in subsetting data frames, but I would like to avoid toying with the options of the future users of my function.
fun <- function(sheeta = 1, sheetb = 2, sheepc = 3, ...)
{
# I use the sheeta, sheetb and sheepc arguments for computations
# (more complex than shown below, but here thet are just there to give an example)
a <- sum(sheeta, sheetb)
df1 <- data.frame(standard = rep(a, sheepc))
df2 <- as.data.frame(list(...))
if(nrow(df1) == nrow(df2)){
res <- cbind(df1, df2)
return(res)
} else {
stop("Extra elements should be of length ", sheep)
}
}
fun(ball = rep(1,3))
#> standard ball
#> 1 3 1
#> 2 3 1
#> 3 3 1
fun(sheep = rep(1,3))
#> Error in rep(a, sheepc): argument 'times' incorrect
fun(sheet = rep(1,3))
#> Error in fun(sheet = rep(1, 3)) :
#> argument 1 matches multiple formal arguments
From the language definition:
If the formal arguments contain ‘...’ then partial matching is only
applied to arguments that precede it.
fun <- function(..., sheeta = 1, sheetb = 2, sheepc = 3)
{<your function body>}
fun(sheep = rep(1,3))
# standard sheep
#1 3 1
#2 3 1
#3 3 1
Of course, your function should have assertion checks for the non-... parameters (see help("stopifnot")). You could also consider adding a . or _ to their tags to make name collisions less likely.
Edit:
"would it be possible to achieve the same effect without having the ... at the beginning ?"
Yes, here is a quick example with one parameter:
fun <- function(sheepc = 3, ...)
{
stopifnot("partial matching detected" = identical(sys.call(), match.call()))
list(...)
}
fun(sheep = rep(1,3))
# Error in fun(sheep = rep(1, 3)) : partial matching detected
fun(ball = rep(1,3))
#$ball
#[1] 1 1 1

Converting factor to numeric, with dots, thousands(K) and millions(M) abbreviation [duplicate]

This question already has answers here:
Converting unit abbreviations to numbers
(4 answers)
Closed 3 years ago.
I'm trying to convert a column of money amount to numeric values. A very simplified version of my database would be:
SoccerPlayer = c("A","B","C","D","E")
Value = c("10K","25.5K","1M","1.2M","0")
database = data.frame(SoccerPlayer,Value)
I'm facing the currently issues. If there were no dots, and all money amount was at the same level of units such as only K(thousands) or only M(millions), this would work perfectly
library(stringi)
database$Value = as.numeric(gsub("K","000",database$Value))
But since there are K and M values in my data I'm trying to write it like this:
library(stringi)
if(stri_sub(database$Value,-1,-1) == 'M'){
database$Value = gsub("M","000000",database$Value)
}
if(stri_sub(database$Value,-1,-1) == 'K'){
database$Value = gsub("K","000",database$Value)
}
as.numeric(database$Value)
Which reports the following warnings messages
Warning message:
In if (stri_sub(database$Value, -1, -1) == "M") { :
the condition has length > 1 and only the first element will be used
Warning message:
In if (stri_sub(database$Value, -1, -1) == "K") { :
the condition has length > 1 and only the first element will be used
Warning message:
NAs introduced by coercion
Looking the data after the procedure, it looks like this:
> print(database$Value)
[1] "10000" "25.5000" "1M" "1.2M" "0"
Only the K(thousands) values were converted and I also have a problem on how to solve the dot issue like in "25.5000" or "1.2000000" (if the M conversion would have worked).
I'm new on programming and any help or thoughts on how to solve this would be much appreciated.
You can build a vector with the corresponding values of M and K (I use str_detect() for this but there are several ways to do it), use str_remove() to remove M and K from your initial Vector, and then transform Value as numeric and multiply with the created vector.
library(stringr)
Value_unity <- ifelse(str_detect(Value, 'M'), 1e6, ifelse(str_detect(Value, 'K'), 1e3, 1))
Value_new <- Value_unity * as.numeric(str_remove(Value, 'K|M'))

Apply over xts with multiple columns

I'm having a weird error which I can not understand. Let me explain the variables and their meaning:
ts <- a xts object
range.matrix <- matrix with two columns and n rows (only knows at execution time)
so, range.matrix contains ranges of dates. first column is the start of the range and second column is the end of it. The goal is to slice the ts time series by the ranges in range.matrix a get a list with all slices.
It fails with some ranges but not in others, and fails with 1 row matrices... The error message is:
Error in array(ans, c(len.a%/%d2, d.ans), if (!is.null(names(dn.ans))
length of 'dimnames' 1 not equal to array extent
Check yourself with this toy example (range.matrix contains numbers which are cast as.Date)
library(xts)
ts <- xts(cbind('a'= c(1,2,3,4,5,6,7,8),'b' =c(1,2,3,4,5,6,7,8),'c'= c(1,2,3,4,5,6,7,8))
,order.by = as.Date(as.Date('2017-01-01'):(as.Date('2017-01-01')+7)) )
range.matrix <- matrix(c(16314,17286), ncol = 2,byrow = TRUE) # Fails. Range: "2014-09-01/2017-04-30"
range.matrix <- matrix(c(16314,17236,16314,17286), ncol = 2,byrow = TRUE) # Fails. Range: "2014-09-01/2017-03-11" and "2014-09-01/2017-04-30"
range.matrix <- matrix(c(16314,17236,17237,17286), ncol = 2,byrow = TRUE) # does not fail. "2014-09-01/2017-03-11" and "2017-03-12/2017-04-30"
apply(range.matrix,
1,
function(r) {
ts[paste0(as.Date(r[1]), '/', as.Date(r[2]))]
})
Any clue? It has to do with dimnames but can not find the solution
Try this instead, and you won't have issues:
lapply(split(range.matrix, row(range.matrix)), function(x) {
ts[paste0(as.Date(r[1]), '/', as.Date(r[2]))]})
Personally I would not use apply on xts objects in the way you want to do it (i'd do the above; lapply is much more natural).
apply is used on arrays, and an xts object is not just a matrix (array), but also supports a time index and other attributes that give xts its power. You could use something like coredata on the xts object to just return the underlying matrix to the apply call, and then you won't get errors, but the results don't make much sense.
apply(range.matrix,
1,
function(r) {
res <- ts[paste0(as.Date(r[1]), '/', as.Date(r[2]))]
coredata(res)
})

R: created a names vector containing the means of multiple numeric vectors

I have over 20 numeric vectors which consist of a series of values. each vector is distinguished by a letter, e.g. val_a, val_b, val_c etc...
I would like to put the means from each of these vectors into a single named vector. I could of course do this in a laborious manner like so:
obs <- c("val_a" = round(mean(val_a),3),
"val_b" = round(mean(val_b),3),
"val_c" = round(mean(val_c),3))
But with 20 vectors this then becomes tedious to write out, and not to mention an inelegant solution. How can I create the named vector in a more succinct way? I have made an attempt using a for loop, as so:
obs <- c(for (j in 1:20) {
assign(paste("val",letters[j], sep = "_"),
mean(as.name(paste('val',letters[j], sep = '_'))),)
})
In the right hand argument passed to assign, "as.name" is used in order to remove the quotation marks from output of "paste". So the second argument passed to assign returns a character which has the exact same name as the numeric vector that I want get the mean of, e.g. val_a. But I get the error messsage:
Warning messages:
1: In mean.default(as.name(paste("val", letters[j], sep = "_"))) :
argument is not numeric or logical: returning NA
Does anyone know how to accomplish this?
Solution
To build on bouncyball's comment so you have a full answer, you can do this:
sapply(paste('val', letters[1:20], sep='_'), function(x) round(mean(get(x)), 3))
Explanation
For an object in your environment called x, get("x") will return x. See help("get"). Then we can do this for every element of paste('val', letters[1:20], sep='_') using sapply(), or if you like, a loop.
Example
val_a <- rnorm(100)
val_b <- rnorm(100)
val_c <- rnorm(100)
sapply(paste('val', letters[1:3], sep='_'), function(x) round(mean(get(x)), 3))
val_a val_b val_c
-0.09328504 -0.15632654 -0.09759111

Vector-version / Vectorizing a for which equals loop in R

I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
Code:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
UPDATE:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the
numbers listed against 4 are the index points in v where 4 appears:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
Cheers! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})

Resources