I`m working with a ffdf object which has NAs in some of the columns. The NAs are the result of a left outer merge using merge.ffdf.I would like to replace the NAs with 0s but not managing to do it.
Here is the code I am running:
library(ffbase)
deals <- merge(deals,rk,by.x=c("DEALID","STICHTAG"),by.y=c("ID","STICHTAG"),all.x=TRUE)
attributes(deals)
$names
[1] "virtual" "physical" "row.names"
$class
[1] "ffdf"
vmode(deals$CREDIT_R)
[1] "double"
idx <- ffwhich(deals,is.na(CREDIT_R)) # CREDIT_R is one of the columns with NAs
deals.strom[idx,"CREDIT_R"]<-0
error in `[<-.ffdf`(`*tmp*`, idx, "CREDIT_R", value = 0) :
ff/ffdf-iness of value and selected columns don't match
Any idea what I am doing wrong? In general I would like to learn more about replacing methods for class ff and ffdf. Any suggestion where I can find some examples about the topic?
The manual of package ff indicates a function called ffindexset.
idx <- is.na(deals$CREDIT_R) ## This uses is.na.ff_vector from ffbase
idx <- ffwhich(idx, idx == TRUE) ## Is part of ffbase
deals$CREDIT_R <- ffindexset(x=deals$CREDIT_R, index=idx, value=ff(0, length=length(idx), vmode = "double")) ## Is part of ff
deals$CREDIT_R[idx] <- ff(0, length=length(idx), vmode = "double") ## this one will probably also work
Also have a look at ?Extract.ff
Related
I have two dataframes of ~150 rows of X and Y where identical(X, Y) is TRUE but identical(digest(X), digest(Y)) is FALSE. I'm looking into why this is the case.
I did look at this answer and re-ran what they tested, with similar results, but unlike their problem, the attributes for my dataframes are the same. Testing results:
> names(attributes(X))
[1] "names" "row.names" "class"
> names(attributes(Y))
[1] "names" "row.names" "class"
> digest(X)
[1] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y)
[1] "09d8abcab0af0a72265a9b690f4eacc3"
> digest(X[1:nrow(X),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> digest(Y[1:nrow(Y),])
[1] "2f338de9972529bd2bc9c39c3c5762ea"
> identical(X, Y, attrib.as.set=FALSE)
[1] TRUE
I also saved the dataframes as .RDS files, and re-read them in.
> X_rds <- read_rds("cache_vars/X.rds")
> Y_rds <- read_rds("cache_vars/Y.rds")
> identical(X_rds , Y_rds )
[2] TRUE
> digest(X_rds)
[2] "07b7ef11ce6eaae01ddd79e4facef581"
> digest(Y_rds )
[2] "09d8abcab0af0a72265a9b690f4eacc3"
> identical(X_rds , Y_rds , attrib.as.set=FALSE)
[2] TRUE
And like the other poster, converting to matrices and back to dataframe yielded identical digests, so it's probably some structural problem.
> X_Mat <- as.matrix(X_rds)
> Y_Mat <- as.matrix(Y_rds)
> identical(digest(X_Mat), digest(Y_Mat))
[2] TRUE
> X_DF <- as.data.frame(X_Mat)
> Y_DF <- as.data.frame(Y_Mat)
> identical(digest(X_DF ), digest(Y_DF))
[2] TRUE
Dataframe X was produced from a parallel-designed loop (but with the %do% flag so no actual parallelism was done) and Y was produced from a sequential loop.
The .RDS files for X and Y can be found at this link.
Update:
MrFlick has it right. As it turns out, the serialization during parallel's rbind function was also adding the gp=0x20 flag, similar to what they described occurs when writing to RDS.
When you write to rds, the objects are serialized. The serialization contains some information in addition to just the values the vectors contain. Note that if we just compare all the columns, they produce a different digests
sapply(seq_along(X_rds), function(i)
digest::digest(X_rds[[i]])==digest::digest(Y_rds[[i]])
)
So the vectors that are being stored in the data.frame are different. We can use the internal inspect function to get some of the meta-data for the vectors
.Internal(inspect(X_rds[[1]]))
# #135305a00 14 REALSXP g0c7 [REF(4),gp=0x20] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
.Internal(inspect(Y_rds[[1]]))
# #115dbfc00 14 REALSXP g0c7 [REF(29)] (len=150, tl=0)
# 1.009e+06,1.009e+06,1.009e+06,1.009e+06,1.009e+06,...
So we see they differ in the [] parts. I believe the REF() number represents the reference count to that object for memory clearing purposes. I do not believe that this number is used in the serialization. But the X_rds also has gp=0x20 set. The "gp" stands for "general purpose" bits/flags. I believe in this case it means the GROWABLE_MASK was set on that object. These values are preserved when the object is serialized which is the default behavior for digest. Thus these vectors do not have the exact same serialization due to this flag difference.
Another way to see the difference is to look at the desrialization
substring(rawToChar(serialize(X_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n131086\n150\n1009002\n"
substring(rawToChar(serialize(Y_rds[[1]], connection = NULL, ascii = TRUE)), 1, 45)
# [1] "A\n3\n262657\n197888\n5\nUTF-8\n14\n150\n1009002\n1009"
We have a a bit of a header, then we start to see the values being output. There is one value where there is a difference and that's where X has 131086 (0x20000E) and Y has 14 (0xE). Those differences are due to the flags where are written here in the R source code.
When you use identical, only the values in the data.frame are compared, not the additional metadata.
If you wanted to get around this, you could write your own wrapper around digest that avoids the serialization. For example
dfdigest <- function(x) {
charsToRaw <- function(x) unlist(lapply(x, charToRaw))
bytes <- unlist(c(list(charsToRaw(names(x))),
lapply(x, function(col) {
if (typeof(col)=="double") writeBin(col, raw())
else if (typeof(col)=="character") charsToRaw(col)
else stop(paste("unconfigured data type:", typeof(col)))
})))
digest::digest(bytes, serialize = FALSE)
}
dfdigest(X_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"
dfdigest(Y_rds)
# [1] "2488505e3ad1a370d030b539a287b7ca"
I need both the original and a rounded-to-the-day version of a datetime in a data.table. When I use the base round function to do this (as recommended here), I start getting errors regarding the number of items when I try to add it back into my data.table - even though the length looks right.
Example:
temp <- data.table(ID=1:3,dates_unrounded=rep(as.POSIXct(NA),3),dates_rounded=rep(as.POSIXct(NA),3))
dates_form1 <- c("2021-04-01","2021-06-30","2021-05-22")
dates_form2 <- as.POSIXct(dates_form1,format="%Y-%m-%d")
temp$dates_unrounded <- dates_form2
dates_form3 <- round(dates_form2,"days")
temp$dates_rounded <- dates_form3
length(dates_form3)
length(temp$dates_unrounded)
When run, produces:
> temp <- data.table(ID=1:3,dates_unrounded=rep(as.POSIXct(NA),3),dates_rounded=rep(as.POSIXct(NA),3))
> dates_form1 <- c("2021-04-01","2021-06-30","2021-05-22")
> dates_form2 <- as.POSIXct(dates_form1,format="%Y-%m-%d")
> temp$dates_unrounded <- dates_form2
> dates_form3 <- round(dates_form2,"days")
> temp$dates_rounded <- dates_form3
Error in set(x, j = name, value = value) :
Supplied 11 items to be assigned to 3 items of column 'dates_rounded'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
> length(dates_form3)
[1] 3
> length(temp$dates_unrounded)
[1] 3
What's going wrong and how do I fix it?
?round.POSIXt reveals that in this case, round() returns a POSIXlt object. But data.table doesn't work with those. So just do
dates_form3 <- round(dates_form2,"days")
dates_form3 <- as.POSIXct(dates_form3)
temp$dates_rounded <- dates_form3
length(dates_form3)
length(temp$dates_unrounded)
and you're fine.
I got the error when I wanted to set the first column as the row names:
dt <- fread('../data/data_logTMP.csv', header = T)
rownames(dt) <- dt$GENE
I used duplicated() to check the values:
> which(duplicated(dt$GENE) == TRUE)
[1] 20209 21919
Therefore, I compared these values:
> dt$GENE[20209] == dt$GENE[21919]
[1] FALSE
> dt$GENE[20209]
[1] "1-Mar"
> dt$GENE[21919]
[1] "2-Mar"
Why were these two values recognized as duplicated? And how can I fix this problem?
As you are using fread for reading the file the default class of you object dt will be of data.table. By design data.table will not support row.names. Therefore you need to pass an additional argument to fread as shown below to make sure that the class of the object that you are reading is not a data.table.
data.table::fread(input = "file name",sep = ",",header = T,data.table = FALSE)
I am working with a data.table that has been read in from a .txt file with fread. The data.table contains some amount of integer columns as well as a column of very large integers that I intend to store as bigz. However, fread will only read in large integers as character if I plan on keeping all of the digits (and I do).
#Something to the effect of (run not needed):
#fread(file = FILENAME.txt, header=TRUE, colClasses = c(rep("integer", 10), "character"), data.table = TRUE)
Additionally, I am working with a fairly large dataset. My primary problem is converting a character column in a data.table to a bigz column without creating a new object.
Here's a toy example which demonstrates my issue. First, I know that data.tables can have bigzcolumns - IF they are introduced in a new object.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa) #The same number in character form
(good = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(exa, 3)))
str(good) #Notice "bigs" is type bigz (and raw?)
However, if a character column is to be converted to a bigz column on the fly, an error results. The syntax in these conversion methods "works" w.r.t. the numeric nums column if as.bigz is replaced with as.character.
(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
str(bad)
#Method 1
bad[,bigs:=as.bigz(bigs)]
#Method 2 (re-create data.table first)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
set(bad, j="bigs", value = as.bigz(bad$bigs))
Error below. It appears that the issue stems from bigz integers being stored as raw, although I am not sure where '64' is coming from - exa has 24 digits.
Warning messages:
1: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Supplied 64 items to be assigned to 3 items of column 'bigs' (61 unused)
2: In `[.data.table`(bad, , `:=`(bigs, as.bigz(bigs))) :
Coerced 'raw' RHS to 'character' to match the column's type. Either change the target column ['bigs'] to 'raw' first (by creating a new 'raw' vector length 3 (nrows of entire table) and assign that; i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for speed. Or, set the column type correctly up front when you create the table and stick to it, please.
I have a work-around for now, but it requires creating a new object (and deleting the old one).
(bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3)))
meh = data.table(as.data.frame(bad)[,-3], bigs = as.bigz(bad$bigs))
rm(bad)
str(meh)
identical(good, meh) #Well, at least this works
I think this situation could be resolved if:
fread could read in bigz integers, or
there is a way to change the column type without creating a new object.
Admittedly, I am a data.table novice. Thanks in advance!
These bigq numbers seem to be a pain to work with. Additionally, it seems they cannot be held as the only column in a data.table.
The only work around I can find is to declare a new data.table which is what you have already done, only it can be done more succinctly without creating a new object.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = data.table(bad,bigsN = as.bigz(bad$bigs))
str(bad)
However, these columns cannot be manipulated inside the data.table without the same problems.
bad$bigsN = bad$bigsN*2
## Error in `[<-.data.table`(x, j = name, value = value) :
## Unsupported type 'raw'
## In addition: Warning message:
## In `[<-.data.table`(x, j = name, value = value) :
## Supplied 64 items to be assigned to 3 items of column 'bigsN' (61 unused)
The best solution I can think of is simply to keep these objects as separate vectors to your data.table.
as.list
Another solution would be to embed the the bigz in a list.
library(gmp)
library(data.table)
exa = as.bigz(2)^80 #A very large number
cha = as.character(exa)
bad = data.table(nums = 1:3, lets = letters[1:3], bigs = rep(cha, 3))
bad = bad[,bigs := as.list(as.bigz(bad$bigs))]
This gives R a better handle on the location of elements, and is more memory efficient at the creation stage. The down side is each element is a length 1 bigz vector and as such holds 4 redundant bytes of data per element. It also still cannot be used for arithmetic in a vectorised fashion.
bad$bigs = bad$bigs * 2
## Error in bad$bigs * 2 : non-numeric argument to binary operator
bad$bigs[[2]] = bad$bigs[[2]] * 2
bad$bigs
## [[1]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176
##
## [[2]]
## Big Integer ('bigz') :
## [1] 2417851639229258349412352
##
## [[3]]
## Big Integer ('bigz') :
## [1] 1208925819614629174706176
In fact, it would seem very little can be done with it in a vetorised fashion, including sorting or even converting it back into a bigz vector.
I have run into an error in a script I am writing that only occurs when I have dplyr running. I first encountered it when I found a function from dplyr that I wanted to use, after which I installed and ran the package. Here is an example of my error:
First I read in a table from excel that has column values I am going to use as indices in it:
library(readxl)
examplelist <- read_excel("example.xlsx")
The contents of the file are:
1 2 3 4
1 1 4 1
2 3 2 1
4 4 1 4
And then I build a data frame:
testdf = data.frame(1:12, 13:24, 25:36, 37:48)
And then I have a loop that calls a function that uses the values of examplelist as indices.
testfun <- function(df, a, b, c, d){
value1 <- df[[a]]
value2 <- df[[b]]
value3 <- df[[c]]
value4 <- df[[d]]
}
for (i in 1:nrow(examplelist)){
testfun(testdf, examplelist[i, 1], examplelist[i, 2],
examplelist[i, 3], examplelist[i, 4])
}
When I run this script without dplyr, everything is fine, but with dplyr it gives me the error:
Error in .subset2(x, i, exact = exact) : invalid subscript type 'list'
Why would having dplyr cause this error, and how can I fix it?
I think MKR's answer is a valid solution, I will elaborate a bit more on the why with some alternatives.
The readxl library is part of the tidyverse and returns a tibble (tbl_df) with the function read_excel. This is a special type of data frame and there are differences from base behaviour, notably printing and subsetting (read here).
Tibbles also clearly delineate [ and [[: [ always returns another tibble, [[ always returns a vector. No more drop = FALSE
So you can see now that your examplelist[i, n] will return a tibble and not a vector of length 1, which is why using as.numeric works.
library(readxl)
examplelist <- read_excel("example.xlsx")
class(examplelist[1, 1])
# [1] "tbl_df" "tbl" "data.frame"
class(examplelist[[1, 1]])
# [1] "numeric"
class(as.numeric(examplelist[1, 1]))
# [1] "numeric"
class(as.data.frame(examplelist)[1, 1])
# [1] "numeric"
My workflow tends towards using the tidyverse so you could use [[ to subset or as.data.frame if you don't want tibbles.
I can see this issue even without loading dplyr. It seems the culprit is use of examplelist items. if you print the value of examplelist[1, 2] then it is 1x1 dimension data.frame. But the value of a, b, c and d are expected to be a simple number. Hence if you change examplelist[i, 1] etc using as.numeric then the error will be avoided. Change call of testfun as:
testfun(testdf, as.numeric(examplelist[i, 1]), as.numeric(examplelist[i, 2]),
as.numeric(examplelist[i, 3]), as.numeric(examplelist[i, 4]))