I'm able to use the following to generate/replace a numeric column (var) with each cell's quartile rank (1-4) for the column:
df <- as.data.table(df)
df[, var:=ntile(var, 4)]
I want to iterate this conversion over each of the columns in the data table. When I try the following, every cell in the table becomes the number, 1. Any guidance on why I'm not getting the expected output? I'm sure there's also a simpler approach, so alternatives are welcome too. Thanks!
for (i in 1:ncol(df))
{
df[, colnames(df)[i]:=ntile(df[i], 4)]
}
Related
I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?
I work mostly with data table but a data frame solution would work as well.
I have the result of an apply which returns this data structure
applyres=structure(c(0.0260, 3.6775, 0.92
), .Names = c("a.1", "a.2", "a.3"))
Then I have a data table
coltoadd=c('a.1','a.2','a.3')
dt <- data.table(variable1 = factor(c("what","when","where")))
dt[,coltoadd]=as.numeric(NA)
Now I would like to add the elements of applyres to the corresponding columns, just one row at a time, because applyres is calculated from another function. I have tried different assignments but nothing seems to work. Ideally I would like to assign based on column name, just in case the columns change order in one of the two structures.
This doesn't work
dt[1,coltoadd]=applyres
I also tried
dt[1,coltoadd := applyres]
And tried to change applyrest to a matrix or a data table and transpose.
I would like to do something like this
dt[1,coltoadd[i]]=applyres[coltoadd[i]]
But not sure if it should go in a loop, doesn't seem the best way to do it.
How do I avoid doing single assignments if I have a large number of columns?
1) data.frame Convert to data.frame, perform the assignments and convert back.
DF <- as.data.frame(dt)
DF[1, -1] <- applyres
# perform remaining of assignments
dt <- as.data.table(DF)
2) loop Another possibility is a for loop:
for(i in 2:ncol(dt)) dt[1, i] <- applyres[i-1]
This is similar to Update values in data.table with values from another data.table and R data.table replacing an index of values from another data.table, except in my situation the number of variables is very large so I do not want to list them explicitly.
What I have is a large data.table (let's call it dt_original) and a smaller data.table (let's call it dt_newdata) whose IDs are a subset of the first and it has only some of the variables of the first. I would like to update the values in dt_original with the values from dt_newdata. For an added twist, I only want to update the values conditionally - in this case, only if the values in dt_newdata are larger than the corresponding values in dt_original.
For a reproducible example, here are the data. In the real world the tables are much larger:
library(data.table)
set.seed(0)
## This data.table with 20 rows and many variables is the existing data set
dt_original <- data.table(id = 1:20)
setkey(dt_original, id)
for(i in 2015:2017) {
varA <- paste0('varA_', i)
varB <- paste0('varB_', i)
varC <- paste0('varC_', i)
dt_original[, (varA) := rnorm(20)]
dt_original[, (varB) := rnorm(20)]
dt_original[, (varC) := rnorm(20)]
}
## This table with a strict subset of IDs from dt_original and only a part of
## the variables is our potential replacement data
dt_newdata <- data.table(id = sample(1:20, 3))
setkey(dt_newdata, id)
newdata_vars <- sample(names(dt_original)[-1], 4)
for(var in newdata_vars) {
dt_newdata[, (var) := rnorm(3)]
}
Here is a way of doing it using a loop and pmax, but there has to be a better way, right?
for(var in newdata_vars) {
k <- pmax(dt_newdata[, (var), with = FALSE], dt_original[id %in% dt_newdata$id, (var), with = FALSE])
dt_original[id %in% dt_newdata$id, (var) := k, with = FALSE]
}
It seems like there should be a way using join syntax, and maybe the prefix i. and/or .SD or something like that, but nothing I've tried comes close enough to warrant repeating here.
This code should work in the current format given your criteria.
dt_original[dt_newdata, names(dt_newdata) := Map(pmax, mget(names(dt_newdata)), dt_newdata)]
It joins to the IDs that match between the data.tables and then performs an assignment using := Because we want to return a list, I use Map to run pmax through the columns of data.tables matching by the name of dt_newdata. Note that it is necessary that all names of dt_newdata are in dt_original data.
Following Frank's comment, you can remove the first column of the Map list items and the column names using [-1] because they are IDs, which don't need to be computed. Removing the first column from Map avoids one pass of pmax and also preserves the key on id. Thanks to #brian-stamper for pointing out the key preservation in the comments.
dt_original[dt_newdata,
names(dt_newdata)[-1] := Map(pmax,
mget(names(dt_newdata)[-1]),
dt_newdata[, .SD, .SDcols=-1])]
Note that the use of [-1] assumes that the ID variable is located in the first position of new_data. If it is elsewhere, you could change the index manually or use grep.
I am interested in optimizing some code using data.table. I feel I should be able to do better than my current solution, and it does not scale well (as the number of rows increase).
Consider I have a matrix of values, with ID denoting person and the remaining values are traits (lineage in my case). I want to create a logical matrix which reflects if two ID's (rows) share any values amongst their row (including ID). I have been using data.table lately, but I cannot figure out how to do this more efficiently. I have tried (and failed) at nesting apply statements, or somehow using the .SD function of data.table to accomplish this.
The working code is below.
m <- matrix(rep(1:10,2),nrow=5,byrow=T)
m[c(1,3),3:4] <- NA
dt <- data.table(m)
setnames(dt,c("id","v1","v2","v3"))
res <- matrix(data=NA,nrow=5,ncol=5)
dimnames(res) <- list(dt[,id],dt[,id])
for (i in 1:nrow(dt)){
for (j in i:nrow(dt)){
res[j,i] <- res[i,j] <-length(na.omit(intersect(as.numeric(dt[i]),as.numeric(dt[j])))) > 0
}
}
res
I had a similar problem a while ago and somebody helped me out. Here's that help converted to your problem...
tm<-t(m) #transpose the matrix
dtt<-data.table(tm[2:4,]) #take values of matrix into data.table
setnames(dtt,as.character(tm[1,])) #make data.table column names
comblist<-combn(names(dtt),2,FUN=list) #create list of all possible column combinations
preresults<-dtt[,lapply(comblist, function(x) length(na.omit(intersect(as.numeric(get(x[1])),as.numeric(get(x[2]))))) > 0)] #recreate your double for loop
preresults<-melt(preresults,measure.vars=names(preresults)) #change columns to rows
preresults[,c("LHS","RHS"):=lapply(1:2,function(i)sapply(comblist,"[",i))] #add column labels
preresults[,variable:=NULL] #kill unneeded column
I'm drawing a blank on how to get my preresults to be in the same format as your res but this should give you the performance boost you're looking for.
I have the following data frame:
df <- data.frame(
Target=rep(LETTERS[1:3],each=8),
Prov=rep(letters[1:4],each=2),
B=rep("5MB"),
S=rep("1MB"),
BUF=rep("8kB"),
M=rep(c('g','p')),
Thr.mean=1:24)
whose column Thr.mean I would like to normalize by the values where Target=='C' (I don't mind attaching a new column).
To clarify, I would like to end up with:
Thr.mean <- c(1/17,2/18,3/19,4/20,5/21,6/22,7/23,8/24,9/17,10/18,11/19,12/20,13/21,14/22,15/23,16/24,1,1,1,1,1,1,1,1)
Now, it may happen that there are rows in this data frame, where Target!='C', and they have values in S or B that are not present in rows where Target=='C', and for these I would also like to calculate the overhead. The most important column for matching is M, then BUF, B, and S.
Any ideas how to do it? I could write several loops and ifs, but I'm looking for a more elegant solution.
For posterity,
the way how I solved my problem is by using data.table:
DT <- data.table(df)
DT[, Thr.Norm.C := .SD[Target=='C', Thr.mean], by = 'B,BUF,Prov']
DT[, over.thr := Thr.Norm.C/Thr.mean]