Related
Firstly, let us generate data like this:
library(data.table)
data <- data.table(date = as.Date("2015-05-01")+0:299)
set.seed(123)
data[,":="(
a = round(30*cumprod(1+rnorm(300,0.001,0.05)),2),
b = rbinom(300,5000,0.8)
)]
Then I want to use my custom function to operate multiple columns multiple times without manually typing out .Such as my custom function is add <- function(x,n) (x+n)
I provide my for loops code as following:
add <- function(x,n) (x+n)
n <- 3
freture_old <- c("a","b")
for(i in 1:n ){
data[,(paste0(freture_old,"_",i)) := add(.SD,i),.SDcols =freture_old ]
}
Could you please tell me a lapply version to instead of for loop?
If all you want is to use an lapply loop instead of a for loop you really do not need to change much. For a data.table object it is even easier since every iteration will change the data.table without having to save a copy to the global environment. One thing I add just to suppress the output to the console is to wrap an invisible around it.
lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old])
Note that if you assign this lapply to an object you will get a list of data.tables the size of the number of iterations or in this case 3. This will kill memory because you are really only interested in the final entry. Therefore just run the code without assigning it to a variable. Now if you do not assign it to anything you will get every iteration printed out to the console. So what I would suggest is to wrap an invisible around it like this:
invisible(lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old]))
Hope this helps and let me know if you need me to add anything else to this answer. Good luck!
An option without R "loop" (quoted since ultimately its a loop at certain level somewhere):
data[,
c(outer(freture_old, seq_len(n), paste, sep="_")) :=
as.data.table(matrix(outer(as.matrix(.SD), seq_len(n), add), .N)),
.SDcols=freture_old]
Or equivalently in base R:
setDF(data)
cbind(data, matrix(outer(as.matrix(data[, freture_old]), seq_len(n), add),
nrow(data)))
I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.
I was hoping somebody could help, I'm trying to speed up an apply function, and I've tried a few tricks but it is still quite slow and I was wondering if anybody had any more suggestions.
I have data as follows:
myData= data.frame(ident=c(3,3,4,4,4,4,4,4,4,4,4,7,7,7,7,7,7,7),
group=c(7,7,7,7,7,7,7,7,7,7,7,8,8,8,8,8,8,8),
significant=c(1,1,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0),
year=c(2003,2002,2001,2008,2010,2007,2007,2008,2006,2012,2008,
2012,2006,2001,2014,2012,2004,2007),
month=c(1,1,9,12,3,2,4,3,9,5,12,8,11,3,1,6,3,1),
subReport=c(0,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0),
prevReport=c(1,1,0,1,1,1,0,1,1,1,0,1,1,1,1,1,1,1))
and I want to end up with a dataframe like this:
results=data.frame(ident=c(3,4,7),
significant=c(1,0,1),
prevReports=c(2,6,7),
subReport=c(0,1,0),
group=c(7,7,8))
To do this I wrote the code below and to do it quickly i've tried converting to data tables and using rbindlist instead of rbind, which I've found suggested in a few threads. I've also tried parLapply, I still find the process to be quite slow however, (I'm tring to do this on about 250,000 data points).
dt<-data.table(myData)
results<-NULL
ApplyModel <- function (id,data) {
dtTemp<-dt[dt$ident== id,]
if(nrow(dtTemp)>=1){
prevReport = if(sum(dtTemp$prevReport)>=1) sum(dtTemp$prevReport) else 0
subsequentReport = if(sum(dtTemp$subReport)>=1) 1 else 0
significant = as.numeric(head(dtTemp$sig,1))
group = head(dtTemp$group,1)
id= as.numeric(head(dtTemp$id,1))
output<-cbind(id, significant ,prevReport,subsequentReport ,group)
output<-output[!duplicated(output[,1]),]
print(output)
results <- rbindlist(list(as.list(output)))
}
}
results<-lapply(unique(dt$ident), ApplyModel)
results<-as.data.frame(do.call(rbind, results))
Any suggestions on how this might be speeded up would be most welcome! I think it may be to do with the subsetting, I want to apply the function to a subset based on a unique value but I think lapply is really more for applying a function to each value, so subsetting is defeating the object somewhat...
Here, your code produces an error:
results<-lapply(unique(dt$ident), ApplyModel)
Error in dt$ident : object of type 'closure' is not subsettable
It appears to me, that you are looking for tapply instead of lapply. Using tapply you could express roughly the above in much more concise ways:
results2 <- data.frame(significant = tapply(myData$significant, myData$ident, function(x) return(x[1])),
prevreports = tapply(myData$prevReport, myData$ident, sum),
subReports = tapply(myData$subReport, myData$ident, function(x) as.numeric(any(x==1))),
group = tapply(myData$group, myData$ident, function(x) return(x[1])))
Should do about the same job but be much more readable. Now this should really be fast except for huge datasets. In most cases it should be faster to wait for R to complete the job than to spend more time programming. One way to make this even faster would be to use the power of the data.table package, but just invoking it doesn't do the trick. You'll need to learn it's very special syntax. Please check before, that the code given this way really is too slow.
If it is really too slow, check this:
library(data.table)
first <- function(x) x[1]
myAny <- function(x) as.numeric(any(x==1))
myData <- data.table(myData)
myData[, .(significant=first(significant),
prevReports=sum(prevReport),
subReports=myAny(subReport),
group=first(group)), ident]
You could use dplyr:
require(dplyr)
new <- myData %>% group_by(ident) %>%
summarise(first(significant),sum(prevReport),(n_distinct(subReport)-1), first(group)) %>%
data.frame()
I have a data.frame mapping which contains path and map.
I also have another data.frame DATA which contains the raw path and value.
EDIT: Path might have two components or more: e.g. "A>C" or "A>C>B"
set.seed(24);
DATA <- data.frame(
path=paste0(sample(LETTERS[1:3], 25, replace=TRUE), ">", sample(LETTERS[1:3], 25, replace=TRUE)),
value=rnorm(25)
)
mapping <- data.frame(path=c("A","B","C"), map=c("X","Y","Z"))
lapply(mapping, function (x) {
for (i in 1:nrow(DATA)) {
DATA$path[i] <- gsub(as.character(x["path"]),as.character(x["map"]),as.character(DATA$path[i]))
}
})
I'm trying to replace the path in DATA with the map value in mapping but this doesn't seem to be working for me.
"A>C" will be converted to "X>Z".
I understand that for loops are not good in R, but I can't think of another way to code it. Data size I'm working with is 6m row in DATA and 16k rows in mapping.
Clarification on Data: While the path consists of alphabets (ABC) now, the real path are actually domain names. Number of steps in a path is also not fixed at 2 and can be any number.
You can use chartr
DATA$path <- chartr('ABC', 'XYZ', DATA$path)
Or if we are using the data from 'mapping'
DATA$path <- chartr(paste(mapping$path, collapse=''),
paste(mapping$map, collapse=''), DATA$path)
Or using gsubfn
library(gsubfn)
pat <- paste0('[', paste(mapping$path, collapse=''),']')
indx <- setNames(as.character(mapping$map), mapping$path)
gsubfn(pat, as.list(indx), as.character(DATA$path))
Or a base R option based on #smci's comment
vapply(strsplit(as.character(DATA$path), '>'), function(x)
paste(indx[x], collapse=">"), character(1L))
Using data.table (1.9.5+), especially advisable b/c of the size of your data.
library(data.table)
setDT(DATA); setDT(mapping)
DATA[,paste0("path",1:2):=tstrsplit(path,split=">")]
setkey(DATA,path1)[mapping,new.path1:=i.map]
setkey(DATA,path2)[mapping,new.path2:=i.map]
DATA[,new.path:=paste0(new.path1,">",new.path2)]
If you want to get rid of the extra columns:
DATA[,paste0(c("","","new.","new."),"path",rep(1:2,2)):=NULL]
If you just want to overwrite path, use path on the LHS of the last line instead of new.path.
This could also be written more concisely:
library(data.table)
setDT(mapping)
setkey(setkey(setDT(DATA)[,paste0("path",1:2):=tstrsplit(path,split=">")
],path1)[mapping,new.path1:=i.map],path2
)[mapping,new.path:=paste0(new.path1,">",i.map)]
I think you're using the wrong apply.
mapply allows you to use two arguments to the function, here the path and the map. Note that in mapply, the argument FUN comes first. You also do not need to do this row by row, you can just do the entire column at once. Finally, in an apply the variables do not get updated as they do in a for loop, so you need to assign them in the .GlobalEnv. You can do this with an explicit call to assign() or using <<- which assigns them in the first place it finds them in the stack. In this case, that will be back in .GlobalEnv.
After defining mapping and DATA as you do above, try this.
head(DATA)
invisible(mapply( function (x,y) {
DATA$path <<- gsub(x,y,DATA$path)
},mapping$path, mapping$map))
head(DATA)
note that the call to invisible suppresses output from mapply.
If you really want to use lapply, you can. But you need to transpose mapping. You can do that but it will be converted to a matrix, so you have to convert it back. Then, you can just use the same tricks with <<- and not using a for loop as above to get this code:
invisible(lapply(as.data.frame(t(mapping)), function (x) {
DATA$path <<- gsub(x[1],x[2],DATA$path)
}))
head(DATA)
Thanks for sharing, I learned a lot answering this question.
I am new to R and it seems like this shouldn't be a difficult task but I cannot seem to find the answer I am looking for. I am trying to add multiple vectors to a data frame using a for loop. This is what I have so far and it works as far as adding the correct columns but the variable names are not right. I was able to fix them by using rename.vars but was wondering if there was a way without doing that.
for (i in 1:5) {
if (i==1) {
alldata<-data.frame(IA, rand1) }
else {
alldata<-data.frame(alldata, rand[[i]]) }
}
Instead of the variable names being rand2, rand3, rand4, rand5, they show up as rand..i.., rand..i...1, rand..i...2, and rand..i...3.
Any Suggestions?
You can set variable names using the colnames function. Therefore, your code would look something like:
newdat <- cbind(IA, rand1, rand[2:5])
colnames(newdat) <- c(colnames(IA), paste0("rand", 1:5))
If you're creating your variables in a loop, you can assign the names during the loop
alldata <- data.frame(IA)
for (i in 1:5) {alldata[, paste0('rand', i)] <- rand[[i]]}
However, R is really slow at loops, so if you are trying to do this with tens of thousands of columns, the cbind and rename approach will be much faster.
Just do cbind(IA, rand1, rand[2:5]).