How to save multiple variables in global environment into a dataframe in R? I can do it one by one using the $. But is there a command that moves all variables at once?
You can do:
do.call(data.frame, lapply(ls(), get))
But let me just say this sounds like a horrible idea.
If there is a named list of objects, then:
cbind(dfrm, list.obj)
E.g.
dados <- data.frame(x=1:10, v1=rnorm(10), v2=rnorm(10))
dat1 <- rnorm(10)
dat2 <- letters[1:5]
cbind(dados, list(new1=dat1, new2=dat2))
Can also use this form:
cbind(dados, new1=dat1, new2=dat2)
If you had a bunch of variables in the global environment with the character string "zzz" in their names and you wanted to append them columnwise to an existing dataframe, you could use this:
dados[, ls(patt="zzz") ] <- sapply( ls(patt="zzz"), get)
You might notice that I was objecting to using data.frame(cbind(...)) and yet I am saying to use cbind. Well, I am assuming that the first argument in my case is already a dataframe, so that cbind does not get called, but rather cbind.data.frame' gets called, and it will allow the usual list-behavior of data.frame objects to persist. If the first argument is only a vector, then.Internal(cbind(...)` will do coercion to a matrix which will remove attributes of all the arguments and coerce then into a common storage mode.
Related
This is probably a very simple problem but I have been struggling to search for this issue. Basically, I am using lapply to convert the column names to upper in a list of dataframes. My first attempt did not work, however adding ;x works. What exactly is going on?
This does not work:
df.list <- lapply(df.list,function(x) colnames(x) <- toupper(colnames(x)))
This does:
df.list <- lapply(df.list,function(x) {colnames(x) <- toupper(colnames(x));x})
Since you are modifying the object x (or in this case only the colnames of x) inside the function definition, you have to return the modified object x. This is happening by using ;x which can be read as a new line only returning the object x
I am currently trying to create a table from a list of variable names (something I feel should be relatively simple) and I can't for the life of me, figure out how to do it correctly.
I have a data table that I've named 'file' and there are a list of 3 variable names within this file. What I want to do is create a table of each variable and then rbind them together. For further context, these few lines of code will be worked into a much larger function. The list of variable names must be able to accommodate the number of variables the user defines.
I have tried the following:
file<-as.data.table(dt)
variable_list<-list("outcome", "type")
for (variable in variable_list){
var_table<-as.data.table(table(file$variable_list))
na_table<-as.data.table(table(is.na(file$variable)))
}
When I run the above code, R returns empty tables of var_table and na_table. What am I doing wrong?
An option is to loop over the 'variable_list, extract the column, apply tableandrbindwithindo.call`
do.call(rbind, lapply(variable_list, function(nm) table(file[[nm]])))
NOTE: assuming that the levels of the columns are the same
If the levels are not the same, make it same by converting the columns to factor with levels specified
lvls <- na.omit(sort(unique(unlist(file[, unlist(variable_list), with = FALSE]))))
do.call(rbind, lapply(variable_list, function(nm)
table(factor(file[[nm]], levels = lvls))))
Or if we have a data.table, use the data.table methods
rbindlist(lapply(variable_list, function(nm) file[, .N,by = c(nm)]), fill = TRUE)
The problem (at least one of the problems) might be that you are attempting to use the $ operator incorrectly. You cannot substitute text values into the second argument. You can use its syntactic equivalent [[ instead of $, however. So this would be a possible improvement. (I've not tested it since you provided no test material.)
file<-as.data.table(dt)
variable_list<-list("outcome", "type")
for (variable in variable_list){
var_table<-as.data.table(table(file[[variable]])) # clearly not variable_list
na_table<-as.data.table(table(is.na(file[[variable]] )))
}
I'm guessing you might have done something like, ...
var_table <- file[, table(variable ) ]
... since data.table syntax evaluates text values in the environment of the file (which in this case is confusing named "file". It's better not to use such names, since in this case there's also an R function by that name.
I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.
I have a data.frame mapping which contains path and map.
I also have another data.frame DATA which contains the raw path and value.
EDIT: Path might have two components or more: e.g. "A>C" or "A>C>B"
set.seed(24);
DATA <- data.frame(
path=paste0(sample(LETTERS[1:3], 25, replace=TRUE), ">", sample(LETTERS[1:3], 25, replace=TRUE)),
value=rnorm(25)
)
mapping <- data.frame(path=c("A","B","C"), map=c("X","Y","Z"))
lapply(mapping, function (x) {
for (i in 1:nrow(DATA)) {
DATA$path[i] <- gsub(as.character(x["path"]),as.character(x["map"]),as.character(DATA$path[i]))
}
})
I'm trying to replace the path in DATA with the map value in mapping but this doesn't seem to be working for me.
"A>C" will be converted to "X>Z".
I understand that for loops are not good in R, but I can't think of another way to code it. Data size I'm working with is 6m row in DATA and 16k rows in mapping.
Clarification on Data: While the path consists of alphabets (ABC) now, the real path are actually domain names. Number of steps in a path is also not fixed at 2 and can be any number.
You can use chartr
DATA$path <- chartr('ABC', 'XYZ', DATA$path)
Or if we are using the data from 'mapping'
DATA$path <- chartr(paste(mapping$path, collapse=''),
paste(mapping$map, collapse=''), DATA$path)
Or using gsubfn
library(gsubfn)
pat <- paste0('[', paste(mapping$path, collapse=''),']')
indx <- setNames(as.character(mapping$map), mapping$path)
gsubfn(pat, as.list(indx), as.character(DATA$path))
Or a base R option based on #smci's comment
vapply(strsplit(as.character(DATA$path), '>'), function(x)
paste(indx[x], collapse=">"), character(1L))
Using data.table (1.9.5+), especially advisable b/c of the size of your data.
library(data.table)
setDT(DATA); setDT(mapping)
DATA[,paste0("path",1:2):=tstrsplit(path,split=">")]
setkey(DATA,path1)[mapping,new.path1:=i.map]
setkey(DATA,path2)[mapping,new.path2:=i.map]
DATA[,new.path:=paste0(new.path1,">",new.path2)]
If you want to get rid of the extra columns:
DATA[,paste0(c("","","new.","new."),"path",rep(1:2,2)):=NULL]
If you just want to overwrite path, use path on the LHS of the last line instead of new.path.
This could also be written more concisely:
library(data.table)
setDT(mapping)
setkey(setkey(setDT(DATA)[,paste0("path",1:2):=tstrsplit(path,split=">")
],path1)[mapping,new.path1:=i.map],path2
)[mapping,new.path:=paste0(new.path1,">",i.map)]
I think you're using the wrong apply.
mapply allows you to use two arguments to the function, here the path and the map. Note that in mapply, the argument FUN comes first. You also do not need to do this row by row, you can just do the entire column at once. Finally, in an apply the variables do not get updated as they do in a for loop, so you need to assign them in the .GlobalEnv. You can do this with an explicit call to assign() or using <<- which assigns them in the first place it finds them in the stack. In this case, that will be back in .GlobalEnv.
After defining mapping and DATA as you do above, try this.
head(DATA)
invisible(mapply( function (x,y) {
DATA$path <<- gsub(x,y,DATA$path)
},mapping$path, mapping$map))
head(DATA)
note that the call to invisible suppresses output from mapply.
If you really want to use lapply, you can. But you need to transpose mapping. You can do that but it will be converted to a matrix, so you have to convert it back. Then, you can just use the same tricks with <<- and not using a for loop as above to get this code:
invisible(lapply(as.data.frame(t(mapping)), function (x) {
DATA$path <<- gsub(x[1],x[2],DATA$path)
}))
head(DATA)
Thanks for sharing, I learned a lot answering this question.
Sometimes I have code which references a specific dataset based on some variable ID. I have then been creating lines of code using paste0, and then eval(parse(...)) that line to execute the code. This seems to be getting sloppy as the length of the code increases. Are there any cleaner ways to have dynamic data reference?
Example:
dataset <- "dataRef"
execute <- paste0("data.frame(", dataset, "$column1, ", dataset, "$column2)")
eval(parse(execute))
But now imagine a scenario where dataRef would be called for 1000 lines of code, and sometimes needs to be changed to dataRef2 or dataRefX.
Combining the comments of Jack Maney and G.Grothendieck:
It is better to store your data frames that you want to access by a variable in a list. The list can be created from a vector of names using get:
mynames <- c('dataRef','dataRef2','dataRefX')
# or mynames <- paste0( 'dataRef', 1:10 )
mydfs <- lapply( mynames, get )
Then your example becomes:
dataset <- 'dataRef'
mydfs[[dataset]][,c('column1','column2')]
Or you can process them all at once using lapply, sapply, or a loop:
mydfs2 <- lapply( mydfs, function(x) x[,c('column1','column2')] )
#G.Grothendieck has shown you how to use get and [ to elevate a character value and return the value of a named object and then reference named elements within that object. I don't know what your code was intended to accomplish since the result of executing htat code would be to deliver values to the console, but they would not have been assigned to a name and would have been garbage collected. If you wanted to use three character values: objname, colname1 and colname2 and those columns equal to an object named after a fourth character value.
newname <- "newdf"
assign( newname, get(dataset)[ c(colname1, colname2) ]
The lesson to learn is assign and get are capable of taking character character values and and accessing or creating named objects which can be either data objects or functions. Carl_Witthoft mentions do.call which can construct function calls from character values.
do.call("data.frame", setNames(list( dfrm$x, dfrm$y), c('x2','y2') )
do.call("mean", dfrm[1])
# second argument must be a list of arguments to `mean`