a R novice is once again seeking for help.
General situation: I am currently creating a script, I got several data frames per experiment.
The experiments vary in time-steps of measurements and number of reactors, therefore I need
two dimensional flexibility of my script to "massage" data into the right shape for the desired tests, and draw the necessary data from multiple data frames.
Unfortunately I choose to use for loops to account for this, which I see now is bad practice in R,
but I have gotten to fare to change directions now.
The Problem: I try to achieve that one dimensional matrix are named by the objects names, inside a for loop, I need them to be in matrix format because of further functions I want to apply.
# Simple but non- flexible examples of what I want to do:
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
#names header of the objects name
colnames(a1) <- "a1"
colnames(a2) <- "a2"
this works, but I need it to work with in a for loop...
# here are the two flexible but non- working approaches of mine
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
# should name object according to progress in loop
for(i in 1:2)
{
assign(colnames(paste("a",i,sep="",collapse="")),do.call("c",list(paste("a",i,sep=""))))
}
which isn’t the proper use of assign and creates an error.
the second attempt doesn't create an error but doesn't work either, it creates empty objects
# creates two matrix objects
a1 <- matrix(c(1,2,3,4,5))
a2 <- matrix(c(1,2,3,4,5))
# should name object according to progress in loop
for(i in 1:2)
{
assign(paste("a",i,sep="", colapse=""),do.call("colnames",list(paste("a",i,sep="", colapse=""))))
}
My conclusion: I do not understand the proper way of combining assign, and colnames,
If anyone got suggestion how I could get it up and running, this would be awesome.
So fare I searched for: R combining assign and colnames inside for loop, R using assign and colnames, R naming data with for loops,...
but unfortunately didn’t manage to extrapolate solutions to my problem.
The following is a function that, by default, will take any objects in the parent environment that have a name that starts with a followed by numbers, check that they are one column matrices, and if they are, name the columns with the name of the object.
a1 <- matrix(c(1:5))
a2 <- matrix(c(1:5))
name_cols()
a1
# a1
# [1,] 1
# [2,] 2
# ...
a2
# a2
# [1,] 1
# [2,] 2
# ...
And here is the code:
name_cols <- function(pattern="^a[0-9]+", env=parent.frame()) {
lapply(
ls(pattern=pattern, envir=env),
function(x) {
var <- get(x, envir=env)
if(is.matrix(var) && identical(ncol(var), 1L)) {
colnames(var) <- x
assign(x, var, env)
} } )
invisible(NULL)
}
Note I chose specification by pattern, but you can easily change this to be by specifying the names of the variable (instead of using ls, just pass the names and lapply over those), or potentially even the objects (though you have to use substitute for that, and the wiseness of this becomes questionable).
More generally, if you have several related objects on which you will be performing related analysis (e.g. in this case modifying columns), you should really consider storing them in lists rather than at the top level. If you do this, then you can easily use the built in *pply functions to operate on all your objects at once. For example:
a.lst <- list(a1=matrix(1:5), a2=matrix(1:5))
a.lst <- lapply(names(a.lst), function(x) {colnames(a.lst[[x]]) <- x; a.lst[[x]]})
Related
I have a model object m1. I need to create 100 distinctly named copies so I can adjust and plot each. To create a copy, I currently do this as such:
m1recip1 <- m1
m1recip2 <- m1
m1recip3 <- m1
m1recip4 <- m1
m1recip5 <- m1
m1recip6 <- m1
m1recip7 <- m1
...
m1recip100 <- m1
I planned to create these through a loop, but this is less efficient because I only know how to do so by initializing all 100 objects before looping through them. I'm effectively looking for something similar to the macro facility in other languages (where m1recip&i would produce the names iteratively). I'm sure R can do this - how?
As mentioned above, reconsider saving many similar structured objects in global environment. Instead, use a named list which results in the maintenance of one, indexed object to maintain where R has many handlers (i.e., apply family) to run operations across all elements.
Specifically, consider replicate (wrapper to sapply) to build the 100 m1 elements and use setNames to name them accordingly. You lose no functionality of object if saved within a list.
model_list <- setNames(replicate(100, m1, simplify = FALSE),
paste0("m1recip", 1:100))
model_list$m1recip1
model_list$m1recip2
model_list$m1recip3
...
Instead of assigning m1 to 100 objects, we can create a list with 100 elements like the following:
m1recip_list <- lapply(1:100, function(x) m1)
We can then reference each element by element number m1recip_list[[10]] or apply a function to every element of the list using lapply:
lapply(m1recip_list, some_function)
You can dynamically create object names using the paste function in a loop, and you can assign them values using the assign function as opposed to the "<-" operator.
for(i in 1:100) {
assign(paste("m1recip",i, sep = ""), m1)
}
I am trying to get to grips with R and as an experiment I thought that I would try to play around with some cricket data. In its rawest format it is a yaml file, which I used the yaml R package to turn into an R object.
However, I now have a number of nested lists of uneven length that I want to try and turn into a data frame in R. I have tried a few methods such as writing some loops to parse the data and some of the functions in the tidyr package. However, I can't seem to get it to work nicely.
I wondered if people knew of the best way to tackle this? Replicating the data structure would be difficult here, because the complexity comes in the multiple nested lists and the unevenness of their length (which would make for a very long code block. However, you can find the raw yaml data here: http://cricsheet.org/downloads/ (I was using the ODI internationals).
Thanks in advance!
Update
I have tried this:
1)Using tidyr - seperate
d <- unnest(balls)
Name <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal","WicketFielder","WicketKind","PlayerOut")
a <- separate(d, x, Name, sep = ",",extra = "drop")
Which basically uses the tidyr package returns a single column dataframe that I then try to separate. However, the problem here is that in the middle there is sometimes extras variables that appear in some rows and not others, thereby throwing off the separation.
2) Creating vectors
ballsVector <- unlist(balls[[2]],use.names = FALSE)
names_vector <- c("Batsman","Bowler","NonStriker","RunsBatsman","RunsExtras","RunsTotal")
names(ballsVector) <- c(names_vector)
ballsMatrix <- matrix(ballsVector, nrow = 1, byrow = TRUE)
colnames(ballsMatrix) <- names_vector
The problem here is that the resulting vectors are uneven in length and therefore cant be combined into a data frame. It will also suffer from the issue that there are sporadic variables in the middle of the dataset (as above).
Caveat: not complete answer; attempt to arrange the innings data
plyr::rbind.fill may offer a solution to binding rows with a different number of columns.
I dont use tidyr but below is some rough code to get the innings data into a data.frame. You could then loop this through all the yaml files in the directory.
# Download and unzip data
download.file("http://cricsheet.org/downloads/odis.zip", temp<- tempfile())
tmp <- unzip(temp)
# Create lists - use first game
library(yaml)
raw_dat <- yaml.load_file(tmp[[2]])
#names(raw_dat)
# Function to process list into dataframe
p_fun <- function(X) {
team = X[[1]][["team"]]
# function to process each list subelement that represents each throw
fn <- function(...) {
tmp = unlist(...)
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
}
# loop over all throws
lst = lapply(X[[1]][["deliveries"]], fn )
cbind(team, plyr:::rbind.fill(lst))
}
# Loop over each innings
dat <- plyr::rbind.fill(lapply(raw_dat$innings, p_fun))
Some explanation
The list structure and subsetting it. To get an idea of the structure of the list use
str(raw_dat) # but this gives a really long list of data
You can truncate this, to make it a bit more useful
str(raw_dat, 3)
length(raw_dat)
So there are three main list elements - meta, info, and innings. You can also see this with
names(raw_dat)
To access the meta data, you can use
raw_dat$meta
#or using `[[1]]` to access the first element of the list (see ?'[[')
raw_dat[[1]]
#and get sub-elements by either
raw_dat$meta$data_version
raw_dat[[1]][[1]] # you can also use the names of the list elements eg [[`data_version`]]
The main data is in the inningselement.
str(raw_dat$innings, 3)
Look at the names in the list element
lapply(raw_dat$innings, names)
lapply(raw_dat$innings[[1]], names)
There are two list elements, each with sub-elements. You can access these as
raw_dat$innings[[1]][[1]][["team"]] # raw_dat$innings[[1]][["1st innings"]][["team"]]
raw_dat$innings[[2]][[1]][["team"]] # raw_dat$innings[[2]][["2nd innings"]][["team"]]
The above function parsed the deliveries data in raw_dat$innings. To see what it does, work through it from the inside.
Use one record to see how it works
(note the lapply, with p_fun, looped over raw_dat$innings[[1]] and raw_dat$innings[[2]] ; so this is the outer loop, and the lapply, with fn, loops through the deliveries, within an innings ; the inner loop)
X <- raw_dat$innings[[1]]
tmp <- X[[1]][["deliveries"]][[1]]
tmp
#create a named vector
tmp <- unlist(tmp)
tmp
# 0.1.batsman 0.1.bowler 0.1.non_striker 0.1.runs.batsman 0.1.runs.extras 0.1.runs.total
# "IR Bell" "DW Steyn" "MJ Prior" "0" "0" "0"
To use rbind.fill, the elements to bind together need to be data.frames. We also want to remove the leading numbers /
deliveries from the names, as otherwise we will have lots of uniquely names columns
# this regex removes all non-numeric characters from the string
# you could then split this number into over and delivery
gsub("[^0-9]", "", names(tmp))
# this regex removes all numeric characters from the string -
# allowing consistent names across all the balls / deliveries
# (if i was better at regex I would have also removed the leading dots)
gsub("[0-9]", "", names(tmp))
So for the first delivery in the first innings we have
tmp = data.frame(ball=gsub("[^0-9]", "", names(tmp))[1], t(tmp))
colnames(tmp) = gsub("[0-9]", "", colnames(tmp))
tmp
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
To see how the lapply works, use the first three deliveries (you will need to run the function fn in your workspace)
lst = lapply(X[[1]][["deliveries"]][1:3], fn )
lst
# [[1]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 01 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[2]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 02 IR Bell DW Steyn MJ Prior 0 0 0
#
# [[3]]
# ball X..batsman X..bowler X..non_striker X..runs.batsman X..runs.extras X..runs.total
# 1 03 IR Bell DW Steyn MJ Prior 3 0 3
So we end up with a list element for every delivery within an innings. We then use rbind.fill to create one data.frame.
If I was going to try and parse every yaml file I would use a loop.
Use the first three records as an example, and also add the match date.
tmp <- unzip(temp)[2:4]
all_raw_dat <- vector("list", length=length(tmp))
for(i in seq_along(tmp)) {
d = yaml.load_file(tmp[i])
all_raw_dat[[i]] <- cbind(date=d$info$date, plyr::rbind.fill(lapply(d$innings, p_fun)))
}
Then use rbind.fill.
Q1. from comments
A small example with rbind.fill
a <- data.frame(x=1, y=2)
b <- data.frame(x=2, z=1)
rbind(a,b) # error as names dont match
plyr::rbind.fill(a, b)
rbind.fill doesnt go back and add/update rows with the extra columns, where needed (a still doesnt have column z), Think of it as creating an empty dataframe with the number of columns equal to the number of unique columns found in the list of dataframes - unique(c(names(a), names(b))). The values are then filled in each row where possible, and left missing (NA) otherwise..
Overall situation:
The interface of my measuring devices couldn’t save any further information but the name of the csv it generates during measuring its values. So I used a systematic set of abbreviations to account for changing parameters, such as concentrations, enzymes, feed stocks, buffers etc., That combined formed the title of my csv files which form the names of the data.frames , where I am now trying to read out the names, to combine them with the rest of the data, to form tables that I can use to do regressions.
The Issue:
I just noticed that I lose the names of my data.frames inside the list,
I could rename them after each call of lapply, but this doesn't seam to be a proper solution.
I found suggestion to use the llply, but I can't teach it to keep names either.
# loads plyr package
library(plyr)
# generates a showcase list of dataframes,
data <- list(data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)),data.frame(c(1,2),c(3,3)))
# assigns names to dataframe
names(data) <- list("one","two", "tree", "four")
usses the dataframes name to pass “o” to a column, this part works fine,
But after running it the names are lost
data <- lapply(X = seq_along(data),
FUN = function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)},
USE.NAMES = TRUE)
Same thing with llply, operates as expected but doesn’t keep the name either although I thought I could solve that particular problem (quote: “llply is equivalent to lapply except that it will preserve labels and can display a progress bar.”)
data <- llply(seq_along(data), function(i){
x <- data[[i]]
if (gsub("([(a-z)]).*","\\1", names(data)[i]) == "o") {x$enz <- "o"}
return(x)})
I would very much appreciate a hint how to solve this with out something like
name(data) <- list.with.the.names
after each llply ore lapply call.
Do something like this:
for (i in seq_along(data)) data[[i]]$name <- names(data)[i]
do.call(rbind, data)
# c.1..2. c.3..3. name
#one.1 1 3 one
#one.2 2 3 one
#two.1 1 3 two
#two.2 2 3 two
#tree.1 1 3 tree
#tree.2 2 3 tree
#four.1 1 3 four
#four.2 2 3 four
And continue from there.
(edited to reflect help...I'm not doing great with formatting, but appreciate the feedback)
I'm a bit stuck on what I suspect is an easy enough problem. I have multiple different data sets that I have loaded into R, all of which have different numbers of observations, but all of which have two variables named "A1," "A2," and "A3". I want to create a new variable in each of the three data frames that contains the value held in "A1" if A3 contains a value greater than zero, and the value held in "A2" if A3 contains a value less than zero. Seems simple enough, right?
My attempt at this code uses this faux-data:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=cbind(A1,A2,A3)
A3=runif(100,-1,1)
df2=cbind(A1,A2,A3)
I'm about a thousand percent sure that R has some functionality for creating the same named variable in multiple data frames, but I have tried doing this with lapply:
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
But the newVar is not available for me once I leave the lapply loop. For example, if I ask for the mean of the new variable:
mean(df1$newVar)
[1] NA
Warning message:
In mean.default(df1$newVar) :
argument is not numeric or logical: returning NA
Any help would be appreciated.
Thank you.
Well first of all, df1 and df2 are not data.frames but matrices (the dollar syntax doesn't work on matrices).
In fact, if you do:
set.seed(1)
A1=seq(1,100,length=100)
A2=seq(-100,-1,length=100)
A3=runif(100,-1,1)
df1=as.data.frame(cbind(A1,A2,A3))
A3=runif(100,-1,1)
df2=as.data.frame(cbind(A1,A2,A3))
mylist=list(df1,df2)
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2
})
the code almost works but gives some warnings. In fact, there's still an error in the last line of the function called by lapply. If you change it like this, it works as expected:
lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0] # you need to subset x$A2 otherwise it's too long
return(x) # better to state explicitly what's the return value
})
EDIT (as per comment):
as basically always happens in R, functions do not mutate existing objects but return brand new objects.
So, in this case df1 and df2 are still the same but lapply returns a list with the expected 2 new data.frames i.e. :
resultList <- lapply(mylist,function(x){
x$newVar=x$A1
x$newVar[x$A3>0]=x$A2[x$A3>0]
return(x)
})
newDf1 <- resultList[[1]]
newDf2 <- resultList[[2]]
I have lots of variables in R, all of type list
a100 = list()
a200 = list()
# ...
p700 = list()
Each variable is a complicated data structure:
a200$time$data # returns 1000 x 1000 matrix
Now, I want to apply code to each variable in turn. However, since R doesn't support pass-by-reference, I'm not sure what to do.
One idea I had was to create a big list of all these lists, i.e.,
biglist = list()
biglist[[1]] = a100
...
And then I could iterate over biglist:
for (i in 1:length(biglist)){
biglist[[i]]$newstuff = "profit"
# more code here
}
And finally, after the loop, go backwards so that existing code (that uses variable names) still works:
a100 = biglist[[1]]
# ...
The question is: is there a better way to iterate over a set of named lists? I have a feeling that I'm doing things horribly wrong. Is there something easier, like:
# FAKE, Idealized code:
foreach x in (a100, a200, ....){
x$newstuff = "profit"
}
a100$newstuff # "profit"
To parallel walk over lists you can use mapply, which will take parallel lists and then walk over them in lock-step. Furthermore, in a functional language you should emit the object that you want rather than modify the data structure within a function call.
You should use the sapply, apply, lapply, ... family of functions.
jim
jimmyb is quite right. lapply and sapply are specifically designed to work on lists. So they would work with your biglist as well. You shouldn't forget to return the object in the nested function though : An example :
X <- list(A=list(A1=1:2,A2=3:4),B=list(B1=5:6,B2=7:8))
lapply(X,function(i){
i$newstuff = "profit"
return(i)
})
Now as you said, R passes by value so you have multiple copies of the data roaming around. If you work with really big lists, you might want to try toning the memory usage down by working on each variable seperately, using assign and get. The following is considered bad coding, but can sometimes be necessary to avoid memory trouble :
A <- X[[1]] ; B <- X[[2]] #make the data
list.names <- c("A","B")
for (i in list.names){
tmp <- get(i)
tmp$newstuff <- "profit"
assign(i,tmp)
rm(tmp)
}
Make sure you are well aware of the implication this code has, as you're working within the global environment. If you need to do this more often, you might want to work with environments instead :
my.env <- new.env() # make the environment
my.env$A <- X[[1]];my.env$B <- X[[2]] # put vars in environment
for (i in list.names){
tmp <- get(i,envir=my.env)
tmp$newstuff <- "profit"
assign(i,tmp,envir=my.env)
rm(tmp)
}
my.env$A
my.env$B