Confusion about unpacking lists in R - r

I'm befuddled by how R is dealing with lists and data frames. For example:
agg = function() {
df1 = data.frame(a=1:5,b=1:5)
df2 = data.frame(a=11:15,b=11:15)
return(list(df1, df2))
}
res = agg()
# returns NULL
res[1]$a
# returns 1:5
res[[1]]$a
I don't understand why the first element of res is not a data frame; rather, I need double-referencing to get at the elements. I read Hadley Wickham's excellent Data Structures chapter in his Advanced R website, but still can't figure out what's up with this example. Can anyone explain what I'm missing?

Single square brackets [] are used to index vectors in R. Double square brackets [[]] are used to index lists. You have a list, so [] doesn't work:
is.list(res)
# [1] TRUE
str(res)
# List of 2
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 1 2 3 4 5
# ..$ b: int [1:5] 1 2 3 4 5
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 11 12 13 14 15
# ..$ b: int [1:5] 11 12 13 14 15
See ?[, vectors, and lists for more information. The following SO posts might also help:
What are the differences between R vector and R list data types
Learning R for someone used to MATLAB, and confusion with R data types
how to understand list(list(object)) in r?

First element of the list is a list, thus agg[1] returns a list.
You are looking for the first component of the list, which is saved in agg[[1]]. Thus agg[[1]]$a works.
E.g., take a look at the following
res[[1]]$a
res[1][[1]]$a
res[1][1][[1]]$a
res[1][1][1][[1]]$a
They are all returning the column a of the first component of the list. In these cases, they are all the same list, i.e. the first element of res.
Hope that makes sense.

Related

How do I matricise a column/vector (applying a function like sum/diff/boolean)? [duplicate]

I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns

Setting variable attributes via subsetting a dataframe

I want to set an attribute ("full.name") of certain variables in a data frame by subsetting the dataframe and iterating over a character vector. I tried two solutions but neither works (varsToPrint is a character vector containing the variables, questionLabels is a character vector containing the labels of questions):
Sample data:
jtiPrint <- data.frame(question1 = seq(5), question2 = seq(5), question3=seq(5))
questionLabels <- c("question1Label", "question2Label")
varsToPrint <- c("question1", "question2")
Solution 1:
attrApply <- function(var, label) {
`<-`(attr(var, "full.name"), label)
}
mapply(attrApply, jtiPrint[varsToPrint], questionLabels)
Solution 2:
i <- 1
for (var in jtiPrint[varsToPrint]) {
attr(var, "full.name") <- questionLabels[i]
i <- i + 1
}
Desired output (for e.g. variable 1):
attr(jtiPrint$question1, "full.name")
[1] "question1Label"
The problems seems to be in solution 2 that R sets the attritbute to a new dataframe only containing one variable (the indexed variable). However, I don't understand why solution 1 does not work. Any ideas how to fix either of these two ways?
Solution 1 :
The function is 'attr<-' not '<-'(attr...), also you need to set SIMPLIFY=FALSE (otherwise a matrix is returned instead of a list) and then call as.data.frame :
attrApply <- function(var, label) {
`attr<-`(var, "full.name", label)
}
df <- as.data.frame(mapply(attrApply,jtiPrint[varsToPrint],questionLabels,SIMPLIFY = FALSE))
> str(df)
'data.frame': 5 obs. of 2 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
Solution 2 :
You need to set the attribute on the column of the data.frame, you're setting the attribute on copies of the columns :
for(i in 1:length(varsToPrint)){
attr(jtiPrint[[i]],"full.name") <- questionLabels[i]
}
> str(jtiPrint)
'data.frame': 5 obs. of 3 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
$ question3: int 1 2 3 4 5
Anyway, note that the two approaches lead to a different result. In fact the mapply solution returns a subset of the previous data.frame (so no column 3) while the second approach modifies the existing jtiPrint data.frame.

Converting an R list with NULL sub-elements to a data frame

Say I have a list below
> str(lll)
List of 2
$ :List of 3
..$ Name : chr "Sghokbt"
..$ Title: NULL
..$ Value: int 7
$ :List of 3
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
How can I convert this list to a data frame as below?
> df
Name Title Value
1 Sghokbt <NA> 7
2 Sgnglio Mr 5
as.data.frame doesn't work, I suspect due to the NULL in the first list element. EDIT: I have also tried do.call(rbind, list) as suggested in another question, but the result is a matrix of lists, not a data frame.
To reproduce the list:
list(structure(list(Name = "Sghokbt", Title = NULL, Value = 7L), .Names = c("Name",
"Title", "Value")), structure(list(Name = "Sgnglio", Title = "Mr",
Value = 5), .Names = c("Name", "Title", "Value")))
I think I've found a solution myself.
My approach is to first convert all the sub-lists into dataframes, so I have a list of dataframes instead of list of lists. These dataframes will just drop the NULL variables.
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
The resultant list of dataframes:
> str(ldf)
List of 2
$ :'data.frame': 1 obs. of 2 variables:
..$ Name : chr "Sghokbt"
..$ Value: int 7
$ :'data.frame': 1 obs. of 3 variables:
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
From here I get a little help from plyr.
require(plyr)
df <- ldply(ldf)
The result has the columns out of order, but I'm happy enough with it.
> str(df)
'data.frame': 2 obs. of 3 variables:
$ Name : chr "Sghokbt" "Sgnglio"
$ Value: num 7 5
$ Title: chr NA "Mr"
I won't accept this as an answer yet for now in case there is a better solution.
Tidyverse solution
Here's a solution with the tidyverse which might be more readable or at least more intuitive to read for those who are familiar with dplyr and purrr.
lll %>%
# apply to the whole list, and then convert into a tibble
map_df(~
# convert every list element to a char vector
as.character(.x) %>%
# convert the char vector to a tibble row
as_tibble_row(.name_repair = "unique")) %>%
# convert all "NULL" entries to NA
na_if("NULL") %>%
# set tibble names assuming all list entries contain the same names
set_names(lll[[1]] %>% names())
There are several tricks to note:
map_df cannot merge the character vectors into a dataframe. therefore, you convert them into dataframe rows by as_tibble_row(). theoretically, you could name these vectors but as.character has no names attribute, but you need a conversion into a named vector
for as_tibble_row(), you need to specify a .name_repair argument, so map_df can merge the tibble rows without names
i'm truly grateful for the dplyr::na_if() function, you should be too!
lll[[1]] %>% names() is just one way to get the names of the first list entry, and it assumes the other list entries are named the same and in the same order. you should check that before.
Details:
when you use na_if(), you so elegantly replace this code by Ricky (which is totally fine but hard to remember):
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
data.frame(do.call(rbind, lll))
Name Title Value
1 Sghokbt NULL 7
2 Sgnglio Mr 5
do.call is useful in that it accepts lists as an argument. It will execute the function rbind which combines the lists observation by observation. data.frame structures the output as needed. The weakness is that because data frames also accept lists, the new object will keep the list attributes and will be difficult to perform calculations on the elements. Below, is another option, but also potentially problematic.
By removing the NULL value first:
null.remove <- function(lst) {
lapply(lst, function(x) {x <- paste(x, ""); x})
}
newlist <- lapply(lll, null.remove)
asvec <- unlist(newlist)
col.length <- length(newlist[[1]])
data.frame(rbind(asvec[1:col.length],
asvec[(col.length+1):length(asvec)]))
Name Title Value
1 Sghokbt 7
2 Sgnglio Mr 5
'data.frame': 2 obs. of 3 variables:
$ Name : Factor w/ 2 levels "Sghokbt ","Sgnglio ": 1 2
$ Title: Factor w/ 2 levels " ","Mr ": 1 2
$ Value: Factor w/ 2 levels "5 ","7 ": 2 1
This method coerces a value onto the NULL elements in the list by pasting a space onto the existing object. Next unlist allows the list elements to be treated as a named vector. col.length takes note of how many variables there are for use in the new data frame. The last function call creates the data frame by using the col.length value to split the vector.
This is still an intermediate result. Before regular data frame operations can be done, the extra space will have to be trimmed off of the factors. The digits must also be coerced to the class numeric.
I can continue the process when I have another chance to update.

Write a list with multiple dataframes

I'm trying to write a list with multiple dataframes (more than 200 dataframes), for that purpose i use the following syntax:
> list.name <- list(ls(pattern="dfname*"))
this let me create a list.name with multiple object and when i try to print the content of the list there is no problem
> list.name
[[1]]
[1] "dfname1" "dfname2" "dfname3" "dfname4" "dfname5"
.....
[200] "dfname200" "dfname201" "dfname202" ....
but, when i try to see a specific dataframe in my list list.name i can't see the dataframe value, for example
> list.name[[5]]
Error en list.name[[5]] : subíndice fuera de los límites (subscript out of range)
or
> list.name[2]
[[1]]
NULL
I need to built a list that let me do other operations like view str of each dataframe and export each dataframe to csv, etc.
Could you please give me some suggestion? Thanks in advance!
ls() will just return a character vector with the names of the matching data.frames. If you actually want to create a list of the data.frames you should use
dflist <- mget(ls(pattern="dfname*"))
or if you do just want the names, my not keep it in vector form rather than converting to list
list.name <- list(ls(pattern="dfname*"))
then you can extract each name with
list.name[1]
rather than using the double-bracket syntax. There's really no need for a list in that case.
You have simply created a list of length 1, that contains a vector of 200 names. You haven't stored the dataframes themselves. To actually create a list of all the data frames, you can try
lists <- lapply(ls(pattern="dfname*"), get)
You are making a list out of the character vector of object names. You want to get() the object behind each name and make a list from those. In this case, you want the mget() function to get lots of objects at once from their names. This returns a list
## dummy data
dfname1 <- dfname2 <- dfname3 <- data.frame(A = 1:3, B = LETTERS[1:3])
dfnames <- ls(pattern="dfname*")
list.name <- mget(dfnames)
str(list.name)
> str(list.name)
List of 3
$ dfname1:'data.frame': 3 obs. of 2 variables:
..$ A: int [1:3] 1 2 3
..$ B: Factor w/ 3 levels "A","B","C": 1 2 3
$ dfname2:'data.frame': 3 obs. of 2 variables:
..$ A: int [1:3] 1 2 3
..$ B: Factor w/ 3 levels "A","B","C": 1 2 3
$ dfname3:'data.frame': 3 obs. of 2 variables:
..$ A: int [1:3] 1 2 3
..$ B: Factor w/ 3 levels "A","B","C": 1 2 3
list.name[[ dfnames[1] ]]
> list.name[[ dfnames[1] ]]
A B
1 1 A
2 2 B
3 3 C

mapply basics? - how to create a matrix from two vectors and a function

I am trying create a data.frame from which to create a graph. I have a function and two vectors that I want to use as the two inputs. This is a bit simplified, but basically all I have is:
relGPA <- seq(-1.5,1.5,.2)
avgGPA <- c(-2,0,2)
f <- function(relGPA, avgGPA) 1/(1+exp(sum(relGPA*pred.model$coef[1],avgGPA*pred.model$coef[2])))
and all I want is a data.frame with 3 columns for the avgGPA values, and 16 rows for the relGPA values with the resulting values in the cells.
I apologize for how basic this is, but I assure you I have tried to make this happen without your assistance. I have tried following the examples on the sapply and mapply man pages, but I'm just a little too new to R to see what I'm trying to do.
Thanks!
Cannot be tested with the information offered, but this should work:
expGPA <- outer(relGPA, avgGPA, FUN=f) # See below for way to make this "work"
Another useful function when you want to generate combinations is expand.grid and this would get you the "long form":
expGPA2 <-expand.grid(relGPA, avgGPA)
expGPA2$fn <- apply(expGPA2, 1, f)
The long form is what lattice and ggplot will expect as input format for higher level plotting.
EDIT: It may be necessary to construct a more specific method for passing column references to the function as pointed out by djhurio and (solved) by Sam Swift with the Vectorize strategy. In the case of apply, the sum function would work out of the box as described above, but the division operator would not, so here is a further example that can be generalized to more complex functions with multiple arguments. All the programmer needs is the number of the column for the appropriate argument in the "apply()"-ed" function, because (unfortunately) the column names are not carried through to the x argument:
> expGPA2$fn <- apply(expGPA2, 1, function(x) x[1]/x[2])
> str(expGPA2)
'data.frame': 48 obs. of 3 variables:
$ Var1: num -1.5 -1.3 -1.1 -0.9 -0.7 ...
$ Var2: num -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
$ fn : num 0.75 0.65 0.55 0.45 0.35 ...
- attr(*, "out.attrs")=List of 2
..$ dim : int 16 3
..$ dimnames:List of 2
.. ..$ Var1: chr "Var1=-1.5" "Var1=-1.3" "Var1=-1.1" "Var1=-0.9" ...
.. ..$ Var2: chr "Var2=-2" "Var2= 0" "Var2= 2"
Edit2: (2013-01-05) Looking at this a year later, I realized that SamSwift's function could be vectorized by making its body use "+" instead of sum:
1/(1+exp( relGPA*pred.model$coef[1] + avgGPA*pred.model$coef[2]) # all vectorized fns

Resources