Setting variable attributes via subsetting a dataframe - r

I want to set an attribute ("full.name") of certain variables in a data frame by subsetting the dataframe and iterating over a character vector. I tried two solutions but neither works (varsToPrint is a character vector containing the variables, questionLabels is a character vector containing the labels of questions):
Sample data:
jtiPrint <- data.frame(question1 = seq(5), question2 = seq(5), question3=seq(5))
questionLabels <- c("question1Label", "question2Label")
varsToPrint <- c("question1", "question2")
Solution 1:
attrApply <- function(var, label) {
`<-`(attr(var, "full.name"), label)
}
mapply(attrApply, jtiPrint[varsToPrint], questionLabels)
Solution 2:
i <- 1
for (var in jtiPrint[varsToPrint]) {
attr(var, "full.name") <- questionLabels[i]
i <- i + 1
}
Desired output (for e.g. variable 1):
attr(jtiPrint$question1, "full.name")
[1] "question1Label"
The problems seems to be in solution 2 that R sets the attritbute to a new dataframe only containing one variable (the indexed variable). However, I don't understand why solution 1 does not work. Any ideas how to fix either of these two ways?

Solution 1 :
The function is 'attr<-' not '<-'(attr...), also you need to set SIMPLIFY=FALSE (otherwise a matrix is returned instead of a list) and then call as.data.frame :
attrApply <- function(var, label) {
`attr<-`(var, "full.name", label)
}
df <- as.data.frame(mapply(attrApply,jtiPrint[varsToPrint],questionLabels,SIMPLIFY = FALSE))
> str(df)
'data.frame': 5 obs. of 2 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
Solution 2 :
You need to set the attribute on the column of the data.frame, you're setting the attribute on copies of the columns :
for(i in 1:length(varsToPrint)){
attr(jtiPrint[[i]],"full.name") <- questionLabels[i]
}
> str(jtiPrint)
'data.frame': 5 obs. of 3 variables:
$ question1: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question1Label"
$ question2: atomic 1 2 3 4 5
..- attr(*, "full.name")= chr "question2Label"
$ question3: int 1 2 3 4 5
Anyway, note that the two approaches lead to a different result. In fact the mapply solution returns a subset of the previous data.frame (so no column 3) while the second approach modifies the existing jtiPrint data.frame.

Related

R- expand.grid given a data.frame of parameter names and sequence definitions

I have a data.frame that arbitrarily defines parameter names and sequence boundaries:
dfParameterValues <- data.frame(ParameterName = character(), seqFrom = integer(), seqTo = integer(), seqBy = integer())
row1 <- data.frame(ParameterName = "parameterA", seqFrom = 1, seqTo = 2, seqBy = 1)
row2 <- data.frame(ParameterName = "parameterB", seqFrom = 5, seqTo = 7, seqBy = 1)
row3 <- data.frame(ParameterName = "parameterC", seqFrom = 10, seqTo = 11, seqBy = 1)
dfParameterValues <- rbind(dfParameterValues, row1)
dfParameterValues <- rbind(dfParameterValues, row2)
dfParameterValues <- rbind(dfParameterValues, row3)
I would like to use this approach to create a grid of c parameter columns based on the number of unique ParameterNames that contain r rows of all possible combinations of the sequences given by seqFrom, seqTo, and seqBy. The result would therefore look somewhat like this or should have a content like the following:
ParameterA ParameterB ParameterC
1 5 10
1 5 11
1 6 10
1 6 11
1 7 10
1 7 11
2 5 10
2 5 11
2 6 10
2 6 11
2 7 10
2 7 11
Edit: Note that the parameter names and their numbers are not known in advance. The data.frame comes from elsewhere so I cannot use the standard static expand.grid approach and need something like a flexible function that creates the expanded grid based on any dataframe with the columns ParameterName, seqFrom, seqTo, seqBy.
I've been playing around with for loops (which is bad to begin with) and it hasn't lead me to any elegant ideas. I can't seem to find a way to come up with the result by using tidyr without constructing the sequences seperately first, either. Do you have any elegant approaches?
Bonus kudos for extending this to include not only numerical sequences, but vectors/sets of characters / other factors, too.
Many thanks!
Going off CPak's answer, you could use
my_table <- expand.grid(apply(dfParameterValues, 1, function(x) seq(as.numeric(x['seqFrom']), as.numeric(x['seqTo']), as.numeric(x['seqBy']))))
names(my_table) <- c("ParameterA", "ParameterB", "ParameterC")
my_table <- my_table[order(my_table$ParameterA, my_table$ParameterB), ]
#smanski's answer is technically correct (and should arguably be accepted since it motivated this), but it is also a good example of when to be careful when using apply with data.frames. In this case, the frame contains at least one column that is character, so all columns are converted, resulting in the need to use as.numeric. The safer alternative is to only pull the columns needed, such as either of:
expand.grid(apply(dfParameterValues[,-1], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) ))
I prefer the second, because it only pulls what it needs and therefore what it "knows" should be numeric. (I find explicit is often safer.)
The reason this is happening is that apply silently converts the data to a matrix, so to see the effects, try:
str(as.matrix(dfParameterValues))
# chr [1:3, 1:4] "parameterA" "parameterB" "parameterC" " 1" " 5" ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:4] "ParameterName" "seqFrom" "seqTo" "seqBy"
str(as.matrix(dfParameterValues[c("seqFrom","seqTo","seqBy")]))
# num [1:3, 1:3] 1 5 10 2 7 11 1 1 1
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:3] "1" "2" "3"
# ..$ : chr [1:3] "seqFrom" "seqTo" "seqBy"
(Note the chr on the first and the num on the second.)
Neither one preserves the parameter names. To do that, just sandwich the call with setNames:
setNames(
expand.grid(apply(dfParameterValues[,c("seqFrom","seqTo","seqBy")], 1,
function(x) seq(x['seqFrom'], x['seqTo'], x['seqBy']) )),
dfParameterValues$ParameterName)

List assignment for list with greater than three nesting

I have not been able to find a fix for this error. I have implemented work-arounds before, but I wonder if anyone here knows why it occurs.
the following returns no error as expected
q <- list()
q[["a"]][["b"]] <- 3
q[["a"]][["c"]] <- 4
However, when I add another level of nesting I get:
q <- list()
q[["a"]][["b"]][["c"]]<- 3
q[["a"]][["b"]][["d"]] <- 4
Error in q[["a"]][["b"]][["d"]] <- 4 : more elements supplied than there are to replace
To make this even more confusing if I add a fourth nested list I get:
q <- list()
q[["a"]][["b"]][["c"]][["d"]] <- 3
q[["a"]][["b"]][["c"]][["e"]] <- 4
Error in *tmp*[["c"]] : subscript out of bounds
I would have expected R to return the same error message for the triple nested list as for the quadruple nested list.
I first came across this a few months ago. I am running R 3.4.3.
If we check the str(q) from the first assignment, it is a list with a single element 'a'. On subsequent assignment, it is creating a named vector rather than a list.
q <- list()
q[["a"]][["b"]] <- 3
q[["a"]][["c"]] <- 4
str(q)
#List of 1
# $ a: Named num [1:2] 3 4
# ..- attr(*, "names")= chr [1:2] "b" "c"
is.vector(q$a)
#[1] TRUE
If we try to do an assignment on the next level, it is like assignment based on indexing the name i.e. 'b' which is empty and assign value on 'c'. The option would be to create a list element by wrapping the value with list
q <- list()
q[["a"]][["b"]][["c"]]<- list(3)
q[["a"]][["b"]][["d"]] <- list(4)
It returns the structure with 'q' as a list of 1 element i.e. 'a', which is again a list of length 1 ('b') and as we assign two values '3' and '4' for 'c' and 'd', it is a list of 2 elemeents
str(q)
#List of 1
# $ a:List of 1
# ..$ b:List of 2
# .. ..$ c:List of 1
# .. .. ..$ : num 3
# .. ..$ d:List of 1
# .. .. ..$ : num 4
By this way, we can nest 'n' number of lists
q <- list()
q[["a"]][["b"]][["c"]][["d"]] <- list(3)
q[["a"]][["b"]][["c"]][["e"]] <- list(4)
Note: It is not clear about the expected output structure

Converting an R list with NULL sub-elements to a data frame

Say I have a list below
> str(lll)
List of 2
$ :List of 3
..$ Name : chr "Sghokbt"
..$ Title: NULL
..$ Value: int 7
$ :List of 3
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
How can I convert this list to a data frame as below?
> df
Name Title Value
1 Sghokbt <NA> 7
2 Sgnglio Mr 5
as.data.frame doesn't work, I suspect due to the NULL in the first list element. EDIT: I have also tried do.call(rbind, list) as suggested in another question, but the result is a matrix of lists, not a data frame.
To reproduce the list:
list(structure(list(Name = "Sghokbt", Title = NULL, Value = 7L), .Names = c("Name",
"Title", "Value")), structure(list(Name = "Sgnglio", Title = "Mr",
Value = 5), .Names = c("Name", "Title", "Value")))
I think I've found a solution myself.
My approach is to first convert all the sub-lists into dataframes, so I have a list of dataframes instead of list of lists. These dataframes will just drop the NULL variables.
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
The resultant list of dataframes:
> str(ldf)
List of 2
$ :'data.frame': 1 obs. of 2 variables:
..$ Name : chr "Sghokbt"
..$ Value: int 7
$ :'data.frame': 1 obs. of 3 variables:
..$ Name : chr "Sgnglio"
..$ Title: chr "Mr"
..$ Value: num 5
From here I get a little help from plyr.
require(plyr)
df <- ldply(ldf)
The result has the columns out of order, but I'm happy enough with it.
> str(df)
'data.frame': 2 obs. of 3 variables:
$ Name : chr "Sghokbt" "Sgnglio"
$ Value: num 7 5
$ Title: chr NA "Mr"
I won't accept this as an answer yet for now in case there is a better solution.
Tidyverse solution
Here's a solution with the tidyverse which might be more readable or at least more intuitive to read for those who are familiar with dplyr and purrr.
lll %>%
# apply to the whole list, and then convert into a tibble
map_df(~
# convert every list element to a char vector
as.character(.x) %>%
# convert the char vector to a tibble row
as_tibble_row(.name_repair = "unique")) %>%
# convert all "NULL" entries to NA
na_if("NULL") %>%
# set tibble names assuming all list entries contain the same names
set_names(lll[[1]] %>% names())
There are several tricks to note:
map_df cannot merge the character vectors into a dataframe. therefore, you convert them into dataframe rows by as_tibble_row(). theoretically, you could name these vectors but as.character has no names attribute, but you need a conversion into a named vector
for as_tibble_row(), you need to specify a .name_repair argument, so map_df can merge the tibble rows without names
i'm truly grateful for the dplyr::na_if() function, you should be too!
lll[[1]] %>% names() is just one way to get the names of the first list entry, and it assumes the other list entries are named the same and in the same order. you should check that before.
Details:
when you use na_if(), you so elegantly replace this code by Ricky (which is totally fine but hard to remember):
ldf <- lapply(lll, function(x) {
nonnull <- sapply(x, typeof)!="NULL" # find all NULLs to omit
do.call(data.frame, c(x[nonnull], stringsAsFactors=FALSE))
})
data.frame(do.call(rbind, lll))
Name Title Value
1 Sghokbt NULL 7
2 Sgnglio Mr 5
do.call is useful in that it accepts lists as an argument. It will execute the function rbind which combines the lists observation by observation. data.frame structures the output as needed. The weakness is that because data frames also accept lists, the new object will keep the list attributes and will be difficult to perform calculations on the elements. Below, is another option, but also potentially problematic.
By removing the NULL value first:
null.remove <- function(lst) {
lapply(lst, function(x) {x <- paste(x, ""); x})
}
newlist <- lapply(lll, null.remove)
asvec <- unlist(newlist)
col.length <- length(newlist[[1]])
data.frame(rbind(asvec[1:col.length],
asvec[(col.length+1):length(asvec)]))
Name Title Value
1 Sghokbt 7
2 Sgnglio Mr 5
'data.frame': 2 obs. of 3 variables:
$ Name : Factor w/ 2 levels "Sghokbt ","Sgnglio ": 1 2
$ Title: Factor w/ 2 levels " ","Mr ": 1 2
$ Value: Factor w/ 2 levels "5 ","7 ": 2 1
This method coerces a value onto the NULL elements in the list by pasting a space onto the existing object. Next unlist allows the list elements to be treated as a named vector. col.length takes note of how many variables there are for use in the new data frame. The last function call creates the data frame by using the col.length value to split the vector.
This is still an intermediate result. Before regular data frame operations can be done, the extra space will have to be trimmed off of the factors. The digits must also be coerced to the class numeric.
I can continue the process when I have another chance to update.

Confusion about unpacking lists in R

I'm befuddled by how R is dealing with lists and data frames. For example:
agg = function() {
df1 = data.frame(a=1:5,b=1:5)
df2 = data.frame(a=11:15,b=11:15)
return(list(df1, df2))
}
res = agg()
# returns NULL
res[1]$a
# returns 1:5
res[[1]]$a
I don't understand why the first element of res is not a data frame; rather, I need double-referencing to get at the elements. I read Hadley Wickham's excellent Data Structures chapter in his Advanced R website, but still can't figure out what's up with this example. Can anyone explain what I'm missing?
Single square brackets [] are used to index vectors in R. Double square brackets [[]] are used to index lists. You have a list, so [] doesn't work:
is.list(res)
# [1] TRUE
str(res)
# List of 2
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 1 2 3 4 5
# ..$ b: int [1:5] 1 2 3 4 5
# $ :'data.frame': 5 obs. of 2 variables:
# ..$ a: int [1:5] 11 12 13 14 15
# ..$ b: int [1:5] 11 12 13 14 15
See ?[, vectors, and lists for more information. The following SO posts might also help:
What are the differences between R vector and R list data types
Learning R for someone used to MATLAB, and confusion with R data types
how to understand list(list(object)) in r?
First element of the list is a list, thus agg[1] returns a list.
You are looking for the first component of the list, which is saved in agg[[1]]. Thus agg[[1]]$a works.
E.g., take a look at the following
res[[1]]$a
res[1][[1]]$a
res[1][1][[1]]$a
res[1][1][1][[1]]$a
They are all returning the column a of the first component of the list. In these cases, they are all the same list, i.e. the first element of res.
Hope that makes sense.

Write a list with multiple dataframes

I'm trying to write a list with multiple dataframes (more than 200 dataframes), for that purpose i use the following syntax:
> list.name <- list(ls(pattern="dfname*"))
this let me create a list.name with multiple object and when i try to print the content of the list there is no problem
> list.name
[[1]]
[1] "dfname1" "dfname2" "dfname3" "dfname4" "dfname5"
.....
[200] "dfname200" "dfname201" "dfname202" ....
but, when i try to see a specific dataframe in my list list.name i can't see the dataframe value, for example
> list.name[[5]]
Error en list.name[[5]] : subíndice fuera de los límites (subscript out of range)
or
> list.name[2]
[[1]]
NULL
I need to built a list that let me do other operations like view str of each dataframe and export each dataframe to csv, etc.
Could you please give me some suggestion? Thanks in advance!
ls() will just return a character vector with the names of the matching data.frames. If you actually want to create a list of the data.frames you should use
dflist <- mget(ls(pattern="dfname*"))
or if you do just want the names, my not keep it in vector form rather than converting to list
list.name <- list(ls(pattern="dfname*"))
then you can extract each name with
list.name[1]
rather than using the double-bracket syntax. There's really no need for a list in that case.
You have simply created a list of length 1, that contains a vector of 200 names. You haven't stored the dataframes themselves. To actually create a list of all the data frames, you can try
lists <- lapply(ls(pattern="dfname*"), get)
You are making a list out of the character vector of object names. You want to get() the object behind each name and make a list from those. In this case, you want the mget() function to get lots of objects at once from their names. This returns a list
## dummy data
dfname1 <- dfname2 <- dfname3 <- data.frame(A = 1:3, B = LETTERS[1:3])
dfnames <- ls(pattern="dfname*")
list.name <- mget(dfnames)
str(list.name)
> str(list.name)
List of 3
$ dfname1:'data.frame': 3 obs. of 2 variables:
..$ A: int [1:3] 1 2 3
..$ B: Factor w/ 3 levels "A","B","C": 1 2 3
$ dfname2:'data.frame': 3 obs. of 2 variables:
..$ A: int [1:3] 1 2 3
..$ B: Factor w/ 3 levels "A","B","C": 1 2 3
$ dfname3:'data.frame': 3 obs. of 2 variables:
..$ A: int [1:3] 1 2 3
..$ B: Factor w/ 3 levels "A","B","C": 1 2 3
list.name[[ dfnames[1] ]]
> list.name[[ dfnames[1] ]]
A B
1 1 A
2 2 B
3 3 C

Resources