collapse data frame with embedded matrices [duplicate] - r

This question already has answers here:
aggregate() puts multiple output columns in a matrix instead
(1 answer)
Apply several summary functions (sum, mean, etc.) on several variables by group in one call
(7 answers)
Closed 4 years ago.
Under certain conditions, R generates data frames that contain matrices as elements. This requires some determination to do by hand, but happens e.g. with the results of an aggregate() call where the aggregation function returns multiple values:
set.seed(101)
d0 <- data.frame(g=factor(rep(1:2,each=20)), x=rnorm(20))
d1 <- aggregate(x~g, data=d0, FUN=function(x) c(m=mean(x), s=sd(x)))
str(d1)
## 'data.frame': 2 obs. of 2 variables:
## $ g: Factor w/ 2 levels "1","2": 1 2
## $ x: num [1:2, 1:2] -0.0973 -0.0973 0.8668 0.8668
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr "m" "s"
This makes a certain amount of sense, but can make trouble for downstream processing code (for example, ggplot2 doesn't like it). The printed representation can also be confusing if you don't know what you're looking at:
d1
## g x.m x.s
## 1 1 -0.09731741 0.86678436
## 2 2 -0.09731741 0.86678436
I'm looking for a relatively simple way to collapse this object to a regular three-column data frame (either with names g, m, s, or with names g, x.m, x.s ...).
I know this problem won't arise with tidyverse (group_by + summarise), but am looking for a base-R solution.

Related

R - Multi-level list indexing

What is the convention to assign an object to a multi-level list?
Sofar I thought the convention 1,2 of indexing is to use [[]] instead of $.
Hence, when saving results in loops I usually used the following approach:
> result <- matrix(2,2,2)
> result_list <- list()
> result_list[["A"]][["B"]][["C"]] <- result
> print(result_list)
$A
$A$B
$A$B$C
[,1] [,2]
[1,] 2 2
[2,] 2 2
Which works as intended with this matrix.
But when assigning a single number the list seems to skip the last level.
> result <- 2
> result_list <- list()
> result_list[["A"]][["B"]][["C"]] <- result
> print(result_list)
$A
B
2
At the same time, if I use $ instead of [[]] the list again is as intendet.
> result_list$A$B$C <- result
> print(result_list)
$A
$A$B
$A$B$C
[1] 2
As mentioned here you can also use list("A" = list("B" = list("C" = 2))).
Which of these methods should be used for indexing a multi-level list in R?
Although the title of the question referst to multi-level list indexing, and the syntax mylist[['a']][['b']][['c']] is the same that one would use to retrieve an element of a multi-level list, the differences that you're observing actually arise from using the same syntax for creation (or not) of multi-level lists.
To show this, we can first explicitly create the multi-level (nested) lists, and then check that the indexing works as expected both for matrices and for single numbers.
mymatrix=matrix(1:4,nrow=2)
list_b=list(c=mymatrix)
list_a=list(b=list_b)
mynestedlist1=list(a=list_a)
str( mynestedlist1 )
# List of 1
# $ a:List of 1
# ..$ b:List of 1
# .. ..$ c: int [1:2, 1:2] 1 2 3 4
mynumber=2
list_e=list(f=mynumber)
list_d=list(e=list_e)
mynestedlist2=list(d=list_d)
str( mynestedlist2 )
# List of 1
# $ d:List of 1
# ..$ e:List of 1
# .. ..$ f: num 2
( Note that I've created the lists in sequential steps for clarity; the could have been all rolled-together in a single line, like: mynestedlist2=list(d=list(e=list(f=mynumber))) )
Anyway, now we'll check that indexing works Ok:
str(mynestedlist1[['a']][['b']][['c']])
# int [1:2, 1:2] 1 2 3 4
str(mynestedlist1$a$b$c)
# int [1:2, 1:2] 1 2 3 4
str(mynestedlist2[['d']][['e']][['f']])
# num 2
str(mynestedlist2$d$e$f)
# num 2
# and, just to check that we don't 'skip the last level':
str(mynestedlist2[['d']][['e']])
# List of 1
# $ f: num 2
So the direct answer to the question 'which of these methods should be used for indexing a multi-level list in R' is: 'any of them - they're all ok'.
So what's going on with the examples in the question, then?
Here, the same syntax is being used to try to implicitly create lists, and since the structure of the nested list is not specified explicitly, this relies on whether R can infer the structure that you want.
In the first and third examples, there's no ambiguity, but each for a different reason:
First example:
mynestedlist1=list()
mynestedlist1[['a']][['b']][['c']]=mymatrix
We've specified that mynestedlist1 is a list. But its elements could be any kind of object, until we assign them. In this case, we put into the element named 'a' an object with an element 'b' that contains an object with an element 'c' that is a matrix. Since there's no R object that can contain a matrix in a single element except a list, the only way to achieve this assignment is by creating a nested list.
Third example:
mynestedlist3=list()
mynestedlist3$g$h$i=mynumber
In this case, we've used the $ notation, which only applies to lists (or to data types that are similar/equivalent to lists, like dataframes). So, again, the only way to follow the instructions of this assignment is by creating a nested list.
Finally, the pesky second example, but starting with a simpler variant of it:
mylist2=list()
mylist2[['c']][['d']]=mynumber
Here there's an ambiguity. We've specified that mylist2 is a list, and we've put into the element named 'c' an object with an element 'd' that contains a single number. This element could have been a list, but it can also be a simple vector, and in this case R chooses this as the simpler option:
str(mylist2)
# List of 1
# $ c: Named num 2
# ..- attr(*, "names")= chr "d"
Contrast this to the behaviour when trying to assign a matrix using exactly the same syntax: in this case, the only way follow the syntax would be by creating another, nested, list inside the first one.
What about the full second example mylist2[['c']][['d']][['e']]=mynumber, where we try to assign a number named 'e' to the just-created but still-empty object 'd'?
This seems rather unclear, and this may be the reason for the different behaviours of different versions of R (as reported in the comments to the question). In the question, the action taken by R has been to assign the number while dropping its name, similarly to:
myvec=vector(); myvec2=vector()
myvec[['a']]=1
myvec2[['b']]=2
myvec[['a']]=myvec2
str(myvec)
# Named num 2
# - attr(*, "names")= chr "a"
However, the syntax alone doesn't seem to force this behaviour, so it would be sensible to avoid relying on this behaviour when trying to create nested lists, or lists of vectors.

R: A column in a dataframe from numeric to factor with paste0 (and vise- versa)

Preface:
I have seen this post:How to convert a factor to an integer\numeric without a loss of information? , but it does not really apply to the issue I am having. It addresses the issue of converting a vector in the form of factor to a numeric, but the issue I am having is larger than that.
Problem:
I am trying to convert a column in a dataframe from a factor to a numeric, while representing the dataframe using paste0. Here is an example:
aa=1:10
bb=rnorm(10)
dd=data.frame(aa,bb)
get(paste0("d","d"))[,2]=as.factor(get(paste0("d","d"))[,2])
(The actual code I am using requires me to use the paste0 function)
I get the error: target of assignment expands to non-language object
I am not sure how to do this, I think what is messing it up is the paste0 function.
First, this is not really a natural way to think about things or to code things in R. It can be done, but if you rephrase your question to give the bigger picture, someone can probably provide more natural ways of doing this in R. (Like the named lists #joran mentioned in the comment.)
With that said, to do this in R, you need to split apart the three steps you're trying to do in one line: get the data frame with the specified variable, make the desired column a factor, and then assign back to the variable name. Here I've wrapped this in a function, so the assignment needs to be made in pos=1 instead of the default, which would name it only within the function.
tof <- function(dfname, colnum) {
d <- get(dfname)
d[, colnum] <- factor(d[, colnum])
assign(dfname, d, pos=1)
}
dd <- data.frame(aa=1:10, bb=rnorm(10))
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: num -1.4824 0.7904 0.0258 1.2075 0.2455 ...
tof("dd", 2)
str(dd)
## 'data.frame': 10 obs. of 2 variables:
## $ aa: int 1 2 3 4 5 6 7 8 9 10
## $ bb: Factor w/ 10 levels "-1.48237228248052",..: 1 8 4 9 5 10 2 7 3 6

Build column of data frame with character vectors of different length?

I want to create a data frame in R.
To make an easy 2x2 example of my problem:
Assume the first column is a simple vector:
first <- c(1:2)
The second column is for every row a character vector (but of different length), for example:
c('A') for the first row and c('B','C') for the second.
How can I build this data frame?
If you want to store different vector sizes in each row of a certain column, you will need to use a list, problem that (from ?data.frame)
If a list or data frame or matrix is passed to data.frame it is as if
each component or column had been passed as a separate argument
Thus you will need to wrap it up into I in order to protect you desired structure, e.g.
df <- data.frame(first = 1:2, Second = I(list("A", c("B", "C"))))
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ first : int 1 2
# $ Second:List of 2
# ..$ : chr "A"
# ..$ : chr "B" "C"
# ..- attr(*, "class")= chr "AsIs"

Organization of data with metadata

I have a dataframe that contains two columns X-data and Y-data.
This represents some experimental data.
Now I have a lot of additional information that I want to associate with this data, such as temperatures, flow rates and so on the sample was recorded at. I have this metadata in a second dataframe.
The data and metadata should always stay together, but I also want to be able to do calculations with the data
As I have many of those data-metadata pairs (>100), I was wondering what people think is an efficient way to organize the data?
For now, I have the two dataframes in a list, but I find accessing the individual values or data-columns tedious (= a lot of code and brackets to write).
You can use an attribute:
dfr <- data.frame(x=1:3,y=rnorm(3))
meta <- list(temp="30C",date=as.Date("2013-02-27"))
attr(dfr,"meta") <- meta
dfr
x y
1 1 -1.3580532
2 2 -0.9873850
3 3 0.3809447
attr(dfr,"meta")
$temp
[1] "30C"
$date
[1] "2013-02-27"
str(dfr)
'data.frame': 3 obs. of 2 variables:
$ x: int 1 2 3
$ y: num -1.358 -0.987 0.381
- attr(*, "meta")=List of 2
..$ temp: chr "30C"
..$ date: Date, format: "2013-02-27"

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources