add variable to a list in R - r

I have 28 list within a list and I try to add another variable called ID to each individual list. I found this Dataframes in a list; adding a new variable with name of dataframe to be very helpful. But when i tried his code, it doesn't work in my case. I think it's because my list doesn't have clear labels [1],[2].[3], etc.. that the code can recognize.
all$id <- rep(names(mylist), sapply(mylist, nrow))
>List of 1
$ :List of 28
..$ :'data.frame': 271 obs. of 12 variables:
.. ..$ Sample_ID : Factor w/ 271 levels "MC25",..: 19 27 2
.. ..$ Reported_Analyte : Factor w/ 10 levels "2-Butoxyethanol",..: 7 7 7
..$ Date_Collected : Factor w/ 71 levels "2010-05-08","2010-05-09",..: 8 9 1
.. ..$ Result2 : num [1:271] 0.11 0.11 0.11 0.11
..$ :'data.frame': 6 obs. of 12 variables:
.. ..$ Sample_ID : Factor w/ 271 levels "MC25",..: 19 27 2
.. ..$ Reported_Analyte : Factor w/ 10 levels "2-Butoxyethanol",..: 7 7 7
..$ Date_Collected : Factor w/ 71 levels "2010-05-08","2010-05-09",..: 8 9 1
.. ..$ Result2 : num [1:271] 0.11 0.11 0.11 0.11

It really isn't very clear what you want to achieve (the post you linked to was about collapsing over the list of data frames and adding into the collapsed version an ID variable indicating which original data frame each row in the collapsed data frame came from).
I see a complication with your data; you have a list of 28 data frames within a list. You can see that in the output from str() that is given in your Q. You can see this better with this example data set (here all the data frames are the same but that is just for expedience)
set.seed(42)
dat <- data.frame(Sample_ID = factor(sample(10)),
Reported_Analyte = factor(sample(LETTERS, 10)),
Date_Collected = Sys.Date() + 0:9,
Result2 = rnorm(10))
mylist <- list(lapply(1:28, function(x) dat))
If we look at mylist using str() we see the nature of the complication I mentioned:
R> str(mylist, max = 2)
List of 1
$ :List of 28
..$ Data_frame_ 1 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 2 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 3 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 4 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 5 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 6 :'data.frame': 10 obs. of 4 variables:
..$ Data_frame_ 7 :'data.frame': 10 obs. of 4 variables:
....<etc>
Where the post you linked to was starting from was the list inside your outer list and that list had named components. If you don't need the outer list, perhaps best to throw it away at this stage:
mylist2 <- mylist[[1]]
## the `[[` are important as we want the 1st component *inside* the list
## using `[` would just give us a list within a list again.
Names can then be added to this list
names(mylist2) <- paste("Data_frame_", seq_along(mylist2), sep = "")
which would result in
R> str(mylist2)
List of 28
$ Data_frame_1 :'data.frame': 10 obs. of 4 variables:
..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
$ Data_frame_2 :'data.frame': 10 obs. of 4 variables:
..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
....<etc>
Notice the List of 1 is no longer reported.
If the list of data frames within a list is important to you (not sure why it would be, but OK), then you can assign the names to the [[1]]st component directly.
names(mylist[[1]]) <- paste("Data_frame_", seq_along(mylist[[1]]), sep = "")
(Notice I'm using the original mylist and on both occasions I index that list with [[1]].)
The result is similar to the above though the list within a list structure is retained:
R> str(mylist)
List of 1
$ :List of 28
..$ Data_frame_1 :'data.frame': 10 obs. of 4 variables:
.. ..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
.. ..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
.. ..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
.. ..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
..$ Data_frame_2 :'data.frame': 10 obs. of 4 variables:
.. ..$ Sample_ID : Factor w/ 10 levels "1","2","3","4",..: 10 9 3 6 4 8 5 1 2 7
.. ..$ Reported_Analyte: Factor w/ 10 levels "C","F","I","J",..: 6 7 10 2 5 8 9 1 3 4
.. ..$ Date_Collected : Date[1:10], format: "2012-05-02" "2012-05-03" ...
.. ..$ Result2 : num [1:10] 1.305 2.287 -1.389 -0.279 -0.133 ...
....<etc>
If you now wish to proceed with collapsing the individual data frames into a single data frame, but retaining the information about which data frame they came from, we would do this for mylist2:
all2 <- do.call("rbind", mylist2)
all2 <- transform(all2, id = rep(names(mylist2), sapply(mylist2, nrow)))
rownames(all2) <- seq_len(nrow(all2)) ## reset rownames for compactness
which gives:
R> head(all2)
Sample_ID Reported_Analyte Date_Collected Result2 id
1 10 L 2012-05-02 1.3048697 Data_frame_1
2 9 R 2012-05-03 2.2866454 Data_frame_1
3 3 W 2012-05-04 -1.3888607 Data_frame_1
4 6 F 2012-05-05 -0.2787888 Data_frame_1
5 4 K 2012-05-06 -0.1333213 Data_frame_1
6 8 T 2012-05-07 0.6359504 Data_frame_1
For mylist we use something very similar, but just index into mylist using [[1]]:
all1 <- do.call("rbind", mylist[[1]])
all1 <- transform(all1, id = rep(names(mylist[[1]]), sapply(mylist[[1]], nrow)))
rownames(all1) <- seq_len(nrow(all1)) ## reset rownames for compactness
R> head(all1)
Sample_ID Reported_Analyte Date_Collected Result2 id
1 10 L 2012-05-02 1.3048697 Data_frame_1
2 9 R 2012-05-03 2.2866454 Data_frame_1
3 3 W 2012-05-04 -1.3888607 Data_frame_1
4 6 F 2012-05-05 -0.2787888 Data_frame_1
5 4 K 2012-05-06 -0.1333213 Data_frame_1
6 8 T 2012-05-07 0.6359504 Data_frame_1
As you can see repeatedly having to refer to your list of data frames as mylist[[1]] is a pain if you dont need the outer list.
Update:
If you don't want to collapse the list into a single data frame, see #Andrie's answer, but modify it to read:
ml2 <- ml1
ml2[[1]] <- lapply(seq_along(ml[[1]]), function(x)cbind(ml[[1]][[x]], id=x))
so you account for the list within list structure.

I answer this using a constructed example of a list with samples from mtcars.
First, create a list of data frames. Do this by sampling 10 rows from mtcars for each element of the list:
ml <- lapply(1:3, function(x)mtcars[sample(1:32, 10), 1:3])
So, now you have an unnamed list of 3 data frames. Next you want to add an id column. The trick is to use lapply over a sequence of list items using seq_along(ml), and then to cbind your id to each data frame:
ml2 <- lapply(seq_along(ml), function(x)cbind(ml[[x]], id=x))
The results are what you required:
str(ml2)
List of 3
$ :'data.frame': 10 obs. of 4 variables:
..$ mpg : num [1:10] 15 24.4 26 15.8 22.8 21 32.4 17.3 17.8 30.4
..$ cyl : num [1:10] 8 4 4 8 4 6 4 8 6 4
..$ disp: num [1:10] 301 147 120 351 108 ...
..$ id : int [1:10] 1 1 1 1 1 1 1 1 1 1
$ :'data.frame': 10 obs. of 4 variables:
..$ mpg : num [1:10] 33.9 19.2 24.4 10.4 30.4 22.8 16.4 21.4 15.5 21.5
..$ cyl : num [1:10] 4 6 4 8 4 4 8 6 8 4
..$ disp: num [1:10] 71.1 167.6 146.7 460 75.7 ...
..$ id : int [1:10] 2 2 2 2 2 2 2 2 2 2
$ :'data.frame': 10 obs. of 4 variables:
..$ mpg : num [1:10] 15.5 21 13.3 21.5 21.4 30.4 21 18.1 30.4 15.2
..$ cyl : num [1:10] 8 6 8 4 4 4 6 6 4 8
..$ disp: num [1:10] 318 160 350 120 121 ...
..$ id : int [1:10] 3 3 3 3 3 3 3 3 3 3

Related

Change df columns from lists to vectors

I've been using R for a while, but lists perplex me.
For some reason in some cases my function outputs a data frame of lists:
str() returns something like:
*'data.frame': 4683 obs. of 6 variables:
$ f1:List of 4683
..$ : num -0.196
..$ : num -0.205
..$ : num -0.209
..$ : num -0.218
..$ : num -0.197
..$ : num -0.136
..$ : num -0.22*
instead of
*'data.frame': 4683 obs. of 6 variables:
$ f1: num -0.197 -0.205 -0.208 -0.218 -0.197 ...
$ f2: num -0.13 -0.139 -0.136 -0.137 -0.126 ...
$ f3: num -0.216 -0.221 -0.214 -0.209 -0.203 ...
$ f4: num 0.00625 -0.04806 -0.04888 -0.02979 -0.03813 ...
$ f5: num -0.15 -0.178 -0.173 -0.207 -0.154 ...
$ f6: num -0.191 -0.224 -0.25 -0.183 -0.209 ...*
...
like I'd expect. Is there some simple way to convert df from the first case to the second?
I have tried manually casting columns as vectors, which not only doesn't work, but also would not be very effective.
When we have a data frame like this
df
# 1 1, 2, 3 1, 2, 3
# 2 4, 5, 6 4, 5, 6
where
df |> str()
# 'data.frame': 2 obs. of 2 variables:
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
# $ :List of 2
# ..$ : int 1 2 3
# ..$ : int 4 5 6
we can do
r <- do.call(data.frame, df)
r
# X1.3 X4.6 X1.3.1 X4.6.1
# 1 1 4 1 4
# 2 2 5 2 5
# 3 3 6 3 6
where
str(r)
# 'data.frame': 3 obs. of 4 variables:
# $ X1.3 : int 1 2 3
# $ X4.6 : int 4 5 6
# $ X1.3.1: int 1 2 3
# $ X4.6.1: int 4 5 6
Explanation: do.call constructs a data.frame() call with df (which is a "data.frame" as well as a "list") as ... arguments. So in our df with two lists of length 2, we get two data frames with 2 columns, i.e. a resulting data frame with 4 columns in this case.
By the way, you can use Reduce(.) just as well.
Data:
df <- structure(list(list(1:3, 4:6), list(1:3, 4:6)), names = c("",
""), class = "data.frame", row.names = c(NA, -2L))

How can I extract and name all of these data frames from this list in R?

How I can get these data frames out of this unholy list? Using an 'apply' function, I would also like to assign each a variable if possible.
List of 452
$ :'data.frame': 11 obs. of 2 variables:
..$ X[[i]]: int [1:11] 39 12 51 6 14 3 5 1 5 5 ...
..$ time : chr [1:11] "2018-08-29T19:00:00-06:00" "2018-08-30T11:00:00-06:00" "2018-08-30T12:00:00-06:00" "2018-08-30T13:00:00-06:00" ...
$ :'data.frame': 4 obs. of 2 variables:
..$ X[[i]]: int [1:4] 2 1 8 25
..$ time : chr [1:4] "2018-08-28T13:00:00-06:00" "2018-08-28T23:00:00-06:00" "2018-08-29T21:00:00-06:00" "2018-08-29T22:00:00-06:00"

Extract values from nested list of summary(aov()) into a dataframe

I am running a simple one-way ANOVA across multiple groups within a single data frame.
Dataframe available here: https://www.dropbox.com/s/6nsjk4l1pgiwal3/cut1.csv?dl=0
>download.file('https://www.dropbox.com/s/6nsjk4l1pgiwal3/cut1.csv?raw=1', destfile = "cut1.csv", method = "auto")
> data <- read.csv("cut1.csv")
> cut1 <- data %>% mutate(Plot = as.factor(Plot), Block = as.factor(Block), Cut = as.factor(Cut))
> str(cut1)
'data.frame': 160 obs. of 6 variables:
$ Plot : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Block : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 2 2 2 2 3 3 ...
$ Treatment : Factor w/ 4 levels "AN","C","IU",..: 4 2 3 1 1 3 4 2 3 1 ...
$ Cut : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
$ Measurement: Factor w/ 10 levels "ADF","Ash","Crude_Protein",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Value : num 956 965 961 963 955 ...
I used some code from this SO question to enable the aov function to be applied to every level of Measurementfactor:
anova_1<- sapply(unique(as.character(cut1$Measurement)),
function(meas)aov(Value~Treatment+Block,cut1,subset=(Measurement==meas)),
simplify=FALSE,USE.NAMES=TRUE)
summary_1 <- lapply(anova_1, summary)
I can look manually through summary_1 but ideally what I would like to do is extract the p values for each level of the Measurement factor into a dataframe which I could then filter so that I only see which ones are <0.5. I would then run TukeyHSD on these.
summary_1 looks like this (only first 2 lists shown):
> str(summary_1)
List of 10
$ Dry_matter :List of 1
..$ :Classes ‘anova’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ Df : num [1:3] 3 3 9
.. ..$ Sum Sq : num [1:3] 359 167 612
.. ..$ Mean Sq: num [1:3] 119.8 55.5 68
.. ..$ F value: num [1:3] 1.761 0.816 NA
.. ..$ Pr(>F) : num [1:3] 0.224 0.517 NA
..- attr(*, "class")= chr [1:2] "summary.aov" "listof"
$ Crude_Protein:List of 1
..$ :Classes ‘anova’ and 'data.frame': 3 obs. of 5 variables:
.. ..$ Df : num [1:3] 3 3 9
.. ..$ Sum Sq : num [1:3] 306 721 1606
.. ..$ Mean Sq: num [1:3] 102 240 178
.. ..$ F value: num [1:3] 0.572 1.347 NA
.. ..$ Pr(>F) : num [1:3] 0.647 0.319 NA
..- attr(*, "class")= chr [1:2] "summary.aov" "listof"
I can extract the p value from one of the lists in summary_1 like this:
> summary_1$OAH[[1]][,5][1]
[1] 0.4734992
However, I dont know how to extract from all the nested lists and place in a dataframe.
Much obliged for any help.
You can use the package broom in combination with dplyr to apply Anova by Measurement, and assign the output to a data.frame in a tidy format.
library(broom)
library(dplyr)
summaries <- cut1 %>% group_by(Measurement) %>%
do(tidy(aov(Value ~ Treatment + Block, data = .)))
head(summaries)
# Measurement term df sumsq meansq statistic p.value
# (fctr) (chr) (dbl) (dbl) (dbl) (dbl) (dbl)
#1 ADF Treatment 3 41.416875 13.805625 3.097871 0.07138437
#2 ADF Block 1 8.001125 8.001125 1.795388 0.20729351
#3 ADF Residuals 11 49.021375 4.456489 NA NA
#4 Ash Treatment 3 38.511875 12.837292 1.051787 0.40840601
#5 Ash Block 1 34.980125 34.980125 2.865998 0.11856463
#6 Ash Residuals 11 134.257375 12.205216 NA NA
Here's a solution in vanilla R:
# you can shorten your example -- download.file not necessary
cut1 <- read.csv('https://www.dropbox.com/s/6nsjk4l1pgiwal3/cut1.csv?raw=1') %>%
mutate(Plot = as.factor(Plot), Block = as.factor(Block), Cut = as.factor(Cut))
# split-apply-combine strategy
do.call(rbind, lapply(split(cut1,cut1$Measurement),
function(x) with(x, summary(aov(Value ~ Treatment + Block)))[[1]]
)
)
returns:
Df Sum Sq Mean Sq F value Pr(>F)
ADF.Treatment 3 41.42 13.81 6.7088 0.01133 *
ADF.Block 3 38.50 12.83 6.2366 0.01405 *
ADF.Residuals 9 18.52 2.06
Ash.Treatment 3 38.51 12.84 0.9162 0.47115
Ash.Block 3 43.13 14.38 1.0261 0.42602
Ash.Residuals 9 126.11 14.01
Crude_Protein.Treatment 3 306.42 102.14 0.5723 0.64733
Crude_Protein.Block 3 721.42 240.47 1.3473 0.31946
Crude_Protein.Residuals 9 1606.39 178.49
D.Treatment 3 9.47 3.16 4.5530 0.03331 *
D.Block 3 7.57 2.52 3.6383 0.05751 .
D.Residuals 9 6.24 0.69
Dry_matter.Treatment 3 359.39 119.80 1.7609 0.22432
Dry_matter.Block 3 166.62 55.54 0.8164 0.51656
Dry_matter.Residuals 9 612.27 68.03
ME.Treatment 3 0.24 0.08 4.5530 0.03331 *
ME.Block 3 0.19 0.06 3.6383 0.05751 .
ME.Residuals 9 0.16 0.02
NCGD.Treatment 3 2777.57 925.86 4.5530 0.03331 *
NCGD.Block 3 2219.55 739.85 3.6383 0.05751 .
NCGD.Residuals 9 1830.17 203.35
NDF.Treatment 3 355.91 118.64 6.8809 0.01050 *
NDF.Block 3 336.70 112.23 6.5095 0.01239 *
NDF.Residuals 9 155.17 17.24
OAH.Treatment 3 1.41 0.47 0.9108 0.47350
OAH.Block 3 1.37 0.46 0.8850 0.48488
OAH.Residuals 9 4.65 0.52
Sugar.Treatment 3 86.18 28.73 5.0212 0.02577 *
Sugar.Block 3 51.64 17.21 3.0085 0.08720 .
Sugar.Residuals 9 51.49 5.72
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Error in ncol(xj) : object 'xj' not found when using R matplot()

Using matplot, I'm trying to plot the 2nd, 3rd and 4th columns of airquality data.frame after dividing these 3 columns by the first column of airquality.
However I'm getting an error
Error in ncol(xj) : object 'xj' not found
Why are we getting this error? The code below will reproduce this problem.
attach(airquality)
airquality[2:4] <- apply(airquality[2:4], 2, function(x) x /airquality[1])
matplot(x= airquality[,1], y= as.matrix(airquality[-1]))
You have managed to mangle your data in an interesting way. Starting with airquality before you mess with it. (And please don't attach() - it's unnecessary and sometimes dangerous/confusing.)
str(airquality)
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R: int 190 118 149 313 NA NA 299 99 19 194 ...
$ Wind : num 7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
$ Temp : int 67 72 74 62 56 66 65 59 61 69 ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
After you do
airquality[2:4] <- apply(airquality[2:4], 2,
function(x) x /airquality[1])
you get
'data.frame': 153 obs. of 6 variables:
$ Ozone : int 41 36 12 18 NA 28 23 19 8 NA ...
$ Solar.R:'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 4.63 3.28 12.42 17.39 NA ...
$ Wind :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 0.18 0.222 1.05 0.639 NA ...
$ Temp :'data.frame': 153 obs. of 1 variable:
..$ Ozone: num 1.63 2 6.17 3.44 NA ...
$ Month : int 5 5 5 5 5 5 5 5 5 5 ...
$ Day : int 1 2 3 4 5 6 7 8 9 10 ...
or
sapply(airquality,class)
## Ozone Solar.R Wind Temp Month Day
## "integer" "data.frame" "data.frame" "data.frame" "integer" "integer"
that is, you have data frames embedded within your data frame!
rm(airquality) ## clean up
Now change one character and divide by the column airquality[,1] rather than airquality[1] (divide by a vector, not a list of length one ...)
airquality[,2:4] <- apply(airquality[,2:4], 2,
function(x) x/airquality[,1])
matplot(x= airquality[,1], y= as.matrix(airquality[,-1]))
In general it's safer to use [, ...] indexing rather than [] indexing to refer to columns of a data frame unless you really know what you're doing ...

R write.table read.table change the format of some columns in dataframes

I am experienced a problem when saving data using write.table and reading data using read.table.
I wrote some code that collect data from thousands of files, does some calculations, and creates a data frame. In this data frame I have 8 columns and more then 11000 rows. The columns contain the 8 variables, 3 of which are ordered factors; the other variables are numeric.
When I look at the structure of my data before using the command write.table I got exactly what I expect which is:
str(data)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : Ord.factor w/ 6 levels "160"<"800"<"1.4"<..: 1 1 1 1 1 1 1 1 1 1 ...
$ EnergyUnit: Ord.factor w/ 3 levels "MeV"<"GeV"<"TeV": 1 1 1 1 1 1 1 1 1 1 ...
$ Location : Ord.factor w/ 7 levels "BeamImpact"<"WithinBulky"<..: 5 5 5 5 5 5 5 5 5 5 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : num 30 28 26 24 22 20 30 18 28 16 ...
After that I use the usual write.table command to save my file:
write.table(data, file = "filename.txt")
Now, when I read again this file into R, and I look at the structure, I get this:
mydata <- read.table("filename.txt", header=TRUE)
> str(mydata)
'data.frame': 11424 obs. of 8 variables:
$ a_KN : num 8.56e-09 1.11e-08 1.45e-08 1.88e-08 2.45e-08 ...
$ a_DTM : num 5.05e-08 5.12e-08 5.19e-08 5.26e-08 5.33e-08 ...
$ SF : num 5.89 4.6 3.58 2.79 2.18 ...
$ Energy : num 160 160 160 160 160 160 160 160 160 160 ...
$ EnergyUnit: Factor w/ 3 levels "GeV","MeV","TeV": 2 2 2 2 2 2 2 2 2 2 ...
$ Location : Factor w/ 7 levels "10cmTarget","AdjBulky",..: 4 4 4 4 4 4 4 4 4 4 ...
$ Ti : num 0.25 0.25 0.25 0.25 0.25 0.25 1 0.25 1 0.25 ...
$ Tc : int 30 28 26 24 22 20 30 18 28 16 ...
Do you know how to solve this problem? THis bothers me also because I am creating a Shiny app and this changed class doesn't fit my purpose.
Thanks!

Resources