Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a dataset in which we measured well-being with 18 items and political orientation (let's just assume for the moment that political orientation is measured with one item).
A person’s well-being score can be computed by taking the average of all 18 items, but also of taking the average of each possible combination of items (e.g., all combination of one item, two items etc), resulting in sum(choose(18, 0:18)) = 262,144 possible combinations.
I am interested in how the correlation coefficient between well-being and political orientation changes depending on how well-being is computed. That is, I am interested in getting all 18 (choose(18,1) = 18) correlation coefficients if well-being is assessed with each of the 18 items and then correlated with political orientation, all 153 correlation coefficients if well-being is computed with all possible combinations of 2-items and then correlated with political orientation etc. So in the end I'd be looking for 262,144 correlation coefficients.
The dataset looks something like this (just with >10,000 participants), whereas v19 is political orientation, v1 to v18 the well-being items.
df <- as.data.frame(matrix(rnorm(190), ncol = 19))
In essence, I am asking on how to compute the average of all combinations of 2 items, 3, … , 17 well-being items. I came across the expand() function of tidyr, but this seems to be doing something else.
Here are some steps to (1) calculate the average across the combinations of 18 factors; and then (2) correlate each of those combined-averages with the 19th column (political orientation).
set.seed(42)
df <- as.data.frame(matrix(rnorm(190), ncol = 19))
df[,1:3]
# V1 V2 V3
# 1 1.37096 1.3049 -0.3066
# 2 -0.56470 2.2866 -1.7813
# 3 0.36313 -1.3889 -0.1719
# 4 0.63286 -0.2788 1.2147
# 5 0.40427 -0.1333 1.8952
# 6 -0.10612 0.6360 -0.4305
# 7 1.51152 -0.2843 -0.2573
# 8 -0.09466 -2.6565 -1.7632
# 9 2.01842 -2.4405 0.4601
# 10 -0.06271 1.3201 -0.6400
rowMeans(df[,c(1,2)])
# [1] 1.3379 0.8610 -0.5129 0.1770 0.1355 0.2649 0.6136 -1.3756 -0.2110 0.6287
rowMeans(df[,c(1,3)])
# [1] 0.53216 -1.17300 0.09561 0.92377 1.14973 -0.26830 0.62713 -0.92891 1.23926 -0.35135
rowMeans(df[,c(2,3)])
# [1] 0.4991 0.2527 -0.7804 0.4679 0.8809 0.1027 -0.2708 -2.2098 -0.9902 0.3401
I show the row-means for three combinations because I want to verify where in the next step those values are found.
means <- lapply(1:3, function(N) {
do.call(cbind,
lapply(asplit(combn(18, N), 2),
function(ind) rowMeans(df[, ind, drop = FALSE])))
})
str(means)
# List of 3
# $ : num [1:10, 1:18] 1.371 -0.565 0.363 0.633 0.404 ...
# $ : num [1:10, 1:153] 1.338 0.861 -0.513 0.177 0.135 ...
# $ : num [1:10, 1:816] 0.7897 -0.0198 -0.3992 0.5229 0.722 ...
That last step produces a means object that contains the "1" (singular columns), "2" (pairwise row-averages), and "3"-deep combination-averages. Note that choose(18,2) is 153 (number of columns in means[[2]]) and choose(18,3) is 816 (means[[3]]). Each column represents the average of the respective columns combined.
I included 1 here (choose(18,1)) simply to keep all data in the same structure, since we do want to test correlation of the single-columns; other methods could be done to achieve this, I leaned towards consistency and simplicity.
To verify we have what we think, I'll pull out three columns from means[[2]] which correspond to the three rowMeans calculations I showed above based on direct access to df (inspection will reveal they are a match):
means[[2]][,c(1,2,18)]
# [,1] [,2] [,3]
# [1,] 1.3379 0.53216 0.4991
# [2,] 0.8610 -1.17300 0.2527
# [3,] -0.5129 0.09561 -0.7804
# [4,] 0.1770 0.92377 0.4679
# [5,] 0.1355 1.14973 0.8809
# [6,] 0.2649 -0.26830 0.1027
# [7,] 0.6136 0.62713 -0.2708
# [8,] -1.3756 -0.92891 -2.2098
# [9,] -0.2110 1.23926 -0.9902
# [10,] 0.6287 -0.35135 0.3401
This means that the columns are ordered as 1,2, 1,3, 1,4, ..., 1,18, then 2,3 (column 18), 2,4, etc, through 17,18 (column 153).
From here, correlating each of those columns with V19 is not difficult:
cors <- lapply(means, function(mn) apply(mn, 2, cor, df$V19))
str(cors)
# List of 3
# $ : num [1:18] 0.2819 -0.3977 0.0426 0.2501 -0.063 ...
# $ : num [1:153] -0.27 0.168 0.472 0.192 0.6 ...
# $ : num [1:816] -0.1831 -0.063 -0.3355 0.0358 -0.3829 ...
cor(df$V1, df$V19)
# [1] 0.2819
cor(rowMeans(df[,c(1,2)]), df$V19)
# [1] -0.2702
cor(rowMeans(df[,c(1,3)]), df$V19)
# [1] 0.1677
cor(rowMeans(df[,c(1,2,3)]), df$V19)
# [1] -0.1831
cor(rowMeans(df[,c(1,2,4)]), df$V19)
# [1] -0.06303
Because of the way that was done, it should be straight-forward to change the N of 3 to whatever you may need ... realizing that choose(18,9) is 48620, generating those combination-averages is not instantaneous but still quite manageable:
system.time({
means18 <- lapply(1:18, function(N) {
do.call(cbind,
lapply(asplit(combn(18, N), 2),
function(ind) rowMeans(df[, ind, drop = FALSE])))
})
})
# user system elapsed
# 41.65 0.58 50.35
str(means18)
# List of 18
# $ : num [1:10, 1:18] 1.371 -0.565 0.363 0.633 0.404 ...
# $ : num [1:10, 1:153] 1.338 0.861 -0.513 0.177 0.135 ...
# $ : num [1:10, 1:816] 0.7897 -0.0198 -0.3992 0.5229 0.722 ...
# $ : num [1:10, 1:3060] 0.7062 0.1614 -0.0406 0.24 0.6678 ...
# $ : num [1:10, 1:8568] 0.6061 0.0569 0.1191 0.0466 0.2606 ...
# $ : num [1:10, 1:18564] 0.5588 -0.0832 0.3619 0.146 0.2321 ...
# $ : num [1:10, 1:31824] 0.4265 -0.0449 0.3933 0.3251 0.095 ...
# $ : num [1:10, 1:43758] 0.2428 -0.0505 0.4221 0.1653 0.0153 ...
# $ : num [1:10, 1:48620] 0.3839 -0.0163 0.385 0.1335 -0.1191 ...
# $ : num [1:10, 1:43758] 0.4847 -0.0623 0.4115 0.2592 -0.2183 ...
# $ : num [1:10, 1:31824] 0.5498 0.0384 0.2829 0.4037 -0.259 ...
# $ : num [1:10, 1:18564] 0.5019 0.0442 0.2189 0.3281 -0.3759 ...
# $ : num [1:10, 1:8568] 0.3484 -0.0723 0.2117 0.2262 -0.3471 ...
# $ : num [1:10, 1:3060] 0.364 -0.102 0.197 0.29 -0.219 ...
# $ : num [1:10, 1:816] 0.334 -0.155 0.154 0.269 -0.232 ...
# $ : num [1:10, 1:153] 0.311 -0.242 0.217 0.235 -0.247 ...
# $ : num [1:10, 1:18] 0.282 -0.291 0.214 0.2 -0.198 ...
# $ : num [1:10, 1] 0.254 -0.228 0.105 0.283 -0.139 ...
and the rest of the process can be completed in a similar manner.
I have a large list of lists. There are 46 lists in "output". Each list is a tibble with differing number of rows and columns. My immediate goal is to subset a specific column from each list.
This is str(output) of the first two lists to give you an idea of the data.
> str(output)
List of 46
$ Brain :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6108 obs. of 8 variables:
..$ p_val : chr [1:6108] "0" "1.60383253411205E-274" "0" "0" ...
..$ avg_diff : num [1:6108] 1.71 1.7 1.68 1.6 1.58 ...
..$ pct.1 : num [1:6108] 0.998 0.808 0.879 0.885 0.923 0.905 0.951 0.957 0.619 0.985 ...
..$ pct.2 : num [1:6108] 0.677 0.227 0.273 0.323 0.36 0.384 0.401 0.444 0.152 0.539 ...
..$ cluster : num [1:6108] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:6108] "Plp1" "Mal" "Ermn" "Stmn4" ...
..$ X__1 : logi [1:6108] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:6108] "Myelinating oligodendrocyte" NA NA NA ...
$ Bladder :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4656 obs. of 8 variables:
..$ p_val : num [1:4656] 0.00 1.17e-233 2.85e-276 0.00 0.00 ...
..$ avg_diff : num [1:4656] 2.41 2.23 2.04 2.01 1.98 ...
..$ pct.1 : num [1:4656] 0.833 0.612 0.855 0.987 1 0.951 0.711 0.544 0.683 0.516 ...
..$ pct.2 : num [1:4656] 0.074 0.048 0.191 0.373 0.906 0.217 0.105 0.044 0.177 0.106 ...
..$ cluster : num [1:4656] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:4656] "Dpt" "Gas1" "Cxcl12" "Lum" ...
..$ X__1 : logi [1:4656] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:4656] "Stromal cell_Dpt high" NA NA NA ...
Since I have a large number of lists that make up the list, I have been trying to create an iterative code to perform tasks. This hasn't been successful.
I can achieve this manually, or list by list, but I haven't been successful in finding an iterative way of doing this.
x <- data.frame(output$Brain, stringsAsFactors = FALSE)
tmp.list <- x$Cell.Type
tmp.output <- purrr::discard(tmp.list, is.na)
x <- subset(x, Cell.Type %in% tmp.output)
This gives me the output that I want, which are the rows in the column "Cell.Type" with non-NA values.
I got as far as the code below to get the 8th column of each list, which is the "Cell.Type" column.
lapply(output, "[", , 8))
But here I found that the naming and positioning of the "Cell.Type" column in each list is not consistent. This means I cannot use the lapply function to subset the 8th columns, as some lists have this on for example the 9th column.
I tried the code below, but it does not work and gets an error.
lapply(output, "[", , c('Cell.Type', 'celltyppe'))
#Error: Column `celltyppe` not found
#Call `rlang::last_error()` to see a backtrace
Essentially, from my "output" list, I want to subset either columns "Cell.Type" or "celltyppe" from each of the 46 lists to create a new list with 46 lists of just a single column of values. Then I want to drop all rows with NA.
I would like to perform this using some sort of loop.
At the moment I have not had much success. Lapply seems to be able to extract columns through lists iterately, and I am having difficultly trying to subset names columns.
Once I can do this, I then want to create a loop that can subset only rows without NA.
FINAL CODE
This is the final code I have used to create exactly what I had hoped for. The first line of the code specifies the loop to go through each list of the large list. The second line of code selects columns of each list that contains "ell" in its name (Cell type, Cell Type, or celltyppe). The last removes any rows with "na".
purrr::map(output, ~ .x %>%
dplyr::select(matches("ell")) %>%
na.omit)
We can use anonymous function call
lapply(output, function(x) na.omit(x[grep("(?i)Cell\\.?(?i)Typp?e", names(x))]))
#[[1]]
# Cell.Type
#1 1
#2 2
#3 3
#4 4
#5 5
#[[2]]
# celltyppe
#1 7
#2 8
#3 9
#4 10
#5 11
Also with purrr
library(tidyverse)
map(output, ~ .x %>%
select(matches("(?i)Cell\\.?(?i)Typp?e") %>%
na.omit))
data
output <- list(data.frame(Cell.Type = 1:5, col1 = 6:10, col2 = 11:15),
data.frame(coln = 1:5, celltyppe = 7:11))
I have two lists of 48 elements. Each element in the list has one variable (DiffINT or DiffEXT below), with differing numbers of observations. The names of all of the elements are the same in both lists.
What I would like to do is merge the two lists of elements based on the element name and end up with two variables per element.
Bonus question: I have two lists of 48 elements, both lists have the same elements. One list has one variable with one observation in it, the other list as six variables per element with varied numbers of observations. Can I somehow merge these to accomplish the same as above?
I have reviewed other questions and tried append() and cbind() and other functions, but none of them accomplish what I want. Example of what I am looking for is below.
> str(DiffsMerged)
List of 48
$ Element1:List of 2
..$ DiffINT : num 1 0.642 0.27 -0.102 -0.123 ...
..$ DiffEXT : num 1 0.1397 -0.1045 -0.0751 -0.1414 ...
$ Element 2:List of 2
..$ DiffINT : num 1 0.5842 0.3453 0.158 -0.0259 ...
..$ DiffEXT : num 1 -0.0312 -0.0321 -0.033 -0.0339 ...
$ Element 3:List of 2
..$ DiffINT : num 1 0.908 0.816 0.724 0.632 ...
..$ DiffEXT : num 1 0.584 0.21 -0.163 -0.406
Many thanks in advance.
Edit to add: Whenever I want to view the individual lists (DiffINT and DiffEXT), I get the following error. Thoughts?
> View(DiffEXT)
Error in if (more || nchar(output) > 80) { :
missing value where TRUE/FALSE needed
You can get a simple "merge" with a lapply loop:
all_names <- union(names(DiffINT), names(DiffEXT))
DiffsMerged <- lapply(
X = all_names,
FUN = function(name) {
list(DiffINT[[name]], DiffEXT[[name]])
}
)
names(DiffsMerged) <- all_names
str(DiffsMerged)
# List of 3
# $ Element1:List of 2
# ..$ : num [1:5] 1 0.642 0.27 -0.102 -0.123
# ..$ : num [1:5] 1 0.1397 -0.1045 -0.0751 -0.1414
# $ Element2:List of 2
# ..$ : num [1:5] 1 0.1397 -0.1045 -0.0751 -0.1414
# ..$ : num [1:5] 1 -0.0312 -0.0321 -0.033 -0.0339
# $ Element3:List of 2
# ..$ : num [1:5] 1 0.908 0.816 0.724 0.632
# ..$ : num [1:5] 1 0.584 0.21 -0.163 -0.406
I don't know what you plan to use this data for, but it could help to keep it tidy. Only do this if both lists have the same names, and all elements have the same length.
int_df <- data.frame(DiffINT)
int_df[["source"]] <- "int"
ext_df <- data.frame(DiffEXT)
ext_df[["source"]] <- "ext"
merged_df <- rbind(int_df, ext_df)
merged_df
# Element1 Element2 Element3 source
# 1 1.0000 1.0000 1.000 int
# 2 0.6420 0.1397 0.908 int
# 3 0.2700 -0.1045 0.816 int
# 4 -0.1020 -0.0751 0.724 int
# 5 -0.1230 -0.1414 0.632 int
# 6 1.0000 1.0000 1.000 ext
# 7 0.1397 -0.0312 0.584 ext
# 8 -0.1045 -0.0321 0.210 ext
# 9 -0.0751 -0.0330 -0.163 ext
# 10 -0.1414 -0.0339 -0.406 ext
I am new in using apply and functions together and I am stuck and frustrated. I have 2 different list of data frames that I need to add certain number of columns to the first one when a condition is fulfill related to the second one. Below this is the structure of the first list that has one data frame for any station and every df has 2 or more columns with each pressure:
> str(KDzlambdaEG)
List of 3
$ 176:'data.frame': 301 obs. of 3 variables:
..$ 0 : num [1:301] 0.186 0.182 0.18 0.181 0.177 ...
..$ 5 : num [1:301] 0.127 0.127 0.127 0.127 0.127 ...
..$ 20: num [1:301] 0.245 0.241 0.239 0.236 0.236 ...
$ 177:'data.frame': 301 obs. of 2 variables:
..$ 0 : num [1:301] 0.132 0.132 0.132 0.13 0.13 ...
..$ 25: num [1:301] 0.09 0.092 0.0902 0.0896 0.0896 ...
$ 199:'data.frame': 301 obs. of 2 variables:
..$ 0 : num [1:301] 0.181 0.182 0.181 0.182 0.179 ...
..$ 10: num [1:301] 0.186 0.186 0.185 0.183 0.184 ...
On the other hand I have the second list that have the number of columns that I need to add after every column on each data frame of the first list :
> str(dif)
List of 3
[[176]]
[1] 4 15 28
[[177]]
[1] 24 67
[[199]]
[1] 9 53
I´ve tried tonnes of things even this, using the append_col function that appear in:
How to add a new column between other dataframe columns?
for (i in 1:length(dif)){
A<-lapply(KDzlambdaEG,append_col,rep(list(NA),dif[[i]][1]),after=1)
}
but nothing seems to work so far... I have searched for answers here but its difficult to find specific ones being a newcomer.
Try:
indxlst <- lapply(dif, function(x) c(1, x[-length(x)]+1, x[length(x)]))
newdflist <- lapply(indxlst, function(x) data.frame(matrix(0, 2, sum(x))))
for(i in 1:length(newdflist)) {
newdflist[[i]][indxlst[[i]]] <- KDzlambdaEG[[i]]
}
Reproducible Data Test
df1 <- data.frame(x=1:2, y=c("Jan", "Feb"), z=c("A", "B"))
df3 <- df2 <- df1[,-3]
KDzlambdaEG <- list(df1,df2,df3)
x1 <- c(4,15,28)
x2 <- c(24,67)
x3 <- c(9, 53)
dif <- list(x1,x2,x3)
indxlst <- lapply(dif, function(x) c(1, x[-length(x)]+1, x[length(x)]))
newdflist <- lapply(indxlst, function(x) data.frame(matrix(0, 2, sum(x))))
for(i in 1:length(newdflist)) {
newdflist[[i]][indxlst[[i]]] <- KDzlambdaEG[[i]]
}
newdflist
I'm trying to use dplyr to group and summarize a dataframe, but keep getting the following error:
Error: cannot modify grouping variable
Here's the code that generates it:
data_summary <- labeled_dataset %>%
group_by("Activity") %>%
summarise_each(funs(mean))
Here's the structure of the data frame that I'm applying this to:
> str(labeled_dataset)
'data.frame': 10299 obs. of 88 variables:
$ Subject : int 1 1 1 1 1 1 1 1 1 1 ...
$ Activity : Factor w/ 6 levels "LAYING","SITTING",..: 3 3 3 3 3 3 3 3 3 3 ...
$ tBodyAccmeanX : num 0.289 0.278 0.28 0.279 0.277 ...
$ tBodyAccmeanY : num -0.0203 -0.0164 -0.0195 -0.0262 -0.0166 ...
$ tBodyAccmeanZ : num -0.133 -0.124 -0.113 -0.123 -0.115 ...
$ tGravityAccmeanX : num 0.963 0.967 0.967 0.968 0.968 ...
$ tGravityAccmeanY : num -0.141 -0.142 -0.142 -0.144 -0.149 ...
$ tGravityAccmeanZ : num 0.1154 0.1094 0.1019 0.0999 0.0945 ...
...
The only reference I've found to this error is another post that suggests ungrouping first to make sure the data isn't already grouped. I've tried that without success.
Thanks,
Luke
Don't put the name of the grouping variable in quotes:
data_summary <- labeled_dataset %>%
group_by(Activity) %>%
summarise_each(funs(mean))
Looks like there were two problems:
Grouping variable names were in quotes ("Activity" instead of
Activity) - Thanks, Richard!
By not specifying the columns to summarise, dplyr was trying to summarise the mean for each column, including the first two columns that contained the grouped variables.
I fixed the code, specifying all columns except the grouping ones, as follows:
data_summary <- labeled_dataset %>%
group_by(Activity) %>%
summarise_each(funs(mean), tBodyAccmeanX:tGravityAccmeanX)