R dplyr summarize_each --> "Error: cannot modify grouping variable" - r

I'm trying to use dplyr to group and summarize a dataframe, but keep getting the following error:
Error: cannot modify grouping variable
Here's the code that generates it:
data_summary <- labeled_dataset %>%
group_by("Activity") %>%
summarise_each(funs(mean))
Here's the structure of the data frame that I'm applying this to:
> str(labeled_dataset)
'data.frame': 10299 obs. of 88 variables:
$ Subject : int 1 1 1 1 1 1 1 1 1 1 ...
$ Activity : Factor w/ 6 levels "LAYING","SITTING",..: 3 3 3 3 3 3 3 3 3 3 ...
$ tBodyAccmeanX : num 0.289 0.278 0.28 0.279 0.277 ...
$ tBodyAccmeanY : num -0.0203 -0.0164 -0.0195 -0.0262 -0.0166 ...
$ tBodyAccmeanZ : num -0.133 -0.124 -0.113 -0.123 -0.115 ...
$ tGravityAccmeanX : num 0.963 0.967 0.967 0.968 0.968 ...
$ tGravityAccmeanY : num -0.141 -0.142 -0.142 -0.144 -0.149 ...
$ tGravityAccmeanZ : num 0.1154 0.1094 0.1019 0.0999 0.0945 ...
...
The only reference I've found to this error is another post that suggests ungrouping first to make sure the data isn't already grouped. I've tried that without success.
Thanks,
Luke

Don't put the name of the grouping variable in quotes:
data_summary <- labeled_dataset %>%
group_by(Activity) %>%
summarise_each(funs(mean))

Looks like there were two problems:
Grouping variable names were in quotes ("Activity" instead of
Activity) - Thanks, Richard!
By not specifying the columns to summarise, dplyr was trying to summarise the mean for each column, including the first two columns that contained the grouped variables.
I fixed the code, specifying all columns except the grouping ones, as follows:
data_summary <- labeled_dataset %>%
group_by(Activity) %>%
summarise_each(funs(mean), tBodyAccmeanX:tGravityAccmeanX)

Related

How do you get structure of data frame with limited length for variable names?

I have a data frame for a raw data set where the variable names are extremely long. I would like to display the structure of the data frame using the str function, and impose a character limit on the displayed variable names, so that it is easier to read.
Here is a reproducible example of the kind of thing I am talking about.
#Data frame with long names
set.seed(1);
DATA <- data.frame(ID = 1:50,
Value = rnorm(50),
This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name = runif(50));
#Show structure of DATA
str(DATA);
> str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has_a_really_long_and_annoying_name_to_illustrate_the_problem_of_a_data_frame_with_a_long_and_annoying_name: num 0.655 0.353 0.27 0.993 0.633 ...
I would like to use the str function but impose an upper limit on the number of characters to display in the variable names, so that I get output that is something like the one below. I have read the documentation, but I have not been able to identify if there is an option to do this. (There seem to be options to impose upper limits on the lengths of strings in the data, but I cannot see an option to impose a limit on the length of the variable name.)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...
Question: Is there a simple way to get the structure of the data frame, but imposing a limitation on the length of the variable names (to get output something like the above)?
As far as I can see you're right, there doesn't seem to be a built in means to control this. You also can't do it after the fact because str() doesn't return anything. So the easiest option seems to be renaming beforehand. Relying on setNames(), you could create a simple function to accomplish this:
short_str <- function(data, n = 20, ...) {
name_vec <- names(data)
str(setNames(data, ifelse(
nchar(name_vec) > n, paste0(substring(name_vec, 1, n - 4), "... "), name_vec
)), ...)
}
short_str(DATA)
'data.frame': 50 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ Value : num -0.626 0.184 -0.836 1.595 0.33 ...
$ This_variable_has... : num 0.655 0.353 0.27 0.993 0.633 ...

Subsetting columns in different positions and with different names in a large list of lists with purrr

I have a large list of lists. There are 46 lists in "output". Each list is a tibble with differing number of rows and columns. My immediate goal is to subset a specific column from each list.
This is str(output) of the first two lists to give you an idea of the data.
> str(output)
List of 46
$ Brain :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 6108 obs. of 8 variables:
..$ p_val : chr [1:6108] "0" "1.60383253411205E-274" "0" "0" ...
..$ avg_diff : num [1:6108] 1.71 1.7 1.68 1.6 1.58 ...
..$ pct.1 : num [1:6108] 0.998 0.808 0.879 0.885 0.923 0.905 0.951 0.957 0.619 0.985 ...
..$ pct.2 : num [1:6108] 0.677 0.227 0.273 0.323 0.36 0.384 0.401 0.444 0.152 0.539 ...
..$ cluster : num [1:6108] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:6108] "Plp1" "Mal" "Ermn" "Stmn4" ...
..$ X__1 : logi [1:6108] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:6108] "Myelinating oligodendrocyte" NA NA NA ...
$ Bladder :Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4656 obs. of 8 variables:
..$ p_val : num [1:4656] 0.00 1.17e-233 2.85e-276 0.00 0.00 ...
..$ avg_diff : num [1:4656] 2.41 2.23 2.04 2.01 1.98 ...
..$ pct.1 : num [1:4656] 0.833 0.612 0.855 0.987 1 0.951 0.711 0.544 0.683 0.516 ...
..$ pct.2 : num [1:4656] 0.074 0.048 0.191 0.373 0.906 0.217 0.105 0.044 0.177 0.106 ...
..$ cluster : num [1:4656] 1 1 1 1 1 1 1 1 1 1 ...
..$ gene : chr [1:4656] "Dpt" "Gas1" "Cxcl12" "Lum" ...
..$ X__1 : logi [1:4656] NA NA NA NA NA NA ...
..$ Cell Type: chr [1:4656] "Stromal cell_Dpt high" NA NA NA ...
Since I have a large number of lists that make up the list, I have been trying to create an iterative code to perform tasks. This hasn't been successful.
I can achieve this manually, or list by list, but I haven't been successful in finding an iterative way of doing this.
x <- data.frame(output$Brain, stringsAsFactors = FALSE)
tmp.list <- x$Cell.Type
tmp.output <- purrr::discard(tmp.list, is.na)
x <- subset(x, Cell.Type %in% tmp.output)
This gives me the output that I want, which are the rows in the column "Cell.Type" with non-NA values.
I got as far as the code below to get the 8th column of each list, which is the "Cell.Type" column.
lapply(output, "[", , 8))
But here I found that the naming and positioning of the "Cell.Type" column in each list is not consistent. This means I cannot use the lapply function to subset the 8th columns, as some lists have this on for example the 9th column.
I tried the code below, but it does not work and gets an error.
lapply(output, "[", , c('Cell.Type', 'celltyppe'))
#Error: Column `celltyppe` not found
#Call `rlang::last_error()` to see a backtrace
Essentially, from my "output" list, I want to subset either columns "Cell.Type" or "celltyppe" from each of the 46 lists to create a new list with 46 lists of just a single column of values. Then I want to drop all rows with NA.
I would like to perform this using some sort of loop.
At the moment I have not had much success. Lapply seems to be able to extract columns through lists iterately, and I am having difficultly trying to subset names columns.
Once I can do this, I then want to create a loop that can subset only rows without NA.
FINAL CODE
This is the final code I have used to create exactly what I had hoped for. The first line of the code specifies the loop to go through each list of the large list. The second line of code selects columns of each list that contains "ell" in its name (Cell type, Cell Type, or celltyppe). The last removes any rows with "na".
purrr::map(output, ~ .x %>%
dplyr::select(matches("ell")) %>%
na.omit)
We can use anonymous function call
lapply(output, function(x) na.omit(x[grep("(?i)Cell\\.?(?i)Typp?e", names(x))]))
#[[1]]
# Cell.Type
#1 1
#2 2
#3 3
#4 4
#5 5
#[[2]]
# celltyppe
#1 7
#2 8
#3 9
#4 10
#5 11
Also with purrr
library(tidyverse)
map(output, ~ .x %>%
select(matches("(?i)Cell\\.?(?i)Typp?e") %>%
na.omit))
data
output <- list(data.frame(Cell.Type = 1:5, col1 = 6:10, col2 = 11:15),
data.frame(coln = 1:5, celltyppe = 7:11))

Accessing dataframes after splitting a dataframe

I'm splitting a dataframe in multiple dataframes using the command
data <- apply(data, 2, function(x) data.frame(sort(x, decreasing=F)))
I don't know how to access them, I know I can access them using df$1 but I have to do that for every dataframe,
df1<- head(data$`1`,k)
df2<- head(data$`2`,k)
can I get these dataframes in one go (like storing them in some form) however the indexes of these multiple dataframes shouldn't change.
str(data) gives
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
str(data[1:2])
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
Thanks to #r2evans I got it done, here is his code from the comments
Yes. Two short demos: lapply(data, head, n=2), or more generically
sapply(data, function(df) mean(df$x)). – r2evans
and after that fetching the indexes
df<-lapply(df, rownames)

Columns Named as Numbers

I have a data frame that already has columns named by numbers:
> str(df)
'data.frame': 8305 obs. of 5 variables:
$ 1 : num 0.652 0.526 0.504 0.628 0.744 ...
$ 2 : num 0.498 0.476 0.454 0.454 0.498 ...
$ 3 : num 0.3537 0.0368 0.3421 0.3421 0.3537 ...
$ 4 : num 0.298 0.031 0.309 0.305 0.313 ...
$ 5 : num 0.2808 0.0292 0.2781 0.2811 0.2808 ...
I know that a command such as df$1 or df$as.character(1) do not work, but is there a way to subset without using index numbers (so NO df[,1])?
You can do
df$`1`
Any name that cannot be treated as an R variable needs to be wrapped with backticks. Of course you could also just do
df["1"]
Yes, use quotes in the case of [ and backticks in the case of $.
> x <- data.frame(`2`=1, `1`=2, check.names=FALSE)
> x
2 1
1 1 2
> x[,"2"]
[1] 1
> x$`2`
[1] 1
> x$`1`
[1] 2
> x[,"1"]
[1] 2

how to assign colour to subset of variables ggplot2

I have a data frame of 379838 rows and 13 variables in columns( 13 clinical samples) :
> str( df)
'data.frame': 379838 obs. of 13 variables:
$ V1 : num 0.8146 0.7433 0.0174 0.177 0 ...
$ V2 : num 0.7465 0.5833 0.0848 0.5899 0.0161 ...
$ V3 : num 0.788 0.843 0.333 0.801 0.156 ...
$ V4 : num 0.601 0.958 0.319 0.807 0.429 ...
$ V5 : num 0.792 0.49 0.341 0.865 1 ...
$ V6 : num 0.676 0.801 0.229 0.822 0.282 ...
$ V7 : num 0.783 0.732 0.223 0.653 0.507 ...
$ V8 : num 0.69 0.773 0.108 0.69 0.16 ...
$ V9 : num 0.4014 0.5959 0.0551 0.7578 0.2784 ...
$ V10: num 0.703 0.784 0.131 0.698 0.204 ...
$ V11: num 0.6731 0.8224 0.125 0.6021 0.0772 ...
$ V12: num 0.7889 0.7907 0.0881 0.7175 0.2392 ...
$ V13: num 0.6731 0.8221 0.0341 0.4059 0 ...
and I am trying to make a ggplot2 box plot grouping variables into three groups: V1-V5 , V6-V9 and V10-V13 and assigning different color to variables of each group.
I am trying the following code:
df1= as.vector(df[, c("V1", "V2", "V3","V4", "V5")])
df2= as.vector(df[, c("V6","V7", "V8","V9")])
df3=as.vector(df[, c( "V10","V11", "V12","V13")])
sample= c(df1,df2,df3)
library(reshape2)
meltData1 <- melt(df, varnames="sample")
str(meltData1)
'data.frame': 4937894 obs. of 2 variables:
$ variable: Factor w/ 13 levels "V1","V2","V3",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 0.8146 0.7433 0.0174 0.177 0 ...
p=ggplot(data=meltData1,aes(variable,value, fill=x$sample))
p+geom_boxplot()
That gives me white box plots. How can I assign a colour to three groups of variables? Many thanks in advance!
As sample data were not provided, made new data frame containing 13 columns with names from V1 to V13.
df<-as.data.frame(matrix(rnorm(1300),ncol=13))
With function melt() from library reshape2 data are transformed from wide to long format. Now data frame has two columns: variable and value.
library(reshape2)
dflong<-melt(df)
To the long format new column sample is added. Here I repeated names group1, group2, group3 according to number of row in original data frame and number of original columns in each group.
dflong$sample<-c(rep("group1",nrow(df)*5),rep("group2",nrow(df)*4),rep("group3",nrow(df)*4))
New column is used with argument fill= to set colors according to grouping.
library(ggplot2)
ggplot(data=dflong,aes(variable,value, fill=sample))+geom_boxplot()
This is a follow-up to Didzis Elferts.
Objective: Split the sample into 3 colour groups with a difference in shade within the colour group.
The first part of the code is the same:
df<-as.data.frame(matrix(rnorm(1300),ncol=13))
library(reshape2)
dflong<-melt(df)
dflong$sample<-c(rep("group1",nrow(df)*5),rep("group2",nrow(df)*4),rep("group3",nrow(df)*4))
library(ggplot2)
Now, use the package RColorBrewer to select color shades
library(RColorBrewer)
Create a list of colors by color class
col.g <- c(brewer.pal(9,"Greens"))[5:9] # select 5 colors from class Greens
col.r <- c(brewer.pal(9,"Reds"))[6:9] # select 4 colors from class Reds
col.b <- c(brewer.pal(9,"Blues"))[6:9] # select 4 colors from class Blues
my.cols <- c(col.g,col.r,col.b)
Take a look at the colors selected:
image(1:13,1,as.matrix(1:13), col=my.cols, xlab="my palette", ylab="", xaxt="n", yaxt="n", bty="n")
And now plot with the colors we have created
ggplot(data=dflong,aes(variable,value,colour=variable))+geom_boxplot()+scale_colour_manual(values = my.cols)
In the above, with the colour and scale_colour_manual commands, only the lines are colored. Below, we use fill and scale_fill_manual:
ggplot(data=dflong,aes(variable,value,fill=variable))+geom_boxplot()+scale_fill_manual(values = my.cols)
P.S. I'm a total newbie and learning R myself. I saw this question as an opportunity to apply something I just learned.

Resources