Related
I would like to get the values from multiple columns and rows of a data frame, do something with those values (e.g., calculate a mean), then add the results to a new column in a modified version of the original data frame. The columns and rows are selected based on values in other columns. I've gotten this working in dplyr, but only when I "hard code" the column names that are used by the function.
Example code:
'''
# create a test data frame
tdf <- data.frame(A=c('a','a','b','b','b'), B=c('d','d','e','e','f'),
L1=as.numeric(c(1,2,3,4,5)), L2=as.numeric(c(11,12,13,'na',15)),
L3=as.numeric(c('na',22,23,'na',25)), stringsAsFactors=FALSE)
'''
which gives
A B L1 L2 L3
1 a d 1 11 NA
2 a d 2 12 22
3 b e 3 13 23
4 b e 4 NA NA
5 b f 5 15 25
In this case, I would like to calculate the mean of the values in the L* columns (L1,L2,L3) that have the same values in columns A and B. For example, using the values in A and B, I select rows 1 & 2 and calculate a mean using (1, 11, 2, 12, 22), then rows 3 & 4 (3, 13, 23, 4), and finally row 5 (5, 15, 25).
I can do this using dplyr using either (both work):
'''
ddply(tdf, .(A, B), summarize, mean_L=mean(c(L1, L2, L3), na.rm=TRUE))
or
tdf %>% group_by(A,B) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
'''
which gives what I want:
A B mean_L
1 a d 9.60
2 b e 10.75
3 b f 15.00
However, my issue is that the number of "L" columns is dynamic among different data sets. In some cases I may have 10 total columns (L1, L2, ... L10) or 100+ columns. The columns I use for the selection criteria (in this case A and B), will always be the same, so I can "hard code" those, but I'm having difficulty specifying the columns in the "mean" function.
dplyr has a way of dynamically generating the "group by" variables, but that does not seem to work within the function component of the summarize. For example, I can do this:
'''
b <- names(tdf)[1:2]
dots <- lapply(b, as.symbol)
tdf %>% group_by(.dots=dots) %>% summarize(mean_L=mean(c(L1,L2,L3), na.rm=TRUE))
'''
but I can't do the same inside the mean function. The closest I have come to working is:
'''
b='L1'
tdf %>% group_by(A,B) %>% summarize(mean_L=mean(.data[[b]], na.rm=TRUE))
'''
but this only works for specifying a single column. If I try b='L1,L2,L3', it seems dplyr uses the literal "L1,L2,L3" as a column name and not as a list.
This doesn't seem to be a complicated problem, but I would like help finding the solution, either in dplyr or some other way.
Many thanks!
tdf %>%
group_by_at(1:2) %>%
summarise(mean_L=mean(c_across(starts_with("L")),
na.rm=TRUE)) %>%
ungroup()
No matter how many L columns you have you can always think of transforming your data set into long format and group them based on your variables:
library(tidyverse)
df %>%
pivot_longer(!c(A, B)) %>%
group_by(A, B) %>%
summarise(L_mean = mean(value, na.rm = TRUE))
# A tibble: 3 × 3
# Groups: A [2]
A B L_mean
<chr> <chr> <dbl>
1 a d 9.6
2 b e 10.8
3 b f 15
Assuming the following data set:
df <- data.frame(...1 = c(1, 2, 3),
...2 = c(1, 2, 3),
n_column = c(1, 1, 2))
I now want to rename all vars that start with "...". My real data sets could have different numbers of "..." vars. The information about how many such vars I have is in the n_column column, more precisely, it is the maximum of that column.
So I tried:
df %>%
rename_with(.cols = starts_with("..."),
.fn = paste0("new_name", 1:max(n_column)))
which gives an error:
# Error in paste0("new_name", 1:max(n_column)) :
# object 'n_column' not found
So I guess the problem is that the paste0 function does look for the column I provide within the current data set. Not sure, though, how I could do so. Any ideas?
I know I could bypass the whole thing by just creating an external scalar that contains the max. of n_column, but ideally I'd like to do everything in one pipeline.
You don't need information from n_column, .cols will pass only those columns that satisfy the condition (starts_with("...")).
library(dplyr)
df %>% rename_with(~paste0("new_name", seq_along(.)), starts_with("..."))
# new_name1 new_name2 n_column
#1 1 1 1
#2 2 2 1
#3 3 3 2
This is safer than using max(n_column) as well, for example if the data from n_column gets corrupted or the number of columns with ... change this will still work.
A way to refer to column values in rename_with would be to use anonymous function so that you can use .$n_column.
df %>%
rename_with(function(x) paste0("new_name", 1:max(.$n_column)),
starts_with("..."))
I am assuming this is part of longer chain so you don't want to use max(df$n_column).
We can use str_c
library(dplyr)
library(stringr)
df %>%
rename_with(~str_c("new_name", seq_along(.)), starts_with("..."))
Or using base R
i1 <- startsWith(names(df), "...")
names(df)[i1] <- sub("...", "new_name", names(df)[i1], fixed = TRUE)
df
new_name1 new_name2 n_column
1 1 1 1
2 2 2 1
3 3 3 2
A completly other approach would be
df %>% janitor::clean_names()
x1 x2 n_column
1 1 1 1
2 2 2 1
3 3 3 2
Using R and tidyr, I want to count the number of rows of a dataframe between two pipes; is there an elegant way to do this?
I ran into the problem when removing rows with only NA, then having to count the number of rows of the dataframe so far. Can I do this without storing the dataframe between the pipes?
Here is a reproducible example. I essentially need XXX to refer to the dataframe after drop_na().
library(dplyr)
scrap <- as.data.frame(matrix(1:16, ncol = 4))
scrap[4,] <- rep(NA, 4)
scrap %>%
drop_na() %>%
mutate(index=c(1:nrow(XXX)))
I thought it would guess what I am doing if I do not refer to anything as below, but no.
scrap %>%
drop_na() %>%
mutate(index=c(1:nrow()))
Error in nrow() : argument "x" is missing, with no default
Is there an elegant solution I am missing?
Not sure if I understood your question, but using dplyr::row_number :
library(dplyr)
scrap %>%
drop_na() %>%
mutate(index=row_number())
Returns:
V1 V2 V3 V4 index
1 1 5 9 13 1
2 2 6 10 14 2
3 3 7 11 15 3
I want to create a table from the existing data. I have 5 varieties and 3 clusters in the data. In the expected table I want to show the number and the name of the varieties with the corresponding clusters. But I cannot make it. This is my data
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
variety cluster
1 a 1
2 b 2
3 c 2
4 d 3
5 e 1
My desirable table is like this.
cluster number variety name
1 2 a, e
2 2 b,c
3 1 d
I would be grateful if anyone helps me.
The following can give the results you're looking for:
library(plyr)
variety<-c("a","b","c","d","e")
cluster<-c(1,2,2,3,1)
x <- cbind(variety, cluster)
data <- data.frame(x)
data
ddply(data,.(cluster),summarise,n=length(variety),group=paste(variety,collapse=','))
Here is one option with tidyverse. Grouped by 'cluster', get the number of rows (n()) and paste the 'variety' into a single string (toString)
library(tidyverse)
data %>%
group_by(cluster) %>%
summarise(number = n(), variety_name = toString(variety))
I have an R data frame with 6 columns, and I want to create a new dataframe that only has three of the columns.
Assuming my data frame is df, and I want to extract columns A, B, and E, this is the only command I can figure out:
data.frame(df$A,df$B,df$E)
Is there a more compact way of doing this?
You can subset using a vector of column names. I strongly prefer this approach over those that treat column names as if they are object names (e.g. subset()), especially when programming in functions, packages, or applications.
# data for reproducible example
# (and to avoid confusion from trying to subset `stats::df`)
df <- setNames(data.frame(as.list(1:5)), LETTERS[1:5])
# subset
df[c("A","B","E")]
Note there's no comma (i.e. it's not df[,c("A","B","C")]). That's because df[,"A"] returns a vector, not a data frame. But df["A"] will always return a data frame.
str(df["A"])
## 'data.frame': 1 obs. of 1 variable:
## $ A: int 1
str(df[,"A"]) # vector
## int 1
Thanks to David Dorchies for pointing out that df[,"A"] returns a vector instead of a data.frame, and to Antoine Fabri for suggesting a better alternative (above) to my original solution (below).
# subset (original solution--not recommended)
df[,c("A","B","E")] # returns a data.frame
df[,"A"] # returns a vector
Using the dplyr package, if your data.frame is called df1:
library(dplyr)
df1 %>%
select(A, B, E)
This can also be written without the %>% pipe as:
select(df1, A, B, E)
This is the role of the subset() function:
> dat <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> subset(dat, select=c("A", "B"))
A B
1 1 3
2 2 4
There are two obvious choices: Joshua Ulrich's df[,c("A","B","E")] or
df[,c(1,2,5)]
as in
> df <- data.frame(A=c(1,2),B=c(3,4),C=c(5,6),D=c(7,7),E=c(8,8),F=c(9,9))
> df
A B C D E F
1 1 3 5 7 8 9
2 2 4 6 7 8 9
> df[,c(1,2,5)]
A B E
1 1 3 8
2 2 4 8
> df[,c("A","B","E")]
A B E
1 1 3 8
2 2 4 8
For some reason only
df[, (names(df) %in% c("A","B","E"))]
worked for me. All of the above syntaxes yielded "undefined columns selected".
Where df1 is your original data frame:
df2 <- subset(df1, select = c(1, 2, 5))
You can also use the sqldf package which performs selects on R data frames as :
df1 <- sqldf("select A, B, E from df")
This gives as the output a data frame df1 with columns: A, B ,E.
You can use with :
with(df, data.frame(A, B, E))
df<- dplyr::select ( df,A,B,C)
Also, you can assign a different name to the newly created data
data<- dplyr::select ( df,A,B,C)
[ and subset are not substitutable:
[ does return a vector if only one column is selected.
df = data.frame(a="a",b="b")
identical(
df[,c("a")],
subset(df,select="a")
)
identical(
df[,c("a","b")],
subset(df,select=c("a","b"))
)