I am stuck with this one:
I have a dataframe with the following properties:
a variable type (values: "P", "T", "I")
a variable id (subject id)
a variable RT (reaction time)
It looks like this:
id type rt
1 T 333
1 P 912
1 P 467
1 I 773
1 I 123
...
2 P 125
2 I 843
2 T 121
2 P 982
...
The order of the variable type is random for each subject but each subject has the same amount of each type. What I want is to select the first 2 RT values where type=="P" for each participant and then average over occurrences, so that I get a mean RT of all participants for the first occurrence of P, and a mean for the second occurrence of P.
So far, say, 20 participants, I want to extract a total of 20 RTs for the first occurrence and 20 RTs for the second occurrence.
I tried tapply, aggregate, for loop and simple subsetting but these either average "too early" or fail since the order is random for each subject.
Try
devtools::install_github("hadley/dplyr")
library(dplyr)
df%>%
group_by(id) %>%
filter(type=="P") %>%
slice(1:2)%>%
mutate(N=row_number()) %>%
group_by(N) %>%
summarise(rt=mean(rt))
#Source: local data frame [2 x 2]
# N rt
#1 1 518.5
#2 2 724.5
Or using data.table
library(data.table)
setDT(df)[type=="P", list(rt=rt[1:2], N=seq_len(.N)), by=id][,
list(Meanrt=mean(rt)), by=N]
# N Meanrt
#1: 1 518.5
#2: 2 724.5
Or using aggregate from base R
df1 <- subset(df, type=="P")
df1$indx <- with(df1, ave(rt, id, FUN=seq_along))
aggregate(rt~indx, df1[df1$indx %in% 1:2,], FUN=mean)
# indx rt
#1 1 518.5
#2 2 724.5
data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), type = c("T",
"P", "P", "I", "I", "P", "I", "T", "P"), rt = c(333L, 912L, 467L,
773L, 123L, 125L, 843L, 121L, 982L)), .Names = c("id", "type",
"rt"), class = "data.frame", row.names = c(NA, -9L))
I hope I got it right, using dplyr:
df %>%
group_by(id, type) %>%
mutate(occ=1:n()) %>%
group_by(type, occ) %>%
summarise(av=mean(rt)) %>%
filter(type=="P")
Source: local data frame [2 x 3]
Groups: type
type occ av
1 P 1 518.5
2 P 2 724.5
Related
Its a bit tricky to explain, Ill try my best, query below. I have a df as below. I need to filter rows by group based on maximum pop in country column but which has not already occurred in the above groups. (As per output (image), the reason why A didnt feature in group2 because it had already occured in Group 1)
In short, I need to get unique values in country column at the same time get maximum value in pop (on a group level). I hope picture can convey what I could not. (Tidyverse solution preferred)
[![Expected output][2]][2]
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
I think this will do. Explanation of syntax
split the data into list for each group
leave first group (as it will be used as .init in next step but after filtering for the max of pop value.
use purrr::reduce here which will reduce the list of tibbles to a single tibble
iterations used in reduce
.init used as filtered first group
thereafter countries in previous groups removed through anti_join
this data filtered for max pop again
added the previously filtered countries by bind_rows()
Thus, in the end we will have desired tibble.
df %>% group_split(Group) %>% .[-1] %>%
reduce(.init =df %>% group_split(Group) %>% .[[1]] %>%
filter(pop == max(pop)),
~ .y %>%
anti_join(.x, by = c("country" = "country")) %>%
filter(pop == max(pop)) %>%
bind_rows(.x) %>% arrange(Group))
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100
You can create a helper function that writes the maximum pop from each group in a vector and use it to filter the dataframe.
library(tidyverse)
max_values <- c()
helper <- function(dat, ...){
dat <- dat[!(dat %in% max_values)] # exclude maximum values from previous groups
max_value <- max(dat) # get current max. value
max_values <<- c(max_values, max_value) # append
return(max_value)
}
df %>%
group_by(Group) %>%
filter(pop == helper(pop))
which gives you:
# A tibble: 3 x 3
# Groups: Group [3]
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 H 120
Data used:
> df
Group country pop
1 1 A 200
2 1 B 100
3 1 C 50
4 2 A 200
5 2 E 150
6 2 F 120
7 3 A 200
8 3 E 150
9 3 G 100
10 3 H 120
Here is another possibility, but
Overly Simplified in that it does not take into account
the possibility of a group having a higher population in a Group where
it does not win.
library(dplyr)
df<- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"), pop = c(200L, 100L, 50L, 200L, 150L, 120L, 200L, 150L,
100L)), class = "data.frame", row.names = c(NA, -9L))
df %>%
group_by(country) %>%
summarize(popmax = max(pop)) %>%
inner_join(df, by = c("popmax" = 'pop')) %>%
rename(country = country.y) %>%
select(-country.x) %>%
group_by(country) %>%
arrange(Group) %>%
slice(1) %>%
ungroup() %>%
group_by(Group) %>%
arrange(country) %>%
slice(1) %>%
select(Group, country, popmax) %>%
rename(pop = popmax)
My answer fails (while other answers don't) with this data set:
df <- tribble(
~Group, ~ country, ~pop,
1 , 'A', 200,
1 , 'B', 100,
1 , 'C', 50,
1 , 'G', 150,
2 , 'A', 200,
2 , 'E', 150,
2 , 'F', 120,
3 , 'A', 200,
3 , 'E', 150,
3 , 'G', 100
)
Update
#Crestor who is claiming that my answer is not correct.
My answer is correct because my code gives the desired output as requested by OP.
Your objection that my code does not work on another scenario may be correct, but in this setting it is irrelevant, as my answer was only intended to solve the task at hand.
Here is the answer to your raised scenario with this dataset:
df1 <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
country = c("A", "B", "C", "A", "E", "F", "A", "E", "G"),
pop = c(200L, 100L, 250L, 220L, 150L, 120L, 200L, 150L, 100L
)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"
))
expected output by Crestor:
# A tibble: 3 x 3
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
My code for your scenario #crestor
library(dplyr)
df1 %>%
group_by(country) %>%
arrange(Group) %>%
filter(pop == max(pop)) %>%
group_by(Group) %>%
filter(pop == max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 C 250
2 2 A 220
3 3 E 150
Original answer to the question by OP
To keep it simple: First arrange to bring your dataset in position. Then group_by and keep first row in each group with slice. Then group_by Group and filter the max pop
library(dplyr)
df %>%
arrange(country, pop) %>%
group_by(country) %>%
slice(1) %>%
group_by(Group) %>%
filter(pop==max(pop))
Output:
Group country pop
<int> <chr> <int>
1 1 A 200
2 2 E 150
3 3 G 100
This question already has answers here:
Remove groups that contain certain strings
(4 answers)
Closed 3 years ago.
3 doctors diagnose a patient
question 1 : how to filter the patient which all 3 doctors diagnose with disease B (no matter B.1, B.2 or B.3)
question 2: how to filter the patient which any of 3 doctors diagnose with disease A.
set.seed(20200107)
df <- data.frame(id = rep(1:5,each =3),
disease = sample(c('A','B'), 15, replace = T))
df$disease <- as.character(df$disease)
df[1,2] <- 'A'
df[4,2] <- 'B.1'
df[5,2] <- 'B.2'
df[6,2] <- 'B.3'ยท
df
I got a method but I don't know how to write the code. I think in the code any() or all() function shoule be used.
First, I want to group patients by id.
Second, check if all the disease is A or B in each group.
The code like this
df %>% group_by(id) %>% filter_all(all_vars(disease == B))
You can use all assuming every patient is checked by 3 doctors only.
library(dplyr)
df %>% group_by(id) %>% summarise(disease_B = all(grepl('B', disease)))
# id disease_B
# <int> <lgl>
#1 1 FALSE
#2 2 TRUE
#3 3 FALSE
#4 4 FALSE
#5 5 FALSE
If you want to subset the rows of the patient, we can use filter
df %>% group_by(id) %>% filter(all(grepl('B', disease)))
For question 2: similarly, we can use any
df %>% group_by(id) %>% summarise(disease_B = any(grepl('A', disease)))
data
df <- structure(list(id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L), disease = c("A", "A", "A", "B.1", "B.2",
"B.3", "B", "A", "A", "B", "A", "A", "B", "A", "B")), row.names = c(NA,
-15L), class = "data.frame")
For the question 1, you can replace B.1 B.2 ... by B, then count the number of different "Disease" per patients and filter to keep only those equal to 3 and B:
library(tidyverse)
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease)) %>%
count(Disease) %>%
filter(n == 3 & Disease == "B")
# A tibble: 2 x 3
# Groups: id [2]
id Disease n
<int> <chr> <int>
1 2 B 3
2 4 B 3
For the question 2, similarly, you can replace B.1 ... by B, then filter all rows with Disease is A, then count the number of rows per patients and your output is the patient id and the number of doctors that diagnose the disease A:
df %>% group_by(id) %>%
mutate(Disease = gsub(".[0-9]+","",disease))%>%
filter(Disease == "A") %>%
count(id)
# A tibble: 3 x 2
# Groups: id [3]
id n
<int> <int>
1 1 1
2 3 3
3 5 2
This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 3 years ago.
Looking to concatenate a column of strings (separated by a ",") when grouped by another column. Example of raw data:
Column1 Column2
1 a
1 b
1 c
1 d
2 e
2 f
2 g
2 h
3 i
3 j
3 k
3 l
Results Needed:
Column1 Grouped_Value
1 "a,b,c,d"
2 "e,f,g,h"
3 "i,j,k,l"
I've tried using dplyr, but I seem to be getting the below as a result
Column1 Grouped_Value
1 "a,b,c,d,e,f,g,h,i,j,k,l"
2 "a,b,c,d,e,f,g,h,i,j,k,l"
3 "a,b,c,d,e,f,g,h,i,j,k,l"
summ_data <-
df_columns %>%
group_by(df_columns$Column1) %>%
summarise(Grouped_Value = paste(df_columns$Column2, collapse =","))
We can do this with aggregate
aggregate(Column2 ~ Column1, df1, toString)
Or with dplyr
library(dplyr)
df1 %>%
group_by(Column1) %>%
summarise(Grouped_value =toString(Column2))
# A tibble: 3 x 2
# Column1 Grouped_value
# <int> <chr>
#1 1 a, b, c, d
#2 2 e, f, g, h
#3 3 i, j, k, l
NOTE: toString is wrapper for paste(., collapse=', ')
The issue in OP' solution is that it is pasteing the whole column (df1$Column2 or df1[['Column2']] - breaks the grouping and select the whole column) instead of the grouped elements
data
df1 <- structure(list(Column1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), Column2 = c("a", "b", "c", "d", "e", "f", "g", "h",
"i", "j", "k", "l")), class = "data.frame", row.names = c(NA,
-12L))
First commandment of dplyr
Don't use dollar signs in dplyr commands!
Use
group_by(Column1)
and
summarise(Grouped_Value = paste(Column2, collapse =","))
I have a dataset that has two columns. One is userid, the other is company type, like below:
userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A
I want to know how many unique userid's there are that have company.type of A and B or A and C, (but not B and C).
I'm assuming it's some sort of aggregate function, but I'm not sure how to place the qualifier that company.type has to be A and B or A and C only.
We can do this with base R using table
tbl <- table(df1) > 0
sum(((tbl[, 1] & tbl[,2]) | (tbl[,1] & tbl[,3])) & (!(tbl[,2] & tbl[,3])))
#[1] 2
Here's an idea with dplyr. setequal checks if two vectors are composed of the same elements, regardless of ordering:
library(dplyr)
df %>%
group_by(userid) %>%
summarize(temp = setequal(company.type, c("A", "B")) |
setequal(company.type, c("A", "C"))) %>%
pull(temp) %>%
sum()
# [1] 2
Data:
df <- structure(list(userid = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), company.type = c("A",
"A", "C", "B", "B", "B", "A")), .Names = c("userid", "company.type"
), class = "data.frame", row.names = c(NA, -7L))
See: Check whether two vectors contain the same (unordered) elements in R
Sort DF and reduce it to one row per userid with a types column consisting of a comma-separated string of company types. Then filter it using the indicated condition. Finally use tally to get the number of rows left after filtering. To get the details omit the tally line.
library(dplyr)
DF %>%
arrange(userid, company.type) %>%
group_by(userid) %>%
summarize(types = toString(company.type)) %>%
ungroup %>%
filter(grepl("A.*B|A.*C", types) & ! grepl("B.*C", types)) %>%
tally
giving:
# A tibble: 1 x 1
n
<int>
1 2
Note
The input used, in reproducible form, is:
Lines <- "userid company.type
1 A
2 A
3 C
1 B
2 B
3 B
4 A"
DF <- read.table(text = Lines, header = TRUE)
I'm currently working on a dataframe that looks something like this:
Site Spp1 Spp2 Spp3 LOC TYPE
S01 2 4 0 A FLOOD
S02 4 0 0 A REG
....
S10 0 1 0 B FLOOD
S11 1 0 0 B REG
What I'm trying to do is subset the dataframe so I can run some indicator species analysis in R.
The following code works in that I create two subsets of the data, merge them into one frame and then drop the unused factor levels
A.flood <- filter(data, TYPE == "FLOOD", LOC == "A")
B.flood <- filter(data, TYPE == "FLOOD", LOC == "B")
A.B.flood <- rbind(A.flood, B.flood) %>% droplevels.data.frame(A.B.flood, except = c("A", "B"))
What I was also hoping/need to do is to drop all Spp columns (in my real dataset there are ~ 60) that sum to zero. Is there a way to achieve this this with dplyr, and if there is, is it possible to pipe that code onto the existing A.B.flood dataframe code?
Thanks!
EDIT
I managed to remove all the columns that summed to zero, by selecting only the columns that summed to > zero:
A.B.flood.subset <- A.B.flood[, apply(A.B.flood[1:(ncol(A.B.flood))], 2, sum)!=0]
I realize this question is now quite old, but I came accross and found another solution using dplyr's "select" and "which", which might seem clearer to dplyr's enthusiasts:
A.B.flood.subset <- A.B.flood %>% select(which(!colSums(A.B.flood, na.rm=TRUE) %in% 0))
Without using any package, we can use rowSums of the 'Spp' columns (subset the columns using grep) and double negate so that rows with sum>0 will be TRUE and others FALSE. Use this index to subset the rows.
data[!!rowSums(data[grep('Spp', names(data))]),]
Or using dplyr/magrittr, we select the 'Spp' columns, get the sum of each row with Reduce, double negate and use extract from magrittr to subset the original dataset with the index derived.
library(dplyr)
library(magrittr)
data %>%
select(matches('^Spp')) %>%
Reduce(`+`, .) %>%
`!` %>%
`!` %>%
extract(data,.,)
data
data <- structure(list(Site = c("S01", "S02", "S03", "S04"),
Spp1 = c(2L,
4L, 0L, 4L), Spp2 = c(4L, 0L, 0L, 0L), Spp3 = c(0L, 0L, 0L, 0L
), LOC = c("A", "A", "A", "A"), TYPE = c("FLOOD", "REG",
"FLOOD",
"REG")), .Names = c("Site", "Spp1", "Spp2", "Spp3", "LOC",
"TYPE"), class = "data.frame", row.names = c(NA, -4L))
You should convert to tidy data with tidyr::gather() and the data frame will be much easier to manipulate.
library(tidyr)
library(dplyr)
A.B.Flood %>% gather(Species, Sp.Count, -Site, -LOC, -TYPE) %>%
group_by(Species) %>%
filter(Sp.Count > 0)
Voila, your tidy data minus the zero counts.
# Site LOC TYPE Species Sp.Count
# <fctr> <fctr> <fctr> <chr> <int>
#1 S01 A FLOOD Spp1 2
#2 S02 A REG Spp1 4
#3 S11 B REG Spp1 1
#4 S01 A FLOOD Spp2 4
#5 S10 B FLOOD Spp2 1
Personally I'd keep it like this. If you want your original format back with the zero counts for the non-discarded species, just add %>% spread(Species, Sp.Count, fill = 0) to the pipeline.
# Site LOC TYPE Spp1 Spp2
#* <fctr> <fctr> <fctr> <dbl> <dbl>
#1 S01 A FLOOD 2 4
#2 S02 A REG 4 0
#3 S10 B FLOOD 0 1
#4 S11 B REG 1 0
There is an even easier and quicker way to do this (and also more in line with your question: using dplyr).
A.B.flood.subset <- A.B.flood[, colSums(A.B.flood != 0) > 0]
or with a MWE:
df <- data.frame (x = rnorm(100), y = rnorm(100), z = rep(0, 100))
df[, colSums(df != 0) > 0]
For those who want to use dplyr 1.0.0 with the where keyword, you can do:
A.B.flood %>%
select(where( ~ is.numeric(.x) && sum(.x) != 0))
returns:
Spp1 Spp2
1 2 4
2 4 0
3 0 0
4 4 0
using the same data given by #akrun:
A.B.flood <- structure(
list(
Site = c("S01", "S02", "S03", "S04"),
Spp1 = c(2L,
4L, 0L, 4L),
Spp2 = c(4L, 0L, 0L, 0L),
Spp3 = c(0L, 0L, 0L, 0L),
LOC = c("A", "A", "A", "A"),
TYPE = c("FLOOD", "REG",
"FLOOD",
"REG")
),
.Names = c("Site", "Spp1", "Spp2", "Spp3", "LOC",
"TYPE"), class = "data.frame", row.names = c(NA, -4L))