I have the following data frame:
Group 1 ID A Value
Group 1 ID B Value
Group 1 ID C Value
Group 2 ID B Value
Group 2 ID C Value
Group 3 ID B Value
… … …
I am trying to use dplyr to get the mean value for each of the same ID across groups (e.g. the mean of the value of ID B across group 1, group 2, and group 3). However, not every group has all of the IDs so I wanted to subset so that only means for IDs which are in all groups get computed. I know that I can group_by(dataFrame, group) %>% filter subset %>% group_by(id) %>% mutate(mean) but I don't know what code to place in the filter subset.
How about
df %>%
group_by(id) %>%
mutate(count = n()) %>%
filter(count != ngroups) %>% #...
So basically remove all the rows in the dataframe that correspond to an ID that doesn't appear in all groups, then perform the computation.
Related
I have a dataframe of patients who underwent one or more surgical procedures and am interested in grouping them by procedure type for analysis of outcomes. The procedures are represented by numbers (1-5). To avoid having to create a new column in the dataframe for each procedure type to identify whether the patient had that unique procedure performed, I'm basically looking for a way to do aggregate grouping and summarizing for each unique value in a list.
A representative df would look like this...
id <- c(1,2,3,4,5,6,7,8,9,10)
procedures <- list(2, 3, c(1,5), 1, c(3,4), c(1,3), 5, 2, c(1,2,5), 4)
df <- as.data.frame(cbind(id, procedures))
Say I wanted to count the number of patients who had each type of procedure. The following would obviously count each unique list as a separate object.
df %>%
group_by(procedures) %>%
summarise(n = n())
What I'm trying to accomplish would be a count of times each unique procedure appears in the list of lists. The below is oversimplified but an example of this.
df %>%
group_by(unique(procedures)) %>%
summarise(n = n())
We may unnest the list column and use that in group_by
library(dplyr)
library(tidyr)
df %>%
unnest(everything()) %>%
group_by(procedures) %>%
summarise(n = n())
We could use separate_rows with count:
library(dplyr)
library(tidyr)
df %>%
separate_rows("procedures", sep = " ,") %>%
count(procedures)
procedures n
<dbl> <int>
1 1 4
2 2 3
3 3 3
4 4 2
5 5 3
This question already has answers here:
How can I remove all duplicates so that NONE are left in a data frame?
(3 answers)
Closed 1 year ago.
I need to find the unique entries in my dataframe using column ID and Genus. I do not need to find unique values from column Count. My dataframe is structured like this:
ID Genus Count
A Genus1 4
A Genus18 265
A Genus28 1
A Genus2 900
B Genus1 85
B Genus18 9
B Genus28 24
B Genus2 6
B Genus3000 152
The resulting dataframe would have only
ID Genus Count
B Genus3000 152
In it because this row is unique by ID and Genus.
I have tidyverse loaded but have had trouble trying to get the result I need. I tried using distinct() but continue to get back all data from the input as output.
I have tried the following:
uniquedata <- mydata %>% distinct(.keep_all = TRUE)
uniquedata <- mydata %>% group_by(ID, Genus) %>% distinct(.keep_all = TRUE)
uniquedata <- mydata %>% distinct(ID, Genus, .keep_all = TRUE)
uniquedata <- mydata %>% distinct()
What should I use to achieve my desired output?
We could use add_count in combination with filter:
library(dplyr)
df %>%
add_count(Genus) %>%
filter(n == 1) %>%
select(ID, Genus, Count)
Output:
ID Genus Count
<chr> <chr> <dbl>
1 B Genus3000 152
For the given data set, it is enough to check the column "Genus" for values appearing twice and then to remove the corresponding rows from the dataframe.
df %>% count(Genus) -> countGenus
filter(df, Genus %in% filter(countGenus,n==1)$Genus)
I have a data set called test with multiple observations per participant. Every participant has a unique id, but several observations (1 row in the data = 1 observation). I have to reduce the data set to 1 row per participant and add two new variables which are a sum of the no. of observations per participant and the sum of points he or she received per observation.
I already got this values, but how can I create and add these two variables to my data set based on this code?
test %>%
group_by(id) %>%
summarize(sum_communities = sum(id/id, na.rm = TRUE))
test %>%
group_by(id) %>%
summarize(sum_points = sum(points, na.rm = TRUE))
I created a demo data in test table. test_reduced table has desired output.
library(dplyr)
test = data.frame("Participent" =c("A","A","A","B","B","C","C","C", "C"),
"Observation" = c(4,5,6,4,7,4,6,6,3))
test_reduced = test %>% group_by(Participent) %>%
summarise(count = n(), sum = sum(Observation))
Output:
# A tibble: 3 x 3
Participent count sum
<fct> <int> <dbl>
1 A 3 15
2 B 2 11
3 C 4 19
I have a Data with User_Name and Group.
User_Name Group
MustafE A
fischeta A
LosperS1 A
MustafE B
fischeta B
jose B
MustafE c
fischeta c
I want to flag those customers which are not repeating groups .. Example - 'LosperS1' is in group A but not in group B , same way 'jose' is in group B but not in group C, so in a new column they will be marked as "No In group B/No In group C"
Any help will be appreciated ..
Here is a way to get the output using tidyverse. Get the distinct elements of 'User_Name' column, loop through those elements (map), filter the rows of the dataset based on the presence of looped elements in 'User_Name', paste the elements that are not found in the 'Group' column when compared with the filtered 'Group', subset the first row (slice) and right_join with the original dataset. We used map_df to get the end output as a single data.frame instead of a list of data.frame
library(tidyverse)
df1 %>%
distinct(User_Name) %>%
pull(User_Name) %>%
map_df(~ df1 %>%
filter(User_Name == .x) %>%
mutate(Flag = toString(setdiff(unique(df1$Group),
unique(Group)))) %>%
slice(1) %>%
select(-Group)) %>%
right_join(df1, "User_Name")
I have a dataframe with >100 columns, and I would to find the unique rows by comparing only two of the columns. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.
In the below, I would like to unique only using id and id2:
data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
id id2 somevalue
1 1 x
1 1 y
3 4 z
I would like to obtain either:
id id2 somevalue
1 1 x
3 4 z
or:
id id2 somevalue
1 1 y
3 4 z
(I have no preference which of the unique rows is kept)
Ok, if it doesn't matter which value in the non-duplicated column you select, this should be pretty easy:
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
> dat[!duplicated(dat[,c('id','id2')]),]
id id2 somevalue
1 1 1 x
3 3 4 z
Inside the duplicated call, I'm simply passing only those columns from dat that I don't want duplicates of. This code will automatically always select the first of any ambiguous values. (In this case, x.)
Here are a couple dplyr options that keep non-duplicate rows based on columns id and id2:
library(dplyr)
df %>% distinct(id, id2, .keep_all = TRUE)
df %>% group_by(id, id2) %>% filter(row_number() == 1)
df %>% group_by(id, id2) %>% slice(1)
Using unique():
dat <- data.frame(id=c(1,1,3),id2=c(1,1,4),somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])),]
Minor update in #Joran's code.
Using the code below, you can avoid the ambiguity and only get the unique of two columns:
dat <- data.frame(id=c(1,1,3), id2=c(1,1,4) ,somevalue=c("x","y","z"))
dat[row.names(unique(dat[,c("id", "id2")])), c("id", "id2")]