dplyr distinct over two columns - r

I have a table where the first two rows are sample identifiers and the third a measure of distance eg:
df<-data.table(H1=c(1,2,3,4,5),H2=c(7,3,2,8,9), D=c(100,4,55,66,35))
I want to find only the unique pairs across both columns, ie 1-7,2-3,4-8,5-9. Removing the duplicate 2-3 and 3-2 pairings which appears in different columns but keeping the third row (which being a distance is identical for 2-3 and 3-2).

# example data
df<-data.frame(H1=c(1,2,3,4,5),
H2=c(7,3,2,8,9),
D=c(100,4,55,66,35), stringsAsFactors = F)
library(dplyr)
df %>%
rowwise() %>% # for each row
mutate(HH = paste0(sort(c(H1,H2)), collapse = ",")) %>% # create a new variable that orders and combines H1 and H2
group_by(HH) %>% # group by that variable
filter(D == max(D)) %>% # keep the row where D is the maximum (assumed logic*)
ungroup() %>% # forget the grouping
select(-HH) # remove unnecessary variable
# # A tibble: 4 x 3
# H1 H2 D
# <dbl> <dbl> <dbl>
# 1 1 7 100
# 2 3 2 55
# 3 4 8 66
# 4 5 9 35
*Note: No idea what your logic is to keep 1 row from the duplicates. I had to use something as an example and here I'm keeping the row with the highest D value. This logic can change if needed.

Related

Show rows that appear only once in R dataframe

I know that we can use unique() to effectively show a dataframe without duplicate values, but is there an elegant way to show only those rows that appear once in a dataframe?
E.g.,
a = c(10,20,10,10)
b = c(10,30,10,20)
ab = data.frame(a,b)
should return the second and final row only, and not the first and third (since this row exists more than once).
Thanks
We can use duplicated
subset(ab, !(duplicated(ab)|duplicated(ab, fromLast = TRUE)))
-output
a b
2 20 30
4 10 20
dplyr option:
library(dplyr)
ab %>%
group_by(across(everything())) %>%
filter(n() == 1)
Output:
# A tibble: 2 × 2
# Groups: a, b [2]
a b
<dbl> <dbl>
1 20 30
2 10 20

Mutate by group based on a conditional

I am trying to add a summary column to a dataframe. Although the summary statistic should be applied to every column, the statistic itself should only be calculated based on conditional rows.
As an example, given this dataframe:
x <- data.frame(usernum=rep(c(1,2,3,4),each=3),
final=rep(c(TRUE,TRUE,FALSE,FALSE)),
time=1:12)
I would like to add a usernum.mean column, but where the mean is only calculated when final=TRUE. I have tried:
library(tidyverse)
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(x$time[x$final==TRUE]))
but this gives an overall mean, rather than by user. I have also tried:
x %>%
group_by(usernum) %>%
filter(final==TRUE) %>%
mutate(user.mean = mean(time))
but this only returns the filtered dataframe:
# A tibble: 6 x 4
# Groups: usernum [4]
usernum final time user.mean
<dbl> <lgl> <int> <dbl>
1 1 TRUE 1 1.5
2 1 TRUE 2 1.5
3 2 TRUE 5 5.5
4 2 TRUE 6 5.5
5 3 TRUE 9 9
6 4 TRUE 10 10
How can I apply those means to every original row?
If we use x$ after the group_by, it returns the entire column instead of only the values in that particular group. Second, TRUE/FALSE is logical vector, so we don't need ==
library(dplyr)
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(time[final]))
The one option where we can use $ is with .data
x %>%
group_by(usernum) %>%
mutate(user.mean = mean(.data$time[.data$final]))

Adding new, combined values to existing dataframe in R

This is an approximation of the original dataframe. In the original, there are many more columns than are shown here.
id init_cont family description value
1 K S impacteach 1
1 K S impactover 3
1 K S read 2
2 I S impacteach 2
2 I S impactover 4
2 I S read 1
3 K D impacteach 3
3 K D impactover 5
3 K D read 3
I want to combine the values for impacteach and impactover to generate an average value that is just called impact. I would like the final table to look like the following:
id init_cont family description value
1 K S impact 2
1 K S read 2
2 I S impact 3
2 I S read 1
3 K D impact 4
3 K D read 3
I have not been able to figure out how to generate this table. However, I have been able to create a dataframe that looks like this:
id description value
1 impact 2
1 read 2
2 impact 3
2 read 1
3 impact 4
3 read 3
What is the best way for me to take these new values and add them to the original dataframe? I also need to remove the original values (like impacteach and impactover) in the original dataframe. I would prefer to modify the original dataframe as opposed to creating an entirely new dataframe because the original dataframe has many columns.
In case it is useful, this is a summary of the code I used to create the shorter dataframe with impact as a combination of impacteach and impactover:
df %<%
mutate(newdescription = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %<%
group_by(id, newdescription) %<%
summarise(value = mean(as.numeric(value)))
What if you changed the description column first so that it could be included in the grouping:
df %>%
mutate(description = substr(description, 1, 6)) %>%
group_by(id, init_cont, family, description) %>%
summarise(value = mean(value))
# A tibble: 6 x 5
# Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.
# 2 1 K S read 2.
# 3 2 I S impact 3.
# 4 2 I S read 1.
# 5 3 K D impact 4.
# 6 3 K D read 3.
You just need to modify your group_by statement. Try group_by(id, init_cont, family)
Because your id seems to be mapped to init_cont and family already, adding in these values won't change your summarization result. Then you have all the columns you want with no extra work.
If you have a lot of columns you could trying something like the code below. Essentially, do a left_join onto your original data with your summarised data, but doing it using the . to not store off a new dataframe. Then, once joined (by id and description which we modified in place) you'll have two value columns which should be prepeneded with a .x and .y, drop the original and then use distinct to get rid of the duplicate 'impact' columns.
df %>%
mutate(description = case_when(description %in% c("impacteach", "impactoverall") ~ "impact", TRUE ~ description)) %>%
left_join(. %>%
group_by(id, description)
summarise(value = mean(as.numeric(value))
,by=c('id','description')) %>%
select(-value.x) %>%
distinct()
gsub can be used to replace description containing imact as impact and then group_by from dplyr package will help in summarising the value.
df %>% group_by(id, init_cont, family,
description = gsub("^(impact).*","\\1", description)) %>%
summarise(value = mean(value))
# # A tibble: 6 x 5
# # Groups: id, init_cont, family [?]
# id init_cont family description value
# <int> <chr> <chr> <chr> <dbl>
# 1 1 K S impact 2.00
# 2 1 K S read 2.00
# 3 2 I S impact 3.00
# 4 2 I S read 1.00
# 5 3 K D impact 4.00
# 6 3 K D read 3.00

Sample from groups and only maintain unique observations in the data

I want to take a sample per group, allthewhile avoiding that any participant appears twice across the samples (I need this for a between-subjects ANOVA). I have a dataframe in which some participants (not all) appear twice, each time in a different group, i.e. Peter can appear in group v1=A and v2=1 but theoretically also in group v1=B and v2=3. A group is defined by the two variables v1 and v2, so according to the below code, there are 8 groups.
Now, I want to avoid the double appearance of any participant in the data by taking samples per group and randomly eliminating one observation from any participant, allthewhile maintaining similarly sized samples. I constructed the following ugly code to showcase my problem.
How do I get the last step done, so that no participant appears twice across the samples and I only have unique cases across all samples?
df1 < - data.frame(ID=c("peter","peter","chris","john","george","george","norman","josef","jan","jan","richard","richard","paul","christian","felix","felix","nick","julius","julius","moritz"),
v1=rep(c("A","B"),10),
v2=rep(c(1:4),5))
library(dplyr)
df2 <- df1 %>% group_by(v1,v2) %>% sample_n(2)
You could first take a sample of size 1 as per 'ID', then group_by 'v1' and 'v2' and take another sample of size 2.
library(dplyr)
set.seed(1)
df2 <- df1 %>%
group_by(ID) %>%
sample_n(1) %>%
group_by(v1, v2) %>%
sample_n(2)
df2
# Groups: v1, v2 [4]
# ID v1 v2
# <fct> <fct> <int>
# 1 paul A 1
# 2 jan A 1
# 3 norman A 3
# 4 richard A 3
# 5 george B 2
# 6 peter B 2
# 7 moritz B 4
# 8 felix B 4

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Resources