Sample from groups and only maintain unique observations in the data - r

I want to take a sample per group, allthewhile avoiding that any participant appears twice across the samples (I need this for a between-subjects ANOVA). I have a dataframe in which some participants (not all) appear twice, each time in a different group, i.e. Peter can appear in group v1=A and v2=1 but theoretically also in group v1=B and v2=3. A group is defined by the two variables v1 and v2, so according to the below code, there are 8 groups.
Now, I want to avoid the double appearance of any participant in the data by taking samples per group and randomly eliminating one observation from any participant, allthewhile maintaining similarly sized samples. I constructed the following ugly code to showcase my problem.
How do I get the last step done, so that no participant appears twice across the samples and I only have unique cases across all samples?
df1 < - data.frame(ID=c("peter","peter","chris","john","george","george","norman","josef","jan","jan","richard","richard","paul","christian","felix","felix","nick","julius","julius","moritz"),
v1=rep(c("A","B"),10),
v2=rep(c(1:4),5))
library(dplyr)
df2 <- df1 %>% group_by(v1,v2) %>% sample_n(2)

You could first take a sample of size 1 as per 'ID', then group_by 'v1' and 'v2' and take another sample of size 2.
library(dplyr)
set.seed(1)
df2 <- df1 %>%
group_by(ID) %>%
sample_n(1) %>%
group_by(v1, v2) %>%
sample_n(2)
df2
# Groups: v1, v2 [4]
# ID v1 v2
# <fct> <fct> <int>
# 1 paul A 1
# 2 jan A 1
# 3 norman A 3
# 4 richard A 3
# 5 george B 2
# 6 peter B 2
# 7 moritz B 4
# 8 felix B 4

Related

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

dplyr distinct over two columns

I have a table where the first two rows are sample identifiers and the third a measure of distance eg:
df<-data.table(H1=c(1,2,3,4,5),H2=c(7,3,2,8,9), D=c(100,4,55,66,35))
I want to find only the unique pairs across both columns, ie 1-7,2-3,4-8,5-9. Removing the duplicate 2-3 and 3-2 pairings which appears in different columns but keeping the third row (which being a distance is identical for 2-3 and 3-2).
# example data
df<-data.frame(H1=c(1,2,3,4,5),
H2=c(7,3,2,8,9),
D=c(100,4,55,66,35), stringsAsFactors = F)
library(dplyr)
df %>%
rowwise() %>% # for each row
mutate(HH = paste0(sort(c(H1,H2)), collapse = ",")) %>% # create a new variable that orders and combines H1 and H2
group_by(HH) %>% # group by that variable
filter(D == max(D)) %>% # keep the row where D is the maximum (assumed logic*)
ungroup() %>% # forget the grouping
select(-HH) # remove unnecessary variable
# # A tibble: 4 x 3
# H1 H2 D
# <dbl> <dbl> <dbl>
# 1 1 7 100
# 2 3 2 55
# 3 4 8 66
# 4 5 9 35
*Note: No idea what your logic is to keep 1 row from the duplicates. I had to use something as an example and here I'm keeping the row with the highest D value. This logic can change if needed.

r: Summarise for rowSums after group_by

I've tried searching a number of posts on SO but I'm not sure what I'm doing wrong here, and I imagine the solution is quite simple. I'm trying to group a dataframe by one variable and figure the mean of several variables within that group.
Here is what I am trying:
head(airquality)
target_vars = c("Ozone","Temp","Solar.R")
airquality %>% group_by(Month) %>% select(target_vars) %>% summarise(rowSums(.))
But I get the error that my lenghts don't match. I've tried variations using mutate to create the column or summarise_all, but neither of these seem to work. I need the row sums within group, and then to compute the mean within group (yes, it's nonsensical here).
Also, I want to use select because I'm trying to do this over just certain variables.
I'm sure this could be a duplicate, but I can't find the right one.
EDIT FOR CLARITY
Sorry, my original question was not clear. Imagine the grouping variable is the calendar month, and we have v1, v2, and v3. I'd like to know, within month, what was the average of the sums of v1, v2, and v3. So if we have 12 months, the result would be a 12x1 dataframe. Here is an example if we just had 1 month:
Month v1 v2 v3 Sum
1 1 1 0 2
1 1 1 1 3
1 1 0 0 3
Then the result would be:
Month Average
1 8/3
You can try:
library(tidyverse)
airquality %>%
select(Month, target_vars) %>%
gather(key, value, -Month) %>%
group_by(Month) %>%
summarise(n=length(unique(key)),
Sum=sum(value, na.rm = T)) %>%
mutate(Average=Sum/n)
# A tibble: 5 x 4
Month n Sum Average
<int> <int> <int> <dbl>
1 5 3 7541 2513.667
2 6 3 8343 2781.000
3 7 3 10849 3616.333
4 8 3 8974 2991.333
5 9 3 8242 2747.333
The idea is to convert the data from wide to long using tidyr::gather(), then group by Month and calculate the sum and the average.
This seems to deliver what you want. It's regular R. The sapply function keeps the months separated by "name". The sum function applied to each dataframe will not keep the column sums separate. (Correction # 2: used only target_vars):
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)
5 6 7 8 9
7541 8343 10849 8974 8242
If you wanted the per number of variable results, then you would divide by the number of variables:
sapply( split( airquality[target_vars], airquality$Month), sum, na.rm=TRUE)/
(length(target_vars))
5 6 7 8 9
2513.667 2781.000 3616.333 2991.333 2747.333
Perhaps this is what you're looking for
library(dplyr)
library(purrr)
library(tidyr) # forgot this in original post
airquality %>%
group_by(Month) %>%
nest(Ozone, Temp, Solar.R, .key=newcol) %>%
mutate(newcol = map_dbl(newcol, ~mean(rowSums(.x, na.rm=TRUE))))
# A tibble: 5 x 2
# Month newcol
# <int> <dbl>
# 1 5 243.2581
# 2 6 278.1000
# 3 7 349.9677
# 4 8 289.4839
# 5 9 274.7333
I've never encountered a situation where all the answers disagreed. Here's some validation (at least I think) for the 5th month
airquality %>%
filter(Month == 5) %>%
select(Ozone, Temp, Solar.R) %>%
mutate(newcol = rowSums(., na.rm=TRUE)) %>%
summarise(sum5 = sum(newcol), mean5 = mean(newcol))
# sum5 mean5
# 1 7541 243.2581

Using dplyr to summarise a running total of distinct factors

I'm trying to generate a species saturation curve for a camera trapping survey. I have thousands of observations and do most of my manipulations in dplyr.
I have three field sites, with observation records of different animal species from a number of weeks of trapping. In some weeks there are no animals, in other weeks there may be more than one species. I want to generate a separate figure for each site to compare how quickly new species that are encountered over the sequential weeks of the study. These observations of new species should eventually saturate once the total species diversity has been captured in the area. Some field sites are likely to saturate faster than others.
The problem is that I have not come across a way of counting the number of distinct species to provide a running total by time. A simple dummy dataset is below.
field_site<-c(rep("A",4),rep("B",4),rep("C",4))
week<-c(1,2,2,3,2,3,4,4,1,2,3,4)
animal<-c("dog","dog","cat","rabbit","dog","dog","dog","rabbit","cat","cat","rabbit","dog")
df<-as.data.frame(cbind(field_site,week,animal),head=TRUE)
I can easily generate the number of unique species within each week grouping, e.g.
tbl_df(df)%>%
group_by(field_site,week) %>%
summarise(no_of_sp=n_distinct(animal))
But this is not sensitive to the fact that some species are encountered again in subsequent weeks. What I really need is a running count of the different species that counts the unique species per site from week 1 going down through the rows, assuming that the data is sorted by increasing time from the start of the survey.
The cumulative total of species encountered over the course of the study by week in the example for field Site A would be: week 1 = 1 species, week 2 = 2 species, week 3 = 3 species, week 4 = still 3 species.
For site B cumulative total of species would be: week 1 = 0 species, week 2 = 1 species, week 3 = 1 species,week 4 = 1 species, etc...
Any advice would be greatly appreciated.
cheers in advance!
I'm making two assumptions:
Site B, week 4 = 2 species, both "dog" and "rabbit"; and
All sites share the same weeks, so if at least on site has week 4, then all sites should include it. This only drives the mt (empty) variable, feel free to update this variable.
I first suggest an "empty" data.frame to ensure sites have the requisite week numbers populated:
mt <- expand.grid(field_site = unique(ret$field_site),
week = unique(ret$week))
The use of tidyr helps:
library(tidyr)
df %>%
mutate(fake = TRUE) %>%
# ensure all species are "represented" on each row
spread(animal, fake) %>%
# ensure all weeks are shown, even if no species
full_join(mt, by = c("field_site", "week")) %>%
# ensure the presence of a species persists at a site
arrange(week) %>%
group_by(field_site) %>%
mutate_if(is.logical, funs(cummax(!is.na(.)))) %>%
ungroup() %>%
# helps to contain variable number of species columns in one place
nest(-field_site, -week, .key = "species") %>%
group_by(field_site, week) %>%
# could also use purrr::map in place of sapply
mutate(n = sapply(species, sum)) %>%
ungroup() %>%
select(-species) %>%
arrange(field_site, week)
# # A tibble: 12 × 3
# field_site week n
# <fctr> <fctr> <int>
# 1 A 1 1
# 2 A 2 2
# 3 A 3 3
# 4 A 4 3
# 5 B 1 0
# 6 B 2 1
# 7 B 3 1
# 8 B 4 2
# 9 C 1 1
# 10 C 2 1
# 11 C 3 2
# 12 C 4 3

How to repeat empty rows so that each split has the same number

My goal is to get the same number of rows for each split (based on column Initial). I am trying to basically pad the number of rows so that each person has the same amount, while retaining the Initial column so I can tell them apart. My attempt failed completely. Anybody have suggestions?
df<-data.frame(Initials=c("a","a","b"),data=c(2,3,4))
attach(df)
maxrows=max(table(Initials))+1
arr<-split(df,Initials)
lapply(arr,function(x){
toadd<-maxrows-dim(x)[1]
replicate(toadd,x<-rbind(x,rep(NA,1)))#colnames -1 because col 1 should the the same Initial
})
Goal:
a 2
a 3
b 4
b NA
Using data.table...
my_rows <- seq.int(max(tabulate(df$Initials)))
library(data.table)
setDT(df)[ , .SD[my_rows], by=Initials]
# Initials data
# 1: a 2
# 2: a 3
# 3: b 4
# 4: b NA
.SD is the Subset of Data associated with each by= group. We can subset its rows like .SD[row_numbers], unlike a data.frame which requires an additional comma DF[row_numbers,].
The analogue in dplyr is
my_rows <- seq.int(max(tabulate(df$Initials)))
library(dplyr)
setDT(df) %>% group_by(Initials) %>% slice(my_rows)
# Initials data
# (fctr) (dbl)
# 1 a 2
# 2 a 3
# 3 b 4
# 4 b NA
Strangely, this only works if df is a data.table. I've filed a report/query with dplyr. There's a good chance that the dplyr devs will prevent this usage in a future version.
Here's a dplyr/tidyr method. We group_by initials, add row_numbers, ungroup, complete row numbers/Initials combinations, then remove our row numbers:
library(dplyr)
library(tidyr)
df %>% group_by(Initials) %>%
mutate(row = row_number()) %>%
ungroup() %>%
complete(Initials, row) %>%
select(-row)
Source: local data frame [4 x 2]
Initials data
(fctr) (dbl)
1 a 2
2 a 3
3 b 4
4 b NA
Interesting problem. Try:
to.add <- max(table(df$Initials)) - table(df$Initials)
rbind(df, c(rep(names(to.add), to.add), rep(NA, ncol(df)-1)))
# Initials data
#1 a 2
#2 a 3
#3 b 4
#4 b <NA>
We calculate the number of extra initials needed then combine the extras with NA values then rbind to the data frame.
max(table(df$Initials)) calculates the the initial with the most repeats. In this case a 2. By subtracting that max amount by the other initials table(df$Initials) we get a vector with the necessary additions. There's an added bonus to this method, by using table we also automatically have a named vector.
We use the names of the new vector to know 1) what initials to repeat, and 2) how many times should they be repeated.
To preserve the class of the data, you can add newdf$data <- as.numeric(newdf$data).

Resources