Comparison across unique readers - r

Reprex
dat <- data.frame(id = c(1,1,2,2,3,3,4,4),
reader = c(1,4,2,3,3,4,2,5),
response = c("CR","PR","SD","SD","PR","PR","CR","SD"))
Problem: Wish to compare response across each unique reader for each id. There are 5 unique readers in total, but each id only has 2 individual readers.
The resulting dataset would look something like this:
# A tibble: 4 x 4
id read1 read2 matchflag
<dbl> <chr> <chr> <dbl>
1 1 CR PR 0
2 2 SD SD 1
3 3 PR PR 1
4 4 CR SD 0

A data.table option
dcast(
setDT(df),
id ~ paste0("reader", rowid(id)),
value.var = "response"
)[
,
match_flag := +(reader1 == reader2)
][]
gives
id reader1 reader2 match_flag
1: 1 CR PR 0
2: 2 SD SD 1
3: 3 PR PR 1
4: 4 CR SD 0

This should work:
dat <- data.frame(id, reader, response)
dat %>%
select(-reader) %>%
group_by(id) %>%
mutate(obs = seq_along(id)) %>%
pivot_wider(names_from="obs", values_from="response", names_prefix="read") %>%
mutate(match_flag = as.numeric(read1 == read2))
# # A tibble: 4 x 4
# # Groups: id [4]
# id read1 read2 match_flag
# <dbl> <chr> <chr> <dbl>
# 1 1 CR PR 0
# 2 2 SD SD 1
# 3 3 PR PR 1
# 4 4 CR SD 0

A slight change from #DaveArmstrong's solution is also by creating the row sequence with rowid (from data.table, and then pivot to wide format and create the new column by using a relational operator and coerce to binary with +
library(dplyr)
library(tidyr)
library(data.table)
dat %>%
transmute(id, obs = rowid(id), response) %>%
pivot_wider(names_from = obs,values_from = response, names_prefix = 'read') %>%
mutate(match_flag = +(read1 == read2))
# A tibble: 4 x 4
# id read1 read2 match_flag
# <dbl> <chr> <chr> <int>
#1 1 CR PR 0
#2 2 SD SD 1
#3 3 PR PR 1
#4 4 CR SD 0

Related

Function over tidyverse code results in issue with quotes

Example of the problem I'm having with applying a function including tidyverse code. I want to repeat for different variable names, but I'm not sure how to 'unquote'.
Example data:
df <- data.frame(grp=c(1,2,1,2,1), one=c(rep('a', 3), rep('b', 2)), two=c(rep('a', 1), rep('d', 4)))
cn <- colnames(df)[2:ncol(df)]
for(i in cn){
i <- enquo(i)
print(df %>% group_by(grp) %>% count(!!i))
}
# A tibble: 2 x 3
# Groups: grp [2]
grp `"one"` n
<dbl> <chr> <int>
1 1 one 3
2 2 one 2
# A tibble: 2 x 3
# Groups: grp [2]
grp `"two"` n
<dbl> <chr> <int>
1 1 two 3
2 2 two 2
Doing it for a single variable named one; this is the correct output.
df %>% group_by(grp) %>% count(one)
# A tibble: 4 x 3
# Groups: grp [2]
grp one n
<dbl> <fct> <int>
1 1 a 2
2 1 b 1
3 2 a 1
4 2 b 1
You can use map, also can avoid group_by by including grp in count
library(dplyr)
library(purrr)
map(cn, ~df %>% count(grp, .data[[.x]]))
#[[1]]
# grp one n
#1 1 a 2
#2 1 b 1
#3 2 a 1
#4 2 b 1
#[[2]]
# grp two n
#1 1 a 1
#2 1 d 2
#3 2 d 2
You can also use NSE with sym
map(cn, ~df %>% count(grp, !!sym(.x)))

How to combine multiple summary tables at once

Consider the following data frame:
set.seed(123)
dat <- data.frame(Region = rep(c("a","b"), each=100),
State =rep(c("NY","MA","FL","GA"), each = 50),
Loc = rep(letters[1:20], each = 5),
ID = 1:200,
count1 = sample(4, 200, replace=T),
count2 = sample(4, 200, replace=T))
Region, State, and Loc are grouping variables for individual measurements, each of which has a unique ID number. For each grouping variable, I want to summarize the number of observations in each level of count1 and count2. Normally I would do on of the following for each pair:
#example for count1 and region:
library(tidyverse)
dat%>%
dplyr::select(Region,count1)%>%
group_by(count1,Region)%>%
count()
##or
with(dat, table(Region, count1))
How can I do this for all combinations and wrap them into a single table (or at least a few tables that are grouped by equivalent lengths since they will differ depending on which grouping variable is being used)
Try something like this:
Region1 <- dat %>% group_by(Region, count1) %>%
summarise(TotalRegion1 = n())
State1 <- dat %>% group_by(State, count1) %>%
summarise(TotalState1 = n())
Loc1 <- dat %>% group_by(Loc, count1) %>%
summarise(TotalLoc1 = n())
You can try to get "all at once" (for count1) with
out <- dat %>%
select(-ID, -count2) %>%
pivot_longer(Region:Loc, names_to = "k", values_to = "v") %>%
group_by(k, v, count1) %>%
tally() %>%
ungroup()
out %>%
filter(k == "Region")
# # A tibble: 8 x 4
# k v count1 n
# <chr> <fct> <int> <int>
# 1 Region a 1 26
# 2 Region a 2 27
# 3 Region a 3 20
# 4 Region a 4 27
# 5 Region b 1 20
# 6 Region b 2 30
# 7 Region b 3 30
# 8 Region b 4 20
out
# # A tibble: 101 x 4
# k v count1 n
# <chr> <fct> <int> <int>
# 1 Loc a 2 5
# 2 Loc a 3 1
# 3 Loc a 4 4
# 4 Loc b 1 2
# 5 Loc b 2 2
# 6 Loc b 3 3
# 7 Loc b 4 3
# 8 Loc c 1 2
# 9 Loc c 2 2
# 10 Loc c 3 3
# # ... with 91 more rows

R dplyr::Filter dataframe by group and numeric vector?

I have dataframe df1 containing data and groups, and df2 which stores the same groups, and one value per group.
I want to filter rows of df1 by df2 where lag by group is higher than indicated value.
Dummy example:
# identify the first year of disturbance by lag by group
df1 <- data.frame(year = c(1:4, 1:4),
mort = c(5,16,40,4,5,6,10,108),
distance = rep(c("a", "b"), each = 4))
df2 = data.frame(distance = c("a", "b"),
my.median = c(12,1))
Now calculate the lag between values (creates new column) and filter df1 based on column values of df2:
# calculate lag between years
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > df2$my.median) ##
This however does not produce expected results:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 2 16 a 11
2 3 40 a 24
3 4 108 b 98
Instead, I expect to get:
# A tibble: 3 x 4
# Groups: distance [2]
year mort distance yearLag
<int> <dbl> <fct> <dbl>
1 3 40 a 24
2 1 5 b 5
3 3 10 b 4
The filter works great while applied to single value, but how to adapt it to vector, and especially vector of groups (as the order of elements can potentially change?)
Is this what you're trying to do?
df1 %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
left_join(df2) %>%
filter(yearLag > my.median)
Result:
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance yearLag my.median
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 24 12
2 1 5 b 5 1
3 3 10 b 4 1
4 4 108 b 98 1
here is a data.table approach
library( data.table )
#creatae data.tables
setDT(df1);setDT(df2)
#create yearLag variable
df1[, yearLag := mort - shift( mort, type = "lag", fill = 0 ), by = .(distance) ]
#update join and filter wanted rows
df1[ df2, median.value := i.my.median, on = .(distance)][ yearLag > median.value, ][]
# year mort distance yearLag median.value
# 1: 3 40 a 24 12
# 2: 1 5 b 5 1
# 3: 3 10 b 4 1
# 4: 4 108 b 98 1
Came to the same conclusion. You should left_join the data frames.
df1 %>% left_join(df2, by="distance") %>%
group_by(distance) %>%
dplyr::mutate(yearLag = mort - lag(mort, default = 0)) %>%
filter(yearLag > my.median)
# A tibble: 4 x 5
# Groups: distance [2]
year mort distance my.median yearLag
<int> <dbl> <fct> <dbl> <dbl>
1 3 40 a 12 24
2 1 5 b 1 5
3 3 10 b 1 4
4 4 108 b 1 98

Summing values in R based on column value with dplyr

I have a data set that has the following information:
Subject Value1 Value2 Value3 UniqueNumber
001 1 0 1 3
002 0 1 1 2
003 1 1 1 1
If the value of UniqueNumber > 0, I would like to sum the values with dplyr for each subject from rows 1 through UniqueNumber and calculate the mean. So for Subject 001, sum = 2 and mean = .67.
total = 0;
average = 0;
for(i in 1:length(Data$Subject)){
for(j in 1:ncols(Data)){
if(Data$UniqueNumber[i] > 0){
total[i] = sum(Data[i,1:j])
average[i] = mean(Data[i,1:j])
}
}
Edit: I am only looking to sum through the number of columns listed in the 'UniqueNumber' column. So this is looping through every row and stopping at column listed in 'UniqueNumber'.
Example: Row 2 with Subject 002 should sum up the values in columns 'Value1' and 'Value2', while Row 3 with Subject 003 should only sum the value in column 'Value1'.
Not a tidyverse fan/expert, but I would try this using long format. Then, just filter by row index per group and then run any functions you want on a single column (much easier this way).
library(tidyr)
library(dplyr)
Data %>%
gather(variable, value, -Subject, -UniqueNumber) %>% # long format
group_by(Subject) %>% # group by Subject in order to get row counts
filter(row_number() <= UniqueNumber) %>% # filter by row index
summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
ungroup()
## A tibble: 3 x 3
# Subject Mean Total
# <int> <dbl> <int>
# 1 1 0.667 2
# 2 2 0.5 1
# 3 3 1 1
A very similar way to achieve this could be filtering by the integers in the column names. The filter step comes before the group_by so it could potentially increase performance (or not?) but it is less robust as I'm assuming that the cols of interest are called "Value#"
Data %>%
gather(variable, value, -Subject, -UniqueNumber) %>% #long format
filter(as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber) %>% #filter
group_by(Subject) %>% # group by Subject
summarise(Mean = mean(value), Total = sum(value)) %>% # do the calculations
ungroup()
## A tibble: 3 x 3
# Subject Mean Total
# <int> <dbl> <int>
# 1 1 0.667 2
# 2 2 0.5 1
# 3 3 1 1
Just for fun, adding a data.table solution
library(data.table)
data.table(Data) %>%
melt(id = c("Subject", "UniqueNumber")) %>%
.[as.numeric(gsub("Value", "", variable, fixed = TRUE)) <= UniqueNumber,
.(Mean = round(mean(value), 3), Total = sum(value)),
by = Subject]
# Subject Mean Total
# 1: 1 0.667 2
# 2: 2 0.500 1
# 3: 3 1.000 1
Here is another method that uses tidyr::nest to collect the Values columns into a list so that we can iterate through the table with map2. In each row, we select the correct values from the Values list-col and take the sum or mean respectively.
library(tidyverse)
tbl <- read_table2(
"Subject Value1 Value2 Value3 UniqueNumber
001 1 0 1 3
002 0 1 1 2
003 1 1 1 1"
)
tbl %>%
filter(UniqueNumber > 0) %>%
nest(starts_with("Value"), .key = "Values") %>%
mutate(
sum = map2_dbl(UniqueNumber, Values, ~ sum(.y[1:.x], na.rm = TRUE)),
mean = map2_dbl(UniqueNumber, Values, ~ mean(as.numeric(.y[1:.x], na.rm = TRUE))),
)
#> # A tibble: 3 x 5
#> Subject UniqueNumber Values sum mean
#> <chr> <dbl> <list> <dbl> <dbl>
#> 1 001 3 <tibble [1 × 3]> 2 0.667
#> 2 002 2 <tibble [1 × 3]> 1 0.5
#> 3 003 1 <tibble [1 × 3]> 1 1
Created on 2019-02-14 by the reprex package (v0.2.1)
Check this solution:
df %>%
gather(key, val, Value1:Value3) %>%
group_by(Subject) %>%
mutate(
Sum = sum(val[c(1:(UniqueNumber[1]))]),
Mean = mean(val[c(1:(UniqueNumber[1]))]),
) %>%
spread(key, val)
Output:
Subject UniqueNumber Sum Mean Value1 Value2 Value3
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 001 3 2 0.667 1 0 1
2 002 2 1 0.5 0 1 1
3 003 1 1 1 1 1 1
OP might be interested only for dplyr solution but for comparison purposes and for future readers a base R option using mapply
cols <- grep("^Value", names(df))
cbind(df, t(mapply(function(x, y) {
if (y > 0) {
vals = as.numeric(df[x, cols[1:y]])
c(Sum = sum(vals, na.rm = TRUE), Mean = mean(vals, na.rm = TRUE))
}
else
c(0, 0)
},1:nrow(df), df$UniqueNumber)))
# Subject Value1 Value2 Value3 UniqueNumber Sum Mean
#1 1 1 0 1 3 2 0.667
#2 2 0 1 1 2 1 0.500
#3 3 1 1 1 1 1 1.000
Here we subset each row based on its respective UniqueNumber and then calculate it's sum and mean if the UniqueNumber value is greater than 0 or else return only 0.
A solution that uses purrr::map_df(which is from the same author as dplyr).
library(dplyr)
library(purrr)
l_dat <- split(dat, dat$Subject) # first we need to split in a list
map_df(l_dat, function(x) {
n_cols <- x$UniqueNumber # finds the number of columns
x <- as.numeric(x[2:(n_cols+1)]) # subsets x and converts to numeric
mean(x, na.rm=T) # mean to be returned
})
# output:
# # A tibble: 1 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
# 1 0.667 0.5 1
Another option (output format closer to a dplyr solution):
map_df(l_dat, function(x) {
n_cols <- x$UniqueNumber
id <- x$Subject
x <- as.numeric(x[2:(n_cols+1)])
tibble(id=id, mean_values=mean(x, na.rm=T))
})
# # A tibble: 3 x 2
# id mean_values
# <int> <dbl>
# 1 1 0.667
# 2 2 0.5
# 3 3 1
Just as an example I added a sum() then divided by length(x)-1:
map_df(l_dat, function(x) {
n_cols <- x$UniqueNumber
id <- x$Subject
x <- as.numeric(x[2:(n_cols+1)])
tibble(id=id,
mean_values=sum(x, na.rm=T)/(length(x)-1)) # change here
})
# # A tibble: 3 x 2
# id mean_values
# <int> <dbl>
# 1 1 1.
# 2 2 1.
# 3 3 Inf #beware of this case where you end up dividing by 0
Data:
tt <- "Subject Value1 Value2 Value3 UniqueNumber
001 1 0 1 3
002 0 1 1 2
003 1 1 1 1"
dat <- read.table(text=tt, header=T)
I think the easiest way is to set to NA the zeros that really should be NA, then use rowSums and rowMeans on the appropriate subset of columns.
Data[2:4][(col(dat[2:4])>dat[[5]])] <- NA
Data
# Subject Value1 Value2 Value3 UniqueNumber
# 1 1 1 0 1 3
# 2 2 0 1 NA 2
# 3 3 1 NA NA 1
library(dplyr)
Data%>%
mutate(sum = rowSums(.[2:4], na.rm = TRUE),
mean = rowMeans(.[2:4], na.rm = TRUE))
# Subject Value1 Value2 Value3 UniqueNumber sum mean
# 1 1 1 0 1 3 2 0.6666667
# 2 2 0 1 NA 2 1 0.5000000
# 3 3 1 NA NA 1 1 1.0000000
or transform(Data, sum = rowSums(Data[2:4],na.rm = TRUE), mean = rowMeans(Data[2:4],na.rm = TRUE)) to stay in base R.
data
Data <- structure(
list(Subject = 1:3,
Value1 = c(1L, 0L, 1L),
Value2 = c(0L, 1L, NA),
Value3 = c(1L, NA, NA),
UniqueNumber = c(3L, 2L, 1L)),
.Names = c("Subject","Value1", "Value2", "Value3", "UniqueNumber"),
row.names = c(NA, 3L), class = "data.frame")

Performing in group operations in R

I have a data in which I have 2 fields in a table sf -> Customer id and Buy_date. Buy_date is unique but for each customer, but there can be more than 3 different values of Buy_dates for each customer. I want to calculate difference in consecutive Buy_date for each Customer and its mean value. How can I do this.
Example
Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13
I want the results for each customer in the format
Customer mean
Here's a dplyr solution.
Your data:
df <- data.frame(Customer = c(1,1,1,1,2,2,2), Buy_date = c("2018/03/01", "2018/03/19", "2018/04/3", "2018/05/10", "2018/01/02", "2018/02/10", "2018/04/13"))
Grouping, mean Buy_date calculation and summarising:
library(dplyr)
df %>% group_by(Customer) %>% mutate(mean = mean(as.POSIXct(Buy_date))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <dttm>
1 1 2018-03-31 06:30:00
2 2 2018-02-17 15:40:00
Or as #r2evans points out in his comment for the consecutive days between Buy_dates:
df %>% group_by(Customer) %>% mutate(mean = mean(diff(as.POSIXct(Buy_date)))) %>% group_by(Customer, mean) %>% summarise()
Output:
# A tibble: 2 x 2
# Groups: Customer [?]
Customer mean
<dbl> <time>
1 1 23.3194444444444
2 2 50.4791666666667
I am not exactly sure of the desired output but this what I think you want.
library(dplyr)
library(zoo)
dat <- read.table(text =
"Customer Buy_date
1 2018/03/01
1 2018/03/19
1 2018/04/3
1 2018/05/10
2 2018/01/02
2 2018/02/10
2 2018/04/13", header = T, stringsAsFactors = F)
dat$Buy_date <- as.Date(dat$Buy_date)
dat %>% group_by(Customer) %>% mutate(diff_between = as.vector(diff(zoo(Buy_date), na.pad=TRUE)),
mean_days = mean(diff_between, na.rm = TRUE))
This produces:
Customer Buy_date diff_between mean_days
<int> <date> <dbl> <dbl>
1 1 2018-03-01 NA 23.3
2 1 2018-03-19 18 23.3
3 1 2018-04-03 15 23.3
4 1 2018-05-10 37 23.3
5 2 2018-01-02 NA 50.5
6 2 2018-02-10 39 50.5
7 2 2018-04-13 62 50.5
EDITED BASED ON USER COMMENTS:
Because you said that you have factors and not characters just convert them by doing the following:
dat$Buy_date <- as.Date(as.character(dat$Buy_date))
dat$Customer <- as.character(dat$Customer)

Resources