Find overlapping rows between dataframes using dplyr? - r

df1 <- data_frame(time1 = c(0, 1, 2, 3, 4, 5, 6, 7, 8, 9),
time2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
id = c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j"))
df2 <- data_frame(time = sort(runif(100, 0, 10)),
C = rbinom(100, 1, 0.5))
For every row in df1, I want to find the rows in df2 that overlap for time, then assign the median C value for this group of df2 rows to a new column in df1. I'm sure there's some simple way to do this with dplyr's between function, but I'm new to R and haven't been able to figure it out. Thanks!

Here's a way, using the merge function to basically do a SQL style cross join, then using the between function:
library(tidyverse)
merge(df1, df2, all = TRUE) %>%
rowwise() %>%
mutate(time_between = between(time, time1, time2)) %>%
filter(time_between) %>%
group_by(time1, time2, id) %>%
summarise(med_C = median(C))
Using the filter function may result in losing some rows from df1, so an alternative method would be:
merge(df1, df2, all = TRUE) %>%
rowwise() %>%
mutate(time_between = between(time, time1, time2)) %>%
group_by(time1, time2, id) %>%
summarise(med_C = median(ifelse(time_between, C, NA), na.rm = TRUE))

You can do this in base R with sapply:
df1$median_c <- sapply(seq_along(df1$id), function(i) {
median(df2$C[df2$time > df1$time1[i] & df2$time < df1$time2[i]])
})

Related

Remove Columns from a table that are 90% one value

Example Data:
A<- c(1,2,3,4,1,2,3,4,1,2)
B<- c(A,B,C,D,E,F,G,H,I,J)
C<- c(1,1,1,1,1,1,1,1,1,0)
D<- c(TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,TRUE,FALSE)
df1<-data.frame(A,B,C,D)
df1 %>%
select_if(
###column is <90% one value
)
So I have a table that has a few columns that are predominantly one value--like C and D in the above example. I need to get rid of any columns that are 90% or more one unique value. How can I get rid of the columns that fit this criteria?
We may use select with where, get the frequency count with table, convert to proportions, get the max value and check if it is less than .90 to select the particular column
library(dplyr)
df1 <- df1 %>%
select(where(~ max(proportions(table(.))) < .90))
data
df1 <- structure(list(A = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2), B = c("A",
"B", "C", "D", "E", "F", "G", "H", "I", "J"), C = c(1, 1, 1,
1, 1, 1, 1, 1, 1, 0), D = c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE, TRUE, FALSE)), class = "data.frame", row.names = c(NA,
-10L))

Using dplyr to create a ICCs table

I am trying to create a table with ICCs for multiple raters and multiple variables, I am trying to use a function and dplyr, but it is not working as I expected.
This is the structure of the data frame and the expected ICCs table:
# Create data frame
ID <- c("r1", "r1", "r1", "r1", "r1", "r2", "r2", "r2", "r2", "r2", "r3", "r3", "r3", "r3", "r3")
V1.1 <- c(3, 3, 3, 3, 3, 3, 2, 3, 3, 1, 2, 2, 1, 1, 2)
V2.1 <- c(1, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 2, 1, 1, 3)
V3.1 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
V4.1 <- c(2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 1, 2)
V1.2 <- c(3, 3, 3, 3, 3, 3, 2, 3, 2, 2, 3, 2, 1, 2, 1)
V2.2 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2)
V3.2 <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1)
V4.2 <- c(2, 4, 2, 1, 3, 2, 1, 3, 2, 2, 3, 2, 1, 2, 1)
df <- data.frame(ID, V1.1, V2.1, V3.1, V4.1, V1.2, V2.2, V3.2, V4.2)
# Empty data frame for ICCs
ids <- c("r1", "r2", "r3")
vars <- c("V1", "V2", "V3", "V4")
icc_table <- data.frame(ID = ids)
icc_table <- cbind(icc_table, matrix(NA, nrow = length(ids), ncol = length(vars)))
names(icc_table)[2:ncol(icc_table)] <- vars
Here is the attempt to create the ICCs table with a function and dplyr:
# ICC function
icc.fun <- function(data, x1, x2){
result <- irr::icc(subset(data, select = c(x1, x2)),
model = "twoway",
type = "agreement",
unit = "single")
result$value
}
# Table attempt
icc_table <- df %>%
pivot_longer(cols = -ID, names_to = c("criteria", ".value"), names_pattern = "(V\\d)\\.(\\d)") %>%
group_by(ID, criteria) %>%
rename("val1" = `1`, "val2" = `2`) %>%
summarise(icc = icc.fun(df, val1, val2), .groups = "drop") %>%
pivot_wider(id_cols = ID, names_from = criteria, values_from = icc)
However, it is not working and it returns a table with a lot of NAs. When I tried the function it seems to be working fine, so I guess it is a problem with the dplyr code. If you have any other solution apart from dplyr it is also welcomed!
Thanks!
I think the issue is between the subset() in your icc.fun and summarise(), try:
# ICC function
icc.fun <- function(x1, x2){
result <- irr::icc(data.frame(x1, x2)),
model = "twoway",
type = "agreement",
unit = "single")
result$value
}
# Table attempt
icc_table <- df %>%
pivot_longer(cols = -ID, names_to = c("criteria", ".value"), names_pattern = "(V\\d)\\.(\\d)") %>%
group_by(ID, criteria) %>%
rename("val1" = `1`, "val2" = `2`) %>%
summarise(icc = icc.fun(val1, val2), .groups = "drop") %>%
pivot_wider(id_cols = ID, names_from = criteria, values_from = icc)
In case it is useful for someone, here is the solution that I found:
I simplified the function by subsetting the data using R base
# ICC function
icc.fun <- function(data, x1, x2){
result <- icc(data[ ,c(x1, x2)],
model = "twoway",
type = "agreement",
unit = "single")
result$value
}
I used the group_modify() instead of summarise(), plus enframe()
# Create ICC table
icc_table <- df %>%
pivot_longer(cols = -ID, names_to = c("criteria", ".value"), names_pattern = "(V\\d)\\.(\\d)") %>%
group_by(ID, criteria) %>%
rename("val1" = `1`, "val2" = `2`) %>%
group_modify(~ {
icc.fun(.x, "val1", "val2") %>%
tibble::enframe(name = "variable", value = "icc")
}) %>%
pivot_wider(id_cols = ID, names_from = criteria, values_from = icc)

How do you create a grouped barplot in R from only certain columns?

I have a data frame that looks like
Role <- letters(1:3)
df <- data.frame(Role,
Female1=c(1,4,2),
Male1 = c(3,0,0),
Female2 = c(3,5,3),
Male2 = c(1,3,0),
FemaleTotal = Female1+Female2,
MaleTotal = Male1+Male2)
And want to create a barplot grouped with Male,Female for each column category, (in this example it would be 1 and 2), stacked with Roles and also another plot with just the totals. To do just the totals I could use melt() and subset the dataframe to only have those columns, but that seems messy and doesnt help witht the main plot I want to make.
An option would be to reshape to 'long' format
library(dplyr)
library(tidyr)
library(ggplot2)
df %>%
pivot_longer(cols = -Role, names_to = c( "group", '.value'),
names_sep="(?<=[a-z])(?=(\\d+|Total))") %>%
pivot_longer(-c(Role, group)) %>%
ggplot(aes(x = Role, y = value, fill = group)) +
geom_col() +
facet_wrap(~ name)
-output
data
df <- structure(list(Role = c("a", "b", "c"), Female1 = c(1, 4, 2),
Male1 = c(3, 0, 0), Female2 = c(3, 5, 3), Male2 = c(1, 3,
0), FemaleTotal = c(4, 9, 5), MaleTotal = c(4, 3, 0)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))

How to apply a function to a data.table subset by multiple columns in R?

I have a data table with counts for changes for multiple groups. For example:
input <- data.table(from = c("A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"),
to = c(letters[1:6], letters[1:6]),
from_N = c(100, 100, 100, 50, 50, 50, 60, 60 ,60, 80, 80, 80),
to_N = c(10, 20, 40, 5, 5, 15, 10, 5, 10, 20, 5, 10),
group = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2))
How can I calculate the total for each change across groups? I can do this using a for loop, for example:
out <- list()
for (i in 1:length(unique(input$from))){
sub <- input[from == unique(input$from)[i]]
out2 <- list()
for (j in 1:length(unique(sub$to))){
sub2 <- sub[to == unique(sub$to)[j]]
out2[[j]] <- data.table(from = sub2$from[1],
to = sub2$to[1],
from_N = sum(sub2$from_N),
to_N = sum(sub2$to_N))
print(unique(sub$to)[j])
}
out[[i]] <- do.call("rbind", out2)
print(unique(input$from)[i])
}
output <- do.call("rbind", out)
However, the data table I need to apply this to is very large, and I therefore need to maximise performance. Is there a data.table method? Any help will be greatly appreciated!
Perhaps I've overlooked something, but it seems you're just after:
library(data.table)
setDT(input)[, .(from_N = sum(from_N), to_N = sum(to_N)), by = .(from, to)]
Output:
from to from_N to_N
1: A a 160 20
2: A b 160 25
3: A c 160 50
4: B d 130 25
5: B e 130 10
6: B f 130 25
An option with dplyr
library(dplyr)
input %>%
group_by(from, to) %>%
summarise_at(vars(ends_with('_N')), sum)
Or in data.table
library(data.table)
setDT(input)[, lapply(.SD, sum), by = .(from, to), .SDcols = patterns('_N$')]

Plotting based on occurrence in group

I would to make a bar chart that plots the bar as a proportion of the total group rather than the usual percentage. For a var to "count" it only needs to occur once in a group. For example in this df where id is the grouping variable
df <-
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", NA, "b", "c", "d", "e", "a", "a", "a"))
The a bars would be:
a = 2/3 # since a occurs in 2 out of 3 groups
b = 1/3
c = 1/3
d = 1/3
e = 1/3
If I understand you correctly, a one-liner would suffice:
ggplot(distinct(df)) + geom_bar(aes(vars, stat(count) / n_distinct(df$id)))
Working answer:
tibble(id = c(rep(1, 3), rep(2, 3), rep(3, 3)),
vars = c("a", "a", "b", "c", "d", "e", "a", "a", "a")) %>%
group_by(id) %>%
distinct(vars) %>%
ungroup() %>%
add_count(vars) %>%
mutate(prop = n / n_distinct(id)) %>%
distinct(vars, .keep_all = T) %>%
ggplot(aes(vars, prop)) +
geom_col()

Resources