Pivot and transform df in dplyr - r

my input:
df<-data.frame("frame"=c(1,2,3,4,5,6,7,8,9,10),
"label_x"=c("AO","Other","AO","GS","GS","RF","RF","TI",NA,"Other"),
"label_y"=c("AO","RF","RF", "GS","GS","Other","Other","TI","AO","RF"),
"cross"=c("Matched","Mismatched", "Mismatched","Matcehed","Matched"
,"Mismatched", "Mismatched","Mismatched","Mismatched","Mismatched") )
I want to count all "Matches/Mismatches" from column cross per label, for column label_x and label_y (both). So I tried this code for each column label_:
df %>% filter(!is.na(label_y )) %>% group_by(label_y) %>% count(cross)
but it doesn't answer my question, after that I need to sum the counts for each column .
So I expect something like this...:
label Mismatching Matching Total
AO 5 7 13
RF 3 4 7

On way to do it:
df %>% pivot_longer(cols = c(label_x ,label_y), values_to = "label") %>%
group_by(label) %>% count(cross) %>%
pivot_wider(values_from = n, names_from = cross, values_fill = 0) %>%
mutate(total = Matched + Mismatched)
Result tibble:
# A tibble: 6 x 4
# Groups: label [6]
label Matched Mismatched total
<chr> <int> <int> <int>
1 AO 2 2 4
2 GS 4 0 4
3 Other 0 4 4
4 RF 0 5 5
5 TI 0 2 2
6 NA 0 1 1
However, keep in mind that the matched number is overestimated because both label_x and label_y have been used. Could you show a result table with the real labels and number you expect ?

Using table:
table(data.frame(label = unlist(df[, c("label_x", "label_y")]),
cross = df$cross))
# cross
#label Matcehed Matched Mismatched
# AO 0 2 2
# GS 2 2 0
# Other 0 0 4
# RF 0 0 5
# TI 0 0 2

Related

Average across rows and sum across columns if condition is met in R dataframe

I have an R dataframe that looks like this:
chr bp instances_1 instances_2 instances_sum
1 143926410 0 1 1
1 144075771 1 0 1
1 187762696 0 2 2
1 187783844 2 0 2
2 121596288 0 1 1
2 122042325 3 0 3
2 259939985 1 0 1
2 259991389 0 1 1
What I would like to do is group by 'chr', determine if two rows are within 1e7 base-pairs ('bp') from one another, and if they are, retain the average (and round the average) and sum across all other columns that met the condition. So, the final product would look like:
chr bp instances_1 instances_2 instances_sum
1 144001091 1 1 2
1 187773270 2 2 4
2 121819307 3 1 4
2 259965687 1 1 2
I tried the to manipulate the following code (using tidyverse) that I used for a similar kind of task that did it over multiple columns:
df_Pruned <- df |>
group_by(chr_snp1, chr_snp2) |>
mutate(grp = (abs(loc_snp1 - lag(loc_snp1, default = first(loc_snp1))) < 1e7) &
(abs(loc_snp2 - lag(loc_snp2, default = first(loc_snp2))) < 1e7)) |>
group_by(grp, .add=TRUE) |>
filter(pval == min(pval)) |>
ungroup()|>
select(-grp)
into this by trying to do the same over one grouping variable ('chr') and by trying to average and sum at the same time:
df_Pruned <- df |>
group_by(chr) |>
mutate(grp = (abs(bp - lag(bp, default = first(bp))) < 1e7)) |>
group_by(grp, .add=TRUE) |>
filter(bp == mean(bp) & instances_sum == sum(instances_sum)) |>
ungroup()|>
select(-grp)
But I can't get it to work. I think I'm close but could use some help.
Using cumsum with the lag condition produces your expected output:
df |>
mutate(grp = cumsum(abs(bp - lag(bp, default = first(bp))) > 1e7)) |>
group_by(chr, grp) |>
summarise(bp = mean(bp),
across(starts_with("instance"), sum),
.groups = "drop")
# A tibble: 4 × 6
chr grp bp instances_1 instances_2 instances_sum
<int> <int> <dbl> <int> <int> <int>
1 1 0 144001090. 1 1 2
2 1 1 187773270 2 2 4
3 2 2 121819306. 3 1 4
4 2 3 259965687 1 1 2

How to run Excel-like formulas using dplyr?

In the below reproducible R code, I'd like to add a column "adjust" that results from a series of calculations that in Excel would use cumulative countifs, max, and match (actually, to make this more complete the adjust column should have used the match formula since there could be more than 1 element in the list starting in row 15, but I think it's clear what I'm doing without actually using match) formulas as shown below in the illustration. The yellow shading shows what the reproducible code generates, and the blue shading shows my series of calculations in Excel that derive the desired values in the "adjust" column. Any suggestions for doing this, in dplyr if possible?
I am a long-time Excel user trying to migrate all of my work to R.
Reproducible code:
library(dplyr)
myData <-
data.frame(
Element = c("A","B","B","B","B","B","B","B"),
Group = c(0,1,1,1,2,2,3,3)
)
myDataGroups <- myData %>%
mutate(origOrder = row_number()) %>%
group_by(Element) %>%
mutate(ElementCnt = row_number()) %>%
ungroup() %>%
mutate(Group = factor(Group, unique(Group))) %>%
arrange(Group) %>%
mutate(groupCt = cumsum(Group != lag(Group, 1, Group[[1]])) - 1L) %>%
as.data.frame()
myDataGroups
We may use rowid to get the sequence to update the 'Group', and then create a logical vector on 'Group' to create the binary and use cumsum on the 'excessOver2' and take the lag
library(dplyr)
library(data.table)
myDataGroups %>%
mutate(Group = rowid(Element, Group),
excessOver2 = +(Group > 2), adjust = lag(cumsum(excessOver2),
default = 0))
-output
Element Group origOrder ElementCnt groupCt excessOver2 adjust
1 A 1 1 1 -1 0 0
2 B 1 2 1 0 0 0
3 B 2 3 2 0 0 0
4 B 3 4 3 0 1 0
5 B 1 5 4 1 0 1
6 B 2 6 5 1 0 1
7 B 1 7 6 2 0 1
8 B 2 8 7 2 0 1
library(dplyr)
myData %>%
group_by(Element, Group) %>%
summarize(ElementCnt = row_number(), over2 = 1 * (ElementCnt > 2),
.groups = "drop_last") %>%
mutate(adjust = cumsum(lag(over2, default = 0))) %>%
ungroup()
Result
# A tibble: 8 × 5
Element Group ElementCnt over2 adjust
<chr> <dbl> <int> <dbl> <dbl>
1 A 0 1 0 0
2 B 1 1 0 0
3 B 1 2 0 0
4 B 1 3 1 0
5 B 2 1 0 1
6 B 2 2 0 1
7 B 3 1 0 1
8 B 3 2 0 1

Efficient way to add sample information as new column to data set

I know how I can subset a data frame by sampling certain rows. However, I'm struggling with finding an easy (preferably tidyverse) way to just ADD the sampling information as a new column to my data set, i.e. I simply want to populate a new column with "1" if it is sampled and "0" if not.
I currently have this one, but it feels overly complicated. Note, in the example I want to sample 3 rows per group.
df <- data.frame(group = c(1,2,1,2,1,1,1,1,2,2,2,2,2,1,1),
var = 1:15)
library(tidyverse)
df <- df %>%
group_by(group) %>%
mutate(sampling_info = sample.int(n(), size = n(), replace = FALSE),
sampling_info = if_else(sampling_info <= 3, 1, 0))
You can try -
library(dplyr)
set.seed(123)
df %>%
arrange(group) %>%
group_by(group) %>%
mutate(sampling_info = as.integer(row_number() %in% sample(n(), size = 3))) %>%
ungroup
# group var sampling_info
# <dbl> <int> <int>
# 1 1 1 0
# 2 1 3 0
# 3 1 5 1
# 4 1 6 0
# 5 1 7 0
# 6 1 8 0
# 7 1 14 1
# 8 1 15 1
# 9 2 2 0
#10 2 4 1
#11 2 9 1
#12 2 10 0
#13 2 11 0
#14 2 12 1
#15 2 13 0
sample(n(), size = 3) will generate 3 random row numbers for each group and we assign 1 for those row numbers.

Get number of ties ego and alter have in common in R

I have a directed network dataset of adolescent friendships. I'd like to make an edgelist that includes the number of friends ego has in common with alter (someone ego and alter both nominated as a friend). Below is some sample data:
HAVE DATA:
id alter
1 3
1 5
1 9
2 3
2 5
3 2
3 5
3 9
3 6
WANT DATA:
id alter num_common
1 3 2
1 5 0
1 9 0
2 3 1
2 5 0
3 2 1
3 5 0
3 9 0
3 6 0
A solution could be to transform the edgelist into an adjacency matrix (using the igraph package) and multiple it by its transpose to count the number of shared neighbors:
el <- read.table(text= " id alter
1 3
1 5
1 9
2 3
2 5
3 2
3 5
3 9
3 6", header =T)
g <- graph_from_edgelist(as.matrix(el), directed = T)
m <- get.adjacency(g, sparse = F)
m2 <- m %*% t(m)
Afterwards transform the resulting matrix back to an edgelist and merge it with the original data set:
el2 <- reshape2::melt(m2)
dplyr::left_join(el, el2, by = c("id" = "Var1", "alter" = "Var2"))
id alter value
1 1 3 2
2 1 5 0
3 1 9 0
4 2 3 1
5 2 5 0
6 3 2 1
7 3 5 0
8 3 9 0
9 3 6 0
To see who how often ego and alter were both nominated by the same friend change the direction of the relation by using t(m) %*% m instead of m %*% t(m). To ignore direction, set the directed argument to FALSE in the graph_from_edgelist function.
this is a possible though not very simple solution:
# your dummy data
df <- data.table::fread("id alter
1 3
1 5
1 9
2 3
2 5
3 2
3 5
3 9
3 6")
library(dplyr)
library(tidyr)
# all pairs vertically with pair ID
pairs_v <- combn(unique(c(df$id, df$alter)), 2) %>%
dplyr::as_tibble() %>%
tidyr::pivot_longer(cols = everything()) %>%
dplyr::arrange(name)
# number of comon friends per group ID
pairs_comp <- pairs_v %>%
dplyr::left_join(df, by = c("value" = "id")) %>%
dplyr::count(name, alter) %>%
dplyr::filter(n > 1 & !is.na(alter)) %>%
dplyr::count(name)
# all pairs horizontally with pair ID
pairs_h <-pairs_v %>%
dplyr::group_by(name) %>%
dplyr::mutate(G_ID = dplyr::row_number()) %>%
tidyr::pivot_wider(names_from = G_ID, values_from = "value")
# multiple left joins to get repeated comon friends for each direction of combination
df %>%
dplyr::left_join(pairs_h, by = c("id" = "1", "alter" = "2")) %>%
dplyr::left_join(pairs_comp) %>%
dplyr::left_join(pairs_h, by = c("id" = "2", "alter" = "1")) %>%
dplyr::left_join(pairs_comp, by = c("name.y" = "name")) %>%
dplyr::mutate(num_common = case_when(!is.na(n.x) ~ as.numeric(n.x),
!is.na(n.y) ~ as.numeric(n.y),
TRUE ~ 0)) %>%
dplyr::select(id, alter, num_common)
id alter num_common
1: 1 3 2
2: 1 5 0
3: 1 9 0
4: 2 3 1
5: 2 5 0
6: 3 2 1
7: 3 5 0
8: 3 9 0
9: 3 6 0

Computing minimum distance between a row and all previous rows in R

I want to compute the minimum distance between the current row and every row before it within each group. My data frame has several groups, and each group has multiple dates with longitude and latitude. I use a Haversine function to compute distance, and I need to apply this function as described above. The data frame looks like the following:
grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4
My current attempt uses purrrlyr::by_row, but the method is too slow. In practice, each group has thousands of dates and geographic positions. Here is part of my current attempt:
calc_min_distance <- function(df, grp.name, row){
df %>%
filter(
group_name==grp.name
) %>%
filter(
row_number() <= row
) %>%
mutate(
last.lat = last(lat),
last.long = last(long),
rowid = 1:n()
) %>%
group_by(rowid) %>%
purrrlyr::by_row(
~haversinedistance.fnct(.$last.long, .$last.lat, .$long, .$lat),
.collate='rows',
.to = 'min.distance'
) %>%
filter(
row_number() < n()
) %>%
summarise(
min = min(min.distance)
) %>%
.$min
}
df_dist <-
df %>%
group_by(grp_name) %>%
mutate(rowid = 1:n()) %>%
group_by(grp_name, rowid) %>%
purrrlyr::by_row(
~calc_min_distance(df, .$grp_name,.$rowid),
.collate='rows',
.to = 'min.distance'
) %>%
ungroup %>%
select(-rowid)
Suppose that distance is defined as (lat + long) for reference row - (lat + long) for each pairwise row less than the reference row. My expected output for grp 1 is the following:
grp date long lat rowid min.distance
1 1 1995-07-01 11 12 1 0
2 1 1995-07-05 3 0 2 -20
3 1 1995-07-09 13 4 3 -6
4 1 1995-07-13 4 25 4 6
How can I quickly compute the minimum distance between the current rowid and all rowids before it?
Here's how I would go about it. You need to calculate all the within-group pair-wise distances anyway, so we'll use geosphere::distm which is designed to do just that. I'd suggest stepping through my function line-by-line and looking at what it does, I think it will make sense.
library(geosphere)
find_min_dist_above = function(long, lat, fun = distHaversine) {
d = distm(x = cbind(long, lat), fun = fun)
d[lower.tri(d, diag = TRUE)] = NA
d[1, 1] = 0
return(apply(d, MAR = 2, min, na.rm = TRUE))
}
df %>% group_by(grp) %>%
mutate(min.distance = find_min_dist_above(long, lat))
# # A tibble: 8 x 6
# # Groups: grp [2]
# grp date long lat rowid min.distance
# <int> <fct> <int> <int> <int> <dbl>
# 1 1 1995-07-01 11 12 1 0
# 2 1 1995-07-05 3 0 2 1601842.
# 3 1 1995-07-09 13 4 3 917395.
# 4 1 1995-07-13 4 25 4 1623922.
# 5 2 1995-03-07 12 6 1 0
# 6 2 1995-03-10 3 27 2 2524759.
# 7 2 1995-03-13 34 8 3 2440596.
# 8 2 1995-03-16 25 9 4 997069.
Using this data:
df = read.table(text = ' grp date long lat rowid
1 1 1995-07-01 11 12 1
2 1 1995-07-05 3 0 2
3 1 1995-07-09 13 4 3
4 1 1995-07-13 4 25 4
5 2 1995-03-07 12 6 1
6 2 1995-03-10 3 27 2
7 2 1995-03-13 34 8 3
8 2 1995-03-16 25 9 4', h = TRUE)

Resources