Pairwise count data from long format

Pairwise count data from long format - r

Example data
I have the following data:
df <- data.frame(
id = c('X1','X1','X1','X1','X2','X2','X2','X2'),
pos = c(1,2,3,4,1,2,3,4),
group = c(100,200,100,300,100,200,100,200)
)
Which thus looks like:
id pos group
1 X1 1 100
2 X1 2 200
3 X1 3 100
4 X1 4 300
5 X2 1 100
6 X2 2 200
7 X2 3 100
8 X2 4 200
What I try to achieve
I want to plot this data using geom_segment(), where pos will be on the x-xis, and group on the y-axis. Then for each of these segments I want to count how often they are present in the dataset (based on the id column). When doing this for the example dataset the result would be:
pos1 pos2 group1 group2 id.count
1 2 100 200 2
2 3 200 100 2
3 4 100 300 1
3 4 100 200 1
I have no clue how to start with this, while I'm familiar with group_by from dplyr I can not figure out how to build the initial four columns.

If the ordering in your data set is as in your example you can try this:
library(dplyr)
df %>% group_by(id) %>%
transmute(pos1 = pos, pos2 = lead(pos),
group1 = group, group2 = lead(group)) %>%
na.omit() %>% ungroup()%>%
count(pos1, pos2, group1, group2, name = "id.count")
# A tibble: 4 x 5
# pos1 pos2 group1 group2 id.count
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 2 100 200 2
# 2 3 200 100 2
# 3 4 100 200 1
# 3 4 100 300 1

I tried the following that works, but wonder if there is a more elegant solution for this:
# Simple stats
vals <- unique(df$pos)
min.val = min(vals)
max.val = max(vals)
# Combination
comb.df <- data.frame(
pos1 = min.val:(max.val - 1),
pos2 = (min.val + 1): max.val
)
# Combine
comb.df <- comb.df %>%
left_join(df %>% select(pos1 = pos, group1 = group, id )) %>%
left_join(df %>% select(pos2 = pos, group2 = group, id ))
# Count
comb.df <- comb.df %>%
group_by(pos1, pos2, group1, group2) %>%
summarise(n.ids = n_distinct(id))

Related

R dplyr left join multiple tables without two separate columns with suffix

Suppose I have a main table x
x <- tibble(id = c(1,2,3,4,5), score = c(100,200,300,100,200))
x
# A tibble: 5 x 2
id score
<dbl> <dbl>
1 1 100
2 2 200
3 3 300
4 4 100
5 5 200
and two other tables
y = tibble(id = c(1,2), score_new=c(200,300))
y
# A tibble: 2 x 2
id score_new
<dbl> <dbl>
1 1 200
2 2 300
z = tibble(id = c(3,4), score_new = c(300,400))
z
# A tibble: 2 x 2
id score_new
<dbl> <dbl>
1 3 300
2 4 400
If I join them together it will be like this:
x %>% left_join(y, by =c("id" = "id")) %>% left_join(z, by =c("id" = "id"))
# A tibble: 5 x 4
id score score_new.x score_new.y
<dbl> <dbl> <dbl> <dbl>
1 1 100 200 NA
2 2 200 300 NA
3 3 300 NA 300
4 4 100 NA 400
5 5 200 NA NA
But I need score_new to be only one column. How do I do that? Sorry if there are already other similar questions but I really couldn't find them.

You can do that by appending y and z and then joining them.
# Loading required libraries
library(dplyr)
# Create sample df
x <- tibble(id = c(1,2,3,4,5), score = c(100,200,300,100,200))
y = tibble(id = c(1,2), score_new=c(200,300))
z = tibble(id = c(3,4), score_new = c(300,400))
x %>%
# union y and z and join on x to get new scores
left_join(union_all(y,z), by = "id")
Similarly you can use bind_rows instead of union_all both gives same results in this scenario.
x %>%
# union y and z and join on x to get new scores
left_join(bind_rows(y,z), by = "id")

I'm a bit late to the party. But I would opt for this tidyverse-solution,
bind_rows(
y,z
) %>% left_join(x = x)
Which gives the following output,
# A tibble: 5 x 3
id score score_new
<dbl> <dbl> <dbl>
1 1 100 200
2 2 200 300
3 3 300 300
4 4 100 400
5 5 200 NA
Note: left_join() has x and y arugments, and here Ive specified that x = x, where the rhs is your data.

You can try this approach:
mutate(score_new.x = if_else(is.na(score_new.x),score_new.y,score_new.x)) %>%
select(-score_new.y)

How to combine multiple summary tables at once

Consider the following data frame:
set.seed(123)
dat <- data.frame(Region = rep(c("a","b"), each=100),
State =rep(c("NY","MA","FL","GA"), each = 50),
Loc = rep(letters[1:20], each = 5),
ID = 1:200,
count1 = sample(4, 200, replace=T),
count2 = sample(4, 200, replace=T))
Region, State, and Loc are grouping variables for individual measurements, each of which has a unique ID number. For each grouping variable, I want to summarize the number of observations in each level of count1 and count2. Normally I would do on of the following for each pair:
#example for count1 and region:
library(tidyverse)
dat%>%
dplyr::select(Region,count1)%>%
group_by(count1,Region)%>%
count()
##or
with(dat, table(Region, count1))
How can I do this for all combinations and wrap them into a single table (or at least a few tables that are grouped by equivalent lengths since they will differ depending on which grouping variable is being used)

Try something like this:
Region1 <- dat %>% group_by(Region, count1) %>%
summarise(TotalRegion1 = n())
State1 <- dat %>% group_by(State, count1) %>%
summarise(TotalState1 = n())
Loc1 <- dat %>% group_by(Loc, count1) %>%
summarise(TotalLoc1 = n())

You can try to get "all at once" (for count1) with
out <- dat %>%
select(-ID, -count2) %>%
pivot_longer(Region:Loc, names_to = "k", values_to = "v") %>%
group_by(k, v, count1) %>%
tally() %>%
ungroup()
out %>%
filter(k == "Region")
# # A tibble: 8 x 4
# k v count1 n
# <chr> <fct> <int> <int>
# 1 Region a 1 26
# 2 Region a 2 27
# 3 Region a 3 20
# 4 Region a 4 27
# 5 Region b 1 20
# 6 Region b 2 30
# 7 Region b 3 30
# 8 Region b 4 20
out
# # A tibble: 101 x 4
# k v count1 n
# <chr> <fct> <int> <int>
# 1 Loc a 2 5
# 2 Loc a 3 1
# 3 Loc a 4 4
# 4 Loc b 1 2
# 5 Loc b 2 2
# 6 Loc b 3 3
# 7 Loc b 4 3
# 8 Loc c 1 2
# 9 Loc c 2 2
# 10 Loc c 3 3
# # ... with 91 more rows

how to create a variable based on lm in a regular mutate in dplyr?

Consider this simple example:
library(dplyr)
library(broom)
dataframe <- data_frame(id = c(1,2,3,4,5,6),
group = c(1,1,1,2,2,2),
value = c(200,400,120,300,100,100))
# A tibble: 6 x 3
id group value
<dbl> <dbl> <dbl>
1 1 1 200
2 2 1 400
3 3 1 120
4 4 2 300
5 5 2 100
6 6 2 100
Here I want to group by group and create two columns.
One is the number of distinct values in value (I can use dplyr::n_distinct), the other is the constant term from a regression of value on the vector 1. That is, the output of
tidy(lm(data = dataframe, value ~ 1)) %>% select(estimate)
estimate
1 203.3333
The difficulty here is combining these two simple outputs into a single mutate statement that preserves the grouping.
I tried something like:
formula1 <- function(data, myvar){
tidy(lm(data = data, myvar ~ 1)) %>% select(estimate)
}
dataframe %>% group_by(group) %>%
mutate(distinct = n_distinct(value),
mean = formula1(., value))
but this does not work. What I am missing here?
Thanks!

This approach will work if you use pull in place of select. This extracts the single estimate value from the tidy output.
formula1 <- function(data, myvar){
tidy(lm(data = data, myvar ~ 1)) %>% pull(estimate)
}
dataframe %>%
group_by(group) %>%
mutate(distinct = n_distinct(value),
mean = formula1(., value))
# A tibble: 6 x 5
# Groups: group [2]
id group value distinct mean
<dbl> <dbl> <dbl> <int> <dbl>
1 1 1 200 3 240.0000
2 2 1 400 3 240.0000
3 3 1 120 3 240.0000
4 4 2 300 2 166.6667
5 5 2 100 2 166.6667
6 6 2 100 2 166.6667

Count number of values in row using dplyr

This question should have a simple, elegant solution but I can't figure it out, so here it goes:
Let's say I have the following dataset and I want to count the number of 2s present in each row using dplyr.
set.seed(1)
ID <- LETTERS[1:5]
X1 <- sample(1:5, 5,T)
X2 <- sample(1:5, 5,T)
X3 <- sample(1:5, 5,T)
df <- data.frame(ID,X1,X2,X3)
library(dplyr)
Now, the following works:
df %>%
rowwise %>%
mutate(numtwos = sum(c(X1,X2,X3) == 2))
But how do I avoid typing out all of the column names?
I know this is probably easier to do without dplyr, but more generally I want to know how I can use dplyr's mutate with multiple columns without typing out all the column names.

Try rowSums:
> set.seed(1)
> ID <- LETTERS[1:5]
> X1 <- sample(1:5, 5,T)
> X2 <- sample(1:5, 5,T)
> X3 <- sample(1:5, 5,T)
> df <- data.frame(ID,X1,X2,X3)
> df
ID X1 X2 X3
1 A 2 5 2
2 B 2 5 1
3 C 3 4 4
4 D 5 4 2
5 E 2 1 4
> rowSums(df == 2)
[1] 2 1 0 1 1
Alternatively, with dplyr:
> df %>% mutate(numtwos = rowSums(. == 2))
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1

Here's another alternative using purrr:
library(purrr)
df %>%
by_row(function(x) {
sum(x[-1] == 2) },
.to = "numtwos",
.collate = "cols"
)
Which gives:
#Source: local data frame [5 x 5]
#
# ID X1 X2 X3 numtwos
# <fctr> <int> <int> <int> <int>
#1 A 2 5 2 2
#2 B 2 5 1 1
#3 C 3 4 4 0
#4 D 5 4 2 1
#5 E 2 1 4 1
As per mentioned in the NEWS, row based functionals are still maturing in dplyr:
We are still figuring out what belongs in dplyr and what belongs in
purrr. Expect much experimentation and many changes with these
functions.
Benchmark
We can see how rowwise() and do() compare to purrr::by_row() for this type of problem and how they "perform" against rowSums() and the tidy data way:
largedf <- df[rep(seq_len(nrow(df)), 10e3), ]
library(microbenchmark)
microbenchmark(
steven = largedf %>%
by_row(function(x) {
sum(x[-1] == 2) },
.to = "numtwos",
.collate = "cols"),
psidom = largedf %>%
rowwise %>%
do(data_frame(numtwos = sum(.[-1] == 2))) %>%
cbind(largedf, .),
gopala = largedf %>%
gather(key, value, -ID) %>%
group_by(ID) %>%
summarise(numtwos = sum(value == 2)) %>%
inner_join(largedf, .),
evan = largedf %>%
mutate(numtwos = rowSums(. == 2)),
times = 10L,
unit = "relative"
)
Results:
#Unit: relative
# expr min lq mean median uq max neval cld
# steven 1225.190659 1261.466936 1267.737126 1227.762573 1276.07977 1339.841636 10 b
# psidom 3677.603240 3759.402212 3726.891458 3678.717170 3728.78828 3777.425492 10 c
# gopala 2.715005 2.684599 2.638425 2.612631 2.59827 2.572972 10 a
# evan 1.000000 1.000000 1.000000 1.000000 1.00000 1.000000 10 a

Just wanted to add to the answer of #evan.oman in case you only want to sum rows for specific columns, not all of them. You can use the regular select and/or select_helpers functions. In this example, we don't want to include X1 in rowSums:
df %>%
mutate(numtwos = rowSums(select(., -X1) == 2))
ID X1 X2 X3 numtwos
1 A 2 5 2 1
2 B 2 5 1 0
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 0

One approach is to use a combination of dplyr and tidyr to convert data into long format, and do the computation:
library(dplyr)
library(tidyr)
df %>%
gather(key, value, -ID) %>%
group_by(ID) %>%
summarise(numtwos = sum(value == 2)) %>%
inner_join(df, .)
Output is as follows:
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1

You can use do, which doesn't add the column to your original data frame and you need to add the column to your original data frame.
df %>%
rowwise %>%
do(numtwos = sum(.[-1] == 2)) %>%
data.frame
numtwos
1 2
2 1
3 0
4 1
5 1
Add a cbind to bind the new column to the original data frame:
df %>%
rowwise %>%
do(numtwos = sum(.[-1] == 2)) %>%
data.frame %>% cbind(df, .)
ID X1 X2 X3 numtwos
1 A 2 5 2 2
2 B 2 5 1 1
3 C 3 4 4 0
4 D 5 4 2 1
5 E 2 1 4 1

Operations between groups with dplyr

I have a data frame as follow where I would like to group the data by grp and index and use group a as a reference to perform some simple calculations. I would like to subtract the variable value from other group from the values of group a.
df <- data.frame(grp = rep(letters[1:3], each = 2),
index = rep(1:2, times = 3),
value = seq(10, 60, length.out = 6))
df
## grp index value
## 1 a 1 10
## 2 a 2 20
## 3 b 1 30
## 4 b 2 40
## 5 c 1 50
## 6 c 2 60
The desired outpout would be like:
## grp index value
## 1 b 1 20
## 2 b 2 20
## 3 c 1 40
## 4 c 2 40
My guess is it will be something close to:
group_by(df, grp, index) %>%
mutate(diff = value - value[grp == "a"])
Ideally I would like to do it using dplyr.
Regards, Philippe

We can filter for 'grp' that are not 'a' and then do the difference within mutate.
df %>%
filter(grp!="a") %>%
mutate(value = value- df$value[df$grp=="a"])
Or another option would be join
df %>%
filter(grp!="a") %>%
left_join(., subset(df, grp=="a", select=-1), by = "index") %>%
mutate(value = value.x- value.y) %>%
select(1, 2, 5)
# grp index value
#1 b 1 20
#2 b 2 20
#3 c 1 40
#4 c 2 40

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pairwise count data from long format - r

Related

R dplyr left join multiple tables without two separate columns with suffix

How to combine multiple summary tables at once

how to create a variable based on lm in a regular mutate in dplyr?

Count number of values in row using dplyr

Operations between groups with dplyr

Categories

Resources