Visualization for Association of 2 binary variables - r

What would be a good visualization to use in R to show the association of 2 binary variables?
I understand that phi coefficient would be the best statistic to use, but how can I show it graphically? Considering that if I use a scatterplot, it would be very condensed since there are only 4 possible values.

One idea would be to create a mosaicplot from the contigency table of the two binary variables.
Let's assume our data looks like this:
var1 var2
1 1 1
2 0 0
3 1 1
4 0 0
5 1 1
6 1 1
7 0 0
8 0 1
9 0 1
10 1 0
We could visualize it in the following way:
mosaicplot(table(df))
Data
df <- structure(list(var1 = c(1, 0, 1, 0, 1, 1, 0, 0, 0, 1), var2 = c(1,
0, 1, 0, 1, 1, 0, 1, 1, 0)), .Names = c("var1", "var2"), row.names = c(NA,
-10L), class = "data.frame")

Related

How to get the total sum up until condition, including the row where the condition changes

I'm trying to count the total sum of a column (y) based on a anothe column (x), including the row where the condition changes. Let's say I have this sample :
Test = structure(list(x = c(0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1),
y = c(3, 4, 5, 6, 2, 4, 8, 9, 11, 57, 14, 21, 1)), row.names = c(NA,
-13L), class = c("tbl_df", "tbl", "data.frame"))
My end goal is something similar to this :
I found this alternative solution but it works for cumulative sums. It's not exactly what i'm looking for :
R how to cumulative sums up until condition, including the row where the condition changes
Thanks in advance.
library(dplyr)
Test %>%
group_by(cu = cumsum(lag(x, default = 0) == 1)) %>%
mutate(z = ifelse(x == 1, sum(y), NA)) %>%
ungroup() %>%
select(-cu)
# A tibble: 13 × 3
x y z
<dbl> <dbl> <dbl>
1 0 3 NA
2 0 4 NA
3 0 5 NA
4 0 6 NA
5 0 2 NA
6 1 4 24
7 0 8 NA
8 0 9 NA
9 1 11 28
10 1 57 57
11 0 14 NA
12 0 21 NA
13 1 1 36

How do you make a new factor column based on other columns in r?

I have a data set that looks like this
ID Group 1 Group 2 Group 3 Group 4
1 1 0 1 0
2 0 1 1 1
3 1 1 0 0
.
.
.
100 0 1 0 1
I want to make another column lets say Group 5 where if the condition of Group 1 is 1 then Group 5 would be 1. If Group 2 = 1, then Group 5 = 2. If Group 3 = 1, then Group 5 = 3, and if Group 4 = 1, then Group 5 = 4. How do I do this?
I tried these lines of code, but I seem to be missing something.
Group5 <- data.frame(Group1, Group2, Group3, Group4, stringsAsFactors=FALSE)
df$Group5 <- with(finalmerge, ifelse(Group1 %in% c("1", "0"),
"1", ""))
Any advice would be helpful, thanks in advance.
You could use which.max(), and apply this to each row.
df["Group_5"] <- apply(df[, -1], 1, which.max)
Output:
ID Group_1 Group_2 Group_3 Group_4 Group_5
1 1 0 0 0 1 4
2 2 0 1 0 0 2
3 3 0 0 1 0 3
4 4 1 0 0 0 1
Input:
df = structure(list(ID = c(1, 2, 3, 4), Group_1 = c(0, 0, 0, 1), Group_2 = c(0,
1, 0, 0), Group_3 = c(0, 0, 1, 0), Group_4 = c(1, 0, 0, 0)), class = "data.frame", row.names = c(NA,
-4L))

Subtracting each column from its previous one in a data frame

I have a very simple case here in which I would like to subtract each column from its previous one. As a matter of fact I am looking for a sliding subtraction as the first column stays as is and then the first one subtracts the second one and second one subtracts the third one and so on till the last column.
here is my sample data set:
structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
and my desired output:
structure(list(x = c(1, 0, 0, 0), y = c(0, 0, 1, 1), z = c(-1,
1, 0, 0)), class = "data.frame", row.names = c(NA, -4L))
I am personally looking for a solution with purrr family of functions. I also thought about slider but I'm not quite familiar with the latter one. So I would appreciate any help and idea with these two packages in advance. Thank you very much.
A simple dplyr only solution-
cur_data() inside mutate/summarise just creates a whole copy. So
just substract cur_data()[-ncol(.)] from cur_data()[-1]
with pmap_df you can do similar things
df <- structure(list(x = c(1, 0, 0, 0), y = c(1, 0, 1, 1), z = c(0,
1, 1, 1)), class = "data.frame", row.names = c(NA, -4L))
library(dplyr)
df %>%
mutate(cur_data()[-1] - cur_data()[-ncol(.)])
#> x y z
#> 1 1 0 -1
#> 2 0 0 1
#> 3 0 1 0
#> 4 0 1 0
similarly
pmap_dfr(df, ~c(c(...)[1], c(...)[-1] - c(...)[-ncol(df)]))
I think you are looking for pmap_df with lag to subtract the previous value.
library(purrr)
library(dplyr)
pmap_df(df, ~{x <- c(...);x - lag(x, default = 0)})
# A tibble: 4 x 3
# x y z
# <dbl> <dbl> <dbl>
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
Verbose, but simple:
df %>%
select(x) %>%
bind_cols(df %>%
select(-1) %>%
map2_dfc(df %>%
select(-ncol(df)), ~.x -.y))
# x y z
#1 1 0 -1
#2 0 0 1
#3 0 1 0
#4 0 1 0
We can just do (no need of any packages)
cbind(df1[1], df1[-1] - df1[-ncol(df1)])
-output
x y z
1 1 0 -1
2 0 0 1
3 0 1 0
4 0 1 0
Or using dplyr
library(dplyr)
df1 %>%
mutate(.[-1] - .[-ncol(.)])

Is there an R function for combining two replicate site columns in a table to show presence absence of species?

I have the following DF (example data, my actual data set is 96 columns):
class X1A X1B X2A X2B X3A X3B X4A X4B X5A X5B X6A X6B
1 A 0 1 0 0 0 0 0 1 1 1 1 1
2 B 1 1 1 1 0 0 0 1 1 1 0 1
3 C 0 0 0 1 1 0 0 0 1 1 0 0
4 D 0 0 0 1 1 0 1 0 1 0 0 0
5 A 0 1 1 1 0 0 0 1 1 1 1 1
6 B 0 0 1 1 0 0 0 1 1 1 0 1
7 C 0 0 0 1 1 0 0 0 1 1 0 0
8 D 0 0 0 1 1 0 1 0 1 0 0 0
9 A 0 1 1 1 0 0 0 1 1 1 1 1
10 B 1 1 1 1 0 0 0 1 1 1 0 1
11 C 0 0 0 1 1 0 0 0 1 1 0 0
12 D 0 1 0 1 1 0 1 0 1 0 0 0
Class denotes the phylogenic class of the organism (each replicate of the letter is a different species but members of the same class). 1A and 1B are samples from the same site. I want to combine the two presence/absence data (1/0 respectively) from each two samples from every site and add up the number of "presences" for the class across that site. so that my df now looks something like this:
Sample Class Number of Species Present
1 A 3
1 B 2
1 C 0
1 D 1
2 A 2
2 B 3
2 C 3
2 D 3
For example,
in the original df you see that Class C species are not present in sample 2A at all but each species of class C is present in sample 2B. So the output df records Species C as present 3 times in sample 2. Furthermore, Class B has 3 different species occur in 2A and in 2B but because they are replicates of the output df records sample 2 as having 3 Class B species present.
Any help would be appreactiated as I'm stumped!
Cheers!!
You just need to format your initial df a bit (since your colnames actually contain more information than just being a "name").
library(tidyverse)
d <- data %>% pivot_longer(-class, names_to = 'site', values_to = 'presence') %>%
mutate(sample=substr(site,1,1)) %>%
mutate(site = substr(site, 2,2))
d %>% group_by(class,sample) %>%
summarise(presence = sum(presence)) %>% arrange(sample)
which results in:
# A tibble: 24 x 3
# Groups: class [4]
class sample presence
<chr> <chr> <dbl>
1 A 1 3
2 B 1 4
3 C 1 0
4 D 1 1
5 A 2 4
6 B 2 6
7 C 2 3
8 D 2 3
9 A 3 0
10 B 3 0
Here is the data with dput():
structure(list(class = c("A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D"), `1A` = c(0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0), `1B` = c(1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1), `2A` = c(0,
1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0), `2B` = c(0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1), `3A` = c(0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1,
1), `3B` = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), `4A` = c(0,
0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1), `4B` = c(1, 1, 0, 0, 1, 1,
0, 0, 1, 1, 0, 0), `5A` = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1), `5B` = c(1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), `6A` = c(1,
0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0), `6B` = c(1, 1, 0, 0, 1, 1,
0, 0, 1, 1, 0, 0)), class = c("spec_tbl_df", "tbl_df", "tbl",
"data.frame"), row.names = c(NA, -12L), spec = structure(list(
cols = list(class = structure(list(), class = c("collector_character",
"collector")), `1A` = structure(list(), class = c("collector_double",
"collector")), `1B` = structure(list(), class = c("collector_double",
"collector")), `2A` = structure(list(), class = c("collector_double",
"collector")), `2B` = structure(list(), class = c("collector_double",
"collector")), `3A` = structure(list(), class = c("collector_double",
"collector")), `3B` = structure(list(), class = c("collector_double",
"collector")), `4A` = structure(list(), class = c("collector_double",
"collector")), `4B` = structure(list(), class = c("collector_double",
"collector")), `5A` = structure(list(), class = c("collector_double",
"collector")), `5B` = structure(list(), class = c("collector_double",
"collector")), `6A` = structure(list(), class = c("collector_double",
"collector")), `6B` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
You can try this:
Code
df %>%
#long format with column for sample and species
pivot_longer(-class,
names_pattern = "(\\d*)([A-Z]*)",
names_to = c("sample", "species")) %>%
#creating two columns (for each species one)
pivot_wider(c(class, sample),
names_from = species,
values_from = value,
values_fn = list) %>%
unnest(c(A, B)) %>%
#creating a presence column - 1 when any species (column A and B) is presence
mutate(presence = ifelse(A == 1 | B == 1, 1, 0)) %>%
#sum prescence by sample and class
group_by(sample, class) %>%
summarise(Number = sum(presence))
Output
# A tibble: 24 x 3
# Groups: sample [6]
sample class Number
<chr> <chr> <dbl>
1 1 A 3
2 1 B 2
3 1 C 0
4 1 D 1
5 2 A 2
6 2 B 3
7 2 C 3
8 2 D 3
9 3 A 0
10 3 B 0
# ... with 14 more rows

How to plot a "matrix" in a fixed grid pattern in R

I have a large data frame. A sample of the first 6 rows is below:
> temp
M1 M2 M3 M4 M5 M6
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 0 1 0 -1 1 0
4 1 1 1 1 1 1
5 0 0 0 -1 0 1
6 0 0 0 0 0 0
> dput(temp)
structure(list(M1 = c(1, 1, 0, 1, 0, 0), M2 = c(1, 1, 1, 1, 0,
0), M3 = c(1, 1, 0, 1, 0, 0), M4 = c(1, 1, -1, 1, -1, 0), M5 = c(1,
1, 1, 1, 0, 0), M6 = c(1, 1, 0, 1, 1, 0)), .Names = c("M1", "M2",
"M3", "M4", "M5", "M6"), row.names = c(NA, -6L), class = "data.frame")
The data frame only has values -1, 0 and 1. The total number of rows is 2,156. What I would like to do is to plot a a "grid" format where each row is comprised of 6 squares (one for each column). Each of the three values is then assigned a color (say, red, green, blue).
I've tried to do this with heatmap.2 (but I can't get the distinct squares to show up).
I've tried to do this using ggplot2 with geom_points() but can't figure out the best way to do it.
Any help on how to efficiently do this would be much appreciated!
Thanks!
Another option:
library(lattice)
temp <- as.matrix(temp)
levelplot(temp, col.regions= colorRampPalette(c("red","green","blue")))
This will produce the following plot:
I think geom_tile() is a better bet, in combination with reshaping to long.
library(ggplot2)
library(reshape2)
#assign an id to plot rows to y-axis
temp$id <- 1:nrow(temp)
#reshape to long
m_temp <- melt(temp, id.var="id")
p1 <- ggplot(m_temp, aes(x=variable,
y=id,fill=factor(value))) +
geom_tile()
p1
You could use ggplot to do the following:
library(ggplot2)
dd <- expand.grid(x = 1:ncol(temp), y = 1:nrow(temp))
dd$col <- unlist(c(temp))
ggplot(dd, aes(x = x, y = y, fill = factor(col))) + geom_tile()

Resources