Creating a matrix of co-occurence from many variables, and plotting it - r

I have a data set of individuals with a number of health conditions. Individuals either do (1) or do not (0) have each condition (my real data set has 14). What I want to do is summarise the data so I know how often pairs of conditions occur. Note that some individuals may have three or four of the conditions, but what I'm interested in is the pairwise co-occurence. I would then like to plot this as a heatmap.
I suspect that the solution involves the 'gather' function from tidyr, but I haven't been able to work it out. This is an example of what my input looks like and what I'd like to achieve:
Here's some data on individuals and whether or not they have conditions "a", "b" or "c":
library(tidyverse)
library(viridis)
dat <- tibble(
id = c(1:15),
a = c(1,0,0,0,1,1,1,0,1,0,0,0,1,0,1),
b = c(1,0,0,1,1,1,0,0,1,0,0,1,1,0,1),
c = c(0,0,1,1,0,1,0,1,0,1,1,0,1,1,0))
I want to summarise how often each of the conditions occur, and how often they co-occur. In this case, it's evident that conditions "a" and "b" co-occur more often than do either of these with "c", which usually occurs on its own. Below is my imagined idea of what the data will look like in a plottable format. The first column is 'variable 1', the second is 'variable 2', and the third, is the count of how often these occur together. Below that is the plot which I have in my mind.
plotdat <- tibble(
var1 = c("a", "a", "a", "b", "b", "c"),
var2 = c("a", "b", "c", "b", "c", "c"),
count = c(7, 6, 2, 8, 3, 8))
ggplot(plotdat) +
geom_tile(aes(var1, var2, fill = count)) +
scale_fill_viridis()
Perhaps this is not the right approach at all and I actually need to convert the data into a 3x3 matrix. Any possible solutions would be gratefully received!

Here is a way
library(tidyverse)
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
filter(!is.na(value))
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 b b 8
#5 c b 3
#6 c c 8
The most important piece is crossprod, I think. But let's go through it step by step.
You don't need column id so we exclude it and convert dat[-1] to a matrix because this is what crossprod expects.
as.matrix(dat[-1]) %>%
crossprod()
# a b c
#a 7 6 2
#b 6 8 3
#c 2 3 8
Then we replace the upper triangle of this matrix with NA because you don't want to compare a-b and b-a etc.
The next step is to convert to a dataframe, make the rownames a column and reshape from wide to long
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname)
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 a b NA
#5 b b 8
#6 c b 3
#7 a c NA
#8 b c NA
#9 c c 8
Finally remove NAs to get desired output.

Related

Move subgroup under repeated main group while keeping main group once in data.frame R

I'm aware that the question is awkward. If I could phrase it better I'd probably find the solution in an other thread.
I have this data structure...
df <- data.frame(group = c("X", "F", "F", "F", "F", "C", "C"),
subgroup = c(NA, "camel", "horse", "dog", "cat", "orange", "banana"))
... and would like to turn it into this...
data.frame(group = c("X", "F", "camel", "horse", "dog", "cat", "C", "orange", "banana"))
... which is surprisingly confusing. Also, I would prefer not using a loop.
EDIT: I updated the example to clarify that solutions that depend on sorting unfortunately do not do the trick.
Here an (edited) answer with new data.
Using data.table is going to help a lot. The idea is to split the df into groups and lapply() to each group what we need. Whe have to take care of some things meanwhile.
library(data.table)
# set as data.table
setDT(df)
# to mantain the ordering, you need to put as factor the group.
# the levels are going to give the ordering infos to split
df[,':='(group = factor(group, levels =unique(df$group)))]
# here the split function, splitting df int a list
df_list <-split(df, df$group, sorted =F)
# now you lapply to each element what you need
df_list <-lapply(df_list, function(x) data.frame(group = unique(c(as.character(x$group),x$subgroup))))
# put into a data.table and remove NAs
rbindlist(df_list)[!is.na(df_onecol$group)]
group
1: X
2: F
3: camel
4: horse
5: dog
6: cat
7: C
8: orange
9: banana
With the edited data we need to add another column (here row_number) to sort by:
df %>%
pivot_longer(col = everything()) %>%
mutate(r_n = row_number()) %>%
group_by(value) %>% slice(1) %>%
arrange(r_n) %>%
filter(!is.na(value))
#output
# A tibble: 9 × 3
# Groups: value [9]
name value r_n
<chr> <chr> <int>
1 group X 1
2 group F 3
3 subgroup camel 4
4 subgroup horse 6
5 subgroup dog 8
6 subgroup cat 10
7 group C 11
8 subgroup orange 12
9 subgroup banana 14

Selecting most frequent combinations while removing the parts of that combination R

I have a dataset with combinations and their frequency as shown below in an example. The idea is to find all combinations (every name has to be used) to have the highest possible value for count (frequency).
Person 1
Person 2
Count
A
B
4
A
D
4
A
C
3
B
C
2
C
D
1
B
D
0
A, B, C and D are names of people and count is the frequency of a combination of two people.
In this example the highest count can be reached by having an AD and BC combination, which sums to 6 (4+2). If we take AB and CD as a combination the total sum of count will be lower (5, 4+1).
I would like to have a dataset looking like this as an answer:
Person 1
Person 2
Count
A
D
4
B
C
2
How can I create this dataset from the original without having duplicate names and with having the highest possible count. So if there is an AD combination, there can not be another combination including A or D.
I tried following code, but this does not give me the desired dataset:
dat <- data %>%
arrange(desc(count))
count = 0
while (nrow(dat)>0){
print(dat[1,])
dat <- dat %>%
filter(!(X1==X1[1]|X1==X2[1]|X2==X1[1]|X2==X2[1]))
}
dat is the arranged dataset shown in the first table. I print the first row with the highest count and delete all combinations that has one of the names in their combination (because I can use a name only once). This is looped until there are no more people left.
This code will give following dataset:
Person 1
Person 2
Count
A
B
4
C
D
1
Thank you in advance.
There is probably a more elegant solution with igraph, but here is my approach:
Using your data
your_data <- tibble::tribble( ~Person.1, ~Person.2, ~Count, "A", "B", 4L, "A", "D", 4L, "A", "C", 3L, "B", "C", 2L, "C", "D", 1L, "B", "D", 0L)
and assuming Person.1 and Person.2 are in alphabetical order, you can do
library(purrr)
with(your_data, unique(c(Person.1, Person.2))) %>%
combinat::permn(\(x) split(x, (seq_along(x) + 1) %/% 2) %>%
map(sort) %>%
map_dfr(set_names, c("Person.1", "Person.2"))) %>%
map(~ arrange(.x, Person.1)) %>%
unique() %>%
imap(~ dplyr::left_join(.x, your_data)) %>%
rlist::list.sort((sum(Count))) %>%
first()
returning the desired
# A tibble: 2 x 3
Person.1 Person.2 Count
<chr> <chr> <int>
1 A D 4
2 B C 2

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

Take data weights into account when using dplyr:count [R]

I am working with purely factorial data (survey), and I need to aggregate the data in order to visualise it. I am currently using the count() function from dplyr, but there is no option to take data weights into account. In particular, I want count() to count each row as its given weight.
Currently count(data, var1, var2, var3) returns an aggregated dataframe where each row from data is counted as 1. I want to be able to specify a numeric weight column within my data so that each row is counted as the value in data$weight in stead of simply 1.
You could repeat the rows data$weight times and then count. This can be done with the splitstackshape package:
library(splitstackshape)
library(dplyr)
mydf <- data.frame(x = c("a", "b", "q","a","b"),
y = c("c", "d", "r","c","r"),
count = c(2, 5, 3,4,4))
mydf
x y count
1 a c 2
2 b d 5
3 q r 3
4 a c 4
5 b r 4
mydf %>%
expandRows("count") %>%
count(x)
x y n
1 a c 6
2 b d 5
3 b r 4
4 q r 3

Conditional sums in R

I created a data frame of the following type
Name Date Value
A 01.01.01 10
B 02.01.01 2
A 04.01.01 4
...
I would like to obtain a list that ranks the elements in the name column by the total sum, provided that the dates are within a certain range.
Welcome to Stack Overflow (SO). It is very important for anybody asking questions to provide reproducible data which you can get using dput(). Please read this link. If you have tried something, you want to leave your code and describe what your challenge is. In this way, you can help SO users save more time and you are likely to receive more support. Here, I did my best to read your question, created a sample data, and did the following using the dplyr package.
# Sample data
foo <- data.frame(id = c("A", "B", "A", "C", "D", "B", "D", "E", "A", "S", "B"),
date = c("01.01.01", "02.01.01", "04.01.01", "05.01.01",
"11.01.01", "09.03.01", "12.15.01", "08.08.01",
"03.27.01", "11.16.01", "04.07.01"),
value = c(-10, -2, -4, 8, 5, 2, 10, 5, 11, 7, 8),
stringsAsFactors = FALSE)
# id date value
#1 A 01.01.01 -10
#2 B 02.01.01 -2
#3 A 04.01.01 -4
#4 C 05.01.01 8
#5 D 11.01.01 5
#6 B 09.03.01 2
#7 D 12.15.01 10
#8 E 08.08.01 5
#9 A 03.27.01 11
#10 S 11.16.01 7
#11 B 04.07.01 8
library(dplyr)
foo %>%
# Create date objects
mutate(date = as.Date(date, format = "%m.%d.%y")) %>%
# Select data points which stay between 2001-01-01 and 2001-08-31
filter(between(date, as.Date("2001-01-01"), as.Date("2001-08-31"))) %>%
# For each id group
group_by(id) %>%
# Get sum of value
summarise(Total = sum(value)) %>%
# Arrange row order by descending order with Total
arrange(desc(Total))
# id Total
#1 C 8
#2 B 6
#3 E 5
#4 A -3

Resources