Take data weights into account when using dplyr:count [R] - r

I am working with purely factorial data (survey), and I need to aggregate the data in order to visualise it. I am currently using the count() function from dplyr, but there is no option to take data weights into account. In particular, I want count() to count each row as its given weight.
Currently count(data, var1, var2, var3) returns an aggregated dataframe where each row from data is counted as 1. I want to be able to specify a numeric weight column within my data so that each row is counted as the value in data$weight in stead of simply 1.

You could repeat the rows data$weight times and then count. This can be done with the splitstackshape package:
library(splitstackshape)
library(dplyr)
mydf <- data.frame(x = c("a", "b", "q","a","b"),
y = c("c", "d", "r","c","r"),
count = c(2, 5, 3,4,4))
mydf
x y count
1 a c 2
2 b d 5
3 q r 3
4 a c 4
5 b r 4
mydf %>%
expandRows("count") %>%
count(x)
x y n
1 a c 6
2 b d 5
3 b r 4
4 q r 3

Related

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

Creating a matrix of co-occurence from many variables, and plotting it

I have a data set of individuals with a number of health conditions. Individuals either do (1) or do not (0) have each condition (my real data set has 14). What I want to do is summarise the data so I know how often pairs of conditions occur. Note that some individuals may have three or four of the conditions, but what I'm interested in is the pairwise co-occurence. I would then like to plot this as a heatmap.
I suspect that the solution involves the 'gather' function from tidyr, but I haven't been able to work it out. This is an example of what my input looks like and what I'd like to achieve:
Here's some data on individuals and whether or not they have conditions "a", "b" or "c":
library(tidyverse)
library(viridis)
dat <- tibble(
id = c(1:15),
a = c(1,0,0,0,1,1,1,0,1,0,0,0,1,0,1),
b = c(1,0,0,1,1,1,0,0,1,0,0,1,1,0,1),
c = c(0,0,1,1,0,1,0,1,0,1,1,0,1,1,0))
I want to summarise how often each of the conditions occur, and how often they co-occur. In this case, it's evident that conditions "a" and "b" co-occur more often than do either of these with "c", which usually occurs on its own. Below is my imagined idea of what the data will look like in a plottable format. The first column is 'variable 1', the second is 'variable 2', and the third, is the count of how often these occur together. Below that is the plot which I have in my mind.
plotdat <- tibble(
var1 = c("a", "a", "a", "b", "b", "c"),
var2 = c("a", "b", "c", "b", "c", "c"),
count = c(7, 6, 2, 8, 3, 8))
ggplot(plotdat) +
geom_tile(aes(var1, var2, fill = count)) +
scale_fill_viridis()
Perhaps this is not the right approach at all and I actually need to convert the data into a 3x3 matrix. Any possible solutions would be gratefully received!
Here is a way
library(tidyverse)
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname) %>%
filter(!is.na(value))
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 b b 8
#5 c b 3
#6 c c 8
The most important piece is crossprod, I think. But let's go through it step by step.
You don't need column id so we exclude it and convert dat[-1] to a matrix because this is what crossprod expects.
as.matrix(dat[-1]) %>%
crossprod()
# a b c
#a 7 6 2
#b 6 8 3
#c 2 3 8
Then we replace the upper triangle of this matrix with NA because you don't want to compare a-b and b-a etc.
The next step is to convert to a dataframe, make the rownames a column and reshape from wide to long
as.matrix(dat[-1]) %>%
crossprod() %>%
`[<-`(upper.tri(.), NA) %>%
as.data.frame() %>%
rownames_to_column() %>%
gather(key, value, -rowname)
# rowname key value
#1 a a 7
#2 b a 6
#3 c a 2
#4 a b NA
#5 b b 8
#6 c b 3
#7 a c NA
#8 b c NA
#9 c c 8
Finally remove NAs to get desired output.

Grouping R data multiple times before summing

I'm trying to group my data by a number of variables before providing a summary table showing the sum of the values within each group.
I have created the below data as an example.
Value <- c(21000,10000,50000,60000,2000, 4000, 5500, 10000, 35000, 40000)
Group <- c("A", "A", "B", "B", "C", "C", "A", "A", "B", "C")
Type <- c(1, 2, 1, 2, 1, 1, 1, 2, 2, 1)
Matrix <- cbind(Value, Group, Type)
I want to group the above data first by the 'Group' variable, and then by the 'Type' variable to then sum the values and get an output similar to the attached example I worked on Excel. I would usually use the aggregate function if I just wanted to group by one variable, but am not sure whether I can translate this for multiple variables?
Further to this I then need to provide an identical table but with the values being calculated with a "count" function rather than a "sum".
Many thanks in advance!
You can supply multiple groupings to aggregate:
df <- data.frame(Value, Group, Type)
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), sum)
Type Group x
1 1 A 26500
2 2 A 20000
3 1 B 50000
4 2 B 95000
5 1 C 46000
> aggregate(df$Value, list(Type = df$Type, Group = df$Group), length)
Type Group x
1 1 A 2
2 2 A 2
3 1 B 1
4 2 B 2
5 1 C 3
There are other packages which may be easier to use such as data.table:
>library(data.table)
>dt <- as.data.table(df)
>dt[, .(Count = length(Value), Sum = sum(Value)),
by = .(Type, Group)]
Type Group Count Sum
1: 1 A 2 26500
2: 2 A 2 20000
3: 1 B 1 50000
4: 2 B 2 95000
5: 1 C 3 46000
dplyr is another option and #waskuf has good example of that.
Using dplyr (note that "Matrix" needs to be a data.frame):
library(dplyr)
Matrix <- data.frame(Value, Group, Type)
Matrix %>% group_by(Group, Type) %>% summarise(Sum = sum(Value),
Count = n()) %>% ungroup()

Select rows based on non-directed combinations of columns

I am trying to select the maximum value in a dataframe's third column based on the combinations of the values in the first two columns.
My problem is similar to this one but I can't find a way to implement what I need.
EDIT: Sample data changed to make the column names more obvious.
Here is some sample data:
library(tidyr)
set.seed(1234)
df <- data.frame(group1 = letters[1:4], group2 = letters[1:4])
df <- df %>% expand(group1, group2)
df <- subset(df, subset = group1!=group2)
df$score <- runif(n = 12,min = 0,max = 1)
df
# A tibble: 12 × 3
group1 group2 score
<fctr> <fctr> <dbl>
1 a b 0.113703411
2 a c 0.622299405
3 a d 0.609274733
4 b a 0.623379442
5 b c 0.860915384
6 b d 0.640310605
7 c a 0.009495756
8 c b 0.232550506
9 c d 0.666083758
10 d a 0.514251141
11 d b 0.693591292
12 d c 0.544974836
In this example rows 1 and 4 are 'duplicates'. I would like to select row 4 as the value in the score column is larger than in row 1. Ultimately I would like a dataframe to be returned with the group1 and group2 columns and the maximum value in the score column. So in this example, I expect there to be 6 rows returned.
How can I do this in R?
I'd prefer dealing with this problem in two steps:
library(dplyr)
# Create function for computing group IDs from data frame of groups (per column)
get_group_id <- function(groups) {
apply(groups, 1, function(row) {
paste0(sort(row), collapse = "_")
})
}
group_id <- get_group_id(select(df, -score))
# Perform the computation
df %>%
mutate(groupId = group_id) %>%
group_by(groupId) %>%
slice(which.max(score)) %>%
ungroup() %>%
select(-groupId)

How to compute correlation of all subset of two vectors in R?

I work in R. I have two vectors of length n, let's say a and b. I want compute the correlation of all subsets of length m, in this way:
cor(a[1:m],b[1:m])
cor(a[m+1:2m],b[m+1:2m])
...
cor(a[km+1:n],b[km+1:n])
Now I'm using a cycle for but it's too slow. How can I do it in a faster way?
First create a grouping variable (index) and then calculate the correlations groupwise:
# Some fake data:
set.seed(123)
df <- data.frame(cbind(a = rnorm(100), b = rnorm(100), index = rep(1:10, each = 10)))
# Loading the pryr package:
library(plyr)
ddply(df, .(index), summarise, "corr" = cor(a, b))
index corr
1 1 0.26831285
2 2 0.14373593
3 3 0.21555988
4 4 -0.27461416
5 5 -0.08825786
6 6 -0.58680476
7 7 -0.02613450
8 8 -0.29408586
9 9 0.12030810
10 10 -0.04391428
Or with the dplyr:
library(dplyr)
df %>% group_by(index) %>% summarise(cor(a,b))
Or with the data.table:
library(data.table)
setDT(df)[,cor(a, b), by = index]

Resources