R grouped frequency table - r

I'm an R noob and I feel this should be simple but I cannot work it out. I have a survey dataset, with columns for ID, employer, practice_area and then a number of columns where the survey takers had to indicate which tools they use with a 'check all that apply' instruction. The data set now has a column for each tool option with either 1 or 0.
Sample df:
np1 <- data.frame(ID = c(1:10),
practice_area = c("A", "B", "C", "A", "A", "C", "B", "D", "C", "A"),
tool_1 = sample(0:1,10, replace = TRUE),
tool_2 = sample(0:1,10, replace = TRUE),
tool_3 = sample(0:1,10, replace = TRUE),
tool_4 = sample(0:1,10, replace = TRUE),
tool_5 = sample(0:1,10, replace = TRUE))
I'd like a frequency table that is grouped by practice_area. So basically I can see the results that it would say practice_area A, x people use tool_1, x people use tool_2, etc.

# data
df <- data.frame(ID = c(1, 2, 3, 4 ,5),
employer = c("A", "B", "C", "D", "E"),
practice_area = c("X", "Y", "X", "X", "X"),
tool_1 = c(1, 0, 0, 1, 1),
tool_2 = c(1, 0, 0, 1, 0),
tool_3 = c(1, 1, 1, 1, 1),
tool_4 = c(0, 1, 1, 0, 1))
Output:
# code
df %>%
group_by(practice_area) %>%
summarise(tool_1 = sum(tool_1), tool_2 = sum(tool_2),
tool_3 = sum(tool_3), tool_4 = sum(tool_4))

Ok, so let's start off with creating a dataset, to reproduce this problem.
library(tidyverse)
df <- data.frame(
ID = 1:50,
employer = rep(
c("employer.1","employer.2"),
25
),
practice_area = rep(
1:5,
10
),
tool.1 = sample(0:1, 50, replace=T),
tool.2 = sample(0:1, 50, replace=T)
)
So, If I want a table like this:
# A tibble: 10 x 3
# Groups: practice_area [5]
practice_area tool n
<int> <chr> <int>
1 1 tool.1 7
2 1 tool.2 2
3 2 tool.1 4
4 2 tool.2 2
5 3 tool.1 2
6 3 tool.2 4
7 4 tool.1 4
8 4 tool.2 6
9 5 tool.1 6
10 5 tool.2 5
I would do
df %>%
pivot_longer(
starts_with("tool"),
names_to = "tool",
values_to = "uses_tool"
) %>%
filter(uses_tool != 0) %>%
group_by(practice_area) %>%
count(tool)
In this piece of code, I make a long table (instead of wide) in which I have a column for the tools (selected with start_with, see https://dplyr.tidyverse.org/reference/select.html). After that, I remove the ones that don't use the tool (uses_tool != 0) and I group them by the practice area. The only thing to do then is to count the occurrences by group.

Related

Divide one table by another, with matching index

I have two table with a shared index, I want to divide one by another. This could be done with division on two data frames. But It seems arbitrary (how would I know I am dividing the right number?) and does not preserve index, so I want to do this division by matching rows with the same index. What's the best way to do this? Is there a best practice in terms of table division in this case?
tb1 <- data.frame(index = c(1, 2, 3), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))
tb2 <- data.frame(index = c(1, 2, 3), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1[,-1]/tb2[,-1]
total_1 total_2
1 25 10
2 225 13
3 100 10
Another case, two col of index must match.
tb2 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), unit_1 = c(4, 2, 3), unit_2 = c(2, 3, 6))
tb1 <- data.frame(index_1 = c("a", "b", "b"), index_2 = c("c", "d", "b"), total_1 = c(100, 450, 300), total_2 = c(20, 39, 60))
If both data have the same index and the number of rows are same. One way is to order by 'index' in both data to enforce that they are in the same order. Then do the division
tb1new <- tb1[order(tb1$index),]
tbl2new <- tb2[order(tb2$index),]
tb1new[-1] <- tbl1new[-1]/tbl2new[-1]
Or we can make a check on both 'index' first and use that condition to do the division
i1 <- all.equal(tbl1$index, tbl2$index)
if(i1) tb1[-1]/tbl2[-1]
Or another option in a join
library(data.table)
nm1 <- c('total_1', 'total_2')
nm2 <- c('unit_1', 'unit_2')
setDT(tb1)[tb2, (nm1) := .SD/mget(nm2), on = .(index), .SDcols = nm1]
You can perform a join and divide the columns. In base R :
result <- merge(tb1, tb2, by = c('index_1', 'index_2'))
result
# index_1 index_2 total_1 total_2 unit_1 unit_2
#1 a c 100 20 4 2
#2 b b 300 60 3 6
#3 b d 450 39 2 3
total_cols <- grep('total', names(result), value = TRUE)
unit_cols <- grep('unit', names(result), value = TRUE)
result[total_cols]/result[unit_cols]
# total_1 total_2
#1 25 10
#2 100 10
#3 225 13
Maybe this is not the most efficient solution but here is another way:
library(dplyr)
library(tidyr)
# For one index matching
tb1 %>%
left_join(tb2, by = "index") %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_")))
index result_1 result_2
1 1 25 10
2 2 225 13
3 3 100 10
# For two indices matching
tb1 %>%
left_join(tb2, by = c("index_1", "index_2")) %>%
mutate(result_1 = get(paste("total", 1, sep = "_")) / get(paste("unit", 1, sep = "_")),
result_2 = get(paste("total", 2, sep = "_")) / get(paste("unit", 2, sep = "_"))) %>%
select(!starts_with(c("total", "unit")))
index_1 index_2 result_1 result_2
1 a c 25 10
2 b d 225 13
3 b b 100 10

How to pivot_wider only a single condition using a single command in R

Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")

2-group heterogeneity index

I have a dataset with two distinct groups (A and B) belonging to 3 different categories (1, 2, 3):
library(tidyverse)
set.seed(100)
df <- tibble(Group = sample(c(1, 2, 3), 20, replace = T),Company = sample(c('A', 'B'), 20, replace = T))
I want to come come up with a metric that characterizes group composition across the timespan.
Thus far, I have used an index based on Shannon's Index which gives a measure of heterogeneity varying between 0 and 1. With 1 being a perfectly heterogeneous (equal representation of each group) and 0 being completely homogeneous (only 1 group is represented):
df %>%
group_by(Group, Company) %>%
summarise(n=n()) %>%
mutate(p = n / sum(n)) %>%
mutate(Shannon = -(p*log2(p) + (1-p)))
Yielding:
Group Company n p Shannon
<dbl> <chr> <int> <dbl> <dbl>
1 A 2 0.6666667 0.05664167
1 B 1 0.3333333 -0.13834583
2 A 4 0.5000000 0.00000000
2 B 4 0.5000000 0.00000000
3 A 1 0.1111111 -0.53667500
3 B 8 0.8888889 0.03993333
However, I am looking for an index between [-1, +1]. Where the index yields -1 when only group A is present at a time point, +1 when only group B is present at a time point, 0 being an equal representation.
How can I create such an index? I have looked at measures such as Moran's I as inspiration, but they do not seem to suit the need.
A simple solution might be to calculate the mean.
I transformed Company into value with A = -1 and B = 1 and calculated the mean by Group.
The result will be an index for each Group, with -1 when Company has just "A"s or 1 when there are just "B"s.
Data
df <- structure(list(Group = c(2, 2, 3, 3, 1, 2, 3, 1, 1, 3, 3, 1,
2, 2, 3, 2, 2, 1, 1, 3), Company = c("A", "A", "A", "A", "B",
"B", "B", "B", "A", "B", "B", "B", "A", "A", "B", "A", "B", "B",
"A", "B")), row.names = c(NA, -20L), class = c("tbl_df", "tbl",
"data.frame"))
Code
df %>%
mutate(value = ifelse(Company == "A", -1, 1)) %>%
group_by(Group) %>%
summarise(index = mean(value))
Output
# A tibble: 3 x 2
Group index
<dbl> <dbl>
1 1 0.333
2 2 -0.429
3 3 0.429

Insert specified values in R grouped df and fill up missing values using another df (R)

I have 2 dfs : df & xdf.
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
In df, within "id" column, for the groups (a & b), I would like to insert only that row of xdf which matches the same id name as in df. How can I make it ? I have tried following commands but all of the values of xdf$x are inserted for each group.
ndf <- df %>%
group_by(id) %>%
do(add_row(., id = .$id[1], x = xdf$x))
> ndf
# A tibble: 12 x 3
# Groups: id [2]
id x y
<chr> <dbl> <dbl>
1 a 1 0.2
2 a 2 0
3 a 3 0.9
4 a 4 7
5 a 2 NA
6 a 3.5 NA
7 b 1 1
8 b 2 0.3
9 b 3 5
10 b 4 5.1
11 b 2 NA
12 b 3.5 NA
# expected result should be : ndf <- ndf[c(-6,-11),]
My end goal is to fill these newborns NA of ndf with the approx() function. But my issue remains because I'm using xout = xdf$x that calls supernumerary values. How can I overcome this? Can you help to write a function that makes xout varies?
f <- function(z)
{
fdf <- approx(z$x, z$y, xout = xdf$x, method = "linear")
return(data.frame(nx= fdf$x, y.out = fdf$y, id = unique(z$id)))
}
jdf <- as.data.frame(ddply(ndf, .(id), f))
zdf <- subset(jdf, select = c(id, nx, y.out))
> zdf
id nx y.out
1 a 2.0 0.00
2 a 3.5 3.95
3 b 2.0 0.30
4 b 3.5 5.05
# expected results
id nx y.out
1 a 2.0 0.00
2 b 3.5 5.05
Any helpful tips to this is welcome. Many thanks!
library(dplyr)
df <- tibble(id = c("a", "a", "a", "a", "b", "b", "b", "b"),
x = c(1, 2, 3, 4, 1, 2, 3, 4),
y = c(0.2, 0, 0.9, 7, 1, 0.3, 5, 5.1))
xdf <- tibble(id = c("a", "b"),
x = c(2, 3.5))
ndf <- df %>%
bind_rows(xdf) %>%
arrange(id)
zdf <- ndf %>%
group_by(id) %>%
group_modify(~mutate(., y_approx = approx(.$x, .$y, .$x, method = "linear")[["y"]])) %>%
ungroup() %>%
filter(is.na(y)) %>%
select(id, y_approx)

Find rows in data frame with certain columns are duplicated, then combine the the elements in other columns [duplicate]

This question already has answers here:
Collapse text by group in data frame [duplicate]
(2 answers)
Aggregating by unique identifier and concatenating related values into a string [duplicate]
(4 answers)
Closed 3 years ago.
I have one data frame, I want to find the rows where both columns A and B are duplicated, and then combine the rows by combing the elements in C column together.
My example:
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
My expected result:
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
Thanks a lot
Without packages:
DF <- aggregate(C ~ A + B, FUN = function(x) paste(x, collapse = "; "), data = DF)
Output:
A B C
1 1 a M
2 2 a X
3 1 b N
4 3 c M; N
Or with data.table:
setDT(DF)[, .(C = paste(C, collapse = "; ")), by = .(A, B)]
This is a tidyverse based solution where you can use paste with collapse after grouping it.
library(dplyr)
DF = cbind.data.frame(A = c(1, 1, 2, 3, 3),
B = c("a", "b", "a", "c", "c"),
C = c("M", "N", "X", "M", "N"))
DFE = cbind.data.frame(A = c(1, 1, 2, 3),
B = c("a", "b", "a", "c"),
C = c("M", "N", "X", "M; N"))
DF %>%
group_by(A,B) %>%
summarise(C = paste(C, collapse = ";"))
#> # A tibble: 4 x 3
#> # Groups: A [3]
#> A B C
#> <dbl> <fct> <chr>
#> 1 1 a M
#> 2 1 b N
#> 3 2 a X
#> 4 3 c M;N
Created on 2019-03-19 by the reprex package (v0.2.1)

Resources