Relative frequencies/proportions with dplyr create new columns instead of rows - r

This question is inspired by this and this question.
I am trying to calculate the proportion of different values within each group, but I do not want to create "new" rows for the groups but new columns.
Taking the example from the second question above. If I have the following data:
data <- structure(list(value = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), class = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("A",
"B"), class = "factor")), .Names = c("value", "class"), class = "data.frame", row.names = c(NA,
-16L))
I can calculate the proportion of each value (1,2,3) in each class (A,B):
data %>%
group_by(value, class) %>%
summarise(n = n()) %>%
complete(class, fill = list(n = 0)) %>%
group_by(class) %>%
mutate(freq = n / sum(n))
# A tibble: 6 x 4
value class n freq
<int> <fctr> <dbl> <dbl>
1 1 A 3 0.2727273
2 1 B 3 0.6000000
3 2 A 4 0.3636364
4 2 B 2 0.4000000
5 3 A 4 0.3636364
6 3 B 0 0.0000000
However I end up with a line for each value/class pair instead I want something like this:
# some code
# A tibble: 6 x 4
class n 1 2 3
<fctr> <dbl> <dbl> <dbl> <dbl>
1 A 11 0.2727273 0.3636364 0.3636364
2 B 5 0.6000000 0.4000000 0.0000000
With a column for each group. I could write for loops to construct a new data frame from the old one but I am certain there is a better way. Any suggestions?
Thank you

We can use pivot_wider at the end
library(dplyr)
library(tidyr)
data %>%
group_by(value, class) %>%
summarise(n = n()) %>%
complete(class, fill = list(n = 0)) %>%
group_by(class) %>%
mutate(freq = n / sum(n), n = sum(n)) %>%
pivot_wider(names_from = value, values_from = freq)
# A tibble: 2 x 5
# Groups: class [2]
# class n `1` `2` `3`
# <fct> <dbl> <dbl> <dbl> <dbl>
#1 A 11 0.273 0.364 0.364
#2 B 5 0.6 0.4 0
Or as #IcecreamToucan mentioned, the complete is not needed as the pivot_wider have the option to fill with a custom value (default is NA)
data %>%
group_by(value, class) %>%
summarise(n = n()) %>%
group_by(class) %>%
mutate(freq = n / sum(n), n = sum(n)) %>%
pivot_wider(names_from = value, values_from = freq, values_fill = list(freq = 0))
If we are using a previous version of tidyr, then use spread
data %>%
group_by(value, class) %>%
summarise(n = n()) %>%
complete(class, fill = list(n = 0)) %>%
group_by(class) %>%
mutate(freq = n / sum(n), n = sum(n)) %>%
spread(value, freq)

Method using data.table::dcast instead of pivot_wider.
Line 1: Get a count (.N) for each (value, class) group, and call it n
Line 2: Make new variables within each class group:
N, the sum of the previous counts
pct, the percent of N each n makes up
Line 3: Cast to wide with class and N as the rows, value as the column names, and pct as the column elements, with empty elements set to 0.
library(magrittr) # For %>%. Not necessary if dplyr is loaded already
library(data.table)
setDT(data)
data[, .(n = .N), by = .(value, class)] %>%
.[, `:=`(N = sum(n), pct = n/sum(n)), by = class] %>%
dcast(class + N ~ value, value.var = 'pct', fill = 0)
# class N 1 2 3
# 1: A 11 0.2727273 0.3636364 0.3636364
# 2: B 5 0.6000000 0.4000000 0.0000000

We can use count to count occurrences of value and class, group_by class, calculate the frequency and get the data in wide format.
library(dplyr)
library(tidyr)
data %>%
count(value, class) %>%
group_by(class) %>%
mutate(freq = n/sum(n), n = sum(n)) %>%
pivot_wider(names_from = value, values_from = freq, values_fill = list(freq = 0))
# class n `1` `2` `3`
# <fct> <int> <dbl> <dbl> <dbl>
#1 A 11 0.273 0.364 0.364
#2 B 5 0.6 0.4 0

Related

R: How to identify first date of positive observation by ID for multiple columns?

I would like to identify first date of positive observation by ID for multiple columns.
Example dataframe:
ID date Observ1 Observ2 Observ3
1 1 1 0 0
1 2 0 1 0
1 3 1 0 1
2 1 1 1 0
Desired result:
ID FirstObserv1 FirstObserv2 FirstObserv3
1 1 2 3
2 1 1 NA
For single column of observation, I can solve it with dplyr:
df %>% group_by(ID) %>% filter( Observ1 > 0) %>% summarize( FirstObserv1 = min(date) ) %>% as.data.frame()
Having no idea how to do it for multiple column at once, though.
Try reshaping your data like this using tidyverse functions. The key of the code id filtering those values with value of 1 and then set a filter to extract the min date value using filter(). After that you reshape to wide and you get the expected output. Here the code:
library(tidyverse)
#Code
dfnew <- df %>% pivot_longer(-c(ID,date)) %>%
group_by(ID) %>%
filter(value==1) %>% select(-value) %>% ungroup() %>%
group_by(ID,name) %>%
filter(date==min(date)) %>%
pivot_wider(names_from = name,values_from=date)
Output:
# A tibble: 2 x 4
# Groups: ID [2]
ID Observ1 Observ2 Observ3
<int> <int> <int> <int>
1 1 1 2 3
2 2 1 1 NA
Some data used:
#Data
df <- structure(list(ID = c(1L, 1L, 1L, 2L), date = c(1L, 2L, 3L, 1L
), Observ1 = c(1L, 0L, 1L, 1L), Observ2 = c(0L, 1L, 0L, 1L),
Observ3 = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-4L))
Here's a method which just replaces the observation with date if the observation is positive and NA otherwise. Getting the min of each observation yields the desired results.
df %>%
mutate_at(vars(starts_with("Observ")), ~ifelse(. > 0, date, NA)) %>%
group_by(ID) %>%
summarise_at(vars(starts_with("Observ")), min, na.rm = TRUE)
#> # A tibble: 2 x 4
#> ID Observ1 Observ2 Observ3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 1 Inf
Another alternative:
df %>%
group_by(ID) %>%
summarise(across(
-date,
list(First = ~{x <- which(. > 0); if (length(x) > 0L) date[[x[[1L]]]] else NA_real_}),
.names = "{.fn}{.col}"
))
Output
ID FirstObserv1 FirstObserv2 FirstObserv3
<dbl> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 1 NA
We can use data.table
library(data.table)
setDT(df)[, lapply(.SD, function(x) which(x > 0)[1]),
ID, .SDcols = patterns('^Observ')]
# ID Observ1 Observ2 Observ3
#1: 1 1 2 3
#2: 2 1 1 NA
Or using tidyverse
library(dplyr)
df %>%
group_by(ID) %>%
summarise(across(starts_with('Obser'), ~ which(. > 0)[1],
.names = 'First{col}'), .groups = 'drop')
# A tibble: 2 x 4
# ID FirstObserv1 FirstObserv2 FirstObserv3
# <int> <int> <int> <int>
#1 1 1 2 3
#2 2 1 1 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L), date = c(1L, 2L, 3L, 1L
), Observ1 = c(1L, 0L, 1L, 1L), Observ2 = c(0L, 1L, 0L, 1L),
Observ3 = c(0L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-4L))

R using dplyr group_by/ sum in for loop, output as concatenated list

I am using the dplyr package to group by a week variable and get the sum for three variables. The output should be attached to each other.
Here is my data frame df:
week var1 var2 var3
1 1 2 3
1 2 2 3
2 4 4 5
2 2 2 6
3 6 6 6
3 4 4 4
My command is
calculate <- function(vars){
x <- df %>% group_by(week) %>% summarise(summe = sum(vars))%>%mutate(group = paste(vars))
x
}
cols <- c("var1", "var2", "var3")
for (i in 1:length(cols)){
var <- cols[i]
cal <- calculate(var)
total <- rbind(total,cal)
}
The expected output should be
week summe group
1 3 var1
2 6 var1
3 10 var1
1 4 var2
2 6 var2
3 10 var2
1 6 var3
2 11 var3
3 10 var3
My question is: Is there a better way instead of using a for loop?
Cheers,
Andi
We could pivot to 'long' format and then do a group by 'sum'
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('var'), names_to = 'group') %>%
group_by(week, group) %>%
summarise(summe = sum(value)) %>%
ungroup %>%
arrange(group) %>%
select(week, summe, group)
# A tibble: 9 x 3
# week summe group
# <int> <int> <chr>
#1 1 3 var1
#2 2 6 var1
#3 3 10 var1
#4 1 4 var2
#5 2 6 var2
#6 3 10 var2
#7 1 6 var3
#8 2 11 var3
#9 3 10 var3
We can also do the sum grouped by 'week' first and the pivot to 'long' format
df %>%
group_by(week) %>%
summarise_at(vars(-group_cols()), sum) %>%
pivot_longer(cols = starts_with('var'), names_to = 'group', values_to = 'summe') %>%
select(week, summe, group)
data
df <- structure(list(week = c(1L, 1L, 2L, 2L, 3L, 3L), var1 = c(1L,
2L, 4L, 2L, 6L, 4L), var2 = c(2L, 2L, 4L, 2L, 6L, 4L), var3 = c(3L,
3L, 5L, 6L, 6L, 4L)), class = "data.frame", row.names = c(NA,
-6L))

Formatting a data.frame with binary values

I have a dataframe with 4 columns and 4 rows. For simplicity, I changed it to numeric format. The schema is as follows:
df <- structure(list(a = c(1,2,2,0),
b = c(2,1,2,2),
c = c(2,0,1,0),
d = c(0,2,1,1)),row.names=c(NA,-4L) ,class = "data.frame")
a b c d
1 1 2 2 0
2 2 1 2 2
3 2 0 1 0
4 0 2 1 1
I would like to change this data frame and obtain the following:
1 2
1 a b/c
2 b a/c/d
3 c a
4 c/d b
Is there a function or package I should look into? I have been doing lots of text processing in R recently. I'd appreciate your assistance!
tapply fun with some row and col indexes (stealing df from Ronak's answer):
tapply(
colnames(df)[col(df)],
list(row(df), unlist(df)),
FUN=paste, collapse="/"
)[,-1]
# 1 2
#1 "a" "b/c"
#2 "b" "a/c/d"
#3 "c" "a"
#4 "c/d" "b"
Basically I'm taking one long vector representing each column name in df, and tabulating it by the combination of the row of df, and the original values in df.
One way with dplyr and tidyr could be to get data in long format, remove 0 values and paste the column names together for each row and value combination. Finally get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
filter(value != 0) %>%
group_by(row, value) %>%
summarise(val = paste(name, collapse = "/")) %>%
pivot_wider(names_from = value, values_from = val)
# row `1` `2`
# <int> <chr> <chr>
#1 1 a b/c
#2 2 b a/c/d
#3 3 c a
#4 4 c/d b
data
df <- structure(list(a = c(1L, 2L, 2L, 0L), b = c(2L, 1L, 0L, 2L),
c = c(2L, 2L, 1L, 1L), d = c(0L, 2L, 0L, 1L)), class = "data.frame",
row.names = c("1", "2", "3", "4"))

Compute relative frequencies with group totals using dplyr

I have the following toy data:
data <- structure(list(value = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), class = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("A",
"B"), class = "factor")), .Names = c("value", "class"), class = "data.frame", row.names = c(NA,
-16L))
Using the commands:
data <- table(data$class, data$value)
data <- as.data.frame(data)
data$rel_freq <- data$Freq / aggregate(Freq ~ Var1, FUN = sum, data = data)$Freq
I calculate appropriate relative frequencies for each value in each of the classes:
> data
Var1 Var2 Freq rel_freq
1 A 1 3 0.2727273
2 B 1 3 0.6000000
3 A 2 4 0.3636364
4 B 2 2 0.4000000
5 A 3 4 0.3636364
6 B 3 0 0.0000000
I wonder how to construct equivalent dplyr pipeline. Pasted below is my attempt:
library(dplyr)
data %>%
group_by(value, class) %>%
summarise(n = n()) %>%
complete(class, fill = list(n = 0)) %>%
mutate(freq = n / sum(n))
I compute relative frequencies for each value, but, unfortunately, separately for each pair of classes (instead for group totals):
Source: local data frame [6 x 4]
Groups: value [3]
value class n freq
<int> <fctr> <dbl> <dbl>
1 1 A 3 0.5000000
2 1 B 3 0.5000000
3 2 A 4 0.6666667
4 2 B 2 0.3333333
5 3 A 4 1.0000000
6 3 B 0 0.0000000
You only need to group by class for computing the frequencies, so remove the value grouping:
data %>%
group_by(value, class) %>%
summarise(n = n()) %>%
complete(class, fill = list(n = 0)) %>%
group_by(class) %>%
mutate(freq = n / sum(n))
# A tibble: 6 x 4
value class n freq
<int> <fctr> <dbl> <dbl>
1 1 A 3 0.2727273
2 1 B 3 0.6000000
3 2 A 4 0.3636364
4 2 B 2 0.4000000
5 3 A 4 0.3636364
6 3 B 0 0.0000000

Summary multiple columns with dplyr - categorical version

Following this question and this one, I wondered what was the best option to summarise categorical variables in one dataset.
I have a dataset such as
# A tibble: 10 <U+00D7> 4
empstat_couple nssec7_couple3 nchild07 age_couple
<chr> <fctr> <fctr> <dbl>
1 Neo-Trad Lower Managerial 1child 39
2 Neo-Trad Higher Managerial 1child 31
3 Neo-Trad Manual and Routine 1child 33
4 Trad Higher Managerial 1child 43
The 3 first variables are categorical (character or factor) and the last numerical.
What I would like is something like (output)
var n p
1: Neo-Trad 6 0.6
2: OtherArrangment 2 0.2
3: Trad 2 0.2
4: Higher Managerial 4 0.4
5: Lower Managerial 5 0.5
6: Manual and Routine 1 0.1
7: 1child 9 0.9
8: 2children 1 0.1
Well for the numerical variable, I am unsure how to add it meaningfully to the summary.
I guess the most basic way to go is
library(dplyr)
library(data.table)
a = count(dt, empstat_couple) %>% mutate(p = n / sum(n))
b = count(dt, nssec7_couple3) %>% mutate(p = n / sum(n))
c = count(dt, nchild07) %>% mutate(p = n / sum(n))
rbindlist(list(a,b,c))
I wondered if a summarise_each solution existed ?
This doesn't work
dt %>% summarise_each(funs(count))
Using apply I could come up with this
apply(dt, 2, as.data.frame(table)) %>% rbindlist()
But it's not great.
Any suggestions ?
data
dt = structure(list(empstat_couple = c("Neo-Trad", "Neo-Trad", "Neo-Trad",
"Trad", "OtherArrangment", "Neo-Trad", "Trad", "OtherArrangment",
"Neo-Trad", "Neo-Trad"), nssec7_couple3 = structure(c(2L, 1L,
4L, 1L, 2L, 2L, 1L, 2L, 1L, 2L), .Label = c("Higher Managerial",
"Lower Managerial", "Intermediate", "Manual and Routine"), class = "factor"),
nchild07 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
1L), .Label = c("1child", "2children", ">2children"), class = "factor"),
age_couple = c(39, 31, 33, 43, 32, 28, 28, 40, 33, 26), hldid = 1:10), .Names = c("empstat_couple",
"nssec7_couple3", "nchild07", "age_couple", "hldid"), row.names = c(NA,
-10L), class = "data.frame")
We can melt with data.table and get the .N and proportion
library(data.table)
unique(melt(setDT(dt), id.var = "age_couple")[, n := .N , value],
by = c("variable", "value", "n"))[, p := n/sum(n), variable
][, c("age_couple", "variable" ) := NULL][]
Or using dplyr/tidyr
library(dplyr)
library(tidyr)
gather(dt, var1, var, -age_couple) %>%
group_by(var) %>%
mutate(n = n()) %>%
select(-age_couple) %>%
unique() %>%
group_by(var1) %>%
mutate(p= n/sum(n)) %>%
ungroup() %>%
select(-var1)

Resources