I have a unique problem where I would like to add a column of percentiles for each group in a data frame. Here is how my data look like:
library(tidyverse)
set.seed(123)
df <- tibble(id = 1:100,
group = rep(letters[1:4], 25),
x = c(sample(1:100, 25, replace = T),
sample(101:200, 25, replace = T),
sample(201:300, 25, replace = T),
sample(301:400, 25, replace = T)))
> df
# A tibble: 100 x 3
id group x
<int> <chr> <int>
1 1 a 78
2 2 b 80
3 3 c 7
4 4 d 100
5 5 a 45
6 6 b 76
7 7 c 25
8 8 d 91
9 9 a 13
10 10 b 84
# ... with 90 more rows
# Function to create a table ten percentiles for a numeric vector
percentiles_table <- function(x) {
res <- round(quantile(x, probs = seq(from=.1, to=1, by=0.1)), 0)
res <- data.frame(percentile = names(res), to = res )
res <- res %>%
mutate(from = lag(to, default = 0)) %>%
select(from,to,percentile)
}
# Table of percentiles
percentiles <- df %>%
group_by(group) %>%
summarise(percentiles_table(x)) %>%
ungroup()
> percentiles
# A tibble: 40 x 4
group from to percentile
<chr> <dbl> <dbl> <chr>
1 a 0 25 10%
2 a 25 71 20%
3 a 71 106 30%
4 a 106 125 40%
5 a 125 198 50%
6 a 198 236 60%
7 a 236 278 70%
8 a 278 325 80%
9 a 325 379 90%
10 a 379 389 100%
I would like to add the percentile column to df for each group where the value of x falls between from and to.
There might be some way to calculate the percentile column directly without having it calculated in a separated data.frame and then appending it back to df.
A one-liner with my santoku package:
library(santoku)
df |>
group_by(group) |>
mutate(
percentile = chop_quantiles(x, 0:100/100,
labels = lbl_endpoint())
)
# A tibble: 100 × 4
# Groups: group [4]
id group x percentile
<int> <chr> <int> <fct>
1 1 a 35 8%
2 2 b 97 20%
3 3 c 39 4%
4 4 d 20 8%
5 5 a 89 16%
...
Using data.table:
setDT(df)[
,
percentile := cut(
x,
quantile(x, seq(0, 1, 0.1)),
include.lowest = TRUE,
labels = paste0(seq(10, 100, 10), "%")
),
by = group
]
install.packages("zoo")
library(zoo)
y=as.data.frame(c(0:max(percentiles$to)))
y=merge(y,unique(percentiles[,c(1)]))
y=merge(y,percentiles[,c(1,2,4)], by.x = c("group","c(0:max(percentiles$to))"), by.y = c("group","from"), all.x = TRUE)
y=na.locf(y)
df=merge(df,y, all.x = TRUE, by.x = c("group","x"), by.y = c("group","c(0:max(percentiles$to))"))
I got this working solution.
percentile_ranks <- function(x) {
res <- trunc(rank(x))/length(x) * 100
res <- floor(res/10) }
df <- df %>%
group_by(group) %>%
arrange(x) %>%
mutate(percentile = percentile_ranks(x)) %>%
mutate(percentile_pct = paste0(percentile*10,"%")) %>%
ungroup() %>%
arrange(id) # original data.frame order
Related
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 3 years ago.
I have the following data:
set.seed(789)
df_1 = data.frame(a = 22, b = 24, c = rnorm(10))
df_2 = data.frame(a = 44, b = 24, c = rnorm(10))
df_3 = data.frame(a = 33, b = 99, c = rnorm(10))
df_all = rbind(df_1, df_2, df_3)
I need to group df_all by column a and b, and then find the 50th quantile based on column c.
This can be done singularly, for each df, as follows:
df_1_q = quantile(df_1$c, probs = 0.50)
df_2_q = quantile(df_2$c, probs = 0.50)
df_3_q = quantile(df_3$c, probs = 0.50)
However my real df_all is larger than this.
And more generally, how can I group a data.frame by rows and apply a given function?
thanks
You could use dplyr for that
library(dplyr)
df_all %>%
group_by(a, b) %>%
summarise(quantile = quantile(c, probs = 0.5))
# A tibble: 3 x 3
# Groups: a [?]
a b quantile
<dbl> <dbl> <dbl>
1 22 24 -0.268
2 33 99 -0.234
3 44 24 -0.445
Or using data.table as:
library(data.table)
dt <- data.table(df_all)
dt[,list(quantile=quantile(c, probs = 0.5)),by=c("a", "b")]
a b quantile
1: 22 24 -0.2679104
2: 44 24 -0.4450979
3: 33 99 -0.2336712
I would like to use a category from one data frame and apply it to another based on a similar column (merge). But, the merge needs to consider a range of data points that are found between two columns. I have an example below.
set.seed(123)
df_1 <- tibble(
x = c(0, 500, 1000, 1500, 2000),
y = c(499, 999, 1499, 1999, 99999),
desc = LETTERS[1:5]
)
> df_1
# A tibble: 5 x 3
x y desc
<dbl> <dbl> <chr>
1 0 499 A
2 500 999 B
3 1000 1499 C
4 1500 1999 D
5 2000 99999 E
df_2 <- tibble(
code = sample(1:2500,5,F)
)
>df_2
# A tibble: 5 x 1
code
<int>
1 719
2 1970
3 1022
4 2205
5 2348
## desired output
df_2 %>%
mutate(desc = c('B', 'D', 'C', 'E', 'E'))
# A tibble: 5 x 2
code desc
<int> <chr>
1 719 B
2 1970 D
3 1022 C
4 2205 E
5 2348 E
My first thought was to split df_1 and merge somehow, but I'm stuck on how to deal with the range of values found in x and y. Any ideas?
This is an easy problem to handle in SQL, so one option would be to use the sqldf package, with this query:
SELECT t2.code, COALESCE(t1.desc, '') AS desc
FROM df_2 t2
LEFT JOIN df_1 t1
ON t2.code BETWEEN t1.x AND t1.y;
R code:
library(sqldf)
sql <- paste0("SELECT t2.code, COALESCE(t1.desc, '') AS desc ",
"FROM df_2 t2 LEFT JOIN df_1 t1 ON t2.code BETWEEN t1.x AND t1.y")
result <- sqldf(sql)
library(tidyverse)
set.seed(123)
df_1 <- tibble(
x = c(0, 500, 1000, 1500, 2000),
y = c(499, 999, 1499, 1999, 99999),
desc = LETTERS[1:5]
)
df_2 <- tibble(
code = sample(1:2500,5,F)
)
df_1 %>%
mutate(code = map2(x, y, ~seq(.x, .y, 1))) %>% # create a sequence of numbers with step = 1
unnest() %>% # unnest data
inner_join(df_2, by="code") %>% # join df_2
select(-x, -y) # remove columns
# # A tibble: 5 x 2
# desc code
# <chr> <dbl>
# 1 B 719
# 2 C 1022
# 3 D 1970
# 4 E 2205
# 5 E 2348
This seems to work, but is not very tidyverse-ish:
df_2 %>% mutate(v = with(df_1, desc[ findInterval(code, x) ]))
code v
1 719 B
2 1970 D
3 1022 C
4 2205 E
5 2348 E
This only uses the x column, so the assumption is that there are no gaps in the ranges (y is always one below the next x).
I have the following data frame:
set.seed(42)
df <- data_frame(x = sample(0:100, 50, replace = T),
y = sample(c(T, F), 50, replace = T))
I would like to create a third column z that will be the sum of column x, but only if there are more than 3 trues in a row in column y.
Is there a vectorized way to do it with dplyr? I don't even know how to approach this.
We create a grouping variable with rleid from data.table and get the sum of 'x' if there are more than 3 elements (n() >3) and if all the elements in 'y' are TRUE or else return NA
library(dplyr)
library(data.table)
df %>%
group_by(grp = rleid(y)) %>%
mutate(Sum = if(n() > 3 & all(y)) sum(x) else NA_integer_) %>%
ungroup %>%
select(-grp)
It can be also done with data.table
library(data.table)
setDT(df)[, Sum := sum(x) * NA^(!((.N > 3) & all(y))), .(grp = rleid(y))]
The question did not specify what values to use if there are not 3 TRUE values so we will use 0.
library(dplyr)
library(zoo)
sum3 <- function(z) all(z[, "y"]) * sum(z[, "x"])
df %>% mutate(sum = rollapplyr(df, 3, sum3, by.column = FALSE, fill = 0))
giving:
# A tibble: 50 x 3
x y sum
<int> <lgl> <int>
1 92 TRUE 0
2 94 TRUE 0
3 28 TRUE 214
4 83 FALSE 0
5 64 TRUE 0
6 52 FALSE 0
7 74 FALSE 0
8 13 TRUE 0
9 66 TRUE 0
10 71 FALSE 0
# ... with 40 more rows
Say that I have this data.frame, data:
data <- data.frame(val=c(rep(6,10), rep(7, 15), rep(8, 20), rep(9, 25), rep(10, 100), rep(11, 20), rep(12, 15), rep(13, 10)))
data$plus <- data$val + 100
My goal is to create a new data.frame that has the frequencies of each val, and the associated plus value.
My current strategy is to create a table (called table), then merge the frequencies. Then to keep only the first observation within each group:
table <- table(data$val)
df1 <- data.frame(val = as.integer(names(table)[1:length(table)]), N = table[1:length(table)])
df2 <- merge(data, df1)
df3 <- do.call(rbind, by(df2, list(df2$val), FUN=function(x) head(x, 1)))
This works, but it seems clunky.
In Stata, for example, it would be less and simpler code. Something like:
bys val plus: egen max = _N
bys val plus: gen first = _n==1
keep if first==1
Is there a way to simplify or make more elegant the R code?
Here's an approach using "data.table":
library(data.table)
as.data.table(data)[, N := .N, by = val][, .SD[1], by = val]
# val plus N
# 1: 6 106 10
# 2: 7 107 15
# 3: 8 108 20
# 4: 9 109 25
# 5: 10 110 100
# 6: 11 111 20
# 7: 12 112 15
# 8: 13 113 10
## Or (#RicardoSaporta)
as.data.table(data)[, list(.N, plus=plus[1]), by = val]
## Or (#DavidArenburg)
unique(as.data.table(data)[, N := .N, by = val], by = "val")
With "dplyr", you can try:
library(dplyr)
data %>%
group_by(val) %>%
mutate(N = n()) %>%
slice(1)
In base R, I guess you can try something like:
do.call(rbind, lapply(split(data, data$val),
function(x) cbind(x, N = nrow(x))[1, ]))
Edited
Or you can use aggregate()
data$N = 0
out = aggregate(N ~ val + plus, data = data, length)
or else
out = aggregate(plus ~val, data = data,function(x) c(unique(x), N = length(x)))
do.call(data.frame, out)
or using ddply
library(plyr)
out = ddply(data, .(val,plus), summarize, N = length(plus))
#> out
# val plus N
#1 6 106 10
#2 7 107 15
#3 8 108 20
#4 9 109 25
#5 10 110 100
#6 11 111 20
#7 12 112 15
#8 13 113 10
df1 <-
data.frame(c("male", "female", "male"),
c("1", "2", "3", "4", "5", "6"),
seq(141, 170))
names(df1) = c("gender", "age", "height")
df1$age <- factor(
df1$age,
levels = c(1, 2, 3, 4, 5, 6),
labels = c("16-24", "25-34", "35-44", "45-54", "55-64", "65+")
)
q1a = c(1, 0, 1, 0, 0, 1)
q1b = c(0, 0, 2, 2, 2, 0)
q1c = c(0, 0, 3, 3, 0, 3)
# 1,2 and 3 used to be compatible with existing datasets.
# Could change all to 1 if necessary.
df2 <- data.frame(q1a = q1a, q1b = q1b, q1c = q1c)
df1 <- cbind(df1, df2)
rm(q1a, q1b, q1c, df2)
I am looking to replicate the analysis of multiple response questions from SPSS in R.
At the moment I am using this code:
#creating function for analysing questions with grouped data
multfreqtable <- function(a, b, c) {
# number of respondents (for percent of cases)
totrep = sum(a == 1 | b == 2 | c == 3)
#creating frequency table
table_a = data.frame("a", sum(a == 1))
names(table_a) = c("question", "freq")
table_b = data.frame("b", sum(b == 2))
names(table_b) = c("question", "freq")
table_c = data.frame("c", sum(c == 3))
names(table_c) = c("question", "freq")
table_question <- rbind(table_a, table_b, table_c)
#remove individual question tables
rm(table_a, table_b, table_c)
#adding total
total = as.data.frame("Total")
totalsum = (sum(table_question$freq, na.rm = TRUE))
totalrow = cbind(total, totalsum)
names(totalrow) = c("question", "freq")
table_question = rbind(table_question, totalrow)
#adding percentage column to frequency table
percentcalc = as.numeric(table_question$freq)
percent = (percentcalc / totalsum) * 100
table_question <- cbind(table_question, percent)
#adding percent of cases column to frequency table
poccalc = as.numeric(table_question$freq)
percentofcases = (poccalc / totrep) * 100
table_question <- cbind(table_question, percentofcases)
#print percent of cases value
total_respondents <<- data.frame(totrep)
#remove all unnecessary data and values
rm(
total,
totalsum,
totalrow,
b,
c,
percent,
percentcalc,
percentofcases,
totrep,
poccalc
)
return(table_question)
}
#calling function - must tie to data.frame using $ !!!
q1_frequency <- multfreqtable(df1$q1a, df1$q1b, df1$q1c)
#renaming percent of cases - This is very important while using current method
total_respondents_q1 <- total_respondents
rm(total_respondents)
Producing this table as a result:
I am looking for a more efficient method of doing this that ideally would not require the function to be edited if there were more or less multiple choice questions.
Your function is actually far too complicated for what you need to do. I think a function like this should work and be more flexible.
multfreqtable = function(data, question.prefix) {
# Find the columns with the questions
a = grep(question.prefix, names(data))
# Find the total number of responses
b = sum(data[, a] != 0)
# Find the totals for each question
d = colSums(data[, a] != 0)
# Find the number of respondents
e = sum(rowSums(data[,a]) !=0)
# d + b as a vector. This is your overfall frequency
f = as.numeric(c(d, b))
data.frame(question = c(names(d), "Total"),
freq = f,
percent = (f/b)*100,
percentofcases = (f/e)*100 )
}
Add another question to your example dataset:
set.seed(1); df1$q2a = sample(c(0, 1), 30, replace=T)
set.seed(2); df1$q2b = sample(c(0, 2), 30, replace=T)
set.seed(3); df1$q2c = sample(c(0, 3), 30, replace=T)
Make a table for "q1" responses:
> multfreqtable(df1, "q1")
question freq percent percentofcases
1 q1a 15 33.33333 60
2 q1b 15 33.33333 60
3 q1c 15 33.33333 60
4 Total 45 100.00000 180
Make a table for "q2" responses:
> multfreqtable(df1, "q2")
question freq percent percentofcases
1 q2a 14 31.11111 53.84615
2 q2b 13 28.88889 50.00000
3 q2c 18 40.00000 69.23077
4 Total 45 100.00000 173.07692
Tables for multiple questions
Here's a modified version of the function that allows you to create a list of tables for multiple questions at once:
multfreqtable = function(data, question.prefix) {
z = length(question.prefix)
temp = vector("list", z)
for (i in 1:z) {
a = grep(question.prefix[i], names(data))
b = sum(data[, a] != 0)
d = colSums(data[, a] != 0)
e = sum(rowSums(data[,a]) !=0)
f = as.numeric(c(d, b))
temp[[i]] = data.frame(question = c(sub(question.prefix[i],
"", names(d)), "Total"),
freq = f,
percent = (f/b)*100,
percentofcases = (f/e)*100 )
names(temp)[i] = question.prefix[i]
}
temp
}
Examples:
> multfreqtable(df1, "q1")
$q1
question freq percent percentofcases
1 a 15 33.33333 60
2 b 15 33.33333 60
3 c 15 33.33333 60
4 Total 45 100.00000 180
> test1 = multfreqtable(df1, c("q1", "q2"))
> test1
$q1
question freq percent percentofcases
1 a 15 33.33333 60
2 b 15 33.33333 60
3 c 15 33.33333 60
4 Total 45 100.00000 180
$q2
question freq percent percentofcases
1 a 14 31.11111 53.84615
2 b 13 28.88889 50.00000
3 c 18 40.00000 69.23077
4 Total 45 100.00000 173.07692
> test1$q1
question freq percent percentofcases
1 a 15 33.33333 60
2 b 15 33.33333 60
3 c 15 33.33333 60
4 Total 45 100.00000 180
I've noticed that this is post is quite old, however I couldn’t find a more up to date solution. Here's my version based on dplyr/tidyverse approach.
mult_resp = function(df1, mv_q = c("q1a", "q1b", "q1c")){
df2 = df1 %>%
mutate(id = rownames(.)) %>% #row id for counting n_cases
select(id, everything()) %>%
mutate_at(mv_q, ~ ifelse(. != 0, 1, 0)) %>%
gather(question, resp,-id, -gender,-age,-height)
#count number of cases excluding "all zeros" cases
n_cases = df2 %>% group_by(id) %>%
summarise(n = sum(resp)) %>%
summarise(sum(n > 0))
#output table
res = df2 %>%
group_by(question) %>%
summarise(freq = sum(resp)) %>%
mutate(
percent = freq/sum(freq) *100,
percent_of_cases = freq/as.numeric(n_cases)*100
) %>%
rbind(.,
data.frame(question ="Total",
freq =sum(.$freq, na.rm=TRUE),
percent =sum(.$percent, na.rm=TRUE),
percent_of_cases = sum(.$percent_of_cases, na.rm=TRUE)
)
)
res
}
Example:
> mult_resp(df1, mv_q = c("q1a", "q1b", "q1c"))
# A tibble: 4 x 4
question freq percent percent_of_cases
<chr> <dbl> <dbl> <dbl>
1 q1a 15 33.3 60
2 q1b 15 33.3 60
3 q1c 15 33.3 60
4 Total 45 100. 180
It's an old question. However, you can use userfriendlyscience package to analyze multiple responses survey data very easily.
library(userfriendlyscience)
multiResponse (data, c('v1', 'v2', 'v3'))