(R): Calculate quantile by unique row value unification - r

I have a df like this:
> df<-data.frame(Client.code =
c(100451,100451,100523,100523,100523,100525),dayref = c(24,30,15,13,17,5))
> df
Client.code dayref
1 100451 24
2 100451 30
3 100523 15
4 100523 13
5 100523 17
6 100525 5
It is a one-year distribution of payments period from issue.
Usign this data above and given a df2 like this:
Client.Code Days
1 100451 16
1 100523 16
1 100460 35
As i have enough data for a reasonable quantile prob. calculations.I will like to know how to build a loop for assing to every row in this df2 of days a quantile according with the first df.

We can use data.table
library(data.table)
setDT(df)[, .(Quantile = quantile(dayref)), Client.code]
Or with tidyverse
library(dplyr)
library(tidyr)
df %>%
group_by(Client.code) %>%
summarise(Quantile = list(quantile(dayref))) %>%
unnest

tapply(df$dayref, df$Client.code, quantile)
You can specify specific percentiles by adding a vector of them
tapply(df$dayref, df$Client.code, quantile, 1:19/20)
You may need to formulate like this
tapply(df$dayref, df$Client.code, quantile, probs = 1:19/20)
And you can add na.rm = TRUE as another argument if you might have NAs

Related

Count rows by group in Dplyr: Evaluation error

I'm trying to count the number of rows using dplyr after using group_by. I have the following data:
scenario pertubation population
A 1 20
B 1 30
C 1 40
D 1 50
A 2 15
B 2 25
And I'm using the following code to group_by and mutate:
test <- all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)),
exceedance_probability = rank / count(pertubation)) %>%
select(scenario, pertubation, All.ages, rank, exceedance_probability)
But I keep encoutering this error message and I am unsure of what it means, or why I keep getting it?
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "c('integer', 'numeric')".
I would like my output data to look something like this:
scenario pertubation population rank exceedance_probability
A 1 20 12 0.06
B 1 30 7 0.035
C 1 40 2 0.01
D 1 50 1 0.005
A 2 15 34 0.17
B 2 25 28 0.14
To calculate the exceedance probability I just need to divide the rank by the number of observations, but I've found it hard to do this in dplyr after a group_by statement. Am I ordering the dplyr statements incorrectly?
We can get the count separately and join with the original dataset
all_scenarios %>%
count(pertubation) %>%
left_join(all_scenarios, ., by = 'pertubation') %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)), exceedance_probability = rank /n)
Or instead of using count, we can do a second group_by and get the n()
all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population))) %>%
group_by(pertubation) %>%
mutate( exceedance_probability = rank /n())
Your issue comes from the
count(pertubation)
part of the code. You cannot use count in a group_by scenario. I can't find a good explanation why, but it won't work. Just use
n()
in place of it in the code. Since youre grouping by scenario, and each scenario-pertubation is unique in your dataset, by counting the number of rows in each scenario you are effectively counting the number of values or pertubation for each scenario.

how to group identical instances in r into one and at the same time, generate frequency and average stats?

I'm at the last stage of cleaning/organizing data and would appreciate suggestions for this step. I'm new to R and don't understand fully how dataframes or other data types work. (I'm trying to learn but have a project due so need a quick solution). I've imported the data from a CSV file.
I want to group instances with the same (date, ID1, ID2, ID3). I want the average of all stats in the output and also a new column with the number of instances grouped.
Note: ID3 contains . I'd like to rename these to "na" before grouping
I've tried solutions
tdata$ID3[is.na(tdata$ID3)] <- "NA"
tdata[["ID3"]][is.na(tdata[["ID3"]])] <- "NA"
But get Error:
In `[<-.factor`(`*tmp*`, is.na(tdata[["ID3"]]), value = c(3L, 3L, :
invalid factor level, NA generated
The data is:
date ID1 ID2 ID3 stat1 stat2 stat.3
1 12-03-07 abc123 wxy456 pqr123 10 20 30
2 12-03-07 abc123 wxy456 pqr123 20 40 60
3 10-04-07 bcd456 wxy456 hgf356 10 20 40
4 12-03-07 abc123 wxy456 pqr123 30 60 90
5 5-09-07 spa234 int345 <NA> 40 50 70
Desired Output
date ID1, ID2, ID3, n, stat1, stat2, stat 3
12-03-07 abc123, wxy456, pqr457, 3, 20, 40, 60
10-04-07 bcd456, wxy456, hgf356, 1, 10, 20, 40
05-09-07 spa234, int345, big234, 1 , 40, 50, 70
I tried this solution: How to merge multiple data.frames and sum and average columns at the same time in R
But I was not successful merging the columns which have to be grouped and tested for similarity.
DF <- merge(tdata$date, tdata$ID1, tdata$ID2, tdata$ID3, by = "Name", all = T)
Error in fix.by(by.x, x) : 'by' must specify uniquely valid columns
Finally, to generate the n column. Perhaps insert a rows of 1s and use the sum of the column while summarizing?
We can do this with dplyr. After grouping by the 'ID' columns, add 'date' and 'n' also in the grouping variables, and get the mean of 'stat' columns
library(dplyr)
df1 %>%
group_by(ID1, ID2, ID3) %>%
group_by(date = first(date), n =n(), add=TRUE) %>%
summarise_at(vars(matches("stat")), mean)
NOTE: Regarding change the 'NA' to 'big234', we can convert the 'ID3' to character class and change it before doing the above operation
df1$ID3 <- as.character(df1$ID3)
df1$ID3[is.na(df1$ID3)] <- "big234"
While I find the dplyr solution proposed by akrun very intuitive to use, there is also a nice data.table solution:
Similarly as akrun, I assume that the NA value has been converted to "big234" to get the desired result.
library(data.table)
# convert data.frame to data.table
data <- data.table(df1)
# return the desired output
data[, c(.N, lapply(.SD, mean)),
by = list(date, ID1,ID2, ID3)]

How to subset data for a specific column with ddply?

I would like to know if there is a simple way to achieve what I describe below using ddply. My data frame describes an experiment with two conditions. Participants had to select between options A and B, and we recorded how long they took to decide, and whether their responses were accurate or not.
I use ddply to create averages by condition. The column nAccurate summarizes the number of accurate responses in each condition. I also want to know how much time they took to decide and express it in the column RT. However, I want to calculate average response times only when participants got the response right (i.e. Accuracy==1). Currently, the code below can only calculate average reaction times for all responses (accurate and inaccurate ones). Is there a simple way to modify it to get average response times computed only in accurate trials?
See sample code below and thanks!
library(plyr)
# Create sample data frame.
Condition = c(rep(1,6), rep(2,6)) #two conditions
Response = c("A","A","A","A","B","A","B","B","B","B","A","A") #whether option "A" or "B" was selected
Accuracy = rep(c(1,1,0),4) #whether the response was accurate or not
RT = c(110,133,121,122,145,166,178,433,300,340,250,674) #response times
df = data.frame(Condition,Response, Accuracy,RT)
head(df)
Condition Response Accuracy RT
1 1 A 1 110
2 1 A 1 133
3 1 A 0 121
4 1 A 1 122
5 1 B 1 145
6 1 A 0 166
# Calculate averages.
avg <- ddply(df, .(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT))
# The problem: response times are calculated over all trials. I would like
# to calculate mean response times *for accurate responses only*.
avg
Condition N nAccurate RT
1 6 4 132.8333
2 6 4 362.5000
With plyr, you can do it as follows:
ddply(df,
.(Condition), summarise,
N = length(Response),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1]))
this gives:
Condition N nAccurate RT
1: 1 6 4 127.50
2: 2 6 4 300.25
If you use data.table, then this is an alternative way:
library(data.table)
setDT(df)[, .(N = .N,
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy==1])),
by = Condition]
Using dplyr package:
library(dplyr)
df %>%
group_by(Condition) %>%
summarise(N = n(),
nAccurate = sum(Accuracy),
RT = mean(RT[Accuracy == 1]))

converting a for cycle into a dplyr statement - deviation from mean

My aim is to obtain the deviations of a measure from the mean of that measure, per group.
My data look like this:
Cluster Media_Name count
1 1 20minutes 9
2 1 AFP 7
3 1 BFM 5
4 1 BFMTV 6
5 2 AFP 12
6 2 BFM 4
7 2 BFMTV 5
In a formula:
data <- data.frame(Cluster = c("1","1","1","1","2","2","2"), Media_Name = c("20Minutes", "AFP", "BFM", "BFMTV", "AFP", "BFM", "BFMTV"), count = c(9,7,5,6,12,4,5))
So I have got two categorical variables (Cluster and Media_Name) and the count of observations for each pairing.
In order to get a new variable called deviationFromClusterMean I work in two steps:
1- I calculate the mean number of occurrences (count) for the variable Cluster
clusterMean <- data %>% group_by(Cluster) %>% summarise(clusterMean = mean(count))
2- I use a for cycle to obtain for each Media Name (second categorical variable) the deviation from the cluster mean:
for (i in 1:nrow(data)) {
cluster <- data$Cluster[i]
moyenneducluster <- clusterMean$clusterMean[clusterMean$Cluster==cluster]
data$deviationFromClusterMean[i] <- data$count[i]/moyenneducluster
}
It looks pretty ugly to me, and I am sure that I can apply the split-apply-combine strategy here. However the best I can do is not working:
data %>% group_by(Media_Name, Cluster) %>% do(mutate(deviationFromClusterMean = count/clusterMean[clusterMean$Cluster == .$Cluster,]$clusterMean))
Any idea?
You don't need to define clusterMean separately. The following should work:
data %>%
group_by(Cluster) %>%
mutate(deviationFromClusterMean = count/mean(count))
You can also use ave from base R
with(data, count/ave(count, Cluster, FUN=mean))
#[1] 1.3333333 1.0370370 0.7407407 0.8888889 1.7142857 0.5714286 0.7142857

dplyr idiom for summarize() a filtered-group-by, and also replace any NAs due to missing rows

I am computing a dplyr::summarize across a dataframe of sales data.
I do a group-by (S,D,Y), then within each group, compute medians and means for weeks 5..43, then merge those back into the parent df. Variable X is sales. X is never NA (i.e. there are no explicit NAs anywhere in df), but if there is no data (as in, no sales) for that S,D,Y and set of weeks, there will simply be no row with those values in df (take it that means zero sales for that particular set of parameters). In other words, impute X=0 in any structurally missing rows (but I hope I don't need to melt/cast the original df, to avoid bloat. Similar to cast(fill....,add.missing=T) or caret::preProcess()).
Two questions about my code idiom:
Is it better to use summarize than dplyr::filter, because filter physically drops rows so I have to assign the results to df.tmp then left-join it back to the original df (as below)? Also, big subsetting expressions repeated on every single line of summarize computations make the code harder to read.
Should I worry (or not) about caching the rows or logical indices of the subsetting operation, in the general case where I might be computing say n=20 new summary variables?
Not all combinations of S,D,Y-groups and filter (for those weeks) have rows, so how to get the summarize to replace NA on any missing rows? Currently I do as below.
Sorry both the code and dataset are proprietary, but here's the code idiom, and below is code you should run first to generate sample-data:
# Compute median, mean of X across wks 5..43, for that set of S,D,Y-values
# Issue a) filter() or repeatedly use subset() within each calculation?
df.tmp <- df %.% group_by(S,D,Y) %.% filter(Week>=5 & Week<=43) %.%
summarize(ysd_med543_X = median(X),
ysd_mean543_X = mean(X)
) %.% ungroup()
# Issue b) how to replace NAs in groups where the group_by-and-filter gave empty output?
# can you merge this code with the summarize above?
df <- left_join(df, df.tmp, copy=F)
newcols <- match(c('ysd_mean543_X','ysd_med543_X'), names(df))
df[!complete.cases(df[,newcols]), newcols] <- c(0.0,0.0)
and run this first to generate sample-data:
set.seed(1234)
rep_vector <- function(vv, n) {
unlist(as.vector(lapply(vv, function(...) {rep(...,n)} )))
}
n=7
m=3
df = data.frame(S = rep_vector(10:12, n), D = 20:26,
Y = rep_vector(2005:2007, n),
Week = round(52*runif(m*n)),
X = 4e4*runif(m*n) + 1e4 )
# Now drop some rows, to model structurally missing rows
I <- sort(sample(1:nrow(df),0.6*nrow(df)))
df = df[I,]
require(dplyr)
I don't think this has anything to do with the feature you've linked under comments (because IIUC that feature has to do with unused factor levels). Once you filter your data, IMO summarise should not (or rather can't?) be including them in the results (with the exception of factors). You should clarify this with the developers on their project page.
I'm by no means a dplyr expert, but I think, firstly, it'd be better to filter first followed by group_by + summarise. Else, you'll be filtering for each group, which is unnecessary. That is:
df.tmp <- df %.% filter(Week>=5 & Week<=43) %.% group_by(S,D,Y) %.% ...
This is just so that you're aware of it for any future cases.
IMO, it's better to use mutate here instead of summarise, as it'll remove the need for left_join, IIUC. That is:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
md_X = median(X[Week >=5 & Week <= 43]),
mn_X = mean(X[Week >=5 & Week <= 43]))
Here, still we've the issue of replacing the NA/NaN. There's no easy/direct way to sub-assign here. So, you'll have to use ifelse, once again IIUC. But that'd be a little nicer if mutate supports expressions.
What I've in mind is something like:
df.tmp <- df %.% group_by(S,D,Y) %.% mutate(
{ tmp = Week >= 5 & Week <= 43;
md_X = ifelse(length(tmp), median(X[tmp]), 0),
md_Y = ifelse(length(tmp), mean(X[tmp]), 0)
})
So, we'll have to workaround in this manner probably:
df.tmp = df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43)
df.tmp %.% mutate(md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], mean(X), 0))
Or to put things together:
df %.% group_by(S,D,Y) %.% mutate(tmp = Week >=5 & Week <= 43,
md_X = ifelse(tmp[1L], median(X), 0),
mn_X = ifelse(tmp[1L], median(X), 0))
# S D Y Week X tmp md_X mn_X
# 1 10 20 2005 6 22107.73 TRUE 22107.73 22107.73
# 2 10 23 2005 32 18751.98 TRUE 18751.98 18751.98
# 3 10 25 2005 33 31027.90 TRUE 31027.90 31027.90
# 4 10 26 2005 0 46586.33 FALSE 0.00 0.00
# 5 11 20 2006 12 43253.80 TRUE 43253.80 43253.80
# 6 11 22 2006 27 28243.66 TRUE 28243.66 28243.66
# 7 11 23 2006 36 20607.47 TRUE 20607.47 20607.47
# 8 11 24 2006 28 22186.89 TRUE 22186.89 22186.89
# 9 11 25 2006 15 30292.27 TRUE 30292.27 30292.27
# 10 12 20 2007 15 40386.83 TRUE 40386.83 40386.83
# 11 12 21 2007 44 18049.92 FALSE 0.00 0.00
# 12 12 26 2007 16 35856.24 TRUE 35856.24 35856.24
which doesn't require df.tmp.
HTH

Resources