R dplyr Summarising based condition - r

I have a data set of items downloaded from a website based on reports we generate. The idea is to remove reports that are no longer needed based on the number of downloads. The logic is basically count all the reports for the last year that have been downloaded, check if they are outside of two absolute deviations around the median for the current year, check if the report has been downloaded within the last 4 weeks and if so how many times
I have the code below which doesn't work, I was wondering if anyone can help
It gives me the error: for the n_recent_downloads section
Error in FUN(X[[1L]], ...) : only defined on a data frame with all numeric variables
reports <- c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D")
Week_no <- c(36,36,33,32,20,18,36,30,29,27)
New.Downloads <- data.frame (Report1 = reports, DL.Week = Week_no)
test <- New.Downloads %>%
group_by(report1) %>%
summarise(n_downloads = n(),
n_recent_downloads = ifelse(sum((as.integer(DL.Week) >= (as.integer(max(DL.Week))) - 4),value,0)))

Providing a reproducible example would make life a lot easier. Nonetheless I have modified your code to do what I think you were trying to achieve.
I've split it into two so you can see what is going on. I moved the ifelsestatement to a mutate call which gives:
library(dplyr)
New.Downloads <- data.frame(
Report1 = c("Report_A","Report_B","Report_C","Report_D","Report_A","Report_A","Report_A","Report_D","Report_D","Report_D"),
DL.Week = as.numeric(c(36,36,33,32,20,18,36,30,29,27))
)
test <- New.Downloads %>%
group_by(Report1) %>%
mutate(
median = median(DL.Week),
mad = 2 * mad(DL.Week),
check = ifelse(DL.Week > median + mad | DL.Week < median - mad, 0, DL.Week)
)
test
Source: local data frame [10 x 5]
Groups: Report1
Report1 DL.Week median mad check
1 Report_A 36 28.0 23.7216 36
2 Report_B 36 36.0 0.0000 36
3 Report_C 33 33.0 0.0000 33
4 Report_D 32 29.5 4.4478 32
5 Report_A 20 28.0 23.7216 20
6 Report_A 18 28.0 23.7216 18
7 Report_A 36 28.0 23.7216 36
8 Report_D 30 29.5 4.4478 30
9 Report_D 29 29.5 4.4478 29
10 Report_D 27 29.5 4.4478 27
Note that from your example none of the values are classed as extreme relative to the median + 2 * mad criterion, so the check values are identical to DL.week.
You can then chain a summarise onto the end of this to give you the sums.
test %>%
summarise(
n_recent_downloads = sum(check)
)
Source: local data frame [4 x 2]
Report1 n_recent_downloads
1 Report_A 110
2 Report_B 36
3 Report_C 33
4 Report_D 118

Related

Looking for advice to analyse this particular objective and data in R

Thank you in advance for any assistance.
Aim: I have a 5-day food intake survey dataset that I am trying to analyse in R. I am interested in calculating the mean, se, min and max intake for the weight of a specific food consumed per day.
I would more easily complete this in excel, but due to the scale of data, I require R to complete this.
Example question: What is a person's daily intake (g) of lettuce? [mean, standard deviation, standard error, min, and max]
Example extraction dataset: please note the actual dataset includes a number of foods and a large no. of participants.
participant
day
code
foodname
weight
132
1
62
lettuce
53
84
3
62
lettuce
23
132
3
62
lettuce
32
153
4
62
lettuce
26
142
2
62
lettuce
23
123
3
62
lettuce
23
131
3
62
lettuce
30
153
5
62
lettuce
16
At present:
# import dataset
foodsurvey<-read.spss("foodsurvey.sav",to.data.frame=T,use.value.labels=T)
summary(foodsurvey)
# keep my relevant columns
myvariables = subset(food survey, select = c(1,2,3,4,5) )
# rename columns
colnames(myvariables)<-c('participant','day','code','foodname','foodweight')
# create values
day<-myvariables$day
participant<-myvariables$participant
foodcode<-myvariables$foodcode
foodname<-myvariables$foodname
foodweight<-myvariables$foodweight
# extract lettuce by ID code to be analysed
lettuce<- filter(myvariables, foodcode == "62")
dim(lettuce)
str(lettuce)
# errors arise attempting to analyse consumption (weight) of lettuce per day using ops.factor function
# to analyse the outputs
summary(lettuce/days)
quantile(lettuce/foodweight)
max(lettuce)
min(lettuce)
median(lettuce)
mean(lettuce)
this should give you the mean, standard deviation, standard error, min, and max
food weight for each participant and food type combinantion along these days:
library(dplyr)
myvariables %>%
filter(foodname == "lettuce") %>%
group_by(participant) %>%
summarise(mean = mean(foodweight, na.rm = T),
max_val = max(foodweight),
min_val = min(foodweight),
sd = sd(foodweight, na.rm = T),
se = sqrt(var(foodweight, na.rm = T)/length(foodweight))
Here's a method that groups by participant and food itself to give summaries across everything.
dplyr
library(dplyr)
dat %>%
group_by(participant, foodname) %>%
summarize(
across(weight, list(min = min, mean = mean, max = max,
sigma = sd, se = ~ sd(.)/n()))
) %>%
ungroup()
# # A tibble: 6 x 7
# participant foodname weight_min weight_mean weight_max weight_sigma weight_se
# <int> <chr> <int> <dbl> <int> <dbl> <dbl>
# 1 84 lettuce 23 23 23 NA NA
# 2 123 lettuce 23 23 23 NA NA
# 3 131 lettuce 30 30 30 NA NA
# 4 132 lettuce 32 42.5 53 14.8 7.42
# 5 142 lettuce 23 23 23 NA NA
# 6 153 lettuce 16 21 26 7.07 3.54
Once you have those summaries, you can easily filter for one participant, a specific food, etc. If you need to also group by code, just add it to the group_by.
The premise of using summarise(across(...)) is that the first argument includes whichever variables you want to summarize (just weight here, but you can add others if it makes sense), and the second argument is a list of functions in various forms. It accepts just a function symbol (e.g., mean), a tilde-function facilitate by rlang (e.g., ~ sd(.) / n(), where n() is a dplyr-special function), or regular anonymous functions (e.g., function(z) sd(z)/length(z), not shown here). The "name" on the LHS of each listed function is used in the resulting column name.

Writing a function to summarize the results of dunn.test::dunn.test

In R, I perform dunn's test. The function I use has no option to group the input variables by their statistical significant differences. However, this is what I am genuinely interested in, so I tried to write my own function. Unfortunately, I am not able to wrap my head around it. Perhaps someone can help.
I use the airquality dataset that comes with R as an example. The result that I need could look somewhat like this:
> library (tidyverse)
> ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
# A tibble: 5 x 2
Month Mean
<int> <dbl>
1 5 23.6
2 6 29.4
3 7 59.1
4 8 60.0
5 9 31.4
When I run the dunn.test, I get the following:
> dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
Kruskal-Wallis rank sum test
data: x and group
Kruskal-Wallis chi-squared = 29.2666, df = 4, p-value = 0
Comparison of x by group
(Benjamini-Hochberg)
Col Mean-|
Row Mean | 5 6 7 8
---------+--------------------------------------------
6 | -0.925158
| 0.4436
|
7 | -4.419470 -2.244208
| 0.0001* 0.0496*
|
8 | -4.132813 -2.038635 0.286657
| 0.0002* 0.0691 0.8604
|
9 | -1.321202 0.002538 3.217199 2.922827
| 0.2663 0.9980 0.0043* 0.0087*
alpha = 0.05
Reject Ho if p <= alpha
From this result, I deduce that May differs from July and August, June differs from July (but not from August) and so on. So I'd like to append significantly differing groups to my results table:
# A tibble: 5 x 3
Month Mean Group
<int> <dbl> <chr>
1 5 23.6 a
2 6 29.4 ac
3 7 59.1 b
4 8 60.0 bc
5 9 31.4 a
While I did this by hand, I suppose it must be possible to automate this process. However, I don't find a good starting point. I created a dataframe containing all comparisons:
> ozone_differences <- dunn.test::dunn.test (airquality$Ozone, airquality$Month, method = "bh", altp = T)
> ozone_differences <- data.frame ("P" = ozone_differences$altP.adjusted, "Compare" = ozone_differences$comparisons)
P Compare
1 4.436043e-01 5 - 6
2 9.894296e-05 5 - 7
3 4.963804e-02 6 - 7
4 1.791748e-04 5 - 8
5 6.914403e-02 6 - 8
6 8.604164e-01 7 - 8
7 2.663342e-01 5 - 9
8 9.979745e-01 6 - 9
9 4.314957e-03 7 - 9
10 8.671708e-03 8 - 9
I thought that a function iterating through this data frame and using a selection variable to choose the right letter from letters() might work. However, I cannot even think of a starting point, because changing numbers of rows have to considered at the same time...
Perhaps someone has a good idea?
Perhaps you could look into cldList() function from rcompanion library, you can pipe the res results from the output od dunnTest() and create a table that specifies the compact letter display comparison per group.
Following the advice of #TylerRuddenfort , the following code will work. The first cld is created with rcompanion::cldList, and the second directly uses multcompView::multcompLetters. Note that to use multcompLetters, the spaces have to be removed from the names of the comparisons.
Here, I have used FSA:dunnTest for the Dunn test (1964).
In general, I recommend ordering groups by e.g. median or mean before running e.g. dunnTest if you plan on using a cld, so that the cld comes out in a sensible order.
library (tidyverse)
ozone_summary <- airquality %>% group_by(Month) %>% dplyr::summarize(Mean = mean(Ozone, na.rm=TRUE))
library(FSA)
Result = dunnTest(airquality$Ozone, airquality$Month, method = "bh")$res
### Use cldList()
library(rcompanion)
cldList(P.adj ~ Comparison, data=Result)
### Use multcompView
library(multcompView)
X = Result$P.adj <= 0.05
names(X) = gsub(" ", "", Result$Comparison)
multcompLetters(X)

How to efficiently replace ranges with their median in a dataframe

Suppose I have a dataframe named score.master that looks like this:
school perc.prof num.tested
A 8 482
B 6-9 34
C 40-49 49
D GE50 81
E 80-89 26
Here, school A's percent proficient is 8%, and the number of students tested is 482. However, suppose that when num.tested falls below a certain number (in this case arbitrarily 100), data suppression is introduced. In most cases, ranges of perc.prof are given but in other cases a value such as "GE50" is given, indicating greater than or equal to 50.
My question is, in a much larger dataset, what is the best way to replace a range with its median? So for example I want the final dataset to look like this:
school perc.prof num.tested
A 8 482
B 8 34
C 44 49
D 75 81
E 85 26
I know this can be done manually like this:
score.master$perc.prof[score.master$perc.prof == "6-9"] <- round(median(6:9), 0)
But the actual dataset has many more range combinations. One way I thought of selecting the correct values is by length; all provided values are 1-2 characters long (no more than 99 percent proficient) whereas the range values are 3 or more characters long.
You can use stringr::str_split() to get the lower and upper bound, then calculate the median. The "GE50" and similar are not generalizable to this, and you could use ifelse() to handle special cases.
df <- data.frame(perc.prof = c('8', '6-9', '40-49', 'GE50', '80-89'))
df$lower.upper <- sapply(stringr::str_split(df$perc.prof, '-'), as.integer)
df$perc.prof.median <- sapply(df$lower.upper, median)
df$lower.upper <- NULL
> df
perc.prof perc.prof.median
1 8 8.0
2 6-9 7.5
3 40-49 44.5
4 GE50 NA
5 80-89 84.5
You could do the following to convert your ranges with the median. However, I did not handle the "GExx" or "LExx" situations since it's not well defined enough.
Note that you would need the stringr package for my solution.
score.master$perc.prof <- sapply(score.master$perc.prof, function(x){
sep <- stringr::str_locate(x, "-")[, 1]
if(is.na(sep)) {
x
} else {
as.character(round(median(as.integer(stringr::str_sub(x, c(1L, sep+1), c(sep-1, -1L))))))
}
})
Here's a tidyverse approach. First I replace "GE50" with it's expected output, then use tidyr::separate to split perc.prof where possible. Last step either uses the given perc.prof if large school, or uses the median for small schools.
library(tidyverse)
df %>%
mutate(perc.prof = if_else(perc.prof == "GE50", "75", perc.prof)) %>%
separate(perc.prof, c("low", "high"), remove = F, convert = T) %>%
mutate(perc.prof.adj = if_else(num.tested > 100,
as.numeric(perc.prof),
rowSums(select(., low, high), na.rm = T)/2)
)
school perc.prof low high num.tested perc.prof.adj
1 A 8 8 NA 482 8.0
2 B 6-9 6 9 34 7.5
3 C 40-49 40 49 49 44.5
4 D 75 75 NA 81 37.5
5 E 80-89 80 89 26 84.5

Take the mean of three variables containing NAs to create new variable using dplyr [duplicate]

This question already has answers here:
R: How to calculate mean for each row with missing values using dplyr
(3 answers)
Closed 3 years ago.
I have three measures in my dataset that I am trying to combine into one new variable that represents the mean value across those three variables for each row in turn (each row represents a participant). Each of the original three variables contains NA values.
I've tried the code below that I've applied here to a sample dataset from R that contains NA values (airquality):
airquality %>% mutate(New = mean(airquality$Solar.R,airquality$Ozone,airquality$Wind))
But I keep getting the error message:
Error in mean.default(airquality$Solar.R, airquality$Ozone,
airquality$Wind) : 'trim' must be numeric of length one In
addition: Warning message: In if (na.rm) x <- x[!is.na(x)] : the
condition has length > 1 and only the first element will be used
I have also tried :
airquality %>% filter(!is.na(airquality$Solar.R,airquality$Ozone,airquality$Wind)) %>% mutate(New = mean(airquality$Solar.R,airquality$Ozone,airquality$Wind))
But this gives me the same error.
Can anyone advise on how to solve this problem?
Thanks so much in advance!
You can use row_mean_ from hablar which takes mean by row while ignoring missing.
library(hablar)
airquality %>%
mutate(New = row_mean_(Solar.R, Ozone, Wind))
Result
Ozone Solar.R Wind Temp Month Day New
1 41 190 7.4 67 5 1 79.466667
2 36 118 8.0 72 5 2 54.000000
3 12 149 12.6 74 5 3 57.866667
4 18 313 11.5 62 5 4 114.166667
5 NA NA 14.3 56 5 5 14.300000
6 28 NA 14.9 66 5 6 21.450000
7 23 299 8.6 65 5 7 110.200000

rolling percentile for conditional selections in r

I have a data.frame with daily maximum and minimum temperatures for 40 years and need to select all days that have maximum temperature above 90th percentile of maximum temperature and minimum temperatures above the 85th percentile of minimum temperature.
I was able to do that
> head(df)
YEAR MONTH DAY Date MEAN MAX MIN
1 1965 1 1 1/1/1965 NA 27.0 17.0
2 1965 1 2 1/2/1965 24.0 28.0 20.7
3 1965 1 3 1/3/1965 19.9 23.7 16.2
4 1965 1 4 1/4/1965 18.0 23.4 12.0
5 1965 1 5 1/5/1965 19.7 24.0 14.0
6 1965 1 6 1/6/1965 18.6 24.0 13.0
df[, hotday := +(df$MAX>=(quantile(df$MAX,.90, na.rm = T, type = 6)) & df$MIN>=(quantile(df$MIN,.85, na.rm = T, type = 6)))
] [, length := with(rle(hotday), rep(lengths,lengths)) # to calculate lenght so I can select consecutive days only
] [hotday==0, length:=0][!!hotday, Highest_Mean := max(MEAN) , rleid(length)][] # to find the highest Mean temp for each consecutive group
But I need to do the same thing using centered rolling percentiles for every 15 days (i.e., for a given day, the 90th percentile of maximum temperature is the 90th percentile of the historical data for a 15-day window centered on that day)
I mean that the percentile to be calculated from the historical data of each calendar day using 15-days calendar window. That is, there are 365 days so for day 118 I will use the historical data for day 111, 112,..... to day 125. So in my case, I have data for 40 years so the 15-day window will yield a total sample size of 40 years × 15 days = 600 for each calendar day. The moving window is based on the calendar day, not the time series
Any thought please
What about something like this to select the rows you want ?
Since you want a sliding window of 15 days centered at the day of interest, you will always have windows of 7 preceding days + day of interest + 7 following days. Because of this constraint, the first 7 and the last 7 days (rows) of the dataset are excluded and forced == FALSE { rep(FALSE, 7) }
the code included in the sapply() call will test each day (starting from day n.(7+1=8) ) against the 15-day sliding window (as defined before) and check if the max temperature is higher than the 90th percentile of that window (test1). A similar test (test2) is executed looking at the MIN temp. If one of the two tests is TRUE, TRUE is returned (otherwise, FALSE is outputted. You can easily adapt this to your needs).
The resulting vector (stored in the KEEP vector) includes booleans TRUE/FALSE that can be used for subsetting the initial dataframe.
set.seed(111)
df <- data.frame(MIN=sample(50:70, size = 50, replace = T),
MAX=sample(70:90, size = 50, replace = T))
head(df)
KEEP <- c(rep(FALSE, 7),
sapply(8:(length(df$MAX) - 7), (function(i){
test1 <- df$MAX[i] >= as.numeric(quantile(df$MAX[(i-7):(i+7)], 0.9, na.rm = TRUE))
test2 <- df$MIN[i] <= as.numeric(quantile(df$MIN[(i-7):(i+7)], 0.15, na.rm = TRUE))
test1 | test2
})),
rep(FALSE, 7))
head(KEEP)
df <- df[KEEP,]
df
This should return
MIN MAX
10 51 86
13 51 73
14 50 75
15 53 89
22 55 83
28 55 90
31 51 72
32 60 88
37 52 84
42 56 87

Resources