dplyr: mean of a group count - r

I am trying to find the mean length of a variable over a dataframe using dplyr:
x <- data %>% group_by(Date, `% Bucket`) %>% summarise(count = n())
Date % Bucket count
(date) (fctr) (int)
1 2015-01-05 <=1 1566
2 2015-01-05 (1-25] 421
3 2015-01-05 (25-50] 461
4 2015-01-05 (50-75] 485
5 2015-01-05 (75-100] 662
6 2015-01-05 (100-150] 1693
7 2015-01-05 >150 12359
8 2015-01-13 <=1 1608
9 2015-01-13 (1-25] 441
10 2015-01-13 (25-50] 425
How to aggregate to find average across each % Bucket over the year with dplyr?
in base:
x <- as.data.frame(x)
aggregate(count ~ `% Bucket`, data = x, FUN=mean)
% Bucket count
1 <=1 2609.5294
2 (1-25] 449.0000
3 (25-50] 528.7059
4 (50-75] 593.2157
5 (75-100] 763.0000
6 (100-150] 1758.6667
7 >150 12457.9216
Aggregate function will take the count found by dplyr across each bucket above and sum them, dividing by the number of rows that contain that % Bucket variable and give the answer above. How can I accomplish this with dplyr though? This is not about completing the problem but understanding how the dplyr package would be used in such a scenario.
Another example of this type of thing would be summarise the n() of each group_by variable and also listing the minimum length "count" of that variable across the 52 weeks.
I am struggling because dplyr seems to be built to find a mean of a value in a column, but here I am counting the number of row occurrences given a variable in a column and trying to find the mean, min, max, etc. of it.

We can use dplyr methods
library(dplyr)
x %>%
group_by(`% Bucket`) %>%
summarise(count= mean(count))

Related

How to add rows to dataframe R with rbind

I know this is a classic question and there are also similar ones in the archive, but I feel like the answers did not really apply to this case. Basically I want to take one dataframe (covid cases in Berlin per district), calculate the sum of the columns and create a new dataframe with a column representing the name of the district and another one representing the total number. So I wrote
covid_bln <- read.csv('https://www.berlin.de/lageso/gesundheit/infektionsepidemiologie-infektionsschutz/corona/tabelle-bezirke-gesamtuebersicht/index.php/index/all.csv?q=', sep=';')
c_tot<-data.frame('district'=c(), 'number'=c())
for (n in colnames(covid_bln[3:14])){
x<-data.frame('district'=c(n), 'number'=c(sum(covid_bln$n)))
c_tot<-rbind(c_tot, x)
next
}
print(c_tot)
Which works properly with the names but returns only the number of cases for the 8th district, but for all the districts. If you have any suggestion, even involving the use of other functions, it would be great. Thank you
Here's a base R solution:
number <- colSums(covid_bln[3:14])
district <- names(covid_bln[3:14])
c_tot <- cbind.data.frame(district, number)
rownames(c_tot) <- NULL
# If you don't want rownames:
rownames(c_tot) <- NULL
This gives us:
district number
1 mitte 16030
2 friedrichshain_kreuzberg 10679
3 pankow 10849
4 charlottenburg_wilmersdorf 10664
5 spandau 9450
6 steglitz_zehlendorf 9218
7 tempelhof_schoeneberg 12624
8 neukoelln 14922
9 treptow_koepenick 6760
10 marzahn_hellersdorf 6960
11 lichtenberg 7601
12 reinickendorf 9752
I want to provide a solution using tidyverse.
The final result is ordered alphabetically by districts
c_tot <- covid_bln %>%
select( mitte:reinickendorf) %>%
gather(district, number, mitte:reinickendorf) %>%
group_by(district) %>%
summarise(number = sum(number))
The rusult is
# A tibble: 12 x 2
district number
* <chr> <int>
1 charlottenburg_wilmersdorf 10736
2 friedrichshain_kreuzberg 10698
3 lichtenberg 7644
4 marzahn_hellersdorf 7000
5 mitte 16064
6 neukoelln 14982
7 pankow 10885
8 reinickendorf 9784
9 spandau 9486
10 steglitz_zehlendorf 9236
11 tempelhof_schoeneberg 12656
12 treptow_koepenick 6788

How to calculate duration of time between two dates

I'm working with a large data set in RStudio that includes multiple test scores for the same individuals. I've filtered my data set to display the same individual's scores in two consecutive rows with the test date for each test administration in one column. My data appears as follows:
id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline
I would like to calculate the total duration of time between baseline 1 and baseline 2 administration and store that value in a new column. Therefore, my first question is what is the best way to calculate the duration of time between two dates? And two, what is the best way to condense each individual's data into one row to make calculating the difference between test scores easier and to be stored in a new column?
Thank you for any assistance!
This is a solution inside the tidyverse universe. The packages we are going to use are dplyr and tidyr.
First, we create the dataset (you read it from a file instead) and convert strings to date format:
library(dplyr)
library(tidyr)
dataset <- read.table(text = "id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline", header = TRUE)
dataset$test_date <- as.Date(dataset$test_date, format = "%m/%d/%Y")
# id test_date score baseline_number_1 baseline_number_2
# 1 1 2017-08-15 21.18 Baseline <NA>
# 2 1 2019-08-28 28.55 <NA> Baseline
# 3 2 2017-11-22 33.38 Baseline <NA>
# 4 2 2019-11-06 35.30 <NA> Baseline
# 5 3 2018-07-25 30.77 Baseline <NA>
# 6 3 2019-07-31 33.42 <NA> Baseline
The best solution to condense each individual's data into one row and compute the difference between the two baselines can be achieved as follows:
dataset %>%
group_by(id) %>%
mutate(number = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = id,
names_from = number,
values_from = c(test_date, score),
names_glue = "{.value}_{number}"
) %>%
mutate(
time_between = test_date_2 - test_date_1
)
Brief explanation: first we create the variable number which indicates the baseline number in each row; then we use pivot_wider to make the dataset "wider" indeed, i.e. we have one row for each id along with its features; finally we create the variable time_between which contains the difference in days between two baselines. In you are not familiar with some of these functions, I suggest you break the pipeline after each operation and analyse it step by step.
Final output
# A tibble: 3 x 6
# id test_date_1 test_date_2 score_1 score_2 time_between
# <int> <date> <date> <dbl> <dbl> <drtn>
# 1 1 2017-08-15 2019-08-28 21.2 28.6 743 days
# 2 2 2017-11-22 2019-11-06 33.4 35.3 714 days
# 3 3 2018-07-25 2019-07-31 30.8 33.4 371 days

R -Finding Mean with multiple subsets

I have a dataset with 4 columns as shown below. I want to create a 5th column (Mean) which has the mean of the 4th column based on the first 3 columns.
For e.g: The mean of the value in the first hour (hour=1) on the date (1/1/2018) for the Id (5000) is the mean of first 3 rows (2+2+1)/3 = 1.67
>
head(read_df[,1:5])
`
Id Date Hour Value Mean
5000 1/1/2018 1 1 1.67
5000 1/1/2018 1 2 1.67
5000 1/1/2018 1 2 1.67
5100 1/1/2018 4 2 2
5100 2/1/2018 6 2 3
5100 2/1/2018 6 4 3
5100 3/1/2018 2 7 7
5200 3/1/2018 3 3 4.5
5200 3/1/2018 3 6 4.5
I tried using a for loop for each of Id and Date and Hour. But ended up with NAs in some rows. Kindly let me know an efficient way to achieve this.
I would recommend using dplyr package.
library(dplyr)
read_df %>%
group_by(ID, Date) %>% # Specifly your by-variables
mutate(Mean = mean(Value)) %>% # Calculate the mean
ungroup()
ddply from plyr does exactly this for any function.
plyr::ddply(read_df, c("Id", "Date", "Hour"), numcolwise(mean))
Though in your example I notice the 3rd row has a different date, so that contradicts your example.
There are simpler functions that can do similar things such as aggregate, but I like ddply as its a good all-rounder.

How to check if any row has negative values by leaving out selected rows?

Below is the dataframe I get by running a query. Please note that df1 is a dynamic dataframe and it might return either an empty df or partial df with not all quarters as seen below:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 299111.86
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
I would want to check the values of all the rows in Revenue column and see if any value is 0 or negative excluding 2014-Q1 row
Also, the df1 is dynamic and will contain only 12 quarters of data i.e. when I reach next qtr i.e. 2017-Q2, the Revenue associated with 2014-Q2 becomes 0 and it will look like this:
df1
FISC_QTR_VAL Revenue
1 2014-Q1 0.00
2 2014-Q2 0.00
3 2014-Q3 174071.98
4 2014-Q4 257655.30
5 2015-Q1 0.00
6 2015-Q2 317118.63
7 2015-Q3 145461.88
8 2015-Q4 162972.41
9 2016-Q1 96896.04
10 2016-Q2 135058.78
11 2016-Q3 111773.77
12 2016-Q4 138479.28
13 2017-Q1 169276.04
14 2017-Q2 146253.64
In the above case, I would need to check all rows for the Revenue column by excluding 2014-Q1 and 2014-Q2
And this goes on as quarter progresses
Need your help to generate the code which would dynamically do all the above steps of excluding the row(s) and check only the rows that matter for a particular quarter
Currently, I am using the below code:
#Taking the first df1 into consideration which has 2017-Q1 as the last quarter
startQtr <- "2014-Q2" #This value is dynamically achieved and will change as we move ahead. Next quarter, the value changes to 2014-Q3 and so on
if(length(df1[["FISC_QTR_VAL"]][nrow(df1)-11] == startQtr) == 1){
if(nrow(df1[df1$Revenue < 0,]) == 0 & nrow(df1[df1$Revenue == 0,]) == 0){
df1 <- df1 %>% slice((nrow(df1)-11):(nrow(df1)))
}
}
The first IF loop checks if there is data in df1
If the df is empty, df1[["FISC_QTR_VAL"]][nrow(df1)-10] == startQtr condition would return numeric(0) whose length would be 0 and hence the condition fails
If not, then it goes to the next IF loop and checks for -ve and 0 values in Revenue column. But it does for all the rows. I want 2014-Q1 excluded in this case, and going forward to the future quarters, would want the condition to be dynamic as explained above.
Also, I do not want to slice the dataset before the if condition as the code would throw an error if the initial dataframe df1 returns 1 row or 2 rows and we try to slice those further
Thanks
Here's a solution using a few functions from the dplyr and tidyr packages.
Here's a toy data set to work with:
d <- data.frame(
FISC_QTR_VAL = c("2015-Q1", "2014-Q2", "2014-Q1", "2015-Q2"),
Revenue = c(100, 200, 0, 0)
)
d
#> FISC_QTR_VAL Revenue
#> 1 2015-Q1 100
#> 2 2014-Q2 200
#> 3 2014-Q1 0
#> 4 2015-Q2 0
Notice that FISC_QTR_VAL is intentionally out of order (as a precaution).
Next, set variables for the current year and quarter (you'll see why separate in a moment):
current_year <- 2014
current_quarter <- 2
Then run the following:
d %>%
separate(FISC_QTR_VAL, c("year", "quarter"), sep = "-Q") %>%
arrange(year, quarter) %>%
slice(which(year == current_year & quarter == current_quarter):n()) %>%
filter(Revenue <= 0)
#> year quarter Revenue
#> 1 2015 2 0
First, we separate() the FISC_QTR_VAL into separate year and quarter variables for (a) a tidy data set and (b) a way to order the data in case it's out of order (as in the toy used here). We then arrange() the data so that it's ordered by year and quarter. Then, we slice() away any quarters prior to the current one, and then filter() to return all rows where Revenue <= 0.
To alternatively get, for example, a count of the number of rows that are returned, you can pipe on something like nrow().
Is the subset function an option for you?
exclude.qr <- c("2014-Q1", "2014-Q2")
df <- data.frame(
FISC_QTR_VAL = c("2014-Q1", "2014-Q2", "2014-Q3", "2014-Q4"),
Revenue = c(0.00, 299111.86, 174071.98, 257655.30))
subset(
df,
FISC_QTR_VAL != exclude.qr, Revenue > 0)
You can easily create exclue.qr dynamically, e.g. via paste an years <- 2010:END.
I hope this is helpfull!

Creating Mean Function for Subset in R

I'm trying to create a function that will take a few parameters and return the total average hourly return. My data set looks like this:
Location Time units
1 Columbus 3:35 12
2 Columbus 3:58 199
3 Chicago 6:10 -45
4 Chicago 6:19 87
5 Detroit 12:05 -200
6 Detroit 0:32 11
What I would like returned would be
Location Time units unitsph
Columbus 7:33 211 27.9
Chicago 12:29 42 3.4
Detroit 12:37 -189 -15.1
while also retaining the other items
basically total units produced and units per hour.
I tried out
thing <- time %>% group_by(Location) %>% summarize(sum(units))
which returned locations and total units but not units per hour. Then I moved to
thing <- time %>% group_by(Location) %>% summarize(sum(units)) %>% summarize(sum(Time))
which returned
Error in eval(expr, envir, enclos) : object 'Time' not found
I also tried mutate but to no effect:
fin <- mutate(time, as.numeric(sum(Time))/as.numeric(sum(units)))
Error in Summary.factor(c(118L, 131L, 174L, 178L, 57L), na.rm = FALSE) :
‘sum’ not meaningful for factors
Any help here much appreciated. I also have a few other columns that I'd like to retain (they're geocodes for the locations etc), but didn't list those here. If that's important I can add back in.
Your time is a a string object. You can use
data <- data.frame(loc=c("C","C","D","D"),time=c("1:22","1:23","1:24","1:25"),u=c(1,2,3,4))
basetime <- strptime("00:00","%H:%M")
data$in.hours <- as.double(strptime(data$time,"%H:%M")-basetime)
thing <- data %>% group_by(loc) %>% summarize(sum(u),sum(in.hours))
The conversion into hours is not exactly beautiful. It first turns the time into a Posix.ct object to convert it in turn to a double. But guess ok.
The converted data
loc time u in.hours
1 C 1:22 1 1.366667
2 C 1:23 2 1.383333
3 D 1:24 3 1.400000
4 D 1:25 4 1.416667
so 1.366 means 1h + 1/3h.
The final result is then
loc sum(u) sum(in.hours)
(fctr) (dbl) (dbl)
1 C 3 2.750000
2 D 7 2.816667
hence for C you have 2 hours and 0.75*60 minutes
I ended up taking part of what #CAFEBABE recommended and modifying it.
I used
mutated_time <- time %>%
group_by(Location) %>%
summarize(play
= sum(as.numeric(Time)/60),
unitsph = sum(units))
and that plus
selektor <- as.data.frame(select(distinct(mutated_time), Location,unitsph))
got me where I wanted to go. Thank you all for the many helpful comments.

Resources