R -Finding Mean with multiple subsets - r

I have a dataset with 4 columns as shown below. I want to create a 5th column (Mean) which has the mean of the 4th column based on the first 3 columns.
For e.g: The mean of the value in the first hour (hour=1) on the date (1/1/2018) for the Id (5000) is the mean of first 3 rows (2+2+1)/3 = 1.67
>
head(read_df[,1:5])
`
Id Date Hour Value Mean
5000 1/1/2018 1 1 1.67
5000 1/1/2018 1 2 1.67
5000 1/1/2018 1 2 1.67
5100 1/1/2018 4 2 2
5100 2/1/2018 6 2 3
5100 2/1/2018 6 4 3
5100 3/1/2018 2 7 7
5200 3/1/2018 3 3 4.5
5200 3/1/2018 3 6 4.5
I tried using a for loop for each of Id and Date and Hour. But ended up with NAs in some rows. Kindly let me know an efficient way to achieve this.

I would recommend using dplyr package.
library(dplyr)
read_df %>%
group_by(ID, Date) %>% # Specifly your by-variables
mutate(Mean = mean(Value)) %>% # Calculate the mean
ungroup()

ddply from plyr does exactly this for any function.
plyr::ddply(read_df, c("Id", "Date", "Hour"), numcolwise(mean))
Though in your example I notice the 3rd row has a different date, so that contradicts your example.
There are simpler functions that can do similar things such as aggregate, but I like ddply as its a good all-rounder.

Related

How to calculate duration of time between two dates

I'm working with a large data set in RStudio that includes multiple test scores for the same individuals. I've filtered my data set to display the same individual's scores in two consecutive rows with the test date for each test administration in one column. My data appears as follows:
id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline
I would like to calculate the total duration of time between baseline 1 and baseline 2 administration and store that value in a new column. Therefore, my first question is what is the best way to calculate the duration of time between two dates? And two, what is the best way to condense each individual's data into one row to make calculating the difference between test scores easier and to be stored in a new column?
Thank you for any assistance!
This is a solution inside the tidyverse universe. The packages we are going to use are dplyr and tidyr.
First, we create the dataset (you read it from a file instead) and convert strings to date format:
library(dplyr)
library(tidyr)
dataset <- read.table(text = "id test_date score baseline_number_1 baseline_number_2
1 08/15/2017 21.18 Baseline N/A
1 08/28/2019 28.55 N/A Baseline
2 11/22/2017 33.38 Baseline N/A
2 11/06/2019 35.3 N/A Baseline
3 07/25/2018 30.77 Baseline N/A
3 07/31/2019 33.42 N/A Baseline", header = TRUE)
dataset$test_date <- as.Date(dataset$test_date, format = "%m/%d/%Y")
# id test_date score baseline_number_1 baseline_number_2
# 1 1 2017-08-15 21.18 Baseline <NA>
# 2 1 2019-08-28 28.55 <NA> Baseline
# 3 2 2017-11-22 33.38 Baseline <NA>
# 4 2 2019-11-06 35.30 <NA> Baseline
# 5 3 2018-07-25 30.77 Baseline <NA>
# 6 3 2019-07-31 33.42 <NA> Baseline
The best solution to condense each individual's data into one row and compute the difference between the two baselines can be achieved as follows:
dataset %>%
group_by(id) %>%
mutate(number = row_number()) %>%
ungroup() %>%
pivot_wider(
id_cols = id,
names_from = number,
values_from = c(test_date, score),
names_glue = "{.value}_{number}"
) %>%
mutate(
time_between = test_date_2 - test_date_1
)
Brief explanation: first we create the variable number which indicates the baseline number in each row; then we use pivot_wider to make the dataset "wider" indeed, i.e. we have one row for each id along with its features; finally we create the variable time_between which contains the difference in days between two baselines. In you are not familiar with some of these functions, I suggest you break the pipeline after each operation and analyse it step by step.
Final output
# A tibble: 3 x 6
# id test_date_1 test_date_2 score_1 score_2 time_between
# <int> <date> <date> <dbl> <dbl> <drtn>
# 1 1 2017-08-15 2019-08-28 21.2 28.6 743 days
# 2 2 2017-11-22 2019-11-06 33.4 35.3 714 days
# 3 3 2018-07-25 2019-07-31 30.8 33.4 371 days

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.
I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?
It sounds like you need to group the desired columns before sampling rows. The last line will return one random row for each unique storeID-week pairing.
df <- data.frame(storeid=sample(c(2000:2010),1000,T),
week=sample(c(1:52),1000,T),
value=runif(1000))
# count number of duplicated storeid-week pairs
df %>% count(storeid,week) %>% filter(n>1)
df %>% group_by(storeid,week) %>% sample_n(1)
# A tibble: 468 x 3
# Groups: storeid, week [468]
storeid week value
<int> <int> <dbl>
1 2000 1 0.824
2 2000 2 0.0987
3 2000 6 0.916
4 2000 8 0.289
5 2000 9 0.610
6 2000 11 0.0807
7 2000 12 0.592
8 2000 13 0.849
9 2000 14 0.0181
10 2000 16 0.182
# ... with 458 more rows
Not sure if I have read the problem correctly. I would have tried the following using sample function.
Assuming your dataframe is called MyDataFrame and is two dimensional, I would have done it like this.
RandomizedDF <- MyDataFrame[sample(dim(MyDataFrame)[1],dim(MyDataFrame)[1],replace=FALSE),]
Let me know if this is what you wanted or something else?

I need to find the mean for the data with cells without values

I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0

dplyr: mean of a group count

I am trying to find the mean length of a variable over a dataframe using dplyr:
x <- data %>% group_by(Date, `% Bucket`) %>% summarise(count = n())
Date % Bucket count
(date) (fctr) (int)
1 2015-01-05 <=1 1566
2 2015-01-05 (1-25] 421
3 2015-01-05 (25-50] 461
4 2015-01-05 (50-75] 485
5 2015-01-05 (75-100] 662
6 2015-01-05 (100-150] 1693
7 2015-01-05 >150 12359
8 2015-01-13 <=1 1608
9 2015-01-13 (1-25] 441
10 2015-01-13 (25-50] 425
How to aggregate to find average across each % Bucket over the year with dplyr?
in base:
x <- as.data.frame(x)
aggregate(count ~ `% Bucket`, data = x, FUN=mean)
% Bucket count
1 <=1 2609.5294
2 (1-25] 449.0000
3 (25-50] 528.7059
4 (50-75] 593.2157
5 (75-100] 763.0000
6 (100-150] 1758.6667
7 >150 12457.9216
Aggregate function will take the count found by dplyr across each bucket above and sum them, dividing by the number of rows that contain that % Bucket variable and give the answer above. How can I accomplish this with dplyr though? This is not about completing the problem but understanding how the dplyr package would be used in such a scenario.
Another example of this type of thing would be summarise the n() of each group_by variable and also listing the minimum length "count" of that variable across the 52 weeks.
I am struggling because dplyr seems to be built to find a mean of a value in a column, but here I am counting the number of row occurrences given a variable in a column and trying to find the mean, min, max, etc. of it.
We can use dplyr methods
library(dplyr)
x %>%
group_by(`% Bucket`) %>%
summarise(count= mean(count))

Creating a vector containing total quantities sold per delivery term

Have a look at the simplified table below. I want for each product a vector containing the quantities sold within each delivery time. A delivery time is defined as 4 days. So if we look at product A, we see that it starts at 03/12/15 and within the first delivery term (until 07/12/15) it has sold a quantity of 4. The second delivery term starts at 08/12/15 and ends at 12/12/15. So for this period there is 1 quantity sold. The following delivery term starts at 13/12/15 and ends at 17/12/15. During these period there are no quantities sold and thus for this period the vector must have a value of 0. In the last period, finally, 2 products are sold. So basically the problem here is that information regarding the periods were no products are sold is missing.
Any ideas on how the vector I want can be created using R? I've been thinking of for or while loops, but these do not seem to give the requested results. Note that the code must be applicable on a real dataset containing over 1000 product categories, so it has to be 'automatized' in one way.
I would be very gratefull if somebody could point me in the right direction.
Product Quantity Date
A 1 03/12/15
A 2 04/12/15
A 1 05/12/15
A 1 08/12/15
A 1 17/12/16
A 1 18/12/16
B 1 19/12/15
B 2 10/05/15
B 2 11/05/15
C 1 01/06/15
C 1 02/06/15
C 1 12/06/15
Assume that dt is the dataset you provided. You'll get a better understanding of the process if you run it step by step (and maybe with an even simpler dataset).
library(lubridate)
library(dplyr)
# create date time columns
dt$Date = dmy(dt$Date)
dt %>%
group_by(Product) %>%
do(data.frame(days = seq(min(.$Date), max(.$Date), by="1 day"))) %>% # create all combinations between product and days
mutate(dist = as.numeric(difftime(days,min(days), units="days"))) %>% # create distance of each day with min date
ungroup() %>%
left_join(dt, by=c("Product"="Product","days"="Date")) %>% # join info to get quantities for each day
mutate(Quantity = ifelse(is.na(Quantity), 0, Quantity), # replace NAs with 0s
id = floor(dist/5 + 1)) %>% # create the 4 period id
group_by(Product, id) %>%
summarise(Sum = sum(Quantity),
min_date = min(days),
max_date = max(days)) %>%
ungroup
# Product id Sum min_date max_date
# 1 A 1 4 2015-12-03 2015-12-07
# 2 A 2 1 2015-12-08 2015-12-12
# 3 A 3 0 2015-12-13 2015-12-17
# 4 A 4 0 2015-12-18 2015-12-22
# 5 A 5 0 2015-12-23 2015-12-27
# 6 A 6 0 2015-12-28 2016-01-01
# 7 A 7 0 2016-01-02 2016-01-06
# 8 A 8 0 2016-01-07 2016-01-11
# 9 A 9 0 2016-01-12 2016-01-16
# 10 A 10 0 2016-01-17 2016-01-21
# .. ... .. ... ... ...
First row of the output tells you that for product A in the first 4 days period (id = 1) you had 4 quantities in total and the period is from 3/12 to 7/12.
I would suggest {dplyr}'s summarise(),mutate() and group_by() functions. group_by() groups your data by desired variables (in your case - product and delivery term),mutate() allows operations on grouped columns, and summarise() applies a summarising function over these groups (in your case sum(Quantity)).
So this is how it will look:
convert date into proper format:
library(dplyr)
df=tbl_df(df)
df$Date=as.Date(df$Date,format="%d/%m/%y")
calculating delivery terms
df=group_by(df,Product) %>% arrange(Date)
df=mutate(df,term=1+unclass((Date-min(Date)))%/%4)
group by product and terms and calculate sum of quantity:
df=group_by(df,Product,term)
summarise(df,sum=sum(Quantity))
Here's a base R way:
df$groups <- ave(as.numeric(df$Date), df$Product, FUN=function(x) {
intrvl <- findInterval(x, seq(min(x), max(x),4))
as.numeric(factor(intrvl))
})
df
# Product Quantity Date groups
# 1 A 1 2015-12-03 1
# 2 A 2 2015-12-04 1
# 3 A 1 2015-12-05 1
# 4 A 1 2015-12-08 2
# 5 A 1 2016-12-17 3
# 6 A 1 2016-12-18 3
# 7 B 1 2015-12-19 2
# 8 B 2 2015-05-10 1
# 9 B 2 2015-05-11 1
# 10 C 1 2015-06-01 1
# 11 C 1 2015-06-02 1
# 12 C 1 2015-06-12 2
The dates should be converted to one of the date classes. I chose as.Date. When it converts to numeric, the output will be the number of days from a specified date. From there, we are able to group by 4 day increments.
Data
df$Date <- as.Date(df$Date, format="%d/%m/%y")

Resources