Using geom_bar to represent averages - r

I'm looking at daily bookings for a hotel room based on the days before arrival.
I think booking speed varies by day of week and hotel A and hotel B, so I'd like to facet by these categories. However, when I facet (7 x 2 hotels=14 facets), it divides by the total number of dates rather than the number of dates in each category. That is, I have 1400 unique Date-Hotels so everything is being divided by 1400 instead of approximately 100 when I facet. I'd like my code to divide by 97,103,101 depending on how many Hotel-Dates I have in each facet so I can represent a "typical" booking pattern.
Here is my current data and code:
DaysBeforeArrival=rep(1:5,8)
Hotel=rep(LETTERS[1:2],20)
DayOfWeek=c(rep(1,10),rep(2,10),rep(1,10),rep(2,10))
Dates=c(rep("Jan-1",10),rep("Jan-2",10),rep("Jan-8",10),rep("Jan-9",10))
bookings=(sample(1:40))
Date_HotelID=paste(Hotel,Dates,sep="-")
mydf=data.frame(DaysBeforeArrival,Hotel,DayOfWeek,Dates,bookings,Date_HotelID)
ggplot(mydf,aes(DaysBeforeArrival,bookings/length(unique(Date_HotelID)))+
geom_bar(stat=identity) +
facet_grid(DayofWeek~HotelID)
Thanks!

Is this what you wanted to achieve?
library(ggplot2)
ggplot(mydf,aes(DaysBeforeArrival,bookings/length(unique(Date_HotelID))))+
geom_bar(stat="identity") + facet_wrap(~Hotel~DayOfWeek)

One approach is to simply calculate what you want to plot prior to making the graph. In your case, you'd just need to calculate the number of unique Date_HotelID for each DayOfWeek/Hotel combination, and then divide bookings by that value for each row.
For example, I might do this with functions from dplyr. Note I use n_distinct, which is the dplyr version of length(unique(...)).
library(dplyr)
mydf3 = mydf %>%
group_by(DayOfWeek, Hotel) %>%
mutate(book.speed = bookings/n_distinct(Date_HotelID))
mydf3
Source: local data frame [40 x 7]
Groups: DayOfWeek, Hotel [4]
DaysBeforeArrival Hotel DayOfWeek Dates bookings Date_HotelID book.speed
(int) (fctr) (dbl) (fctr) (int) (fctr) (dbl)
1 1 A 1 Jan-1 5 A-Jan-1 2.5
2 2 B 1 Jan-1 34 B-Jan-1 17.0
3 3 A 1 Jan-1 20 A-Jan-1 10.0
4 4 B 1 Jan-1 11 B-Jan-1 5.5
5 5 A 1 Jan-1 13 A-Jan-1 6.5
6 1 B 1 Jan-1 38 B-Jan-1 19.0
7 2 A 1 Jan-1 7 A-Jan-1 3.5
8 3 B 1 Jan-1 15 B-Jan-1 7.5
9 4 A 1 Jan-1 22 A-Jan-1 11.0
10 5 B 1 Jan-1 14 B-Jan-1 7.0
.. ... ... ... ... ... ... ...
The just make your graph with the calculated data.
ggplot(mydf3, aes(DaysBeforeArrival, book.speed)) +
geom_bar(stat="identity") +
facet_grid(DayOfWeek ~ Hotel)

Related

How to select consecutive measurement cycles

I am working with a dataset that contains variables measured from permanent plots. These plots are continuously remeasured every couple of years. The data sort of looks like the table at the bottom. I used the following code to separate the dataset to slice the initial measurement at t1. Now, I want to slice t2 which is the remeasurement that is one step greater than the minimum_Cycle or minimum_Measured_year. This is particularly a problem for plots that have more than two remeasurements (num_obs > 2) and the measured_year intervals and cycle intervals are different.
I would really appreciate the help. I have stuck on this for quite sometime now.
df_Time1 <- df %>% group_by(State, County, Plot) %>% slice(which.min(Cycle))
State County Plot Measured_year basal_area tph Cycle num_obs
1 1 1 2006 10 10 8 2
2 1 2 2002 20 20 7 3
1 1 1 2009 30 30 9 2
2 1 1 2005 40 40 6 3
2 1 1 2010 50 50 8 3
2 1 2 2013 60 60 10 2
2 1 2 2021 70 70 12 3
2 1 1 2019 80 80 13 3
Create a t variable for yourself based on the Cycle order:
df_Time1 %>%
group_by(State, County, Plot) %>%
mutate(t = order(Cycle))
You can then filter on t == 1 or t == 2, etc.

How to select random rows from R data frame to include all distinct values of two columns

I want to select a random sample of rows from a large R data frame df (around 10 million rows) in such a way that all distinct values of two columns are included in the resulting sample. df looks like:
StoreID WEEK Units Value ProdID
2001 1 1 3.5 20702
2001 2 2 3 20705
2002 32 3 6 23568
2002 35 5 15 24025
2003 1 2 10 21253
I have the following unique values in the respective columns: StoreID: 1433 and WEEK: 52. When I generate a random sample of rows from df, I must have at least one row each for each StoreID and each WEEK value.
I used the function sample_frac in dplyr in various trials but that does not ensure that all distinct values of StoreID and WEEK are included at least once in the resulting sample. How can I achieve what I want?
It sounds like you need to group the desired columns before sampling rows. The last line will return one random row for each unique storeID-week pairing.
df <- data.frame(storeid=sample(c(2000:2010),1000,T),
week=sample(c(1:52),1000,T),
value=runif(1000))
# count number of duplicated storeid-week pairs
df %>% count(storeid,week) %>% filter(n>1)
df %>% group_by(storeid,week) %>% sample_n(1)
# A tibble: 468 x 3
# Groups: storeid, week [468]
storeid week value
<int> <int> <dbl>
1 2000 1 0.824
2 2000 2 0.0987
3 2000 6 0.916
4 2000 8 0.289
5 2000 9 0.610
6 2000 11 0.0807
7 2000 12 0.592
8 2000 13 0.849
9 2000 14 0.0181
10 2000 16 0.182
# ... with 458 more rows
Not sure if I have read the problem correctly. I would have tried the following using sample function.
Assuming your dataframe is called MyDataFrame and is two dimensional, I would have done it like this.
RandomizedDF <- MyDataFrame[sample(dim(MyDataFrame)[1],dim(MyDataFrame)[1],replace=FALSE),]
Let me know if this is what you wanted or something else?

Performing the colsum based on row values [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
Hi I have 3 data set with contains the items and counts. I need to add the all data sets and combine the count based on the item names. He is my input.
Df1 <- data.frame(items =c("Cookies", "Candys","Toys","Games"), Counts = c( 10,20,30,5))
Df2 <- data.frame(items =c( "Candys","Cookies","Toys"), Counts = c( 5,21,20))
Df3 <- data.frame(items =c( "Playdows","Gummies","Candys"), Counts = c(10,15,20))
Df_all <- rbind(Df1,Df2,Df3)
Df_all
items Counts
1 Cookies 10
2 Candys 20
3 Toys 30
4 Games 5
5 Candys 5
6 Cookies 21
7 Toys 20
8 Playdows 10
9 Gummies 15
10 Candys 20
I need to combine the columns based on the item values. Delete the Row after adding the values. My output should be
items Counts
1 Cookies 31
2 Candys 45
3 Toys 50
4 Games 5
5 Playdows 10
6 Gummies 15
Could you help in getting this output in r.
use dplyr:
library(dplyr)
result<-Df_all%>%group_by(items)%>%summarize(sum(Counts))
> result
# A tibble: 6 x 2
items `sum(Counts)`
<fct> <dbl>
1 Candys 45.0
2 Cookies 31.0
3 Games 5.00
4 Toys 50.0
5 Gummies 15.0
6 Playdows 10.0
You can use tapply
tapply(Df_all$Counts, Df_all$items, FUN=sum)
what returns
Candys Cookies Games Toys Gummies Playdows
45 31 5 50 15 10

How to calculate moving average for different starting date?

I would like to calculate moving averages for each participant in the dataset.
Participant may have more than one visit date, and I would like to calculate the average value in the past 3 days and in the past 2 days before each visit (not including the day of visit).
For example, let id=1, date=6/6/2017.
Average value in the past 2 days should be an average of value on 6/5/2017 and 6/4/2017.
Sample datasets are generated as below.
I am working on a much larger dataset, with more participants, more visits, and more days of value. I want to find an efficient way to calculate these averages.
timeseries <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3), date=c("6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017",
"6/1/2017","6/2/2017","6/3/2017","6/4/2017","6/5/2017","6/6/2017"),
value=c(2,3,4,NA,6,7,
NA,9,5,NA,3,2,
5,7,3,8,3,5))
> timeseries
id date value
1 1 6/1/2017 2
2 1 6/2/2017 3
3 1 6/3/2017 4
4 1 6/4/2017 NA
5 1 6/5/2017 6
6 1 6/6/2017 7
7 2 6/1/2017 NA
8 2 6/2/2017 9
9 2 6/3/2017 5
10 2 6/4/2017 NA
...
visit <- data.frame(id=c(1,1,2,3,3,3),
date=c("6/6/2017","6/5/2017",
"6/6/2017",
"6/6/2017","6/5/2017","6/4/2017"))
> visit
id date
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017
The result table should be something like this, where mean3 is the average value in the past 3 days, and mean2 is the average value in the past 2 days
> result
id date mean3 mean2
1 1 6/6/2017
2 1 6/5/2017
3 2 6/6/2017
4 3 6/6/2017
5 3 6/5/2017
6 3 6/4/2017
For each id of visit, I subset corresponding data from timeseries and then calculated mean of the value within n_days.
library(lubridate)
n_days = 2
sapply(1:NROW(visit), function(i)
with(subset(x = timeseries,
subset = timeseries$id == visit$id[i]),
mean(x = value[difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") <= n_days &
difftime(time1 = mdy(visit$date[i]),
time2 = mdy(date),
units = "days") > 0],
na.rm = TRUE)))
#[1] 6.0 4.0 3.0 5.5 5.5 5.0

How to calculate top rows from a large data set

I have a dataset in which there are following columns: flavor, flavorid and unitSoled.
Flavor Flavorid unitsoled
beans 350 6
creamy 460 2
.
.
.
I want to find top ten flavors and then calculate market share for each flavor. My logic is market share for each flavor = units soled for particular flavor divided by total units soled.
How do I implement this. For output I just want two col Flavorid and corresponding market share. Do I need to save top ten flavors in some table first?
One way is with the dplyr package:
An example data set:
flavor <- rep(letters[1:15],each=5)
flavorid <- rep(1:15,each=5)
unitsold <- 1:75
df <- data.frame(flavor,flavorid,unitsold)
> df
flavor flavorid unitsold
1 a 1 1
2 a 1 2
3 a 1 3
4 a 1 4
5 a 1 5
6 b 2 6
7 b 2 7
8 b 2 8
9 b 2 9
...
...
Solution:
library(dplyr)
df %>%
select(flavorid,unitsold) %>% #select the columns you want
group_by(flavorid) %>% #group by flavorid
summarise(total=sum(unitsold)) %>% #sum the total units sold per id
mutate(marketshare=total/sum(total)) %>% #calculate the market share per id
arrange( desc(marketshare)) %>% #order by marketshare descending
head(10) #pick the 10 first
#and you can add another select(flavorid,marketshare) if you only want those two
Output:
Source: local data frame [10 x 3]
flavorid total marketshare
1 15 365 0.12807018
2 14 340 0.11929825
3 13 315 0.11052632
4 12 290 0.10175439
5 11 265 0.09298246
6 10 240 0.08421053
7 9 215 0.07543860
8 8 190 0.06666667
9 7 165 0.05789474
10 6 140 0.04912281

Resources