I currently have the following data frame:
> head(Coyote_reports_garbage)
# A tibble: 6 x 4
name_1 Date Day Collection
<chr> <date> <chr> <chr>
1 PLEASANTVIEW 2013-02-20 Wednesday Friday
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday
3 MAYLIEWAN 2013-11-28 Thursday Friday
4 BROOKSIDE 2013-12-18 Wednesday Thursday
5 KIRKNESS 2012-11-14 Wednesday Friday
6 RIDEAU PARK 2013-11-15 Friday Friday
Where "name_1" represents the name of a neighbourhood, "Date" represents the date when a report was made, "Day" represent the day of the week where that report was name (in relation to the date), and "Collection" represents the garbage day in that neighbourhood. "Collection" therefore varies per neighbourhood and year.
I am trying to add a column (Day_in_relation_to_collection) where the day would be related to Collection day. If the day of the week is the same as the garbage collection day, Day_in_relation_to_collection = 0. If the day of the week is a day after collection day, Day_in_relation_to_collection = 1, etc.
name_1 Date Day Collection Day_in_relation_to_collection
<chr> <date> <chr> <chr>
1 PLEASANTVIEW 2013-02-20 Wednesday Friday 5
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday 4
3 MAYLIEWAN 2013-11-28 Thursday Friday 6
4 BROOKSIDE 2013-12-18 Wednesday Thursday 6
5 KIRKNESS 2012-11-14 Wednesday Friday 5
6 RIDEAU PARK 2013-11-15 Friday Friday 0
I'm not quite sure how to do this, so any help would be appreciated.
I'm assuming here that Day will always be after Collection, and it will always be the next instance of that day. If so, a simple way to do that would be to make a reference matrix setting up the number of days between a combination of 2 days of the week and then using that to fill in this value:
dnames <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
day_table <- matrix(c(0:6,6,0:5,5:6,0:4,4:6,0:3,3:6,0:2,2:6,0:1,1:6,0),
nrow=7, ncol=7, byrow=T,
dimnames = list(dnames, dnames))
day_table
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Sunday 0 1 2 3 4 5 6
Monday 6 0 1 2 3 4 5
Tuesday 5 6 0 1 2 3 4
Wednesday 4 5 6 0 1 2 3
Thursday 3 4 5 6 0 1 2
Friday 2 3 4 5 6 0 1
Saturday 1 2 3 4 5 6 0
Now we can just access the values of Coyote_reports_garbage$Collection and Coyote_reports_garbage$Day to access values in that table to get the appropriate value. We can either run this as a tidyverse mutate statement, or insert it using base R.
Either way, we need to use diag here, as subsetting a matrix with 2 vectors gives a matrix with all combinations of the selected values. The diagonal of that matrix will give the result you want here:
library(tidyverse)
Coyote_reports_garbage %>%
mutate(Day_in_relation_to_collection = diag(day_table[Collection,Day]))
name_1 Date Day Collection Day_in_relation_to_collection
1 PLEASANTVIEW 2013-02-20 Wednesday Friday 5
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday 4
3 MAYLIEWAN 2013-11-28 Thursday Friday 6
4 BROOKSIDE 2013-12-18 Wednesday Thursday 6
5 KIRKNESS 2012-11-14 Wednesday Friday 5
6 RIDEAU PARK 2013-11-15 Friday Friday 0
Or in base R
Coyote_reports_garbage$dr_collect <- diag(day_table[Coyote_reports_garbage$Collection,
Coyote_reports_garbage$Day])
Coyote_reports_garbage
name_1 Date Day Collection dr_collect
1 PLEASANTVIEW 2013-02-20 Wednesday Friday 5
2 MCCONACHIE AREA 2012-11-20 Tuesday Friday 4
3 MAYLIEWAN 2013-11-28 Thursday Friday 6
4 BROOKSIDE 2013-12-18 Wednesday Thursday 6
5 KIRKNESS 2012-11-14 Wednesday Friday 5
6 RIDEAU PARK 2013-11-15 Friday Friday 0
Related
I have a dataset in long format. Every subject in the dataset was observed a number of times (from 1 to 3) during the week. The observations occurred once a day from Monday to Friday based on the subject's availability. If a case was observed less than 3 days, I need to generate empty rows for the missing observation(s). If a case was observed on Monday and Friday, the third observation should randomly occur on Tuesday, Wednesday, or Thursday. The code below works well except for one fundamental flaw: when days are randomly generated, they might overlap with days in which subjects were already observed. This should not happen. Feel free to suggest completely different code, should my solution be too complex to edit. Thanks!
#Toy dataset
dataset_long <- data.frame(
id = c(1, 1, 2, 2, 2, 3, 3, 4, 5, 5),
observation = c(1, 2, 1, 2, 3, 1, 2, 1, 1, 2),
day_name = c("Monday", "Tuesday", "Monday", "Wednesday", "Thursday", "Tuesday", "Thursday", "Wednesday", "Monday", "Friday"),
scores = sample(20:60, 10)
)
# case observation day scores
#1 1 1 Monday 32
#2 1 2 Tuesday 31
#3 2 1 Monday 29
#4 2 2 Wednesday 28
#5 2 3 Thursday 22
#6 3 1 Tuesday 45
#7 3 2 Thursday 30
#8 4 1 Wednesday 36
#9 5 1 Monday 58
#10 5 2 Friday 37
#Identify cases with fewer than 3 observations
cases_to_fill <- dataset_long %>%
group_by(case) %>%
summarize(days_observed = n()) %>%
filter(days_observed < 3) %>%
ungroup()
#Create vector with day names
all_days <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
#Fill the missing observation day randomly
cases_to_fill_long <- cases_to_fill %>%
mutate(missing_day = map(days_observed, ~sample(all_days, 3 - .x))) %>%
unnest(missing_day) %>%
mutate(observation = 3)
#Join the filled cases with the original dataset
dataset_long_filled <- dataset_long %>%
full_join(cases_to_fill_long, by = c("case", "observation")) %>%
arrange(case, observation)
#Coalesce the two columns of day into one
dataset_long_filled |>
mutate(day = coalesce(day, missing_day)) |>
select(-days_observed, -missing_day)
# case observation day scores
#1 1 1 Monday 32
#2 1 2 Tuesday 31
#3 1 3 Friday NA
#4 2 1 Monday 29
#5 2 2 Wednesday 28
#6 2 3 Thursday 22
#7 3 1 Tuesday 45
#8 3 2 Thursday 30
#9 3 3 Tuesday NA Tuesday is repeated for this subject
#10 4 1 Wednesday 36
#11 4 3 Wednesday NA Wednesday is repeated for this subject
#12 4 3 Monday NA
#13 5 1 Monday 58
#14 5 2 Friday 37
#15 5 3 Friday NA Friday is repeated for this subject
You can try:
library(tidyr)
library(dplyr)
dataset_long %>%
group_by(id) %>%
complete(day_name = sample(all_days[!(all_days %in% day_name)], 3-n())) %>%
arrange(id, match(day_name, all_days)) %>%
mutate(observation = row_number()) %>%
ungroup()
# A tibble: 15 × 4
id day_name observation scores
<dbl> <chr> <int> <int>
1 1 Monday 1 51
2 1 Tuesday 2 30
3 1 Thursday 3 NA
4 2 Monday 1 42
5 2 Wednesday 2 53
6 2 Thursday 3 35
7 3 Tuesday 1 25
8 3 Thursday 2 27
9 3 Friday 3 NA
10 4 Monday 1 NA
11 4 Wednesday 2 59
12 4 Thursday 3 NA
13 5 Monday 1 50
14 5 Thursday 2 NA
15 5 Friday 3 54
I have data that includes a date and day of week.
I would like to identify all instances of a particular weekday that match the given year/month/weekday
in the original data.
For instance if the first record has the date "2010-07-05" which is a Thursday, I want to rowbind all Thursdays
that occur in July of 2010 to my original dataset.
While adding those new rows, I also want to fill in those new rows with values from the original data for all columns, except one. The exception is a variable which indicates whether or not that row
was in the original dataset or not.
Example data:
(1) alldays -- this data includes all dates and weekdays for the appropriate years.
(2) dt1 -- this is the example dataset that includes the date Adate, and day of week dow that will be used to identify the year/month/weekday and then look for all dates within that same month for the given weekday. For example - all Thursdays in July of 2017 will need to row bound to the original data.
library(data.table)
library(tidyverse)
library(lubridate)
alldays <- data.table (date = seq(as.Date("2010-01-01"),
as.Date("2011-12-31"), by="days"))
alldays <- alldays %>%
dplyr::mutate(year = lubridate::year(date),
month = lubridate::month(date),
day = lubridate::day(date),
dow = weekdays(date))
setDT(alldays)
head(alldays)
date year month day dow
1 2010-01-01 2010 1 1 Friday
2 2010-01-02 2010 1 2 Saturday
3 2010-01-03 2010 1 3 Sunday
4 2010-01-04 2010 1 4 Monday
5 2010-01-05 2010 1 5 Tuesday
6 2010-01-06 2010 1 6 Wednesday
Here is an example of the primary dataset
id <- seq(1:2)
admit <- rep(1,2)
zip <- c(54123, 54789)
Adate <- as.Date(c("2010-07-15","2011-03-14"))
year <- c(2010, 2011)
month <- c(7,3)
day <- c(15,14)
dow <- c("Thursday","Monday")
dt1 <- data.table(id, admit, zip, Adate, year, month, day, dow)
dt1
#> id admit zip Adate year month day dow
#> 1: 1 1 54123 2010-07-15 2010 7 15 Thursday
#> 2: 2 1 54789 2011-03-14 2011 3 14 Monday
The resulting dataset should be:
id admit zip Adate year month day dow
1: 1 0 54123 2010-07-01 2010 7 1 Thursday
2: 1 0 54123 2010-07-08 2010 7 8 Thursday
3: 1 1 54123 2010-07-15 2010 7 15 Thursday
4: 1 0 54123 2010-07-22 2010 7 22 Thursday
5: 1 0 54123 2010-07-29 2010 7 29 Thursday
6: 2 0 54789 2011-03-07 2011 3 7 Monday
7: 2 1 54789 2011-03-14 2011 3 14 Monday
8: 2 0 54789 2011-03-21 2011 3 21 Monday
9: 2 0 54789 2011-03-28 2011 3 28 Monday
So we can see that the first date dt1 2010-07-15 associated with id=1, which was a Thursday fell within a month with 4 additional Thursday in that month which were added to the dataset. The variable admit is the indicator of whether that row was in the original or subsequently added by virtue of the being matched.
I have tried first selecting the additional dates from alldays with matching weekdays but I am running into issues on how to rowbind those back into the original dataset while filling in the other values appropriately. Eventually I will be running this on a dataset with about 300,000 rows.
Here is an option:
alldays[dt1[, .(id, zip, admit=0L, year, month, dow)],
on=.(year, month, dow), allow.cartesian=TRUE][
dt1, on=.(id, date=Adate), admit := i.admit][]
output:
date year month day dow id zip admit
1: 2010-07-01 2010 7 1 Thursday 1 54123 0
2: 2010-07-08 2010 7 8 Thursday 1 54123 0
3: 2010-07-15 2010 7 15 Thursday 1 54123 1
4: 2010-07-22 2010 7 22 Thursday 1 54123 0
5: 2010-07-29 2010 7 29 Thursday 1 54123 0
6: 2011-03-07 2011 3 7 Monday 2 54789 0
7: 2011-03-14 2011 3 14 Monday 2 54789 1
8: 2011-03-21 2011 3 21 Monday 2 54789 0
9: 2011-03-28 2011 3 28 Monday 2 54789 0
I am finding it difficult to wrap my head around this:
In the dataframe below I want to delete the next row after every count/list of, say, Thursday, same for Friday and so on. I would prefer not using a loop since the data is big.
mydata<- read.table(header=TRUE, text="
Date AAPL.ret Weekday Thursday
1 2001-01-04 0.000000000 Thursday 1
2 2001-01-04 0.000000000 Thursday 1
3 2001-01-04 -0.025317808 Thursday 1
4 2001-01-04 0.014545711 Thursday 1
5 2001-01-04 0.007194276 Thursday 1
6 2001-01-04 -0.007194276 Thursday 1
7 2001-01-05 -0.0278569545 Friday 0
8 2001-01-05 0.0056338177 Friday 0
9 2001-01-05 0.0037383221 Friday 0
10 2001-01-05 0.0000000000 Friday 0
11 2002-02-25 3.511856e-03 Monday 0
12 2002-02-25 -3.511856e-03 Monday 0
13 2002-02-25 -4.398505e-04 Monday 0
14 2002-02-25 -2.643173e-03 Monday 0
15 2002-02-25 4.401416e-03 Monday 0
16 2002-02-26 9.189066e-03 Tuesday 0
17 2002-02-26 -8.243166e-04 Tuesday 0
18 2002-02-26 9.533751e-03 Tuesday 0
19 2002-02-26 4.527688e-03 Tuesday 0
20 2002-02-26 4.105933e-04 Tuesday 0
.............
100 2002-03-01 8.717651e-03 Friday 0
101 2002-03-01 1.990115e-02 Friday 0
102 2002-03-01 -1.344387e-03 Friday 0
103 2002-03-01 -1.445373e-02 Friday 0
")
The output I need should be like this:
Date AAPL.ret Weekday Thursday
1 2001-01-04 0.000000000 Thursday 1
2 2001-01-04 0.000000000 Thursday 1
3 2001-01-04 -0.025317808 Thursday 1
4 2001-01-04 0.014545711 Thursday 1
5 2001-01-04 0.007194276 Thursday 1
6 2001-01-04 -0.007194276 Thursday 1
7 2001-01-05 0.0056338177 Friday 0
8 2001-01-05 0.0037383221 Friday 0
9 2001-01-05 0.0000000000 Friday 0
11 2002-02-25 -3.511856e-03 Monday 0
12 2002-02-25 -4.398505e-04 Monday 0
13 2002-02-25 -2.643173e-03 Monday 0
14 2002-02-25 4.401416e-03 Monday 0
15 2002-02-26 -8.243166e-04 Tuesday 0
16 2002-02-26 9.533751e-03 Tuesday 0
17 2002-02-26 4.527688e-03 Tuesday 0
18 2002-02-26 4.105933e-04 Tuesday 0
.............
100 2002-03-01 1.990115e-02 Friday 0
101 2002-03-01 -1.344387e-03 Friday 0
102 2002-03-01 -1.445373e-02 Friday 0
Thank you in advance. Sorry if I have wrongfully asked the question. This is my first time of asking a question here; I have tried to follow the rules as best as I can; especially how the table should appear.
The codes I have tried, I believe, are really far from the answer I desire. Just counting and subsetting; below.
table(ret.df$Weekday=="Thursday")
r1<-ret.df[!(ret.df$Weekday=="Thursday"),]
I hope my question less vague now.
A follow up from the previous answer:
removing rows based on condition in ret_1ON
ret_1ON<- ret.df[duplicated(ret.df$Date)|1:nrow(ret.df)==1,]
dim(ret_1ON)
[1] 98734 4
head(ret_1ON)
Date AAPL.ret Weekday Thursday
1 2001-01-04 0.000000000 Thursday 1
2 2001-01-04 0.000000000 Thursday 1
3 2001-01-04 -0.025317808 Thursday 1
4 2001-01-04 0.014545711 Thursday 1
5 2001-01-04 0.007194276 Thursday 1
6 2001-01-04 -0.007194276 Thursday 1
tail(ret_1ON)
Date AAPL.ret Weekday Thursday
99994 2006-01-19 0.0013771520 Thursday 1
99995 2006-01-19 -0.0007321584 Thursday 1
99996 2006-01-19 -0.0029026141 Thursday 1
99997 2006-01-19 -0.0002511616 Thursday 1
99998 2006-01-19 0.0011297309 Thursday 1
99999 2006-01-19 -0.0002509410 Thursday 1
I'm wandering why the last item in tail is not 98734 but rather 99999?
dim(ret.df)
[1] 99999 4
which means the condition was effected, though.
We can do this with data.table
library(data.table)
setDT(mydata)[, .SD[(seq_len(.N) != 1)], Date]
if we wanted to keep the first row of the dataset
setDT(mydata)[, .SD[(seq_len(.N) != 1)|seq_len(.N)==.I[1]], Date]
Or with dplyr
library(dplyr)
mydata %>%
group_by(Date) %>%
filter(row_number() != 1)
Or using base R, if the 'Date' column is ordered
mydata[duplicated(mydata$Date),]
or with including the first row
mydata[duplicated(mydata$Date)|1:nrow(mydata)==1,]
I have recently encountered a problem in R dealing with the dates. The last day of 2015 (2015-12-31) falls on Thursday, meaning last week of the year only contains 5 days if I consider Sunday as the start day of my week. Now, I would like 2016-01-01 and 2016-01-02, which fall on Friday and Saturday, to be associated with week 53, and start week 1 on 2016-01-03, which falls on Sunday.
require(lubridate)
range <- seq(as.Date('2015-12-26'), by = 1, len = 10)
df <- data.frame(range)
df$WKN <- as.numeric(strftime(df$range, format = "%U")) + 1
df$weekday <- weekdays(df$range)
df$weeknum <- wday(df$range)
This would give me the following result:
df:
range WKN weekday weeknum
2015-12-26 52 Saturday 7
2015-12-27 53 Sunday 1
2015-12-28 53 Monday 2
2015-12-29 53 Tuesday 3
2015-12-30 53 Wednesday 4
2015-12-31 53 Thursday 5
2016-01-01 1 Friday 6
2016-01-02 1 Saturday 7
2016-01-03 2 Sunday 1
2016-01-04 2 Monday 2
Now I would like to have my dataframe as follows:
df:
range WKN weekday weeknum
2015-12-26 52 Saturday 7
2015-12-27 53 Sunday 1
2015-12-28 53 Monday 2
2015-12-29 53 Tuesday 3
2015-12-30 53 Wednesday 4
2015-12-31 53 Thursday 5
2016-01-01 53 Friday 6
2016-01-02 53 Saturday 7
2016-01-03 1 Sunday 1
2016-01-04 1 Monday 2
Could anyone point me to a direction to automate that so that I don't have to change the code every year?
If you check out ?strptime, there are a few different week number tokens available for use with format. Here %V almost works, except it starts the week on Monday, so add one to adjust:
df$WKN <- as.integer(format(df$range + 1, '%V'))
df
## range WKN weekday weeknum
## 1 2015-12-26 52 Saturday 7
## 2 2015-12-27 53 Sunday 1
## 3 2015-12-28 53 Monday 2
## 4 2015-12-29 53 Tuesday 3
## 5 2015-12-30 53 Wednesday 4
## 6 2015-12-31 53 Thursday 5
## 7 2016-01-01 53 Friday 6
## 8 2016-01-02 53 Saturday 7
## 9 2016-01-03 1 Sunday 1
## 10 2016-01-04 1 Monday 2
Or if you're using dplyr like the tag suggests,
library(dplyr)
df %>% mutate(WKN = as.integer(format(range + 1, '%V')))
which returns the same thing. The isoweek function of lubridate is equivalent, so you could also do
library(lubridate)
df$WKN <- isoweek(df$range + 1)
or
df %>% mutate(WKN = isoweek(range + 1))
both of which return identical results to the as.integer(format(...)) versions.
We can use cumsum on a logical vector
df$WKN <- unique(df$WKN)[cumsum(df$weeknum==1) +1]
df$WKN
#[1] 52 53 53 53 53 53 53 53 1 1
Considering that you are using lubridate, I also wanted to give you a lubridate solution. You also asked for a solution that works with other years. Here goes:
adjust_first_week<- function(year){
first <- floor_date(dmy(paste0("1-1-", year)), "year")
two_weeks <- c(first - days(7:1), first + days(0:6))
df <- data.frame(date = two_weeks,
day_of_week = weekdays(two_weeks),
day_of_year = yday(two_weeks),
week_of_year = week(two_weeks))
last_weekend <- which(df$day_of_week == "Sunday")[2] -1
df$adjust_week <- df$week_of_year
if(last_weekend ==7) return(df)
else{
df$adjust_week[8:last_weekend] <- rep(53,length(8:last_weekend))
}
return(df)
}
Takes a numeric year, and takes the first day of that year.
Creates a two week period by appending a week on either side of 1/1/year.
Calculates various summary statistics for that year for your edification.
Picks out the second Sunday. By design 1/1/year is always the 8th entry.
If Sunday is the first day of the month, it doesn't do anything.
Otherwise it overwrites the week of the year so that the first week of the year starts on the second Sunday.
Here is the results for
adjust_last_week(2016)
date day_of_week day_of_year week_of_year adjust_week
1 2015-12-25 Friday 359 52 52
2 2015-12-26 Saturday 360 52 52
3 2015-12-27 Sunday 361 52 52
4 2015-12-28 Monday 362 52 52
5 2015-12-29 Tuesday 363 52 52
6 2015-12-30 Wednesday 364 52 52
7 2015-12-31 Thursday 365 53 53
8 2016-01-01 Friday 1 1 53
9 2016-01-02 Saturday 2 1 53
10 2016-01-03 Sunday 3 1 1
11 2016-01-04 Monday 4 1 1
12 2016-01-05 Tuesday 5 1 1
13 2016-01-06 Wednesday 6 1 1
14 2016-01-07 Thursday 7 1 1
I'm looking to see if there is a quicker way to replace the days of the week in a R dataframe with a number. Essentially, the question I'm wondering is given one vector and a corresponding vector is there a quick way to apply a replacement to a dataframe.
Here is my dataframe:
month day_of_week skies
1 APR Tuesday Clear
2 APR Wednesday Cloudy
3 APR Thursday Cloudy
4 APR Friday Cloudy
5 APR Saturday Cloudy
6 APR Sunday Clear
The days of the week are in the following vector:
daysweek <- unique(df$day_of_week)
daysweek
[1] Tuesday Wednesday Thursday Friday Saturday Sunday Monday
The corresponding vector is:
days_num <- c(2,3,4,5,6,7,1)
The long way I would do it is without the corresponding vector and using gsub individually. I was wondering if there was a quick way to do it. I couldn't figure it out with a for loop.
for (i in c(1:7)) {
df$result <- gsub(daysweek[i], days_num[i], df$day_of_week)
}
Desired dataframe output I would want would be:
month day_of_week skies
1 APR 2 Clear
2 APR 3 Cloudy
3 APR 4 Cloudy
4 APR 5 Cloudy
5 APR 6 Cloudy
6 APR 7 Clear
Create a index of weekdays and match with the day_of_week column.
Date <- as.Date('2014-12-29') #Monday
Wdays <- weekdays(seq(Date, length.out=7, by= '1 day'))
df[,2] <- match(df[,2],Wdays)
df[,2]
#[1] 2 3 4 5 6 7
Or you can convert the column to factor with levels from Monday to Sunday and convert it to numeric
as.numeric(factor(df$day_of_week, levels=c("Monday", "Tuesday",
"Wednesday", "Thursday", "Friday", "Saturday", "Sunday")))
#[1] 2 3 4 5 6 7
Update
If you have a vector of numeric indices that correspond the unique values in the day_of_week column
Un <- c('Tuesday', 'Wednesday', 'Thursday', 'Friday',
'Saturday', 'Sunday', 'Monday')
days_num <- c(2,3,4,5,6,7,1)
set.seed(24)
day_of_week <- sample(Un, 20, replace=TRUE)
unname(setNames(days_num, Un)[day_of_week])
#[1] 4 3 6 5 6 1 3 7 7 3 6 4 6 6 4 1 3 2 5 2
Because you used gsub, another option would be mgsub from qdap
library(qdap)
as.numeric(mgsub(Un, days_num, day_of_week))
#[1] 4 3 6 5 6 1 3 7 7 3 6 4 6 6 4 1 3 2 5 2
or
library(qdapTools)
day_of_week %l% data.frame(Un, days_num)
#[1] 4 3 6 5 6 1 3 7 7 3 6 4 6 6 4 1 3 2 5 2