How can I convert a column of integers as dates:
DATE PRCP
1: 19490101 25
2: 19490102 5
3: 19490118 18
4: 19490119 386
5: 19490202 38
to a table like this:
days month years PRCP
We can use extract
library(tidyr)
extract(df, DATE, into=c('YEAR', 'MONTH', 'DAY'),
'(.{4})(.{2})(.{2})', remove=FALSE)
# DATE YEAR MONTH DAY PRCP
#1 19490101 1949 01 01 25
#2 19490102 1949 01 02 5
#3 19490118 1949 01 18 18
#4 19490119 1949 01 19 386
#5 19490202 1949 02 02 38
Here's another way using regular expressions:
df <- read.table(header=T, stringsAsFactors=F, text="
DATE PRCP
19490101 25
19490102 5
19490118 18
19490119 386
19490202 38")
dates <- as.character(df$DATE)
res <- t(sapply(regmatches(dates, regexec("(\\d{4})(\\d{2})(\\d{2})", dates)), "[", -1))
res <- structure(as.integer(res), .Dim=dim(res)) # make them integer values
cbind(df, setNames(as.data.frame(res), c("Y", "M", "D"))) # combine with original data frame
# DATE PRCP Y M D
# 1 19490101 25 1949 01 01
# 2 19490102 5 1949 01 02
# 3 19490118 18 1949 01 18
# 4 19490119 386 1949 01 19
# 5 19490202 38 1949 02 02
I would advise you to use the lubridate package:
require(lubridate)
df[, DATE := ymd(DATE)]
df[, c("Day", "Month", "Year") := list(day(DATE), month(DATE), year(DATE))]
df[, DATE := NULL]
Another option would be to use separate from the tidyr package:
library(tidyr)
separate(df, DATE, c('year','month','day'), sep = c(4,6), remove = FALSE)
which results in:
DATE year month day PRCP
1: 19490101 1949 01 01 25
2: 19490102 1949 01 02 5
3: 19490118 1949 01 18 18
4: 19490119 1949 01 19 386
5: 19490202 1949 02 02 38
Two options in base R:
1) with substr as said by #coffeinjunky in the comments:
df$year <- substr(df$DATE,1,4)
df$month <- substr(df$DATE,5,6)
df$day <- substr(df$DATE,7,8)
2) with as.Date and format:
df$DATE <- as.Date(as.character(df$DATE),'%Y%m%d')
df$year <- format(df$DATE, '%Y')
df$month <- format(df$DATE, '%m')
df$day <- format(df$DATE, '%d')
First I would convert the DATE column to Date type using as.Date(), then build the new data.frame using calls to format():
df <- data.frame(DATE=c(19490101,19490102,19490118,19490119,19490202),PRCP=c(25,5,18,386,38),stringsAsFactors=F);
df$DATE <- as.Date(as.character(df$DATE),'%Y%m%d');
data.frame(day=as.integer(format(df$DATE,'%d')),month=as.integer(format(df$DATE,'%m')),year=as.integer(format(df$DATE,'%Y')),PRCP=df$PRCP);
## day month year PRCP
## 1 1 1 1949 25
## 2 2 1 1949 5
## 3 18 1 1949 18
## 4 19 1 1949 386
## 5 2 2 1949 38
Related
So I have values like
Mon 162 Tue 123 Wed 29
and so on. I need to find the average for all weekdays in R. I have tried filter and group_by but cannot get an answer.
Time Day Count Speed
1 00:00 Sun 169 60.2
2 00:00 Mon 71 58.5
3 00:00 Tue 70 57.2
4 00:00 Wed 68 58.5
5 00:00 Thu 91 58.8
6 00:00 Fri 94 58.7
7 00:00 Sat 135 58.5
8 01:00 Sun 111 60.0
9 01:00 Mon 45 59.2
10 01:00 Tue 50 57.6
I need the out come to be Weekday Average = ####
Let's say your df is
> df
# A tibble: 14 x 2
Day Count
<chr> <dbl>
1 Sun 31
2 Mon 51
3 Tue 21
4 Wed 61
5 Thu 31
6 Fri 51
7 Sat 65
8 Sun 31
9 Mon 13
10 Tue 61
11 Wed 72
12 Thu 46
13 Fri 62
14 Sat 13
You can use
df %>%
filter(!Day %in% c('Sun', 'Sat')) %>%
group_by(Day) %>%
summarize(mean(Count))
To get
# A tibble: 5 x 2
Day `mean(Count)`
<chr> <dbl>
1 Fri 56.5
2 Mon 32
3 Thu 38.5
4 Tue 41
5 Wed 66.5
For the average of all filtered values
df %>%
filter(!Day %in% c("Sun", "Sat")) %>%
summarize("Average of all Weekday counts" = mean(Count))
Output
# A tibble: 1 x 1
`Average of all Weekday counts`
<dbl>
1 46.9
To get a numeric value instead of a tibble
df %>%
filter(!Day %in% c("Sun", "Sat")) %>%
summarize("Average of all Weekday counts" = mean(Count)) %>%
as.numeric()
Output
[1] 46.9
This might do the trick
days <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")
d.f <- data.frame(Day = rep(days, 3), Speed = rnorm(21))
# split dataframe by days then take the mean over the speed
lapply(split(d.f, f=days), function(d) mean(d$Speed))
If you're looking for the single mean for just the weekdays, you could do something like this:
dat = data.frame(Time = rep(c("00:00","01:00"),c(7,3)),
Day = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat","Sun","Mon","Tue"),
Count = c(169,71,70,68,91,94,135,111,45,50),
Speed = c(60.2,58.5,57.2,58.5,58.8,58.7,58.5,60.0,59.2,57.6))
mean(dat$Count[dat$Day %in% c("Mon","Tue","Wed","Thu","Fri")])
# [1] 69.85714
If, on the other hand, you're looking for the mean across each individual day then you could do this using base R:
aggregate(dat$Count, by=list(dat$Day), FUN = mean)
# Group.1 x
# 1 Fri 94
# 2 Mon 58
# 3 Sat 135
# 4 Sun 140
# 5 Thu 91
# 6 Tue 60
# 7 Wed 68
It looks like you've tried dplyr, so the syntax for that same operation in dplyr would be:
library(dplyr)
dat %>% group_by(Day) %>% summarize(mean_count = mean(Count))
# Day mean_count
# <chr> <dbl>
# 1 Fri 94
# 2 Mon 58
# 3 Sat 135
# 4 Sun 140
# 5 Thu 91
# 6 Tue 60
# 7 Wed 68
And if you want to do the same thing in data.table you would do this:
library(data.table)
as.data.table(dat)[,.(mean_count = mean(Count)), by = Day]
# Day mean_count
# 1: Sun 140
# 2: Mon 58
# 3: Tue 60
# 4: Wed 68
# 5: Thu 91
# 6: Fri 94
# 7: Sat 135
With format() I can extract year, month and day as follows:
date day month year
<date> <fctr> <fctr> <fctr>
2005-01-01 01 01 2005
2005-01-01 01 01 2005
2005-01-02 02 01 2005
2005-01-02 02 01 2005
2005-01-03 03 01 2005
2005-01-03 03 01 2005
...
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
2010-12-31 31 12 2010
However, I also want to count how many days,weeks,months there are, from the start to the end. That is, I want to create day,week,month numbers as follows:
date day month year day_num week_num month_num
<date> <fctr> <fctr> <fctr> <double> <double> <double>
2005-01-01 01 01 2005 1 1 1
2005-01-01 01 01 2005 1 1 1
2005-01-02 02 01 2005 2 1 1
2005-01-02 02 01 2005 2 1 1
2005-01-03 03 01 2005 3 1 1
2005-01-03 03 01 2005 3 1 1
...
2005-02-28 28 02 2005 59 9 2
2005-03-01 01 03 2005 60 9 3
2005-03-02 02 03 2005 61 9 3
...
How can I do that without miscounting?
You can use difftime to get the number of days and weeks but you need a workaround for the number of months. This will do the trick:
library(lubridate)
library(dplyr)
df %>%
mutate(
day_num = as.numeric(difftime(date, min(date), units = "days")),
week_num = floor(as.numeric(difftime(date, min(date), units = "weeks"))),
tmp = year(date) * 12 + month(date),
month_num = tmp - min(tmp)
) %>%
select(-tmp)
Use format() with the following codes:
date = strptime('2005-02-28', format='%Y-%m-%d')
format(date, '%j') # Decimal day of the year
format(date, '%U') # Decimal week of the year (starting on Sunday)
format(date, '%W') # Decimal week of the year (starting on Monday)
format(date, '%m') # Decimal month
Output:
[1] "059"
[1] "09"
[1] "09"
[1] "02"
Source
Let say I have the following data.frame:
Dates<-seq(as.Date('2017/01/01'), by = 'day', length.out = 365)
A <- data.frame(date=(Dates), month=month(Dates), week=week(Dates))
library(dplyr)
B <- A %>% dplyr::mutate(day = lubridate::wday(date, label = TRUE))
B[350:365,]
date month week day
350 2017-12-16 12 50 Sat
351 2017-12-17 12 51 Sun
352 2017-12-18 12 51 Mon
353 2017-12-19 12 51 Tue
354 2017-12-20 12 51 Wed
355 2017-12-21 12 51 Thu
356 2017-12-22 12 51 Fri
357 2017-12-23 12 51 Sat
358 2017-12-24 12 52 Sun
359 2017-12-25 12 52 Mon
360 2017-12-26 12 52 Tue
361 2017-12-27 12 52 Wed
362 2017-12-28 12 52 Thu
363 2017-12-29 12 52 Fri
364 2017-12-30 12 52 Sat
365 2017-12-31 12 53 Sun
I need to add another ten dates after the end date which is from 2018-01-01 to 2018-01-10. Sequence for week should be continuous. For example:
date month week day
365 2017-12-31 12 53 Sun
366 2018-01-01 1 53 Mon
367 2018-01-02 1 53 Tue
368 2018-01-03 1 53 Wed
369 2018-01-04 1 53 Thu
370 2018-01-05 1 53 Fri
371 2018-01-06 1 53 Sat
372 2018-01-07 1 54 Sun
373 2018-01-08 1 54 Mon
374 2018-01-09 1 54 Tue
375 2018-01-10 1 54 Wed
library(dplyr)
library(lubridate)
Dates<-seq(as.Date('2017/01/01'), by = 'day', length.out = 365)
A <- data.frame(date=(Dates), month=month(Dates), week=week(Dates))
B <- A %>% dplyr::mutate(day = lubridate::wday(date, label = TRUE))
B[350:365,]
B %>%
rbind( # bind rows with the following dataset
data.frame(date = seq(max(B$date)+1, by = 'day', length.out = 10)) %>% # create sequence of new dates
mutate(month = month(date), # add month
day = wday(date, label = TRUE), # add day
week = cumsum(day=="Sun") + max(A$week)) ) %>% # add week: continuous from last week of A and gets updated every Sunday
tbl_df() # only for visualisation purposes
# # A tibble: 375 x 4
# date month week day
# <date> <dbl> <dbl> <ord>
# 1 2017-01-01 1 1 Sun
# 2 2017-01-02 1 1 Mon
# 3 2017-01-03 1 1 Tue
# 4 2017-01-04 1 1 Wed
# 5 2017-01-05 1 1 Thu
# 6 2017-01-06 1 1 Fri
# 7 2017-01-07 1 1 Sat
# 8 2017-01-08 1 2 Sun
# 9 2017-01-09 1 2 Mon
#10 2017-01-10 1 2 Tue
# # ... with 365 more rows
Little Tweak to #antoniosk code , just added max of week from the past data frame and got the continuous week numbers as desired.
library(dplyr)
library(lubridate)
Dates<-seq(as.Date('2017/01/01'), by = 'day', length.out = 365)
A <- data.frame(date=(Dates), month=month(Dates), week=week(Dates))
B <- A %>% dplyr::mutate(day = lubridate::wday(date, label = TRUE))
B[350:365,]
c<- B %>% rbind( # bind rows with the following dataset
data.frame(date = seq(max(B$date)+1, by = 'day', length.out = 10)) %>% # get 10 extra sequential dates after the last date in B
mutate(month = month(date), week = (as.numeric(strftime(date, format = "%U")) +max(A$week)),day = wday(date, label = TRUE)) ) %>% tbl_df()
This is what my data frame looks like :
its the data of a song portal(like itunes or raaga)
datf <- read.csv(text =
"albumid,date_transaction,listened_time_secs,userid,songid
6263,3/28/2017,59,3747,6263
3691,4/24/2017,53,2417,3691
2222,3/24/2017,34,2417,9856
1924,3/16/2017,19,8514,1924
6691,1/1/2017,50,2186,6691
5195,1/1/2017,64,2186,5195
2179,1/1/2017,37,2186,2179
6652,1/11/2017,33,1145,6652")
My aim is to pick out the rare user. A 'rare' user is the one which visits the portal not more than once in each calendar month.
for e.g : 2186 is not rare. 2417 is rare because it occurred only once in 2 diff months, so are 3747,1145 and 8514.
I've been trying something like this :
DuplicateUsers <- duplicated(songsdata[,2:4])
DuplicateUsers <- songsdata[DuplicateUsers,]
DistinctSongs <- songsdata %>%
distinct(sessionid, date_transaction, .keep_all = TRUE)
RareUsers <- anti_join(DistinctSongs, DuplicateUsers, by='sessionid')
but doesn't seem to work.
Using library(dplyr) you could do this:
# make a new monthid variable to group_by() with
songdata$month_id <- gsub("\\/.*", "", songdata$date_transaction)
RareUsers <- group_by(songdata, userid, month_id) %>%
filter(n() == 1)
RareUsers
# A tibble: 5 x 6
# Groups: userid, month_id [5]
albumid date_transaction listened_time_secs userid songid month_id
<int> <chr> <int> <int> <int> <chr>
1 6263 3/28/2017 59 3747 6263 3
2 3691 4/24/2017 53 2417 3691 4
3 2222 3/24/2017 34 2417 9856 3
4 1924 3/16/2017 19 8514 1924 3
5 6652 1/11/2017 33 1145 6652 1
You can try something like:
df %>%
mutate(mth = lubridate::month(mdy(date_transaction))) %>%
group_by(mth, userid) %>%
filter(n() == 1)
which gives:
albumid date_transaction listened_time_secs userid songid mth
<int> <fctr> <int> <int> <int> <dbl>
1 6263 3/28/2017 59 3747 6263 3
2 3691 4/24/2017 53 2417 3691 4
3 2222 3/24/2017 34 2417 9856 3
4 1924 3/16/2017 19 8514 1924 3
5 6652 1/11/2017 33 1145 6652 1
You can do it with base R:
# parse date and extract month
datf$date_transaction <- as.Date(datf$date_transaction, "%m/%d/%Y")
datf$month <- format(datf$date_transaction, "%m")
# find non-duplicated pairs of userid and month
aux <- datf[, c("userid", "month")]
RareUsers <- setdiff(aux, aux[duplicated(aux), ])
RareUsers
# userid month
# 1 3747 03
# 2 2417 04
# 3 2417 03
# 4 8514 03
# 5 1145 01
If you need the other columns:
merge(RareUsers, datf)
# userid month albumid date_transaction listened_time_secs songid
# 1 1145 01 6652 2017-01-11 33 6652
# 2 2417 03 2222 2017-03-24 34 9856
# 3 2417 04 3691 2017-04-24 53 3691
# 4 3747 03 6263 2017-03-28 59 6263
# 5 8514 03 1924 2017-03-16 19 1924
Hello Everybody I am pretty much completely new to R and any help is greatly appreciated. I have the following data (called "depressionaggregate") from 2004 until 2013 for each month:
Month Year DepressionCount
1 01 2004 285
2 02 2004 323
3 03 2004 267
4 04 2004 276
5 05 2004 312
6 06 2004 232
7 07 2004 228
8 08 2004 280
9 09 2004 277
10 10 2004 335
11 11 2004 273
I am trying to create a new column with the aggregated values for each year for each quarter (i.e. 2004 Q1, 2004 Q2 etc.). I have tried using the function aggregate but have not been successful. Hope you can help me! Regards
1) If DF is the input data.frame convert it to a zoo object z with a "yearmon" index and then aggregate that to "yearqtr":
library(zoo)
toYearmon <- function(y, m) as.yearmon(paste(y, m, sep = "-"))
z <- read.zoo(DF, index = 2:1, FUN = toYearmon)
ag <- aggregate(z, as.yearqtr, sum)
giving:
> ag
2004 Q1 2004 Q2 2004 Q3 2004 Q4
875 820 785 608
2) This would also work:
library(zoo)
yq <- as.yearqtr(as.yearmon(paste(DF$Year, DF$Month), "%Y %m"))
ta <- tapply(DF$DepressionCount, yq, sum)