I have the following data set. I am trying to split the date_1 field into month and days. Then converting the month number to a month name.
date_1,no_of_births_1
1/1,1482
2/2,1213
3/23,1220
4/4,1319
5/11,1262
6/18,1271
I am using month.abb[] for converting the month number to name. But instead of providing month name for each value of month number, the result is generating wrong array.
for example: month.abb[2] is generating Apr instead of Feb.
date_1 no_of_births_1 V1 V2 month
1 1/1 1482 1 1 Jan
2 2/2 1213 2 2 Apr
3 3/23 1220 3 23 May
4 4/4 1319 4 4 Jun
5 5/11 1262 5 11 Jul
6 6/18 1271 6 18 Aug
below is the code i am using,
birthday<-read.csv("Birthday_s.csv",header = TRUE)
birthday$date_1<-as.character(birthday$date_1)
#split the data
listx<-sapply(birthday$date_1,function(x) strsplit(x,"/"))
library(base)
#convert to data frame
mat<-as.data.frame(matrix(unlist(listx),ncol = 2, byrow = TRUE))
#combine birthday and mat
birthday2<-cbind(birthday,mat)
#convert month number to month name
birthday2$month<-sapply(birthday2$V1, function(x) month.abb[as.numeric(x)])
When I run your code, I get the correct months. However, your code is more complicated than necessary. Here are two ways to extract month and day from date_1:
First, when you read the data, use stringsAsFactors=FALSE, which prevents strings from getting converted to factors.
birthday <- read.csv("Birthday_s.csv",header = TRUE, stringsAsFactors=FALSE)
Extract month and days using date functions:
library(lubridate)
birthday$month = month(as.POSIXct(birthday$date_1, format="%m/%d"), abbr=TRUE, label=TRUE)
birthday$day = day(as.POSIXct(birthday$date_1, format="%m/%d"))
Extract month and days using Regular Expressions:
birthday$month = month.abb[as.numeric(gsub("([0-9]{1,2}).*", "\\1", birthday$date_1))]
birthday$day = as.numeric(gsub(".*/([0-9]{1,2}$)", "\\1", birthday$date_1))
Related
In my data frame table, the days of the week are integer type. I want to change the days of the week, for example Monday, Tuesday.
I tried stringr::str_replace()
Please try to capture sample data and desired output next time. I will try based on provided info.
Simple Approach :
Lets say your data has a column weekday with integer values, similar to this- added data just for sample:
df <- data.frame(
days_n =c(1,2,3,4,5,6,7)
,SomeData = c('A','AA','BB','CC','BB','AAA','CCC'))
Then just mutate using wonderful lubridate
df%>%mutate(Weekday = wday(x = days_n,label = T,abbr = T))
will give you :
days_n SomeData Weekday
1 1 A Sun
2 2 AA Mon
3 3 BB Tue
4 4 CC Wed
5 5 BB Thu
6 6 AAA Fri
7 7 CCC Sat
Check wday() in lubridate for more.
I have the below dataset which is a dataframe. But I would like to convert it into time series so that I can do ARIMA forecasting.
Have searched various topics in SO but could not find anything similar which is at YEARMONTH grain. Everyone talked about date field. But here I don't have date.
I am using the below code but this gives error
dataset <- data.frame(year =c(2017), YearMonth = c(201701,201702,201703,201704), sales = c(100,200,300,400))
library(zoo)
newdataset <- as.ts(read.zoo(dataset, FUN = as.yearmon))
# Error:
#
# In zoo(coredata(x), tt) :
# some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I know it gives error because I have year column as 1st column which does not have unique values but not really sure how to fix it.
Any help would be really appreciated.
Regards,
Akash
One option is to convert YearMonth to 1st date of a month and generate ts.
library(zoo)
dataset$YearMonth = as.Date(as.yearmon(as.character(dataset$YearMonth),"%Y%m"), frac = 0)
dataset
# year YearMonth sales
# 1 2017 2017-01-01 100
# 2 2017 2017-02-01 200
# 3 2017 2017-03-01 300
# 4 2017 2017-04-01 400
Just for ts another option is as:
dataset$YearMonth = as.yearmon(as.character(dataset$YearMonth),"%Y%m")
as.ts(dataset[-1])
# Time Series:
# Start = 1
# End = 4
# Frequency = 1
# YearMonth sales
# 1 2017.000 100
# 2 2017.083 200
# 3 2017.167 300
# 4 2017.250 400
I want to do time series analysis on the following data
but I can't convert it into time series data,
Data can be downloaded from the given link https://datamarket.com/data/set/22ox/monthly-milk-production-pounds-per-cow-jan-62-dec-75#!ds=22ox&display=line
I have tried str_split_fixed function to separate it into two columns but putting back together as a time series after splitting is a problem
This is what I have tried:
#Convert it into Time series
#Train Data
ds.ts<-ts(ds$V2,start = c(1962,1),end = c(1974,12),frequency = 12)
ds.ts
plot(ds.ts)
plot(decompose(ds.ts))
#Test Data
ts.1975<-ts(ds$V2,start=c(1975,1),end=c(1975,12),frequency = 12)
I exported as .tsv (tab separated), but .csv should work as well. Then read to data.table and use substr to extract the first 4 digits as year (and convert to integer) and the last 2 digits as month:
library(data.table)
dt <- fread("~/Downloads/monthly-milk-production-pounds-p.tsv")
dt[, ":="(
year = as.integer(substr(Month, start = 1, stop = 4)),
month = as.integer(substr(Month, start = 6, stop = 7)))]
>dt
Month Monthly milk production: pounds per cow. Jan 62 ? Dec 75 year month
1: 1962-01 589 1962 1
2: 1962-02 561 1962 2
3: 1962-03 640 1962 3
4: 1962-04 656 1962 4
5: 1962-05 727 1962 5
Updated based on comment by AkselA:
To obtain a time-series use as.Date as suggested by AkselA:
library(data.table)
dt <- fread("~/Downloads/monthly-milk-production-pounds-p.tsv")
dt[, date_time := as.Date(paste0(Month, "-01"), format="%Y-%m-%d")]
>dt
Month Monthly milk production: pounds per cow. Jan 62 ? Dec 75 date_time
1: 1962-01 589 1962-01-01
2: 1962-02 561 1962-02-01
3: 1962-03 640 1962-03-01
4: 1962-04 656 1962-04-01
5: 1962-05 727 1962-05-01
I must be missing something simple.
I have a data.frame of various date formats and I'm using lubridate which works great with everything except month names by themselves. I can't get the month names to convert to date time objects.
> head(dates)
From To
1 June August
2 January December
3 05/01/2013 10/30/2013
4 July November
5 06/17/2013 10/14/2013
6 05/04/2013 11/23/2013
Trying to change June into date time object:
> as_date(dates[1,1])
Error in charToDate(x) :
character string is not in a standard unambiguous format
> as_date("June")
Error in charToDate(x) :
character string is not in a standard unambiguous format
The actual year and day do not matter. I only need the month. zx8754 suggested using dummy day and year.
lubridate can handle converting the name or abbreviation of a month to its number when it's paired with the rest of the information needed to make a proper date, i.e. a day and year. For example:
lubridate::mdy("August/01/2013", "08/01/2013", "Aug/01/2013")
#> [1] "2013-08-01" "2013-08-01" "2013-08-01"
You can utilize that to write a function that appends "/01/2013" to any month names (I threw in abbreviations as well to be safe). Then apply that to all your date columns (dplyr::mutate_all is just one way to do that).
name_to_date <- function(x) {
lubridate::mdy(ifelse(x %in% c(month.name, month.abb), paste0(x, "/01/2013"), x))
}
dplyr::mutate_all(dates, name_to_date)
#> From To
#> 1 2013-06-01 2013-08-01
#> 2 2013-01-01 2013-12-01
#> 3 2013-05-01 2013-10-30
#> 4 2013-07-01 2013-11-01
#> 5 2013-06-17 2013-10-14
#> 6 2013-05-04 2013-11-23
The following is a crude example of how you could achieve that.
Given that dummy values are fine:
match(dates[1, 1], month.abb)
The above would return you, given that we had Dec in dates[1. 1]:
12
To generate the returned value above along with dummy number in a date format, I tried:
tmp = paste(match(dates[1, 1], month.abb), "2013", sep="/")
which gives us:
12/2013
and then lastly:
result = paste("01", tmp, sep="/")
which returns:
01/12/2013
I am sure there are more flexible approaches than this; but this is just an idea, which I just tried.
Using a custom function:
# dummy data
df1 <- read.table(text = "
From To
1 June August
2 January December
3 05/01/2013 10/30/2013
4 July November
5 06/17/2013 10/14/2013
6 05/04/2013 11/23/2013", header = TRUE, as.is = TRUE)
# custom function
myFun <- function(x, dummyDay = "01", dummyYear = "2013"){
require(lubridate)
x <- ifelse(substr(x, 1, 3) %in% month.abb,
paste(match(substr(x, 1, 3), month.abb),
dummyDay,
dummyYear, sep = "/"), x)
#return date
mdy(x)
}
res <- data.frame(lapply(df1, myFun))
res
# From To
# 1 2013-06-01 2013-08-01
# 2 2013-01-01 2013-12-01
# 3 2013-05-01 2013-10-30
# 4 2013-07-01 2013-11-01
# 5 2013-06-17 2013-10-14
# 6 2013-05-04 2013-11-23
I've compiled a corpus of tweets sent over the past few months or so, which looks something like this (the actual corpus has a lot more columns and obviously a lot more rows, but you get the idea)
id when time day month year handle what
UK1.1 Sat Feb 20 2016 12:34:02 20 2 2016 dave Great goal by #lfc
UK1.2 Sat Feb 20 2016 15:12:42 20 2 2016 john Can't wait for the weekend
UK1.3 Sat Mar 01 2016 12:09:21 1 3 2016 smith Generic boring tweet
Now what I'd like to do in R is, using grep for string matching, plot the frequency of certain words/hashtags over time, ideally normalised by the number of tweets from that month/day/hour/whatever. But I have no idea how to do this.
I know how to use grep to create subsets of this dataframe, e.g. for all tweets including the #lfc hashtag, but I don't really know where to go from there.
The other issue is that whatever time scale is on my x-axis (hour/day/month etc.) needs to be numerical, and the 'when' column isn't. I've tried concatenating the 'day' and 'month' columns into something like '2.13' for February 13th, but this leads to the issue of R treating 2.13 as being 'earlier', so to speak, than 2.7 (February 7th) on mathematical grounds.
So basically, I'd like to make plots like these, where frequency of string x is plotted against time
Thanks!
Here's one way to count up tweets by day. I've illustrated with a simplified fake data set:
library(dplyr)
library(lubridate)
# Fake data
set.seed(485)
dat = data.frame(time = seq(as.POSIXct("2016-01-01"),as.POSIXct("2016-12-31"), length.out=10000),
what = sample(LETTERS, 10000, replace=TRUE))
tweet.summary = dat %>% group_by(day = date(time)) %>% # To summarise by month: group_by(month = month(time, label=TRUE))
summarise(total.tweets = n(),
A.tweets = sum(grepl("A", what)),
pct.A = A.tweets/total.tweets,
B.tweets = sum(grepl("B", what)),
pct.B = B.tweets/total.tweets)
tweet.summary
day total.tweets A.tweets pct.A B.tweets pct.B
1 2016-01-01 28 3 0.10714286 0 0.00000000
2 2016-01-02 27 0 0.00000000 1 0.03703704
3 2016-01-03 28 4 0.14285714 1 0.03571429
4 2016-01-04 27 2 0.07407407 2 0.07407407
...
Here's a way to plot the data using ggplot2. I've also summarized the data frame on the fly within ggplot, using the dplyr and reshape2 packages:
library(ggplot2)
library(reshape2)
library(scales)
ggplot(dat %>% group_by(Month = month(time, label=TRUE)) %>%
summarise(A = sum(grepl("A", what))/n(),
B = sum(grepl("B", what))/n()) %>%
melt(id.var="Month"),
aes(Month, value, colour=variable, group=variable)) +
geom_line() +
theme_bw() +
scale_y_continuous(limits=c(0,0.06), labels=percent_format()) +
labs(colour="", y="")
Regarding your date formatting issue, here's how to get numeric dates: You can turn the day month and year columns into a date using as.Date and/or turn the day, month, year, and time columns into a date-time column using as.POSIXct. Both will have underlying numeric values with a date class attached, so that R treats them as dates in plotting functions and other functions. Once you've done this conversion, you can run the code above to count up tweets by day, month, etc.
# Fake time data
dat2 = data.frame(day=sample(1:28, 10), month=sample(1:12,10), year=2016,
time = paste0(sample(c(paste0(0,0:9),10:12),10),":",sample(10:50,10)))
# Create date-time format column from existing day/month/year/time columns
dat2$posix.date = with(dat2, as.POSIXct(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day)," ",
time)))
# Create date format column
dat2$date = with(dat2, as.Date(paste0(year,"-",
sprintf("%02d",month),"-",
sprintf("%02d", day))))
dat2
day month year time posix.date date
1 28 10 2016 01:44 2016-10-28 01:44:00 2016-10-28
2 22 6 2016 12:28 2016-06-22 12:28:00 2016-06-22
3 3 4 2016 11:46 2016-04-03 11:46:00 2016-04-03
4 15 8 2016 10:13 2016-08-15 10:13:00 2016-08-15
5 6 2 2016 06:32 2016-02-06 06:32:00 2016-02-06
6 2 12 2016 02:38 2016-12-02 02:38:00 2016-12-02
7 4 11 2016 00:27 2016-11-04 00:27:00 2016-11-04
8 12 3 2016 07:20 2016-03-12 07:20:00 2016-03-12
9 24 5 2016 08:47 2016-05-24 08:47:00 2016-05-24
10 27 1 2016 04:22 2016-01-27 04:22:00 2016-01-27
You can see that the underlying values of a POSIXct date are numeric (number of seconds elapsed since midnight on Jan 1, 1970), by doing as.numeric(dat2$posix.date). Likewise for a Date object (number of days elapsed since Jan 1, 1970): as.numeric(dat2$date).