Just starting out in R and got my data into a dataframe: It created index column (row label), but I think I want/need the date column to be the row label column (for ease of use in forecast and plot methods) The functions ie forecast are sometimes picking the row label col and I want dates..
> fullmatrix
Date Unit Sales Average Selling Price Median Selling Price Average Days on Market
161 2000-05-01 3041 114093 99554 138
160 2000-06-01 3079 114730 99931 138
159 2000-07-01 2455 122074 97737 145
So How do I 1)drop the index(row label),and 2) declare the date as the index(row label)?
The question is not clear. But I think, you can want to create a time serie object. Using xts package for example, you can do the following:
dat <- read.table(text=' Date Unit_Sales_Average Selling_Price Median_Selling_Price Average_Days_on_Market
161 2000-05-01 3041 114093 99554 138
160 2000-06-01 3079 114730 99931 138
159 2000-07-01 2455 122074 97737 145',header=TRUE)
library(xts)
dat.xts <- xts(x=dat[,-1],order.by= as.POSIXct(dat$Date))
Unit_Sales_Average Selling_Price Median_Selling_Price Average_Days_on_Market
2000-05-01 3041 114093 99554 138
2000-06-01 3079 114730 99931 138
2000-07-01 2455 122074 97737 145
Now you have index:
index(dat.xts)
[1] "2000-05-01 CEST" "2000-06-01 CEST" "2000-07-01 CEST"
This xts object can be used within forecast.
Related
I'm sorry , i know this question has been asked a lot of times , but I'm having problems to convert my dataframe into time series.
this is my dataframe ( after dropping some columns):
head(New_DF):
ï..date qty
1 2017-07-05 61
2 2018-01-20 73
3 2017-07-10 145
4 2017-07-01 255
5 2017-05-23 267
6 2017-06-24 242
And this is what i did:
library(zoo)
as.ts(read.zoo(New_Df, FUN = as.yearmon))
And i get this Error:
Error in seq.default(head(tt, 1), tail(tt, 1), deltat) :
'from' must be a finite number
In addition: Warning message:
In zoo(rval3, ix) :
some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I think i got why , it is because i have a lot of duplicates in my i..date column , unfortunately i don't want to drop them since time-series ML Model are bit different than other routine ML models. As time-series model is based upon the sequence of previous values, dropping a Date may impact my solution.
Any suggestions would be much appreciated , thank you.
1) yearmon Assuming New_DF shown reproducibly in the Note at the end, use read.zoo specifying the argument aggregate=sum .
library(zoo)
read.zoo(New_DF, FUN = as.yearmon, aggregate = sum)
giving:
May 2017 Jun 2017 Jul 2017 Jan 2018
267 242 461 73
2) Date If you want to keep the individual rows then use Date class instead of yearmon (assuming that the dates are unique).
read.zoo(New_DF)
## 2017-05-23 2017-06-24 2017-07-01 2017-07-05 2017-07-10 2018-01-20
## 267 242 255 61 145 73
3) sequence number Another possibility is to just ignore the dates and use 1, 2, .3, ..
zoo(New_DF$qty)
## 1 2 3 4 5 6
## 267 242 255 61 145 73
Note
Lines <- " ï..date qty
1 2017-07-05 61
2 2018-01-20 73
3 2017-07-10 145
4 2017-07-01 255
5 2017-05-23 267
6 2017-06-24 242 "
New_DF <- read.table(text = Lines)
Could you share some background about your data. Also if there are some duplicates in the data, can you just sum them up, so that the above error won't occur.
Date Sales
3/11/2017 1
3/12/2017 0
3/13/2017 40
3/14/2017 47
3/15/2017 83
3/16/2017 62
3/17/2017 13
3/18/2017 58
3/19/2017 27
3/20/2017 17
3/21/2017 71
3/22/2017 76
3/23/2017 8
3/24/2017 13
3/25/2017 97
3/26/2017 58
3/27/2017 80
3/28/2017 77
3/29/2017 31
3/30/2017 78
3/31/2017 0
4/1/2017 40
4/2/2017 58
4/3/2017 32
4/4/2017 31
4/5/2017 90
4/6/2017 35
4/7/2017 88
4/8/2017 16
4/9/2017 72
4/10/2017 39
4/11/2017 8
4/12/2017 88
4/13/2017 93
4/14/2017 57
4/15/2017 23
4/16/2017 15
4/17/2017 6
4/18/2017 91
4/19/2017 87
4/20/2017 44
Here current date is 20/04/2017, My question is grouping data from 19/04/2017 to 11/03/2017 with 4 equal parts with summation sales in r programming?
Eg :
library("xts")
ep <- endpoints(data, on = 'days', k = 4)
period.apply(data,ep,sum)
it's not working. However, its taking start date to current date but I need to geatherd data from yestderday (19/4/2017) to start date and split into 4 equal parts.
kindly anyone guide me soon.
Thank you
Base R has a function cut.Date() which is built for the purpose.
However, the question is not fully clear on what the OP intends. My understanding of the requirements supplied in Q and additional comment is:
Take the sales data per day in Book1 but leave out the current day, i.e., use only completed days.
Group the data in four equal parts, i.e., four periods containing an equal number of days. (Note that the title of the Q and the attempt to use xts::endpoint() with k = 4 indicates that the OP might have a different intention to group the data in periods of four days length each.)
Summarize the sales figures by period
For the sake of brevity, data.table is used here for data manipulation and aggregation, lubridate for date manipulation
library(data.table)
library(lubridate)
# coerce to data.table, convert Date column from character to class Date,
# exclude the actual date
temp <- setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()]
# cut the date range in four parts
temp[, start_date_of_period := cut.Date(Date, 4)]
temp
# Date Sales start_date_of_period
# 1: 2017-03-11 1 2017-03-11
# 2: 2017-03-12 0 2017-03-11
# 3: 2017-03-13 40 2017-03-11
# ...
#38: 2017-04-17 6 2017-04-10
#39: 2017-04-18 91 2017-04-10
#40: 2017-04-19 87 2017-04-10
# Date Sales start_date_of_period
# aggregate sales by period
temp[, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
# start_date_of_period n_days total_sales
#1: 2017-03-11 10 348
#2: 2017-03-21 10 589
#3: 2017-03-31 10 462
#4: 2017-04-10 10 507
Thanks to chaining, this can be put together in one statement without using a temporary variable:
setDT(Book1)[, Date := mdy(Book1$Date)][Date != today()][
, start_date_of_period := cut.Date(Date, 4)][
, .(n_days = .N, total_sales = sum(Sales)), by = start_date_of_period]
Note If you want to reproduce the result in the future, you will have to replace the call to today() which excludes the current day by mdy("4/20/2017") which is the last day in the sample data set supplied by the OP.
Please before make it as duplicate read carefully my question!
I am new in R and I am trying to figure it out how to calculate the sequential date difference from one row/variable compare to the next row/variable in based on weeks and create another field/column for making a graph accordingly.
There are couple of answer here Q1 , Q2 , Q3 but none specifically talk about making difference in one column sequentially between rows lets say from top to bottom.
Below is the example and the expected results:
Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234
Expected
Date Var1 week
2/6/2017 493 0
2/20/2017 558 2
3/6/2017 595 4
3/20/2017 636 6
4/6/2017 697 8
4/20/2017 566 10
5/6/2017 234 12
You can use a similar approach to that in your first linked answer by saving the difftime result as a new column in your data frame.
# Set up data
df <- read.table(text = "Date Var1
2/6/2017 493
2/20/2017 558
3/6/2017 595
3/20/2017 636
4/6/2017 697
4/20/2017 566
5/5/2017 234", header = T)
df$Date <- as.Date(as.character(df$Date), format = "%m/%d/%Y")
# Create exact week variable
df$week <- difftime(df$Date, first(df$Date), units = "weeks")
# Create rounded week variable
df$week2 <- floor(difftime(df$Date, first(df$Date), units = "weeks"))
df
# Date Var1 week week2
# 2017-02-06 493 0.000000 weeks 0 weeks
# 2017-02-20 558 2.000000 weeks 2 weeks
# 2017-03-06 595 4.000000 weeks 4 weeks
# 2017-03-20 636 6.000000 weeks 6 weeks
# 2017-04-06 697 8.428571 weeks 8 weeks
# 2017-04-20 566 10.428571 weeks 10 weeks
# 2017-05-05 234 12.571429 weeks 12 weeks
I have a massive excel spreadsheet full of dates in %m/%d/%Y format. In R, I convert them date format using as.Date. The problem is that some of the dates in Excel were manually entered incorrectly, for example as section below where 214 was entered instead of 2014.
...
235 2014-01-20
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27
...
For individual columns, I can use the function which(dataframe$colname_X<1900) which will give me the row number. This is easy because I already know which column it is.
My question is, how can I do the same to the entire dataframe, so that I get both row and column number of the faulty cells?.
Starting with:
dat <- rd.txt("235 2014-01-20 # #function to use read.table on text
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27")
dat <- cbind(dat,dat)
dat[] <- lapply(dat, as.Date, origin="1970-01-01")
> dat
X235 X2014.01.20 X235 X2014.01.20
1 1970-08-25 2014-03-03 1970-08-25 2014-03-03
2 1970-08-26 2014-01-24 1970-08-26 2014-01-24
3 1970-08-27 2014-03-07 1970-08-27 2014-03-07
4 1970-08-28 0214-05-23 1970-08-28 0214-05-23
5 1970-08-29 2014-01-31 1970-08-29 2014-01-31
6 1970-08-30 2014-02-19 1970-08-30 2014-02-19
7 1970-08-31 2014-03-27 1970-08-31 2014-03-27
Now use which with arr.ind=TRUE (do need to convert to numeric matrix first)
which( sapply(dat,as.numeric) < (as.numeric(as.Date("1900-01-01") ) ), arr.ind=TRUE)
row col
[1,] 4 2
[2,] 4 4
One potential solution
identify all errors using apply
results <- apply(df, 2, function(x) which(x<1900))
This will return a list with each column as an element of the list. As you don't care about those that are empty (i.e. no errors) you could contract the list to only keep those with errors:
results[lapply(results,length)>0]
This is what my data looks like. I would like to convert date and time columns to a time stamp and put it in a single column.
Any help appreciated. Thanks
DATE TIME CLOSE HIGH LOW OPEN VOLUME
1 20150216 1520 2283.85 2284 2275.6 2275.6 48309
2 20150216 1530 2282 2284 2273.15 2283.85 108856
3 20150218 920 2276.1 2280.1 2260.6 2280.1 94279
4 20150218 930 2271.6 2277.95 2271 2276.1 65932
5 20150218 940 2270.35 2275 2268.2 2271.6 53595
6 20150218 950 2270.65 2271.2 2265.55 2270.5 34546
7 20150218 1000 2274.15 2274.25 2268.65 2270.6 35414
8 20150218 1010 2270.1 2274.9 2267.1 2274.25 37334
You can try
df$DateTime <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
df1 <- df[-(1:2)]
head(df1,2)
# CLOSE HIGH LOW OPEN VOLUME DateTime
#1 2283.85 2284 2275.60 2275.60 48309 2015-02-16 15:20:00
#2 2282.00 2284 2273.15 2283.85 108856 2015-02-16 15:30:00
Update
If you need to convert to xts, instead of creating a new column, we can remove the columns that are not needed (df[-(1:2)]) and specify order.by as the datetime vector ('indx')
library(xts)
indx <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
xt1 <- xts(df[-(1:2)], order.by=indx)