Identifying incorrectly transformed data cells - r

I have a massive excel spreadsheet full of dates in %m/%d/%Y format. In R, I convert them date format using as.Date. The problem is that some of the dates in Excel were manually entered incorrectly, for example as section below where 214 was entered instead of 2014.
...
235 2014-01-20
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27
...
For individual columns, I can use the function which(dataframe$colname_X<1900) which will give me the row number. This is easy because I already know which column it is.
My question is, how can I do the same to the entire dataframe, so that I get both row and column number of the faulty cells?.

Starting with:
dat <- rd.txt("235 2014-01-20 # #function to use read.table on text
236 2014-03-03
237 2014-01-24
238 2014-03-07
239 214-05-23
240 2014-01-31
241 2014-02-19
242 2014-03-27")
dat <- cbind(dat,dat)
dat[] <- lapply(dat, as.Date, origin="1970-01-01")
> dat
X235 X2014.01.20 X235 X2014.01.20
1 1970-08-25 2014-03-03 1970-08-25 2014-03-03
2 1970-08-26 2014-01-24 1970-08-26 2014-01-24
3 1970-08-27 2014-03-07 1970-08-27 2014-03-07
4 1970-08-28 0214-05-23 1970-08-28 0214-05-23
5 1970-08-29 2014-01-31 1970-08-29 2014-01-31
6 1970-08-30 2014-02-19 1970-08-30 2014-02-19
7 1970-08-31 2014-03-27 1970-08-31 2014-03-27
Now use which with arr.ind=TRUE (do need to convert to numeric matrix first)
which( sapply(dat,as.numeric) < (as.numeric(as.Date("1900-01-01") ) ), arr.ind=TRUE)
row col
[1,] 4 2
[2,] 4 4

One potential solution
identify all errors using apply
results <- apply(df, 2, function(x) which(x<1900))
This will return a list with each column as an element of the list. As you don't care about those that are empty (i.e. no errors) you could contract the list to only keep those with errors:
results[lapply(results,length)>0]

Related

How to transform a dataframe into time series?

I'm sorry , i know this question has been asked a lot of times , but I'm having problems to convert my dataframe into time series.
this is my dataframe ( after dropping some columns):
head(New_DF):
ï..date qty
1 2017-07-05 61
2 2018-01-20 73
3 2017-07-10 145
4 2017-07-01 255
5 2017-05-23 267
6 2017-06-24 242
And this is what i did:
library(zoo)
as.ts(read.zoo(New_Df, FUN = as.yearmon))
And i get this Error:
Error in seq.default(head(tt, 1), tail(tt, 1), deltat) :
'from' must be a finite number
In addition: Warning message:
In zoo(rval3, ix) :
some methods for “zoo” objects do not work if the index entries in ‘order.by’ are not unique
I think i got why , it is because i have a lot of duplicates in my i..date column , unfortunately i don't want to drop them since time-series ML Model are bit different than other routine ML models. As time-series model is based upon the sequence of previous values, dropping a Date may impact my solution.
Any suggestions would be much appreciated , thank you.
1) yearmon Assuming New_DF shown reproducibly in the Note at the end, use read.zoo specifying the argument aggregate=sum .
library(zoo)
read.zoo(New_DF, FUN = as.yearmon, aggregate = sum)
giving:
May 2017 Jun 2017 Jul 2017 Jan 2018
267 242 461 73
2) Date If you want to keep the individual rows then use Date class instead of yearmon (assuming that the dates are unique).
read.zoo(New_DF)
## 2017-05-23 2017-06-24 2017-07-01 2017-07-05 2017-07-10 2018-01-20
## 267 242 255 61 145 73
3) sequence number Another possibility is to just ignore the dates and use 1, 2, .3, ..
zoo(New_DF$qty)
## 1 2 3 4 5 6
## 267 242 255 61 145 73
Note
Lines <- " ï..date qty
1 2017-07-05 61
2 2018-01-20 73
3 2017-07-10 145
4 2017-07-01 255
5 2017-05-23 267
6 2017-06-24 242 "
New_DF <- read.table(text = Lines)
Could you share some background about your data. Also if there are some duplicates in the data, can you just sum them up, so that the above error won't occur.

How to apply a function to every nth month in data frame?

I have a data frame like this:
Month Amount
1/31/2014 793
2/28/2014 363
3/31/2014 857
4/30/2014 621
5/31/2014 948
6/30/2014 385
I would like to apply a function (x*0.5) to the third and sixth rows in this data frame. The results will overwrite the data currently in the data frame. So the end result would look like this:
Month Amount
1/31/2014 793
2/28/2014 363
3/31/2014 428.5
4/30/2014 621
5/31/2014 948
6/30/2014 192.5
I've tried the rollapply() function, but that functions seems to start at the first row only without an option to force it to start at the third.
I really appreciate any help around this. Thanks in advance.
Suppose your data.frame is named DT:
DT$Amount[c(3,6)] <- 0.5 * DT$Amount[c(3,6)]
If you have a lot of data, use data.table:
setDT(DT)
DT[
month(as.Date(Month, format = "%m/%d/%Y")) %% 3 == 0,
Amount := 0.5 * Amount
]
If the rows follow a pattern then %% can be used to select every x rows
df1$Amount[seq_len(nrow(df1)) %% 3 == 0] <- df1$Amount[seq_len(nrow(df1)) %% 3 == 0] * 0.5
Month Amount
1 1/31/2014 793.0
2 2/28/2014 363.0
3 3/31/2014 428.5
4 4/30/2014 621.0
5 5/31/2014 948.0
6 6/30/2014 192.5
An alternative for detecting particular months in bigger datasets is using month from lubridate()
month ammount
1 1/31/2014 793
2 2/28/2014 363
3 3/31/2014 857
4 4/30/2014 621
5 5/31/2014 948
6 6/30/2014 385
library(lubridate)
df %>% mutate(month = as.Date(month, "%m/%d/%Y"),
date_month = month(month),
new_ammount = ifelse(date_month %in% c(3,6), ammount*0.5, ammount))
Which provides
month ammount date_month new_ammount
1 2014-01-31 793 1 793.0
2 2014-02-28 363 2 363.0
3 2014-03-31 857 3 428.5
4 2014-04-30 621 4 621.0
5 2014-05-31 948 5 948.0
6 2014-06-30 385 6 192.5

ggplot sort order treatment of NA values

My goal is to create a scatter plot of requests for service.
The X axis will be the date the request was made.
X values will show dates from oldest to newest, left to right.
The Y axis will show the priority assigned to the request.
I wish to order the Y values from highest priority at the top (i.e., 1) to lowest.
Requests which haven't been prioritized have NA in that column.
Here is a sample data set (NOTE - the original data file id tab-separated with no values in the position where "NA" is shown below for clarity's sake):
ID Priority DateCreated
549 NA 2018-02-15
548 NA 2018-02-15
547 3 2018-02-13
537 1 2018-01-17
536 5 2018-01-17
518 NA 2017-12-21
509 3 2017-11-27
500 2 2017-11-16
486 NA 2017-10-04
477 3 2017-08-08
475 1 2017-09-14
448 2 2017-07-21
444 5 2017-07-14
431 5 2017-06-30
425 1 2017-06-21
407 2 2017-05-26
395 4 2017-05-09
394 4 2017-05-09
374 4 2017-04-27
368 2 2017-04-21
352 NA 2017-04-03
328 4 2017-02-28
308 NA 2017-02-28
272 2 2016-10-05
213 4 2016-05-19
212 5 2016-05-19
200 2 2016-04-26
188 NA 2016-03-17
After loading ggplot2 and data.frame, I create the plot with this code:
bl <- fread("backlog.txt")
bl$DateCreated <- as.Date(bl$DateCreated, "%Y-%m-%d")
bl$Priority <- as.integer(bl$Priority)
ggplot(bl, aes(x = DateCreated, y = reorder(Priority, -Priority))) +
geom_text((aes(label = ID)))
If you reproduce this plot, you will see that the items with a priority of NA appear at the top. For presentation to my customer, it is much clearer if they appear at the bottom.
I suppose I could replace the NAs with a "magic number" (e.g., 11), but I'd prefer a less kludgey solution.
Anyone dealt with a similar issue already?
Thanks.
This is a bit of a workaround as well but I think more acceptable than setting a 'magic number'
bl$DateCreated <- as.Date(bl$DateCreated, "%Y-%m-%d")
bl$Priority[is.na(bl$Priority)] <- "No Data Available"
bl$Priority <- factor(bl$Priority,levels=c("No Data Available","1","2","3","4","5"))
ggplot(bl, aes(x = DateCreated, y = Priority)) + geom_text((aes(label = ID)))

How do I convert a character field to date/time format in R?

This is what my data looks like. I would like to convert date and time columns to a time stamp and put it in a single column.
Any help appreciated. Thanks
DATE TIME CLOSE HIGH LOW OPEN VOLUME
1 20150216 1520 2283.85 2284 2275.6 2275.6 48309
2 20150216 1530 2282 2284 2273.15 2283.85 108856
3 20150218 920 2276.1 2280.1 2260.6 2280.1 94279
4 20150218 930 2271.6 2277.95 2271 2276.1 65932
5 20150218 940 2270.35 2275 2268.2 2271.6 53595
6 20150218 950 2270.65 2271.2 2265.55 2270.5 34546
7 20150218 1000 2274.15 2274.25 2268.65 2270.6 35414
8 20150218 1010 2270.1 2274.9 2267.1 2274.25 37334
You can try
df$DateTime <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
df1 <- df[-(1:2)]
head(df1,2)
# CLOSE HIGH LOW OPEN VOLUME DateTime
#1 2283.85 2284 2275.60 2275.60 48309 2015-02-16 15:20:00
#2 2282.00 2284 2273.15 2283.85 108856 2015-02-16 15:30:00
Update
If you need to convert to xts, instead of creating a new column, we can remove the columns that are not needed (df[-(1:2)]) and specify order.by as the datetime vector ('indx')
library(xts)
indx <- as.POSIXct(sprintf('%08d %04d', df$DATE, df$TIME),
format ='%Y%m%d %H%M')
xt1 <- xts(df[-(1:2)], order.by=indx)

Make my Date column (row label) index in R Data Frame

Just starting out in R and got my data into a dataframe: It created index column (row label), but I think I want/need the date column to be the row label column (for ease of use in forecast and plot methods) The functions ie forecast are sometimes picking the row label col and I want dates..
> fullmatrix
Date Unit Sales Average Selling Price Median Selling Price Average Days on Market
161 2000-05-01 3041 114093 99554 138
160 2000-06-01 3079 114730 99931 138
159 2000-07-01 2455 122074 97737 145
So How do I 1)drop the index(row label),and 2) declare the date as the index(row label)?
The question is not clear. But I think, you can want to create a time serie object. Using xts package for example, you can do the following:
dat <- read.table(text=' Date Unit_Sales_Average Selling_Price Median_Selling_Price Average_Days_on_Market
161 2000-05-01 3041 114093 99554 138
160 2000-06-01 3079 114730 99931 138
159 2000-07-01 2455 122074 97737 145',header=TRUE)
library(xts)
dat.xts <- xts(x=dat[,-1],order.by= as.POSIXct(dat$Date))
Unit_Sales_Average Selling_Price Median_Selling_Price Average_Days_on_Market
2000-05-01 3041 114093 99554 138
2000-06-01 3079 114730 99931 138
2000-07-01 2455 122074 97737 145
Now you have index:
index(dat.xts)
[1] "2000-05-01 CEST" "2000-06-01 CEST" "2000-07-01 CEST"
This xts object can be used within forecast.

Resources