Reshape data using colsplit in R - r

I try to reshape my data using this code but i get NA values.
require(reshape2)
dates=data.frame(dates=seq(as.Date("1988-01-01"),as.Date("2011-12-31"),by="day"))
first=dates[,1]
dates1=cbind(dates[,1],colsplit(first,pattern="\\-",names=c("Year","Month","Day")))###split by y/m/day
head(dates1)
dates[, 1] Year Month Day
1 1988-01-01 6574 NA NA
2 1988-01-02 6575 NA NA
3 1988-01-03 6576 NA NA
4 1988-01-04 6577 NA NA
5 1988-01-05 6578 NA NA
6 1988-01-06 6579 NA NA

We can use cSplit from splitstacshape to split the 'dates' column by the delimiter -.
library(splitstackshape)
cSplit(dates, 'dates', '-', drop=FALSE)
Or extract to create additional columns
library(tidyr)
extract(dates, dates, into=c('Year', 'Month', 'Day'),
'([^-]+)-([^-]+)-([^-]+)', remove=FALSE)
Or another option from tidyr (suggested by #Ananda Mahto)
separate(dates, dates, into = c("Year", "Month", "Day"), remove=FALSE)
Or using read.table from base R. We specify the sep and the colum names, and cbind with the original column.
cbind(dates[1],read.table(text=as.character(dates$dates),
sep='-', col.names=c('Year', 'Month', 'Day')))
By using reshape2_1.4.1, I could reproduce the error
head(cbind(dates[,1],colsplit(first,pattern="-",
names=c("Year","Month","Day"))),2)
# dates[, 1] Year Month Day
#1 1988-01-01 6574 NA NA
#2 1988-01-02 6575 NA NA

Related

Reference the previous non-zero row, find the difference and divide by nrows

I must be asking the question terribly because I can't find what I looking for!
I have a large excel file that looks like this for every day of the month:
Date
Well1
1/1/16
10
1/2/16
NA
1/3/16
NA
1/4/16
NA
1/5/16
20
1/6/16
NA
1/7/16
25
1/8/16
NA
1/9/16
NA
1/10/16
35
etc
NA
I want to make a new column that has the difference between the non-zero rows and divide that by the number of rows between each non zero row. Aiming for something like this:
Date
Well1
Adjusted
1/1/16
10
=(20-10)/4 = 2.5
1/2/16
NA
1.25
1/3/16
NA
1.25
1/4/16
NA
1.25
1/5/16
20
=(25-20)/2= 2.5
1/6/16
NA
2.5
1/7/16
25
=(35-25)/3 = 3.3
1/8/16
NA
3.3
1/9/16
NA
3.3
1/10/16
35
etc
etc
NA
etc
I'm thinking I should use lead or lag, but the thing is that the steps are different between each nonzero row (so I'm not sure how to use n in the lead/lag function). I've used group_by so that each month stands alone, as well as attempted case_when and ifelse Mostly need ideas on translating excel format into a workable R format.
With some diff-ing and repeating of values, you should be able to get there.
dat$Date <- as.Date(dat$Date, format="%m/%d/%y")
nas <- is.na(dat$Well1)
dat$adj <- with(dat[!nas,],
diff(Well1) / as.numeric(diff(Date), units="days")
)[cumsum(!nas)]
# Date Well1 adj
#1 2016-01-01 10 2.5
#2 2016-01-02 NA 2.5
#3 2016-01-03 NA 2.5
#4 2016-01-04 NA 2.5
#5 2016-01-05 20 2.5
#6 2016-01-06 NA 2.5
#7 2016-01-07 25 5.0
#8 2016-01-08 NA 5.0
#9 2016-01-09 NA 5.0
#10 2016-01-10 40 NA
dat being used is:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Base R in the same vein as #thelatemail but with transformations all in one expression:
nas <- is.na(dat$Well1)
res <- within(dat, {
Date <- as.Date(Date, "%m/%d/%y")
Adjusted <- (diff(Well1[!nas]) /
as.numeric(diff(Date[!nas]), units = "days"))[cumsum(!nas)]
}
)
Data:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Maybe this should work
library(dplyr)
df1 %>%
#// remove the rows with NA
na.omit %>%
# // create a new column with the lead values of Well1
transmute(Date, Well2 = lead(Well1)) %>%
# // join with original data
right_join(df1 %>%
mutate(rn = row_number())) %>%
# // order by the original order
arrange(rn) %>%
# // create a grouping column based on the NA values
group_by(grp = cumsum(!is.na(Well1))) %>%
# // subtract the first element of Well2 with Well1 and divide
# // by number of rows - n() in the group
mutate(Adjusted = (first(Well2) - first(Well1))/n()) %>%
ungroup %>%
select(-grp, - Well2)

Reshape long to wide time-series data (Unequal number of time-series within each day)

My question is a continuation of the previous questions: Reshape data.frame from long to wide on time series
But with variations. It is 1) when numbers of hours within each day are unequal (minor) 2) when the start hour of the date varies(major)
For example, let's create a data frame with 1) dates, 2)hours, 3)some measurement -- m. Here I use reshape() to transform the long format to wide format, as suggested by the previous post. But you see the output (dt2), instead of creating the start from "m.1" (or "m.2" that's from our data set), it starts from "m.3". The first-row of the dt dataset.
dt <- data.frame(Date=as.Date(character()),
File=character(),
User=character(),
stringsAsFactors=FALSE)
date<-as.Date(c("2018-10-1","2018-10-2","2018-10-4"))
date2<-c(rep(date[1],11),rep(date[2],14),rep(date[3],16))
hour<-c(c(3:13),c(2:15),c(4:19))
m<-rnorm(41)
dt<-data.frame(date2,hour, m)
head(dt)
date2 hour m
1 2018-10-01 3 -0.9259174
2 2018-10-01 4 0.4172615
3 2018-10-01 5 0.3981876
4 2018-10-01 6 -0.1894735
5 2018-10-01 7 0.7387315
6 2018-10-01 8 1.2337722
If I use the
library(reshape)
dt2<-reshape(dt, idvar =c("date2"), timevar ="hour", direction = "wide")
dt2
date2 m.3 m.4 m.5 m.6 m.7 m.8 m.9 m.10 m.11 m.12 m.13
1 2018-10-01 -0.9259174 0.4172615 0.3981876 -0.1894735 0.7387315 1.23377223 -0.3740326 -0.007818602 -1.7822049 -1.304608 -1.2114172
12 2018-10-02 0.4961225 -0.6928343 0.5495917 1.9807136 -2.6065999 -1.78083806 1.3777553 -0.543557423 -0.6153750 1.223758 -1.0254392
26 2018-10-04 NA 0.6442237 0.5019143 0.5550032 -1.5300680 -0.08084971 0.5487069 0.618540806 0.3787519 0.219644 0.6488434
m.2 m.14 m.15 m.16 m.17 m.18 m.19
1 NA NA NA NA NA NA NA
12 -1.54329 0.1743518 -2.3307605 NA NA NA NA
26 NA -1.3063877 -0.6920828 -0.194381 -1.144777 1.585792 -1.320353
One solution I have is to padding the missing values first. But it takes a long time -- imagining thousands of days.
dt_full <- data.frame(Date=as.Date(character()),
File=character(),
User=character(),
stringsAsFactors=FALSE)
udate<-unique(dt$date2)
ludate<-length(udate)
for(j in 1:ludate){
dt_sub<-subset(dt,date2==udate[j])
full_minute<-data.frame(2:19,udate[j])
colnames(full_minute)<-c("hour","date2")
dt_sub_full <-full_join(dt_sub,full_minute,by=c("hour"="hour","date2"="date2"))
o<-order(dt_sub_full["hour"])
dt_sub_full <-dt_sub_full[o,]
dt_full <-rbind(dt_sub_full,dt_full)
}
dt_full2<-reshape(dt_full, idvar =c("date2"), timevar ="hour", direction = "wide")
dt_full2
date2 m.2 m.3 m.4 m.5 m.6 m.7 m.8 m.9 m.10 m.11 m.12
17 2018-10-04 NA NA 0.6442237 0.5019143 0.5550032 -1.5300680 -0.08084971 0.5487069 0.618540806 0.3787519 0.219644
110 2018-10-02 -1.54329 0.4961225 -0.6928343 0.5495917 1.9807136 -2.6065999 -1.78083806 1.3777553 -0.543557423 -0.6153750 1.223758
121 2018-10-01 NA -0.9259174 0.4172615 0.3981876 -0.1894735 0.7387315 1.23377223 -0.3740326 -0.007818602 -1.7822049 -1.304608
m.13 m.14 m.15 m.16 m.17 m.18 m.19
17 0.6488434 -1.3063877 -0.6920828 -0.194381 -1.144777 1.585792 -1.320353
110 -1.0254392 0.1743518 -2.3307605 NA NA NA NA
121 -1.2114172 NA NA NA NA NA NA
My question is could you please help me think about any faster ways to overcome this issue? Thank you so much in advance!

Replace values in dataframe by matching dates of different lengths

I have 52 time series files with differing lengths for date. All have the same end date - 31-01-2017, but all 52 dataframes have different start dates.
'data': nRows
Date FLOW Modelled
01-01-1992 1.856 NA
02-01-1992 1.523 NA
03-01-1992 2.623 NA
04-01-1992 3.679 NA
...
31-12-2017
I also have a file with simulated FLOW values for each of the datasets in columns.
'Simulated': 20819 rows, 53 columns (including Date).
Date 1 2 3 ..52
01-01-1961 1.856 2.889 2.365
02-01-1961 1.523 3.536 4.624
03-01-1961 2.536 2.452 6.352
04-01-1961 3.486 4.267 3.685
...
31-12-2017
My question is I want to select each column from Simulated data (e.g column 1 corresponds to 'data' file 1) and fill the Modelled column of 'data' with the simulated values. Ideally this would loop through the 52 files based on a list of their names
The problem I am facing is when using left_join the error I get is
e.g. replacement has 20819 rows, data has 9657
when 'data' is a shorter than 'Simulated', and
e.g. replacement has 20819 rows, data has 22821
when 'data' is longer than 'Simulated'.
I have tried to use left_join of the dplyr package with no luck as dates are not matching up across 'data' and 'Simulated' dataframes.
library(dplyr)
df <-left_join(data, Simulated, by = c("Date"),all.x=TRUE)
I have formatted both 'data' and 'Simulated' dates using similar to Simulated$Date <- as.Date(with(Simulated, paste(Year, Month, Day, sep="-")), "%Y-%m-%d"). But I still get the error below when using left_join:
cannot join a Date object with an object that is not a Date object
A solution can be achieved using tidyverse and read.table. First read all data frames from all files in a list and then use dplyr::bind_rows to merge them in one dataframe.
#Get the file list
filelist = list.files(path = ".", pattern = ".*.txt", full.names = TRUE)
# Read all files in a list
ll <- lapply(filelist, FUN=read.table, header=TRUE, stringsAsFactors = FALSE)
# Read data from file containing simulate data
simulated <- read.table(file = "simulated.txt", header=TRUE, stringsAsFactors = FALSE)
library(tidyverse)
#Convert simulated data to long format and then join with other dataframes
simulated %>% mutate(Date = as.Date(Date, format = "%d-%m-%Y")) %>%
gather(df_num, SIM_FLOW, -Date) %>%
mutate(df_num = gsub("X(\\d+)", "\\1", df_num)) %>%
right_join(bind_rows(ll, .id="df_num") %>% mutate(Date = as.Date(Date, format = "%d-%m-%Y")),
by=c("df_num", "Date"))
# Date df_num SIM_FLOW FLOW Modelled
# 1 1992-01-01 1 1.86 1.86 NA
# 2 1992-01-02 1 NA 1.52 NA
# 3 1992-01-03 1 NA 2.62 NA
# 4 1992-01-04 1 NA 3.68 NA
# 5 1993-01-01 2 NA 11.86 NA
# 6 1993-01-02 2 3.54 11.52 NA
# 7 1993-01-03 2 NA 12.62 NA
# 8 1993-01-04 2 NA 13.68 NA
# 9 1994-01-01 3 NA 111.86 NA
# 10 1994-01-02 3 NA 111.52 NA
# 11 1994-01-03 3 6.35 112.62 NA
# 12 1994-01-04 3 NA 113.68 NA
Data:
simulated.txt
Date 1 2 3
01-01-1992 1.856 2.889 2.365
02-01-1993 1.523 3.536 4.624
03-01-1994 2.536 2.452 6.352
04-01-1902 3.486 4.267 3.685
File1.txt
Date FLOW Modelled
01-01-1992 1.856 NA
02-01-1992 1.523 NA
03-01-1992 2.623 NA
04-01-1992 3.679 NA
File2.txt
Date FLOW Modelled
01-01-1993 11.856 NA
02-01-1993 11.523 NA
03-01-1993 12.623 NA
04-01-1993 13.679 NA
File3.txt
Date FLOW Modelled
01-01-1994 111.856 NA
02-01-1994 111.523 NA
03-01-1994 112.623 NA
04-01-1994 113.679 NA

Create new column with most recent date

I have some trouble with a dataset I have in data.table. Basically, I have
2 columns: scheduled delivery date and rescheduled delivery date. However,
some values are left blank. An example:
Scheduled Rescheduled
NA NA
2016-11-14 2016-11-17
2016-11-15 NA
2016-11-13 2016-11-11
NA 2016-11-15
I want to create a new column, which indicates the most recent
date of both columns, for instance named max_scheduled_date.
Therefore, if Rescheduled is NA, the max_scheduled_date should
take the value of Scheduled, whilst max_scheduled_date should
take the value of Rescheduled if Scheduled is NA. When both
columns are NA, max_scheduled_date should obviously take NA.
When both columns have a date, it should take the most recent one.
I have a lot of problems creating this and do not get the results I want.
The dates are POSIXct, which gives me some trouble unfortunately.
Can someone help me out?
Thank you in advance,
Kind regards,
Amanda
As the question is tagged with data.table, here is also a data.table solution.
pmax() seems to work sufficiently well with POSIXct. Therefore, I see no reason to coerce the date columns from POSIXct to Date class.
setDT(DF)[, max_scheduled_date := pmax(Scheduled, Rescheduled, na.rm = TRUE)]
DF
Scheduled Rescheduled max_scheduled_date
1: <NA> <NA> <NA>
2: 2016-11-14 2016-11-17 2016-11-17
3: 2016-11-15 <NA> 2016-11-15
4: 2016-11-13 2016-11-11 2016-11-13
5: <NA> 2016-11-15 2016-11-15
Note that the new column is appended by reference, i.e., without copying the whole object.
Data
DF <- setDF(fread(
"Scheduled Rescheduled
NA NA
2016-11-14 2016-11-17
2016-11-15 NA
2016-11-13 2016-11-11
NA 2016-11-15"
)[, lapply(.SD, as.POSIXct)])
str(DF)
'data.frame': 5 obs. of 2 variables:
$ Scheduled : POSIXct, format: NA "2016-11-14" "2016-11-15" ...
$ Rescheduled: POSIXct, format: NA "2016-11-17" NA ...
Assuming that both columns are Date class, we can use pmax to create the max of the dates for each row
df1[] <- lapply(df1, as.Date) #change to Date class initially
df1$max_scheduled_date <- do.call(pmax, c(df1, na.rm = TRUE))
df1$max_scheduled_date
#[1] NA "2016-11-17" "2016-11-15" "2016-11-13" "2016-11-15"
It can also be done with the tidyverse
library(dplyr)
df1 %>%
mutate_all(as.Date) %>%
mutate(max_scheduled_date = pmax(Scheduled, Rescheduled, na.rm = TRUE))

R: creating xts changes dataset, losing data

when creating an xtsobject from a data.frame I seem to lose some data (approximately 3000 data lost over 33 000).
My dataset is as follow: (with the time being day-month-year, EU format)
> head(mesdonnees)
time value
1 05-03-2006 04:07 NA
2 05-03-2006 04:17 NA
3 05-03-2006 04:27 NA
4 05-03-2006 04:37 NA
5 05-03-2006 04:47 NA
6 05-03-2006 04:57 NA
Due to the format I had to extract the different parts of the date (at least I couldn't get as.POSIXct to work with this format).
Here is how I did it:
# Extract characters and define as S....
Syear <- substr(mesdonnees$time, 7,10)
Smonth <- substr(mesdonnees$time, 4,5)
Sday <- substr(mesdonnees$time, 1, 2)
#Gather all parts and use "-" as sep
datetext <- paste(Syear, Smonth, Sday, sep="-")
#define format of each part of the string
formatdate<-as.POSIXct(datetext, format="%Y-%m-%d", tz = "GMT")
I then try to create my xtswith...
xtsdata <- xts(mesdonnees$value, order.by = formatdate, tz = "GMT")
... but when doing this I get some quite weird results: the first value is in 1900
> head(xtsdata)
[,1]
1900-01-04 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
and many (3000) dates are not kept:
> xtsdata[30225:30233,]
[,1]
2006-12-31 0
2006-12-31 0
2006-12-31 0
2006-12-31 0
<NA> NA
<NA> NA
<NA> NA
<NA> NA
<NA> NA
When looking at what should be the same line in both my data.frameand my xtsI can see that the lines are offset (I had the date format changed in the xts object creation):
> mesdonnees[25617,]
time value
25617 08-11-2006 23:51 0
> xtsdata[25617,]
[,1]
2006-11-25 0.27
How is it that my data are offset? I tried changing the tz but it doesn't affect it. I removed all duplicates using the dyplr package, it doesn't affect the xts results either. Thank you for your help !
After changing my xts code to the one suggested by Joshua:
xtsdata <- xts(mesdonnees$value, order.by = as.POSIXct(mesdonnees$time, tz = "GMT", format = "%d-%m-%Y %H:%M"))
... my data show properly for the "last" part, but I now have a different problem. The first 2300 data show the following results when doing (using xtsdata[1500,] (or any row < 2300) displays the same results)
> view(xtsdata):
0206-06-30 23:08:00 NA
0206-06-30 23:18:00 NA
0206-06-30 23:28:00 NA
1900-01-04 12:00:00 NA
2006-03-05 04:07:00 NA
2006-03-05 04:17:00 NA
I noticed this error before and thought it was due to the date format; maybe it is not? Also, when looking at the xtsdata I do not get the same results for the same row (the last rows are correct thought):
> mesdonnees[2360,]
time value
2360 23-03-2006 03:09 NA
> xtsdata[2360,]
[,1]
2006-03-05 09:07:00 NA
As requested:
> str(mesdonnees)
'data.frame': 32556 obs. of 2 variables:
$ time : chr "05-03-2006 04:07" "05-03-2006 04:17" "05-03-2006 04:27" "05-03-2006 04:37" ...
$ value: num NA NA NA NA NA NA NA NA NA NA ...
And if needed:
An ‘xts’ object on 0206-06-01 00:09:00/2006-12-31 23:29:00 containing:
Data: num [1:32556, 1] NA NA NA NA NA NA NA NA NA NA ...
Indexed by objects of class: [POSIXct,POSIXt] TZ: GMT
xts Attributes:
NULL
The problem is that you only include the date portion of the timestamp in datetext and formatdate, but your data have dates and times.
You also do not need to do all the string subsetting. You can achive the same result by specifying the format argument in your as.POSIXct call.
xtsdata <- xts(mesdonnees$value,
as.POSIXct(mesdonnees$times, "GMT", format = "%d-%m-%Y %H:%M")

Resources