Replace values in dataframe by matching dates of different lengths - r

I have 52 time series files with differing lengths for date. All have the same end date - 31-01-2017, but all 52 dataframes have different start dates.
'data': nRows
Date FLOW Modelled
01-01-1992 1.856 NA
02-01-1992 1.523 NA
03-01-1992 2.623 NA
04-01-1992 3.679 NA
...
31-12-2017
I also have a file with simulated FLOW values for each of the datasets in columns.
'Simulated': 20819 rows, 53 columns (including Date).
Date 1 2 3 ..52
01-01-1961 1.856 2.889 2.365
02-01-1961 1.523 3.536 4.624
03-01-1961 2.536 2.452 6.352
04-01-1961 3.486 4.267 3.685
...
31-12-2017
My question is I want to select each column from Simulated data (e.g column 1 corresponds to 'data' file 1) and fill the Modelled column of 'data' with the simulated values. Ideally this would loop through the 52 files based on a list of their names
The problem I am facing is when using left_join the error I get is
e.g. replacement has 20819 rows, data has 9657
when 'data' is a shorter than 'Simulated', and
e.g. replacement has 20819 rows, data has 22821
when 'data' is longer than 'Simulated'.
I have tried to use left_join of the dplyr package with no luck as dates are not matching up across 'data' and 'Simulated' dataframes.
library(dplyr)
df <-left_join(data, Simulated, by = c("Date"),all.x=TRUE)
I have formatted both 'data' and 'Simulated' dates using similar to Simulated$Date <- as.Date(with(Simulated, paste(Year, Month, Day, sep="-")), "%Y-%m-%d"). But I still get the error below when using left_join:
cannot join a Date object with an object that is not a Date object

A solution can be achieved using tidyverse and read.table. First read all data frames from all files in a list and then use dplyr::bind_rows to merge them in one dataframe.
#Get the file list
filelist = list.files(path = ".", pattern = ".*.txt", full.names = TRUE)
# Read all files in a list
ll <- lapply(filelist, FUN=read.table, header=TRUE, stringsAsFactors = FALSE)
# Read data from file containing simulate data
simulated <- read.table(file = "simulated.txt", header=TRUE, stringsAsFactors = FALSE)
library(tidyverse)
#Convert simulated data to long format and then join with other dataframes
simulated %>% mutate(Date = as.Date(Date, format = "%d-%m-%Y")) %>%
gather(df_num, SIM_FLOW, -Date) %>%
mutate(df_num = gsub("X(\\d+)", "\\1", df_num)) %>%
right_join(bind_rows(ll, .id="df_num") %>% mutate(Date = as.Date(Date, format = "%d-%m-%Y")),
by=c("df_num", "Date"))
# Date df_num SIM_FLOW FLOW Modelled
# 1 1992-01-01 1 1.86 1.86 NA
# 2 1992-01-02 1 NA 1.52 NA
# 3 1992-01-03 1 NA 2.62 NA
# 4 1992-01-04 1 NA 3.68 NA
# 5 1993-01-01 2 NA 11.86 NA
# 6 1993-01-02 2 3.54 11.52 NA
# 7 1993-01-03 2 NA 12.62 NA
# 8 1993-01-04 2 NA 13.68 NA
# 9 1994-01-01 3 NA 111.86 NA
# 10 1994-01-02 3 NA 111.52 NA
# 11 1994-01-03 3 6.35 112.62 NA
# 12 1994-01-04 3 NA 113.68 NA
Data:
simulated.txt
Date 1 2 3
01-01-1992 1.856 2.889 2.365
02-01-1993 1.523 3.536 4.624
03-01-1994 2.536 2.452 6.352
04-01-1902 3.486 4.267 3.685
File1.txt
Date FLOW Modelled
01-01-1992 1.856 NA
02-01-1992 1.523 NA
03-01-1992 2.623 NA
04-01-1992 3.679 NA
File2.txt
Date FLOW Modelled
01-01-1993 11.856 NA
02-01-1993 11.523 NA
03-01-1993 12.623 NA
04-01-1993 13.679 NA
File3.txt
Date FLOW Modelled
01-01-1994 111.856 NA
02-01-1994 111.523 NA
03-01-1994 112.623 NA
04-01-1994 113.679 NA

Related

do.call() isnt appending list of dataframes correctly. any idea why?

I'm downloading a abunch of PiT datasets, and trying to automate their combination into a single time series dataframe (master_df)
temp <- tempfile()
testing <- download.file("https://data.sa.gov.au/data/dataset/3ba1c4dd-e52f-4c28-858a-21284c3ee458/resource/c78fc6da-baa4-47cc-b4df-a97f452bbf9a/download/ken01_p.zip",temp)
filenames<-unzip(temp,list=TRUE)[,1]
#only want csvs
filenames<-filenames[str_detect(filenames,".csv")]
dfnames = list()
for (i in 1:length(filenames)){
conn<-unz(temp, filenames[i])
#name files in loop
filename <- sprintf("df_%s",filenames[i] %>%
str_replace("KEN01_p/KEN01p_1hr","") %>%
str_replace(".csv",""))
# list of filenames
dfnames[[i]] <- filename
assign(filename, read.csv(conn))
}
master_df <- do.call(rbind, dfnames)
unlink(temp)
class(master_df)
class(df_201912)
class(master_df)
[1] "matrix"
class(df_201912)
[1] "data.frame"
the loop is sucessfully reading all the datasets, and renaming them as df_yyyymm, but do.call rbind is just producing a list of data names.
What am I doing wrong?
Thanks!!
No need to use assign since it writes all the dataframes to global environment which is not required. You can combine all the dataframes in one using lapply, also some of the dataframes have different column names so it may be better to use map_df that would combine them into one dataframes anyway by appending NA values.
purrr::map_df(filenames, function(x) {
read.csv(unz(temp, x))
}) -> master_df
master_df
The issue in the code is assignment of the list element with filename instead of the value
for (i in 1:length(filenames)){
conn<-unz(temp, filenames[i])
#name files in loop
filename <- sprintf("df_%s",filenames[i] %>%
str_replace("KEN01_p/KEN01p_1hr","") %>%
str_replace(".csv",""))
# list of filenames
dfnames[[i]] <- read.csv(conn) ###
#assign(filename, read.csv(conn))
}
Also, there are some list elements with different names, thus rbind wouldn't work, we can use rbindlist from data.table
library(data.table)
out <- rbindlist(dfnames, fill = TRUE) dim(out)
[1] 44544 6
This is what I would do to download a zip file, unpack it and read all csv files into one large dataset:
temp <- tempfile()
testing <- download.file(
"https://data.sa.gov.au/data/dataset/3ba1c4dd-e52f-4c28-858a-21284c3ee458/resource/c78fc6da-baa4-47cc-b4df-a97f452bbf9a/download/ken01_p.zip",
temp)
filenames <- unzip(temp, list = FALSE)
library(data.table)
library(magrittr) # piping used to improve readability
master_df <- lapply(filenames, fread) %>%
set_names(filenames %>% basename() %>% stringr::str_remove_all("^KEN01p_1hr|\\.csv$")) %>%
rbindlist(fill = TRUE, idcol = TRUE)
master_df
.id Date/Time PM10 TEOM ug/m3 PM2.5 TEOM ug/m3 Temperature Deg C Barometric Pressure atm PM10 BAM ug/m3
1: 201501 1/01/2015 1:00 18.2 NA 16.8 0.986 NA
2: 201501 1/01/2015 2:00 20.3 NA 15.9 0.985 NA
3: 201501 1/01/2015 3:00 27.9 NA 15.1 0.985 NA
4: 201501 1/01/2015 4:00 23.6 NA 16.9 0.984 NA
5: 201501 1/01/2015 5:00 15.8 NA 19.7 0.984 NA
---
44540: 201912 31/12/2019 20:00 NA NA 19.4 NA 14
44541: 201912 31/12/2019 21:00 NA NA 18.0 NA 14
44542: 201912 31/12/2019 22:00 NA NA 16.7 NA 19
44543: 201912 31/12/2019 23:00 NA NA 15.8 NA 11
44544: 201912 1/01/2020 0:00 NA NA 15.3 NA 12
Note that I have changed the list parameter in
filenames <- unzip(temp, list = FALSE)
to FALSE. This unpacks the zip file into a subdirectory named KEN01_p. After unpacking, the subdirectory contains 61 csv files with 1.5 MBytes in total.
Also note that the .id column in master_df indicates the source of each row

Reference the previous non-zero row, find the difference and divide by nrows

I must be asking the question terribly because I can't find what I looking for!
I have a large excel file that looks like this for every day of the month:
Date
Well1
1/1/16
10
1/2/16
NA
1/3/16
NA
1/4/16
NA
1/5/16
20
1/6/16
NA
1/7/16
25
1/8/16
NA
1/9/16
NA
1/10/16
35
etc
NA
I want to make a new column that has the difference between the non-zero rows and divide that by the number of rows between each non zero row. Aiming for something like this:
Date
Well1
Adjusted
1/1/16
10
=(20-10)/4 = 2.5
1/2/16
NA
1.25
1/3/16
NA
1.25
1/4/16
NA
1.25
1/5/16
20
=(25-20)/2= 2.5
1/6/16
NA
2.5
1/7/16
25
=(35-25)/3 = 3.3
1/8/16
NA
3.3
1/9/16
NA
3.3
1/10/16
35
etc
etc
NA
etc
I'm thinking I should use lead or lag, but the thing is that the steps are different between each nonzero row (so I'm not sure how to use n in the lead/lag function). I've used group_by so that each month stands alone, as well as attempted case_when and ifelse Mostly need ideas on translating excel format into a workable R format.
With some diff-ing and repeating of values, you should be able to get there.
dat$Date <- as.Date(dat$Date, format="%m/%d/%y")
nas <- is.na(dat$Well1)
dat$adj <- with(dat[!nas,],
diff(Well1) / as.numeric(diff(Date), units="days")
)[cumsum(!nas)]
# Date Well1 adj
#1 2016-01-01 10 2.5
#2 2016-01-02 NA 2.5
#3 2016-01-03 NA 2.5
#4 2016-01-04 NA 2.5
#5 2016-01-05 20 2.5
#6 2016-01-06 NA 2.5
#7 2016-01-07 25 5.0
#8 2016-01-08 NA 5.0
#9 2016-01-09 NA 5.0
#10 2016-01-10 40 NA
dat being used is:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Base R in the same vein as #thelatemail but with transformations all in one expression:
nas <- is.na(dat$Well1)
res <- within(dat, {
Date <- as.Date(Date, "%m/%d/%y")
Adjusted <- (diff(Well1[!nas]) /
as.numeric(diff(Date[!nas]), units = "days"))[cumsum(!nas)]
}
)
Data:
dat <- read.table(text="Date Well1
1/1/16 10
1/2/16 NA
1/3/16 NA
1/4/16 NA
1/5/16 20
1/6/16 NA
1/7/16 25
1/8/16 NA
1/9/16 NA
1/10/16 40", header=TRUE, stringsAsFactors=FALSE)
Maybe this should work
library(dplyr)
df1 %>%
#// remove the rows with NA
na.omit %>%
# // create a new column with the lead values of Well1
transmute(Date, Well2 = lead(Well1)) %>%
# // join with original data
right_join(df1 %>%
mutate(rn = row_number())) %>%
# // order by the original order
arrange(rn) %>%
# // create a grouping column based on the NA values
group_by(grp = cumsum(!is.na(Well1))) %>%
# // subtract the first element of Well2 with Well1 and divide
# // by number of rows - n() in the group
mutate(Adjusted = (first(Well2) - first(Well1))/n()) %>%
ungroup %>%
select(-grp, - Well2)

R: creating xts changes dataset, losing data

when creating an xtsobject from a data.frame I seem to lose some data (approximately 3000 data lost over 33 000).
My dataset is as follow: (with the time being day-month-year, EU format)
> head(mesdonnees)
time value
1 05-03-2006 04:07 NA
2 05-03-2006 04:17 NA
3 05-03-2006 04:27 NA
4 05-03-2006 04:37 NA
5 05-03-2006 04:47 NA
6 05-03-2006 04:57 NA
Due to the format I had to extract the different parts of the date (at least I couldn't get as.POSIXct to work with this format).
Here is how I did it:
# Extract characters and define as S....
Syear <- substr(mesdonnees$time, 7,10)
Smonth <- substr(mesdonnees$time, 4,5)
Sday <- substr(mesdonnees$time, 1, 2)
#Gather all parts and use "-" as sep
datetext <- paste(Syear, Smonth, Sday, sep="-")
#define format of each part of the string
formatdate<-as.POSIXct(datetext, format="%Y-%m-%d", tz = "GMT")
I then try to create my xtswith...
xtsdata <- xts(mesdonnees$value, order.by = formatdate, tz = "GMT")
... but when doing this I get some quite weird results: the first value is in 1900
> head(xtsdata)
[,1]
1900-01-04 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
2006-03-05 NA
and many (3000) dates are not kept:
> xtsdata[30225:30233,]
[,1]
2006-12-31 0
2006-12-31 0
2006-12-31 0
2006-12-31 0
<NA> NA
<NA> NA
<NA> NA
<NA> NA
<NA> NA
When looking at what should be the same line in both my data.frameand my xtsI can see that the lines are offset (I had the date format changed in the xts object creation):
> mesdonnees[25617,]
time value
25617 08-11-2006 23:51 0
> xtsdata[25617,]
[,1]
2006-11-25 0.27
How is it that my data are offset? I tried changing the tz but it doesn't affect it. I removed all duplicates using the dyplr package, it doesn't affect the xts results either. Thank you for your help !
After changing my xts code to the one suggested by Joshua:
xtsdata <- xts(mesdonnees$value, order.by = as.POSIXct(mesdonnees$time, tz = "GMT", format = "%d-%m-%Y %H:%M"))
... my data show properly for the "last" part, but I now have a different problem. The first 2300 data show the following results when doing (using xtsdata[1500,] (or any row < 2300) displays the same results)
> view(xtsdata):
0206-06-30 23:08:00 NA
0206-06-30 23:18:00 NA
0206-06-30 23:28:00 NA
1900-01-04 12:00:00 NA
2006-03-05 04:07:00 NA
2006-03-05 04:17:00 NA
I noticed this error before and thought it was due to the date format; maybe it is not? Also, when looking at the xtsdata I do not get the same results for the same row (the last rows are correct thought):
> mesdonnees[2360,]
time value
2360 23-03-2006 03:09 NA
> xtsdata[2360,]
[,1]
2006-03-05 09:07:00 NA
As requested:
> str(mesdonnees)
'data.frame': 32556 obs. of 2 variables:
$ time : chr "05-03-2006 04:07" "05-03-2006 04:17" "05-03-2006 04:27" "05-03-2006 04:37" ...
$ value: num NA NA NA NA NA NA NA NA NA NA ...
And if needed:
An ‘xts’ object on 0206-06-01 00:09:00/2006-12-31 23:29:00 containing:
Data: num [1:32556, 1] NA NA NA NA NA NA NA NA NA NA ...
Indexed by objects of class: [POSIXct,POSIXt] TZ: GMT
xts Attributes:
NULL
The problem is that you only include the date portion of the timestamp in datetext and formatdate, but your data have dates and times.
You also do not need to do all the string subsetting. You can achive the same result by specifying the format argument in your as.POSIXct call.
xtsdata <- xts(mesdonnees$value,
as.POSIXct(mesdonnees$times, "GMT", format = "%d-%m-%Y %H:%M")

Reshape data using colsplit in R

I try to reshape my data using this code but i get NA values.
require(reshape2)
dates=data.frame(dates=seq(as.Date("1988-01-01"),as.Date("2011-12-31"),by="day"))
first=dates[,1]
dates1=cbind(dates[,1],colsplit(first,pattern="\\-",names=c("Year","Month","Day")))###split by y/m/day
head(dates1)
dates[, 1] Year Month Day
1 1988-01-01 6574 NA NA
2 1988-01-02 6575 NA NA
3 1988-01-03 6576 NA NA
4 1988-01-04 6577 NA NA
5 1988-01-05 6578 NA NA
6 1988-01-06 6579 NA NA
We can use cSplit from splitstacshape to split the 'dates' column by the delimiter -.
library(splitstackshape)
cSplit(dates, 'dates', '-', drop=FALSE)
Or extract to create additional columns
library(tidyr)
extract(dates, dates, into=c('Year', 'Month', 'Day'),
'([^-]+)-([^-]+)-([^-]+)', remove=FALSE)
Or another option from tidyr (suggested by #Ananda Mahto)
separate(dates, dates, into = c("Year", "Month", "Day"), remove=FALSE)
Or using read.table from base R. We specify the sep and the colum names, and cbind with the original column.
cbind(dates[1],read.table(text=as.character(dates$dates),
sep='-', col.names=c('Year', 'Month', 'Day')))
By using reshape2_1.4.1, I could reproduce the error
head(cbind(dates[,1],colsplit(first,pattern="-",
names=c("Year","Month","Day"))),2)
# dates[, 1] Year Month Day
#1 1988-01-01 6574 NA NA
#2 1988-01-02 6575 NA NA

Combination of merge and aggregate in R

I have created the following 2 dummy datasets as follows:
id<-c(8,8,50,87,141,161,192,216,257,282)
date<-c("2011-03-03","2011-12-12","2010-08-18","2009-04-28","2010-11-29","2012-04-02","2013-01-08","2007-01-22","2009-06-03","2009-12-02")
data<-data.frame(cbind(id,date))
id<-c(3,8,11,11,11,11,11,11,19,19,19,19,19,50,50,50,50,50,87,87,87,87,87,87,282,282,282,282,282,282,282,282,282,282,288,288,288,288,288,288,288,288,288,288,288,288,288)
date<-c("2010-11-04","2011-02-25","2009-07-26","2009-07-27","2009-08-09","2009-08-10","2009-08-30","2004-01-20","2006-02-13","2006-07-18","2007-04-20","2008-05-12","2008-05-29","2009-06-10","2010-08-17","2010-08-15","2011-05-13","2011-06-08","2007-08-09","2008-01-19","2008-02-19","2009-04-28","2009-05-16","2009-05-20","2005-05-14","2007-04-15","2007-07-25","2007-10-12","2007-10-23","2007-10-27","2007-11-20","2009-11-28","2012-08-16","2012-08-16","2008-11-17","2009-10-23","2009-10-27","2009-10-27","2009-10-27","2009-10-27","2009-10-28","2010-06-15","2010-06-17","2010-06-23","2010-07-27","2010-07-27","2010-07-28")
ns<-data.frame(cbind(id,date))
Note that only some of the id in data are included in ns and viceversa.
For each of the values in data$id I am trying to find if there is a ns$date that is 14 days before the data$date where data$id==ns$id and report the number of days difference.
The output I need is a vector/column ("received") of the same number of rows of data, with a TRUE/FALSE whre ns$date[ns$id==data$id] is less than 14 days before the respective data$date and a similar vector with the actual number of days where "received" is TRUE. I hope this makes sense now.
This is where I got so far
# convert dates
data$date <- ymd(data$date)
ns$date <- ymd(ns$date)
# left join datasets
tmp <- merge(data, ns, by="id", all.x=TRUE)
#NOTE THAT this will automatically rename data$date as date.x and tmp$date as date.y
# create variable to say when there is a date difference less than 14 days
tmp$received <- with(tmp, difftime(date.x, date.y, units="days")<14&difftime(date.x, date.y, units="days")>0)
#create a variable that reports the days difference
tmp$dif<-ifelse(tmp$received==TRUE,difftime(tmp$date.x,tmp$date.y, units="days"),NA)
This link Find if date is within 14 days if id matches between datasets in R provides an idea but the result does not include the number of days difference in tmp$dif.
In the result table I need only the lowest difference for each data$id for those cases were tmp$received was TRUE.
Hope this makes more sense now? If not please let me know what needs further clarification.
M
PS: as requested I added what the desired output should look like (same number of rows of data = 10 - no rows for data in ns not in data). Should have thought this might help earlier.
id date received dif
1 8 2011-03-03 TRUE 6
2 8 2011-12-12 FALSE NA
3 50 2010-08-18 TRUE 1
4 87 2009-04-28 TRUE 0
5 141 2010-11-29 NA NA
6 161 2012-04-02 NA NA
7 192 2013-01-08 NA NA
8 216 2007-01-22 NA NA
9 257 2009-06-03 NA NA
10 282 2009-12-02 TRUE 4
Here's a data.table approach
Converting to data.table objects
library(data.table)
setkey(setDT(data), id)
setkey(setDT(ns), id)
Merging
ns <- ns[data]
Converting to Date class
ns[, c("date", "date.1") := lapply(.SD, as.Date), .SDcols = c("date", "date.1")]
Computing days differences and TRUE/FALSE
ns[, `:=`(timediff = date.1 - date,
Logical = (date.1 - date) < 14)]
Taking only the rows we are interested in
res <- ns[is.na(timediff) | timediff >= 0, list(received = any(Logical), dif = timediff[Logical]), by = list(id, date.1)]
Sorting by id and date
res[, id := as.numeric(as.character(id))]
setkey(res, id, date.1)
Subsetting by minimum dstance
res[, list(diff = min(dif)), by = list(id, date.1, received)]
# id date.1 received diff
# 1: 8 2011-03-03 TRUE 6 days
# 2: 8 2011-12-12 FALSE NA days
# 3: 50 2010-08-18 TRUE 1 days
# 4: 87 2009-04-28 TRUE 0 days
# 5: 141 2010-11-29 NA NA days
# 6: 161 2012-04-02 NA NA days
# 7: 192 2013-01-08 NA NA days
# 8: 216 2007-01-22 NA NA days
# 9: 257 2009-06-03 NA NA days
# 10: 282 2009-12-02 TRUE 4 days

Resources