I have a problem to join time-series-dataframes with a map-function. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price). For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to join them all into one dataframe by Date with a map-function:
lst1 <- mget(ls(pattern = "USD"))
df <- map(.x = lst1,.f = full_join(by="Date"))
But ist doesen't work:
Error in UseMethod("full_join") :
no applicable method for 'full_join' applied to an object of class "character"
Can somebody help me?
The result of mget is a list of characters, thats why full_join fails with error.
Try this:
map(lst1, function(x) {full_join(tibble(x),head(BTC.USD),by="Date")}) # Full join might fail becuase lst1 has no column called Date.
Also, in the result of mget in the lst1 (that you have) there is no column called Date
Creating a lst1 tibble with Date Column:
DateVec=c("2015-12-31")
map(lst1, function(x) {full_join(tibble(x,Date=DateVec),head(BTC.USD),by="Date")})
Related
I have created an 'xts' object from a data frame - the data frame was loaded from a 'csv' file.
The 'xts' object looks like so :-
entitycode,usage
2016-01-01 1,16521
2016-01-01 2,6589
2016-01-02 1,16540
2016-01-02 2,6687
2016-01-03 1,16269
2016-01-03 2,6642
There are a total of 1462 records in it - 731 each for each of the entitycodes 1 and 2 from 01/01/2016 through to 31/12/2017 with a frequency of 1 day.
Entitycode 1 & 2 refer to different regions say 'region1' and 'region2'.
Is there a way to create separate 'xts' objects (variables) for entitycodes 1 & 2 (or 'region1' and 'region2') each with 731 rows with names like 'region1_xts' and 'region1_xts'?
Best regards
Deepak
I would recommend splitting the xts object resulting in a list of xts objects
split(xts, xts$entitycode)
#$`1`
# entitycode usage
#2016-01-01 1 16521
#2016-01-02 1 16540
#2016-01-03 1 16269
#
#$`2`
# entitycode usage
#2016-01-01 2 6589
#2016-01-02 2 6687
#2016-01-03 2 6642
You can then use functions of the *apply family to easily operate on the different list elements (i.e. the xts objects).
Sample data
df <- read.csv(text =
" date,entitycode,usage
2016-01-01, 1,16521
2016-01-01, 2,6589
2016-01-02, 1,16540
2016-01-02, 2,6687
2016-01-03, 1,16269
2016-01-03, 2,6642", header = T)
mat <- as.matrix(df[, -1])
rownames(mat) <- df[, 1]
colnames(mat) <- colnames(df)[-1]
xts <- as.xts(mat)
I am trying to import an Excel spreadsheet in to R (via read.xlsx2()). The Excel data has a date column. That date column contains mixed types of date formats e.g. some rows are 42669, and some are in date format e.g. 26/10/2016.
read.xlsx2() reads it in as a factor, so I converted it to as.Date using the code below. This works for all the dates in numeric form (e.g. 42669) but R warns me that it added some NAs (for the ones in format 26/10/2016). My question is how can I import the excel data with proper dates for all the variable i.e. tell R that there is mixed data?
library(xlsx)
#Import excel file
df <- read.xlsx2(mydata, 1, header=true)
#Output = recd_date : Factor w/ 590 levels "", "26/10/2016", "42669" ...
levels(df$recd_date)
#Output = [1] "" "26/10/2016" "42669" ...
#This works for numeric dates:
df$recd_date <- as.Date( as.numeric (as.character(df$recd_date) ),origin="1899-12-30")
#Output = recd_date : Date, format "2016-10-26" ...
#but it doesn't work for dd/mm/yyyy dates, R just replaces these with NA
Try convert_to_date from the janitor package, specifying the character-to-date function from the lubridate package that matches your date format:
library(janitor)
x <- c("26/10/2016", "42669")
convert_to_date(x, character_fun = lubridate::dmy)
#> [1] "2016-10-26" "2016-10-26"
Self-promotion disclaimer: I maintain this package. I'm adding this answer as this function was created to address this exact problem of a mix of Excel date numbers and formatted dates in the same variable.
We could apply a function to clean date if necessary, basically like this:
cleanDate <- function(x) {
if (all(nchar(df2$date.mix) < 10)) {
cd <- as.Date(x)
} else {
cd <- do.call(c,
lapply(x, function(i)
if (nchar(i) < 10)
as.Date(as.numeric(i), origin="1970-01-01")
else as.Date(i)))
}
return(cd)
}
Example
# generate test df
df1 <- data.frame(date.chr=as.character(as.Date(1:3, origin=Sys.Date())),
date.num=as.numeric(as.Date(1:3, origin=Sys.Date())),
date.mix=as.character(as.Date(1:3, origin=Sys.Date())),
stringsAsFactors=FALSE)
df1[2, 3] <- as.character(as.numeric(as.Date(df1[2, 1])))
> df1
date.chr date.num date.mix
1 2019-02-01 17928 2019-02-01
2 2019-02-02 17929 17929
3 2019-02-03 17930 2019-02-03
# write it to working directory
library(xlsx)
write.xlsx2(df1, "df1.xlsx")
# read it
# we use opt. `stringsAsFactors=FALSE` to prevent generation of factors
df2 <- read.xlsx2("df1.xlsx", 1, stringsAsFactors=FALSE)
> df2
X. date.chr date.num date.mix
1 1 2019-02-01 17928 2019-02-01
2 2 2019-02-02 17929 17929
3 3 2019-02-03 17930 2019-02-03
Now we apply the function using lapply().
date.cols <- c("date.chr", "date.num", "date.mix") # select date columns
df2[date.cols] <- lapply(df2[date.cols], cleanDate)
Result
> df2
X. date.chr date.num date.mix
1 1 2019-02-01 2019-02-01 2019-02-01
2 2 2019-02-02 2019-02-02 2019-02-02
3 3 2019-02-03 2019-02-03 2019-02-03
Here is a way to do this,
Once we read in the data we convert the date columns (df$recd_date) to class character and then create two lists, one with the dd/mm/YYYY dates, and the other with the numeric dates. Once that is done we independently convert to date class, and then merge the two to get a final product.
#Test Data, read in anyway you want
data<-c("26/10/2016","27/10/2016","42669","52673","28/10/2016")
Index<-c(1:5)
df<-data.frame(Index, date=data)
#Put entire date column into character format
df$date<-as.character(df$date)
#Create Date from Numeric Date, Create Date from Character Date
Date_N<-as.Date(as.numeric(df$date),origin="1899-12-30")
Date_C<-as.Date(as.character(df$date),format="%d/%m/%Y")
#Create DF from list
Date_N_df<-as.data.frame(Date_N)
Date_C_df<-as.data.frame(Date_C)
#Replace NA from Date_C_df with index from Date_N_df
Date_C_df[is.na(Date_C_df)] <- Date_N_df[is.na(Date_C_df)]
Final<-Date_C_df
names(Final)<-"Date"
> Final
Date
1 2016-10-26
2 2016-10-27
3 2016-10-26
4 2044-03-17
5 2016-10-28
Here my time period range:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
df = as.data.frame(seq(from = start_day, to = end_day, by = 'day'))
colnames(df) = 'date'
I need to created 10,000 data.frames with different fake years of 365days each one. This means that each of the 10,000 data.frames needs to have different start and end of year.
In total df has got 14,965 days which, divided by 365 days = 41 years. In other words, df needs to be grouped 10,000 times differently by 41 years (of 365 days each one).
The start of each year has to be random, so it can be 1974-10-03, 1974-08-30, 1976-01-03, etc... and the remaining dates at the end df need to be recycled with the starting one.
The grouped fake years need to appear in a 3rd col of the data.frames.
I would put all the data.frames into a list but I don't know how to create the function which generates 10,000 different year's start dates and subsequently group each data.frame with a 365 days window 41 times.
Can anyone help me?
#gringer gave a good answer but it solved only 90% of the problem:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
What I need is 10,000 columns with 14,965 rows made by dates taken from df which need to be eventually recycled when reaching the end of df.
I tried to change length.out = 14965 but R does not recycle the dates.
Another option could be to change length.out = 1 and eventually add the remaining df rows for each column by maintaining the same order:
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=1, by="day"),
simplify=FALSE))
colnames(dates.df) <- 1:10000
How can I add the remaining df rows to each col?
The seq method also works if the to argument is unspecified, so it can be used to generate a specific number of days starting at a particular date:
> seq(from=df$date[20], length.out=10, by="day")
[1] "1974-01-20" "1974-01-21" "1974-01-22" "1974-01-23" "1974-01-24"
[6] "1974-01-25" "1974-01-26" "1974-01-27" "1974-01-28" "1974-01-29"
When used in combination with replicate and sample, I think this will give what you want in a list:
> replicate(2,seq(sample(df$date, 1), length.out=10, by="day"), simplify=FALSE)
[[1]]
[1] "1985-07-24" "1985-07-25" "1985-07-26" "1985-07-27" "1985-07-28"
[6] "1985-07-29" "1985-07-30" "1985-07-31" "1985-08-01" "1985-08-02"
[[2]]
[1] "2012-10-13" "2012-10-14" "2012-10-15" "2012-10-16" "2012-10-17"
[6] "2012-10-18" "2012-10-19" "2012-10-20" "2012-10-21" "2012-10-22"
Without the simplify=FALSE argument, it produces an array of integers (i.e. R's internal representation of dates), which is a bit trickier to convert back to dates. A slightly more convoluted way to do this is and produce Date output is to use data.frame on the unsimplified replicate result. Here's an example that will produce a 10,000-column data frame with 365 dates in each column (takes about 5s to generate on my computer):
dates.df <- data.frame(replicate(10000, seq(sample(df$date, 1),
length.out=365, by="day"),
simplify=FALSE));
colnames(dates.df) <- 1:10000;
> dates.df[1:5,1:5];
1 2 3 4 5
1 1988-09-06 1996-05-30 1987-07-09 1974-01-15 1992-03-07
2 1988-09-07 1996-05-31 1987-07-10 1974-01-16 1992-03-08
3 1988-09-08 1996-06-01 1987-07-11 1974-01-17 1992-03-09
4 1988-09-09 1996-06-02 1987-07-12 1974-01-18 1992-03-10
5 1988-09-10 1996-06-03 1987-07-13 1974-01-19 1992-03-11
To get the date wraparound working, a slight modification can be made to the original data frame, pasting a copy of itself on the end:
df <- as.data.frame(c(seq(from = start_day, to = end_day, by = 'day'),
seq(from = start_day, to = end_day, by = 'day')));
colnames(df) <- "date";
This is easier to code for downstream; the alternative being a double seq for each result column with additional calculations for the start/end and if statements to deal with boundary cases.
Now instead of doing date arithmetic, the result columns subset from the original data frame (where the arithmetic is already done). Starting with one date in the first half of the frame and choosing the next 14965 values. I'm using nrow(df)/2 instead for a more generic code:
dates.df <-
as.data.frame(lapply(sample.int(nrow(df)/2, 10000),
function(startPos){
df$date[startPos:(startPos+nrow(df)/2-1)];
}));
colnames(dates.df) <- 1:10000;
>dates.df[c(1:5,(nrow(dates.df)-5):nrow(dates.df)),1:5];
1 2 3 4 5
1 1988-10-21 1999-10-18 2009-04-06 2009-01-08 1988-12-28
2 1988-10-22 1999-10-19 2009-04-07 2009-01-09 1988-12-29
3 1988-10-23 1999-10-20 2009-04-08 2009-01-10 1988-12-30
4 1988-10-24 1999-10-21 2009-04-09 2009-01-11 1988-12-31
5 1988-10-25 1999-10-22 2009-04-10 2009-01-12 1989-01-01
14960 1988-10-15 1999-10-12 2009-03-31 2009-01-02 1988-12-22
14961 1988-10-16 1999-10-13 2009-04-01 2009-01-03 1988-12-23
14962 1988-10-17 1999-10-14 2009-04-02 2009-01-04 1988-12-24
14963 1988-10-18 1999-10-15 2009-04-03 2009-01-05 1988-12-25
14964 1988-10-19 1999-10-16 2009-04-04 2009-01-06 1988-12-26
14965 1988-10-20 1999-10-17 2009-04-05 2009-01-07 1988-12-27
This takes a bit less time now, presumably because the date values have been pre-caclulated.
Try this one, using subsetting instead:
start_day = as.Date('1974-01-01', format = '%Y-%m-%d')
end_day = as.Date('2014-12-21', format = '%Y-%m-%d')
date_vec <- seq.Date(from=start_day, to=end_day, by="day")
Now, I create a vector long enough so that I can use easy subsetting later on:
date_vec2 <- rep(date_vec,2)
Now, create the random start dates for 100 instances (replace this with 10000 for your application):
random_starts <- sample(1:14965, 100)
Now, create a list of dates by simply subsetting date_vec2 with your desired length:
dates <- lapply(random_starts, function(x) date_vec2[x:(x+14964)])
date_df <- data.frame(dates)
names(date_df) <- 1:100
date_df[1:5,1:5]
1 2 3 4 5
1 1997-05-05 2011-12-10 1978-11-11 1980-09-16 1989-07-24
2 1997-05-06 2011-12-11 1978-11-12 1980-09-17 1989-07-25
3 1997-05-07 2011-12-12 1978-11-13 1980-09-18 1989-07-26
4 1997-05-08 2011-12-13 1978-11-14 1980-09-19 1989-07-27
5 1997-05-09 2011-12-14 1978-11-15 1980-09-20 1989-07-28
In R I have data
USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
...
I want to create a new data set data_new that contain all USER in the time 10 o'clock to 11 o'clock.
The types of USER and BIRTH are strings/characters. I tried this:
data_new= data$BIRTH > as.POSIXct("10:00:00", format="%H:%M:%S")
& data$BIRTH < as.POSIXct("11:00:00", format="%H:%M:%S")
but here R gives we FALSE for all entries, so this don't work.
How can I solve this?
Update
Say I want to find the number of users for all hours. I use the answer and try this
u=c()
for(j in 1:24) {
data_new=data[times > "00:00:00"+(j-1) & times < "01:00:00"+j ,]
#saving the number of users in vector u
u[j]=dim(data_new)[1]
}
but R can't figure out the term "00:00:00"+(j-1).
If df is your data frame:
df <- read.table(text = 'USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
121 "2014-12-26 10:07:35"
121 "2014-12-26 11:07:35"
121 "2014-12-26 10:38:35"', header = T)
df$BIRTH <- ymd_hms(df$BIRTH)
times <- strftime(df$BIRTH, format = "%H:%M:%S")
df[times > "10:00:00" & times < "11:00:00",]
Output:
USER BIRTH
3 121 2014-12-26 10:07:35
5 121 2014-12-26 10:38:35
One way to do something to each subset of your data is to use the split-lapply paradigm. In this case, you would convert data$BIRTH to POSIXlt and split by the hour component of the POSIXlt object. That will give you a list where each list element contains all the data for a specific hour.
data <- read.csv(text = "USER,BIRTH
11,2013-01-11 22:31:11
12,2014-12-26 04:07:35
21,2014-12-26 10:07:35
121,2014-12-26 11:07:35
112,2014-12-26 10:38:35")
data_by_hour <- split(data, as.POSIXlt(data$BIRTH)$hour)
Then you can use lapply (or sapply) to do whatever you want to each of those subsets. To count the number of observations per hour:
# number of observations for each hour
sapply(data_by_hour, nrow)
4 10 11 22
1 2 1 1
You can also do this with xts.
library(xts)
# Create xts object from 'data' data.frame
# Note: xts objects are based on a matrix, so you cannot have columns with
# mixed types like you can with a data.frame.
x <- xts(data["USER"], as.POSIXct(data$BIRTH))
period.apply(x, endpoints(x, "hours"), nrow)
# USER
# 2013-01-11 22:31:11 1
# 2014-12-26 04:07:35 1
# 2014-12-26 10:38:35 2
# 2014-12-26 11:07:35 1
Note that you can do time-of-day subsetting with xts. It avoids potential locale-related collation order issues caused by using logical operators on character strings.
x["T10:00/T11:00"]
# USER
# 2014-12-26 10:07:35 21
# 2014-12-26 10:38:35 112
I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03