I have created an 'xts' object from a data frame - the data frame was loaded from a 'csv' file.
The 'xts' object looks like so :-
entitycode,usage
2016-01-01 1,16521
2016-01-01 2,6589
2016-01-02 1,16540
2016-01-02 2,6687
2016-01-03 1,16269
2016-01-03 2,6642
There are a total of 1462 records in it - 731 each for each of the entitycodes 1 and 2 from 01/01/2016 through to 31/12/2017 with a frequency of 1 day.
Entitycode 1 & 2 refer to different regions say 'region1' and 'region2'.
Is there a way to create separate 'xts' objects (variables) for entitycodes 1 & 2 (or 'region1' and 'region2') each with 731 rows with names like 'region1_xts' and 'region1_xts'?
Best regards
Deepak
I would recommend splitting the xts object resulting in a list of xts objects
split(xts, xts$entitycode)
#$`1`
# entitycode usage
#2016-01-01 1 16521
#2016-01-02 1 16540
#2016-01-03 1 16269
#
#$`2`
# entitycode usage
#2016-01-01 2 6589
#2016-01-02 2 6687
#2016-01-03 2 6642
You can then use functions of the *apply family to easily operate on the different list elements (i.e. the xts objects).
Sample data
df <- read.csv(text =
" date,entitycode,usage
2016-01-01, 1,16521
2016-01-01, 2,6589
2016-01-02, 1,16540
2016-01-02, 2,6687
2016-01-03, 1,16269
2016-01-03, 2,6642", header = T)
mat <- as.matrix(df[, -1])
rownames(mat) <- df[, 1]
colnames(mat) <- colnames(df)[-1]
xts <- as.xts(mat)
Related
I am working in R and trying to understand the best way to join data frames when one of them is very large.
I have a data frame which is not excruciatingly large but also not small (~80K observations of 8 variables, 144 MB). I need to match observations from this data frame to observations from another smaller data frame on the basis of a date range. Specifically, I have:
events.df <- data.frame(individual=c('A','B','C','A','B','C'),
event=c(1,1,1,2,2,2),
time=as.POSIXct(c('2014-01-01 08:00:00','2014-01-05 13:00:00','2014-01-10 07:00:00','2014-05-01 01:00:00','2014-06-01 12:00:00','2014-08-01 10:00:00'),format="%Y-%m-%d %H:%M:%S"))
trips.df <- data.frame(individual=c('A','B','C'),trip=c('x1A','CA1B','XX78'),
trip_start = as.POSIXct(c('2014-01-01 06:00:00','2014-01-04 03:00:00','2014-01-08 12:00:00'),format="%Y-%m-%d %H:%M:%S"),
trip_end=as.POSIXct(c('2014-01-03 06:00:00','2014-01-06 03:00:00','2014-01-11 12:00:00'),format="%Y-%m-%d %H:%M:%S"))
In my case events.df contains around 80,000 unique events and I am looking to match them to events from the trips.df data frame, which has around 200 unique trips. Each trip has a unique trip identifier ('trip'). I would like to match based on whether the event took place during the date range defining a trip.
First, I have tried fuzzy_inner_join from the fuzzyjoin library. It works great in principal:
fuzzy_inner_join(events.df,trips.df,by=c('individual'='individual','time'='trip_start','time'='trip_end'),match_fun=list(`==`,`>=`,`<=`))
individual.x event time individual.y trip trip_start trip_end
1 A 1 2014-01-01 08:00:00 A x1A 2014-01-01 06:00:00 2014-01-03 06:00:00
2 B 1 2014-01-05 13:00:00 B CA1B 2014-01-04 03:00:00 2014-01-06 03:00:00
3 C 1 2014-01-10 07:00:00 C XX78 2014-01-08 12:00:00 2014-01-11 12:00:00
>
but runs out of memory when I try to apply it to the larger data frames.
Here is a second solution I cobbled together:
trip.match <- function(tripid){
individual <- trips.df$individual[trips$trip==tripid]
start <- trips.df$trip_start[trips$trip==tripid]
end <- trips.df$trip_end[trips$trip==tripid]
tmp <- events.df[events.df$individual==individual &
events.df$time>= start &
events.df$time<= end,]
tmp$trip <- tripid
return(tmp)
}
result <- data.frame(rbindlist(lapply(unique(trips.df$trip),trip.match)
This solution also breaks down because the list object returned by lapply is 25GB and the attempt to cast this list to a data frame also exhausts the available memory.
I have been able to do what I need to do using a for loop. Basically, I append a column onto events.df and loop through the unique trip identifiers and populate the new column in events.df accordingly:
events.df$trip <- NA
for(i in unique(trips.df$trip)){
individual <- trips.df$individual[trips.df$trip==i]
start <- min(trips.df$trip_start[trips.df$trip==i])
end <- max(trips.df$trip_end[trips.df$trip==i])
events.df$trip[events.df$individual==individual & events.df$time >= start & events.df$time <= end] <- i
}
> events.df
individual event time trip
1 A 1 2014-01-01 08:00:00 x1A
2 B 1 2014-01-05 13:00:00 CA1B
3 C 1 2014-01-10 07:00:00 XX78
4 A 2 2014-05-01 01:00:00 <NA>
5 B 2 2014-06-01 12:00:00 <NA>
6 C 2 2014-08-01 10:00:00 <NA>
My question is this: I'm not a very advanced R programmer so I expect there is a more memory efficient way to accomplish what I'm trying to do. Is there?
Try creating a table that expands the trip ranges by hour and then merge with the event. Here is an example (using the data.table function because data.table outperforms data.frame for larger datasets):
library('data.table')
tripsV <- unique(trips.df$trip)
tripExpand <- function(t){
dateV <- seq(trips.df$trip_start[trips.df$trip == t],
trips.df$trip_end[trips.df$trip == t],
by = 'hour')
data.table(trip = t, time = dateV)
}
trips.dt <- rbindlist(
lapply(tripsV, function(t) tripExpand(t))
)
merge(events.df,
trips.dt,
by = 'time')
Output:
time individual event trip
1 2014-01-01 08:00:00 A 1 x1A
2 2014-01-05 13:00:00 B 1 CA1B
3 2014-01-10 07:00:00 C 1 XX78
So you are basically translating the trip table to trip-hour long-form panel dataset. That makes for easy merging with the event dataset. I haven't benchmarked it to your current method but my hunch is that it will be more memory & cpu efficient.
Consider splitting your data with data.table's split and run each subset on fuzzy_inner_join then call rbindlist to bind all data frame elements together for single output.
df_list <- data.table::split(events.df, by="individual")
fuzzy_list <- lapply(df_list, function(sub.df) {
fuzzy_inner_join(sub.df, trips.df,
by = c('individual'='individual', 'time'='trip_start', 'time'='trip_end'),
match_fun = list(`==`,`>=`,`<=`)
)
})
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(df_list); gc()
final_df <- rbindlist(fuzzy_list)
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(fuzzy_list); gc()
I am trying to import an Excel spreadsheet in to R (via read.xlsx2()). The Excel data has a date column. That date column contains mixed types of date formats e.g. some rows are 42669, and some are in date format e.g. 26/10/2016.
read.xlsx2() reads it in as a factor, so I converted it to as.Date using the code below. This works for all the dates in numeric form (e.g. 42669) but R warns me that it added some NAs (for the ones in format 26/10/2016). My question is how can I import the excel data with proper dates for all the variable i.e. tell R that there is mixed data?
library(xlsx)
#Import excel file
df <- read.xlsx2(mydata, 1, header=true)
#Output = recd_date : Factor w/ 590 levels "", "26/10/2016", "42669" ...
levels(df$recd_date)
#Output = [1] "" "26/10/2016" "42669" ...
#This works for numeric dates:
df$recd_date <- as.Date( as.numeric (as.character(df$recd_date) ),origin="1899-12-30")
#Output = recd_date : Date, format "2016-10-26" ...
#but it doesn't work for dd/mm/yyyy dates, R just replaces these with NA
Try convert_to_date from the janitor package, specifying the character-to-date function from the lubridate package that matches your date format:
library(janitor)
x <- c("26/10/2016", "42669")
convert_to_date(x, character_fun = lubridate::dmy)
#> [1] "2016-10-26" "2016-10-26"
Self-promotion disclaimer: I maintain this package. I'm adding this answer as this function was created to address this exact problem of a mix of Excel date numbers and formatted dates in the same variable.
We could apply a function to clean date if necessary, basically like this:
cleanDate <- function(x) {
if (all(nchar(df2$date.mix) < 10)) {
cd <- as.Date(x)
} else {
cd <- do.call(c,
lapply(x, function(i)
if (nchar(i) < 10)
as.Date(as.numeric(i), origin="1970-01-01")
else as.Date(i)))
}
return(cd)
}
Example
# generate test df
df1 <- data.frame(date.chr=as.character(as.Date(1:3, origin=Sys.Date())),
date.num=as.numeric(as.Date(1:3, origin=Sys.Date())),
date.mix=as.character(as.Date(1:3, origin=Sys.Date())),
stringsAsFactors=FALSE)
df1[2, 3] <- as.character(as.numeric(as.Date(df1[2, 1])))
> df1
date.chr date.num date.mix
1 2019-02-01 17928 2019-02-01
2 2019-02-02 17929 17929
3 2019-02-03 17930 2019-02-03
# write it to working directory
library(xlsx)
write.xlsx2(df1, "df1.xlsx")
# read it
# we use opt. `stringsAsFactors=FALSE` to prevent generation of factors
df2 <- read.xlsx2("df1.xlsx", 1, stringsAsFactors=FALSE)
> df2
X. date.chr date.num date.mix
1 1 2019-02-01 17928 2019-02-01
2 2 2019-02-02 17929 17929
3 3 2019-02-03 17930 2019-02-03
Now we apply the function using lapply().
date.cols <- c("date.chr", "date.num", "date.mix") # select date columns
df2[date.cols] <- lapply(df2[date.cols], cleanDate)
Result
> df2
X. date.chr date.num date.mix
1 1 2019-02-01 2019-02-01 2019-02-01
2 2 2019-02-02 2019-02-02 2019-02-02
3 3 2019-02-03 2019-02-03 2019-02-03
Here is a way to do this,
Once we read in the data we convert the date columns (df$recd_date) to class character and then create two lists, one with the dd/mm/YYYY dates, and the other with the numeric dates. Once that is done we independently convert to date class, and then merge the two to get a final product.
#Test Data, read in anyway you want
data<-c("26/10/2016","27/10/2016","42669","52673","28/10/2016")
Index<-c(1:5)
df<-data.frame(Index, date=data)
#Put entire date column into character format
df$date<-as.character(df$date)
#Create Date from Numeric Date, Create Date from Character Date
Date_N<-as.Date(as.numeric(df$date),origin="1899-12-30")
Date_C<-as.Date(as.character(df$date),format="%d/%m/%Y")
#Create DF from list
Date_N_df<-as.data.frame(Date_N)
Date_C_df<-as.data.frame(Date_C)
#Replace NA from Date_C_df with index from Date_N_df
Date_C_df[is.na(Date_C_df)] <- Date_N_df[is.na(Date_C_df)]
Final<-Date_C_df
names(Final)<-"Date"
> Final
Date
1 2016-10-26
2 2016-10-27
3 2016-10-26
4 2044-03-17
5 2016-10-28
I have a problem to join time-series-dataframes with a map-function. I have 25 dataframes with cryptocurrency time series data.
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 clomuns: Date and Close (Closing price). For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to join them all into one dataframe by Date with a map-function:
lst1 <- mget(ls(pattern = "USD"))
df <- map(.x = lst1,.f = full_join(by="Date"))
But ist doesen't work:
Error in UseMethod("full_join") :
no applicable method for 'full_join' applied to an object of class "character"
Can somebody help me?
The result of mget is a list of characters, thats why full_join fails with error.
Try this:
map(lst1, function(x) {full_join(tibble(x),head(BTC.USD),by="Date")}) # Full join might fail becuase lst1 has no column called Date.
Also, in the result of mget in the lst1 (that you have) there is no column called Date
Creating a lst1 tibble with Date Column:
DateVec=c("2015-12-31")
map(lst1, function(x) {full_join(tibble(x,Date=DateVec),head(BTC.USD),by="Date")})
Consider this dataset:
mydf <- data.frame(churn_indicator = c(0,0,1,0,1),
resign_date = c(NA,NA,"2011-01-01",NA,"2012-02-01"),
join_date = c("2001-01-01","2001-03-01","2002-04-02",
"2003-09-01","2005-05-10"))
The task is to calculate a vector 'length' which is resign_date - join_date for churn_indicator=1 and Sys.Date()-join_date for churn_indicator =0.
I have already figured out how to do this using a for loop but I want to use something that more efficient(the apply family maybe). Also, is it possible to do this using dplyr's mutate function?
A possible solution :
# convert column from factor/characters to Date (if not already done)
mydf$resign_date <- as.Date(mydf$resign_date)
mydf$join_date <- as.Date(mydf$join_date)
# compute the date differences
days_churn1 <- as.numeric(difftime(mydf$resign_date,mydf$join_date,units='days'))
days_churn0 <- as.numeric(difftime(Sys.Date(),mydf$join_date,units='days'))
# set to zero the values where churn indicator is not what we want
days_churn1[mydf$churn_indicator==0]<-0
days_churn0[mydf$churn_indicator==1]<-0
# sum the two vectors
mydf$length <- days_churn1+days_churn0
> mydf
churn_indicator resign_date join_date length
1 0 <NA> 2001-01-01 5997
2 0 <NA> 2001-03-01 5938
3 1 2011-01-01 2002-04-02 3196
4 0 <NA> 2003-09-01 5024
5 1 2012-02-01 2005-05-10 2458
Alternatively, you can combine some operations using ifelse :
# convert column from factor/characters to Date (if not already done)
mydf$resign_date <- as.Date(mydf$resign_date)
mydf$join_date <- as.Date(mydf$join_date)
mydf$length <-
as.numeric(
ifelse(mydf$churn_indicator==1,
difftime(mydf$resign_date,mydf$join_date,units='days'),
difftime(Sys.Date(),mydf$join_date,units='days')
))
In R I have data
USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
...
I want to create a new data set data_new that contain all USER in the time 10 o'clock to 11 o'clock.
The types of USER and BIRTH are strings/characters. I tried this:
data_new= data$BIRTH > as.POSIXct("10:00:00", format="%H:%M:%S")
& data$BIRTH < as.POSIXct("11:00:00", format="%H:%M:%S")
but here R gives we FALSE for all entries, so this don't work.
How can I solve this?
Update
Say I want to find the number of users for all hours. I use the answer and try this
u=c()
for(j in 1:24) {
data_new=data[times > "00:00:00"+(j-1) & times < "01:00:00"+j ,]
#saving the number of users in vector u
u[j]=dim(data_new)[1]
}
but R can't figure out the term "00:00:00"+(j-1).
If df is your data frame:
df <- read.table(text = 'USER BIRTH
11 "2013-01-11 22:31:11"
121 "2014-12-26 04:07:35"
121 "2014-12-26 10:07:35"
121 "2014-12-26 11:07:35"
121 "2014-12-26 10:38:35"', header = T)
df$BIRTH <- ymd_hms(df$BIRTH)
times <- strftime(df$BIRTH, format = "%H:%M:%S")
df[times > "10:00:00" & times < "11:00:00",]
Output:
USER BIRTH
3 121 2014-12-26 10:07:35
5 121 2014-12-26 10:38:35
One way to do something to each subset of your data is to use the split-lapply paradigm. In this case, you would convert data$BIRTH to POSIXlt and split by the hour component of the POSIXlt object. That will give you a list where each list element contains all the data for a specific hour.
data <- read.csv(text = "USER,BIRTH
11,2013-01-11 22:31:11
12,2014-12-26 04:07:35
21,2014-12-26 10:07:35
121,2014-12-26 11:07:35
112,2014-12-26 10:38:35")
data_by_hour <- split(data, as.POSIXlt(data$BIRTH)$hour)
Then you can use lapply (or sapply) to do whatever you want to each of those subsets. To count the number of observations per hour:
# number of observations for each hour
sapply(data_by_hour, nrow)
4 10 11 22
1 2 1 1
You can also do this with xts.
library(xts)
# Create xts object from 'data' data.frame
# Note: xts objects are based on a matrix, so you cannot have columns with
# mixed types like you can with a data.frame.
x <- xts(data["USER"], as.POSIXct(data$BIRTH))
period.apply(x, endpoints(x, "hours"), nrow)
# USER
# 2013-01-11 22:31:11 1
# 2014-12-26 04:07:35 1
# 2014-12-26 10:38:35 2
# 2014-12-26 11:07:35 1
Note that you can do time-of-day subsetting with xts. It avoids potential locale-related collation order issues caused by using logical operators on character strings.
x["T10:00/T11:00"]
# USER
# 2014-12-26 10:07:35 21
# 2014-12-26 10:38:35 112