How to merge two data frames with non overlapping dates? - r

I have a data set with the following variables:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken (288 intervals per day)
The main data set:
> head(activityData, 3)
steps date interval
1 1.7169811 2012-10-01 0
2 0.3396226 2012-10-01 5
3 0.1320755 2012-10-01 10
> str(activityData)
'data.frame': 17568 obs. of 3 variables:
$ steps : num 1.717 0.3396 0.1321 0.1509 0.0755 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: num 0 5 10 15 20 25 30 35 40 45 ...
The data set has a range of two months.
I had to divided it by weekdays and weekend days. I did it with the following functions:
> dataAs.xtsWeekday <- dataAs.xts[.indexwday(dataAs.xts) %in% 1:5]
> dataAs.xtsWeekend <- dataAs.xts[.indexwday(dataAs.xts) %in% c(0, 6)]
After doing this I had to make some calculation, at which I failed so I decided to export the files and read them in, again.
After I imported the data again, I made the calculation I wanted, and I tried to merge the 2 datasets, but did not succeed.
First data set:
> head(weekdays, 3)
X steps date interval daytype
1 1 37.3826 2012-10-01 0 weekday
2 2 37.3826 2012-10-01 5 weekday
3 3 37.3826 2012-10-01 10 weekday
> str(weekdays)
'data.frame': 12960 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 37.4 37.4 37.4 37.4 37.4 ...
$ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekday" "weekday" "weekday" "weekday" ...
Second data set:
> head(weekend, 3)
X steps date interval daytype
1 1 0 2012-10-06 0 weekend
2 2 0 2012-10-06 5 weekend
3 3 0 2012-10-06 10 weekend
> str(weekend)
'data.frame': 4608 obs. of 5 variables:
$ X : int 1 2 3 4 5 6 7 8 9 10 ...
$ steps : num 0 0 0 0 0 0 0 0 0 0 ...
$ date : chr "2012-10-06" "2012-10-06" "2012-10-06" "2012-10-06" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
$ daytype : chr "weekend" "weekend" "weekend" "weekend" ...
Now I would like to merge the 2 data sets (weekdays, weekends) by date, but the problem is that I don't have any common dates or anything else common.
The final data set should have 4 columns and 17568 observations.
The columns should be:
steps: Number of steps taking in a 5-minute interval
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
daytype: weekends days or normal weekdays.
I tried with:
merge
join(plyr)
union
Everywhere I looked all the data sets had a common ID or a common column in both data sets, not like in my case.
I also looked here, but I did not understand much and at many others, but they had nothing in common with my data set.
The other option I thought about was to add a column to the original data set and call it
"ID" and redo everything that I did so far; thing that I'll have to do if I don't find a way around this problem.
I would like some advice on how to proceed or what to try next.

Since you mentioned that your final data set should have 17568 (=4608+12960) observations/rows, I assume you want to stack the two data.frames over each other (and possibly order them by date afterwards). This is done by using rbind().
finaldata <- rbind(weekdays, weekend)
If you want to remove column X:
finaldata$X <- NULL
To convert your date column to actual dates:
finaldata$date <- as.Date(finaldata$date, format="%Y-%m-%d")
To order the whole data by date:
finaldata <- finaldata[order(finaldata$date),]

Related

How do I convert factorial variables into presence-absence data in r?

I've got a dataset with an abundance variable where the data is ordinal (0, 1-5, 6-10, 10+) but I need to convert it into presence/ absence data (0 or 1). How do I go about doing this?
Here is the data:
'data.frame': 100 obs. of 3 variables:
$ date : Date, format: "2021-02-11" "2021-02-15" "2021-02-16" "2021-02-15" ...
$ abund : Factor w/ 4 levels "0","1-5","6-10",..: 4 1 3 3 4 1 4 2 1 3 ...
$ postcode: chr "EH12 7ET" "NW1 1HP" "TA21 0AS" "LE7 3SY" ...
your_data$abund <- ifelse(your_data$abund=="0",0,1)
or
your_data$abund <- as.numeric(your_data$abund!="0")
The latter works because as.numeric() converts FALSE to 0 and TRUE to 1.
or use transform(your_data, abund=...) (base R) or your_data %>% mutate(across(abund,~1-as.numeric(.=="0)) or ...

Why is the R aggregation dropping data rows?

I have a data frame with 2 columns: date & observations. The data consists of multiple observations for each date.
str(observations)
tibble [2,599 × 2] (S3: tbl_df/tbl/data.frame)
$ date : chr [1:2599] "1/22/20" "1/22/20" "1/22/20" "1/22/20" ...
$ observation : num [1:2599] 0 0 0 0 0 0 0 0 0 0 ...
> tail(observations)
# A tibble: 6 x 2
date observation
<chr> <dbl>
1 5/13/20 4127
2 5/13/20 1042
3 5/13/20 14306
4 5/13/20 1066
5 5/13/20 0
6 5/13/20 89
I want to subtotal these observations to produce a single row for each date so I used this function:
subs <- aggregate(cbind(observation) ~ date,data=observations, FUN = sum, na.rm = TRUE)
But the output is missing any rows for the last 4 days of the original:
> tail(subs)
date observation
108 5/4/20 128269
109 5/5/20 130593
110 5/6/20 131890
111 5/7/20 133991
112 5/8/20 135840
113 5/9/20 137397
I apologize. On further investigation, it appears that the aggregate function returned that data out of order. I re-ordered the data frame and confirmed that all dates were accounted for.

loop for list element with datetime in r

loop for list element with datetime in r
I have a df with name mistake. I splitted the mistake df by ID. Now I have over 300 different objects in the list.
library(dplyr)
df <- split.data.frame(mistake, mistake$ID)
Every list object has two different datetime stamps. At first I need the minutes between this two datetime stamps. Then I duplicate the rows of the object by the variable stay (this is the difftime between the sat and end time too). Then I overwrite the test variable with the increment n_mintes.
library(lubridate)
start_date <- df[[1]]$datetime
end_date <- df[[1]]$gehtzeit
n_minutes <- interval(start_date,end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes)#the diff time in minutes I need
df[[1]]$test<- Sys.time()#a new variable
df[[1]] <- data.frame(df[[1]][rep(seq_len(dim(df[[1]])[1]),df[[1]]$stay+1),1:17, drop= F], row.names=NULL)
df[[1]]$test <- format(start_date + minutes(0:n_minutes), format = "%d.%m.%Y %H:%M:%S")
I want to do this with every objcet of the list. And then 'rbind' or 'unsplit' my list. I know I need a loop. But I don' t know how to do this with the list element.
Any help would be create!
Here is a small df example;
mistake
Baureihe Verbund Fahrzeug Code Codetext Subsystem Kommt.Zeit
71 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 29.07.2018 23:00:07
72 411 ICE1166 93805411866-7 1A50 Querfederdruck 1 ungleich Sollwert Neigetechnik 04.08.2018 11:16:41
Geht.Zeit Anstehdauer Jahr Monat KW Tag Wartung.geht datetime gehtzeit
71 29.07.2018 23:02:56 00 Std 02 Min 49 Sek 2018 7 KW30 29 0 2018-07-29 23:00:00 2018-07-29 23:02:00
72 04.08.2018 11:19:20 00 Std 02 Min 39 Sek 2018 8 KW31 4 0 2018-08-04 11:16:00 2018-08-04 11:19:00
bleiben ID
71 2 secs 2018-07-29 23:00:00 2018-07-29 23:02:00 1A50
72 3 secs 2018-08-04 11:16:00 2018-08-04 11:19:00 1A50
And here ist the structure:
str(mistake)
'data.frame': 2 obs. of 18 variables:
$ Baureihe : int 411 411
$ Verbund : Factor w/ 1 level "ICE1166": 1 1
$ Fahrzeug : Factor w/ 7 levels "93805411066-4",..: 7 7
$ Code : Factor w/ 6 levels "1A07","1A0E",..: 3 3
$ Codetext : Factor w/ 6 levels "ITD Karte gestört",..: 5 5
$ Subsystem : Factor w/ 1 level "Neigetechnik": 1 1
$ Kommt.Zeit : Factor w/ 70 levels "02.08.2018 00:07:23",..: 68 6
$ Geht.Zeit : Factor w/ 68 levels "01.08.2018 01:30:25",..: 68 8
$ Anstehdauer : Factor w/ 46 levels "00 Std 00 Min 01 Sek ",..: 12 4
$ Jahr : int 2018 2018
$ Monat : int 7 8
$ KW : Factor w/ 5 levels "KW27","KW28",..: 4 5
$ Tag : int 29 4
$ Wartung.geht: int 0 0
$ datetime : POSIXlt, format: "2018-07-29 23:00:00" "2018-08-04 11:16:00"
$ gehtzeit : POSIXlt, format: "2018-07-29 23:02:00" "2018-08-04 11:19:00"
$ bleiben :Class 'difftime' atomic [1:2] 2 3
.. ..- attr(*, "units")= chr "secs"
$ ID : chr "2018-07-29 23:00:00 2018-07-29 23:02:00 1A50" "2018-08-04 11:16:00 2018-08-04 11:19:00 1A50"
Consider building a generalized user-defined function that receives a data frame as input parameter. Then, call the function with by. Like split, by also subsets a data frame by one or more factor(s) such as ID but, unlike split, by can then pass subsets into a function. To row bind all together, run do.call at end.
Below removes the redundant df$test <- Sys.time() which is overwritten later and uses see object inside format() call at end to avoid re-calculation and repetition.
calc_datetime <- function(df) {
# INITIAL CALCS
start_date <- df$datetime
end_date <- df$gehtzeit
n_minutes <- interval(start_date, end_date)/minutes(1)
see <- start_date + minutes(0:n_minutes) # the diff time in minutes I need
# BUILD OUTPUT DF
df <- data.frame(df[rep(seq_len(dim(df)[1]), df$stay+1), 1:17, drop= F], row.names=NULL)
df$test <- format(see, format = "%d.%m.%Y %H:%M:%S")
return(df)
}
# BUILD LIST OF SUBSETTED DFs
df_list <- by(mistake, mistake$ID, calc_datetime)
# APPEND ALL RESULT DFs TO SINGLE FINAL DF
final_df <- do.call(rbind, df_list)
Along the same lines as Parfait's answer, and using the same user defined function calc_datetime, but I would use map_dfr from the purrr package:
df_list <- split(mistake, mistake$ID)
final_df <- map_dfr(df_list, calc_datetime)
If you update the question to have data I can use I can give a demonstration that works

How to optimize this process?

I have somewhat of broad question, but I will try to make my intent as clear as possible so that people can make suggestions. I am trying to optimize a process I am doing. Generally, what I am doing is feeding a function a data frame of values and generating a prediction off of operations on specific columns. Basically a custom function that is being used with sapply (code below). What I'm doing is much to large to provide any meaningful example, so instead I will try to describe the inputs to the process. I know this will restrict how helpful answers can be, but I am interested in any ideas for optimizing the time it takes me to compute a prediction. Currently it is taking me about 10 seconds to generate one prediction (run the sapply for one line of a dataframe).
mean_rating <- function(df){
user<-df$user
movie<-df$movie
u_row<-which(U_lookup == user)[1]
m_row<-which(M_lookup==movie)[1]
knn_match<- knn_txt[u_row,1:100]
knn_match1<-as.numeric(unlist(knn_match))
dfm_test<- dfm[knn_match1,]
dfm_mov<- dfm_test[,m_row] # row number from DFM associated with the query_movie
C<-mean(dfm_mov)
}
test<-sapply(1:nrow(probe_test),function(x) mean_rating(probe_test[x,]))
Inputs:
dfm is my main data matrix, users in the rows and movies in the columns. Very sparse.
> str(dfm)
Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
..# i : int [1:99072112] 378 1137 1755 1893 2359 3156 3423 4380 5103 6762 ...
..# j : int [1:99072112] 0 0 0 0 0 0 0 0 0 0 ...
..# Dim : int [1:2] 480189 17770
..# Dimnames:List of 2
.. ..$ : NULL
.. ..$ : NULL
..# x : num [1:99072112] 4 5 4 1 4 5 4 5 3 3 ...
..# factors : list()
probe_test is my test set, the set I'm trying to predict for. The actual probe test contains approximately 1.4 million rows but I am trying it on a subset first to optimize the time. It is being fed into my function.
> str(probe_test)
'data.frame': 6 obs. of 6 variables:
$ X : int 1 2 3 4 5 6
$ movie : int 1 1 1 1 1 1
$ user : int 1027056 1059319 1149588 1283744 1394012 1406595
$ Rating : int 3 3 4 3 5 4
$ Rating_Date: Factor w/ 1929 levels "2000-01-06","2000-01-08",..: 1901 1847 1911 1312 1917 1803
$ Indicator : int 1 1 1 1 1 1
U_lookup is the lookup I use to convert between user id and the line of the matrix a user is in since we lose user id's when they are converted to a sparse matrix.
> str(U_lookup)
'data.frame': 480189 obs. of 1 variable:
$ x: int 10 100000 1000004 1000027 1000033 1000035 1000038 1000051 1000053 1000057 ...
M_lookup is the lookup I use to convert between movie id and the column of a matrix a movie is in for similar reasons as above.
> str(M_lookup)
'data.frame': 17770 obs. of 1 variable:
$ x: int 1 10 100 1000 10000 10001 10002 10003 10004 10005 ...
knn_text contains the 100 nearest neighbors for all the lines of dfm
> str(knn_txt)
'data.frame': 480189 obs. of 200 variables:
Thank you for any advice you can provide to me.

Interval classification in r

I have two dataframes as follows
str(daily)
Classes ‘grouped_df’,‘tbl_df’,‘tbl’ and 'data.frame':15264 obs.of 3 variables:
$ steps : int 0 0 0 0 0 0 0 0 0 0 ...
$ date : Date, format: "2012-10-02" "2012-10-02" "2012-10-02" ...
$ interval: int 0 5 10 15 20 25 30 35 40 45 ...
interval<-data.frame(unique(daily$interval))
str(interval)
'data.frame': 288 obs. of 1 variable:
$ unique.daily.interval.:int 0 5 10 15 20 25 30 35 40 45 50 55 100..2350 2355
using dplyr, what I intended to do was find the mean of daily$steps for each interval across daily$Date using the following
mutate(daily,class=cut(daily$steps,c(0,interval$unique.daily.interval.),
include.lowest = TRUE) %>%
group_by(class) %>%
summarise(Mean = mean(daily$steps)))
The code fails giving the following error
Error: 'breaks' are not unique
which I have isolated to the 'class=cut' function. I have checked the interval df for uniqueness being only 288 values. Can someone point out what I am doing wrong ? Here is a reference I used Create class intervals in r and sum values
Here is a link to the data in question Activity monitoring data
thanks.

Resources