Finding discrepancy between two data sets when setdiff is not working

Finding discrepancy between two data sets when setdiff is not working - r

I have data for spot price and day-ahead price for hour 2 and hour 3. They are as below. They are from 2015-12-31 to 2011-01-01 all the way down.
> head(da2)
Date Price Hour
43802 2015-12-31 12.56 2
43778 2015-12-30 23.59 2
43754 2015-12-29 17.07 2
> head(sp2)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 2 17.15
2 2015-12-30 2 26.23
3 2015-12-29 2 23.01
> head(da3)
Date Price Hour
43803 2015-12-31 10.46 3
43779 2015-12-30 23.55 3
43755 2015-12-29 16.52 3
> head(sp3)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 3 12.96
2 2015-12-30 3 25.65
3 2015-12-29 3 23.59
I tried to put da2$Price and sp2$Price together, and again the same for hour 3.
But unfortunately, I get this.
> rpdf2<-data.frame(da2$Date,da2$Price,sp2$Price)
Error in data.frame(da2$Date, da2$Price, sp2$Price) :
arguments imply differing number of rows: 1826, 1822
> rpdf3<-data.frame(da3$Date,da3$Price,sp3$Price)
Error in data.frame(da3$Date, da3$Price, sp3$Price) :
arguments imply differing number of rows: 1821, 1825
So I applied > setdiff(paste(da2$Date),paste(sp2$Date))
Then I found
[1] "2014-03-30" "2013-03-31" "2012-03-25" "2011-03-27"
It was okay. But when I did setdiff(paste(da3$Date),paste(sp3$Date)), It shows me character(0).
There must be 4 observations difference. But I cannot find those four. Can anyone help me with this situation? Thank you.
When setdiff(da3$Date,sp3$Date)
result is
[1] 16800.04 16799.04 16798.04 16797.04 16796.04 16795.04 16794.04 16793.04 16792.04 16791.04 16790.04 16789.04 16788.04 16787.04 16786.04 16785.04 16784.04
[18] 16783.04 16782.04 16781.04 16780.04 16779.04 16778.04 16777.04 16776.04 16775.04 16774.04 16773.04 16772.04 16771.04 16770.04 16769.04 16768.04 16767.04
[35] 16766.04 16765.04 16764.04 16763.04 16762.04 16761.04 16760.04 16759.04 16758.04 16757.04 16756.04 16755.04 16754.04 16753.04 16752.04 16751.04 16750.04
[52] 16749.04 16748.04 16747.04 16746.04 16745.04 16744.04 16743.04 16742.04 16741.04 16740.04 16739.04 16738.04 16737.04 16736.04 16735.04 16734.04 16733.04
[69] 16732.04 16731.04 16730.04 16729.04 16728.04 16727.04 16726.04 16725.04 16724.04 16723.04 16722.04 16721.04 16720.04 16719.04 16718.04 16717.04 16716.04
[86] 16715.04 16714.04 16713.04 16712.04 16711.04 16710.04 16709.04 16708.04 16707.04 16706.04 16705.04 16704.04 16703.04 16702.04 16701.04 16700.04 16699.04
and so further.

One way (of many) to tackle this is instead of looking directly for the differences is to find a way to join your tables which will work regardless. To do so you simply need to generate a complete sequence of all dates from the first date on your list to the last, then left-join these to each of your daily and spot price data frames in turn. Missing date rows in each table will show as NA columns in the resulting joined table.
Example sequence, shortened to one month only for this exemplar. You'd start it at 2011-01-01 instead.
somedates = seq(as.Date("2015-12-01"), as.Date("2015-12-31"), by = "day")
Generate some test data each with four randomly missed dates to simulate your da2, da3, sp2 and sp3 tables:
library(dplyr)
set.seed(0)
da2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 20)
set.seed(1)
da3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 21)
set.seed(2)
sp2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 19)
set.seed(3)
sp3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 18)
Joining the da2, da3, sp2 and sp3 tables
With the test data generated, joining the tables to the complete sequence of dates (as a data frame) is straightforward. (NB I haven't replaced the joined column names with more meaningful versions in the result below).
all =
left_join(data.frame(Date = somedates), da2, by = "Date") %>%
left_join(da3, by = "Date") %>%
left_join(sp2, by = "Date") %>%
left_join(sp3, by = "Date")
Results from the test data joined
>all
Date hour.x price.x hour.y price.y hour.x.x price.x.x hour.y.y price.y.y
1 2015-12-01 2 20 3 21 2 19 3 18
2 2015-12-02 2 20 3 21 2 19 3 18
3 2015-12-03 NA NA 3 21 2 19 3 18
4 2015-12-04 2 20 3 21 2 19 3 18
5 2015-12-05 2 20 3 21 2 19 3 18
6 2015-12-06 2 20 3 21 2 19 3 18
7 2015-12-07 2 20 3 21 2 19 NA NA
8 2015-12-08 2 20 3 21 2 19 3 18
9 2015-12-09 2 20 3 21 NA NA 3 18
10 2015-12-10 2 20 3 21 NA NA 3 18
11 2015-12-11 2 20 3 21 2 19 3 18
12 2015-12-12 NA NA 3 21 2 19 3 18
13 2015-12-13 2 20 NA NA 2 19 NA NA
14 2015-12-14 2 20 3 21 2 19 3 18
15 2015-12-15 2 20 3 21 2 19 3 18
16 2015-12-16 2 20 3 21 2 19 3 18
17 2015-12-17 2 20 3 21 2 19 3 18
18 2015-12-18 2 20 NA NA 2 19 3 18
19 2015-12-19 NA NA 3 21 2 19 3 18
20 2015-12-20 2 20 NA NA NA NA 3 18
21 2015-12-21 2 20 3 21 2 19 3 18
22 2015-12-22 2 20 3 21 2 19 3 18
23 2015-12-23 2 20 3 21 2 19 3 18
24 2015-12-24 2 20 3 21 2 19 NA NA
25 2015-12-25 2 20 3 21 2 19 3 18
26 2015-12-26 2 20 3 21 2 19 3 18
27 2015-12-27 2 20 3 21 2 19 3 18
28 2015-12-28 2 20 3 21 2 19 3 18
29 2015-12-29 2 20 3 21 2 19 3 18
30 2015-12-30 2 20 3 21 NA NA 3 18
31 2015-12-31 NA NA NA NA 2 19 NA NA
Edit I note the numeric dates you posted as a result of your set join have a 0.04 time component as well as the whole-number date. You will need to add this to the date sequence to get the join to work. I have now tested this and without adding the time component you'd have to convert each date to a whole number. This can be done fairly simply though:
da2$Date = trunc.Date(da2$Date, "days")
da3$Date = trunc.Date(da3$Date, "days")
sp2$Date = trunc.Date(sp2$Date, "days")
sp3$Date = trunc.Date(sp3$Date, "days")
You'd do this before the joins.

Related

r group by date difference with respect to first date

I have a dataset that looks like this.
Id Date1 Cars
1 2007-04-05 72
2 2014-01-07 12
2 2018-07-09 10
2 2018-07-09 13
3 2005-11-19 22
3 2005-11-23 13
4 2010-06-17 38
4 2010-09-23 57
4 2010-09-23 41
4 2010-10-04 17
What I would like to do is for each Id get the date difference with respect to the 1st Date (Earliest) date for that Id. For each Id, (EarliestDate - 2nd Earliest Date), (EarliestDate - 3rd Earliest Date), (Earliest Date - 4th Earliest Date) ... so on.
I would end up with a dataset like this
Id Date1 Cars Diff
1 2007-04-05 72 NA
2 2014-01-07 12 NA
2 2018-07-09 10 1644 = (2018-07-09 - 2014-01-07)
2 2018-07-09 13 1644 = (2018-07-09 - 2014-01-07)
3 2005-11-19 22 NA
3 2005-11-23 13 4 = (2005-11-23 - 2005-11-19)
4 2010-06-17 38 NA
4 2010-09-23 57 98 = (2010-09-23 - 2010-06-17)
4 2010-09-23 41 98 = (2010-09-23 - 2010-06-17)
4 2010-10-04 17 109 = (2010-10-04 - 2010-09-23)
I am unclear on how to accomplish this. Any help would be much appreciated. Thanks

Change Date1 to date class.
df$Date1 = as.Date(df$Date1)
You can subtract with the first value in each Id. This can be done using dplyr.
library(dplyr)
df %>% group_by(Id) %>% mutate(Diff = as.integer(Date1 - first(Date1)))
# Id Date1 Cars Diff
# <int> <date> <int> <int>
# 1 1 2007-04-05 72 0
# 2 2 2014-01-07 12 0
# 3 2 2018-07-09 10 1644
# 4 2 2018-07-09 13 1644
# 5 3 2005-11-19 22 0
# 6 3 2005-11-23 13 4
# 7 4 2010-06-17 38 0
# 8 4 2010-09-23 57 98
# 9 4 2010-09-23 41 98
#10 4 2010-10-04 17 109
data.table
setDT(df)[, Diff := as.integer(Date1 - first(Date1)), Id]
OR base R :
df$diff <- with(df, ave(as.integer(Date1), Id, FUN = function(x) x - x[1]))
Replace 0's to NA if you want output as such.

Merging Data frames and creating columns based on conditions

I have 2 data frames
Data Frame A:
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
Data Frame B:
TimeStart TimeEnd Alarm
2 5 556
7 9 556
I would like to create the following joined dataframe:
Time Reading Alarmtime Alarm alarmno
1 20 n/a n/a n/a
2 23 2 556 1
3 25 556 1
4 22 556 1
5 24 5 556 1
6 23 n/a n/a n/a
7 24 7 556 2
8 23 556 2
9 23 9 556 2
10 22 n/a n/a n/a
I can do the join easy enough however im struggling with getting the following rows filled with the alarm until the time the alarm ended. Also numbering each individual alarm so even if they are the same alarm they are counted separately. Any thoughts on how i can do this would be great
Thanks

library(sqldf)
df_b$AlarmNo <- seq_len(nrow(df_b))
sqldf('
select a.Time
, a.Reading
, case when a.Time in (b.TimeStart, b.TimeEnd)
then a.Time
else NULL
end as AlarmTime
, b.Alarm
, b.AlarmNo
from df_a a
left join df_b b
on a.Time between b.TimeStart and b.TimeEnd
')
# Time Reading AlarmTime Alarm AlarmNo
# 1 1 20 NA NA NA
# 2 2 23 2 556 1
# 3 3 25 NA 556 1
# 4 4 22 NA 556 1
# 5 5 24 5 556 1
# 6 6 23 NA NA NA
# 7 7 24 7 556 2
# 8 8 23 NA 556 2
# 9 9 23 9 556 2
# 10 10 22 NA NA NA
Or
library(data.table)
setDT(df_b)
df_c <-
df_b[, .(Time = seq(TimeStart, TimeEnd), Alarm, AlarmNo = .GRP)
, by = TimeStart]
merge(df_a, df_c, by = 'Time', all.x = T)
# Time Reading TimeStart Alarm AlarmNo
# 1: 1 20 NA NA NA
# 2: 2 23 2 556 1
# 3: 3 25 2 556 1
# 4: 4 22 2 556 1
# 5: 5 24 2 556 1
# 6: 6 23 NA NA NA
# 7: 7 24 7 556 2
# 8: 8 23 7 556 2
# 9: 9 23 7 556 2
# 10: 10 22 NA NA NA
Data used:
df_a <- fread('
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
')
df_b <- fread('
TimeStart TimeEnd Alarm
2 5 556
7 9 556
')

R - Detect end of observations in groups and remove redundant rows

I have a data.frame consisting of about 300k rows with 24 rows for each ID - each row representing an hourly observation of that ID. My problem lies in that for some IDs the observation ends before the 24 hours has gone by - yet still have 24 rows with the remaining rows having NA in their 3 observation variables.
In a simplified table would be something like this
ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
1 0 29 32 34 19 21
1 1 21 12 NA 19 21
1 2 NA 24 NA 19 21
1 3 NA NA NA 19 21
1 4 NA  NA NA 19 21
2 0 41 16 21 13 24
2 1 NA NA NA 13 24
2 2 11 30 41 13 24
2 3 21 NA NA 13 24
2 4 24 35 21 13 24
2 5 NA NA NA 13 24
2 6 NA NA NA 13 24
3 0 NA NA NA 35 46
3 1 23 34 24 35 46
3 2 NA 26 NA 35 46
3 3 NA NA 24 35 46
3 4 12 29 42 35 46
3 5 NA NA NA 35 46
3 6 NA NA NA 35 46
In the table, each ID would represent a scenario that should be handled appropriately:
ID 1: Ordinary with observations starting from hour 0 and observation ending at hour 3 - and thus row with hour 3 and 4 for that group should be removed
ID 2: Has an hour (1) where all three observation variables are set at NA, but observation is resumed and ends at hour 5 - and thus row 2 should be kept (due to faulty registration and not end of observation) and rows with hour 5 and 6 should be removed.
ID 3: Starts out with an row with NA in all three observation variables, but observation begins then next hour and ends at hour 5. This is akin to the scenario for ID 2, but this time occurring at the very start (instead of in the middle of the observations). However, this still represent a faulty registration and should be kept and rows from hour 5 and 6 in this group should be removed.
Conceptually, I would think a possible solution would be do a group_by ID and then for R to go through the rows in a group in reverse (from bottom and up) until it encounters a row where "OBS_1", "OBS_2" and "OBS_3" are not all NA and remove the rows examined before reaching to this row and then move on to examine the next group.
Any help would be greatly appreciated!

If your MISC and MISC_2 values are consistent for each ID, you could
filter all rows that have na values then fill back in the missing data with complete and fill.
library(dplyr)
library(tidyr)
df %>% filter(!(is.na(OBS_1)&is.na(OBS_2)&is.na(OBS_3))) %>%
group_by(ID) %>%
complete(HOUR=0:max(HOUR)) %>%
fill(MISC,MISC_2) %>% fill(MISC,MISC_2,.direction = "up")
# A tibble: 13 x 7
# Groups: ID [3]
# ID HOUR OBS_1 OBS_2 OBS_3 MISC MISC_2
# <int> <int> <int> <int> <int> <int> <int>
# 1 1 0 29 32 34 19 21
# 2 1 1 21 12 NA 19 21
# 3 1 2 NA 24 NA 19 21
# 4 2 0 41 16 21 13 24
# 5 2 1 NA NA NA 13 24
# 6 2 2 11 30 41 13 24
# 7 2 3 21 NA NA 13 24
# 8 2 4 24 35 21 13 24
# 9 3 0 NA NA NA 35 46
# 10 3 1 23 34 24 35 46
# 11 3 2 NA 26 NA 35 46
# 12 3 3 NA NA 24 35 46
# 13 3 4 12 29 42 35 46

This filters only the missing values if the no observation for the day are existing after this and keeps all missing observations that do not indicate the end of the observations for the day. These also allow for your other variables to vary during the day because it just removes them if the end of observations is reached.
df %>% arrange(rev(as.numeric(rownames(.)))) %>%
group_by(ID) %>%
mutate(rowNum = 1:n(),
naObs = cumsum((is.na(OBS_1) & is.na(OBS_2) & is.na(OBS_3))),
missingBlock = naObs != rowNum) %>%
slice(min(which(missingBlock)):n()) %>%
ungroup() %>%
arrange(rev(as.numeric(rownames(.)))) %>%
select(-rowNum, -naObs, -missingBlock)

How to insert a row which calculates the average of the rows above it?

I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!

You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835

Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)

A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.

I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)

Fastest way to remove same number of NA from each column and realign data

Extend from this post that gives the result as follows:
x y z
1: 1 NA NA
2: 2 NA 22
3: 3 13 23
4: 4 14 24
5: 5 15 25
6: 6 16 26
7: 7 17 27
8: NA 18 28
9: NA 19 NA
10: NA NA NA
As you can see, if NAs of each column are removed, we can obtain data.table as follows:
x y z
1: 1 13 22
2: 2 14 23
3: 3 15 24
4: 4 16 25
5: 5 17 26
6: 6 18 27
7: 7 19 28
I come up with this code to obtain the above result:
mat.temp <- na.omit(mat[,1, with = F])
for (i in 2:3) {
temp <- na.omit(mat[,i, with = F])
mat.temp <- cbind(mat.temp, temp)
}
However, I am not sure it is efficient.
Could you please give me suggestions ?
Thank you

It sounds like you are just trying to do:
DT[, lapply(.SD, function(x) x[!is.na(x)])]
# x y z
# 1: 1 13 22
# 2: 2 14 23
# 3: 3 15 24
# 4: 4 16 25
# 5: 5 17 26
# 6: 6 18 27
# 7: 7 19 28
However, I'm not sure how well this would hold up if you have a different number of NA values in each column.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding discrepancy between two data sets when setdiff is not working - r

Related

r group by date difference with respect to first date

Merging Data frames and creating columns based on conditions

R - Detect end of observations in groups and remove redundant rows

How to insert a row which calculates the average of the rows above it?

Fastest way to remove same number of NA from each column and realign data

Categories

Resources