I have a data frame and applied the changepoint.np package to it. Now I want to calculate the median or display a trendline between these changepoint locations (red lines) to it.
Any ideas how to do this?
My dataframe df1
date amount
2012-07-01 20.0000000
2012-08-01 11.1111111
2012-09-01 0.0000000
2012-10-01 0.0000000
2012-11-01 4.7619048
2012-12-01 4.7619048
2013-01-01 7.8947368
2013-02-01 0.0000000
2013-03-01 0.0000000
2013-04-01 1.8181818
2013-05-01 0.0000000
2013-06-01 0.0000000
2013-07-01 0.0000000
2013-08-01 0.0000000
2013-09-01 1.7543860
2013-10-01 0.6410256
2013-11-01 3.0534351
2013-12-01 2.6143791
2014-01-01 7.6023392
2014-02-01 2.7777778
2014-03-01 5.2884615
2014-04-01 2.7237354
2014-05-01 2.3255814
2014-06-01 2.6627219
2014-07-01 2.0710059
2014-08-01 2.7522936
2014-09-01 4.6413502
2014-10-01 4.4077135
2014-11-01 3.4759358
2014-12-01 4.3333333
2015-01-01 8.0128205
2015-02-01 9.3632959
2015-03-01 4.3771044
2015-04-01 4.0650407
2015-05-01 3.7500000
2015-06-01 4.6189376
2015-07-01 3.6764706
2015-08-01 2.4561404
2015-09-01 2.9090909
2015-10-01 2.1084337
And my code for the changepoint:
library(changepoint.np)
out <- cpt.np(df1$amount, method = 'PELT')
plot(out)
median(df1$amount) for the median, for the trendline you would first have to tell us the actual values. Lines can be added to a plot with the lines() function, that is coordinates for first and last point.
Related
I can't understand why my code is providing an undesired ouput since I've tried this in the past with similar datasets and good results.
Below are the two dataframes I would like to left_join():
> head(datagps)
Date & Time [Local] Latitude Longitude DateTime meters
1: 06/11/2018 08:44 -2.434986 34.85387 2018-11-06 08:44:00 1.920190
2: 06/11/2018 08:48 -2.434993 34.85386 2018-11-06 08:48:00 3.543173
3: 06/11/2018 08:52 -2.435014 34.85388 2018-11-06 08:52:00 1.002979
4: 06/11/2018 08:56 -2.435011 34.85389 2018-11-06 08:56:00 3.788024
5: 06/11/2018 09:00 -2.434986 34.85387 2018-11-06 09:00:00 1.262584
6: 06/11/2018 09:04 -2.434994 34.85386 2018-11-06 09:04:00 3.012679
> head(datasensorraw)
# A tibble: 6 x 4
TimeGroup x y z
<dttm> <int> <int> <dbl>
1 2000-01-01 00:04:00 0 0 0
2 2000-01-01 00:08:00 1 0 1
3 2000-01-01 00:12:00 0 0 0
4 2000-01-01 00:20:00 0 0 0
5 2000-01-01 00:24:00 0 0 0
6 2018-06-09 05:04:00 4 14 14.6
And below is my code. There are no Errors, but for some reason I get NA's under x, y and z. This should not happen since there are registered values in the datasensorraw dataframe for those time stamps:
> library(dplyr)
> dataresults<-datagps %>%
+ mutate(`Date & Time [Local]` = as.POSIXct(`Date & Time [Local]`,
+ format = "%d/%m/%Y %H:%M")) %>%
+ left_join(datasensorraw, by = c("Date & Time [Local]" = "TimeGroup"))
> #Left join the data frames
> head(dataresults)
Date & Time [Local] Latitude Longitude DateTime meters x y z
1 2018-11-06 07:44:00 -2.434986 34.85387 2018-11-06 08:44:00 1.920190 NA NA NA
2 2018-11-06 07:48:00 -2.434993 34.85386 2018-11-06 08:48:00 3.543173 NA NA NA
3 2018-11-06 07:52:00 -2.435014 34.85388 2018-11-06 08:52:00 1.002979 NA NA NA
4 2018-11-06 07:56:00 -2.435011 34.85389 2018-11-06 08:56:00 3.788024 NA NA NA
5 2018-11-06 08:00:00 -2.434986 34.85387 2018-11-06 09:00:00 1.262584 NA NA NA
6 2018-11-06 08:04:00 -2.434994 34.85386 2018-11-06 09:04:00 3.012679 NA NA NA
I can also upload a small dput() sample of datagps and datasensorraw.
I am learning R so I'm wondering if I'm doing something wrong. I shouldn't get NAs under those columns as you can see on the dput() samples provided. Any input is appreciated!
Looks like a mixup on your date format. Try to switch format = "%d/%m/%Y %H:%M" to format = "%m/%d/%Y %H:%M" or switch it to d/m/y in your other dataset.
dataresults<- datagps_sample %>%
mutate(`Date & Time [Local]` = as.POSIXct(`Date & Time [Local]`, format = "%m/%d/%Y %H:%M")) %>%
left_join(datasensorraw_sample, by = c("Date & Time [Local]" = "TimeGroup"))
> head(dataresults)
Date & Time [Local] Latitude Longitude DateTime meters x y z
1 2018-06-11 12:44:00 -2.434986 34.85387 2018-11-06 08:44:00 1.920190 17 12 21.59363
2 2018-06-11 12:48:00 -2.434993 34.85386 2018-11-06 08:48:00 3.543173 6 0 6.00000
3 2018-06-11 12:52:00 -2.435014 34.85388 2018-11-06 08:52:00 1.002979 47 25 53.24351
4 2018-06-11 12:56:00 -2.435011 34.85389 2018-11-06 08:56:00 3.788024 0 0 0.00000
5 2018-06-11 13:00:00 -2.434986 34.85387 2018-11-06 09:00:00 1.262584 48 53 72.23108
6 2018-06-11 13:04:00 -2.434994 34.85386 2018-11-06 09:04:00 3.012679 139 113 179.24589
EDIT: basically, left_join was not finding any matches and it was returning the rows from your original dataframe but with the new columns as NA. If you format your column before left joining you could check if there are common id's with something simple like datagps$Date & Time [Local] %in% datasensorraw$TimeGroup.
I have a dataframe df and I set interval points which are saved in a vector pts. Now I want to label my data into these intervals. I tried using the cut() function, but I always get the mistake, that x is not numeric, even though I converted it to numeric.
My dataframe df
date amount
1 2012-07-01 2.3498695
2 2012-08-01 0.6984866
3 2012-09-01 0.9079118
4 2012-10-01 2.8858218
5 2012-11-01 1.2406948
6 2012-12-01 2.3140496
7 2013-01-01 1.5904573
8 2013-02-01 3.2531825
9 2013-03-01 4.2962963
10 2013-04-01 3.3287101
11 2013-05-01 3.7698413
12 2013-06-01 1.4376997
13 2013-07-01 5.0687285
14 2013-08-01 4.4520548
15 2013-09-01 5.5063913
16 2013-10-01 5.5676856
17 2013-11-01 6.2686567
18 2013-12-01 11.021069
My vector pts with column Min with interval points
pts$Min
[1] 3 6 11
My new dataframe should look like this:
date amount IntervalRange
1 2012-07-01 2.3498695 1
2 2012-08-01 0.6984866 1
3 2012-09-01 0.9079118 1
4 2012-10-01 2.8858218 2
5 2012-11-01 1.2406948 2
6 2012-12-01 2.3140496 2
7 2013-01-01 1.5904573 3
8 2013-02-01 3.2531825 3
9 2013-03-01 4.2962963 3
10 2013-04-01 3.3287101 3
11 2013-05-01 3.7698413 3
12 2013-06-01 1.4376997 4
13 2013-07-01 5.0687285 4
14 2013-08-01 4.4520548 4
15 2013-09-01 5.5063913 4
16 2013-10-01 5.5676856 4
17 2013-11-01 6.2686567 4
18 2013-12-01 11.021069 4
SO, I tried this:
df_cut <- data.frame(as.numeric(df$date), "IntervalRange" = cut(
df,
breaks=pts$Min))
Which results in this error message:
Error in cut.default(df, breaks = pts$Min) : 'x' must be numeric
My questions now are:
Why do I get this error message? I already changed it to numeric...
Can I achieve my desired output by using the cut() and findIntervals() functions also when using other datasets with other interval points?
You are lacking the value (or the column) in the cut function. Your command should be
data.frame(as.numeric(df$date), "IntervalRange" = cut(df$amount, breaks=pts$Min))
Hope this helps!
I am trying to fill in the gaps in one of my time series by merging a full day time series into my original time series. But for some reason I get duplicate entries and all the rest of my data is NA.
My data looks like this:
> head(data)
TIME Water_Temperature
1 2016-08-22 00:00:00 81.000
2 2016-08-22 00:01:00 80.625
3 2016-08-22 00:02:00 85.000
4 2016-08-22 00:03:00 80.437
5 2016-08-22 00:04:00 85.000
6 2016-08-22 00:05:00 80.375
> tail(data)
TIME Water_Temperature
1398 2016-08-22 23:54:00 19.5
1399 2016-08-22 23:55:00 19.5
1400 2016-08-22 23:56:00 19.5
1401 2016-08-22 23:57:00 19.5
1402 2016-08-22 23:58:00 19.5
1403 2016-08-22 23:59:00 19.5
In between are some minutes missing (1403 rows instead of 1440). I tried to fill them in using:
data.length <- length(data$TIME)
time.min <- data$TIME[1]
time.max <- data$TIME[data.length]
all.dates <- seq(time.min, time.max, by="min")
all.dates.frame <- data.frame(list(TIME=all.dates))
merged.data <- merge(all.dates.frame, data, all=T)
But that gives me a result of 1449 rows instead of 1440. The first eight minutes are duplicates in the time stamp column and all other values in Water_Temperature are NA. Looks like this:
> merged.data[1:25,]
TIME Water_Temperature
1 2016-08-22 00:00:00 NA
2 2016-08-22 00:00:00 81.000
3 2016-08-22 00:01:00 NA
4 2016-08-22 00:01:00 80.625
5 2016-08-22 00:02:00 NA
6 2016-08-22 00:02:00 85.000
7 2016-08-22 00:03:00 NA
8 2016-08-22 00:03:00 80.437
9 2016-08-22 00:04:00 NA
10 2016-08-22 00:04:00 85.000
11 2016-08-22 00:05:00 NA
12 2016-08-22 00:05:00 80.375
13 2016-08-22 00:06:00 NA
14 2016-08-22 00:06:00 80.812
15 2016-08-22 00:07:00 NA
16 2016-08-22 00:07:00 80.812
17 2016-08-22 00:08:00 NA
18 2016-08-22 00:08:00 80.937
19 2016-08-22 00:09:00 NA
20 2016-08-22 00:10:00 NA
21 2016-08-22 00:11:00 NA
22 2016-08-22 00:12:00 NA
23 2016-08-22 00:13:00 NA
24 2016-08-22 00:14:00 NA
25 2016-08-22 00:15:00 NA
> tail(merged.data)
TIME Water_Temperature
1444 2016-08-22 23:54:00 NA
1445 2016-08-22 23:55:00 NA
1446 2016-08-22 23:56:00 NA
1447 2016-08-22 23:57:00 NA
1448 2016-08-22 23:58:00 NA
1449 2016-08-22 23:59:00 NA
Does anyone has an idea whats going wrong?
EDIT:
Using the xts and zoo package now to do the job by doing:
library(xts)
library(zoo)
df1.zoo<-zoo(data[,-1],data[,1])
df2 <- as.data.frame(as.zoo(merge(as.xts(df1.zoo), as.xts(zoo(,seq(start(df1.zoo),end(df1.zoo),by="min"))))))
Very easy and effective!
Instead of merge use rbind which gives you an irregular time series without NAs to start with. If you really want a regular time series with a frequency of say 1 minute you can build a time based sequence as an index and merge it with your data after ( after using rbind) and fill the resulting NAs with na.locf. Hope this helps.
you can try merging with full_join from tidyverse
This works for me with two dataframes (daily values) sharing a column named date.
big_data<-my_data %>%
reduce(full_join, by="Date")
This question already has answers here:
How do I select a subset of rows after group by a specific column in R Data table [duplicate]
(2 answers)
Closed 7 years ago.
How to drop groups when there are not enough observations?
In the following reproducible example, each person (identified by name) has 10 observations:
install.packages('randomNames') # install package if required
install.packages('data.table') # install package if required
lapply(c('data.table', 'randomNames'), require, character.only = TRUE) # load packages
set.seed(1)
testDT <- data.table( date = rep(seq(as.Date("2010/1/1"), as.Date("2019/1/1"), "years"),10),
name = rep(randomNames(10, which.names='first'), times=1, each=10),
Y = runif(100, 5, 15),
X = rnorm(100, 2, 9),
testDT <- testDT[ X > 0]
Now I want to keep only the persons with at least 6 observations, so Gracelline, Anna, Aesha and Michael must be removed, because they have
only 3, 2, 4 and 5 observations respectively.
testDT[, length(X), by=name]
name V1
1: Blake 6
2: Alexander 6
3: Leigha 8
4: Gracelline 3
5: Epifanio 7
6: Keasha 6
7: Robyn 6
8: Anna 2
9: Aesha 4
10: Michael 5
How do I do this in an automatic way (real dataset is much larger)?
Edit:
Yes it's a duplicate. :(
The last proposed method was the fastest one.
> system.time(testDT[, .SD[.N>=6], by = name])
user system elapsed
0.293 0.227 0.517
> system.time(testDT[testDT[, .I[.N>=6], by = name]$V1])
user system elapsed
0.163 0.243 0.415
> system.time(testDT[,if(.N>=6) .SD , by = name])
user system elapsed
0.073 0.323 0.399
We group by 'name', get the nrow (.N), and if it is greater than 6, we Subset the Data.table (.SD).
testDT[,if(.N>=6) .SD , by = name]
# name date Y X
# 1: Blake 2010-01-01 9.820801 3.69913070
# 2: Blake 2012-01-01 9.935413 15.18999375
# 3: Blake 2013-01-01 6.862176 3.37928004
# 4: Blake 2014-01-01 13.273733 21.55350503
# 5: Blake 2015-01-01 11.684667 6.27958576
# 6: Blake 2017-01-01 6.079436 7.49653718
# 7: Alexander 2010-01-01 13.209463 4.62301612
# 8: Alexander 2012-01-01 12.829328 2.00994816
# 9: Alexander 2013-01-01 10.530363 2.66907192
#10: Alexander 2016-01-01 5.233312 0.78339246
#11: Alexander 2017-01-01 9.772301 12.60278297
#12: Alexander 2019-01-01 11.927316 7.34551569
#13: Leigha 2010-01-01 9.776196 4.99655334
#14: Leigha 2011-01-01 13.612095 11.56789854
#15: Leigha 2013-01-01 7.447973 5.33016929
#16: Leigha 2014-01-01 5.706790 4.40388912
#17: Leigha 2016-01-01 8.162717 12.87081025
#18: Leigha 2017-01-01 10.186343 12.44362354
#19: Leigha 2018-01-01 11.620051 8.30192285
#20: Leigha 2019-01-01 9.068302 16.28150109
#21: Epifanio 2010-01-01 8.390729 17.90558542
#22: Epifanio 2011-01-01 13.394404 8.45036728
#23: Epifanio 2012-01-01 8.466835 10.19156807
#24: Epifanio 2013-01-01 8.337749 5.45766822
#25: Epifanio 2014-01-01 9.763512 17.13958472
#26: Epifanio 2017-01-01 8.899895 14.89054015
#27: Epifanio 2019-01-01 14.606180 0.13357331
#28: Keasha 2013-01-01 8.253522 6.44769498
#29: Keasha 2014-01-01 12.570871 0.40402566
#30: Keasha 2016-01-01 12.111212 14.08734943
#31: Keasha 2017-01-01 6.216919 0.06878532
#32: Keasha 2018-01-01 7.454885 0.38399123
#33: Keasha 2019-01-01 6.433044 1.09828333
#34: Robyn 2010-01-01 7.396294 8.41399676
#35: Robyn 2011-01-01 5.589344 1.33792036
#36: Robyn 2012-01-01 11.422883 1.66129246
#37: Robyn 2015-01-01 12.973088 2.54144396
#38: Robyn 2017-01-01 9.100841 6.78346573
#39: Robyn 2019-01-01 11.049333 4.75902075
Or instead of if, we can directly use .N>1 and wrap with `.SD
testDT[, .SD[.N>=6], by = name]
it could be a little slow, so another option would be .I to get the row index and then subset
testDT[testDT[, .I[.N>=6], by = name]$V1]
I have two data.tables:
original <- data.frame(id = c(rep("RE01",5),rep("RE02",5)),date.time = head(seq.POSIXt(as.POSIXct("2015-11-01 01:00:00"),as.POSIXct("2015-11-05 01:00:00"),60*60*10),10))
compare <- data.frame(id = c("RE01","RE02"),seq = c(1,2),start = as.POSIXct(c("2015-11-01 20:00:00","2015-11-04 08:00:00")),end = as.POSIXct(c("2015-11-02 08:00:00","2015-11-04 20:00:00")))
setDT(original)
setDT(compare)
I would like to check the date in each row of original and see if it lies between the start and finish dates of compare whilst respecting the id. If it does lie between the two elements, a variable should be passed to original (compare$diff.seq). The output should look like this:
original
id date.time diff.seq
1 RE01 2015-11-01 01:00:00 NA
2 RE01 2015-11-01 11:00:00 NA
3 RE01 2015-11-01 21:00:00 1
4 RE01 2015-11-02 07:00:00 1
5 RE01 2015-11-02 17:00:00 NA
6 RE02 2015-11-03 03:00:00 NA
7 RE02 2015-11-03 13:00:00 NA
8 RE02 2015-11-03 23:00:00 NA
9 RE02 2015-11-04 09:00:00 2
10 RE02 2015-11-04 19:00:00 2
I've been reading the manual and SO for hours and trying "on", "by" and so on.. without any success. Can anybody point me in the right direction?
As said in the comments, this is very straight forward using data.table::foverlaps
You basically have to create an additional column in the original data set in order to set join boundaries, then key the two data sets by the columns you want to join on and then simply run forverlas and select the desired columns
original[, end := date.time]
setkey(original, id, date.time, end)
setkey(compare, id, start, end)
foverlaps(original, compare)[, .(id, date.time, seq)]
# id date.time seq
# 1: RE01 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2
Alternatively, you can run foverlaps the other way around and then just update the original data set by reference while selecting the correct rows to update
indx <- foverlaps(compare, original, which = TRUE)
original[indx$yid, diff.seq := indx$xid]
original
# id date.time end diff.seq
# 1: RE01 2015-11-01 01:00:00 2015-11-01 01:00:00 NA
# 2: RE01 2015-11-01 11:00:00 2015-11-01 11:00:00 NA
# 3: RE01 2015-11-01 21:00:00 2015-11-01 21:00:00 1
# 4: RE01 2015-11-02 07:00:00 2015-11-02 07:00:00 1
# 5: RE01 2015-11-02 17:00:00 2015-11-02 17:00:00 NA
# 6: RE02 2015-11-03 03:00:00 2015-11-03 03:00:00 NA
# 7: RE02 2015-11-03 13:00:00 2015-11-03 13:00:00 NA
# 8: RE02 2015-11-03 23:00:00 2015-11-03 23:00:00 NA
# 9: RE02 2015-11-04 09:00:00 2015-11-04 09:00:00 2
# 10: RE02 2015-11-04 19:00:00 2015-11-04 19:00:00 2