I'm using the timeSeries package, and especially the align function. My data are spurious and I want to fill the NAs by propagating the last available value. But it seems that align() doesn't go until the end of the sample if it finishes with an NA.
An example: I have a non-aligned time series
> notAligned
GMT
TS.1 TS.2 TS.3 TS.4
2011-02-03 NA 1 4 8
2011-02-04 1 NA 2 NA
2011-02-07 5 6 NA NA
2011-02-08 NA 2 NA 9
If I use the align function, it returns this
> align(notAligned)
GMT
TS.1 TS.2 TS.3 TS.4
2011-02-03 NA 1 4 8
2011-02-04 1 1 2 8
2011-02-07 5 6 NA 8
2011-02-08 NA 2 NA 9
It correctly fills TS.2 on the 4th and TS.4 on the 4th and 7th, but doesn't fill TS.1 on the 8th with 5, or TS.3 on the 7th and 8th with 2. I would expect align to fill them...
Did I misunderstand the function? Is there a way to work around this?
Thanks for your help
I have no idea why timeSeries::align doesn't work, but I would just use zoo::na.locf:
na.locf(notAligned, na.rm=FALSE)
# GMT
# TS.1 TS.2 TS.3 TS.4
# 2011-02-03 NA 1 4 8
# 2011-02-04 1 1 2 8
# 2011-02-07 5 6 2 8
# 2011-02-08 5 2 2 9
Related
I've got the following data frame df
time <- c("01/01/1951", "02/01/1951", "03/01/1951", "04/01/1951", "05/01/1951", "06/01/1951", "07/01/1951", "08/01/1951", "09/01/1951", "10/01/1951", "11/01/1951", "12/01/1951", "13/01/1951", "14/01/1951", "15/01/1951", "16/01/1951", "17/01/1951", "18/01/1951", "19/01/1951", "20/01/1951", "21/01/1951", "22/01/1951", "23/01/1951")
member <- c(1,NA,NA,3,NA,NA,NA,NA,NA,1,1,NA,2,NA,NA,NA,NA,NA,1,NA,NA,NA,NA)
df <- data.frame(time, member)
df$time = as.Date(df$time,format="%d/%m/%Y")
I like the day with an NA value for "member" before a day where member is 1 to become a 0, UNLESS there is a 1 on the day before a 1 (two consecutive ones), I wouldnt want the 1 to become a 0, just the NA values before a 1.
the desired data frame would be:
df
time member
1 01/01/1951 1
2 02/01/1951 NA
3 03/01/1951 NA
4 04/01/1951 3
5 05/01/1951 NA
6 06/01/1951 NA
7 07/01/1951 NA
8 08/01/1951 NA
9 09/01/1951 0
10 10/01/1951 1
11 11/01/1951 1
12 12/01/1951 NA
13 13/01/1951 2
14 14/01/1951 NA
15 15/01/1951 NA
16 16/01/1951 NA
17 17/01/1951 NA
18 18/01/1951 0
19 19/01/1951 1
20 20/01/1951 NA
21 21/01/1951 NA
22 22/01/1951 NA
23 23/01/1951 NA
ideas?
So we need to check if df$member is NA and the next value is 1. When both of those are true, we set df$member equal to 0:
df$member[is.na(df$member) & c(df$member[-1] == 1, FALSE)] = 0
df
# time member
# 1 1951-01-01 1
# 2 1951-01-02 NA
# 3 1951-01-03 NA
# 4 1951-01-04 3
# 5 1951-01-05 NA
# 6 1951-01-06 NA
# 7 1951-01-07 NA
# 8 1951-01-08 NA
# 9 1951-01-09 0
# 10 1951-01-10 1
# 11 1951-01-11 1
# 12 1951-01-12 NA
# 13 1951-01-13 2
# 14 1951-01-14 NA
# 15 1951-01-15 NA
# 16 1951-01-16 NA
# 17 1951-01-17 NA
# 18 1951-01-18 0
# 19 1951-01-19 1
# 20 1951-01-20 NA
# 21 1951-01-21 NA
# 22 1951-01-22 NA
# 23 1951-01-23 NA
I only have basic knowledge of R and i hope you can help me with my problem and its not a too stupid question for you ;-)
I have a dataset called "rope". It looks like the following :
head(rope)
X...Sound Time.real. Time.in.Video. Observations
1 5_min_blank 10:18 03:59 (2) 2
2 5_min_blank NA
3 Fisch1 10:23 08:59 6
4 Fisch1 NA
5 Fisch1 NA
6 Fisch1 NA
Observation.total.time Time.of.the.shark.in.the.video
1 60 23
2 37
3 157 17
4 46
5 37
6 28
Time.of.the.shark.entering.the.video
1 04:03
2 04:20
3 08:49
4 09:06
5 09:23
6 10:21
Time.of.the.shark.leaving.the.video
1 04:26
2 04:57
3 09:05
4 09:52
5 10:00
6 10:49
times.the.shark.turns.to.the.speaker directional.change
1 1 5
2 2 11
3 1 1
4 4 6
5 3 6
6 2 7
flap.of.the.fins..fotf. flap.of.the.fins..second corrected.fotf.s
1 14 0,608695652 0.7777778
2 14 0,378378378 0.5600000
3 0 NA
4 30 0,652173913 0.6818182
5 0 0 NA
6 15 0,535714286 0.6521739
Notes complete.cyrcles swims.below.b..above.a..speaker
1 1 NA
2 NA
3 NA
4 2 NA
5 NA
6 NA
Swimming.patterns date X
1 3 21.07.17 NA
2 9 21.07.17 NA
3 NA 21.07.17 NA
4 9 21.07.17 NA
5 4 21.07.17 NA
6 4 21.07.17 NA
Now i have different sounds. The first sound is the "Fish1" but i also have "Fish2" and "Diving" for example. Furthermore are between the sounds the corresponding pauses they are called "Fish1_pause", "Fish2_pause" or "Diving_pause" etc.
Now i would like to subset my data into the sound data points and the "pause" data points.
I tried:
sound<-subset(rope, rope$X...Sound=="Fish1"& rope$X...Sound=="Fish2")
but i got no datapoint at all... if i only type :
sound<-subset(rope, rope$X...Sound=="Fish1")
I receive all datapoints were i have the Fish1 sound.
My question now is how can i get all sound points?
Because with the "&" it didn't work... i hope you understand my problem and you can help me.
Thank you very much and all the best
Jessi
sound<-subset(rope, rope$X...Sound=="Fish1"& rope$X...Sound=="Fish2")
should be replaced by either
sound<-subset(rope, rope$X...Sound == "Fish1" | rope$X...Sound == "Fish2")
or
sound<-subset(rope, rope$X...Sound %in% c("Fish1","Fish2"))
As it is, you are asking for observations where X...Sound is simultaneously "Fish1" and "Fish2" -- which is impossible.
This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 5 years ago.
Here is a simplified version of what my data set looks like:
> df
ID total_sleep sleep_end_date
1 1 9 2017-09-03
2 1 8 2017-09-04
3 1 7 2017-09-05
4 1 10 2017-09-06
5 1 11 2017-09-07
6 2 5 2017-09-03
7 2 12 2017-09-04
8 2 4 2017-09-05
9 2 3 2017-09-06
10 2 6 2017-09-07
Where total_sleep is expressed in hours.
What I am is trying to find is the absolute difference in hours of sleep for every two consecutive dates, given a specific user ID. The desired output should look something like this:
> df_answer
ID total_sleep sleep_end_date diff_hours_of_sleep
1 1 9 2017-09-03 NA
2 1 8 2017-09-04 1
3 1 7 2017-09-05 1
4 1 10 2017-09-06 3
5 1 11 2017-09-07 1
6 2 5 2017-09-03 NA
7 2 12 2017-09-04 7
8 2 4 2017-09-05 8
9 2 3 2017-09-06 1
10 2 6 2017-09-08 NA
NA appears in rows 1 and 6 because it doesn't have any data concerning the day before.
Most importantly, NA appears in row 10 because I don't have any data concerning the previous day (2017-09-07). And this has been the trickiest part to code for me.
I've googled (meaning: "stackoverflowed") this and tried to find a solution using the "data wrangling cheatsheet" for dplyr, but I haven't been been able to find a function that enables me to do what I want taking into account these two variables: date and different user IDs.
I am a beginner in R, so I might indeed be missing something simple. Any input or suggestion would be very welcome!
## Order data.frame by IDs, then by increasing sleep_end_dates (if not already sorted)
df <- df[order(df$ID, df$sleep_end_date),]
## Calculate difference in total_sleep with previous entry
df$diff_hours_of_sleep <- c(NA,abs(diff(df$total_sleep)))
## If previous ID is not equal, replace diff_hours_of_sleep with NA
ind <- c(NA, diff(df$ID))
df$diff_hours_of_sleep[ind != 0] <- NA
## And if previous day wasn't yesterday, replace diff_hours_of_sleep with NA
day_ind <- c(NA, diff(df$sleep_end_date))
df$diff_hours_of_sleep[day_ind != 1] <- NA
Maybe the following will do it.
df <- lapply(split(df, df$ID), function(x){
y <- ifelse(diff(x$sleep_end_date) == 1, abs(diff(x$total_sleep)), NA)
x$diff_hours_of_sleep <- c(NA, y)
x
})
df <- do.call(rbind, df)
df
Here is a solution using data.table -
dt1 <- data.table(df, key=c('id', 'sleep_end_date'))
merge(
dt1[,.(id, total_sleep, sleep_end_date, i=.I - 1)],
dt1[,.(id, total_sleep, i=.I)], by=c('id','i'), all.x=TRUE) [,.(id,sleep_end_date,\
total_sleep.x,delta=total_sleep.y-total_sleep.x)]
id sleep_end_date total_sleep.x delta
1: 1 2017-09-03 9 NA
2: 1 2017-09-04 8 1
3: 1 2017-09-05 7 1
4: 1 2017-09-06 10 -3
5: 1 2017-09-07 11 -1
6: 2 2017-09-03 5 NA
7: 2 2017-09-04 12 -7
8: 2 2017-09-05 4 8
9: 2 2017-09-06 3 1
10: 2 2017-09-07 6 -3
I'm not sure how the peformance compares to the pure data.frame approach, but it does appear to scale well; extending the input set to 20,000 rows this took under one second on my system.
Say I have a data frame as follows
rsi5 rsi10
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 44.96650 NA
7 39.68831 NA
8 28.35625 NA
9 37.77910 NA
10 53.54822 NA
11 52.05308 46.01867
12 80.44368 66.09973
13 60.88418 56.04507
14 53.59851 52.10633
15 46.45874 48.23648
I wish to simply add 1 (i.e. 9 becomes 10) to each non-NA element of this data frame. There is probably a very simple solution to this but simple arithmetics on dataframes do not seem to work in R giving very strange results.
Just use + 1 as you would expect. Below is a mock example as it wasn't worth copying your data for for this.
Step One: Create a data.frame
R> df <- data.frame(A=c(NA, 1, 2, 3), B=c(NA, NA, 12, 13))
R> df
A B
1 NA NA
2 1 NA
3 2 12
4 3 13
R>
Step Two: Add one
R> df + 1
A B
1 NA NA
2 2 NA
3 3 13
4 4 14
R>
I'm trying to create a 3d scatter plot using the following script:
d <- read.table(file='myfile.dat', header=F)
plot3d(
d,
xlim=c(0,20),
ylim=c(0,20),
zlim=c(0,10000),
xlab='Frequency',
ylab='Size',
zlab='Number of subgraphs',
box=F,
type='s',
size=0.5,
col=d[,1]
)
lines3d(
d,
xlim=c(2,20),
ylim=c(0,20),
zlim=c(0,10000),
lwd=2,
col=d[,1]
)
grid3d(side=c('x', 'y+', 'z'))
Now for some reason, R is ignoring the range limits I've specified and is using arbitrary values, messing up my plot. I get no error when I run the script. Does anybody have any idea what's wrong? If required, I can also post an image of the plot that is created. The data file is given below:
myfile.dat
11 2 2
NA NA NA
10 2 2
NA NA NA
13 2 1
NA NA NA
15 2 1
NA NA NA
5 2 11
5 3 10
5 4 16
5 5 34
5 6 102
5 7 294
5 8 682
5 9 1439
5 10 2646
5 11 3615
5 12 2844
5 13 1394
NA NA NA
4 2 10
4 3 4
4 4 4
4 5 10
4 6 38
4 7 132
4 8 396
4 9 976
4 10 2121
4 11 4085
4 12 6261
4 13 6459
4 14 4238
4 15 1394
NA NA NA
7 2 3
NA NA NA
6 2 2
NA NA NA
9 2 8
9 3 6
9 4 4
9 5 5
NA NA NA
8 2 4
8 3 10
8 4 22
8 5 52
8 6 126
8 7 264
8 8 478
8 9 729
8 10 943
8 11 754
8 12 382
NA NA NA
The help page, ?plot3d says "Note that since rgl does not currently support clipping, all points will be plotted, and 'xlim', 'ylim', and 'zlim' will only be used to increase the respective ranges." So you need to restrict the data in the input stage. (And you will need to use segments3d instead of lines3d if you only want particular ranges that are inside the plotted volume.)
d2 <- subset(d, d[,1]>0 & d[,1] <20 & d[,2]>0 & d[,2] <20 & d[,3]>0 & d[,3]<10000 ])
plot3d(
d2[, 1:3], # You can probably use something more meaningful,
xlim=c(0,20),
ylim=c(0,20),
zlim=c(0,10000),
xlab='Frequency',
ylab='Size',
zlab='Number of subgraphs',
box=F,
type='s',
size=0.5,
col=d[,1]
)
(I did notice that when the range was c(0,10000) that the size of the points was pretty much invisible. and further experimentation suggest that the great disparity in ranges is going to cause furhter difficulties in keeping the ranges at 0 on the low side if you increase the size to the point where it is visible. If you make the points really big , they expand the range to accommodate the overlap beyond the x=0 or y=0 planes.)
As DWin said, lines3d does not handle *lim arguments. From the help page, "... Material properties (see rgl.material), normals and texture coordinates (see rgl.primitive)."
So use some other function, or perhaps you could retrieve the existing limits from your plot3d call and use those to scale your data prior to plotting?