fill gaps in a timeseries with averages - r

I have a dataframe like so:
day sum_flux samples mean
2005-10-26 0.02 48 0.02
2005-10-27 0.12 12 0.50
It's a series of daily readings spanning 5 years, however some of the days are missing. I want to fill these days with the average of that month from other years.
i.e if 26-10-2005 was missing I'd want to use the average of all Octobers in the data set.
if all of October was missing I'd want to apply this average to each missing day.
I think I need to build a function (possibly using plyr) to evaluate the days. However I'm very inexperienced with using the various timeseries objects in R, and conditionally subsetting data and would like some advice. Especially regarding which type of timeseries I should be using.
Many Thanks

Some sample data. I'm assuming that sum_flux is the column that has missing values, and that you want to calculate values for.
library(lubridate)
days <- seq.POSIXt(ymd("2005-10-26"), ymd("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.frame(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
Add a month column.
readings$month <- month(readings$day, label = TRUE)
Use tapply to get the monthly mean flux.
monthly_avg_flux <- with(readings, tapply(sum_flux, month, mean, na.rm = TRUE))
Use this value whenever the flux is missing, or keep the flux if not.
readings$sum_flux2 <- with(readings, ifelse(
is.na(sum_flux),
monthly_avg_flux[month],
sum_flux
))

This is one (very fast) way in data.table.
Using the nice example data from Richie :
require(data.table)
days <- seq(as.IDate("2005-10-26"), as.IDate("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.table(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
readings
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 NA 42 0.37566491
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Create the average for each month, in appearance order of each group :
> avg = readings[,mean(sum_flux,na.rm=TRUE),by=list(mnth = month(day))]
> avg
mnth V1
[1,] 10 0.4915999
[2,] 11 0.5107873
[3,] 12 0.4451787
[4,] 1 0.4966040
[5,] 2 0.4972244
[6,] 3 0.4952821
[7,] 4 0.5106539
[8,] 5 0.4717122
[9,] 6 0.5110490
[10,] 7 0.4507383
[11,] 8 0.4680827
[12,] 9 0.5150618
Next reorder avg to start in January :
avg = avg[order(mnth)]
avg
mnth V1
[1,] 1 0.4966040
[2,] 2 0.4972244
[3,] 3 0.4952821
[4,] 4 0.5106539
[5,] 5 0.4717122
[6,] 6 0.5110490
[7,] 7 0.4507383
[8,] 8 0.4680827
[9,] 9 0.5150618
[10,] 10 0.4915999
[11,] 11 0.5107873
[12,] 12 0.4451787
Now update by reference (:=) the sum_flux column, where sum_flux is NA, with the value from avg for that month.
readings[is.na(sum_flux), sum_flux:=avg$V1[month(day)]]
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 0.51078729** 42 0.37566491 # ** updated with the Nov avg
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Done.

Related

Accounting with apply not working

I am trying to use accounting from the formattable package within apply, and it does not seem to working -
library(formattable)
set.seed(4226)
temp = data.frame(a = sample(1000:50000, 10), b = sample(1000:50000, 10),
c = sample(1000:50000, 10), d = sample(1000:50000, 10))
temp
a b c d
1 45186 17792 43363 17080
2 26982 25410 2327 17982
3 45204 39757 29883 4283
4 27069 21334 10497 28776
5 47895 46241 22743 36257
6 30161 45254 21382 42275
7 18278 28936 27036 23620
8 31199 30182 10235 7355
9 10664 40312 28324 20864
10 45225 45545 44394 13364
apply(temp, 2, function(x){x = accounting(x, digits = 0)})
a b c d
[1,] 45186 17792 43363 17080
[2,] 26982 25410 2327 17982
[3,] 45204 39757 29883 4283
[4,] 27069 21334 10497 28776
[5,] 47895 46241 22743 36257
[6,] 30161 45254 21382 42275
[7,] 18278 28936 27036 23620
[8,] 31199 30182 10235 7355
[9,] 10664 40312 28324 20864
[10,] 45225 45545 44394 13364
What I want is -
a b c d
[1,] 45,186 17,792 43,363 17,080
[2,] 26,982 25,410 2,327 17,982
[3,] 45,204 39,757 29,883 4,283
[4,] 27,069 21,334 10,497 28,776
[5,] 47,895 46,241 22,743 36,257
[6,] 30,161 45,254 21,382 42,275
[7,] 18,278 28,936 27,036 23,620
[8,] 31,199 30,182 10,235 7,355
[9,] 10,664 40,312 28,324 20,864
[10,] 45,225 45,545 44,394 13,364
You probably want to keep things as a data frame, in which case apply is not the right tool. It will always give you a matrix back.
You might want one of the following options:
temp[cols] <- lapply(temp[cols], function(x){accounting(x, digits = 0)})
or
as.data.frame(lapply(temp[cols], function(x){accounting(x, digits = 0)}))
or using dplyr something like:
temp %>%
mutate_at(.vars = cols,.funs = accounting,digits = 0)

Calculate average of lowest values of matrix rows

I have a large matrix, e.g.
> mat = matrix(runif(100), ncol = 5)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0.264442954 0.6408534 0.76472904 0.2437074 0.08019882
[2,] 0.575443586 0.6428957 0.44188123 0.0230842 0.07502289
[3,] 0.894885901 0.5926238 0.55431966 0.7717503 0.52806173
[4,] 0.231978411 0.1192595 0.08170498 0.4264405 0.97486053
[5,] 0.344765840 0.5349323 0.85523617 0.2257759 0.20549035
[6,] 0.499130844 0.9882825 0.99417390 0.8070708 0.29963075
[7,] 0.613479990 0.8877605 0.34282782 0.9525512 0.91488004
[8,] 0.967166001 0.6115709 0.68169111 0.3067973 0.30094691
[9,] 0.957612804 0.5565989 0.88180650 0.3359184 0.17980137
[10,] 0.342177768 0.7735620 0.48154937 0.3692096 0.31299886
[11,] 0.871928110 0.3397143 0.57596030 0.4749349 0.47800019
[12,] 0.387563040 0.1656725 0.47796646 0.8956274 0.68345302
[13,] 0.628535870 0.3418692 0.86513964 0.8052477 0.01850535
[14,] 0.379472842 0.9176644 0.08829197 0.8548662 0.42151935
[15,] 0.071958980 0.6644800 0.90061596 0.4484674 0.32649345
[16,] 0.229463192 0.9995178 0.63995121 0.8369698 0.35091430
[17,] 0.291761976 0.5014815 0.35260028 0.6188047 0.68192891
[18,] 0.077610797 0.2747788 0.07084273 0.5977530 0.37134566
[19,] 0.675912490 0.6059304 0.29321852 0.5638336 0.73866322
[20,] 0.006010715 0.7697045 0.43627939 0.1723969 0.88665973
I want to extract the lowest and highest 2 values of each row and calculate their average.
Eventually, I'd like to generate a new matrix where the first column in the average of the lowest values, and the second column is the average of the highest values.
Thanks in advance!
I believe this does what you want:
do.call(rbind, apply(mat,1, function(x) {sorted = sort(x);
return(data.frame(min=mean(head(sorted,2)), max=mean(tail(sorted,2))))}))
Output:
min max
1 0.14333229 0.8877635
2 0.12311651 0.5283049
3 0.09367614 0.5433373
4 0.39926848 0.6361645
5 0.05196898 0.5473783
6 0.12876148 0.6153546
7 0.29893684 0.8436462
8 0.14254481 0.7023039
9 0.20889814 0.8863141
10 0.44838327 0.8641790
11 0.14859312 0.5533045
12 0.19728414 0.8619284
13 0.37049481 0.7448965
14 0.30070570 0.9320575
15 0.30333510 0.6774024
16 0.21908982 0.7077274
17 0.61804571 0.9239816
18 0.36525615 0.8531795
19 0.22751108 0.4993744
20 0.14251095 0.6353147
Hope this helps!

How to order data based on repeated dates in R

Suppose that the following dataset is available:
date V1 V2
[1,] 1996-01-01 1.4628995 12
[2,] 1996-01-01 0.1972603 11
..............
[3,] 1996-02-01 0.1205479 11
[4,] 1996-02-01 0.9643836 9
..............
[5,] 1996-03-01 0.1972603 14
[6,] 1996-03-01 0.1205479 8
How can i order V1 in ascending order, for example, for each specific date, with the remaining variables, V2,V3... and so on, to follow the ordering. Like this:
date V1 V2
[1,] 1996-01-01 0.1972603 11
[2,] 1996-01-01 1.4628995 12
..............
[3,] 1996-02-01 0.1205479 11
[4,] 1996-02-01 0.9643836 9
..............
[5,] 1996-03-01 0.1205479 8
[6,] 1996-03-01 0.1972603 14
Thank you.
To sort by date and then by V1...
data <- data[order(as.Date(data$date),data$V1),]
In response to follow-up question in comment below, the rows with the two smallest values of V1 for each date can easily be selected using dplyr...
library(dplyr)
data2 <- data %>% group_by(date) %>% filter(rank(V1,ties.method = "min")<3)
Or, rather less intuitively, using base-R...
data2 <- data[as.logical(ave(data$V1,data$date,FUN=function(v) rank(v,ties.method = "min")<3)),]
You might need to fiddle with the parameters of rank to adjust the treatment of NA and the way it handles ties. See ?rank

R: calling a matrix value of column 2 dependent on the value of column 1

I admit that I am totally new to R and have a few beginner's problems;
my problem is the following:
I have quite a long matrix TEST of length 5000 with 2 columns (column 1 = time; column 2 = concentration of a species).
I want to use the right concentration values for calculation of propensities in stochastic simulations.
I already have an alogrithm that gives me the simulation time t_sim; what I would need is a line of code that gives the respective concentration value at t= t_sim;
also: the time vector might have a big step size so that t_sim would have to be rounded to a bigger value in order to call the respective concentration value.
I know this probably quite an easy problem but I really do not see the solution in R.
Best wishes and many thanks,
Arne
Without sample data this answer is kind of a shot in the dark, but I think that this might work:
t_conc <- TEST[which.min(abs(t_sim-TEST[,1])),2]
where TEST is the matrix with two columns as described in the OP and the output t_conc is the concentration that corresponds to the value of time in the matrix that is closest to the input value t_sim.
Here's another shot in the dark:
set.seed(1);
N <- 20; test <- matrix(c(sort(sample(100,N)),rnorm(N,0.5,0.2)),N,dimnames=list(NULL,c('time','concentration')));
test;
## time concentration
## [1,] 6 0.80235623
## [2,] 16 0.57796865
## [3,] 19 0.37575188
## [4,] 20 0.05706002
## [5,] 27 0.72498618
## [6,] 32 0.49101328
## [7,] 34 0.49676195
## [8,] 37 0.68876724
## [9,] 43 0.66424424
## [10,] 57 0.61878026
## [11,] 58 0.68379547
## [12,] 61 0.65642726
## [13,] 62 0.51491300
## [14,] 63 0.10212966
## [15,] 67 0.62396515
## [16,] 83 0.48877425
## [17,] 86 0.46884090
## [18,] 88 0.20584952
## [19,] 89 0.40436999
## [20,] 97 0.58358831
t_sim <- 39;
test[findInterval(t_sim,test[,'time']),'concentration'];
## concentration
## 0.6887672
Note that findInterval() returns the index of the lesser time value if t_sim falls between two time values, as my example shows. If you want the greater, you need a bit more work:
i <- findInterval(t_sim,test[,'time']);
if (test[i,'time'] != t_sim && i < nrow(test)) i <- i+1;
test[i,'concentration'];
## concentration
## 0.6642442
If you want the nearest, see R: find nearest index.

Reduce the large dataset into smaller data set using R

I want to reduce a very large dataset with two variables into a smaller file. What I want to do is I need to find the data points with the same values and then I want to keep only the starting and ending values and then remove all the data points in between them. For example
the sample dataset looks like following :
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.75 126.16542
363.79167 126.16542
363.83333 126.16542
363.875 126.16542
363.91667 0
363.95833 0
364 0
364.04167 0
364.08333 0
364.125 0
364.16667 0
364.20833 0
364.25 127.79872
364.29167 127.79872
364.33333 127.79872
364.375 127.79872
364.41667 127.79872
364.45833 127.79872
364.5 0
364.54167 0
364.58333 0
364.625 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
Here, the first two points have same values i.e 26.369 so I will keep them as it is. I need to write a condition i.e if two or more data points have same values then keep only starting and ending data points. Then the next two values also have same value i.e. 0 and i will keep these two. However, after that there are 5 data points with the same values. I need to write a program such that I want to write just two data points i.e 363.708 & 363.875 and remove data points in between them. After that I will keep only two data points with zero values i.e 363.91667 and 364.20833.
The sample output I am looking for is as follows:
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.875 126.16542
363.91667 0
364.20833 0
364.25 127.79872
364.45833 127.79872
364.5 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
If your data is in a dataframe DF with column names a and b, then
runs <- rle(DF$b)
firsts <- cumsum(c(0,runs$length[-length(runs$length)]))+1
lasts <- cumsum(runs$length)
edges <- unique(sort(c(firsts, lasts)))
DF[edges,]
gives
> DF[edges,]
a b
1 363.5417 23.36990
2 363.5833 23.36990
3 363.6250 0.00000
4 363.6667 0.00000
5 363.7083 126.16542
9 363.8750 126.16542
10 363.9167 0.00000
17 364.2083 0.00000
18 364.2500 127.79872
23 364.4583 127.79872
24 364.5000 0.00000
28 364.6667 0.00000
29 364.7083 127.43202
30 364.7500 135.44052
31 364.7917 135.25522
32 364.8333 135.12892
33 364.8750 20.32986
34 364.9167 0.00000
35 364.9583 0.00000
rle gives the lengths of the groups that have the same value (floating point precision may be an issue if you have more decimal places). firsts and lasts give the row index of the first row of a group and the last row of a group, respectively. Put the indexes together, sort them, and get rid of duplicates (since a group of size one will list the same row as the first and last) and then index DF by the row numbers.
I'd use rle here (no surprise to those who know me :-) . Keeping in mind that you will want to check for approximate equality to avoid floating-point rounding problems, here's the concept. rle will return two sequences, one of which tells you how many times a value is repeated and the other tells you the value itself. Since you want to keep only single or double values, we'll essentially "shrink" all sequence values which are longer.
Edit: I recognize that this is relatively clunky code and a gentle touch with melt/cast should be far more efficient. I just liked doing this.
df<-cbind(1:20, sample(1:3,rep=T,20))
rdf<-rle(df[,2])
lenfoo<-rdf$lengths
cfoo<-cumsum(lenfoo)
repfoo<-ifelse(lenfoo==1,1,2)
outfoo<-matrix(nc=2)
for(j in 1:length(cfoo)) outfoo <- rbind( outfoo, matrix(rep(df[cfoo[j],],times=repfoo[j] ), nc=2,byrow=TRUE ) )
Rgames> df
[,1] [,2]
[1,] 1 2
[2,] 2 2
[3,] 3 3
[4,] 4 3
[5,] 5 3
[6,] 6 3
[7,] 7 3
[8,] 8 2
[9,] 9 2
[10,] 10 3
[11,] 11 1
[12,] 12 2
[13,] 13 2
[14,] 14 3
[15,] 15 1
[16,] 16 2
[17,] 17 1
[18,] 18 2
[19,] 19 3
[20,] 20 1
Rgames> outfoo
[,1] [,2]
[1,] NA NA
[2,] 2 2
[3,] 2 2
[4,] 7 3
[5,] 7 3
[6,] 9 2
[7,] 9 2
[8,] 10 3
[9,] 11 1
[10,] 13 2
[11,] 13 2
[12,] 14 3
[13,] 15 1
[14,] 16 2
[15,] 17 1
[16,] 18 2
[17,] 19 3
[18,] 20 1
x = tapply(df[[1]], df[[2]], range)
gives the values
cbind(unlist(x, use.names=FALSE), as.numeric(rep(names(x), each=2)))
gets a matrix. More explicitly, and avoiding coercion to / from character vectors
u = unique(df[[2]])
rng = sapply(split(df[[1]], match(df[[2]], u)), range)
cbind(as.vector(rng), rep(u, each=2))
If the data is very large then sort by df[[1]] and find the first (min) and last (max) values of each element of df[[2]]; combine these
df = df[order(df[[1]]),]
res = rbind(df[!duplicated(df[[2]]),], df[!duplicated(df[[2]], fromLast=TRUE),])
res[order(res[[2]]),]
perhaps setting the row names of the subset to NULL.

Resources