Calculate average of lowest values of matrix rows - r

I have a large matrix, e.g.
> mat = matrix(runif(100), ncol = 5)
> mat
[,1] [,2] [,3] [,4] [,5]
[1,] 0.264442954 0.6408534 0.76472904 0.2437074 0.08019882
[2,] 0.575443586 0.6428957 0.44188123 0.0230842 0.07502289
[3,] 0.894885901 0.5926238 0.55431966 0.7717503 0.52806173
[4,] 0.231978411 0.1192595 0.08170498 0.4264405 0.97486053
[5,] 0.344765840 0.5349323 0.85523617 0.2257759 0.20549035
[6,] 0.499130844 0.9882825 0.99417390 0.8070708 0.29963075
[7,] 0.613479990 0.8877605 0.34282782 0.9525512 0.91488004
[8,] 0.967166001 0.6115709 0.68169111 0.3067973 0.30094691
[9,] 0.957612804 0.5565989 0.88180650 0.3359184 0.17980137
[10,] 0.342177768 0.7735620 0.48154937 0.3692096 0.31299886
[11,] 0.871928110 0.3397143 0.57596030 0.4749349 0.47800019
[12,] 0.387563040 0.1656725 0.47796646 0.8956274 0.68345302
[13,] 0.628535870 0.3418692 0.86513964 0.8052477 0.01850535
[14,] 0.379472842 0.9176644 0.08829197 0.8548662 0.42151935
[15,] 0.071958980 0.6644800 0.90061596 0.4484674 0.32649345
[16,] 0.229463192 0.9995178 0.63995121 0.8369698 0.35091430
[17,] 0.291761976 0.5014815 0.35260028 0.6188047 0.68192891
[18,] 0.077610797 0.2747788 0.07084273 0.5977530 0.37134566
[19,] 0.675912490 0.6059304 0.29321852 0.5638336 0.73866322
[20,] 0.006010715 0.7697045 0.43627939 0.1723969 0.88665973
I want to extract the lowest and highest 2 values of each row and calculate their average.
Eventually, I'd like to generate a new matrix where the first column in the average of the lowest values, and the second column is the average of the highest values.
Thanks in advance!

I believe this does what you want:
do.call(rbind, apply(mat,1, function(x) {sorted = sort(x);
return(data.frame(min=mean(head(sorted,2)), max=mean(tail(sorted,2))))}))
Output:
min max
1 0.14333229 0.8877635
2 0.12311651 0.5283049
3 0.09367614 0.5433373
4 0.39926848 0.6361645
5 0.05196898 0.5473783
6 0.12876148 0.6153546
7 0.29893684 0.8436462
8 0.14254481 0.7023039
9 0.20889814 0.8863141
10 0.44838327 0.8641790
11 0.14859312 0.5533045
12 0.19728414 0.8619284
13 0.37049481 0.7448965
14 0.30070570 0.9320575
15 0.30333510 0.6774024
16 0.21908982 0.7077274
17 0.61804571 0.9239816
18 0.36525615 0.8531795
19 0.22751108 0.4993744
20 0.14251095 0.6353147
Hope this helps!

Related

Accounting with apply not working

I am trying to use accounting from the formattable package within apply, and it does not seem to working -
library(formattable)
set.seed(4226)
temp = data.frame(a = sample(1000:50000, 10), b = sample(1000:50000, 10),
c = sample(1000:50000, 10), d = sample(1000:50000, 10))
temp
a b c d
1 45186 17792 43363 17080
2 26982 25410 2327 17982
3 45204 39757 29883 4283
4 27069 21334 10497 28776
5 47895 46241 22743 36257
6 30161 45254 21382 42275
7 18278 28936 27036 23620
8 31199 30182 10235 7355
9 10664 40312 28324 20864
10 45225 45545 44394 13364
apply(temp, 2, function(x){x = accounting(x, digits = 0)})
a b c d
[1,] 45186 17792 43363 17080
[2,] 26982 25410 2327 17982
[3,] 45204 39757 29883 4283
[4,] 27069 21334 10497 28776
[5,] 47895 46241 22743 36257
[6,] 30161 45254 21382 42275
[7,] 18278 28936 27036 23620
[8,] 31199 30182 10235 7355
[9,] 10664 40312 28324 20864
[10,] 45225 45545 44394 13364
What I want is -
a b c d
[1,] 45,186 17,792 43,363 17,080
[2,] 26,982 25,410 2,327 17,982
[3,] 45,204 39,757 29,883 4,283
[4,] 27,069 21,334 10,497 28,776
[5,] 47,895 46,241 22,743 36,257
[6,] 30,161 45,254 21,382 42,275
[7,] 18,278 28,936 27,036 23,620
[8,] 31,199 30,182 10,235 7,355
[9,] 10,664 40,312 28,324 20,864
[10,] 45,225 45,545 44,394 13,364
You probably want to keep things as a data frame, in which case apply is not the right tool. It will always give you a matrix back.
You might want one of the following options:
temp[cols] <- lapply(temp[cols], function(x){accounting(x, digits = 0)})
or
as.data.frame(lapply(temp[cols], function(x){accounting(x, digits = 0)}))
or using dplyr something like:
temp %>%
mutate_at(.vars = cols,.funs = accounting,digits = 0)

Reduce the large dataset into smaller data set using R

I want to reduce a very large dataset with two variables into a smaller file. What I want to do is I need to find the data points with the same values and then I want to keep only the starting and ending values and then remove all the data points in between them. For example
the sample dataset looks like following :
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.75 126.16542
363.79167 126.16542
363.83333 126.16542
363.875 126.16542
363.91667 0
363.95833 0
364 0
364.04167 0
364.08333 0
364.125 0
364.16667 0
364.20833 0
364.25 127.79872
364.29167 127.79872
364.33333 127.79872
364.375 127.79872
364.41667 127.79872
364.45833 127.79872
364.5 0
364.54167 0
364.58333 0
364.625 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
Here, the first two points have same values i.e 26.369 so I will keep them as it is. I need to write a condition i.e if two or more data points have same values then keep only starting and ending data points. Then the next two values also have same value i.e. 0 and i will keep these two. However, after that there are 5 data points with the same values. I need to write a program such that I want to write just two data points i.e 363.708 & 363.875 and remove data points in between them. After that I will keep only two data points with zero values i.e 363.91667 and 364.20833.
The sample output I am looking for is as follows:
363.54167 23.3699
363.58333 23.3699
363.625 0
363.66667 0
363.70833 126.16542
363.875 126.16542
363.91667 0
364.20833 0
364.25 127.79872
364.45833 127.79872
364.5 0
364.66667 0
364.70833 127.43202
364.75 135.44052
364.79167 135.25522
364.83333 135.12892
364.875 20.32986
364.91667 0
364.95833 0
If your data is in a dataframe DF with column names a and b, then
runs <- rle(DF$b)
firsts <- cumsum(c(0,runs$length[-length(runs$length)]))+1
lasts <- cumsum(runs$length)
edges <- unique(sort(c(firsts, lasts)))
DF[edges,]
gives
> DF[edges,]
a b
1 363.5417 23.36990
2 363.5833 23.36990
3 363.6250 0.00000
4 363.6667 0.00000
5 363.7083 126.16542
9 363.8750 126.16542
10 363.9167 0.00000
17 364.2083 0.00000
18 364.2500 127.79872
23 364.4583 127.79872
24 364.5000 0.00000
28 364.6667 0.00000
29 364.7083 127.43202
30 364.7500 135.44052
31 364.7917 135.25522
32 364.8333 135.12892
33 364.8750 20.32986
34 364.9167 0.00000
35 364.9583 0.00000
rle gives the lengths of the groups that have the same value (floating point precision may be an issue if you have more decimal places). firsts and lasts give the row index of the first row of a group and the last row of a group, respectively. Put the indexes together, sort them, and get rid of duplicates (since a group of size one will list the same row as the first and last) and then index DF by the row numbers.
I'd use rle here (no surprise to those who know me :-) . Keeping in mind that you will want to check for approximate equality to avoid floating-point rounding problems, here's the concept. rle will return two sequences, one of which tells you how many times a value is repeated and the other tells you the value itself. Since you want to keep only single or double values, we'll essentially "shrink" all sequence values which are longer.
Edit: I recognize that this is relatively clunky code and a gentle touch with melt/cast should be far more efficient. I just liked doing this.
df<-cbind(1:20, sample(1:3,rep=T,20))
rdf<-rle(df[,2])
lenfoo<-rdf$lengths
cfoo<-cumsum(lenfoo)
repfoo<-ifelse(lenfoo==1,1,2)
outfoo<-matrix(nc=2)
for(j in 1:length(cfoo)) outfoo <- rbind( outfoo, matrix(rep(df[cfoo[j],],times=repfoo[j] ), nc=2,byrow=TRUE ) )
Rgames> df
[,1] [,2]
[1,] 1 2
[2,] 2 2
[3,] 3 3
[4,] 4 3
[5,] 5 3
[6,] 6 3
[7,] 7 3
[8,] 8 2
[9,] 9 2
[10,] 10 3
[11,] 11 1
[12,] 12 2
[13,] 13 2
[14,] 14 3
[15,] 15 1
[16,] 16 2
[17,] 17 1
[18,] 18 2
[19,] 19 3
[20,] 20 1
Rgames> outfoo
[,1] [,2]
[1,] NA NA
[2,] 2 2
[3,] 2 2
[4,] 7 3
[5,] 7 3
[6,] 9 2
[7,] 9 2
[8,] 10 3
[9,] 11 1
[10,] 13 2
[11,] 13 2
[12,] 14 3
[13,] 15 1
[14,] 16 2
[15,] 17 1
[16,] 18 2
[17,] 19 3
[18,] 20 1
x = tapply(df[[1]], df[[2]], range)
gives the values
cbind(unlist(x, use.names=FALSE), as.numeric(rep(names(x), each=2)))
gets a matrix. More explicitly, and avoiding coercion to / from character vectors
u = unique(df[[2]])
rng = sapply(split(df[[1]], match(df[[2]], u)), range)
cbind(as.vector(rng), rep(u, each=2))
If the data is very large then sort by df[[1]] and find the first (min) and last (max) values of each element of df[[2]]; combine these
df = df[order(df[[1]]),]
res = rbind(df[!duplicated(df[[2]]),], df[!duplicated(df[[2]], fromLast=TRUE),])
res[order(res[[2]]),]
perhaps setting the row names of the subset to NULL.

Modified rollapply mean

I have a data file which consists of daily xy locations and a logical vector denoting whether or not the location is an outlier. Here is some (poorly created, I know) sample data:
x=seq(3,10,length.out=30)
y=seq(42,45,length.out=30)
outlier=c(F,F,F,F,F,F,F,F,T,T,T,F,F,F,F,F,F,F,F,F,F,T,F,T,F,F,F,F,F,F)
data=cbind(x,y,outlier)
> data
x y outlier
[1,] 3.000000000 42.00000000 0
[2,] 3.241379310 42.10344828 0
[3,] 3.482758621 42.20689655 0
[4,] 3.724137931 42.31034483 0
[5,] 3.965517241 42.41379310 0
[6,] 4.206896552 42.51724138 0
[7,] 4.448275862 42.62068966 0
[8,] 4.689655172 42.72413793 0
[9,] 4.931034483 42.82758621 1
[10,] 5.172413793 42.93103448 1
[11,] 5.413793103 43.03448276 1
[12,] 5.655172414 43.13793103 0
[13,] 5.896551724 43.24137931 0
[14,] 6.137931034 43.34482759 0
[15,] 6.379310345 43.44827586 0
[16,] 6.620689655 43.55172414 0
[17,] 6.862068966 43.65517241 0
[18,] 7.103448276 43.75862069 0
[19,] 7.344827586 43.86206897 0
[20,] 7.586206897 43.96551724 0
[21,] 7.827586207 44.06896552 0
[22,] 8.068965517 44.17241379 1
[23,] 8.310344828 44.27586207 0
[24,] 8.551724138 44.37931034 1
[25,] 8.793103448 44.48275862 0
[26,] 9.034482759 44.58620690 0
[27,] 9.275862069 44.68965517 0
[28,] 9.517241379 44.79310345 0
[29,] 9.758620690 44.89655172 0
[30,] 10.000000000 45.00000000 0
What I need is to take a non-overlapping 6-day mean of the x and y columns. This is easy enough with rollapply(). However, I do not want outlier=1 values to be included in the 6-day mean; nor do I want the 6-day window to 'span' the gap left behind by removing all rows where outlier=T. Instead, I want to make an exception to the 'non-overlapping rule'.
I think this is best explained using the sample data above: the first value should be the mean of rows 1:6, but rather than the second value being the mean of rows 7:12 (including outlier=1 values) or of rows c(7:8,12:15) (skipping over outlier=1 values) I want it to overlap with the first window and take the mean of rows 3:8.
So for the length 30 sample data above, the end result should be of length 5, showing the mean values of rows 1:6, 3:8, 12:17, 16:21 & 25:30 (ideally all values which result from overlapping windows should be labelled as such; i.e. values 1:4 overlap, whereas the final value is unique)
Here is a function that will give you the indices of the endpoints of the averages that you want:
findIndices<-function(outlier,window=6){
r<-rle(outlier)
rends<-cumsum(r$lengths)
segs<-cbind(rends-r$lengths+1,rends)
segs<-segs[with(r,lengths>=window & values==0),]
indices<-unlist(apply(segs,1,function(x) seq(x[1]+window-1,x[2],by=window)))
sort(unique(c(indices,segs[,2])))
}
findIndices(data[,3])
## [1] 6 8 17 21 30
You can then get the averages you want like this:
id<-findIndices(data[,3])
require(zoo)
cbind(index=id,rollmean(data[,1:2],6)[id-5,])
## index x y
## [1,] 6 3.603448 42.25862
## [2,] 8 4.086207 42.46552
## [3,] 17 6.258621 43.39655
## [4,] 21 7.224138 43.81034
## [5,] 30 9.396552 44.74138
You can put it all together in a single function like this:
maWithOutliers<-function(x,outlier,window){
id<-findIndices(outlier,window)
cbind(index=id,rollmean(x,window)[id-window+1,])
}
> maWithOutliers(data[,1:2],data[,3],6)
index x y
[1,] 6 3.603448 42.25862
[2,] 8 4.086207 42.46552
[3,] 17 6.258621 43.39655
[4,] 21 7.224138 43.81034
[5,] 30 9.396552 44.74138
> maWithOutliers(data[,1:2],data[,3],4)
index x y
[1,] 4 3.362069 42.15517
[2,] 8 4.327586 42.56897
[3,] 15 6.017241 43.29310
[4,] 19 6.982759 43.70690
[5,] 21 7.465517 43.91379
[6,] 28 9.155172 44.63793
[7,] 30 9.637931 44.84483
>

R select entire columns where at least one value meets a condition

I have a large matrix, ~300 rows and 200000 cols. I want to shrink this down by selecting the entire columns that have at least one value that is > 0.5 or less than -0.5 (not just that particular value). I would like to keep the row and column names. I was able to get a matrix of true false by doingtmp<-mymat > 0.5 | mymat < -0.5. I want to extract all columns that have at least one TRUE in them. I tried simply mymat[tmp] but this just returns a vector of the values that meet that condition. How can I get the actual columns of the original matrix? Thanks.
Try this:
> set.seed(007) # for the example being reproducible
> X <- matrix(rnorm(100), 20) # generating some data
> X <- cbind(X, runif(20, max=.48)) # generating a column with all values < 0.5
> colnames(X) <- paste('col', 1:ncol(X), sep='') # some column names
> X # this is how the matrix looks like
col1 col2 col3 col4 col5 col6
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350 0.335107187
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236 0.419502015
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842 0.346358090
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429 0.212185020
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563 0.224824248
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865 0.415837389
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592 0.057660111
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588 0.007812921
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487 0.298192099
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672 0.216225091
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434 0.026097800
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474 0.190567072
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169 0.402829397
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598 0.248196976
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899 0.406511129
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739 0.162457572
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204 0.383801555
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132 0.347037954
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899 0.262938992
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971 0.139233120
>
> # defining a index for selecting if the condition is met
> ind <- apply(X, 2, function(X) any(abs(X)>0.5))
> X[,ind] # since col6 only has values less than 0.5 it is not taken
col1 col2 col3 col4 col5
[1,] 2.287247161 0.83975036 1.218550535 0.07637147 0.342585350
[2,] -1.196771682 0.70534183 -0.699317079 0.15915528 0.004248236
[3,] -0.694292510 1.30596472 -0.285432752 0.54367418 0.029219842
[4,] -0.412292951 -1.38799622 -1.311552673 0.70480735 -0.393423429
[5,] -0.970673341 1.27291686 -0.391012431 0.31896914 -0.792704563
[6,] -0.947279945 0.18419277 -0.401526613 1.10924979 -0.311701865
[7,] 0.748139340 0.75227990 1.350517581 0.76915419 -0.346068592
[8,] -0.116955226 0.59174505 0.591190027 1.15347367 -0.304607588
[9,] 0.152657626 -0.98305260 0.100525456 1.26068350 -1.785893487
[10,] 2.189978107 -0.27606396 0.931071996 0.70062351 0.587274672
[11,] 0.356986230 -0.87085102 -0.262742349 0.43262716 1.635794434
[12,] 2.716751783 0.71871055 -0.007668105 -0.92260172 -0.645423474
[13,] 2.281451926 0.11065288 0.367153007 -0.61558421 0.618992169
[14,] 0.324020540 -0.07846677 1.707162545 -0.86665969 0.236393598
[15,] 1.896067067 -0.42049046 0.723740263 -1.63951709 0.846500899
[16,] 0.467680511 -0.56212588 0.481036049 -1.32583924 -0.573645739
[17,] -0.893800723 0.99751344 -1.567868244 -0.88903673 1.117993204
[18,] -0.307328300 -1.10513006 0.318250283 -0.55760233 -1.540001132
[19,] -0.004822422 -0.14228783 0.165991451 -0.06240231 -0.438123899
[20,] 0.988164149 0.31499490 -0.899907630 2.42269298 -0.150672971
# It could be done just in one step avoiding 'ind'
X[, apply(X, 2, function(X) any(abs(X)>0.5))]
An addition to Jilber's answer for the case when only one column remains after filtering:
X[, apply(X, 2, function(X) any(abs(X)>0.5)), drop=FALSE]
Without the drop=FLASE argument the remaining column will be converted to a vector and you will lose the column name information.

fill gaps in a timeseries with averages

I have a dataframe like so:
day sum_flux samples mean
2005-10-26 0.02 48 0.02
2005-10-27 0.12 12 0.50
It's a series of daily readings spanning 5 years, however some of the days are missing. I want to fill these days with the average of that month from other years.
i.e if 26-10-2005 was missing I'd want to use the average of all Octobers in the data set.
if all of October was missing I'd want to apply this average to each missing day.
I think I need to build a function (possibly using plyr) to evaluate the days. However I'm very inexperienced with using the various timeseries objects in R, and conditionally subsetting data and would like some advice. Especially regarding which type of timeseries I should be using.
Many Thanks
Some sample data. I'm assuming that sum_flux is the column that has missing values, and that you want to calculate values for.
library(lubridate)
days <- seq.POSIXt(ymd("2005-10-26"), ymd("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.frame(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
Add a month column.
readings$month <- month(readings$day, label = TRUE)
Use tapply to get the monthly mean flux.
monthly_avg_flux <- with(readings, tapply(sum_flux, month, mean, na.rm = TRUE))
Use this value whenever the flux is missing, or keep the flux if not.
readings$sum_flux2 <- with(readings, ifelse(
is.na(sum_flux),
monthly_avg_flux[month],
sum_flux
))
This is one (very fast) way in data.table.
Using the nice example data from Richie :
require(data.table)
days <- seq(as.IDate("2005-10-26"), as.IDate("2010-10-26"), by = "1 day")
n_days <- length(days)
readings <- data.table(
day = days,
sum_flux = runif(n_days),
samples = sample(100, n_days, replace = TRUE),
mean = runif(n_days)
)
readings$sum_flux[sample(n_days, floor(n_days / 10))] <- NA
readings
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 NA 42 0.37566491
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Create the average for each month, in appearance order of each group :
> avg = readings[,mean(sum_flux,na.rm=TRUE),by=list(mnth = month(day))]
> avg
mnth V1
[1,] 10 0.4915999
[2,] 11 0.5107873
[3,] 12 0.4451787
[4,] 1 0.4966040
[5,] 2 0.4972244
[6,] 3 0.4952821
[7,] 4 0.5106539
[8,] 5 0.4717122
[9,] 6 0.5110490
[10,] 7 0.4507383
[11,] 8 0.4680827
[12,] 9 0.5150618
Next reorder avg to start in January :
avg = avg[order(mnth)]
avg
mnth V1
[1,] 1 0.4966040
[2,] 2 0.4972244
[3,] 3 0.4952821
[4,] 4 0.5106539
[5,] 5 0.4717122
[6,] 6 0.5110490
[7,] 7 0.4507383
[8,] 8 0.4680827
[9,] 9 0.5150618
[10,] 10 0.4915999
[11,] 11 0.5107873
[12,] 12 0.4451787
Now update by reference (:=) the sum_flux column, where sum_flux is NA, with the value from avg for that month.
readings[is.na(sum_flux), sum_flux:=avg$V1[month(day)]]
day sum_flux samples mean
[1,] 2005-10-26 0.32838686 94 0.09647325
[2,] 2005-10-27 0.14686591 88 0.48728321
[3,] 2005-10-28 0.25800913 51 0.72776002
[4,] 2005-10-29 0.09628937 81 0.80954124
[5,] 2005-10-30 0.70721591 23 0.60165240
[6,] 2005-10-31 0.59555079 2 0.96849533
[7,] 2005-11-01 0.51078729** 42 0.37566491 # ** updated with the Nov avg
[8,] 2005-11-02 0.01649860 89 0.48866220
[9,] 2005-11-03 0.46802818 49 0.28920807
[10,] 2005-11-04 0.13024856 30 0.29051080
First 10 rows of 1827 printed.
Done.

Resources