Binning average of matrix

Binning average of matrix - r

I have a matrix with n rows and n columns and I would like to do binning average 10 rows at a time, which means in the end I am left with a matrix of size n/10-by-n. I added the matlab library and tried the following code:
nRemove = rem(size(a,1),10);
a = a(1:end-nRemove,:)
Avg = mean(reshape(a,10,[],n));
AvgF = squeeze(Avg);
but it didn't work, which code/codes should i use?
Thanks!!

Here is another way to do it:
set.seed(5)
x = matrix(runif(1000), ncol = 10)
nr = nrow(x)
gr = rep(1:floor(nr/10), each = 10)
aggregate(x ~ gr, FUN=mean)[,-1]
which results in
NA NA.1 NA.2 NA.3 NA.4 NA.5 NA.6 NA.7
1 0.5295264 0.5957229 0.4502069 0.5168083 0.3398190 0.4075922 0.6059122 0.5127865
2 0.4778341 0.3967321 0.4069635 0.4514742 0.6172677 0.2486085 0.6340686 0.4052600
3 0.5168132 0.5117207 0.5202261 0.5068593 0.5218041 0.4925462 0.5169584 0.4919296
4 0.3299557 0.3314723 0.4503393 0.3965103 0.6166598 0.5525628 0.4943880 0.6048207
5 0.6145423 0.5853235 0.4822182 0.3377771 0.3540784 0.5974846 0.5202577 0.5769518
6 0.5009249 0.5203701 0.3940540 0.4237508 0.3199265 0.4817713 0.4655320 0.6124400
7 0.7335082 0.5856578 0.3929621 0.6403662 0.5347719 0.5658542 0.4226456 0.7196593
8 0.4976663 0.5205538 0.4529273 0.4757352 0.6980300 0.5694570 0.4384924 0.5481236
9 0.5275932 0.5014861 0.5363340 0.5664576 0.5006055 0.5611069 0.3803889 0.4680865
10 0.4560031 0.5527328 0.4419076 0.6893043 0.5161281 0.5895931 0.3965911 0.3842419
NA.8 NA.9
1 0.3711607 0.5541607
2 0.4379255 0.4159131
3 0.5048523 0.5884052
4 0.4642687 0.4572388
5 0.6054209 0.5174784
6 0.4659952 0.5332438
7 0.4568273 0.3943798
8 0.6978356 0.5087778
9 0.4897584 0.4710949
10 0.6310546 0.4775762

t( sapply(1:(NROW(A)/10), function(x) colMeans(A[ x:(x+9), ] ) ) )
You need the transpose operation to re-orient the result. One often needs to do so after an 'apply' operation.

Related

compute diff of rows with NAs values in data frame using R

I have data frame (9000 x 304) but it looks like to this :
date
a
b
1997-01-01
8.720551
10.61597
1997-01-02
na
na
1997-01-03
8.774251
na
1997-01-04
8.808079
11.09641
I want to calculate the values data such as :
first <- data[i-1,] - data[i-2,]
second <- data[i,] - data[i-1,]
third <- data[i,] - data[i-2,]
I want to ignore the NA values and if there is na I want to get the last value that is not na in the column.
For example in the second diff i = 4 from column b :
11.09641 - 10.61597 is the value of b_diff on 1997-01-04
This is what I did but it keeps generating data with NA :
first <- NULL
for (i in 3:nrow(data)){
first <-rbind(first, data[i-1,] - data[i-2,])
}
second <- NULL
for (i in 3:nrow(data)){
second <- rbind(second, data[i,] - data[i-1,])
}
third <- NULL
for (i in 3:nrow(data)){
third <- rbind(third, data[i,] - data[i-2,])
}
It can be a way to solve it with aggregate function but I need a solution that can be applied on big data and I can't specify each colnames separately. Moreover my colnames are in foreign language.
Thank you very much ! I hope I gave you all the information you need to help me, otherwise, let me know please.

You can use fill to replace NAs with the closest value, and then use across and lag to compute the new variables. It is unclear as to what exactly is your expected output, but you can also replace the default value of lag when it does not exist (e.g. for the first value), using lag(.x, default = ...).
library(dplyr)
library(tidyr)
data %>%
fill(a, b) %>%
mutate(across(a:b, ~ lag(.x) - lag(.x, n = 2), .names = "first_{.col}"),
across(a:b, ~ .x - lag(.x), .names = "second_{.col}"),
across(a:b, ~ .x - lag(.x, n = 2), .names = "third_{.col}"))
date a b first_a first_b second_a second_b third_a third_b
1 1997-01-01 8.720551 10.61597 NA NA NA NA NA NA
2 1997-01-02 8.720551 10.61597 NA NA 0.000000 0.00000 NA NA
3 1997-01-03 8.774251 10.61597 0.0000 0 0.053700 0.00000 0.053700 0.00000
4 1997-01-04 8.808079 11.09641 0.0537 0 0.033828 0.48044 0.087528 0.48044

Sampling Nested For Loop

My loop knowledge is very minimal but I currently have a loop written, which takes values from three vectors (small.dens, med.dens, and large.dens) and each vector has 17 values. I have the loop setup to randomly select 2 values, then 3, then 4... all the way up to 17. Using these values, it calculates the mean and standard error (using the plotrix package). It then places these calculated means and standard errors into new vectors (small.density, small.stanerr, medium.density, medium.stanerr, large.density, and large.stanerr). Next, separately from the loop, I combine these vectors into a dataframe.
library(plotrix)
small.density = rep(NA,16)
small.stanerr = rep(NA,16)
medium.density = rep(NA,16)
medium.stanerr = rep(NA,16)
large.density = rep(NA,16)
large.stanerr = rep(NA,16)
for(i in 2:17){
xx=sample(small.dens,i,replace=TRUE)
small.density[[i]] = mean(xx)
small.stanerr[[i]] = std.error(xx)
yy = sample(med.dens, i, replace=TRUE)
medium.density[[i]] = mean(yy)
medium.stanerr[[i]] = std.error(yy)
zz = sample(large.dens, i, replace=TRUE)
large.density[[i]] = mean(zz)
large.stanerr[[i]] = std.error(zz)
}
I then want to run this loop 100 times, ultimately taking the mean, if that makes sense. For example, I would like it to select 2,3,4...17 values 100 times, taking the mean and standard error each time, and then taking the mean of all 100 times. Does this make sense? Would I make another for loop, turning this into a nested loop?
How would I go about doing this?
Thanks!

There are other ways to achieve what you want, but if you do not want to change your code, then just wrap it in something like this
res <- do.call(rbind, lapply(1:100, function(x) {
within(data.frame(
n = x,
size = 2:17,
small.density = rep(NA,16),
small.stanerr = rep(NA,16),
medium.density = rep(NA,16),
medium.stanerr = rep(NA,16),
large.density = rep(NA,16),
large.stanerr = rep(NA,16)
), {
for(i in 2:17){
xx = sample(small.dens,i,replace=TRUE)
small.density[[i - 1L]] = mean(xx)
small.stanerr[[i - 1L]] = std.error(xx)
yy = sample(med.dens, i, replace=TRUE)
medium.density[[i - 1L]] = mean(yy)
medium.stanerr[[i - 1L]] = std.error(yy)
zz = sample(large.dens, i, replace=TRUE)
large.density[[i - 1L]] = mean(zz)
large.stanerr[[i - 1L]] = std.error(zz)
}
rm(xx, yy, zz, i)
})
}))
res looks like this
> head(res, 20)
n size small.density small.stanerr medium.density medium.stanerr large.density large.stanerr
1 1 2 -0.04716195 0.35754422 13.1014925 4.374055 -42.089591 30.87786
2 1 3 -0.15893367 0.34557922 -0.2680632 6.206081 52.984076 36.85058
3 1 4 0.10013995 0.62374467 -0.1944930 5.784211 -112.684774 30.50707
4 1 5 0.40654132 0.40815013 1.6096970 5.026714 45.810098 46.58469
5 1 6 0.13310242 0.32104512 -6.9989844 4.232091 -22.312165 48.14705
6 1 7 0.21283027 0.53633472 -5.0702365 3.829677 -43.266482 41.74286
7 1 8 0.13870439 0.27161346 4.1629469 3.214053 -9.045643 48.49930
8 1 9 0.06495734 0.36738163 3.9742069 3.540913 -43.954345 38.23816
9 1 10 -0.01882762 0.37570468 -3.1764203 3.740403 -43.156792 38.47531
10 1 11 -0.02115580 0.26239465 -2.2026077 2.702412 7.343837 30.58314
11 1 12 0.09967753 0.27360125 3.9603382 3.214921 -13.461632 29.39910
12 1 13 0.53121414 0.27561862 4.3593802 1.872685 -38.572491 25.37029
13 1 14 0.21547909 0.36345292 -0.3377787 2.732968 17.305232 26.08317
14 1 15 0.33957964 0.23029520 0.4832063 2.886160 8.145410 18.23901
15 1 16 0.26871985 0.26846012 -6.7634873 3.436742 -4.011269 20.33814
16 1 17 0.24927792 0.20534048 -0.7481315 1.899348 9.993280 24.49623
17 2 2 -1.10840346 0.07123407 -3.4317644 6.966096 -30.384945 121.00972
18 2 3 1.73947551 0.35986535 -2.1415966 5.628115 -57.857871 10.47413
19 2 4 0.40033834 0.41963615 -4.2156733 1.206414 27.891021 13.84453
20 2 5 -0.08704736 0.52872770 0.3137693 2.974888 -3.100414 57.89126
If you want to calculate the mean of the 100 simulated values for each size, then just
aggregate(. ~ size, res[-1L], mean)
which gives you
size small.density small.stanerr medium.density medium.stanerr large.density large.stanerr
1 2 0.02872578 0.6341294 1.0938287 5.518797 3.141204 53.20675
2 3 0.16985732 0.5388110 -0.1627867 5.185643 -6.660756 49.83607
3 4 0.20543404 0.4815581 0.1385016 4.519419 -8.093673 46.64984
4 5 0.13019280 0.4546794 0.1299331 4.166335 -10.300542 41.40444
5 6 0.10675158 0.4307113 0.2191516 4.033863 -12.068151 38.95312
6 7 0.19326831 0.3834507 0.8784275 3.513812 -6.920378 36.17856
7 8 0.09020638 0.3580780 0.4388388 3.443349 -5.335405 30.49615
8 9 0.13956838 0.3558005 0.3740251 3.313501 -15.290834 31.64833
9 10 0.18368962 0.3397191 0.4600761 3.051425 -5.505220 29.46165
10 11 0.20653866 0.3116104 0.9913534 2.804659 -8.809398 28.79097
11 12 0.14653661 0.2988422 0.3337274 2.624418 -5.128882 26.78074
12 13 0.12255652 0.2864998 0.2085829 2.719396 -11.548064 27.08497
13 14 0.13102809 0.2830709 0.6448798 2.586491 -4.676053 25.21800
14 15 0.14536840 0.2749606 0.3415879 2.522826 -11.968496 24.44427
15 16 0.14871831 0.2571571 0.2218365 2.463486 -10.335511 23.64304
16 17 0.13664397 0.2461108 0.3387764 2.348594 -9.969407 22.84736

apply function to subsets of dataframe r

I am trying to subset a dataframe by two variables ('site' and 'year') and apply a function (dismo::biovars) to each subset. Biovars requires monthly inputs (12 values) and outputs 19 variables per year. I'd like to store the outputs for each subset and combine them.
Example data:
data1<-data.frame(Meteostation=c(rep("OBERHOF",12),rep("SOELL",12)),
Year=c(rep(1:12),rep(1:12)),
tasmin=runif(24, min=-20, max=5),
tasmax=runif(24, min=-1, max=30),
pr=runif(24, min=0, max=300))
The full dataset contains 900 stations and 200 years.
I'm currently attempting a nested loop, which I realised isn't the most efficient, and which I'm struggling to make work - code below:
sitesList <- as.character(unique(data1$Meteostation))
#yearsList<- unique(data1$Year)
bvList<-list()
for (i in c(1:length(unique(sitesList)))) {
site<-filter(data1, Meteostation==sitesList[i])
yearsList[i]<-unique(site$Year)
for (j in c(1:length(yearsList))){
timestep<-filter(site,Year==yearsList[j])
tmin<-timestep$tasmin
tmax<-timestep$tasmax
pr<-timestep$pr
bv<-biovars(pr,tmin,tmax)
bvList[[j]]<- bv
}}
bv_all <- do.call(rbind, bvList)
I'm aware there are much better ways to go about this, and have been looking to variations of apply, and dplyr solutions, but am struggling to get my head around it. Any advice much appreciated.

You could use the dplyr package, as follows perhaps?
library(dplyr)
data1 %>%
group_by(Meteostation, Year) %>%
do(data.frame(biovars(.$pr, .$tasmin, .$tasmax)))

Use by and rbind the result.
library("dismo")
res <- do.call(rbind, by(data1, data1[c("Year", "Meteostation")], function(x) {
cbind(x[c("Year", "Meteostation")], biovars(x$pr, x$tasmin, x$tasmax))
}))
Produces
head(res[, 1:10])
# Meteostation Year bio1 bio2 bio3 bio4 bio5 bio6 bio7 bio8
# 1 OBERHOF 1 12.932403 18.59525 100 NA 22.2300284 3.634777 18.59525 NA
# 2 OBERHOF 2 5.620587 7.66064 100 NA 9.4509069 1.790267 7.66064 NA
# 3 OBERHOF 3 0.245540 12.88662 100 NA 6.6888506 -6.197771 12.88662 NA
# 4 OBERHOF 4 5.680438 45.33159 100 NA 28.3462326 -16.985357 45.33159 NA
# 5 OBERHOF 5 -6.971906 16.83037 100 NA 1.4432801 -15.387092 16.83037 NA
# 6 OBERHOF 6 -7.915709 14.63323 100 NA -0.5990945 -15.232324 14.63323 NA

How to fill out missing value in column K using the value in column K-1 for the whole data frame

Hey I need to fill out missing value of the data frame before I can run them on shiny app. The rule is to fill out missing value in column K using the value in column K-1 for the whole data frame.
I actually have figure out how to do this, but I think my way is too complicated. I believe there should be much easier way to do it. I attach the data, code and output here. Please let me know if you have a easier way to do this.
Thanks a loooooot.
data2 = data.frame('population by age' = seq(3, 24, by = 1),
'2008' = c(145391,
140621,
136150,
131944,
127968,
124209,
120650,
117163,
113674,
110207,
106871,
103659,
100398,
97017,
93584,
90240,
86957,
83783,
80756,
77850,
75003,
72226
),
'2009' = c(148566,
143943,
139367,
135083,
131052,
NA,
123628,
120213,
116826,
113381,
109915,
106574,
103346,
100058,
96644,
93175,
NA,
86455,
NA,
80192,
77279,
74422
),
'2010' = c(152330,
147261,
142555,
138172,
134071,
130214,
126559,
123099,
119825,
116538,
113134,
109669,
106320,
103075,
99760,
96312,
92805,
NA,
NA,
82733,
79661,
76739
),
'2011' = c(156630,
151387,
146491,
141905,
137593,
133545,
129737,
126124,
122678,
NA,
116093,
112666,
109174,
105791,
102505,
99159,
95699,
92193,
88759,
85373,
82123,
79065
))
data7 <- data2 %>%
gather(key = year, value = value, -`population.by.age` )%>%
group_by(`population.by.age`) %>%
nest
library(imputeTS)
impute_nas <- function(df, var, fun, ...) {
df[[var]] <- fun(df[[var]], ...)
return(df)
}
imputed <- data7 %>%
mutate(
interpolation = purrr::map(data, impute_nas, var = 'value', fun = imputeTS::na.locf)
) %>%
select(-data) %>%
unnest 
imputed <- imputed %>% spread(key = 'year', value = 'value')
as.data.frame(imputed)
Best,

An option is to use power of zoo::na.locf to fill NA with last values available values. The apply function can pass row-wise data on which zoo::na.locf will populate the missing values.
library(zoo)
cbind(data2[1], t(apply(data2[2:5], 1, zoo::na.locf)))
# population.by.age X2008 X2009 X2010 X2011
# 1 3 145391 148566 152330 156630
# 2 4 140621 143943 147261 151387
# 3 5 136150 139367 142555 146491
# 4 6 131944 135083 138172 141905
# 5 7 127968 131052 134071 137593
# 6 8 124209 124209 130214 133545
# 7 9 120650 123628 126559 129737
# 8 10 117163 120213 123099 126124
# 9 11 113674 116826 119825 122678
# 10 12 110207 113381 116538 116538
# 11 13 106871 109915 113134 116093
# 12 14 103659 106574 109669 112666
# 13 15 100398 103346 106320 109174
# 14 16 97017 100058 103075 105791
# 15 17 93584 96644 99760 102505
# 16 18 90240 93175 96312 99159
# 17 19 86957 86957 92805 95699
# 18 20 83783 86455 86455 92193
# 19 21 80756 80756 80756 88759
# 20 22 77850 80192 82733 85373
# 21 23 75003 77279 79661 82123
# 22 24 72226 74422 76739 79065

One quick solution I could come up with using for loops is shown below. Of course, the first column cannot be imputed.
impute_from_previous <- function(ds) {
for (i in 2:length(colnames(ds))) {
rows_missing <- which(is.na(ds[[i]]))
ds[rows_missing, i] <- ds[rows_missing, i - 1]
}
return(ds)
}
data3 <- impute_from_previous(data2)

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!

Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363

Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]