Extract rows and calculate average

Extract rows and calculate average - r

I set up 2 problems as follows:
I have two matrices (Mat1 and Mat2). Both matrices are of equal size. I have four output matrices (Output1, Output2, Output3, Output4 respectively) both the same size as Mat1 and Mat2.
Problem 1:
In Mat2, Identify the row that contains the maximum value in column1. Lets assume this is row 1.
Go to Row 1 of Mat1 and extract the first 3 columns of Mat1 Row1 and store in Output1. Store all other rows in Mat2 for the first 3 columns. At this stage Output1 is 1x3. Output 2 is (n-1)x3.
Move to Column 4 of Mat2. Identify the row that contains the maximum value. Lets say this is row 5.
Go to row 5 of Mat1 column 4. Store Row5 columns 4,5,6 in Output1. Store all other rows of Mat1 for columns 4,5,6 in Output2.
... Repeat this process for all columns in Mat2 following the sequence:1,4,7,9 etc. In this case, i have 25 columns for Mat1 and Mat2, so the sequence will end at 24.
I need to be able to change the sequence from 1,4,7,9 etc, to 1,13,25 etc.
Problem2:
is equivalent to problem 1, except this time i identify the rows that contain the top-two values in every stage.
In Mat2, Identify the rows that contain the top-two values in Column1. Lets say these rows are 2 and 5. Store the first 3 columns of rows 2 and 5 of Mat1 in Output3. Store all remaining rows (column 1-3) of Mat1 in Output4.
Move to Column 4 of Mat2. Identify the rows which contain the top two values in column 4 of Mat2. lets say rows 1 and 2.
Move to Column 4 of Mat 1. Store column 4,5,6 Row 1 and 2 into Output3. Store all remaining rows in Output4.
Sidenote: This process must be easily extended for matrices with 1000x1000 dimensions. So would prefer not to do this manually.
Mat1 <- data.frame(matrix(nrow = 10, ncol =25, data = rnorm(250,0,1)))
Mat2 <- data.frame(matrix(nrow = 10, ncol =25, data = rnorm(250,0,1)))
> Mat1
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1 -2.22415466 0.98712728 1.0084356 0.58447183 0.2608830 -0.341029099 -0.71693894 -0.61653058 -0.24790470 0.10777970 -1.68562271 -1.6638535 -0.5538468
2 1.11444365 -0.34865955 0.7518822 -0.07573724 0.1336811 -0.831275643 -0.15564822 -0.68849375 -0.05094047 0.21990082 -0.69879135 -0.6348292 1.0172304
3 -0.05367747 0.08654206 -0.3023270 -0.67335942 -1.1173279 0.004670625 0.52482501 0.78330982 1.18795853 -0.06513613 0.42353439 -0.4152209 1.7174158
4 0.42118984 -0.43257583 -1.3368036 1.64849798 0.8294276 1.256987496 -0.50440892 1.07686292 0.94196135 2.90916270 -0.08714083 0.1094395 1.1715895
5 -0.13720451 -0.94864452 1.9751962 -0.70523555 0.1431405 0.569928767 0.54877505 -0.44571903 -1.16282161 -1.65590032 -0.17710859 -0.8904316 0.3252576
6 0.64336424 -0.38277541 -1.6512377 -0.06542054 -0.1195322 0.666255832 0.60826054 1.88822842 -0.52952627 -0.44776682 0.04321836 -0.6190585 -0.9529690
7 -1.04160098 1.10952094 -0.9186759 0.77437293 -0.2284926 -0.113106151 -0.32092624 1.34157301 2.33813068 1.21812714 0.13165646 0.5532299 -1.3470645
8 1.22940987 -1.26271164 -1.2483658 -2.00578793 -0.6773794 -0.228135998 -0.06223206 -1.97606848 1.67339247 -0.47268196 -0.83544561 -0.3313278 -0.2373613
9 0.08485706 -1.60594589 0.8549923 -0.23394708 -0.5978692 -0.321839877 -0.55298452 -0.08387815 -0.99196489 0.83364114 -0.19579612 -0.8017648 -0.2238073
10 -1.71702699 0.39086484 -0.9974210 0.86232862 -0.2755329 -0.160656438 0.49669949 0.73763073 -0.42380390 1.91208332 -0.27778479 0.7866471 0.1813511
> Mat2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1 0.11053732 -0.5750170 2.58105259 -1.6895285 0.0508918 -0.54188929 -0.92292169 1.4972970 0.009239807 -0.1706461 -0.8942262 -1.6351505 0.2029262
2 -0.83802776 -0.9322157 -0.34753884 0.8164819 0.7318198 0.09162218 0.15971493 -2.6731067 1.554323641 -0.3161967 0.4622101 -1.9521229 -1.3229961
3 0.61368153 -1.3650360 0.95674229 0.4582117 -0.6959545 -0.59627428 1.94172156 1.6784237 -0.482524695 -0.0514944 -0.4608930 -0.5456863 -0.1340540
4 -1.03156503 -0.2516495 0.76770177 -0.7841354 -3.2404904 -1.76276859 1.57421914 0.9782458 -1.364451438 -0.6437429 0.7485424 -0.8778284 1.7587504
5 0.01183232 0.6825633 1.39634308 1.4136879 0.5166420 0.76930390 0.67210932 1.3007904 -0.284451411 0.5163457 0.3198626 0.8030497 -1.4320064
6 -0.06110883 -0.6762991 0.56105196 0.9767543 -1.0016294 -0.84811626 -0.83319744 -1.1777865 -1.185631394 -0.5673733 0.2956725 0.5425602 -1.0510479
7 -0.56195630 1.3883881 0.09995573 0.6722959 -1.6205290 0.32085867 -0.94243554 -0.2340429 -1.299085265 -0.4433517 0.4424583 -2.8887970 0.1679859
8 1.04612102 0.8360530 0.07005306 0.4818317 1.1857504 0.13649605 1.35261983 0.8008935 -0.101922164 0.6773003 -1.0265770 0.1859912 0.2678461
9 0.88419676 -1.7012899 -1.09656000 -0.4360276 0.6238451 -2.03256276 -1.12575579 1.8407234 0.522372401 -0.6229582 0.6727720 -0.5695190 0.6298388
10 -0.68648649 -0.6689894 -0.56849261 -1.9012760 1.1418180 0.46377789 -0.08107475 1.4378120 -1.489367198 -0.7682887 -0.2858680 0.9584056 1.3178700
So for example:
which.max(Mat2[,1]) # 8
so go to row 8 of Mat1 and store the first 3 cols in Output1.
Output1[1, 1:3] # 1.22940987 -1.26271164 -1.2483658
Store all other rows of Mat1 for cols 1 to 3 in Output2.
which.max(Mat2[, 4]) # 5
implies
Output1[1, 4:6] # -0.70523555 0.1431405 0.569928767
And so on and so forth.

Does this do what you need?
Juggle <- function(m1, m2, col, step, top_n = 1) {
if (length(col) == 1) {
output_rows <- sort(m2[, col], index.return = TRUE, decreasing = TRUE)$ix
col_idx <- seq(col, col + step - 1)
top_idx <- seq(top_n)
bottom_idx <- seq(top_n + 1, nrow(m2))
max_out <- m1[output_rows[top_idx], col_idx]
rest_out <- m1[output_rows[bottom_idx], col_idx]
return(list(max_out = max_out, rest_out = rest_out))
} else {
left <- Juggle(m1, m2, col[1], step, top_n)
right <- Juggle(m1, m2, col[seq(2, length(col))], step, top_n)
return(list(
max_out = cbind(left$max_out, right$max_out),
rest_out = cbind(left$rest_out, right$max_out)
))
}
}
m1 and m2 are the matrices or data frames.
col is a single or sequence of columns like 1, 4, 7, 9 or 1,13,25.
step is the amount of columns to extract.
top_n is the amount of rows to extract.
This will return a list with components max_out for the top_n row extracts and rest_out for the N-top_n row extracts.

Related

Cumulative sum on preceding rows in the same column - R

data:
test <- structure(list(fgu_clms = c(14621389.763697, 145818119.352026,
21565415.2337476, 20120830.8221406, 12999772.0950838), loss_to_layer = c(0,
125818119.352026, 1565415.23374765, 120830.822140567, 0)), row.names = c(NA,
5L), class = "data.frame")
> test
fgu_clms loss_to_layer
1 14621390 0.0
2 145818119 125818119.4
3 21565415 1565415.2
4 20120831 120830.8
5 12999772 0.0
I want to create a new column which tries to use a cumulative sum on the rows above it. It's easier if I show how the calculation on the new column works row by row:
row 1: first calculate the sum the value of rows above in the same column. As this is row 1 there are no rows above this value is 0, call this cumsum_1. It should then take the minimum of the value of row 1 in column "loss_to_layer" and the calculation "x2 - cumsum_1".
In row 2: calculate the cumsum by looking at the value above, i.e. min(x2-cumsum_1,loss_to_layer value). Call this cumsum_2. Then repeat as above, i.e. be subject to the minimum of the value on row 2 of the loss-to_layer column and x2 - cumsum_2.
And so on.
In excel, this would be done by using MIN(B2,x2 - SUM(C$1:C1)) and dragging this formula down.
The results with x2 = 127,000,000 should be:
fgu_clms loss_to_layer new_col
1 14621390 0.0 0
2 145818119 125818119.4 125818119
3 21565415 1565415.2 1181881
4 20120831 120830.8 0
5 12999772 0.0 0
As you can see the sum of the "new_col" always sums back up to "x2", in this case 127,000,000.
I have tried:
test <- test %>% mutate(new_col = pmin(loss_to_layer,127e6-cumsum(lag(new_col,1,default=0))))
But get an error as it cannot find the column new_col in the lag function

test %>%
mutate(
cumsum_1 = cumsum(lag(loss_to_layer, default = 0)),
new_col = pmin(loss_to_layer, 127000000 - cumsum_1),
new_col = ifelse(new_col < 0, 0, new_col)
) %>%
select(-cumsum_1)

For Loop and Countif Function R

I would like to create a for loop to count if the values in each row are larger than a cutoff value that changes from row to row in another matrix. Currently, my code looks like this:
for (i in 100) {
count_Q4_l2 = NULL #set to zero after every loop
for (j in 10000){
if (ACT_Allquant2[1,i]>cc[j,1]){ #if the value in this column larger than the other, then count
count_Q4_l2 <- count_Q4_l2+1 #+1 to count the values
}
}
countALL[1,i] <- count_Q4_l2 #save the values into another data.frame
}
}
The cutoff values are in the ACT_Allquant2 table and they should move forward together with the for loop.
Hope I explained myself clearly and I thank you very much in advance for your help!!
EDIT:
ACT_Allquant2 looks the following way:
X91. X92. X93. X94. X95. X96. X97. X98.
Qfourfac_netlg2 0.7685364 0.8995720 0.9896079 1.014982 1.066362 1.229381
X99. X100.
Qfourfac_netlg2 1.727864 2.318737
While cc is a series of column
X1. X2. X3. X4. X5. X6. X7. X8. X9.
2 -2.504816 -2.433826 -2.305134 -2.261871 -2.110741 -1.894405 -1.344805 -1.256876 -1.211396
X10. X11. X12. X13. X14. X15. X16. X17.
2 -1.199943 -1.13323 -1.031908 -1.019844 -1.007079 -0.9932806 -0.9232708 -0.8316696
X18. X19. X20. X21. X22. X23. X24. X25.
2 -0.8052391 -0.7738284 -0.7334976 -0.7126213 -0.6950152 -0.6272749 -0.584775 -0.5540359
X26. X27. X28. X29. X30. X31. X32. X33.
2 -0.5307423 -0.5105184 -0.4107709 -0.4001571 -0.3959766 -0.3607601 -0.329242 -0.2746449
X34. X35. X36. X37. X38. X39. X40. X41.
2 -0.2231775 -0.1799284 -0.1684765 -0.1568755 -0.1446923 -0.1403811 -0.1387818 -0.126637
X42. X43. X44. X45. X46. X47. X48. X49.
2 -0.1082471 -0.08882241 -0.053299 -0.04695731 0.002623936 0.05961787 0.07482258 0.0868524
X50. X51. X52. X53. X54. X55. X56. X57. X58.
2 0.09455113 0.1003998 0.1077676 0.1574778 0.1810591 0.1832488 0.1874931 0.1893803 0.1955026
X59. X60. X61. X62. X63. X64. X65. X66. X67.
2 0.2035948 0.2321749 0.2453042 0.2604033 0.2739561 0.3018942 0.3835822 0.5748584 0.603411
X68. X69. X70. X71. X72. X73. X74. X75. X76.
2 0.6580565 0.6882143 0.7104922 0.7568134 0.7769822 0.7932305 0.8550466 0.876781 1.084851
X77. X78. X79. X80. X81. X82. X83. X84. X85. X86.
2 1.117067 1.196249 1.261902 1.310987 1.423575 1.485869 1.606687 1.678782 1.950923 1.995428
X87. X88. X89. X90. X91. X92. X93. X94. X95. X96.
2 1.99818 2.04422 2.080644 2.205811 2.21738 2.356354 2.469436 2.484198 2.52253 2.564173
X97. X98. X99.
2 2.638286 2.675248 2.768761

I'm not sure I understand, but let's try a simple example:
set.seed(41)
ACT <- data.frame(matrix(rnorm(100), 25, 4))
cc <- rnorm(4, 0, .5)
cc
# [1] 0.03641331 0.59785494 -1.05581599 0.33569523
In each column of ACT you want to count the values that exceed the value in cc, e.g. for column 1 the number that exceed 0.03641331, for column 2 the number that exceed 0.59785494? If that is so, you do not need any loops:
Comp <- sweep(ACT, 2, cc, ">")
Count <- colSums(Comp)
Count
# X1 X2 X3 X4
# 16 8 22 10
You can extract the values that exceed the cc value for each column, but you cannot put them into a data frame since the number of values in each column is different. You can create a data frame with the coordinates of the larger values or a list with the values for each column:
Larger <- data.frame(which(Comp, arr.ind=TRUE), ACT[Comp])
head(Larger)
# row col ACT.Comp.
# 1 2 1 0.1972575
# 2 3 1 1.0017043
# 3 4 1 1.2888254
# 4 5 1 0.9057534
# 5 6 1 0.4936675
# 6 7 1 0.5992858
LargerByCol <- split(Larger$ACT.Comp, Larger$col)
LargerByCol[[1]]
# [1] 0.1972575 1.0017043 1.2888254 0.9057534 0.4936675 0.5992858 . . . 16 values

Distance between sequence of GPS points in R

I have a dataframe as follow:
X1 X2
6.134811 49.58038
6.135331 49.58127
6.135670 49.58170
6.134905 49.58199
I would like to create a new variable X3 where will be computed the distance in meters between each sequence of GPS points. The 1st row can be 0, then on the second row will be the distance between 1st & 2nd row, on the 3rd row will be the distance between 2nd & 3rd row and so on, as follow:
X1 X2 X3
6.134811 49.58038 0
6.135331 49.58127 114
6.135670 49.58170 61
6.134905 49.58199 90
Any ideas on how to solve this would be greatly appreciated!

Use head and tail to isolate the "all but first" and "all but last" parts of your data.frame, then pass those to your desired distance function. Prepend the result with 0 and assign it to a new column in your data:
X <- data.frame( X1 = c(6.134811, 6.135331, 6.135670, 6.134905),
X2 = c(49.58038, 49.58127, 49.58170, 49.58199) )
X$X3 <- c( 0, geosphere::distCosine( head(X,-1), tail(X,-1) ) )
# X1 X2 X3
# 1 6.134811 49.58038 0.00000
# 2 6.135331 49.58127 105.94514
# 3 6.135670 49.58170 53.75830
# 4 6.134905 49.58199 63.95907

add new columns in data frame by subtracting column from a list

I have a large data set and i want to loop over the results, subtract column as present in a list and output the result for each row in a new column.
ref1 <- samples Controls
E_2334188 E_2334207
E_2334202 E_2334221
df1 <-
Chr Start End Feature E_2334188 E_2334202 E_2334207 E_2334221
1 740001 760000 1:740001-760000 1.6832013 0.8346011 -0.23045394 1.5974912
1 760001 780000 1:760001-780000 -0.3231613 -1.8504905 0.13668752 -0.38662600
1 780001 800000 1:780001-800000 -0.3936060 -2.2163153 -0.15266541 -0.60706691
ind <- which(names(df1) %in% ref1$samples)
rnd <- which(names(df1) %in% ref1$controls)
df2 <- df1[,c(1:4)]
df2$newcol <- 0
for (i in 1:nrow(ref1)){
n <- df1[ind]-df1[rnd]
df2$newcol[i] <- n
}
expected outcome
df2 <-
Chr Start End Feature E_2334188 E_2334202
1 740001 760000 1:740001-760000 1.913655 -0.7628901
1 760001 780000 1:760001-780000 -0.4598488 -1.463865
1 780001 800000 1:780001-800000 -0.2409406 -1.609248

We can subset the 'df1' based on the elements in 'samples' and 'Controls', subtract them, and cbind with the first 4 columns of 'df1'.
cbind(df1[1:4],df1[ref1$samples]- df1[ref1$Controls])
# Chr Start End Feature E_2334188 E_2334202
#1 1 740001 760000 1:740001-760000 1.9136552 -0.7628901
#2 1 760001 780000 1:760001-780000 -0.4598488 -1.4638645
#3 1 780001 800000 1:780001-800000 -0.2409406 -1.6092484
NOTE: If the 'samples' and 'Controls' columns are factor class, convert to character and use the same approach.
cbind(df1[1:4],df1[as.character(ref1$samples)]- df1[as.character(ref1$Controls)])

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!

Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363

Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extract rows and calculate average - r

Related

Cumulative sum on preceding rows in the same column - R

For Loop and Countif Function R

Distance between sequence of GPS points in R

add new columns in data frame by subtracting column from a list

Using R to remove data which is below a quartile threshold

Categories

Resources