transform 2d matrix to 3D matrix - r

After a simulation I have data like that :
capt2[1,1] capt2[2,1] capt2[3,1] capt2[4,1] capt2[5,1] capt2[6,1] capt2[1,2] capt2[2,2] capt2[3,2] capt2[4,2]
1 4.582288e-05 5.115372e-05 6.409558e-05 7.132340e-05 6.927382e-05 5.727399e-05 2.753242e-05 3.106131e-05 3.832073e-05 4.270945e-05
2 4.675470e-05 5.045181e-05 6.467788e-05 7.112534e-05 6.809241e-05 5.885455e-05 2.789134e-05 3.097479e-05 3.790915e-05 4.176663e-05
3 4.586335e-05 5.127838e-05 6.344857e-05 6.934458e-05 6.622970e-05 5.651329e-05 2.795094e-05 3.120102e-05 3.790188e-05 4.172773e-05
4 4.572750e-05 5.150407e-05 6.333068e-05 7.145439e-05 6.624694e-05 5.836059e-05 2.795106e-05 3.055858e-05 3.826570e-05 4.172327e-05
5 4.740812e-05 5.113890e-05 6.397921e-05 7.163161e-05 6.838507e-05 5.620327e-05 2.790780e-05 3.083819e-05 3.821806e-05 4.198080e-05
6 4.583460e-05 5.106634e-05 6.340507e-05 7.030548e-05 6.886533e-05 5.901374e-05 2.792663e-05 3.136544e-05 3.862876e-05 4.177590e-05
with a length of 40000 lines.
However the [1: 6,] refers to months and the [, 1: x] refers to territories. So I would like to have [, 1: x] columns (in my dataset 28) for [1: 6,] rows and have the length (40000) in the third dimension since these are simulations.
Subsequently with my 3D table of 6 lines and 28 columns, I would like to do simple operations, such as for example a histogram of the 3D values ​​of line 1 / column 1 etc ...
edit : "capt2[3,1]" it's just the name of the column in character

Just transform it into an array.
I'll simulate some data to show you how to do this.
set.seed(42)
n <- 10 # `n` in your data would be 40,000
# your rownames
v <- c("capt2[1,1]", "capt2[2,1]", "capt2[3,1]", "capt2[4,1]", "capt2[5,1]", "capt2[6,1]",
"capt2[1,2]", "capt2[2,2]", "capt2[3,2]", "capt2[4,2]", "capt2[5,2]", "capt2[6,2]",
"capt2[1,3]", "capt2[2,3]", "capt2[3,3]", "capt2[4,3]", "capt2[5,3]", "capt2[6,3]")
M <- matrix(rnorm(3*6*n), n, dimnames=list(NULL, v)) # shall symbolize your data
M[1:2, 1:6]
# capt2[1,1] capt2[2,1] capt2[3,1] capt2[4,1] capt2[5,1] capt2[6,1]
# [1,] -0.132088 0.5156677 1.3487070 1.01687283 -0.73844075 0.8131950
# [2,] 1.476787 -0.2343653 -0.0227647 -0.02671746 0.04656394 -0.1908165
Now apply array with the right dimensions and dimnames.
A <- array(as.vector(t(M)), dim=c(6, 3, n),
dimnames=list(paste0("month.", 1:6), paste0("territory.", 1:3), NULL))
A
# , , 1
#
# territory.1 territory.2 territory.3
# month.1 -0.1320880 0.4703934 -1.3870266
# month.2 0.5156677 2.4595935 1.1573471
# month.3 1.3487070 -0.1662615 -0.2901453
# month.4 1.0168728 0.4823695 1.8922020
# month.5 -0.7384408 -0.7848878 -0.2764311
# month.6 0.8131950 1.1454705 -0.3047780
#
# , , 2
#
# territory.1 territory.2 territory.3
# month.1 1.47678742 -1.24267027 -1.3066759
# month.2 -0.23436528 -0.81838032 -1.6824809
# month.3 -0.02276470 0.86256338 0.8285461
# month.4 -0.02671746 0.99294364 -1.3859983
# month.5 0.04656394 0.16341632 -1.1094188
# month.6 -0.19081647 0.03157319 0.5978327
#
# , , 3
#
# territory.1 territory.2 territory.3
# month.1 -0.2170302 1.38157546 -0.76839533
# month.2 -0.6585034 -2.11320011 0.08731909
# month.3 0.2442259 0.09734049 -0.29122771
# month.4 0.7036078 -1.24639550 -0.41482430
# month.5 -1.0175961 -1.23671424 0.13386932
# month.6 -2.6999298 -0.83520581 1.39742941
[...]

Related

Finding closest point by comparing two data frames in R

I have a set of two data frames in R
First:
site_no <- c("02110500","02110550", "02110701" , "02110704", "02110760", "02110777", "021108044", "02110815")
lat_coor <- c(33.91267, 33.85083, 33.86100, 33.83295, 33.74073, 33.85156, 33.65017, 33.44461)
long_coor <- c(-78.71502, -78.89722, -79.04115, -79.04365, -78.86669, -78.65585, -79.12310, -79.17393)
AllStations <- data.frame(site_no, lat_coor, long_coor)
Second:
station <- c("USGS-02146110","USGS-02146110","USGS-02146110","USGS-02146110","USGS-02146110","USGS-021473426","USGS-021473426","USGS-021473426")
latitude <- c(34.88928, 34.85651, 34.85651, 34.85651, 34.71679, 34.24320, 34.80012, 34.80012)
longitude <- c(-81.06869, -82.22622, -82.22622, -82.22622, -82.17372, -81.31954, -82.36512, -82.36512)
ContaminantStations <- data.frame(station, latitude, longitude)
My data sets are a lot longer but for the purpose of this question I think this should be enough.
What I would like is to find all the stations from the first data frame (AllStations) that are inside a radius of the points in the second data frame (ContaminantStations) and append them into a new data frame (only the ones from AllStations), I need to extract the station with all its information. I've tried some logical but none of them work or make sense. I also try with RANN:nn2 but that only gives me the count.
Any help would be appreciated
I think you just need to iterate over each within AllStations and return the nearest of the ContaminantStations that is within a radius.
func <- function(stations, constations, radius = 250000) {
if (!NROW(stations) || !NROW(constations)) return()
if (length(radius) == 1 && NROW(constations) > 1) {
radius <- rep(radius, NROW(constations))
} else if (length(radius) != NROW(constations)) {
stop("'radius' must be length 1 or the same as the number of rows in 'constations'")
}
out <- integer(NROW(stations))
for (i in seq_len(NROW(stations))) {
dists <- geosphere::distHaversine(stations[i,], constations)
out[i] <- if (any(dists <= radius)) which.min(dists) else 0L
}
return(out)
}
This returns a integer vector, indicating the contaminant station that is closest. If none are within the radius, then it returns 0. This is safely used as a row-index on the original frame.
Each argument must include only two columns, with the first column being longitude. (I make no assumptions of the column names in the function.) radius is in meters, consistent with the geosphere package assumptions.
ind <- func(AllStations[,c("long_coor", "lat_coor")], ContaminantStations[,c("longitude", "latitude")],
radius = 230000)
ind
# [1] 0 6 6 6 0 0 6 6
These are indices on the ContaminantStations rows, where non-zero means that that contaminant station is the closest to the specific row of AllStations.
We can identify which contaminant station is closest with this (there are many ways to do this, including tidyverse and other techniques ... this is just a start).
AllStations$ClosestContaminantStation <- NA_character_
AllStations$ClosestContaminantStation[ind > 0] <- ContaminantStations$station[ind]
AllStations
# site_no lat_coor long_coor ClosestContaminantStation
# 1 02110500 33.91267 -78.71502 <NA>
# 2 02110550 33.85083 -78.89722 USGS-021473426
# 3 02110701 33.86100 -79.04115 USGS-021473426
# 4 02110704 33.83295 -79.04365 USGS-021473426
# 5 02110760 33.74073 -78.86669 <NA>
# 6 02110777 33.85156 -78.65585 <NA>
# 7 021108044 33.65017 -79.12310 USGS-021473426
# 8 02110815 33.44461 -79.17393 USGS-021473426
A vis of your data for perspective:
An alternative to this approach would be to return the distance and index of the closest contaminant station, regardless of the radius, allowing you to filter later.
func2 <- function(stations, constations, radius = 250000) {
if (!NROW(stations) || !NROW(constations)) return()
if (length(radius) == 1 && NROW(constations) > 1) {
radius <- rep(radius, NROW(constations))
} else if (length(radius) != NROW(constations)) {
stop("'radius' must be length 1 or the same as the number of rows in 'constations'")
}
out <- data.frame(ind = integer(NROW(stations)), dist = numeric(NROW(stations)))
for (i in seq_len(NROW(stations))) {
dists <- geosphere::distHaversine(stations[i,], constations)
out$ind[i] <- which.min(dists)
out$dist[i] <- min(dists)
}
return(out)
}
Demonstration, including bringing the contaminant station into the same frame.
AllStations2 <- cbind(
AllStations,
func2(AllStations[,c("long_coor", "lat_coor")], ContaminantStations[,c("longitude", "latitude")])
)
AllStations2
# site_no lat_coor long_coor ind dist
# 1 02110500 33.91267 -78.71502 1 241971.5
# 2 02110550 33.85083 -78.89722 6 227650.6
# 3 02110701 33.86100 -79.04115 6 214397.8
# 4 02110704 33.83295 -79.04365 6 214847.7
# 5 02110760 33.74073 -78.86669 6 233190.8
# 6 02110777 33.85156 -78.65585 6 249519.7
# 7 021108044 33.65017 -79.12310 6 213299.3
# 8 02110815 33.44461 -79.17393 6 217378.9
AllStations3 <- cbind(
AllStations2,
ContaminantStations[AllStations2$ind,]
)
AllStations3
# site_no lat_coor long_coor ind dist station latitude longitude
# 1 02110500 33.91267 -78.71502 1 241971.5 USGS-02146110 34.88928 -81.06869
# 6 02110550 33.85083 -78.89722 6 227650.6 USGS-021473426 34.24320 -81.31954
# 6.1 02110701 33.86100 -79.04115 6 214397.8 USGS-021473426 34.24320 -81.31954
# 6.2 02110704 33.83295 -79.04365 6 214847.7 USGS-021473426 34.24320 -81.31954
# 6.3 02110760 33.74073 -78.86669 6 233190.8 USGS-021473426 34.24320 -81.31954
# 6.4 02110777 33.85156 -78.65585 6 249519.7 USGS-021473426 34.24320 -81.31954
# 6.5 021108044 33.65017 -79.12310 6 213299.3 USGS-021473426 34.24320 -81.31954
# 6.6 02110815 33.44461 -79.17393 6 217378.9 USGS-021473426 34.24320 -81.31954
From here, you can choose your radius at will:
subset(AllStations3, dist < 230000)
# site_no lat_coor long_coor ind dist station latitude longitude
# 6 02110550 33.85083 -78.89722 6 227650.6 USGS-021473426 34.2432 -81.31954
# 6.1 02110701 33.86100 -79.04115 6 214397.8 USGS-021473426 34.2432 -81.31954
# 6.2 02110704 33.83295 -79.04365 6 214847.7 USGS-021473426 34.2432 -81.31954
# 6.5 021108044 33.65017 -79.12310 6 213299.3 USGS-021473426 34.2432 -81.31954
# 6.6 02110815 33.44461 -79.17393 6 217378.9 USGS-021473426 34.2432 -81.31954

For Loop and Countif Function R

I would like to create a for loop to count if the values in each row are larger than a cutoff value that changes from row to row in another matrix. Currently, my code looks like this:
for (i in 100) {
count_Q4_l2 = NULL #set to zero after every loop
for (j in 10000){
if (ACT_Allquant2[1,i]>cc[j,1]){ #if the value in this column larger than the other, then count
count_Q4_l2 <- count_Q4_l2+1 #+1 to count the values
}
}
countALL[1,i] <- count_Q4_l2 #save the values into another data.frame
}
}
The cutoff values are in the ACT_Allquant2 table and they should move forward together with the for loop.
Hope I explained myself clearly and I thank you very much in advance for your help!!
EDIT:
ACT_Allquant2 looks the following way:
X91. X92. X93. X94. X95. X96. X97. X98.
Qfourfac_netlg2 0.7685364 0.8995720 0.9896079 1.014982 1.066362 1.229381
X99. X100.
Qfourfac_netlg2 1.727864 2.318737
While cc is a series of column
X1. X2. X3. X4. X5. X6. X7. X8. X9.
2 -2.504816 -2.433826 -2.305134 -2.261871 -2.110741 -1.894405 -1.344805 -1.256876 -1.211396
X10. X11. X12. X13. X14. X15. X16. X17.
2 -1.199943 -1.13323 -1.031908 -1.019844 -1.007079 -0.9932806 -0.9232708 -0.8316696
X18. X19. X20. X21. X22. X23. X24. X25.
2 -0.8052391 -0.7738284 -0.7334976 -0.7126213 -0.6950152 -0.6272749 -0.584775 -0.5540359
X26. X27. X28. X29. X30. X31. X32. X33.
2 -0.5307423 -0.5105184 -0.4107709 -0.4001571 -0.3959766 -0.3607601 -0.329242 -0.2746449
X34. X35. X36. X37. X38. X39. X40. X41.
2 -0.2231775 -0.1799284 -0.1684765 -0.1568755 -0.1446923 -0.1403811 -0.1387818 -0.126637
X42. X43. X44. X45. X46. X47. X48. X49.
2 -0.1082471 -0.08882241 -0.053299 -0.04695731 0.002623936 0.05961787 0.07482258 0.0868524
X50. X51. X52. X53. X54. X55. X56. X57. X58.
2 0.09455113 0.1003998 0.1077676 0.1574778 0.1810591 0.1832488 0.1874931 0.1893803 0.1955026
X59. X60. X61. X62. X63. X64. X65. X66. X67.
2 0.2035948 0.2321749 0.2453042 0.2604033 0.2739561 0.3018942 0.3835822 0.5748584 0.603411
X68. X69. X70. X71. X72. X73. X74. X75. X76.
2 0.6580565 0.6882143 0.7104922 0.7568134 0.7769822 0.7932305 0.8550466 0.876781 1.084851
X77. X78. X79. X80. X81. X82. X83. X84. X85. X86.
2 1.117067 1.196249 1.261902 1.310987 1.423575 1.485869 1.606687 1.678782 1.950923 1.995428
X87. X88. X89. X90. X91. X92. X93. X94. X95. X96.
2 1.99818 2.04422 2.080644 2.205811 2.21738 2.356354 2.469436 2.484198 2.52253 2.564173
X97. X98. X99.
2 2.638286 2.675248 2.768761
I'm not sure I understand, but let's try a simple example:
set.seed(41)
ACT <- data.frame(matrix(rnorm(100), 25, 4))
cc <- rnorm(4, 0, .5)
cc
# [1] 0.03641331 0.59785494 -1.05581599 0.33569523
In each column of ACT you want to count the values that exceed the value in cc, e.g. for column 1 the number that exceed 0.03641331, for column 2 the number that exceed 0.59785494? If that is so, you do not need any loops:
Comp <- sweep(ACT, 2, cc, ">")
Count <- colSums(Comp)
Count
# X1 X2 X3 X4
# 16 8 22 10
You can extract the values that exceed the cc value for each column, but you cannot put them into a data frame since the number of values in each column is different. You can create a data frame with the coordinates of the larger values or a list with the values for each column:
Larger <- data.frame(which(Comp, arr.ind=TRUE), ACT[Comp])
head(Larger)
# row col ACT.Comp.
# 1 2 1 0.1972575
# 2 3 1 1.0017043
# 3 4 1 1.2888254
# 4 5 1 0.9057534
# 5 6 1 0.4936675
# 6 7 1 0.5992858
LargerByCol <- split(Larger$ACT.Comp, Larger$col)
LargerByCol[[1]]
# [1] 0.1972575 1.0017043 1.2888254 0.9057534 0.4936675 0.5992858 . . . 16 values

Broken R code to select specific rows and cells in text file and put into data frame

This is an extension of this question which needs to be altered to accommodate more rows Bands in the text file. I want is to select the "Basic stats" rows from a text file that looks like the one below and then organize them in a data frame, like the one at the bottom of the question. Here's a link to the file if you want to use it directly.
Filename: /blah/blah/blah.txt
ROI: red_2 [Red] 12 points
Basic Stats Min Max Mean Stdev
Band 1 0.032262 0.124425 0.078073 0.028031
Band 2 0.021072 0.064156 0.037923 0.012178
Band 3 0.013404 0.066043 0.036316 0.014787
Band 4 0.005162 0.055781 0.015526 0.013255
Histogram DN Npts Total Percent Acc Pct
Band 1 0.032262 1 1 8.3333 8.3333
Bin=0.00036 0.032624 0 1 0.0000 8.3333
0.032985 0 1 0.0000 8.3333
0.033346 0 1 0.0000 8.3333
This is the code I'm using:
dat <- readLines('/blah/blah/blah.txt')
# create an index for the lines that are needed: Basic stats and Bands
ti <- rep(which(grepl('ROI:', dat)), each = 8) + 1:8
# create a grouping vector of the same length
grp <- rep(1:203, each = 8)
# filter the text with the index 'ti'
# and split into a list with grouping variable 'grp'
lst <- split(dat[ti], grp)
# loop over the list a read the text parts in as dataframes
lst <- lapply(lst, function(x) read.table(text = x, sep = '\t', header = TRUE, blank.lines.skip = TRUE))
# bind the dataframes in the list together in one data.frame
DF <- do.call(rbind, lst)
# change the name of the first column
names(DF)[1] <- 'ROI'
# get the correct ROI's for the ROI-column
DF$ROI <- sub('.*: (\\w+).*$', '\\1', dat[grepl('ROI: ', dat)])
DF
The output looks something like this:
$ROI
[1] "red_2" "red_3" "red_4" "red_5" "red_6" "red_7" "red_8" "red_9" "red_10" "bcs_1" "bcs_2"
[12] "bcs_3" "bcs_4" "bcs_5" "bcs_6" "bcs_7" "bcs_8" "bcs_9" "bcs_10" "red_11" "red_12" "red_12"
[23] "red_13" "red_14" "red_15" "red_16" "red_17" "red_18" "red_19" "red_20" "red_21" "red_22" "red_23"
[34] "red_24" "red_25" "red_24" "red_25" "red_26" "red_27" "red_28" "red_29" "red_30" "red_31" "red_33"
$<NA>
[1] "Basic Stats\t Min\t Max\t Mean\t Stdev"
$<NA>
[1] "Basic Stats\t Min\t Max\t Mean\t Stdev"
etc...
When it should look this this:
ROI Band Min Max Mean Stdev
red_2 Band 1 0.032262 0.124425 0.078073 0.028031
red_2 Band 2 0.021072 0.064156 0.037923 0.012178
red_2 Band 3 0.013404 0.066043 0.036316 0.014787
red_2 Band 4 0.005162 0.055781 0.015526 0.013255
red_3 Band 1 values...
red_4 Band 2
red_4 Band 3
red_4 Band 4
I would like some help.
For this file you will have to adapt the approach I proposed here. For the linked text-file (test2.txt) I propose the following approach:
dat <- readLines('test2.txt')
len <- sum(grepl('ROI:', dat))
ti <- rep(which(grepl('ROI:', dat)), each = 7) + 0:6
grp <- rep(1:len, each = 7)
lst <- split(dat[ti], grp)
lst <- lapply(lst, function(x) read.table(text = x, sep = '\t', skip = 1, header = TRUE, blank.lines.skip = TRUE))
names(lst) <- sub('.*: (\\w+).*$', '\\1', dat[grepl('ROI: ', dat)])
library(data.table)
DT <- rbindlist(lst, idcol = 'ROI')
setnames(DT, 2, 'Band')
which give the desired result:
> DT
ROI Band Min Max Mean Stdev
1: red_1 Band 1 0.013282 0.133982 0.061581 0.034069
2: red_1 Band 2 0.009866 0.112935 0.042688 0.026618
3: red_1 Band 3 0.008304 0.037059 0.018434 0.007515
4: red_1 Band 4 0.004726 0.040089 0.018490 0.009605
5: red_2 Band 1 0.032262 0.124425 0.078073 0.028031
---
1220: bcs_49 Band 4 0.002578 0.010578 0.006191 0.002285
1221: bcs_50 Band 1 0.032775 0.072881 0.051152 0.012593
1222: bcs_50 Band 2 0.020029 0.085993 0.042864 0.018628
1223: bcs_50 Band 3 0.012770 0.034367 0.023056 0.006581
1224: bcs_50 Band 4 0.005804 0.024798 0.014049 0.005744

cbind subsets into one column in r

I have have created subsets of a dataframe, which I used for calculations. I am now left with numberous subsets which I want to combine into one column. The subsets look like this:
> E
$`1`
[1] "AAAaaa" "TTTaaa" "CCCaaa" "GGGaaa" "AAAttt" "TTTttt" "CCCttt" "GGGttt"
[9] "AAAccc" "TTTccc" "CCCccc" "GGGccc" "AAAggg" "TTTggg" "CCCggg" "GGGggg"
$`2`
[1] "ATAata" "TATata" "CGCata" "GCGata" "BBBata" "ATAtat" "TATtat" "CGCtat"
[9] "GCGtat" "BBBtat" "ATAcgc" "TATcgc" "CGCcgc" "GCGcgc" "BBBcgc" "ATAgcg"
[17] "TATgcg" "CGCgcg" "GCGgcg" "BBBgcg" "ATAbbb" "TATbbb" "CGCbbb" "GCGbbb"
[25] "BBBbbb"
I have tried:
A=vector()
cbind(A,ExonJunction,deparse.level = 1)
A
But that leaves me with
E
1 Character,16
2 Character,25
I want the list of characters in one column. How do I do this?
Could also try the recursive argument in c function, something like
c(E, recursive = TRUE, use.names = FALSE)
# [1] "AAAaaa" "TTTaaa" "CCCaaa" "GGGaaa" "AAAttt" "TTTttt" "CCCttt" "GGGttt" "AAAccc" "TTTccc" "CCCccc" "GGGccc" "AAAggg" "TTTggg" "CCCggg" "GGGggg" "ATAata"
# [18] "TATata" "CGCata" "GCGata" "BBBata" "ATAtat" "TATtat" "CGCtat" "GCGtat" "BBBtat" "ATAcgc" "TATcgc" "CGCcgc" "GCGcgc" "BBBcgc" "ATAgcg" "TATgcg" "CGCgcg"
# [35] "GCGgcg" "BBBgcg" "ATAbbb" "TATbbb" "CGCbbb" "GCGbbb" "BBBbbb"
Or if you want it as a column within a data frame, could try
df <- data.frame(Res = c(E, recursive = TRUE))
You can unlist the list and create a single column dataframe with data.frame
dat <- data.frame(Col1=unlist(E, use.names=FALSE), stringsAsFactors=FALSE)
data
E <- structure(list(`1` = c("AAAaaa", "TTTaaa", "CCCaaa", "GGGaaa",
"AAAttt", "TTTttt", "CCCttt", "GGGttt", "AAAccc", "TTTccc", "CCCccc",
"GGGccc", "AAAggg", "TTTggg", "CCCggg", "GGGggg"), `2` = c("ATAata",
"TATata", "CGCata", "GCGata", "BBBata", "ATAtat", "TATtat", "CGCtat",
"GCGtat", "BBBtat", "ATAcgc", "TATcgc", "CGCcgc", "GCGcgc", "BBBcgc",
"ATAgcg", "TATgcg", "CGCgcg", "GCGgcg", "BBBgcg", "ATAbbb", "TATbbb",
"CGCbbb", "GCGbbb", "BBBbbb")), .Names = c("1", "2"))
You can also use stack, like this (provided you are dealing with a named list, like you are):
stack(E)
A nice feature is that the names become the "ind" column, so the process is easily reversible.
head(stack(E))
# values ind
# 1 AAAaaa 1
# 2 TTTaaa 1
# 3 CCCaaa 1
# 4 GGGaaa 1
# 5 AAAttt 1
# 6 TTTttt 1
tail(stack(E))
# values ind
# 36 BBBgcg 2
# 37 ATAbbb 2
# 38 TATbbb 2
# 39 CGCbbb 2
# 40 GCGbbb 2
# 41 BBBbbb 2

Using R to remove data which is below a quartile threshold

I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]

Resources