I would like to create a for loop to count if the values in each row are larger than a cutoff value that changes from row to row in another matrix. Currently, my code looks like this:
for (i in 100) {
count_Q4_l2 = NULL #set to zero after every loop
for (j in 10000){
if (ACT_Allquant2[1,i]>cc[j,1]){ #if the value in this column larger than the other, then count
count_Q4_l2 <- count_Q4_l2+1 #+1 to count the values
}
}
countALL[1,i] <- count_Q4_l2 #save the values into another data.frame
}
}
The cutoff values are in the ACT_Allquant2 table and they should move forward together with the for loop.
Hope I explained myself clearly and I thank you very much in advance for your help!!
EDIT:
ACT_Allquant2 looks the following way:
X91. X92. X93. X94. X95. X96. X97. X98.
Qfourfac_netlg2 0.7685364 0.8995720 0.9896079 1.014982 1.066362 1.229381
X99. X100.
Qfourfac_netlg2 1.727864 2.318737
While cc is a series of column
X1. X2. X3. X4. X5. X6. X7. X8. X9.
2 -2.504816 -2.433826 -2.305134 -2.261871 -2.110741 -1.894405 -1.344805 -1.256876 -1.211396
X10. X11. X12. X13. X14. X15. X16. X17.
2 -1.199943 -1.13323 -1.031908 -1.019844 -1.007079 -0.9932806 -0.9232708 -0.8316696
X18. X19. X20. X21. X22. X23. X24. X25.
2 -0.8052391 -0.7738284 -0.7334976 -0.7126213 -0.6950152 -0.6272749 -0.584775 -0.5540359
X26. X27. X28. X29. X30. X31. X32. X33.
2 -0.5307423 -0.5105184 -0.4107709 -0.4001571 -0.3959766 -0.3607601 -0.329242 -0.2746449
X34. X35. X36. X37. X38. X39. X40. X41.
2 -0.2231775 -0.1799284 -0.1684765 -0.1568755 -0.1446923 -0.1403811 -0.1387818 -0.126637
X42. X43. X44. X45. X46. X47. X48. X49.
2 -0.1082471 -0.08882241 -0.053299 -0.04695731 0.002623936 0.05961787 0.07482258 0.0868524
X50. X51. X52. X53. X54. X55. X56. X57. X58.
2 0.09455113 0.1003998 0.1077676 0.1574778 0.1810591 0.1832488 0.1874931 0.1893803 0.1955026
X59. X60. X61. X62. X63. X64. X65. X66. X67.
2 0.2035948 0.2321749 0.2453042 0.2604033 0.2739561 0.3018942 0.3835822 0.5748584 0.603411
X68. X69. X70. X71. X72. X73. X74. X75. X76.
2 0.6580565 0.6882143 0.7104922 0.7568134 0.7769822 0.7932305 0.8550466 0.876781 1.084851
X77. X78. X79. X80. X81. X82. X83. X84. X85. X86.
2 1.117067 1.196249 1.261902 1.310987 1.423575 1.485869 1.606687 1.678782 1.950923 1.995428
X87. X88. X89. X90. X91. X92. X93. X94. X95. X96.
2 1.99818 2.04422 2.080644 2.205811 2.21738 2.356354 2.469436 2.484198 2.52253 2.564173
X97. X98. X99.
2 2.638286 2.675248 2.768761
I'm not sure I understand, but let's try a simple example:
set.seed(41)
ACT <- data.frame(matrix(rnorm(100), 25, 4))
cc <- rnorm(4, 0, .5)
cc
# [1] 0.03641331 0.59785494 -1.05581599 0.33569523
In each column of ACT you want to count the values that exceed the value in cc, e.g. for column 1 the number that exceed 0.03641331, for column 2 the number that exceed 0.59785494? If that is so, you do not need any loops:
Comp <- sweep(ACT, 2, cc, ">")
Count <- colSums(Comp)
Count
# X1 X2 X3 X4
# 16 8 22 10
You can extract the values that exceed the cc value for each column, but you cannot put them into a data frame since the number of values in each column is different. You can create a data frame with the coordinates of the larger values or a list with the values for each column:
Larger <- data.frame(which(Comp, arr.ind=TRUE), ACT[Comp])
head(Larger)
# row col ACT.Comp.
# 1 2 1 0.1972575
# 2 3 1 1.0017043
# 3 4 1 1.2888254
# 4 5 1 0.9057534
# 5 6 1 0.4936675
# 6 7 1 0.5992858
LargerByCol <- split(Larger$ACT.Comp, Larger$col)
LargerByCol[[1]]
# [1] 0.1972575 1.0017043 1.2888254 0.9057534 0.4936675 0.5992858 . . . 16 values
Related
I set up 2 problems as follows:
I have two matrices (Mat1 and Mat2). Both matrices are of equal size. I have four output matrices (Output1, Output2, Output3, Output4 respectively) both the same size as Mat1 and Mat2.
Problem 1:
In Mat2, Identify the row that contains the maximum value in column1. Lets assume this is row 1.
Go to Row 1 of Mat1 and extract the first 3 columns of Mat1 Row1 and store in Output1. Store all other rows in Mat2 for the first 3 columns. At this stage Output1 is 1x3. Output 2 is (n-1)x3.
Move to Column 4 of Mat2. Identify the row that contains the maximum value. Lets say this is row 5.
Go to row 5 of Mat1 column 4. Store Row5 columns 4,5,6 in Output1. Store all other rows of Mat1 for columns 4,5,6 in Output2.
... Repeat this process for all columns in Mat2 following the sequence:1,4,7,9 etc. In this case, i have 25 columns for Mat1 and Mat2, so the sequence will end at 24.
I need to be able to change the sequence from 1,4,7,9 etc, to 1,13,25 etc.
Problem2:
is equivalent to problem 1, except this time i identify the rows that contain the top-two values in every stage.
In Mat2, Identify the rows that contain the top-two values in Column1. Lets say these rows are 2 and 5. Store the first 3 columns of rows 2 and 5 of Mat1 in Output3. Store all remaining rows (column 1-3) of Mat1 in Output4.
Move to Column 4 of Mat2. Identify the rows which contain the top two values in column 4 of Mat2. lets say rows 1 and 2.
Move to Column 4 of Mat 1. Store column 4,5,6 Row 1 and 2 into Output3. Store all remaining rows in Output4.
Sidenote: This process must be easily extended for matrices with 1000x1000 dimensions. So would prefer not to do this manually.
Mat1 <- data.frame(matrix(nrow = 10, ncol =25, data = rnorm(250,0,1)))
Mat2 <- data.frame(matrix(nrow = 10, ncol =25, data = rnorm(250,0,1)))
> Mat1
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1 -2.22415466 0.98712728 1.0084356 0.58447183 0.2608830 -0.341029099 -0.71693894 -0.61653058 -0.24790470 0.10777970 -1.68562271 -1.6638535 -0.5538468
2 1.11444365 -0.34865955 0.7518822 -0.07573724 0.1336811 -0.831275643 -0.15564822 -0.68849375 -0.05094047 0.21990082 -0.69879135 -0.6348292 1.0172304
3 -0.05367747 0.08654206 -0.3023270 -0.67335942 -1.1173279 0.004670625 0.52482501 0.78330982 1.18795853 -0.06513613 0.42353439 -0.4152209 1.7174158
4 0.42118984 -0.43257583 -1.3368036 1.64849798 0.8294276 1.256987496 -0.50440892 1.07686292 0.94196135 2.90916270 -0.08714083 0.1094395 1.1715895
5 -0.13720451 -0.94864452 1.9751962 -0.70523555 0.1431405 0.569928767 0.54877505 -0.44571903 -1.16282161 -1.65590032 -0.17710859 -0.8904316 0.3252576
6 0.64336424 -0.38277541 -1.6512377 -0.06542054 -0.1195322 0.666255832 0.60826054 1.88822842 -0.52952627 -0.44776682 0.04321836 -0.6190585 -0.9529690
7 -1.04160098 1.10952094 -0.9186759 0.77437293 -0.2284926 -0.113106151 -0.32092624 1.34157301 2.33813068 1.21812714 0.13165646 0.5532299 -1.3470645
8 1.22940987 -1.26271164 -1.2483658 -2.00578793 -0.6773794 -0.228135998 -0.06223206 -1.97606848 1.67339247 -0.47268196 -0.83544561 -0.3313278 -0.2373613
9 0.08485706 -1.60594589 0.8549923 -0.23394708 -0.5978692 -0.321839877 -0.55298452 -0.08387815 -0.99196489 0.83364114 -0.19579612 -0.8017648 -0.2238073
10 -1.71702699 0.39086484 -0.9974210 0.86232862 -0.2755329 -0.160656438 0.49669949 0.73763073 -0.42380390 1.91208332 -0.27778479 0.7866471 0.1813511
> Mat2
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1 0.11053732 -0.5750170 2.58105259 -1.6895285 0.0508918 -0.54188929 -0.92292169 1.4972970 0.009239807 -0.1706461 -0.8942262 -1.6351505 0.2029262
2 -0.83802776 -0.9322157 -0.34753884 0.8164819 0.7318198 0.09162218 0.15971493 -2.6731067 1.554323641 -0.3161967 0.4622101 -1.9521229 -1.3229961
3 0.61368153 -1.3650360 0.95674229 0.4582117 -0.6959545 -0.59627428 1.94172156 1.6784237 -0.482524695 -0.0514944 -0.4608930 -0.5456863 -0.1340540
4 -1.03156503 -0.2516495 0.76770177 -0.7841354 -3.2404904 -1.76276859 1.57421914 0.9782458 -1.364451438 -0.6437429 0.7485424 -0.8778284 1.7587504
5 0.01183232 0.6825633 1.39634308 1.4136879 0.5166420 0.76930390 0.67210932 1.3007904 -0.284451411 0.5163457 0.3198626 0.8030497 -1.4320064
6 -0.06110883 -0.6762991 0.56105196 0.9767543 -1.0016294 -0.84811626 -0.83319744 -1.1777865 -1.185631394 -0.5673733 0.2956725 0.5425602 -1.0510479
7 -0.56195630 1.3883881 0.09995573 0.6722959 -1.6205290 0.32085867 -0.94243554 -0.2340429 -1.299085265 -0.4433517 0.4424583 -2.8887970 0.1679859
8 1.04612102 0.8360530 0.07005306 0.4818317 1.1857504 0.13649605 1.35261983 0.8008935 -0.101922164 0.6773003 -1.0265770 0.1859912 0.2678461
9 0.88419676 -1.7012899 -1.09656000 -0.4360276 0.6238451 -2.03256276 -1.12575579 1.8407234 0.522372401 -0.6229582 0.6727720 -0.5695190 0.6298388
10 -0.68648649 -0.6689894 -0.56849261 -1.9012760 1.1418180 0.46377789 -0.08107475 1.4378120 -1.489367198 -0.7682887 -0.2858680 0.9584056 1.3178700
So for example:
which.max(Mat2[,1]) # 8
so go to row 8 of Mat1 and store the first 3 cols in Output1.
Output1[1, 1:3] # 1.22940987 -1.26271164 -1.2483658
Store all other rows of Mat1 for cols 1 to 3 in Output2.
which.max(Mat2[, 4]) # 5
implies
Output1[1, 4:6] # -0.70523555 0.1431405 0.569928767
And so on and so forth.
Does this do what you need?
Juggle <- function(m1, m2, col, step, top_n = 1) {
if (length(col) == 1) {
output_rows <- sort(m2[, col], index.return = TRUE, decreasing = TRUE)$ix
col_idx <- seq(col, col + step - 1)
top_idx <- seq(top_n)
bottom_idx <- seq(top_n + 1, nrow(m2))
max_out <- m1[output_rows[top_idx], col_idx]
rest_out <- m1[output_rows[bottom_idx], col_idx]
return(list(max_out = max_out, rest_out = rest_out))
} else {
left <- Juggle(m1, m2, col[1], step, top_n)
right <- Juggle(m1, m2, col[seq(2, length(col))], step, top_n)
return(list(
max_out = cbind(left$max_out, right$max_out),
rest_out = cbind(left$rest_out, right$max_out)
))
}
}
m1 and m2 are the matrices or data frames.
col is a single or sequence of columns like 1, 4, 7, 9 or 1,13,25.
step is the amount of columns to extract.
top_n is the amount of rows to extract.
This will return a list with components max_out for the top_n row extracts and rest_out for the N-top_n row extracts.
I have this accelerometer dataset and, let's say that I have some n number of observations for each subject (30 subjects total) for body-acceleration x time.
I want to make a plot so that it plots these body acceleration x time points for each subject in a different color on the y axis and the x axis is just an index. I tried this:
ggplot(data = filtered_data_walk, aes(x = seq_along(filtered_data_walk$'body-acceleration-mean-y-time'), y = filtered_data_walk$'body-acceleration-mean-y-time')) +
geom_line(aes(color = filtered_data_walk$subject))
But, the problem is that it doesn't superimpose the 30 lines, instead, they run along side each other. In other words, I end up with n1 + n2 + n3 + ... + n30 x index points, instead of max{n1, n2, ..., n30}. This is my first time posting, so I hope this makes sense (I know my formatting is bad).
One solution I thought of was to create a new variable which gives a value of 1 to n for all the observations of each subject. So, for example, if I had 6 observations for subject1, 4 observations for subject2, and 9 observations for subject3, this new variable would be sequenced like:
1 2 3 4 5 6 1 2 3 4 1 2 3 4 5 6 7 8 9
Is there an easy way to do this? Please help, ty.
Assuming your data is formatted as a data.frame or matrix, for a toy dataset like
x <- data.frame(replicate(5, rnorm(10)))
x
# X1 X2 X3 X4 X5
# 1 -1.36452272 -1.46446475 2.0444381 0.001585876 -1.1085990
# 2 -1.41303046 -0.14690269 1.6179084 -0.310162018 -1.5528733
# 3 -0.15319554 -0.18779791 -0.3005058 0.351619212 1.6282955
# 4 -0.38712167 -0.14867239 -1.0776359 0.106694311 -0.7065382
# 5 -0.50711166 -0.95992916 1.3522922 1.437085757 -0.7921355
# 6 -0.82377208 0.50423328 -0.5366513 -1.315263679 1.0604499
# 7 -0.01462037 -1.15213287 0.9910678 0.372623508 1.9002438
# 8 1.49721113 -0.84914197 0.2422053 0.337141898 1.2405208
# 9 1.95914245 -1.43041783 0.2190829 -1.797396822 0.4970690
# 10 -1.75726827 -0.04123615 -0.1660454 -1.071688768 -0.3331887
...you might be able to get there with something like
plot(x[,1], type='l', xlim=c(1, nrow(x)), ylim=c(min(x), max(x)))
for(i in 2:ncol(x)) lines(x[,i], col=i)
You could play with formatting some more, of course, do things with lty= and lwd= and maybe a color ramp of your own choosing, etc.
If your data is in the format below...
x <- data.frame(id=c("A","A","A","B","B","B","B","C","C"), acc=rnorm(9))
x
# id acc
# 1 A 0.1796964
# 2 A 0.8770237
# 3 A -2.4413527
# 4 B 0.9379746
# 5 B -0.3416141
# 6 B -0.2921062
# 7 B 0.1440221
# 8 C -0.3248310
# 9 C -0.1058267
...you could get there with
maxn <- max(with(x, tapply(acc, id, length)))
ids <- sort(unique(x$id))
plot(x$acc[x$id==ids[1]], type='l', xlim=c(1,maxn), ylim=c(min(x$acc),max(x$acc)))
for(i in 2:length(ids)) lines(x$acc[x$id==ids[i]], col=i)
Hope this helps, and that I interpreted your problem right--
That's pretty quick to do if you are OK with using dplyr. group_by to enforce a separate counter for each subject, mutate to add the actual counter, and your ggplot should work. Example with iris dataset:
group_by(iris, Species) %>%
mutate(index = seq_along(Petal.Length)) %>%
ggplot() + geom_line(aes(x=index, y=Petal.Length, color=Species))
I have a large data set and i want to loop over the results, subtract column as present in a list and output the result for each row in a new column.
ref1 <- samples Controls
E_2334188 E_2334207
E_2334202 E_2334221
df1 <-
Chr Start End Feature E_2334188 E_2334202 E_2334207 E_2334221
1 740001 760000 1:740001-760000 1.6832013 0.8346011 -0.23045394 1.5974912
1 760001 780000 1:760001-780000 -0.3231613 -1.8504905 0.13668752 -0.38662600
1 780001 800000 1:780001-800000 -0.3936060 -2.2163153 -0.15266541 -0.60706691
ind <- which(names(df1) %in% ref1$samples)
rnd <- which(names(df1) %in% ref1$controls)
df2 <- df1[,c(1:4)]
df2$newcol <- 0
for (i in 1:nrow(ref1)){
n <- df1[ind]-df1[rnd]
df2$newcol[i] <- n
}
expected outcome
df2 <-
Chr Start End Feature E_2334188 E_2334202
1 740001 760000 1:740001-760000 1.913655 -0.7628901
1 760001 780000 1:760001-780000 -0.4598488 -1.463865
1 780001 800000 1:780001-800000 -0.2409406 -1.609248
We can subset the 'df1' based on the elements in 'samples' and 'Controls', subtract them, and cbind with the first 4 columns of 'df1'.
cbind(df1[1:4],df1[ref1$samples]- df1[ref1$Controls])
# Chr Start End Feature E_2334188 E_2334202
#1 1 740001 760000 1:740001-760000 1.9136552 -0.7628901
#2 1 760001 780000 1:760001-780000 -0.4598488 -1.4638645
#3 1 780001 800000 1:780001-800000 -0.2409406 -1.6092484
NOTE: If the 'samples' and 'Controls' columns are factor class, convert to character and use the same approach.
cbind(df1[1:4],df1[as.character(ref1$samples)]- df1[as.character(ref1$Controls)])
GIVEN DATA
I have 6 columns of data of vehicle trajectory (observation of vehicles' change in position, velocity, etc over time) a part of which is shown below:
Vehicle ID Frame ID Global X Vehicle class Vehicle velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5
Vehicle ID is the identification of individual vehicles e.g. vehicle 1, vehicle 2, etc. It is repeated in the column for each frame in which it was observed. Please note that each frame is 0.1 seconds long so 10 frames make 1 second. The IDs of frames is in Frame ID column. Vehicle class is the type of vehicle (1=motorcycle, 2=car, 3=truck). Vehicle velocity column represents instantaneous speed of vehicle in that instant of time i.e. in a frame. Lane represents the number or ID of the lane in which vehicle is present in a particular frame.
WHAT I NEED TO FIND
The data I have is for 15 minutes period. The minimum frame ID is 5 and maximum frame ID is 9952. I need to find the total number of vehicles in every 30 seconds time period. This means that starting from the first 30 seconds (frame ID 5 to frame ID 305), I need to know the unique vehicle IDs observed. Also, for these 30 seconds period, I need to find the average velocity of each vehicle class. This means that e.g. for cars I need to find the average of all velocities of those vehicles whose vehicle class is 2.
I need to find this for all 30 seconds time period i.e. 5-305, 305-605, 605-905,..., 9605-9905. The ouput should tables for cars, trucks and motorcycles like this:
Time Slots Total Cars Average Velocity
5-305 xx xx
305-605 xx xx
. . .
. . .
9605-9905 xx xx
WHAT I HAVE TRIED SO FAR
# Finding the minimum and maximum Frame ID for creating 30-seconds time slots
minfid <- min(data$'Frame ID') # this was 5
maxfid <- max(data$'Frame ID') # this was 9952
for (i in 'Frame ID'==5:Frame ID'==305) {
table ('Vehicle ID')
mean('Vehicle Velocity', 'Vehicle class'==2)
} #For cars in first 30 seconds
I can't generate the required output and I don't know how can I do this for all 30 second periods. Please help.
It's a bit tough to make sure code is completely correct with your data since there is only one vehicle in the sample you show. That said, this is a typical split-apply-combine type analysis you can execute easily with the data.table package:
library(data.table)
dt <- data.table(df) # I just did a `read.table` on the text you posted
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
Here, I just converted your data into a data.table (df was a direct import of your data posted above), and then created 300 frame buckets using cut. Then, you just let data.table do the work. In the first expression we calculate total unique vehicles per frame.group
dt[, list(tot.vehic=length(unique(Vehicle_ID))), by=frame.group]
# frame.group tot.vehic
# 1: [5,305] 1
Now we group by frame.group and Vehicle_class to get average speed and count for those combinations:
dt[, list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 2 1 24.965
Again, a bit silly when we only have one vehicle, but this should work for your data set.
EDIT: to show that it works:
library(data.table)
set.seed(101)
dt <- data.table(
Frame_ID=sample(5:9905, 50000, rep=T),
Vehicle_ID=sample(1:400, 50000, rep=T),
Vehicle_velocity=runif(50000, 25, 100)
)
dt[, frame.group:=cut(Frame_ID, seq(5, 9905, by=300), include.lowest=T)]
dt[, Vehicle_class:=Vehicle_ID %% 3]
head(
dt[order(frame.group, Vehicle_class), list(tot.vehic=length(unique(Vehicle_ID)), mean.speed=mean(Vehicle_velocity)), by=list(frame.group, Vehicle_class)]
)
# frame.group Vehicle_class tot.vehic mean.speed
# 1: [5,305] 0 130 63.34589
# 2: [5,305] 1 131 61.84366
# 3: [5,305] 2 129 64.13968
# 4: (305,605] 0 132 61.85548
# 5: (305,605] 1 132 64.76820
# 6: (305,605] 2 133 61.57129
Maybe it's your data?
Here is a plyr version:
data$timeSlot <- cut(data$FrameID,
breaks = seq(5, 9905, by=300),
dig.lab=5,
include.lowest=TRUE)
# split & combine
library(plyr)
data.sum1 <- ddply(.data = data,
.variables = c("timeSlot"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
# include VehicleClass
data.sum2 <- ddply(.data = data,
.variables = c("timeSlot", "VehicleClass"),
.fun = summarise,
totalCars = length(unique(VehicleID)),
AverageVelocity = mean(velocity)
)
The column names like FrameID would have to be edited to match the ones you use:
data <- read.table(sep = "", header = TRUE, text = "
VehicleID FrameID GlobalX VehicleClass velocity Lane
1 177 6451181 2 24.99 5
1 178 6451182 2 24.95 5
1 179 6451184 2 24.91 5
1 180 6451186 2 24.90 5
1 181 6451187 2 24.96 5
1 182 6451189 2 25.08 5")
data.sum1
# timeSlot totalCars AverageVelocity
# 1 [5,305] 1 24.965
data.sum2
# timeSlot VehicleClass totalCars AverageVelocity
# 1 [5,305] 2 1 24.965
I am creating correlations using R, with the following code:
Values<-read.csv(inputFile, header = TRUE)
O<-Values$Abundance_O
S<-Values$Abundance_S
cor(O,S)
pear_cor<-round(cor(O,S),4)
outfile<-paste(inputFile, ".jpg", sep = "")
jpeg(filename = outfile, width = 15, height = 10, units = "in", pointsize = 10, quality = 75, bg = "white", res = 300, restoreConsole = TRUE)
rx<-range(0,20000000)
ry<-range(0,200000)
plot(rx,ry, ylab="S", xlab="O", main="O vs S", type="n")
points(O,S, col="black", pch=3, lwd=1)
mtext(sprintf("%s %.4f", "pearson: ", pear_cor), adj=1, padj=0, side = 1, line = 4)
dev.off()
pear_cor
I now need to find the lower quartile for each set of data and exclude data that is within the lower quartile. I would then like to rewrite the data without those values and use the new column of data in the correlation analysis (because I want to threshold the data by the lower quartile). If there is a way I can write this so that it is easy to change the threshold by applying arguments from Java (as I have with the input file name) that's even better!
Thank you so much.
I have now implicated the answer below and that is working, however I need to keep the pairs of data together for the correlation. Here is an example of my data (from csv):
Abundance_O Abundance_S
3635900.752 1390.883073
463299.4622 1470.92626
359101.0482 989.1609251
284966.6421 3248.832403
415283.663 2492.231265
2076456.856 10175.48946
620286.6206 5074.268802
3709754.717 269.6856808
803321.0892 118.2935093
411553.0203 4772.499758
50626.83554 17.29893001
337428.8939 203.3536852
42046.61549 152.1321255
1372013.047 5436.783169
939106.3275 7080.770535
96618.01393 1967.834701
229045.6983 948.3087208
4419414.018 23735.19352
So I need to exclude both values in the row if one does not meet my quartile threshold (0.25 quartile). So if the quartile for O was 45000 then the row "42046.61549,152.1321255" would be removed. Is this possible? If I read in both columns as a dataframe can I search each column separately? Or find the quartiles and then input that value into code to remove the appropriate rows?
Thanks again, and sorry for the evolution of the question!
Please try to provide a reproducible example, but if you have data in a data.frame, you can subset it using the quantile function as the logical test. For instance, in the following data we want to select only rows from the dataframe where the value of the measured variable 'Val' is above the bottom quartile:
# set.seed so you can reproduce these values exactly on your system
set.seed(39856)
df <- data.frame( ID = 1:10 , Val = runif(10) )
df
ID Val
1 1 0.76487516
2 2 0.59755578
3 3 0.94584374
4 4 0.72179297
5 5 0.04513418
6 6 0.95772248
7 7 0.14566118
8 8 0.84898704
9 9 0.07246594
10 10 0.14136138
# Now to select only rows where the value of our measured variable 'Val' is above the bottom 25% quartile
df[ df$Val > quantile(df$Val , 0.25 ) , ]
ID Val
1 1 0.7648752
2 2 0.5975558
3 3 0.9458437
4 4 0.7217930
6 6 0.9577225
7 7 0.1456612
8 8 0.8489870
# And check the value of the bottom 25% quantile...
quantile(df$Val , 0.25 )
25%
0.1424363
Although this is an old question, I came across it during research of my own and I arrived at a solution that someone may be interested in.
I first defined a function which will convert a numerical vector into its quantile groups. Parameter n determines the quantile length (n = 4 for quartiles, n = 10 for deciles).
qgroup = function(numvec, n = 4){
qtile = quantile(numvec, probs = seq(0, 1, 1/n))
out = sapply(numvec, function(x) sum(x >= qtile[-(n+1)]))
return(out)
}
Function example:
v = rep(1:20)
> qgroup(v)
[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
Consider now the following data:
dt = data.table(
A0 = runif(100),
A1 = runif(100)
)
We apply qgroup() across the data to obtain two quartile group columns:
cols = colnames(dt)
qcols = c('Q0', 'Q1')
dt[, (qcols) := lapply(.SD, qgroup), .SDcols = cols]
head(dt)
> A0 A1 Q0 Q1
1: 0.72121846 0.1908863 3 1
2: 0.70373594 0.4389152 3 2
3: 0.04604934 0.5301261 1 3
4: 0.10476643 0.1108709 1 1
5: 0.76907762 0.4913463 4 2
6: 0.38265848 0.9291649 2 4
Lastly, we only include rows for which both quartile groups are above the first quartile:
dt = dt[Q0 + Q1 > 2]