get location of row with median value in R data frame - r

I am a bit stuck with this basic problem, but I cannot find a solution.
I have two data frames (dummies below):
x<- data.frame("Col1"=c(1,2,3,4), "Col2"=c(3,3,6,3))
y<- data.frame("ColA"=c(0,0,9,4), "ColB"=c(5,3,20,3))
I need to use the location of the median value of one column in df x to then retrieve a value from df y. For this, I am trying to get the row number of the median value in e.g. x$Col1 to then retrieve the value using something like y[,"ColB"][row.number]
is there an elegant way/function for doing this? Solutions might need to account for two cases - when the sample has an even number of values, and ehwn this is uneven (when numbers are even, the median value might be one that is not found in the sample as a result of calculating the mean of the two values in the middle)

The problem is a little underspecified.
What should happen when the median isn't in the data?
What should happen if the median appears in the data multiple times?
Here's a solution which takes the (absolute) difference between each value and the median, then returns the index of the first row for which that difference vector achieves its minimum.
with(x, which.min(abs(Col1 - median(Col1))))
# [1] 2
The quantile function with type = 1 (i.e. no averaging) may also be of interest, depending on your desired behavior. It returns the lower of the two "sides" of the median, while the which.min method above can depend on the ordering of your data.
quantile(x$Col1, .5, type = 1)
# 50%
# 2
An option using quantile is
with(x, which(Col1 == quantile(Col1, .5, type = 1)))
# [1] 2
This could possibly return multiple row-numbers.
Edit:
If you want it to only return the first match, you could modify it as shown below
with(x, which.min(Col1 != quantile(Col1, .5, type = 1)))

Here, something like y$ColB[which(x$Col1 == round(median(x$Col1)))] would do the trick.
The problem is x has an even number of rows, so the median 2.5 is not an integer. In this case you have to choose between 2 or 3.
Note: The above works for your example, not for general cases (e.g. c(-2L,2L) or with rational numbers). For the more general case see #IceCreamToucan's solution.

Related

regarding the usage of runif function

I once saw the following R code,
x<-runif(3,max=c(10,20,30))
If the min is not set, what's the lower range for the generated random variable. Besides,when max is setup this way, my understanding is that it will iterate over the three values given in c() for each generated variable, is that right?
If you look at the ?runif help page, you'll see the default for min= is 0.
If you specify multiple values for max, the values are recycled so it's like the first value comes from unif(0,10), the second from unif(0,20) and the third from (0,30) and that pattern repeats for as many values as you request. If you only request one value
runif(1, max=c(10,20,30)
that would be the same as
runfi(1, max=10)
This is noted in the help page under the Value section
The numerical arguments other than n are recycled to the length of the result.
Per the documentation for this function (https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/Uniform), min takes on the value 0 unless explicitly passed.
And yes, that is correct - the function will iterate over the values given in c() for each value. If there isn't a value passed (e.g. you're generating 3 random variables and set c=(1,2)), then max will take the default value of 1 for the elements that don't have a set max value. An example showing how it iterates over c():
x<-runif(3,max=c(1,20, 7000000))
x
[1] 0.622216 7.463306 809194.417205

changing class and getting numbers

I am working with the golub dataset in R (separated by the AML and ALL) and I am attempting to do a hypothesis test in relation to two genes. For the AML patient group, I want to find out the proportion of patients who have a higher expression of gene 900 as compared to gene 1000, then I want to see if that out of those who have a higher expression value for gene 900, the number is less than half. I have a general idea to do the other half, and I had something like this for the first part, but seeing as its T/F I tried to switch it to numeric which gave 0 and 1 but I want the actual numbers and not in the logical form.
`gol.fac <- factor(golub.cl,levels=0:1, labels= c("ALL","AML"))
x <- golub[900,gol.fac=="AML"]
y <- golub[1000,gol.fac=="AML"]
z <-golub[900,gol.fac=="AML"] > golub[1000,gol.fac=="AML"]
k <- as.numeric(z)`
Use max
max(golub[900,gol.fac=="AML"], golub[1000,gol.fac=="AML"])
Or if you have multiple values then use pmax
pmax(golub[900,gol.fac=="AML"], golub[1000,gol.fac=="AML"])
Instead of doing multiple slices of rows, just get the max by subsetting once
max(golub[900:1000, "AML"])

Medians Values in R - Returns Rounded Number

I have a table of data, where I've labeled the rows based on a cluster they fall into, as well as calculated the average of the rows column values. I would like to select the median row for each cluster.
For example sake, just looking at one, I would like to use:
median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
I can see that
> as.numeric(as.vector(subset(df,df$cluster == i )$avg))
[1] 48.11111111 47.77777778 49.44444444 49.33333333 47.55555556 46.55555556 47.44444444 47.11111111 45.66666667 45.44444444
And yet, the median is
> median(as.numeric(as.vector(subset(df,df$cluster == i )$avg)))
[1] 47.5
I would like to find the median record, by matching the median returned with the average in the column, but that isn't possible with this return.
I've found some documentation and questions on rounding with the mean function, but that doesn't seem to apply to this unfortunately.
I could also limit the data decimal places, but some records will be too close, that duplicates will be common if rounded to one decimal.
When the input has an even number of values (like the 10 values you have) then there is not a value directly in the middle. The standard definition of a median (which R implements) averages the two middle values in the case of an even number of inputs. You could rank the data, and in the case of an even-length input select either the n/2 or n/2 + 1 record.
So, if your data was x = c(8, 6, 7, 5), the median is 6.5. You seem to want the index of "the median", that is either 2 or 3.
If we assume there are no ties, then we can get these answers with
which(rank(x) == length(x) / 2)
# [1] 2
which(rank(x) == length(x) / 2 + 1)
# [1] 3
If ties are a possibility, then rank's default tie-breaking method will cause you some problems. Have a look at ?rank and figure out which option you'd like to use.
We can, of course, turn this into a little utility function:
median_index = function(x) {
lx = length(x)
if (lx %% 2 == 1) {
return(match(median(x), x))
}
which(rank(x, ties.method = "first") == lx/2 + 1)
}
There is an easier way to do that: use dplyr
library(dplyr)
df%>%
group_by(cluster)%>%
summarise(Median=median(avg))

Difference between mean(c(1,2,21)) and mean(1,2,21)

What's the difference between these two?
mean(c(1,2,21))
and
mean(1,2,21)
The answers are different, but what's the meaning of each one?
mean(c(1,2,21))
#[1] 8
This passes a vector of three elements to the mean function and the mean value of these three elements is calculated.
mean(1,2,21)
#[1] 1
This passes 1 as the first argument, 2 as the second argument and 21 as the third argument to the mean function. mean passes these arguments to mean.default. In help("mean.default") you can find the arguments of this function:
The object you want the mean for.
the fraction (0 to 0.5) of observations to be trimmed from each end of x before the mean is computed. Values of trim outside that range are taken as the nearest endpoint.
a logical value indicating whether NA values should be stripped before the computation proceeds. (Since you pass a numeric value, it is coerced to logical automatically).
So you calculate this:
mean.default(1, 0.5, TRUE)
[1] 1
When using mean(c(1,2,21)) R is taking the mean out of the vector consisting of 1,2 and 21, in the second case, when using mean(1,2,21), is equivalent to mean(1, trim=2, na.rm=21) and R is taking the mean out one single number, 1, and you are passing value 2 to trim which controls for the fraction (0 to 0.5) of observations to be trimmed from each end of the vector before the mean is computed, and also you are giving value 21 to na.rm argument, which should be TRUE or FALSE, as you can see 2 and 21 without c are completely useless here.

Find a block of steady column values

5Can anyone give me a hint to speed up the following program?
Situation: I have a huge amount of measurement data. I need to extract data for "10 minutes stable operation conditions" of 5 parameters i.e. column values.
Here is my (working, but really slow) solution:
- Take the first 10 rows from the dataframe
- Compare the min and max of each column to the first value of the column
- If at least one column min or max is not within tolerance, delete the first row, repeat
- If they are within tolerance, calculate the mean of the results, store them, delete 10 rows, repeat.
- break when the dataframe has less than 10 rows
Since I am using a repeat loop, this takes 30min to extract 610 operation points from 86.220 minutes of data.
Any help is appreciated. Thanks!
edit: I created some code to explain. Please note that I deleted the checking routines for na values and standby operation (values around 0):
n_cons<-5 # Number of consistent minutes?
### Function to check wheter a value is within tolerance
f_cons<-function(min,max,value,tol){
z<-max > (value + tol) | min < (value - tol);
return(z)
}
# Define the +/- tolerances
Vu_1_tol<-5 # F_HT
Vu_2_tol<-5 # F_LT
# Create empty result map
map<-c(rep(NA,3))
dim(map)<- c(1,3)
colnames(map)<-list("F_HT","F_LT","Result")
system.time(
repeat{
# Criteria to break
if(nrow(t6)<n_cons){break}
# Subset of the data to check
t_check<-NULL
t_check<-cbind(t6$F_HT[1:n_cons],
t6$F_LT[1:n_cons]
)
# Check for consistency
if(f_cons(min(t_check[,1]),max(t_check[,1]),t_check[1,1],Vu_1_tol)){t6<-t6[-1,]
next}
if(f_cons(min(t_check[,2]),max(t_check[,2]),t_check[1,2],Vu_2_tol)){t6<-t6[-1,]
next}
# If the repeat loop passes the consistency check, store the means
attach(t6[1:n_cons,])
# create a new row wih means of steady block
new_row<-c(mean(F_HT),mean(F_LT),mean(Result))
new_row[-1]<-round(as.numeric(new_row[-1]),2)
map<-rbind(map,new_row) # attach new steady point to the map
detach(t6[1:n_cons,])
t6<-t6[-(1:n_cons),] # delete the evaluated lines from the data
}
)
The data I am using looks like this
t6<-structure(list(F_HT = c(1499.71, 1500.68, 1500.44, 1500.19, 1500.31,
1501.76, 1501, 1551.22, 1500.01, 1500.52, 1499.53, 1500.78, 1500.65,
1500.96, 1500.25, 1500.76, 1499.49, 1500.24, 1500.47, 1500.25,
1735.32, 2170.53, 2236.08, 2247.48, 2250.71, 2249.59, 2246.68,
2246.69, 2248.27, 2247.79), F_LT = c(2498.96, 2499.93, 2499.73,
2494.57, 2496.94, 2507.71, 2495.67, 2497.88, 2499.63, 2506.18,
2495.57, 2504.28, 2497.38, 2498.66, 2502.17, 2497.78, 2498.38,
2501.06, 2497.75, 2501.32, 2500.79, 2498.17, 2494.82, 2499.96,
2498.5, 2503.47, 2500.57, 2501.27, 2501.17, 2502.33), Result = c(9125.5,
8891.5, 8624, 8987, 9057.5, 8840.5, 9182, 8755.5, 9222.5, 9079,
9175.5, 9458.5, 9058, 9043, 9045, 9309, 9085.5, 9230, 9346, 9234,
9636.5, 9217.5, 9732.5, 9452, 9358, 9071.5, 9063.5, 9016.5, 8591,
8447.5)), .Names = c("F_HT", "F_LT", "Result"), row.names = 85777:85806, class = "data.frame")
With this code and data, I get 3 steady operation points, which is what I want, but which is very slow.
Hopefully, this helps to better explain my problem.
Heureka!
Thanks to the comment of Carl Witthoft, I was able to speed up the proces by factor 15!
I used rollapply a lot, because rollmean and rollmax had some problems with NA which did not occur when using rollaply.
Thanks for your help!
Here is what I did I used the same data like before:
# Use only the values needed to check for stability
t7<-as.data.frame(cbind(t6$F_HT,t6$F_LT))
n_cons<-5 # Number of consistent minutes?
# Calculate the mean values for each column over 5 rows
t7_rm<-rollapply(t7,n_cons,mean,align = "left")
colnames(t7_rm)<-c("mean_F_HT","mean_F_LT")
# idem with maximum
t7_max<-rollapply(t7,width=n_cons,FUN=max, na.rm = F,align = "left")
colnames(t7_max)<-c("max_F_HT","max_F_LT")
# idem with minimum
t7_min<-rollapply(t7,width=n_cons,FUN=min, na.rm = F,align = "left")
colnames(t7_min)<-c("min_F_HT","min_F_LT")
# create table with maximum absolute daviation from the mean values
t7_dif<-pmax((t7_max-t7_rm[1:nrow(t7_max),]),(t7_rm[1:nrow(t7_min),]-t7_min))
colnames(t7_dif)<-c("diff_F_HT","diff_F_LT")
# Enter tolerance limits
V1_tol<-50 # F_HT
V2_tol<-50 # F_LT
# Create a tolerance table
t7_tol<-cbind(rep(V1_tol,nrow(t7_dif)),rep(V2_tol,nrow(t7_dif)))
# Create a logical table with TRUE or FALSE depending on if the max deviation is within tolerance
t7_check<-(t7_dif<t7_tol)
# Replace all "FALSE" with "NA" (in order to use is.na)
t7_check_NA<-apply(t7_check,c(1,2),function(x) {ifelse(x==FALSE,NA,x)})
# Create rolling mean over complete data
t6_rm<-rollapply(t6,n_cons,mean,na.rm=TRUE,align = "left")
# Create a map of stable operation points with means of parameters and result
t6_map<-t6_rm[complete.cases(t7_check_NA),]
The result differs from my original one, because no lines are omitted. But this works for me.

Resources