How selecting rows after caltulation of each row's quantiles? - r

I have a big dataframe with numerical values (12579 rows and 21 columns) from which I would like to extract those columns that fit in the first and the fourth quartile of each row (every row has independent values).
That is why I have calculated each row's quantiles in order to obtain two cutoffs by row.
library(matrixStats)
d_q1 <- rowQuantiles(delta, probs = c(0.25, 0.75))
delta2 <- as.data.frame(cbind(delta,d_q1))
dim(delta2) # 12579 23
library(dplyr)
delta2 <- filter(delta2, delta2[,1:21] <= `25%` & delta2[,1:21] >= delta2$`75%`)
I expected getting those values in Q1 and Q4. However, when I tried to filter the values, I always obtain an error message:
Error: Result must have length 12579, not 264159
Can somebody help me?
Thank you!

I'm not entirely sure what you are trying here, but my guess is that you want for each row the values smaller than Q1 and larger than Q3. In that case this line should work for you.
t(apply(delta, 1, sort))[,c(1:6, 16:21)]
Regarding your code, dplyr::filter() doesn't work that way, it is meant to give you a subset of the rows in your dataframe, so its arguments need to be logical vectors of the same length as the number of rows in your dataframe.

Related

Proportion and Averaging Data

I am brand new to r and I am trying to calculate the proportion of the number of 'i' for each timepoint and then average them. I do not know the command for this but I have the script to find the total number of 'i' in the time points.
C1imask<-C16.3[,2:8]== 'i'&!is.na(C16.3[,2:8])
C16.3[,2:8][C1imask]
C1inactive<-C16.3[,2:8][C1imask]
length(C1inactive)
C1bcmask<-C16.3[,8]== 'bc'&!is.na(C16.3[,8])
C16.3[,8][C1bcmask]
C1broodcare<-C16.3[,8][C1bcmask]
length(C1broodcare)
C1amask<-C16.3[,12]== 'bc'&!is.na(C16.3[,12])
C16.3[,12][C1amask]
C1after<-C16.3[,12][C1amask]
length(C1after)
C1<-length(C1after)-length(C1broodcare)
C1
I'd try taking the mean of a logical vector created with the test. You would use na.rm as an argument to mean. You will get the proportion of non-NA values that meet the test rather than the proportion of with number of rows as the denominator.
test <- sample( c(0,1,NA), 100, replace=TRUE)
mean( test==0, na.rm=TRUE)
#[1] 0.5072464
If you needed a proportion of total number of rows you would use sum and divide by nrow(dframe_name). You can then use sapply or lapply to iterate across a group of columns.

Use a vector in filter() with group_by() data

I would like to filter (subset) some observations from a dataframe (A). However, I have to group the observations within the df, which I did with group_by. This gives me a total of 6660 groups. Now I would like to subset those observations which meet a certain range within each group. Therefore I created a dataframe (B), which holds the lower and upper boundary condition. Like the groups, this dataframe consists of 6660 observations.
NewDataFrame <-filter(A %>% group_by(a,b,c),which(d >= B$x && d <= B$y ))
A is the original df,
B holds the lower and upper boundary condition
Variations of the code with other functions I tried did not work either, except if I used fixed values instead of B$x and B$y. Otherwise I typically end up with:
"longer object length is not a multiple of shorter object length"
Unfortunally, I found nothing in old questions regarding this topic. I am thankful for any help!
Here I tried to create some test Data. I did not create A$b and A$c as they are only grouping conditions...
a<-c("A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","C","C","C","C","C","C","C","C")
d<-rep(1:6,4)
A<-data.frame(a,d)
x<-c(3,5,1)
y<-c(6,6,3)
B<-data.frame(x,y)
a_new<-c("A","A","A","A","B","B","C","C","C")
d_new<-c(3,4,5,6,5,6,1,2,3)
NewDataFrame<-data.frame(a_new,d_new)

In R: ordering values from 2 DF columns for use in ratio for each row

I want to calculate ratios for each row in a data frame using values from two columns for each row. The data are anatomical measurements from paired muscles, and I need to calculate a ratio of the measurement of one muscle to the measurement of the other. Each row is an individual specimen, and each of the 2 columns in question has measurements for one of the 2 muscles. Which of the two muscles is largest varies among individuals (rows), so I need to figure out how to write a script that always picks the smaller value, which may be in either column, for the numerator, and that always picks the larger values, which also can be in either column, for the denominator, rather than simply dividing all values of one column by values of the other. This might be simple, but I'm not so good with coding yet.
This doesn't work:
ratio <- DF$1/DF$2
I assume that what I need would loop through each row doing something like this:
ratio <- which.min(c(DF$1, DF$2))/which.max(c(DF$1, DF$2))
Any help would be greatly appreciated!
Assuming that you are only dealing with positive values, you could consider something like this:
# example data:
df <- data.frame(x = abs(rnorm(100)), y = abs(rnorm(100)))
# sorting the two columns so that the smaller always appears in the first
# column:
df_sorted <- t(apply(df,1, sort))
# dividing the first col. by the second col.
ratio <- df_sorted[,1]/df_sorted[,2]
Or, alternatively:
ifelse(df[,1] > df[,2], df[,2]/df[,1], df[,1]/df[,2])

Fill an empty dataframe by rows with paired ratios of rows of another dataframe

I have a dataframe, for all columns of which I want to calculate paired ratios of rows (for example, row1/row2, row3/row4, row5/row6, etc.) and write the result of calculation to a new dataframe. I decided to wrap it in a function with 3 arguments:
paired_row_rat=function(dataframe,rows,columns){
ratio_df=data.frame(matrix(nrow=rows/2,ncol=columns)) #creates new dataframe
#where number of columns is the same as in dataframe used for
#calculation, number of rows for paired ratios will be 2 times lower
cln=colnames(dataframe) #names of columns should be equal in both
colnames(ratio_df)=cln #dataframes
i=seq(1,rows,by=2) #sequance for choosing the first row of calculation
j=i+1 #for choosing second row of calculation
for (k in 1:nrow(ratio_df)){ #here as I am trying to fill new
ratio_df[k,]=dataframe[i,]/dataframe[j,] #dataframe with ratios,
} #the error appears
return(ratio_df)
}
pmap(list(tula3,24,98),paired_row_rat)
#runs the function for my dataframe with 24 rows and 98 columns
In the resulting dataframe each column has the same values for all rows and I have warnings from R:
warnings()
Warning messages:
1: In [<-.data.frame(*tmp*, k, , value = structure(list( ... :
replacement element 1 has 12 rows to replace 1 rows
I've searched a lot for possible solutions but still can't fix this problem. Something is wrong with the for loop. But I don't uderstand where the problem is.
datafrfame used for calculation (the result of head(df)):
Assumed that the requirement is to calculate the ratio for pairs row1/row2, row3/row4 and so on....
Try this:
as.data.frame(t(sapply(seq(1,(nrow(df)-1),2),function(x,df){df[x,]/df[x+1,]},df)))
where df is your data.frame

Extract elements 10x greater than the last values for multiple columns

I am a new R user.
I have a dataframe consisting of 50 columns and 300 rows. The first column indicates the ID while the 2nd until the last column are standard deviation (sd) of traits. The pooled sd for each column are indicated at the last row. For each column, I want to remove all those values ten times greater than the pooled sd. I want to do this in one run. So far, the script below is what I have came up for knowing whether a value is greater than the pooled sd. However, even the ID (character) are being processed (resulting to all FALSE). If I put raw_sd_summary[-1], I have no way of knowing which ID on which trait has the criteria I'm looking for.
logic_sd <- lapply(raw_sd_summary, function(x) x>tail(x,1) )
logic_sd_df <- as.data.frame(logic_sd)
What shall I do? And how can I extract all those values labeled as TRUE (greater than pooled sd) that are ten times greater than the pooled SD (along with their corresponding ID's)?
I think your code won't work since lapply will run on a data.frame's columns, not its rows as you want. Change it to
logic_sd <- apply(raw_sd_summary, 2, function(x) x>10*tail(x,1) )
This will give you a logical array of being more than 10 times the last row. You could recover the IDs by replacing the first column
logic_sd[,1] <- raw_sd_summary[,1]
You could remove/replace the unwanted values in the original table directly by
raw_sd_summary[-300,-1][logic_sd[-300,-1]]<-NA # or new value

Resources