Extracting minor allele counts in each row in R - r

Trying to extract the minor allele counts in a set of three columns. The counts are just the number of times each allele is seen in each row. I need to extract the lowest number without reporting 0. Some lines have 0 in one of the rows which is not wanted in the final minor count. Instances of equal rows should report the minor count as the equal value.
I have tried having multiple lines of if (true) statements but this is cumbersome and does not solve the issues fully because of the combination of different scenarios.
set.seed(100)
df <- data.frame((sample(0:100,50)),(sample(0:100,50)),(sample(0:100,50)))
names(df) <-c("nAA", "nAa", "naa")
# Minor count output
df[1,] <- "31"
df[2,] <- "19"
df[3,] <- "4"
I expect a fourth column with the minor count for each row.

You can use apply and select there with x[x>0] mimimum from counts lager 0 and with which you get the column where it is:
apply(df, 1, function(x) min(x[x>0])) #will give you the minimum
apply(df, 1, function(x) which(x==min(x[x>0]))) #will give you the column of the minumum

You can do it with this code. Here the function pmin give you the parallel min of a set of vectors (in this case, the 3 varaibles on your data frame).
library(dplyr)
mutate(df, min = pmin(nAA, nAa, naa))

Related

How selecting rows after caltulation of each row's quantiles?

I have a big dataframe with numerical values (12579 rows and 21 columns) from which I would like to extract those columns that fit in the first and the fourth quartile of each row (every row has independent values).
That is why I have calculated each row's quantiles in order to obtain two cutoffs by row.
library(matrixStats)
d_q1 <- rowQuantiles(delta, probs = c(0.25, 0.75))
delta2 <- as.data.frame(cbind(delta,d_q1))
dim(delta2) # 12579 23
library(dplyr)
delta2 <- filter(delta2, delta2[,1:21] <= `25%` & delta2[,1:21] >= delta2$`75%`)
I expected getting those values in Q1 and Q4. However, when I tried to filter the values, I always obtain an error message:
Error: Result must have length 12579, not 264159
Can somebody help me?
Thank you!
I'm not entirely sure what you are trying here, but my guess is that you want for each row the values smaller than Q1 and larger than Q3. In that case this line should work for you.
t(apply(delta, 1, sort))[,c(1:6, 16:21)]
Regarding your code, dplyr::filter() doesn't work that way, it is meant to give you a subset of the rows in your dataframe, so its arguments need to be logical vectors of the same length as the number of rows in your dataframe.

Given large data.table, use binary search to find the correct row based on the first two columns and then add 1 to third column

I have a dataframe with 3 columns. First two columns are IDs (ID1 and ID2) referring to the same item and the third column is a count of how many times items with these two IDs appear. The dataframe has many rows so I want to use binary search to first find the appropriate row where both IDs match and then add 1 to the cell under the count column in that row.
I have used the which() function to find the index of the correct row and then using the index added 1 to the count column.
For example:
index <- which(DF$ID1 == x & DF$ID1 == y)
DF$Count[index] <- DF$Count[index] + 1
While this works, the which function is very inefficient. Because I have to do this within a for loop for more than a trillion times, it takes a lot of time. Also, there is only one row in the data frame with this ID combination. While the which function goes through all the rows, a function that stops once it finds the correct row should suffice. I have looked into using data.table and setkey for this purpose but do not know how to implement that for my purpose. Thank you in advance.
Indeed you can use data.table and setkeyv (not setkey because you need 2 columns as indexes)
library(data.table)
DF <- data.frame(ID1=sample(1:100,100000,replace=TRUE),ID2=sample(1:100,100000,replace=TRUE))
# convert DF to a data.table
DF <- as.data.table(DF)
# put both ID1 and ID2 as indexes, in that order
setkeyv(DF,c("ID1","ID2"))
# random x and y values
x <- 10
y <- 18
# select value for ID1=x and ID2=y and add 1 in the Count column
DF[.(x,y),"Count"] <- DF[,.(x,y),"Count"]+1

In R: ordering values from 2 DF columns for use in ratio for each row

I want to calculate ratios for each row in a data frame using values from two columns for each row. The data are anatomical measurements from paired muscles, and I need to calculate a ratio of the measurement of one muscle to the measurement of the other. Each row is an individual specimen, and each of the 2 columns in question has measurements for one of the 2 muscles. Which of the two muscles is largest varies among individuals (rows), so I need to figure out how to write a script that always picks the smaller value, which may be in either column, for the numerator, and that always picks the larger values, which also can be in either column, for the denominator, rather than simply dividing all values of one column by values of the other. This might be simple, but I'm not so good with coding yet.
This doesn't work:
ratio <- DF$1/DF$2
I assume that what I need would loop through each row doing something like this:
ratio <- which.min(c(DF$1, DF$2))/which.max(c(DF$1, DF$2))
Any help would be greatly appreciated!
Assuming that you are only dealing with positive values, you could consider something like this:
# example data:
df <- data.frame(x = abs(rnorm(100)), y = abs(rnorm(100)))
# sorting the two columns so that the smaller always appears in the first
# column:
df_sorted <- t(apply(df,1, sort))
# dividing the first col. by the second col.
ratio <- df_sorted[,1]/df_sorted[,2]
Or, alternatively:
ifelse(df[,1] > df[,2], df[,2]/df[,1], df[,1]/df[,2])

efficiently match values and average column where TRUE

Having trouble just matching values and taking an average of a column when those values match up efficiently in R. Essentially I have a chess table that I have pulled data out of and want to get the average for each player's pre-chess rating based on who they played against.
If I have a dataframe:
number <- c(1:10) #number assigned to each player
rating <- c(1000,1200,1210,980,1000,1001,1100,1300,1100,1250) #rating of the player
df <- data.frame(number= number, rating = rating)
p1_games <- c(1,2,3,4,5) # player 1 played against players 2,3,4,5
I want to essentially do is check to see if the values in p1_games match a number in the table, and when they match, average the values in the ratings column.
I just want to return one value and so I've had trouble trying to make ifelse() work:
avg_rate <- ifelse(p1_games %in% df$number, sum(df$rating)/length(p1_games)) #not working
I would like to like to avoid looping if possible but if there's no other efficient way that's fine. Just can't figure out what's up here. Ideally I'd like to apply this logic over many p*_games vectors.
If p1_games in df$number, sum each corresponding rating and divide by the number or ratings. So the output for p1_games in this case would be 1078. I feel like this is really simple but can't quite make this work.
%in% is great at this kind of thing
> mean(df[number %in% p1_games, "rating"])
[1] 1078
An alternate answer using data.table, which may be of use with larger data sets (although since p1_games isn't a column, I'm not sure):
> setDT(df)
> df[number %in% p1_games, mean(rating)]
[1] 1078

In R: help using rle() function in dataframe

I am trying to find the number of consecutive runs of '1' values from a dataframe of over 1M obs. of 11 binary variables. I have looked at a number of similar questions on here, but none deal with lengthy dataframes like mine.
I can find the consecutive runs of '1's individually row-by-row, but I'm looking for a solution that can deal with my entire dataframe a bit more elegantly.
Simple example data:
test <- data.frame(v1=c(1,0,1),v2=c(1,1,1),v3=c(0,1,1),v4=c(1,1,0),v5=c(1,1,1))
test
vtest <- as.vector(test[1,])
vtest
r <- rle(vtest)
r$length[r$values ==1]
row1_max <- lapply(r$length[r$values ==1], FUN=max)
row1_max
What's the best way for me to find the max consecutive runs of '1' for each row of my dataframe without having to find each one individually by row?
My real dataset also contains an ID# variable that identifies each record uniquely, and I ultimately want to know the max consecutive runs by ID#, so any additional help there would be much appreciated.
Thanks in advance!
You can use apply to apply a function to each row of your data frame:
apply(test, 1, function(x) {
r <- rle(x)
max(r$lengths[as.logical(r$values)])
})
This returns the maximum number of consecutive 1s per row:
[1] 2 4 3
I would use a combinations of the apply family
library(dplyr)
apply(test, 1, rle) %>% lapply(function(x) x$lengths) %>% vapply(max, numeric(1))
[1] 2 4 3
I'm assuming your df is tidy and that the binaries are in columns
set.seed(1)
event <- sample(1:3,365*3,replace=TRUE) # proxy for one of your columns
runs <- rle(event)
sum(runs$lengths >= 6 & runs$values == 1)
[1] 2
I'm currently working on finding the row numbers where the 6 or longer sequences begin

Resources