R - Select Rows Where Number of Values Satisfies Condition - r

I have a dataframe called df, what I want to do is select all rows where there are at least n values in that row satisfying some condition c.
For example, I want rows from df such that at least 50% of the values (or columns) in the row are greater than 0.75.
Here is what I came up with to accomplish this:
test <- df[apply(df, 1, function(x) (length(x[x > 0.75]) / length(x) > 0.5)]
Unfortunately I am getting this error message:
Error in `[.data.frame`(df, apply(df, :
undefined columns selected
I am very new to R, so I'm pretty stuck at this point, what's the problem here?

You are getting that error message because you haven't told R what columns you want to include in your subset.
You have:
df[your_apply_function]
Which doesn't specify which columns. Instead, you should try
df[your_apply_function, ]
That means 'subset 'df' for all rows that match the result of this apply function, and all columns'. Edit: I don't think this will work either.
However, I would approach it by using dplyr:
library(dplyr)
rowcounts <- apply(df, 1, function(x) rowSums(x > 0.75))
df <- bind_cols(df, rowcounts)
df <- filter(df, rowcounts > ncol(df)/2)
I didn't get to test this yet (code still running on my machine), but it looks right to my eye. When I get a chance I will test it.

This can be accomplished with a cellwise comparison against 0.75, rowSums(), and then a vectorized comparison against 0.5:
set.seed(3L); NR <- 5L; NC <- 4L; df <- as.data.frame(matrix(rnorm(NR*NC,0.75,0.1),NR));
df;
## V1 V2 V3 V4
## 1 0.6538067 0.7530124 0.6755218 0.7192344
## 2 0.7207474 0.7585418 0.6368781 0.6546983
## 3 0.7758788 0.8616610 0.6783642 0.6851757
## 4 0.6347868 0.6281143 0.7752652 0.8724314
## 5 0.7695783 0.8767369 0.7652046 0.7699812
df[rowSums(df>0.75)/ncol(df)>=0.5,];
## V1 V2 V3 V4
## 3 0.7758788 0.8616610 0.6783642 0.6851757
## 4 0.6347868 0.6281143 0.7752652 0.8724314
## 5 0.7695783 0.8767369 0.7652046 0.7699812
This can work on both matrices and data.frames.

Related

Split a group of integers into two subgroups of approximately the same suns

I have a group of integers, as in this R data.frame:
set.seed(1)
df <- data.frame(id = paste0("id",1:100), length = as.integer(runif(100,10000,1000000)), stringsAsFactors = F)
So each element has an id and a length.
I'd like to split df into two data.frames with approximately equal sums of length.
Any idea of an R function to achieve that?
I thought that Hmisc's cut2 might do it but I don't think that's its intended use:
library(Hmisc) # cut2
ll <- split(df, cut2(df$length, g=2))
> sum(ll[[1]]$length)
[1] 14702139
> sum(ll[[2]]$length)
[1] 37564671
It's called Bin pack problem. https://en.wikipedia.org/wiki/Bin_packing_problem this link may be helpful.
Using BBmisc::binPack function,
df$bins <- binPack(df$length, sum(df$length)/2 + 1)
tapply(df$length, df$bins, sum)
results like
1 2 3
25019106 24994566 26346
Now since you want two groups,
dummy$bins[dummy$bins == 3] <- 2 #because labeled as 2's sum is smaller
result is
1 2
25019106 25020912

How do I subset a vector while retaining row names?

I am looking for differentially expressed genes in a data set. After using my function to determine fold change, I am given a vector that returns the gene names and fold change which looks like this:
df1
[,1]
gene1074 1.1135131
gene22491 1.0668137
gene15416 0.9840414
gene18645 1.1101060
gene4068 1.0055899
gene19043 1.1463878
I want to look for anything that has a greater than 2 fold change, so to do this I execute:
df2 <- subset(df1 >= 2)
which returns the following:
head(df2)
[,1]
gene1074 FALSE
gene22491 FALSE
gene15416 FALSE
gene18645 FALSE
gene4068 FALSE
gene19043 FALSE
and that is not what I'm looking for.
I've tried another subsetting method:
df2 <- df1[df1 >= 2]
which returns:
head(df2)
[1] 4.191129 127.309557 2.788121 2.090916 11.382345 2.186330
Now that is the values that are over 2, but I've lost the gene names that came along with them.
How would I go about subsetting my data so that it returns in the following format:
head(df2)
[,1]
genex 4.191129
geney 127.309557
genez 2.788121
genea 2.090916
geneb 11.382345
Or something at least approximating that format where I am given the gene and it's corresponding fold change value
You are looking for subsetting like so:
df2 <- df1[df1[, 1] >= 2, ]
To show you on some data:
# Create some toy data
df1 <- data.frame(val = rexp(100))
rownames(df1) <- paste0("gene", 1:100)
head(df1)
# val
#gene1 0.9295632
#gene2 1.2090513
#gene3 0.1550578
#gene4 1.7934942
#gene5 0.7286462
#gene6 1.8424025
Now we take the first column of df1 and compare to 2 (df1[,1] > 2). The output of that (a logical vector) is used to select the rows which fulfill the criteria:
df2 <- df1[df1[,1] > 2, ]
head(df2)
#[1] 2.705683 3.410672 3.544905 3.695313 2.523586 2.229879
Using the drop = FALSE keeps the output as a data.frame:
df3 <- df1[df1[,1] > 2, ,drop = FALSE]
head(df3)
# val
#gene8 2.705683
#gene9 3.410672
#gene22 3.544905
#gene23 3.695313
#gene38 2.523586
#gene42 2.229879
The same can be achieved by
subset(df1, subset = val > 2)
or
subset(df1, subset = df1[1,] > 2)
The former of these two expressions does not work in your case as it appears you have not named the columns.
You can also compute the positions in the data that correspond to your predicate, and use them for indexing:
# create some test data
df <- read.csv(
textConnection(
"g, v
gene1074, 1.1135131
gene22491, 1.0668137
gene15416, 0.9840414
gene18645, 1.1101060
gene4068, 1.0055899
gene19043, 1.1463878"
))
# positions that match a given predicate
idx <- which(df$v > 1)
# indexing "as usual"
df[idx, ]
Output:
g v
1 gene1074 1.113513
2 gene22491 1.066814
4 gene18645 1.110106
5 gene4068 1.005590
6 gene19043 1.146388
I find this code reads quite nicely and is pretty intuitive, but that might just be my opinion.

How to check whether a variable is numeric for a vector in R?

I have two questions.
for (k in 1:iterations) {
corr <- cor(df2_prod[,k], df2_qa[,k])
ifelse(is.numeric(corr), next,
ifelse((all(df2_prod[,k] == df2_qa[,k])) ), (corr <- 1), (corr <- 0))
correlation[k,] <- rbind(names(df2_prod[k]), corr)
}
This is my requirement - I want to calculate correlation for variables in a loop using the code corr <- cor(df2_prod[,k], df2_qa[,k]) If i receive a correlation value in number, I have to keep the value as it is.
Some time it happens that if two columns have the same values, i receive "NA" as output for the vector "corr".
x y
1 1
1 1
1 1
1 1
1 1
corr
[,1]
[1,] NA
I am trying to handle in such a way that if "NA" is received, i will replace the values with "1" or "0".
My questions are:
When I check the class of "corr" vector, I am getting it as "matrix". I want to check whether that is a number or not. Is there any other way other than checking is.numeric(corr)
> class(corr)
[1] "matrix"
I want to check if two columns has same value or not. Something like the code below. If it returns true, I want to proceed. But the way I have put the code in the loop is wrong. Could you please help me how this can be improved:
((all(df2_prod[,k] == df2_qa[,k]))
Is there any effective way to do this?
I sincerely apologize the readers for the poorly framed question / logic. If you can show me pointers which can improve the code, I would be really thankful to you.
1.
You basically want to avoide NAs, right? So you could check the result with is.na().
a <- rep(1, 5)
b <- rep(1, 5)
if(is.na(cor(a, b))) cor.value <- 1
2.You could count how many times the element of a is equal to the element of b with sum(a==b) and check whether this amount is equal to the amount of elements in a (or b) --> length(a)
if(sum(a==b) == length(a)) cor.value <- 1
An example to explain how the cor function works:
set.seed(123)
df1 <- data.frame(v1=1:10, v2=rnorm(10), v3=rnorm(10), v4=rnorm(10))
df2 <- data.frame(w1=rnorm(10), w2=1:10, w3=rnorm(10))
Here, the first variable of df1 is equal to the second variable of df2. Function cor directly applied on the first 3 variables of each data.frame gives:
cor(df1[, 1:3], df2[, 1:3])
# w1 w2 w3
#v1 -0.4603659 1.0000000 0.1078796
#v2 0.6730196 -0.2602059 -0.3486367
#v3 0.2713188 -0.3749826 -0.2520174
As you can notice, the correlation coefficient between w2 and v1 is 1, not NA.
So, in your case, cor(df2_prod[, 1:k], df2_qa[, 1:k]) should provide you the desired output.

Attach/detach in R behaving very strangely

I want to subset a dataframe by applying two conditions to it. When I attach the dataframe, apply the first condition, detach the dataframe, attach it again, apply the second condition, and detach again, I get the expected result, a dataframe with 9 observations.
Of course, you wouldn't normally detach/attach before applying the second condition. So I attach, apply the two conditions after one another, and then detach. But the result is different now: It's a dataframe with 24 observations. All but 5 of these observations consist exclusively of NA-values.
I know there's lots of advice against using attach, and I understand the point that it's dangerous, because it's easy to loose track of an attach statement still being active. My point here is a different one; I see a behaviour in attach that I just can't understand. I'm using R Studio 0.99.465 with 64-bit-R 3.2.1.
So here's the code, first the version that is clumsy, but produces the correct result (df with 9 observations, all non-NA):
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
detach(df)
attach(df)
df <- df[early_vvl <= late_no_reaction,]
detach(df)
Now the one that produces the dataframe with 24 observations, most of which consist only of NA values:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
df <- df[early_vvl <= late_no_reaction,]
detach(df)
I'm puzzled. Does anybody understand why the second version produces a different result?
Have a look at what happens here:
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
length(early_vvl <= late_no_reaction)
## [1] 32
df <- df[early_vvl <= late_no_reaction,]
detach(df)
So your logical vector early_vvl <= late_no_reaction still uses the original df, the one that you attached. When you subset the data.frame the second time, the logical is longer than the data.frame has rows and it behaves like this:
df <- data.frame(x=1:5, y = letters[1:5])
df[rep(c(TRUE, FALSE), 5), ]
## x y
## 1 1 a
## 3 3 c
## 5 5 e
## NA NA <NA>
## NA.1 NA <NA>
You could just use & to avoid the problem:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl & early_vvl <= late_no_reaction,]
detach(df)

How to use `cor.test` for correlation of specific columns?

I have the following data example:
A<-rnorm(100)
B<-rnorm(100)
C<-rnorm(100)
v1<-as.numeric(c(1:100))
v2<-as.numeric(c(2:101))
v3<-as.numeric(c(3:102))
v2[50]<-NA
v3[60]<-NA
v3[61]<-NA
df<-data.frame(A,B,C,v1,v2,v3)
As you can see df has 1 NA in column 5, and 2 NA's in column 6.
Now I would like to make a correlation matrix of col1 and 3 on the one hand, and col2,4,5,6 on the other. Using the cor function in R:
cor(df[ , c(1,3)], df[ , c(2,4,5,6)], use="complete.obs")
# B v1 v2 v3
# A -0.007565203 -0.2985090 -0.2985090 -0.2985090
# C 0.032485874 0.1043763 0.1043763 0.1043763
This works. I however wanted to have both estimate and p.value and therefore I switch to cor.test.
cor.test(df[ ,c(1,3)], df[ , c(2,4,5,6)], na.action = "na.exclude")$estimate
This does not work as 'x' and 'y' must have the same length.
This error actually occurs with or without NA's in the data. It seems that cor.test does not understand (unlike cor) the request to correlate specific columns. Is there any solution to this problem?
You can use outer to perform the test between all pairs of columns. Here X and Y are data frames expanded from df, consisting of 8 columns each.
outer(df[, c(1,3)], df[, c(2,4,5,6)], function(X, Y){
mapply(function(...) cor.test(..., na.action = "na.exclude")$estimate,
X, Y)
})
You even get output on the same form as cor:
B v1 v2 v3
A 0.07844426 0.01829566 0.01931412 0.01528329
C 0.11487140 -0.14827859 -0.14900301 -0.15534569

Resources