How to count missing values from two columns in R - r

I have a data frame which looks like this
**Contig_A** **Contig_B**
Contig_0 Contig_1
Contig_3 Contig_5
Contig_4 Contig_1
Contig_9 Contig_0
I want to count how many contig ids (from Contig_0 to Contig_1193) are not present in either Contig_A column of Contig_B.
For example: if we consider there are total 10 contigs here for this data frame (Contig_0 to Contig_9), then the answer would be 4 (Contig_2, Contig_6, Contig_7, Contig_8)

Create a vector of all the values that you want to check (all_contig) which is Contig_0 to Contig_10 here. Use setdiff to find the absent values and length to get the count of missing values.
cols <- c('Contig_A', 'Contig_B')
#If there are lot of 'Contig' columns that you want to consider
#cols <- grep('Contig', names(df), value = TRUE)
all_contig <- paste0('Contig_', 0:10)
missing_contig <- setdiff(all_contig, unlist(df[cols]))
#[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8" "Contig_10"
count_missing <- length(missing_contig)
#[1] 5

by match,
x <- c(0:9)
contigs <- sapply(x, function(t) paste0("Contig_",t))
df1 <- data.frame(
Contig_A = c("Contig_0", "Contig_3", "Contig_4", "Contig_9"),
Contig_B = c("Contig_1", "Contig_5", "Contig_1", "Contig_0")
)
xx <- c(df1$Contig_A,df1$Contig_B)
contigs[is.na(match(contigs, xx))]
[1] "Contig_2" "Contig_6" "Contig_7" "Contig_8"
In your case, just change x as x <- c(0,1193)

Related

Split a group of integers into two subgroups of approximately the same suns

I have a group of integers, as in this R data.frame:
set.seed(1)
df <- data.frame(id = paste0("id",1:100), length = as.integer(runif(100,10000,1000000)), stringsAsFactors = F)
So each element has an id and a length.
I'd like to split df into two data.frames with approximately equal sums of length.
Any idea of an R function to achieve that?
I thought that Hmisc's cut2 might do it but I don't think that's its intended use:
library(Hmisc) # cut2
ll <- split(df, cut2(df$length, g=2))
> sum(ll[[1]]$length)
[1] 14702139
> sum(ll[[2]]$length)
[1] 37564671
It's called Bin pack problem. https://en.wikipedia.org/wiki/Bin_packing_problem this link may be helpful.
Using BBmisc::binPack function,
df$bins <- binPack(df$length, sum(df$length)/2 + 1)
tapply(df$length, df$bins, sum)
results like
1 2 3
25019106 24994566 26346
Now since you want two groups,
dummy$bins[dummy$bins == 3] <- 2 #because labeled as 2's sum is smaller
result is
1 2
25019106 25020912

Iterate a data frame containing lists of column numbers, of different lengths, with a function in R

I have a data frame (df) of survey responses about human values with 57 columns/variables of numerical/scale responses. Each column belongs to one of ten categories, and they're not in contiguous groups.
I have a second dataframe (scoretable) that associates the categories with the column numbers for the variables; the lists of column numbers are all different lengths:
scoretable <- data.frame(
valuename =
c("Conformity","Tradition","Benevolence","Universalism","Self-
Direction","Stimulation","Hedonism","Achievement","Power","Security"),
valuevars = I(list(c(11,20,40,47), # Conformity
c(18,32,36,44,51), # Tradition
c(33,45,49,52,54), # Benevolence
c(1,17,24,26,29,30,35,38), # Universalism
c(5,16,31,41,53), # Self-Direction
c(9,25,37), # Stimulation
c(4,50,57), # Hedonism
c(34,39,43,55), # Achievement
c(3,12,27,46), # Power
c(8,13,15,22,56))), # Security
stringsAsFactors=FALSE)
I would like to iterate through scoretable with a function, valuescore, that calculates the mean and sd of all responses in that group of columns in data frame df and write the result to a third table of results:
valuescore = function(df,scoretable,valueresults){
valuename = scoretable[,1]
set <- df[,scoretable[,2]]
setmeans <- colMeans(set,na.rm=TRUE)
valuemean <- mean(setmeans)
setvars <- apply(set, 2, var)
valuesd <-sqrt(mean(setvars))
rbind(valueresults,c(valuename, valuemean, valuesd))
}
a <- nrow(scoretable)
for(i in 1:a){
valuescore(df,scoretable[i,],valueresults)
}
I am very new to R and programming in general (this is my first question here), and I'm struggling to determine how to pass list variables to functions and/or as address ranges for data frames.
Let's create an example data.frame:
df <- replicate(57, rnorm(10, 50, 20)) %>% as.data.frame()
Let's prepare the table result format:
valueresults <- data.frame(
name = scoretable$valuename,
mean = 0
)
Now, a loop on the values of scoretable, a mean calculation by column and then the mean of the mean. It is brutal (first answer with Map is more elegant), but maybe it is easier to understand for a R beginnner.
for(v in 1:nrow(scoretable)){
# let's suppose v = 1 "Conformity"
columns_id <- scoretable$valuevars[[v]]
# isolate columns that correspond to 'Conformity'
temp_df <- df[, columns_id]
# mean of the values of these columns
temp_means <- apply(temp_df, 2, mean)
mean <- mean(temp_means)
# save result in the prepared table
valueresults$mean[v] <- mean
}
> (valueresults)
name mean
1 Conformity 45.75407
2 Tradition 52.76935
3 Benevolence 50.81724
4 Universalism 51.04970
5 Self-Direction 55.43723
6 Stimulation 52.15962
7 Hedonism 53.17395
8 Achievement 47.77570
9 Power 52.61731
10 Security 54.07066
Here is a way using Map to apply a function to the list scoretable[, 2].
First I will create a test df.
set.seed(1234)
m <- 100
n <- 57
df <- matrix(sample(10, m*n, TRUE), nrow = m, ncol = n)
df <- as.data.frame(df)
And now the function valuescore.
valuescore <- function(DF, scores){
f <- function(inx) mean(as.matrix(DF[, inx]), na.rm = TRUE)
res <- Map(f, scores[, 2])
names(res) <- scores[[1]]
res
}
valuescore(df, scoretable)
#$Conformity
#[1] 5.5225
#
#$Tradition
#[1] 5.626
#
#$Benevolence
#[1] 5.548
#
#$Universalism
#[1] 5.36125
#
#$`Self-Direction`
#[1] 5.494
#
#$Stimulation
#[1] 5.643333
#
#$Hedonism
#[1] 5.546667
#
#$Achievement
#[1] 5.3175
#
#$Power
#[1] 5.41
#
#$Security
#[1] 5.54

How do I subset a vector while retaining row names?

I am looking for differentially expressed genes in a data set. After using my function to determine fold change, I am given a vector that returns the gene names and fold change which looks like this:
df1
[,1]
gene1074 1.1135131
gene22491 1.0668137
gene15416 0.9840414
gene18645 1.1101060
gene4068 1.0055899
gene19043 1.1463878
I want to look for anything that has a greater than 2 fold change, so to do this I execute:
df2 <- subset(df1 >= 2)
which returns the following:
head(df2)
[,1]
gene1074 FALSE
gene22491 FALSE
gene15416 FALSE
gene18645 FALSE
gene4068 FALSE
gene19043 FALSE
and that is not what I'm looking for.
I've tried another subsetting method:
df2 <- df1[df1 >= 2]
which returns:
head(df2)
[1] 4.191129 127.309557 2.788121 2.090916 11.382345 2.186330
Now that is the values that are over 2, but I've lost the gene names that came along with them.
How would I go about subsetting my data so that it returns in the following format:
head(df2)
[,1]
genex 4.191129
geney 127.309557
genez 2.788121
genea 2.090916
geneb 11.382345
Or something at least approximating that format where I am given the gene and it's corresponding fold change value
You are looking for subsetting like so:
df2 <- df1[df1[, 1] >= 2, ]
To show you on some data:
# Create some toy data
df1 <- data.frame(val = rexp(100))
rownames(df1) <- paste0("gene", 1:100)
head(df1)
# val
#gene1 0.9295632
#gene2 1.2090513
#gene3 0.1550578
#gene4 1.7934942
#gene5 0.7286462
#gene6 1.8424025
Now we take the first column of df1 and compare to 2 (df1[,1] > 2). The output of that (a logical vector) is used to select the rows which fulfill the criteria:
df2 <- df1[df1[,1] > 2, ]
head(df2)
#[1] 2.705683 3.410672 3.544905 3.695313 2.523586 2.229879
Using the drop = FALSE keeps the output as a data.frame:
df3 <- df1[df1[,1] > 2, ,drop = FALSE]
head(df3)
# val
#gene8 2.705683
#gene9 3.410672
#gene22 3.544905
#gene23 3.695313
#gene38 2.523586
#gene42 2.229879
The same can be achieved by
subset(df1, subset = val > 2)
or
subset(df1, subset = df1[1,] > 2)
The former of these two expressions does not work in your case as it appears you have not named the columns.
You can also compute the positions in the data that correspond to your predicate, and use them for indexing:
# create some test data
df <- read.csv(
textConnection(
"g, v
gene1074, 1.1135131
gene22491, 1.0668137
gene15416, 0.9840414
gene18645, 1.1101060
gene4068, 1.0055899
gene19043, 1.1463878"
))
# positions that match a given predicate
idx <- which(df$v > 1)
# indexing "as usual"
df[idx, ]
Output:
g v
1 gene1074 1.113513
2 gene22491 1.066814
4 gene18645 1.110106
5 gene4068 1.005590
6 gene19043 1.146388
I find this code reads quite nicely and is pretty intuitive, but that might just be my opinion.

Creating function to read data set and columns and displyaing nrow

I am struggling a bit with a probably fairly simple task. I wanted to create a function that has arguments of dataframe(df), column names of dataframe(T and R), value of the selected column of dataframe(a and b). I know that the function reads the dataframe. but , I don't know how the columns are selected. I'm getting an error.
fun <- function(df,T,a,R,b)
{
col <- ds[c("x","y")]
omit <- na.omit(col)
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
nrow(data2)/nrow(data1)
}
fun(jugs,Place,UK,Price,10)
I'm new to r language. So, please help me.
There are several errors you're making.
col <- ds[c("x","y")]
What are x and y? Presumably they're arguments that you're passing, but you specify T and R in your function, not x and y.
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
Again, presumably, you want a and b to be arguments you passed to the function, but you specified 'a' and 'b' which are specific, not general arguments. Also, I assume that second "omit$x" should be "omit$y" (or vice versa). And actually, since you just made this into a new data frame with two columns, you can just use the column index.
nrow(data2)/nrow(data1)
You should print this line, or return it. Either one should suffice.
fun(jugs,Place,UK,Price,10)
Finally, you should use quotes on Place, UK, and Price, at least the way I've done it.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
data1 <- omit[omit[,1] == val1,]
data2 <- omit[omit[,2] == val2,]
print(nrow(data2)/nrow(data1))
}
fun(jugs, "Place", "UK", "Price", 10)
And if I understand what you're trying to do, it may be easier to avoid creating multiple dataframes that you don't need and just use counts instead.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
n1 <- sum(omit[,1] == val1)
n2 <- sum(omit[,2] == val2)
print(n2/n1)
}
fun(jugs, "Place", "UK", "Price", 10)
I would write this function as follows:
fun <- function(df,T,a,R,b) {
data <- na.omit(df[c(T,R)]);
sum(data[[R]]==b)/sum(data[[T]]==a);
};
As you can see, you can combine the first two lines into one, because in your code col was not reused anywhere. Secondly, since you only care about the number of rows of the two subsets of the intermediate data.frame, you don't actually need to construct those two data.frames; instead, you can just compute the logical vectors that result from the two comparisons, and then call sum() on those logical vectors, which naturally treats FALSE as 0 and TRUE as 1.
Demo:
fun <- function(df,T,a,R,b) { data <- na.omit(df[c(T,R)]); sum(data[[R]]==b)/sum(data[[T]]==a); };
df <- data.frame(place=c(rep(c('p1','p2'),each=4),NA,NA), price=c(10,10,20,NA,20,20,20,NA,20,20), stringsAsFactors=F );
df;
## place price
## 1 p1 10
## 2 p1 10
## 3 p1 20
## 4 p1 NA
## 5 p2 20
## 6 p2 20
## 7 p2 20
## 8 p2 NA
## 9 <NA> 20
## 10 <NA> 20
fun(df,'place','p1','price',20);
## [1] 1.333333

Subset one element of a row based on vector of column number

I have a dataset
data <- cbind(c(1,2,3),c(1,11,21))
I want to extract one element from each row based on the column number given by a vector
selectcol <- c(1,2,2)
In that particular case the result should be
result
1
11
21
I have tried
resul<-apply(data, 1, [,selectcol])
but it does not work
You can use col to match the values with selectcol and subset data with it.
data[col(data) == selectcol]
# [1] 1 11 21
what if you try
selection <- cbind(1:3, selectcol)
result <- data[sel]
This worked for me using a function:
data <- data.frame(cbind(c(1,2,3),c(1,11,21)))
selectcol <- c(1,2,2)
elems<-vector()
extract_elems <- function(data, selectcol) {
for ( i in 1:length(selectcol)) {
elems <- append(elems,data[i,selectcol[i]])
}
return(elems)
}
output <- extract_elems(data,selectcol)
> output
[1] 1 11 21

Resources