how can i sumrows accept a specific values - r

If I have these data (dat1)
` L1 L2 L3 L4 L5 L6
1 0 1 0 4 0
4 1 0 1 1 0
1 0 0 1 4 1
All the values in the matrix are 0,1,4
I need to sum every row ignoring number 4.
The result like this
2
3
3
I have more than 900 rows..
I tried these but there is something missing
rowSums(subset(dat1,L1!=4)
rowSums(which[dat1!=4])
n4=dat1[dat1==4]<-0

You could simply do:
dat1[dat1==4] <- 0
rowSums(dat1)
If modifying the object is not acceptable just copy it first:
dat1Zero <- dat1
dat1[dat1Zero==4] <- 0
rowSums(dat1Zero)

If all the values are in (0,1,4) then you just want to count the 1's, no??
apply(df,1,function(x)sum(x==1))
# [1] 2 3 3

Related

Compute combination of a pair variables for a given operation in R

From a given dataframe:
# Create dataframe with 4 variables and 10 obs
set.seed(1)
df<-data.frame(replicate(4,sample(0:1,10,rep=TRUE)))
I would like to compute a substract operation between in all columns combinations by pairs, but only keeping one substact, i.e column A- column B but not column B-column A and so on.
What I got is very manual, and this tend to be not so easy when there are lots of variables.
# Result
df_result <- as.data.frame(list(df$X1-df$X2,
df$X1-df$X3,
df$X1-df$X4,
df$X2-df$X3,
df$X2-df$X4,
df$X3-df$X4))
Also the colname of the feature name should describe the operation i.e.(x1_x2) being x1-x2.
You can use combn:
COMBI = combn(colnames(df),2)
res = data.frame(apply(COMBI,2,function(i)df[,i[1]]-df[,i[2]]))
colnames(res) = apply(COMBI,2,paste0,collapse="minus")
head(res)
X1minusX2 X1minusX3 X1minusX4 X2minusX3 X2minusX4 X3minusX4
1 0 0 -1 0 -1 -1
2 1 1 0 0 -1 -1
3 0 0 0 0 0 0
4 0 0 -1 0 -1 -1
5 1 1 1 0 0 0
6 -1 0 0 1 1 0

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

Extract columns from df by subset of column id characters

I am working on a gene expression dataset with hundreds of samples. Each sample in the data frame has a unique column ID (example: OHC_112 of IHC_123). I want to make a new dataframe containing only the columns containing the "OHC". How can i do this?
I am struggling to make workable example dataframe... but this is the best i was able to do.
Data frame "DF"
OHC_1 OHC_2 OHC_3 IHC_4 IHC_5 OHC_6
Gene1 1 1 0 1 1 0
Gene2 0 0 0 1 1 0
Gene3 1 1 1 0 0 1
Gene4 1 1 1 0 0 0
I got close by using the following subset command
newDF <- subset(DF, ,select = OHC_1:OHC_3)
This allows me to subset the dataframe by a range of the columns but does not allow me to choose all the columns containing "OHC" in the header.
Thanks for your help!
Just subset the columns with names that match using grepl?
> DF[, grepl("OHC",names(DF))]
OHC_1 OHC_2 OHC_3 OHC_6
1 1 1 0 0
2 0 0 0 0
3 1 1 1 1
4 1 1 1 0
You can make a shorter call that is also more generalizable with negative-grep:
df.2 <- df[, -grep("^OHC_[1:3]$", names(df) )]
Since grep returns numerics you can use the negative vector indexing to remove columns. You could add further number or more complex patterns.
We can use select with matches from tidyverse
library(tidyverse)
DF %>%
select(matches("^OHC"))
# OHC_1 OHC_2 OHC_3 OHC_6
#Gene1 1 1 0 0
#Gene2 0 0 0 0
#Gene3 1 1 1 1
#Gene4 1 1 1 0

Split a string in column and count occurrence of characters

I have a very huge file with dim: 47,685 x 10,541. In that file, there is no spaces between the characters in each row in the second column, as following:
File # 1
Row1 01205201207502102102…..
Row2 20101020100210201022…..
Row3 21050210210001120120…..
I want to do some statistics on that file and may be delete some columns or rows. So, using R, I want to add one space between each two characters in the second column to get something like this:
File # 2
Row1 0 1 2 0 5 2 0 1 2 0 7 5 0 2 1 0 2 1 0 2…..
Row2 2 0 1 0 1 0 2 0 1 0 0 2 1 0 2 0 1 0 2 2…..
Row3 2 1 0 0 0 2 1 0 2 1 0 0 0 1 1 2 0 1 2 0…..
And then, after I finish editing, remove the spaces between the characters in the second column, so the final format will be just like File # 1.
What is the best and faster way to do that?
updated addressing the column count as well. ( From your comments)
Here is a solution using tidyr and stringr. However, this considers that your string is of equal length for the column2. The solution gives you both rowwise and columnwise count. This is done in very basic step by step manner, could be achieved the same with few lines of the code as well.
library(stringr)
library(tidyr)
data<-data.frame( Column.1 = c("01205", "20705", "27057"),
stringsAsFactors = FALSE)
count<-str_count(data$Column.1) # Get the length of the string in column 2
index<-1:count[1] # Generate an index based on the length
# Count the number of 5 and 7 in each string by row and add it as new column
data$Row.count_5 <- str_count(data$Column.1, "5")
data$Row.count_7 <- str_count(data$Column.1, "7")
new.data <- separate(data, Column.1, into = paste("V", 1:count[1], sep = ""), sep = index)
new.data$'NA' <- NULL
new.data
Column_count_5 <- apply(new.data[1:5],2,FUN=function(x) sum(x == 5))
Column_count_7 <- apply(new.data[1:5],2,FUN=function(x) sum(x == 7))
column_count <- as.data.frame(t(data.frame(Column_count_5,Column_count_7)))
library(plyr)
Final.df<- rbind.fill(new.data,column_count)
rownames(Final.df)<-c("Row1","Row2","Row3", "Column.count_5","Column.count_7")
Final.df
output
V1 V2 V3 V4 V5 Row.count_5 Row.count_7
Row1 0 1 2 0 5 1 0
Row2 2 0 7 0 5 1 1
Row3 2 7 0 5 7 1 2
Column.count_5 0 0 0 1 2 NA NA
Column.count_7 0 1 1 0 1 NA NA
Sample data
data<-data.frame( Column.1 = c("01205", "20705", "27057"),
stringsAsFactors = FALSE)

Filter data frame based on duplicates in columns [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 7 years ago.
I have a very large data frame that looks something like this:
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
B 1 1
C 0 1
C 0 0
I want to only keep rows where there is a duplicate in the Gene column.
So the table would become:
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
C 0 1
C 0 0
I've tried using subset(df, duplicated(df$Genes)) in R But I think it left over some non- duplicates as the naming is more involved than A/B/C. Like: WASH11, KANSL-1, etc.
Can this be done in R or Linux shell?
In R, you could double-up on duplicated(), going from both directions.
df[with(df, duplicated(Gene) | duplicated(Gene, fromLast = TRUE)), ]
# Gene Sample1 Sample2
# 1 A 1 0
# 2 A 0 1
# 3 A 1 1
# 5 C 0 1
# 6 C 0 0
You could also use a table of the first column.
tbl <- table(df$Gene)
df[df$Gene %in% names(tbl)[tbl > 1], ]
# Gene Sample1 Sample2
# 1 A 1 0
# 2 A 0 1
# 3 A 1 1
# 5 C 0 1
# 6 C 0 0
Other options, which may or may not work depending on the real data are ...
df[(table(df$Gene) > 1)[df$Gene],] ## credit to Pierre LaFortune
## or
df[with(df, (tabulate(Gene) > 1)[Gene]), ]
You can find the number of each by applying ave and counting the entries:
ave(as.numeric(x$Gene), x$Gene, FUN=length)
## [1] 3 3 3 1 2 2
In this expression, the first argument to ave need only be a numeric who's length equals the number of rows in the data frame.
Use this to select rows:
count <- ave(as.numeric(x$Gene), x$Gene, FUN=length)
x[count>1,]
## Gene Sample1 Sample2
## 1 A 1 0
## 2 A 0 1
## 3 A 1 1
## 5 C 0 1
## 6 C 0 0
From command line using Perl
cat counts.txt
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
B 1 1
C 0 1
C 0 0
perl -ne '$cg{ (split /\t/,$_)[0] }++; push (#lines, $_); END { print shift #lines; foreach (#lines) { print if ($cg{ (split /\t/,$_)[0] } >= 2) }}' counts.txt
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
C 0 1
C 0 0
%cg hash keeps count of the number of occurrences of each gene. Genes are extracted by selecting only the first element [0] of the split operation on each line. #lines holds entire contents of file in memory by line. Then the END block only outputs those lines whose gene appeared >= 2 times.

Resources