Split a string in column and count occurrence of characters - r

I have a very huge file with dim: 47,685 x 10,541. In that file, there is no spaces between the characters in each row in the second column, as following:
File # 1
Row1 01205201207502102102…..
Row2 20101020100210201022…..
Row3 21050210210001120120…..
I want to do some statistics on that file and may be delete some columns or rows. So, using R, I want to add one space between each two characters in the second column to get something like this:
File # 2
Row1 0 1 2 0 5 2 0 1 2 0 7 5 0 2 1 0 2 1 0 2…..
Row2 2 0 1 0 1 0 2 0 1 0 0 2 1 0 2 0 1 0 2 2…..
Row3 2 1 0 0 0 2 1 0 2 1 0 0 0 1 1 2 0 1 2 0…..
And then, after I finish editing, remove the spaces between the characters in the second column, so the final format will be just like File # 1.
What is the best and faster way to do that?

updated addressing the column count as well. ( From your comments)
Here is a solution using tidyr and stringr. However, this considers that your string is of equal length for the column2. The solution gives you both rowwise and columnwise count. This is done in very basic step by step manner, could be achieved the same with few lines of the code as well.
library(stringr)
library(tidyr)
data<-data.frame( Column.1 = c("01205", "20705", "27057"),
stringsAsFactors = FALSE)
count<-str_count(data$Column.1) # Get the length of the string in column 2
index<-1:count[1] # Generate an index based on the length
# Count the number of 5 and 7 in each string by row and add it as new column
data$Row.count_5 <- str_count(data$Column.1, "5")
data$Row.count_7 <- str_count(data$Column.1, "7")
new.data <- separate(data, Column.1, into = paste("V", 1:count[1], sep = ""), sep = index)
new.data$'NA' <- NULL
new.data
Column_count_5 <- apply(new.data[1:5],2,FUN=function(x) sum(x == 5))
Column_count_7 <- apply(new.data[1:5],2,FUN=function(x) sum(x == 7))
column_count <- as.data.frame(t(data.frame(Column_count_5,Column_count_7)))
library(plyr)
Final.df<- rbind.fill(new.data,column_count)
rownames(Final.df)<-c("Row1","Row2","Row3", "Column.count_5","Column.count_7")
Final.df
output
V1 V2 V3 V4 V5 Row.count_5 Row.count_7
Row1 0 1 2 0 5 1 0
Row2 2 0 7 0 5 1 1
Row3 2 7 0 5 7 1 2
Column.count_5 0 0 0 1 2 NA NA
Column.count_7 0 1 1 0 1 NA NA
Sample data
data<-data.frame( Column.1 = c("01205", "20705", "27057"),
stringsAsFactors = FALSE)

Related

Sub-setting or arrange the data in R

As I am new to R, this question may seem to you piece of a cake.
I have a data in txt format. The first column has Cluster Number and the second column has names of different organisms.
For example:
0 org4|gene759
1 org1|gene992
2 org1|gene1101
3 org4|gene757
4 org1|gene1702
5 org1|gene989
6 org1|gene990
7 org1|gene1699
9 org1|gene1102
10 org4|gene2439
10 org1|gene1374
I need to re-arrange/reshape the data in following format.
Cluster No. Org 1 Org 2 org3 org4
0 0 0 1
1 0 0 0
I could not figure out how to do it in R.
Thanks
We could use table
out <- cbind(ClusterNo = seq_len(nrow(df1)), as.data.frame.matrix(table(seq_len(nrow(df1)),
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4)))))
head(out, 2)
# ClusterNo org1 org2 org3 org4
#1 1 0 0 0 1
#2 2 1 0 0 0
It is also possible that we need to use the first column to get the frequency
out1 <- as.data.frame.matrix(table(df1[[1]],
factor(sub("\\|.*", "", df1[[2]]), levels = paste0("org", 1:4))))
Reading the table into R can be done with
input <- read.table('filename.txt')
Then we can extract the relevant number from the org4|gene759 string using a regular expression, and set this to a third column of our input:
input[, 3] <- gsub('^org(.+)\\|.*', '\\1', input[, 2])
Our input data now looks like this:
> input
V1 V2 V3
1 0 org4|gene759 4
2 1 org1|gene992 1
3 2 org1|gene1101 1
4 3 org4|gene757 4
5 4 org1|gene1702 1
6 5 org1|gene989 1
7 6 org1|gene990 1
8 7 org1|gene1699 1
9 9 org1|gene1102 1
10 10 org4|gene2439 4
11 10 org1|gene1374 1
Then we need to list the possible values of org:
possibleOrgs <- seq_len(max(input[, 3])) # = c(1, 2, 3, 4)
Now for the tricky part. The following function takes each unique cluster number in turn (I notice that 10 appears twice in your example data), takes all the rows relating to that cluster, and looks at the org value for those rows.
result <- vapply(unique(input[, 1]), function (x)
possibleOrgs %in% input[input[, 1] == x, 3], logical(4)))
We can then format this result as we like, perhaps using t to transform its orientation, * 1 to convert from TRUEs and FALSEs to 1s and 0s, and colnames to title its columns:
result <- t(result) * 1
colnames (result) <- paste0('org', possibleOrgs)
rownames(result) <- unique(input[, 1])
I hope that this is what you were looking for -- it wasn't quite clear from your question!
Output:
> result
org1 org2 org3 org4
0 0 0 0 1
1 1 0 0 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
5 1 0 0 0
6 1 0 0 0
7 1 0 0 0
9 1 0 0 0
10 1 0 0 1

Filter data frame based on duplicates in columns [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 7 years ago.
I have a very large data frame that looks something like this:
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
B 1 1
C 0 1
C 0 0
I want to only keep rows where there is a duplicate in the Gene column.
So the table would become:
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
C 0 1
C 0 0
I've tried using subset(df, duplicated(df$Genes)) in R But I think it left over some non- duplicates as the naming is more involved than A/B/C. Like: WASH11, KANSL-1, etc.
Can this be done in R or Linux shell?
In R, you could double-up on duplicated(), going from both directions.
df[with(df, duplicated(Gene) | duplicated(Gene, fromLast = TRUE)), ]
# Gene Sample1 Sample2
# 1 A 1 0
# 2 A 0 1
# 3 A 1 1
# 5 C 0 1
# 6 C 0 0
You could also use a table of the first column.
tbl <- table(df$Gene)
df[df$Gene %in% names(tbl)[tbl > 1], ]
# Gene Sample1 Sample2
# 1 A 1 0
# 2 A 0 1
# 3 A 1 1
# 5 C 0 1
# 6 C 0 0
Other options, which may or may not work depending on the real data are ...
df[(table(df$Gene) > 1)[df$Gene],] ## credit to Pierre LaFortune
## or
df[with(df, (tabulate(Gene) > 1)[Gene]), ]
You can find the number of each by applying ave and counting the entries:
ave(as.numeric(x$Gene), x$Gene, FUN=length)
## [1] 3 3 3 1 2 2
In this expression, the first argument to ave need only be a numeric who's length equals the number of rows in the data frame.
Use this to select rows:
count <- ave(as.numeric(x$Gene), x$Gene, FUN=length)
x[count>1,]
## Gene Sample1 Sample2
## 1 A 1 0
## 2 A 0 1
## 3 A 1 1
## 5 C 0 1
## 6 C 0 0
From command line using Perl
cat counts.txt
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
B 1 1
C 0 1
C 0 0
perl -ne '$cg{ (split /\t/,$_)[0] }++; push (#lines, $_); END { print shift #lines; foreach (#lines) { print if ($cg{ (split /\t/,$_)[0] } >= 2) }}' counts.txt
Gene Sample1 Sample2
A 1 0
A 0 1
A 1 1
C 0 1
C 0 0
%cg hash keeps count of the number of occurrences of each gene. Genes are extracted by selecting only the first element [0] of the split operation on each line. #lines holds entire contents of file in memory by line. Then the END block only outputs those lines whose gene appeared >= 2 times.

Splitting one Column to Multiple R and Giving logical value if true

I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1

how can i sumrows accept a specific values

If I have these data (dat1)
` L1 L2 L3 L4 L5 L6
1 0 1 0 4 0
4 1 0 1 1 0
1 0 0 1 4 1
All the values in the matrix are 0,1,4
I need to sum every row ignoring number 4.
The result like this
2
3
3
I have more than 900 rows..
I tried these but there is something missing
rowSums(subset(dat1,L1!=4)
rowSums(which[dat1!=4])
n4=dat1[dat1==4]<-0
You could simply do:
dat1[dat1==4] <- 0
rowSums(dat1)
If modifying the object is not acceptable just copy it first:
dat1Zero <- dat1
dat1[dat1Zero==4] <- 0
rowSums(dat1Zero)
If all the values are in (0,1,4) then you just want to count the 1's, no??
apply(df,1,function(x)sum(x==1))
# [1] 2 3 3

Data Manipulation in R Project: compare rows

I'm looking to compare values within a dataset
Every row starts with a unique ID followed by a couple binary variables
The data looks like this:
row.name v1 v2 v3 ...
1 0 0 0
2 1 1 1
3 1 0 1
I want to know which values are the same (if equal assign value of 1) and which are different (if not equal assign value of 0) for all unique pairings.
For example in column v1: row1 == 0 and row2 == 1, which should result in an assignment of 0.
So, the output should look like this
id1 id2 v1 v2 v3 ...
1 2 0 0 0 ...
1 3 0 1 0 ...
2 3 1 0 1 ...
I'm looking for an efficient way of doing this for more than 1000 rows...
There's no way to do this without expanding each combination of rows, so with 1000 rows, it is going to take a bit of time. But here is a solution:
dat <- read.table(header=T, text="row.name v1 v2 v3
1 0 0 0
2 1 1 1
3 1 0 1")
Create the index rows:
indices <- t(combn(dat$row.name, 2))
colnames(indices) <- c('id1', 'id2')
Loop through the index rows, and collect the comparisons:
res1 <- t(apply(indices, 1, function(x) as.numeric(dat[x[1],-1] == dat[x[2],-1])))
colnames(res1) <- names(dat[-1])
Put them together:
result <- cbind(indices, res1)
result
## id1 id2 v1 v2 v3
## [1,] 1 2 0 0 0
## [2,] 1 3 0 1 0
## [3,] 2 3 1 0 1

Resources