Concatenating positions into genomic segments - r

I would like to concatenate all rows which have more than 0.955 of similarity score. The Aboand Bel columns represents the similarity score with above and below rows, respectively. In the following input df I have 10 genomic probes (NAME column) which is concatenated in just 4 genomic segments (dfout).
df <- " NAME Abo Bel Chr GD Position
BovineHD0100009217 NA 1.0000000 1 0 31691781
BovineHD0100009218 1.0000000 0.6185430 1 0 31695808
BovineHD0100019600 0.6185430 0.9973510 1 0 69211537
BovineHD0100019601 0.9973510 1.0000000 1 0 69213650
BovineHD0100019602 1.0000000 1.0000000 1 0 69214650
BovineHD0100019603 1.0000000 0.6600000 1 0 69217942
BovineHD0100047112 0.6600000 1.0000000 1 0 93797691
BovineHD0100026604 1.0000000 1.0000000 1 0 93815774
BovineHD0100026605 1.0000000 0.4649007 1 0 93819471
BovineHD0100029861 0.4649007 NA 1 0 105042452"
df <- read.table(text=df, header=T)
My expected output dfout:
dfout <- "Chr start end startp endp nprob
1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
1 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
1 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
1 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1"
dfout <- read.table(text=dfout, header=T)
Any ideas?

I couldn't think of any pretty solution using basic dataframe manipulation, so here's a bad-looking one that works:
First, add stringsAsFactors to df creation:
df <- read.table(text=df, header=T, stringsAsFactors = FALSE)
start <- df$Position[1]
end <- integer()
output <- NULL
count <- 1
for (i in 1:(nrow(df)-1)) {
if(df$Bel[i] < 0.955) {
end <- df$Position[i]
output <- rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count))
start <- df$Position[i+1]
count <- 0
}
count <- count + 1
}
end <- df$Position[nrow(df)]
output <- as.data.frame(rbind(output, c(start, end, df$NAME[which(df$Position == start)], df$NAME[which(df$Position == end)], count)))
colnames(output) <- c("start", "end", "startp", "endp", "nprob")
The basic idea here is looping through the rows and checking if the next should be added to the current segment (Bel > 0.955) or if a new segment should start (Bel <= 0.955). When a new sequence has to be started, the endrow is defined, the respective row added to the output and the new starting segment also defined. A count is used to add the number of rows used to create the segment (nprob).
Finally the last segment is added, outside the for loop, and the output receives its column names and is converted to a dataframe. I did not use Chr because 1. They are all equal, 2. if they weren't you didn't give any way to choose/summarize them.
Result:
> output
start end startp endp nprob
1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
2 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
3 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
4 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1
I'm pretty sure that you or someone else can work on this to make it shorter and more concise.

Here is dplyr version. First we need to define groups, that is what mutate bit is doing, then simple summarise function within the groups.
library(dplyr)
df %>%
mutate(
Abo955=ifelse(Abo<0.955,NA,Abo),
myGroup=cumsum(is.na(Abo955)*1)) %>%
group_by(myGroup) %>%
summarise(
Chr=min(Chr),
start=min(Position),
end=max(Position),
startp=first(NAME),
lastp=last(NAME),
nprob=n()) %>%
select(-myGroup)

This solution is purely based on logical vectors and works with the provided example.
As Molx said, let's add stringsAsFactors=F
df <- read.table(text=df, header=T, stringAsFactors = F)
An just so that the logical evaluations work let's change NA to 0s
df(is.na(df)) <- 0
Now, for the consecutive rows that will be concatenated lets find the "start" and "end" rows using logical evaluations
starts <- df$Bel >= 0.955 & df$Abo < 0.955
ends <- df$Bel < 0.955 & df$Abo >= 0.955
With this we can already construct a data.frame concatenating rows that need to be concatenated
concatenated <- data.frame(Chr = df[starts, "Chr"],
start = df[starts, "Position"],
end = df[ends, "Position"],
startp = df[starts, "NAME"],
endp = df[ends, "NAME"],
nprob = c( diff (which(starts))[1] ,diff (which(ends)))
)
And let's also construct a data.frame with the rows that are not concatenated, i.e. the ones that do not have the desired similarity score with neither the above nor below row
notConcatenate <- df$Abo < 0.955 & df$Bel < 0.955
non_concatenated <- data.frame(Chr = df[notConcatenate, "Chr"],
start = df[notConcatenate, "Position"],
end = df[notConcatenate, "Position"],
startp = df[notConcatenate, "NAME"],
endp = df[notConcatenate, "NAME"],
nprob = 1
)
And finally bind the two data.frames
dfout <- rbind(concataneted,non_concatenated)
Resulting in
> dfout
Chr start end startp endp nprob
1 1 31691781 31695808 BovineHD0100009217 BovineHD0100009218 2
2 1 69211537 69217942 BovineHD0100019600 BovineHD0100019603 4
3 1 93797691 93819471 BovineHD0100047112 BovineHD0100026605 3
4 1 105042452 105042452 BovineHD0100029861 BovineHD0100029861 1
NOTE: This code assumes that correlated probes are within the same chromosome
Cheers!

Related

looping within a variable in panel data using loop in R

I have a panel data like id <- c(1,1,1,1,1,1,1,2,2,2,2,2), intm <- c(1,1,0,0,1,0,0,0,0,0,1,1). The data frame is like
dta <- data.frame(cbind(id,intm)) which gives:
id intm
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 0
7 1 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 1
I would like to replace the subsequent values of "intm" variable by the first value within the ID variable. That is for ID=1, the first value is 1, so the intm should have all values as 1 and for ID=2, intm should have all values as 0. The data should be like
id <- c(1,1,1,1,1,1,1,2,2,2,2,2),intm <- c(1,1,1,1,1,1,1,0,0,0,0,0) with data frame
dta <- data.frame(cbind(id,intm))
How can I do this in R by looping or any other means? I have a big data set.
Consider the following code:
new_column <- c(); i <- 1; # new column to be created
# loop
for (j in unique(dta$id)){ # let's separate the unique values of ID
index <- which(dta$id==j) # which row index satisfy id==1, or id==2, ...?
value <- dta$intm[index[1]] # which value of intm corresponds to the first value of the index?
new_column[i:tail(index,n = 1)] <- rep(value,nrow(dta[id==j,])) # repeat this value the number of rows times which contains the ID
i <- tail(index,n = 1)+1 # the new_column component must start with its last value + 1
}
dta <- cbind(dta,new_column)
Alternatively, you can use the subset() function, i.e
rep(value,nrow(subset(dta,dta$id==j)))
You can use dplyr to do this.
There are other fancier ways to do this but I think dplyr is more graceful.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2)
intm <- c(1,1,0,0,1,0,0,0,0,0,1,1)
df = data.frame(id, intm)
library(dplyr)
df2 = df %>% group_by(id) %>% do({
.$intm = .$intm[1, drop = TRUE]
.
})
You can also try data.table library which shall be faster.
BTW: you do not need cbind to make a data.frame.
Consider ave + head
dta$intm <- with(dta, ave(intm, id, FUN=function(x) head(x, 1)))

R - Lag not working within function [objective: match similar adjacent rows]

I recently tried to match adjacent identical rows in a dataframe based on two variables (Condition1 and Outcome1 below). I have seen people doing this with all rows but not with adjacent rows, which is why I developed the following three-step work-around (which I hope did not overthink things):
-I lagged the variables based on which I wanted the matching to be done.
-I compared the variables and lagged-variables
-I deleted all rows in which both ware identical (and removed the remaining unnecessary columns).
Case <- c("Case 1", "Case 2", "Case 3", "Case 4", "Case 5")
Condition1 <- c(0, 1, 0, 0, 1)
Outcome1 <- c(0, 0, 0, 0, 1)
mwa.df <- data.frame(Case, Condition1, Outcome1)
new.df <- mwa.df
Condition_lag <- c(new.df$Condition1[-1],0)
Outcome_lag <- c(new.df$Outcome1[-1],0)
new.df <- cbind(new.df, Condition_lag, Outcome_lag)
new.df$Comp <- 0
new.df$Comp[new.df$Outcome1 == new.df$Outcome_lag & new.df$Condition1 == new.df$Condition_lag] <- 1
new.df <- subset(new.df, Comp == 0)
new.df <- subset(new.df, select = -c(Condition_lag, Outcome_lag, Comp))
This worked just fine. But when I tried to create a function for this because I had to do this operation with a large number of data frames, I encountered the problem that the lag did not work (i.e. the condition_lag <- c(new.df$condition[-1],0) and outcome_lag <- c(new.df$outcome[-1],0) operations were not carried out). The function code was:
FLC.Dframe <- function(old.df, condition, outcome){
new.df <- old.df
condition_lag <- c(new.df$condition[-1],0)
outcome_lag <- c(new.df$outcome[-1],0)
new.df <- cbind(new.df, condition_lag, outcome_lag)
new.df$comp <- 0
new.df$comp[new.df$outcome == new.df$outcome_lag & new.df$condition == new.df$condition_lag] <- 1
new.df <- subset(new.df, comp == 0)
new.df <- subset(new.df, select = -c(condition_lag, outcome_lag, comp))
return(new.df)
}
As for using the function, I wrote new.df <- FLC.Dframe(mwa.df, Condition1, Outcome1).
Could someone help me with this? Many thanks in advance.
Just generate run-length ids and remove the duplicates.
with(mwa.df, mwa.df[!duplicated(data.table::rleid(Condition1, Outcome1)), ])
Output
Case Condition1 Outcome1
1 Case 1 0 0
2 Case 2 1 0
3 Case 3 0 0
5 Case 5 1 1
If you want a function, then
FLC.Dframe <- function(df, cols) df[!duplicated(data.table::rleidv(df[, cols])), ]
Call this function like this
> FLC.Dframe(mwa.df, c("Condition1", "Outcome1"))
Case Condition1 Outcome1
1 Case 1 0 0
2 Case 2 1 0
3 Case 3 0 0
5 Case 5 1 1
The main problem with your function concerns the incorrect usage of $. This operator treats RHS input as is. For example, in this line new.df$condition the $ operator attempts to find in new.df a column named "condition", but not "Condition1", which is the value of condition. If you rewrite your function as follows, then it should work.
FLC.Dframe <- function(old.df, condition, outcome){
new.df <- old.df
condition_lag <- c(new.df[[condition]][-1],0)
outcome_lag <- c(new.df[[outcome]][-1],0)
new.df <- cbind(new.df, condition_lag, outcome_lag)
new.df$comp <- 0
new.df$comp[new.df[[outcome]] == new.df[["outcome_lag"]] & new.df[[condition]] == new.df[["condition_lag"]]] <- 1
new.df <- subset(new.df, comp == 0)
new.df <- subset(new.df, select = -c(condition_lag, outcome_lag, comp))
return(new.df)
}
You also need to call it like this (note that you need to use characters as inputs)
> FLC.Dframe(mwa.df, "Condition1", "Outcome1")
Case Condition1 Outcome1
1 Case 1 0 0
2 Case 2 1 0
4 Case 4 0 0
5 Case 5 1 1

Counting the frequency of differing patterns in a character string

I currently have a string in R that looks like this:
a <- "BMMBMMMMBMMMBMMBBMMM"
First, I need to determine the frequency of different patterns of "M" that appear in the string.
In this example it would be:
MM = 2
MMM = 2
MMMM = 1
Secondly, I then need to designate a numerical value/score for each different pattern.
i.e:
MM = 1
MMM = 2
MMMM = 3
This would mean that the total value/score of M's in a would equal 9.
If anyone knows any script that would allow me to do this for multiple strings like this in a dataframe that would be great?
Thank you.
a <- "BMMBMMMMBMMMBMMBBMMM"
tbl <- table(strsplit(a, "B"), exclude="")
tbl
# MM MMM MMMM
# 2 2 1
score <- sum(tbl * 1:3)
score
# 9
You could also use the table function.
a_list<-unlist(strsplit(a,"B"))
a_list<-a_list[!a_list==""] #remove cases when 2 B are together
a_list<-table(a_list)
# MM MMM MMMM
# 2 2 1
Here's a solution that uses the dplyr package. First, I load the library and define my string.
library(dplyr)
a <- "BMMBMMMMBMMMBMMBBMMM"
Next, I define a function that counts the occurrences of character x in string y.
char_count <- function(x, y){
# Get runs of same character
tmp <- rle(strsplit(y, split = "")[[1]])
# Count runs of character stored in `x`
tmp <- data.frame(table(tmp$lengths[tmp$values == x]))
# Return strings and frequencies
tmp %>%
mutate(String = strrep(x, Var1)) %>%
select(String, Freq)
}
Then, I run the function.
# Run the function
res <- char_count("M", a)
# String Freq
# 1 M 2
# 2 MM 2
# 3 MMM 1
Finally, I define my value vector and calculate the total value of vector a.
# My value vector
value_vec <- c(M = 1, MM = 2, MMM = 3)
# Total `value` of vector `a`
sum(value_vec * res$Freq)
#[1] 9
It it's acceptable to skip the first step you could do:
nchar(gsub("(B+M)|(^M)","",a))
# [1] 9
First compute all diffrent patterns that appear in your sting :
a <- "BMMBMMMMBMMMBMMBBMMM"
chars = unlist(strsplit(a, ""))
pat = c()
for ( i in 1:length(chars)){
for (j in 1:(length(chars) - i+1)){ pat = c(pat, paste(chars[j:(j+i-1)], collapse = ""))}}
pat =sort(unique(pat))
pat[1:5] : [1] "B" "BB" "BBM" "BBMM" "BBMMM"
Next, count the occurence of each pattern :
counts = sapply(pat, function(w) length(gregexpr(w, a, fixed = TRUE)[[1]]))
Finally build a nice dataframe to summary everything up :
df = data.frame(counts = counts, num = 1:length(pat))
head(df, 10)
counts num
B 6 1
BB 1 2
BBM 1 3
BBMM 1 4
BBMMM 1 5
BM 5 6
BMM 5 7
BMMB 2 8
BMMBB 1 9
BMMBBM 1 10
library(stringr)
str_count(a, "MMMM")
gives 1
str_count(gsub("MMMM", "", a), "MMM") # now count how many times "MMM" occurs, but first delete the "MMMM"
gives 2
str_count(gsub("MMM", "", a), "MM") #now count how many times "MM" occurs, but first delete the "MMM"'s
gives 2

R: How to automatically create flag variables for sequences of values?

Suppose, you're given the following dataframe:
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"))
What I am looking for is to create the variables flag_1, ..., flag_7 in a containing the information of how many times the respective values occur. For a, I would expect the following result:
var flag_1 flag_2 flag_3 flag_4 flag_5
",1,2,3," 1. 1. 1. 0. 0.
",2,3,5," 0. 1. 1. 0. 1.
",1,3,5,5," 1. 0. 1. 0. 2.
I managed to get the result using a nested for-loop and an if-condition but there must be a nicer (more aesthetic and better performing) solution.
One option would be to do strsplit, get the table and then cbind with original data
cbind(a, do.call(rbind, lapply(strsplit(as.character(a$var), ","),
function(x) table(factor(x[nzchar(x)], levels = 1:5, labels = paste0("flag_", 1:5))))))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2
Another option is with tidyverse
library(tidyverse)
str_extract_all(a$var, "[0-9]") %>%
map(~ as.integer(.x) %>%
as_tibble) %>%
bind_rows(.id = 'grp') %>%
count(grp, value = factor(value, levels = min(value):max(value))) %>%
spread(value, n, drop = FALSE, fill = 0) %>%
select(-grp) %>%
bind_cols(a, .) %>%
rename_at(vars(matches("^[0-9]+$")), ~ paste0("flag_", .))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2
First, don't make the strings into factors. Nothing good comes from that.
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"),
stringsAsFactors = FALSE)
To get from strings to your table is simple enough if we take it in small steps. Here, I've written (or renamed) a function per step and then gone through the steps using lapply one at a time. You can string it all together in a pipeline if like, but it would be roughly these steps.
First, I extract the numbers from the strings. That involves splitting on commas, getting rid of empty strings, you have those because you can begin and end a string with a comma, but otherwise, that step wouldn't be necessary. Then we need to translate the strings into numbers, count how often we see each (we can do that with the as.numeric and table functions, respectively), and then it is just a question of mapping the observed counts into a table that also includes those we haven't observed.
pick_indices <- function(str) unlist(strsplit(str, split = ","))
remove_empty <- function(chrs) chrs[nchar(chrs) > 0]
get_indices <- as.numeric
to_counts <- table
to_flag_vect <- function(counts, len) {
vec <- rep(0, len)
names(vec) <- 1:len
vec[names(counts)] <- counts
vec
}
strings <- lapply(a$var, pick_indices)
cleaned <- lapply(strings, remove_empty)
indices <- lapply(cleaned, get_indices)
counts <- lapply(indices, to_counts)
flags <- lapply(counts, to_flag_vect, len = 5)
We now have the flag-counts in a list, so to make it into the table you want, with the column names you want, we simply do this:
tbl <- do.call(rbind, flags)
colnames(tbl) <- paste0("flag_", 1:5)
tbl
Done.
Split and unlist the values into a factor with appropriate levels
x = strsplit(a$var, ",")
xp = factor(unlist(x), levels = seq_len(5))
Create an index that maps the values of xp to the rows they came from
i = rep(seq_along(x), lengths(x))
use xtabs() to cross-tabulate the entries by row
xt = xtabs(~ i + xp)
and cbind() the matrix representation of the result to the original
> cbind(a, unclass(xt))
var 1 2 3 4 5
1 ,1,2,3, 1 1 1 0 0
2 ,2,3,5, 0 1 1 0 1
3 ,1,3,5,5, 1 0 1 0 2

R Matrix process with conditional additions

I have to pre-process a big matrix. To make my example easier to understand I will use the following matrix:
Raw data
Where col = people and row = skills
In R my matrix is:
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
Aim
In my case I need to process row by row. So there is 3 steps. For each row I have to :
Put 0 if ij=ij (So all diagonals equals zero)
Put 0 if one of the ij=0
Otherwise I have to add ij+ij
I will show the 3 steps to be more clear.
Step 1 (row1)
The data are the row 1
The result is:
Step 2 (row2)
The data are the row 2
The result is:
Step 3 (row3)
The data are the row 3
The result is:
Create a maximum matrix
Then the maximum matching are :
So my final matrix should be:
Question
Can someone tell me how to succeed to achieve this in R?
And of course the same process should work if my matrix has more row and columns...
Thanks a lot :)
Here is my implementation in R. The code doesn't execute the steps exactly in the way you specified them. I focused on your final matrix and assumed that this is the main result you're interested in.
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
rownames(test) <- paste("Skill", 1:dim(test)[1], sep="")
colnames(test) <- paste("People", 1:dim(test)[2], sep="")
test
# Pairwise combinations
comb.mat <- combn(1:dim(test)[2], 2)
pairwise.mat <- data.frame(matrix(t(comb.mat), ncol=2))
pairwise.mat$max.score <- 0
names(pairwise.mat) <- c("Person1", "Person2", "Max.Score")
for ( i in 1:dim(comb.mat)[2] ) { # Loop over the rows
first.person <- comb.mat[1,i]
second.person <- comb.mat[2,i]
temp.mat <- test[, c(first.person, second.person)]
temp.mat[temp.mat == 0] <- NA
temp.rowSums <- rowSums(temp.mat, na.rm=FALSE)
temp.rowSums[is.na(temp.rowSums)] <- 0
max.sum <- max(temp.rowSums)
previous.val <- pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person]
pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person] <- max.sum*(max.sum > previous.val)
}
pairwise.mat
Person1 Person2 Max.Score
1 1 2 25
2 1 3 32
3 2 3 0
person.mat <- matrix(NA, nrow=dim(test)[2], ncol=dim(test)[2])
rownames(person.mat) <- colnames(person.mat) <- paste("People", 1:dim(test)[2], sep="")
diag(person.mat) <- 0
person.mat[cbind(pairwise.mat[,1], pairwise.mat[,2])] <- pairwise.mat$Max.Score
person.mat[lower.tri(person.mat, diag=F)] <- t(person.mat)[lower.tri(person.mat, diag=F)]
person.mat
People1 People2 People3
People1 0 25 32
People2 25 0 0
People3 32 0 0

Resources