R - Create biosequence from a list of sequences with sample() - r

I have a list of biosequences that are very similar. Firstly I want to create a distance matrix NxN and then I want to generate another sequence with sample() that will be also quite similar to the others but not identical. I try to create the sequence by using the sample fucntion but it returns a copy of my initial list. Lastly, I want to update the matrix to contain the distances of the generated sequence. Is there a way to simply bind the new distances or I need to create a new one?
sequences <- read.fasta('/media/losve/Νέος τόμος/Scripts/bnp54/ergasia 2/histone4.fa')
seqnames <- c("human", "mouse", "fly", "plant", "cow", "worm", "chick", "rat","yeast", "frog")
myseqs <- list()
for(i in 1:length(sequences))
{
myseqs[i] = toupper(paste(sequences[[i]], collapse=''))
}
names(myseqs) <- seqnames
dist_matrix <- matrix(, nrow = length(sequences), ncol = length(sequences))
for(i in 1:length(sequences))
{
for(j in 1:length(sequences))
{
dist_matrix[i][j] <- pairwiseAlignment(myseqs[i], myseqs[j], substitutionMatrix = "BLOSUM50")
}
}
new_sequence <- sample(myseqs, replace = TRUE )

sample just draws elements from a vector without replacement by default. It is not intended to modify the elements in any way. It just takes a random subset of your data if n < total observations. For n = total observations, it has to return the exact input but just in a different order.
Introducing sequencing errors
You should consider shell tools to introduce random mutation errors in genomic sequences, e.g. Grinder, Mason or SiMSeq.
Getting a permutation for each sequence
library(tidyverse)
set.seed(1337)
list("AATTGG", "GGCCGG") %>%
map_chr(~ {
.x %>%
str_split(pattern = "") %>%
simplify() %>%
sample() %>%
paste0(collapse = "")
})
#> [1] "ATGATG" "CCGGGG"

Related

store values of a function into a vector

I want to create a function which first cuts a tree (from a hclust) into 2:13 groups (12 different cutree values), then calculates the adjusted rand index (randIndex) between these 12 cutree values and a stored vector I already have and finally store these adjusted rand index values into a vector so I can compare the answers. All I've got is
for(i in 2:13){
a <- cutree(hclust1, k=i)
randIndex(stored_vector, a)
}
where hclust1 is just the hierarchical clustering output and stored_vector is just the stored vector value I mentioned. I am completely new to programming and would appreciate some help. Thank you.
Does this work for you?
library(tidyverse)
library(fossil) # rand.index function
# get a dataset for cutree, change this to your dataset
hc <- hclust(dist(USArrests))
# change k to your desired vector
k <- 2:12
vec <- cutree(hc, k = k)
# create an empty dataframe
df <- tibble(i=as.numeric(),j=as.numeric(),result=as.numeric())
# create nested for loops to get result
for (i in k) {
for (j in k) {
result <- rand.index(vec[,i-1],vec[,j-1])
df <- df %>%
add_row(i=i,j=j,result=result)
}
}
# view result
df %>%
filter(result != 1) %>%
distinct(result, .keep_all = TRUE) %>%
view()

How to write function to return plot but also an object (data.frame)?

Description and goal: In R Studio, I would like to define a function that drops columns of a given data.frame if it contains a too high share of missing values, defined by a cutoff value in percent. This function should return information about the subsetted data.frame (number of remaining columns and remaining share of missing cases) together with the subsetted data.frame itself for further analyses. Additionally, there should be an option to visualize remaining missing cases using the function vis_miss() of the identically named package.
Packages used:
library(tidyverse)
library(vismiss)
Data:
my.data <- tibble(col_1 = c(1:5),
col_2 = c(1,2,NA,NA,NA))
My function:
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
df
}
Test:
cut_cols(my.data, 0.5, vis_miss = F) # without visualization
cut_cols(my.data, 0.5, vis_miss = T) # with visualization
Problem:
As you might have already seen in the example above, only the first line, where vis_miss = F actually returns the data.frame but not the second line, where vis_miss = T. I assume that this is because of the extra if () {} clause, which returns a plot and then ends the process without printing df. Is there a way to prevent this from happening so that the first line also returns the new data.frame?
You were correct in your suspicion that the if(){} clause was stopping the df from printing. I think return() stops any function from running further. If that's the case then it's best practice to put it at the end of any function.
Further, use print(df) to make sure your function outputs your data frame. Here are a few changes to your code
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
print(df)
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
}
cut_cols(my.data, 0.5, vis_miss = T)
Here's another option if it interests you. You can assign both the df and the plot to a list then call the list.
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
# empty list
list_ <- c()
# assign df to first index of list
list_[[1]] <- df
if (vis_miss==TRUE){
plot <- vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F)
# assign plot to second index in list
list_[[2]] <- plot
}
return(list_)
}
output <- cut_cols(my.data, 0.5, vis_miss = T)
Calling output will print both the df and plot. output[[1]] will print just the df. output[[2]] will print just the plot.

for loop only runs only for last row in R

I have a simple issue with a for loop in R - I am trying to make it run for the entire dataset and it only runs for the last row. This is done with quite complex datasets which are both shapefiles and I am testing the intersection of the geometries. That is why I can't quite make a reproducible example here.
Nevertheless, this is my code:
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
output <- output%>%
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
return(output)
}
As you can see I am testing the intersections of data2 on data1 and see which percentage of data1 is intersected dependent on the values they take on year and lag.year. The issue is that when I run this it only returns the desidered outcome for the last row, instead of the entire data1 object. I've tested all the different bits of code inside the loop separately and they all do as I want, but once I try to scale all of it up to the entire dataframe it just does it for the last row.
So I assume this must be some simple stupid mistake I am making for the loop.
Thanks!
You keep rewriting the output object; you may want to create a vector of length nrow(data) and assign the result to its i-th element. I don't think this relates to {sf} or GIS in general, it is more about how for loops and vectors work in R - consider this example:
for (i in 1:50) {
output <- i # rewriting output object 50 times
}
print(output) # this will be a single element for last row (50)
output <- numeric(50)
for (i in 1:50) {
output[i] <- i # storing result in a new element of output for each i
}
print(output) # this will be 1:50 as expected
You may want to consider something along these lines (hard to make certain without access to your data, but it should get you started).
result <- numeric(nrow(data1)) # init the vector
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
result[i] <- output%>% # store in i-th element of result instead
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
# return(output) # no need for return unless you are in a function
}

Loop for doing multiple correlations on dataset in R

I have a dataset with x number of columns, consisting of groups of test results, for example test1_1, test1_2 etc. Each set of tests has a different number of test results associated with it so the actual numbers aren't the same across each test. The final column is my target variable. I'm looking to establish which tests are correlated with the target variable, but I also want to create datasets for each set of tests. I'm also going to be plotting correlation plots of each test against the target variable. I suspect I could probably achieve all of this in a few lines of code within a for/while loop, however, I'm not sure where to begin.
Using lapply this could be achieved like so:
library(dplyr)
library(corrplot)
set.seed(42)
dataset <- data.frame(
test1_1 = runif(20),
test1_2 = runif(20),
test2_1 = runif(20),
test2_2 = runif(20),
Target = runif(20)
)
test_cols <- gsub("_\\d+$", "", names(dataset))
test_cols <- test_cols[grepl("^test", test_cols)]
test_cols <- unique(test_cols)
test_cols <- setNames(test_cols, test_cols)
test_fun <- function(x, test) {
x <- x %>%
select((starts_with(test)) | matches("Target"))
cor(x)
}
cor_test <- lapply(test_cols, test_fun, x = dataset)
cplot <- lapply(cor_test, corrplot)
This is similar to #stefan's answer using split.default to split the columns by pattern in the column names.
tmp <- dplyr::select(dataset, -Target)
list_plot <- lapply(split.default(tmp, sub('_.*', '', names(tmp))), function(x) {
corrplot::corrplot(cor(cbind(x, Target = dataset$Target)))
})

How to Efficiently work with Sparse / "Long format" data matrix in R

EDIT: I found out that the Matrix package does everything I need. Super fast and flexible. Specifically, the related functions are
Data <- sparseMatrix(i=Data[,1], j=Data[,2], x=Data[,3])
or simply
Data <- Matrix(data=Data,sparse=T)
Once you have your matrix in this Matrix class, everything should work smoothly like a regular matrix (for the most part, anyway).
======================================================
I have a dataset in "Long format" right now, meaning that it has 3 columns: row name, column name, and value. All of the "missing" row-column pairs are equal to zero.
I need to come up with an efficient way to calculate the cosine similarity (or even just the regular dot product) between all possible pairs of rows. The full data matrix is 19000 x 62000, which is why I need to work with the Long format instead.
I came up with the following method, but it's WAY too slow. Any tips on maximizing efficiency, or any suggestions of a better method overall, would be GREATLY appreciated. Thanks!
Data <- matrix(c(1,1,1,2,2,2,3,3,3,1,2,3,1,2,4,1,4,5,1,2,2,1,1,1,1,3,1),
ncol = 3, byrow = FALSE)
Data <- data.frame(Data)
cosine.sparse <- function(data) {
a <- Sys.time()
colnames(data) <- c('V1', 'V2', 'V3')
nvars <- length(unique(data[,2]))
nrows <- length(unique(data[,1]))
sim <- matrix(nrow=nrows, ncol=nrows)
for (i in 1:nrows) {
data.i <- data[data$V1==i,]
length.i.sq <- sum(data.i$V3^2)
for (j in i:nrows) {
data.j <- data[data$V1==j,]
length.j.sq <- sum(data.j$V3^2)
common.vars <- intersect(data.i$V2, data.j$V2)
row1 <- data.i[data.i$V2 %in% common.vars,3]
row2 <- data.j[data.j$V2 %in% common.vars,3]
cos.sim <- sum(row1*row2)/sqrt(length.i.sq*length.j.sq)
sim[i,j] <- sim[j,i] <- cos.sim
}
if (i %% 500 == 0) {cat(i, " rows have been calculated.")}
}
b <- Sys.time()
time.elapsed <- b - a
print(time.elapsed)
return(sim)
}
cosine.sparse(Data2)

Resources