store values of a function into a vector - r

I want to create a function which first cuts a tree (from a hclust) into 2:13 groups (12 different cutree values), then calculates the adjusted rand index (randIndex) between these 12 cutree values and a stored vector I already have and finally store these adjusted rand index values into a vector so I can compare the answers. All I've got is
for(i in 2:13){
a <- cutree(hclust1, k=i)
randIndex(stored_vector, a)
}
where hclust1 is just the hierarchical clustering output and stored_vector is just the stored vector value I mentioned. I am completely new to programming and would appreciate some help. Thank you.

Does this work for you?
library(tidyverse)
library(fossil) # rand.index function
# get a dataset for cutree, change this to your dataset
hc <- hclust(dist(USArrests))
# change k to your desired vector
k <- 2:12
vec <- cutree(hc, k = k)
# create an empty dataframe
df <- tibble(i=as.numeric(),j=as.numeric(),result=as.numeric())
# create nested for loops to get result
for (i in k) {
for (j in k) {
result <- rand.index(vec[,i-1],vec[,j-1])
df <- df %>%
add_row(i=i,j=j,result=result)
}
}
# view result
df %>%
filter(result != 1) %>%
distinct(result, .keep_all = TRUE) %>%
view()

Related

R - Create biosequence from a list of sequences with sample()

I have a list of biosequences that are very similar. Firstly I want to create a distance matrix NxN and then I want to generate another sequence with sample() that will be also quite similar to the others but not identical. I try to create the sequence by using the sample fucntion but it returns a copy of my initial list. Lastly, I want to update the matrix to contain the distances of the generated sequence. Is there a way to simply bind the new distances or I need to create a new one?
sequences <- read.fasta('/media/losve/Νέος τόμος/Scripts/bnp54/ergasia 2/histone4.fa')
seqnames <- c("human", "mouse", "fly", "plant", "cow", "worm", "chick", "rat","yeast", "frog")
myseqs <- list()
for(i in 1:length(sequences))
{
myseqs[i] = toupper(paste(sequences[[i]], collapse=''))
}
names(myseqs) <- seqnames
dist_matrix <- matrix(, nrow = length(sequences), ncol = length(sequences))
for(i in 1:length(sequences))
{
for(j in 1:length(sequences))
{
dist_matrix[i][j] <- pairwiseAlignment(myseqs[i], myseqs[j], substitutionMatrix = "BLOSUM50")
}
}
new_sequence <- sample(myseqs, replace = TRUE )
sample just draws elements from a vector without replacement by default. It is not intended to modify the elements in any way. It just takes a random subset of your data if n < total observations. For n = total observations, it has to return the exact input but just in a different order.
Introducing sequencing errors
You should consider shell tools to introduce random mutation errors in genomic sequences, e.g. Grinder, Mason or SiMSeq.
Getting a permutation for each sequence
library(tidyverse)
set.seed(1337)
list("AATTGG", "GGCCGG") %>%
map_chr(~ {
.x %>%
str_split(pattern = "") %>%
simplify() %>%
sample() %>%
paste0(collapse = "")
})
#> [1] "ATGATG" "CCGGGG"

for loop only runs only for last row in R

I have a simple issue with a for loop in R - I am trying to make it run for the entire dataset and it only runs for the last row. This is done with quite complex datasets which are both shapefiles and I am testing the intersection of the geometries. That is why I can't quite make a reproducible example here.
Nevertheless, this is my code:
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
output <- output%>%
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
return(output)
}
As you can see I am testing the intersections of data2 on data1 and see which percentage of data1 is intersected dependent on the values they take on year and lag.year. The issue is that when I run this it only returns the desidered outcome for the last row, instead of the entire data1 object. I've tested all the different bits of code inside the loop separately and they all do as I want, but once I try to scale all of it up to the entire dataframe it just does it for the last row.
So I assume this must be some simple stupid mistake I am making for the loop.
Thanks!
You keep rewriting the output object; you may want to create a vector of length nrow(data) and assign the result to its i-th element. I don't think this relates to {sf} or GIS in general, it is more about how for loops and vectors work in R - consider this example:
for (i in 1:50) {
output <- i # rewriting output object 50 times
}
print(output) # this will be a single element for last row (50)
output <- numeric(50)
for (i in 1:50) {
output[i] <- i # storing result in a new element of output for each i
}
print(output) # this will be 1:50 as expected
You may want to consider something along these lines (hard to make certain without access to your data, but it should get you started).
result <- numeric(nrow(data1)) # init the vector
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
result[i] <- output%>% # store in i-th element of result instead
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
# return(output) # no need for return unless you are in a function
}

Any efficient way to filter out multi-dim dataframe by measuring its correlation coefficient in R?

I intend to find Pearson correlation coefficient from multi-dim data to one numeric vector in R. Basically, I am expecting to get a correlation matrix by using the Pearson method, want to keep the rows (a.k.a, features for each column) in multi-dim data by using certain correlation coefficient as threshold.However, I tentatively tried some R implementation to do that but didn't get correct correlation matrix though. How can I get this one? can anyone point me out how to make this happen easily in R? any thought?
reproducible example
persons_df <- data.frame(person1=sample(1:20,10, replace = FALSE),
person2=as.factor(sample(10)),
person3=sample(1:25,10, replace = FALSE),
person4=sample(1:30,10, replace = FALSE),
person5=as.factor(sample(10)),
person6=as.factor(sample(10)))
row.names(persons_df) <-letters[1:10]
in persons_df, different features in row-wise and different persons in column-wise are given.
I have also age_df which has age of each person.
age_df <- data.frame(personID= colnames(persons_df),
age=sample(1:50, 6 , replace = FALSE))
my initial attempt:
pearson_corr <- function(df1, df2, verbose=FALSE){
stopifnot(ncol(df1)==nrow(df2))
res <- as.data.frame()
lapply(colnames(df1), function(x){
lapply(x, rownames(y){
if(colnames(x) %in% rownames(df2)){
cor_mat <- stats::cor(y, df2$age, method = "pearson")
ncor <- ncol(cor_mat)
cmatt <- col(cor_mat)
ord <- order(-cmat, cor_mat, decreasing = TRUE)- (ncor*cmatt - ncor)
colnames(ord) <- colnames(cor_mat)
res <- cbind(ID=c(cold(ord), ID2=c(ord)))
res <- as.data.frame(cbind(out, cor=cor_mat[res]))
res <- cbind(res, cor=cor_mat[out])
}
})
})
return(final_df)
}
but above code didn't return correct correlation matrix. what I want to do how each features of the certain person is correlated with his age. Is there any efficient way to make this happen? any idea?
goal:
basically, I want to keep the features which show a high correlation with age. I don't have a better idea to do this in R. Can anyone point me out how to get his done easily and efficiently in R? thanks
mylist = do.call(rbind,
apply(persons_df, 1, function(x){
temp = cor.test(age_df$age, as.numeric(x))
data.frame(t = temp$statistic, p = temp$p.value)
}))
mylist
# t p
#a -1.060264 3.488012e-01
#b -2.292612 8.361623e-02
#c -16.785311 7.382895e-05
#d -1.362776 2.446304e-01
#e -1.922296 1.269356e-01
#f -4.671259 9.509393e-03
#g -3.719296 2.048710e-02
#h -2.684663 5.496171e-02
#i -15.814635 9.341701e-05
#j -2.423014 7.252635e-02
Then use mylist to filter out what values you don't want.

How do I reference the number of rows in a group in dplyr?

I'm trying to write a function to use with dplyr that uses the number of rows in the group. Is there any way to reference the number of rows in the group in dplyr, other than just creating a new column? This would be equivalent to the .N variable in data.table.
Here's an example of what I'm trying to do:
library(dplyr)
library(RcppRoll)
# Function I'm trying to create
rollingMean <- function(x, n = 4)
if (.N < n) { # I want to test whether we have more than 4 rows
out <- mean(x) # if so, return the overall mean
} else {
out <- roll_meanr(x, n)
}
return(out)
}
# Fake data
tmp <- data.frame(X = 1:21, grouping = c(rep(letters[1:2], 10), letters[3]))
tmp %>%
group_by(grouping) %>%
mutate(ma = rollingMean(X)) %>%
tail # Of course, this doesn't work, but the value for ma for the last row should be 21
This seems like it would be fairly simple to do. Does anyone know how to do it?
I think the test in rollingMean just needs to be
if (length(x) < n)
There is an ?n function in dplyr, but it's special --
... can only be used from within ‘summarise’, ‘mutate’ and ‘filter’ ...

Compute p-values across all columns of (possibly large) matrices in R

is there are any more efficient/faster way to compare two matrices (column by columns) and to compute p-values using t-test for no difference in means (eventually switching to the chisq.test when necessary)?
Here is my solution:
## generate fake data (e.g., from treatment and control data)
z0 <- matrix(rnorm(100),10,10)
z1 <- matrix(rnorm(100, mean=1.1, sd=2),10,10)
## function to compare columns (bloody for loop)
compare.matrix <- function(z0, z1){
pval <- numeric(ncol(z0)) ## initialize
for(i in 1:ncol(z0)){ ## compare columns
pval[i] <- t.test(z1[, i], z0[, i])$p.value
## if var is categorical, switch test type
if ( length(unique(z1[,i]))==2){
index <- c(rep(0, nrow(z0)), rep(1, nrow(z1)))
xx <- c(z0[,i], z1[,i])
pval[i] <- chisq.test(table(xx, index), simulate.p.value=TRUE)$p.value
}
}
return(pval)
}
compare.matrix(z0, z1)
Here's one way using dplyr. It would probably be better to combine the first three lines into a single step if you've got large matrices, but I separated them for clarity. I think the chi-squared case would be a fairly simple extension.
z0_melt = melt(z0, value.name='z0')[,c('Var2','z0')]
z1_melt = melt(z1, value.name='z1')[,c('Var2','z1')]
all_df = merge(z0_melt, z1_melt)
library(dplyr)
all_df %>%
group_by(Var2) %>%
summarize(p = t.test(z0, z1)$p.value)

Resources