for loop only runs only for last row in R - r

I have a simple issue with a for loop in R - I am trying to make it run for the entire dataset and it only runs for the last row. This is done with quite complex datasets which are both shapefiles and I am testing the intersection of the geometries. That is why I can't quite make a reproducible example here.
Nevertheless, this is my code:
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
output <- output%>%
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
return(output)
}
As you can see I am testing the intersections of data2 on data1 and see which percentage of data1 is intersected dependent on the values they take on year and lag.year. The issue is that when I run this it only returns the desidered outcome for the last row, instead of the entire data1 object. I've tested all the different bits of code inside the loop separately and they all do as I want, but once I try to scale all of it up to the entire dataframe it just does it for the last row.
So I assume this must be some simple stupid mistake I am making for the loop.
Thanks!

You keep rewriting the output object; you may want to create a vector of length nrow(data) and assign the result to its i-th element. I don't think this relates to {sf} or GIS in general, it is more about how for loops and vectors work in R - consider this example:
for (i in 1:50) {
output <- i # rewriting output object 50 times
}
print(output) # this will be a single element for last row (50)
output <- numeric(50)
for (i in 1:50) {
output[i] <- i # storing result in a new element of output for each i
}
print(output) # this will be 1:50 as expected
You may want to consider something along these lines (hard to make certain without access to your data, but it should get you started).
result <- numeric(nrow(data1)) # init the vector
for(i in 1:nrow(data1)){
#get intersections between data2 and data1 for specific years
output = st_join(
x = data1[i, ],
y = data2[which(data2$year %in% data1$lag.year[i]:data1$year[i]), ],
join = st_intersects
)
#Get area of intersections
output = transform(output,
inter_area = mapply(function(x, y) {
as.numeric(sf::st_area(
sf::st_intersection(x, y)
))}, x = geometry, y = geom_2))
## obtaining the proportion of area in data1 intersected by data2
output = transform(output, prop_inter = inter_area/area)
#get cycle-specific values
result[i] <- output%>% # store in i-th element of result instead
group_by(code, year.x)%>%
dplyr::summarise(prop_inter = sum(prop_inter),
end_date= max(end_date),
start_date= max(start_date))%>%
ungroup()
# return(output) # no need for return unless you are in a function
}

Related

store values of a function into a vector

I want to create a function which first cuts a tree (from a hclust) into 2:13 groups (12 different cutree values), then calculates the adjusted rand index (randIndex) between these 12 cutree values and a stored vector I already have and finally store these adjusted rand index values into a vector so I can compare the answers. All I've got is
for(i in 2:13){
a <- cutree(hclust1, k=i)
randIndex(stored_vector, a)
}
where hclust1 is just the hierarchical clustering output and stored_vector is just the stored vector value I mentioned. I am completely new to programming and would appreciate some help. Thank you.
Does this work for you?
library(tidyverse)
library(fossil) # rand.index function
# get a dataset for cutree, change this to your dataset
hc <- hclust(dist(USArrests))
# change k to your desired vector
k <- 2:12
vec <- cutree(hc, k = k)
# create an empty dataframe
df <- tibble(i=as.numeric(),j=as.numeric(),result=as.numeric())
# create nested for loops to get result
for (i in k) {
for (j in k) {
result <- rand.index(vec[,i-1],vec[,j-1])
df <- df %>%
add_row(i=i,j=j,result=result)
}
}
# view result
df %>%
filter(result != 1) %>%
distinct(result, .keep_all = TRUE) %>%
view()

R - Create biosequence from a list of sequences with sample()

I have a list of biosequences that are very similar. Firstly I want to create a distance matrix NxN and then I want to generate another sequence with sample() that will be also quite similar to the others but not identical. I try to create the sequence by using the sample fucntion but it returns a copy of my initial list. Lastly, I want to update the matrix to contain the distances of the generated sequence. Is there a way to simply bind the new distances or I need to create a new one?
sequences <- read.fasta('/media/losve/Νέος τόμος/Scripts/bnp54/ergasia 2/histone4.fa')
seqnames <- c("human", "mouse", "fly", "plant", "cow", "worm", "chick", "rat","yeast", "frog")
myseqs <- list()
for(i in 1:length(sequences))
{
myseqs[i] = toupper(paste(sequences[[i]], collapse=''))
}
names(myseqs) <- seqnames
dist_matrix <- matrix(, nrow = length(sequences), ncol = length(sequences))
for(i in 1:length(sequences))
{
for(j in 1:length(sequences))
{
dist_matrix[i][j] <- pairwiseAlignment(myseqs[i], myseqs[j], substitutionMatrix = "BLOSUM50")
}
}
new_sequence <- sample(myseqs, replace = TRUE )
sample just draws elements from a vector without replacement by default. It is not intended to modify the elements in any way. It just takes a random subset of your data if n < total observations. For n = total observations, it has to return the exact input but just in a different order.
Introducing sequencing errors
You should consider shell tools to introduce random mutation errors in genomic sequences, e.g. Grinder, Mason or SiMSeq.
Getting a permutation for each sequence
library(tidyverse)
set.seed(1337)
list("AATTGG", "GGCCGG") %>%
map_chr(~ {
.x %>%
str_split(pattern = "") %>%
simplify() %>%
sample() %>%
paste0(collapse = "")
})
#> [1] "ATGATG" "CCGGGG"

How to plot histogram from the R output?

I am trying to plot histogram from the R output which is not a data frame. Below are my codes and the output.
x <- replicate(1000,
{y <- rpois(200, 1)
{lambda0 <- 1
for(i in 1:1)
{
if( i == 1 ) cat( sprintf("%15s %15s %15s %15s\n", "LogL", "Score", "Information", "New Estimate"))
logL <- sum((-lambda0) + y*(log(lambda0)))
score <- sum((y/lambda0)-1)
information <- sum(y/(lambda0)^2)
lambda1 <- lambda0 + score/information
cat( sprintf("%15.4f %15.4f %15.4f %15.5f\n", logL, score, information, lambda1))
lambda0 <- lambda1
}
}
})
Below is my output
I'm trying to take the new estimate from the output and create histogram. Can you please help?
Thank you.
You need to store the value for New Estimate during your loop. This way you can retrieve your results after the loop is finished. Normally when using a loop, you specify a variable in advance in which you can save the result for each iteration. E.g.:
numbers <- 1:3
result <- list(length = length(numbers)
for (i in seq_along(numbers){
result[[i]] <- numbers[[i]] + 1
}
In this example there is a vector of three numbers, you want to add one to each number and save the result. You can do this by creating a list of length 3 (adding length is better, but not necessary) and for each ith iteration, you save the result in the ith element of the list.
After finishing the loop you can retrieve the results from the result variable. And you can retrieve the ith result by using square brackets: result[[i]].

How to write function to return plot but also an object (data.frame)?

Description and goal: In R Studio, I would like to define a function that drops columns of a given data.frame if it contains a too high share of missing values, defined by a cutoff value in percent. This function should return information about the subsetted data.frame (number of remaining columns and remaining share of missing cases) together with the subsetted data.frame itself for further analyses. Additionally, there should be an option to visualize remaining missing cases using the function vis_miss() of the identically named package.
Packages used:
library(tidyverse)
library(vismiss)
Data:
my.data <- tibble(col_1 = c(1:5),
col_2 = c(1,2,NA,NA,NA))
My function:
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
df
}
Test:
cut_cols(my.data, 0.5, vis_miss = F) # without visualization
cut_cols(my.data, 0.5, vis_miss = T) # with visualization
Problem:
As you might have already seen in the example above, only the first line, where vis_miss = F actually returns the data.frame but not the second line, where vis_miss = T. I assume that this is because of the extra if () {} clause, which returns a plot and then ends the process without printing df. Is there a way to prevent this from happening so that the first line also returns the new data.frame?
You were correct in your suspicion that the if(){} clause was stopping the df from printing. I think return() stops any function from running further. If that's the case then it's best practice to put it at the end of any function.
Further, use print(df) to make sure your function outputs your data frame. Here are a few changes to your code
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
print(df)
if (vis_miss==TRUE) {return(vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F))}
}
cut_cols(my.data, 0.5, vis_miss = T)
Here's another option if it interests you. You can assign both the df and the plot to a list then call the list.
cut_cols <- function(df, na.perc.cutoff, vis_miss=FALSE) {
df <- df[lapply(df, function(x) sum(is.na(x)) / length(x)) < na.perc.cutoff]
cat(paste0("Remaining cols: ", ncol(df)),
paste0("\nRemaining miss: ", paste0(round(sum(is.na(df)) / prod(dim(df)) * 100, 2), "%\n")))
# empty list
list_ <- c()
# assign df to first index of list
list_[[1]] <- df
if (vis_miss==TRUE){
plot <- vis_miss(df[1:nrow(df),c(1:ncol(df))], warn_large_data=F)
# assign plot to second index in list
list_[[2]] <- plot
}
return(list_)
}
output <- cut_cols(my.data, 0.5, vis_miss = T)
Calling output will print both the df and plot. output[[1]] will print just the df. output[[2]] will print just the plot.

How to apply a function that gets data from differents data frames and has conditions in it?

I have a customized function (psup2) that gets data from a data frame and returns a result. The problem is that it takes a while since I am using a "for" loop that runs for every row and column.
Input:
I have a table that contains the ages (table_costumers), an n*m matrix of different terms, and two different mortality tables (for males and females).
The mortality tables i´m using contains one column for ages and another one for its corresponding survival probabilities.
Output:
I want to create a separate dataframe with the same size as that of the term table. The function will take the data from the different mortality tables (depending on the gender) and then apply the function above (psup2) taking the ages from the table X and the terms from the matrix terms.
Up to now I managed to create a very inefficient way to do this...but hopefully by using one of the functions from the apply family this could get faster.
The following code shows the idea of what I am trying to do:
#Function
psup2 <- function(x, age, term) {
P1 = 1
for (i in 1:term) {
P <- x[age + i, 2]
P1 <- P1*P
}
return(P1)
}
#Inputs
terms <- data.frame(V1 = c(1,2,3), V2 = c(1,3,4), V2 = c(2,3,4))
male<- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9981,0.9979,0.9978,.994,.992,.99))
female <- data.frame(age = c(0,1,2,3,4,5), probability = c(0.9983,0.998,0.9979,.9970,.9964,.9950))
table_customers <- data.frame(id = c(1,2,3), age = c(0,0,0), gender = c(1,2,1))
#Loop
output <- data.frame(matrix(NA, nrow = 3, ncol = 0))
for (i in 1:3) {
for (j in 1:3) {
prob <- ifelse(table_customers[j, 3] == 1,
psup2(male, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])),
psup2(female, as.numeric(table_customers[j, 2]), as.numeric(terms[j,i])))
output[j, i] <- prob
}
}
your psup function can be simplified into:
psup2 <- function(x, age, term) { prod(x$probability[age+(1:term)]) }
So actually, we won't use it, we'll use the formula directly.
We'll put your male and female df next to each other, so we can use the value of the gender column to choose one or another.
mf <- merge(male,female,by="age") # assuming you have the same ages on both sides
input_df <- cbind(table_customers,terms)
output <- t(apply(input_df,1,function(x){sapply(1:3,function(i){prod(mf[x["age"]+(1:x[3+i]),x["gender"]+1])})}))
And that's it :)
The sapply function is used to loop on the columns of terms.
x["age"]+(1:x[3+i]) are the indices of the rows you want to multiply
x["gender"]+1 is the relevant column of the mf data.frame

Resources