Working with conditionals in R - r

I have 3 rasters and I want to use them in a expression, but I can find different na values in the 3 rasters. For example: I can have a value in 2 rasters but in the 3 i have na,then in this case I cannot apply my expression.
Follow my code:
for(i in 1:length(name_BSA)){
i <- 1
if(days_BSA[i] == days_WSA[i] & days_WSA[i] == days_FDS[i]){
BSA <- raster(list_BSA[i])
WSA <- raster(list_WSA[i])
FDS <- raster(list_FDS[i])
brick <- brick(BSA, WSA, FDS)
if(!is.na(BSA[,]) & !is.na(WSA[,]) & !is.na(FDS[,])){
BLSA <- ((1-FDS[i])*BSA[i]) + (FDS[i] * WSA[i])
}
name_BLSA <- paste0("BLSA_",days_BSA[i])
writeRaster(BLSA, file.path(main,output_folder, name_BLSA), format = "GTiff", overwrite = T)
}
}
My problem is this part:!is.na(BSA[,]) & !is.na(WSA[,]) & !is.na(FDS[,])
This part does not work.
Someone can help me?

It would be easier to help if you supplied some code-generated example data, and omitted irrelevant detail such as the for-loop and if-clause.
For as far as I can see, there is no need to use !is.na. If one of the values is NA, the result will also be NA. I assume that is what you want, although you do not have an else clause. You should not use indexing on the RasterLayers with arithmetic operations. You also create a RasterBrick without using it.
library(raster)
# example data
output_folder = "."
f <- system.file("external/rlogo.grd", package="raster")
BSA <- raster(f, 1)
WSA <- raster(f, 2)
FDS <- raster(f, 3)
# improved code
BLSA <- (1-FDS)*BSA + FDS * WSA
name_BLSA <- file.path(output_folder, paste0("BLSA_", ".tif"))
writeRaster(BLSA, name_BLSA, overwrite = TRUE)
Alternatively, you could do
BLSA <- overlay(FDS, BSA, WSA, fun=function(x,y,z) { (1-x)*y + x*z }, filename=name_BLSA )

Related

R if in for loop error - how can I save selected model?

I am new to R and have difficulties using "if" and "for-loop". sorry if it is duplicated.
as you can see a chuck of a code below, I try to create 100 lm models and save when the R is more than 0.7.
However, the code saved all 100 lm models.
I suspect the statement (!is.na(lm.cv.r[i]) < 0.60) is wrong but I cannot figure it out.
# lets use USArrests data as an example
data("USArrests")
head(USArrests)
df.norm <- USArrests
set.seed(100)
lm.cv.mse <- NULL
lm.cv.r <- NULL
k <- 100
for(i in 1:k){
index.cv <- sample(1:nrow(df.norm),round(0.8*nrow(df.norm)))
df.cv.train <- df.norm[index.cv, ]
df.cv.test <- df.norm[-index.cv, ]
lm.cv <- glm(Rape~., data = df.cv.train)
lm.cv.predicted <- predict(lm.cv, df.cv.test)
lm.cv.mse[i] <- sum((df.cv.test$target - lm.cv.predicted)^2)/nrow(df.cv.test)
lm.cv.r[i] <- as.numeric(round(cor(lm.cv.predicted, df.cv.test$target, method = "pearson"), digits = 3))
if (!is.na(lm.cv.r[i]) > 0.70){
saveRDS(lm.cv, file = paste("lm.cv", lm.cv.r[i], ".rds", sep = ''))
}
}
I'm not familiarized with lm, so I will assume your code is working and the problem is as you said the if statement.
Try this out:
if ((lm.cv.r[i]>0.7) & (is.na(lm.cv.r[i])==FALSE)){
saveRDS(lm.cv, file = paste("lm.cv", lm.cv.r[i], ".rds", sep = ''))
}
So in your code
(!is.na(lm.cv.r[i]) > 0.70)
take a look at the !is.na(lm.cv.r[i]). Assuming that lm.cv.r[i] is a value or a set of values, then applying !is.na will return a value of TRUE since lm.cv.r[i] is not a na value. So you are dealing with this condition: " if TRUE > 0.7 ", which in fact returns TRUEsince 0.7 is less than 1.
In conclusion, you are saving every element since every if is TRUE.

Search for specific line in R function body

I wish to "copy and modify" a function at a specific point in its body. Currently, what I have is
nearest_psd <- function(mat) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals<0] <- 0
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
nearest_pd <- nearest_psd
formals(nearest_pd)$pdeps <- 1e-08
body(nearest_pd)[[c(7,3)]] <- quote(pdeps)
, so that nearest_pd is a copy of nearest_psd, except for the line eigvals[eigvals<0] <- pdeps.
However, the line number (7, in this case) is hard-coded, and I would prefer to have a robust way to determine this line number. How can I search for the line that contains the expression eigvals[eigvals<0] <- 0?
You can use identical to compare two expressions; that way, you can identify and replace the expression in question:
to_replace = vapply(body(nearest_pd), function (e) identical(e, quote(eigvals[eigvals < 0] <- 0)), logical(1L))
body(nearest_pd)[to_replace] = list(quote(eigvals[eigvals < pdeps] <- pdeps))
However, this is no more readable, nor more robust, than your code: in both cases you’re forced to hard-code the relevant information; in your code, the indices. In mine, the expression. For that reason I wouldn’t recommend using this.
… of course you could instead use an AST walker to replace all occurrences of 0 in the function’s body with pdeps. But is that better? No, since 0 could be used for other purposes. It currently isn’t, but who knows, once the original function changes. And if the original function can’t be assumed to change, why not hard-code the new function entirely? That is, write this:
nearest_pd <- function (mat, pdeps = 1e-08) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals < pdeps] <- pdeps
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
… no need to use metaprogramming just for the sake of it.
The following might do what you want.
nearest_psd <- function(mat) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals<0] <- 0
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
nearest_pd <- nearest_psd
formals(nearest_pd)$pdeps <- 1e-08
nearest_psd_body <- body(nearest_psd)
# Find the string we a re looking for and replace it ...
new.code <- gsub("eigvals[eigvals < 0] <- 0",
"MY_NEW_CODE",
nearest_psd_body, fixed = TRUE)
# Buidling the function body as a string.
new.code <- new.code[-1] # delete first { such that ...
new.code <- paste(new.code, collapse = ";") # we can collapse the remaining here ....
new.code <- paste("{", new.code, "}", sep = "", collapse = "") # and then wrap the remaining in { }
# parse returns an expression.
body(nearest_pd) <- parse(text = new.code)
See At a basic level, what does eval-parse do in R? for an explantion of parse. Or In programming, what is an expression? what an expression is.

How do I repeat codes with names changing at every block? (with R)

I'm dealing with several outputs I obtain from QIIME, texts which I want to manipulate for obtaining boxplots. Every input is formatted in the same way, so the manipulation is always the same, but it changes the source name. For each input, I want to extract the last 5 rows, have a mean for each column/sample, associate the values to sample experimental labels (Group) taken from the mapfile and put them in the order I use for making a boxplot of all the 6 data obtained.
In bash, I do something like "for i in GG97 GG100 SILVA97 SILVA100 NCBI RDP; do cp ${i}/alpha/collated_alpha/chao1.txt alpha_tot/${i}_chao1.txt; done" to do a command various times changing the names in the code in an automatic way through ${i}.
I'm struggling to find a way to do the same with R. I thought creating a vector containing the names and then using a for cycle by moving the i with [1], [2] etc., but it doesn't work, it stops at the read.delim line not finding the file in the wd.
Here's the manipulation code I wrote. After the comment, it will repeat itself 6 times with the 6 databases I'm using (GG97 GG100 SILVA97 SILVA100 NCBI RDP).
PLUS, I repeat this process 4 times because I have 4 metrics to use (here I'm showing shannon, but I also have a copy of the code for chao1, observed_species and PD_whole_tree).
library(tidyverse)
library(labelled)
mapfile <- read.delim(file="mapfile_HC+BV.txt", check.names=FALSE);
mapfile <- mapfile[,c(1,4)]
colnames(mapfile) <- c("SampleID","Pathology_group")
#GG97
collated <- read.delim(file="alpha_diversity/GG97_shannon.txt", check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
GG97_shannon <- as.data.frame(rbind(labels[2,],mean))
GG97_shannon <- t(GG97_shannon);
DB_type <- list(DB = "GG97"); DB_type <- rep(DB_type, 41)
GG97_shannon <- as.data.frame(cbind(DB_type,GG97_shannon))
colnames(GG97_shannon) <- c("DB","Group","value")
rm(collated,collated_reorder,DB_type,labels,mean)
Here I paste all the outputs together, freeze the order and make the boxplot.
alpha_shannon <- as.data.frame(rbind(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon))
rownames(alpha_shannon) <- NULL
rm(GG97_shannon,GG100_shannon,SILVA97_shannon,SILVA100_shannon,NCBI_shannon,RDP_shannon)
alpha_shannon$Group = factor(alpha_shannon$Group, unique(alpha_shannon$Group))
alpha_shannon$DB = factor(alpha_shannon$DB, unique(alpha_shannon$DB))
library(ggplot2)
ggplot(data = alpha_shannon) +
aes(x = DB, y = value, colour = Group) +
geom_boxplot()+
labs(title = 'Shannon',
x = 'Database',
y = 'Diversity') +
theme(legend.position = 'bottom')+
theme_grey(base_size = 16)
How do I keep this code "DRY" and don't need 146 rows of code to repeat the same things over and over? Thank you!!
You didn't provide a Minimal reproducible example, so this answer cannot guarantee correctness.
An important point to note is that you use rm(...), so this means some variables are only relevant within a certain scope. Therefore, encapsulate this scope into a function. This makes your code reusable and spares you the manual variable removal:
process <- function(file, DB){
# -> Use the function parameter `file` instead of a hardcoded filename
collated <- read.delim(file=file, check.names=FALSE);
collated <- tail(collated,5); collated <- collated[,-c(1:3)]
collated_reorder <- collated[,match(mapfile[,1], colnames(collated))]
labels <- t(mapfile)
colnames(collated_reorder) <- labels[2,]
mean <- colMeans(collated_reorder, na.rm = FALSE, dims = 1)
mean = as.matrix(mean); mean <- t(mean)
# -> rename this variable to a more general name, e.g. `result`
result <- as.data.frame(rbind(labels[2,],mean))
result <- t(result);
# -> Use the function parameter `DB` instead of a hardcoded string
DB_type <- list(DB = DB); DB_type <- rep(DB_type, 41)
result <- as.data.frame(cbind(DB_type,result))
colnames(result) <- c("DB","Group","value")
# -> After the end of this function, the variables defined in this function
# vanish automatically, you just need to specify the result
return(result)
}
Now you can reuse that block:
GG97_shannon <- process(file = "alpha_diversity/GG97_shannon.txt", DB = "GG97")
GG100_shannon <- process(file =...., DB = ....)
SILVA97_shannon <- ...
SILVA100_shannon <- ...
NCBI_shannon <- ...
RDP_shannon <- ...
Alternatively, you can use looping structures:
General-purpose for:
datasets <- c("GG97_shannon", "GG100_shannon", "SILVA97_shannon",
"SILVA100_shannon", "NCBI_shannon", "RDP_shannon")
files <- c("alpha_diversity/GG97_shannon.txt", .....)
DBs <- c("GG97", ....)
result <- list()
for(i in seq_along(datasets)){
result[[datasets[i]]] <- process(files[i], DBs[i])
}
mapply, a "specialized for" for looping over several vectors in parallel:
# the first argument is the function from above, the other ones are given as arguments
# to our process(.) function
results <- mapply(process, files, DBs)

How do I use $ for output components in R?

First, my code works perfectly. I simply need to be able to call the year and seasonal components out of BestSolarData using $ with:
BestSolarData$year
BestSolarData$seasonal
I have these written at the end of my code. The year I know comes from BestYear and seasonal come from BestData in the ForLoopSine function.
Any help to be able to access the components using $?
SineFit <- function (ToBeFitted)
{
msvector <- as.vector(ToBeFitted)
y <- length(ToBeFitted)
x <- 1:y
MS.nls <- nls(msvector ~ a*sin(((2*pi)/12)*x+b)+c, start=list(a=300, b=0, c=600))
summary(MS.nls)
MScoef <- coef(MS.nls)
a <- MScoef[1]
b <- MScoef[2]
c <- MScoef[3]
x <- 1:12
FittedCurve <- a*sin(((2*pi)/12)*x+b)+c
#dev.new()
#layout(1:2)
#plot(ToBeFitted)
#plot(FittedCurve)
return (FittedCurve)
}
ForLoopSine <- function(PastData, ComparisonData)
{
w<-start(PastData)[1]
t<-end(PastData)[1]
BestDiff <- 9999
for(i in w:t)
{
DataWindow <- window(PastData, start=c(i,1), end=c(t,12))
Datapredict <- SineFit(DataWindow)
CurrDiff <- norm1diff(Datapredict, ComparisonData)
if (CurrDiff < BestDiff)
{
BestDiff <- CurrDiff
BestYear <- i
BestData <- Datapredict
}
}
print(BestDiff)
print(BestYear)
return(BestData)
}
RandomFunction <- function(PastData, SeasonalData)
{
w <- start(PastData)[1]
t <- end(PastData)[1]
Seasonal.ts <- ts(SeasonalData, st = c(w,1), end = c(t,12), fr = 12)
Random <- PastData-Seasonal.ts
layout(1:3)
plot(SeasonalData)
plot(Seasonal.ts)
plot(Random)
return(Random)
}
BestSolarData <- ForLoopSine(MonthlySolarPre2015, MonthlySolar2015)
RandomComp <- RandomFunction (MonthlySolarPre2015, BestSolarData)
acf(RandomComp)
BestSolarData$year
BestSolarData$seasonal
As far as I understand your problem, you would like to retrieve the year component of BestSolarData with BestSolarData$year. But BestSolarData is returned by ForLoopSine, which is itself named DataPredict and is returned the SineFit function. It seems to be a vector and not a data.frame, so $ cannot work here.
Your example is not reproducible and this may help you find a solution. See this post for more details.

R: create vector from nested for loop

I have a "hit list" of genes in a matrix. Each row is a hit, and the format is "chromosome(character) start(a number) stop(a number)." I would like to see which of these hits overlap with genes in the fly genome, which is a matrix with the format "chromosome start stop gene"
I have the following function that works (prints a list of genes from column 4 of dmelGenome):
geneListBuild <- function(dmelGenome='', hitList='', binSize='', saveGeneList='')
{
genomeColumns <- c('chr', 'start', 'stop', 'gene')
genome <- read.table(dmelGenome, header=FALSE, col.names = genomeColumns)
chr <- genome[,1]
startAdjust <- genome[,2] - binSize
stopAdjust <- genome[,3] + binSize
gene <- genome[,4]
genome <- data.frame(chr, startAdjust, stopAdjust, gene)
hits <- read.table(hitList, header=TRUE)
chrHits <- hits[hits$chr == "chr3R",]
chrGenome <- genome[genome$chr == "chr3R",]
genes <- c()
for(i in 1:length(chrHits[,1]))
{
for(j in 1:length(chrGenome[,1]))
{
if( chrHits[i,2] >= chrGenome[j,2] && chrHits[i,3] <= chrGenome[j,3] )
{
print(chrGenome[j,4])
}
}
}
genes <- unique(genes[is.finite(genes)])
print(genes)
fileConn<-file(saveGeneList)
write(genes, fileConn)
close(fileConn)
}
however, when I substitute print() with:
genes[j] <- chrGenome[j,4]
R returns a vector that has some values that are present in chrGenome[,1]. I don't know how it chooses these values, because they aren't in rows that seem to fulfill the if statement. I think it's an indexing issue?
Also I'm sure that there is a more efficient way of doing this. I'm new to R, so my code isn't very efficient.
This is similar to the "writing the results from a nested loop into another vector in R," but I couldn't fix it with the information in that thread.
Thanks.
I believe the inner loop could be replaced with:
gene.in <- ifelse( chrHits[i,2] >= chrGenome[,2] & chrHits[i,3] <= chrGenome[,3],
TRUE, FALSE)
Then you can use that logical vector to select what you want. Doing
which(gene.in)
might also be of use to you.

Resources