function with FOR and IF loops - r

I am writing a function that will go through a list of files in a directory, count number of complete cases, and if the sum of complete cases is above a given threshhold, a correlation should be calculated. The output must be a numeric vector of correlations for all files that meet the threshhold requirement. This is what I have so far (and it gives me an Error: unexpected '}' in "}" Full disclosure - I am a complete newbie, as in wrote my first code 2 weeks ago. What am I doing wrong?
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric()
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
c <- c(sum(complete.cases(dataFR2)))
if c >= threshhold {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
correlations_list <- c(correlations_list, d)
}
}
correlations_list
}

"Unexpected *" errors are a syntax error. Often a missing parenthesis, comma, or curly bracket. In this case, you need to change if c >= threshhold { to if (c >= threshhold) {. if() is a function and it requires parentheses.
I'd also strongly recommend that you not use c as a variable name. c() is the most commonly used R function, and giving an object the same name will make your code look very strange to anyone else reading it.
Lastly, I'd recommend that you make your output the same length as the the number of files. As you have it, there won't be any way to know which files met the threshold to have their correlations calculated. I'd make correlations_list have the same length as the number of files, and add names to it so you know which correlation belongs to which file. This has the side benefit of not "growing an object in a loop", which is an anti-pattern known for its inefficiency. A rewritten function would look something like this:
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric(length(all_files)) ## initialize to full length
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
n_complete <- sum(complete.cases(dataFR2))
if(n_complete >= threshhold) {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
} else {
d <- NA
}
correlations_list[i] <- d
}
names(correlations_list) <- all_files
correlations_list
}

Related

Search for specific line in R function body

I wish to "copy and modify" a function at a specific point in its body. Currently, what I have is
nearest_psd <- function(mat) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals<0] <- 0
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
nearest_pd <- nearest_psd
formals(nearest_pd)$pdeps <- 1e-08
body(nearest_pd)[[c(7,3)]] <- quote(pdeps)
, so that nearest_pd is a copy of nearest_psd, except for the line eigvals[eigvals<0] <- pdeps.
However, the line number (7, in this case) is hard-coded, and I would prefer to have a robust way to determine this line number. How can I search for the line that contains the expression eigvals[eigvals<0] <- 0?
You can use identical to compare two expressions; that way, you can identify and replace the expression in question:
to_replace = vapply(body(nearest_pd), function (e) identical(e, quote(eigvals[eigvals < 0] <- 0)), logical(1L))
body(nearest_pd)[to_replace] = list(quote(eigvals[eigvals < pdeps] <- pdeps))
However, this is no more readable, nor more robust, than your code: in both cases you’re forced to hard-code the relevant information; in your code, the indices. In mine, the expression. For that reason I wouldn’t recommend using this.
… of course you could instead use an AST walker to replace all occurrences of 0 in the function’s body with pdeps. But is that better? No, since 0 could be used for other purposes. It currently isn’t, but who knows, once the original function changes. And if the original function can’t be assumed to change, why not hard-code the new function entirely? That is, write this:
nearest_pd <- function (mat, pdeps = 1e-08) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals < pdeps] <- pdeps
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
… no need to use metaprogramming just for the sake of it.
The following might do what you want.
nearest_psd <- function(mat) {
ed <- eigen(mat)
eigvecs <- ed$vectors
eigvals <- ed$values
eigvals[eigvals<0] <- 0
eigvecs %*% diag(eigvals) %*% t(eigvecs)
}
nearest_pd <- nearest_psd
formals(nearest_pd)$pdeps <- 1e-08
nearest_psd_body <- body(nearest_psd)
# Find the string we a re looking for and replace it ...
new.code <- gsub("eigvals[eigvals < 0] <- 0",
"MY_NEW_CODE",
nearest_psd_body, fixed = TRUE)
# Buidling the function body as a string.
new.code <- new.code[-1] # delete first { such that ...
new.code <- paste(new.code, collapse = ";") # we can collapse the remaining here ....
new.code <- paste("{", new.code, "}", sep = "", collapse = "") # and then wrap the remaining in { }
# parse returns an expression.
body(nearest_pd) <- parse(text = new.code)
See At a basic level, what does eval-parse do in R? for an explantion of parse. Or In programming, what is an expression? what an expression is.

User defined function - issue with return values

I regularly come up against the issue of how to categorise dataframes from a list of dataframes according to certain values within them (E.g. numeric, factor strings, etc). I am using a simplified version using vectors here.
After writing messy for loops for this task a bunch of times, I am trying to write a function to repeatedly solve the problem. The code below returns a subscripting error (given at the bottom), however I don't think this is a subscripting problem, but to do with my use of return.
As well as fixing this, I would be very grateful for any pointers on whether there are any cleaner / better ways to code this function.
library(plyr)
library(dplyr)
#dummy data
segmentvalues <- c('1_P', '2_B', '3_R', '4_M', '5_D', '6_L')
trialvec <- vector()
for (i in 1:length(segmentvalues)){
for (j in 1:20) {
trialvec[i*j] <- segmentvalues[i]
}
}
#vector categorisation
vcategorise <- function(categories, data) {
#categorises a vector into a list of vectors
#requires plyr and dyplyr
assignment <- list()
catlength <- length(categories)
for (i in 1:length(catlength)){
for (j in 1:length(data)) {
if (any(contains(categories[i], ignore.case = TRUE,
as.vector(data[j])))) {
assignment[[i]][j] <- data[j]
}
}
}
return (assignment)
}
result <- vcategorise(categories = segmentvalues, data = trialvec)
Error in *tmp*[[i]] : subscript out of bounds
You are indexing assignments -- which is ok, even if at an index that doesn't have a value, that just gives you NULL -- and then indexing into what you get there -- which won't work if you get NULL. And NULL you will get, because you haven't allocated the list to be the right size.
In any case, I don't think it is necessary for you to allocate a table. You are already using a flat indexing structure in your test data generation, so why not do the same with assignment and then set its dimensions afterwards?
Something like this, perhaps?
vcategorise <- function(categories, data) {
assignment <- vector("list", length = length(data) * length(categories))
n <- length(data)
for (i in 1:length(categories)){
for (j in 1:length(data)) {
assignment[(i-1)*n + j] <-
if (any(contains(categories[i],
ignore.case = TRUE,
as.vector(data[j])))) {
data[j]
} else {
NA
}
}
}
dim(assignment) <- c(length(data), length(categories))
assignment
}
It is not the prettiest code, but without fully understanding what you want to achieve, I don't know how to go further.

R - Getting warnings: "only the first element is used as variable name"

I'm trying to use "assign" function, inside a for loop, to assign the value of a file to one variable.
When I call the function, it brings the correct answer but in the end it gives me the following warning messages:
In assign(fileList, read.csv(fileList[i])) :
only the first element is used as variable name
If I run > corr("specdata", 129) I can see the correct answer, it can print all the right values, but If I assign the values to any variable for example, it says that this variable is "NULL". An example:
cr <- corr("specdata", 150)
head(cr)
NULL
It will give me all the values that fit in this criteria but it seems that it can't pass the values to "cr". Other examples:
> class(cr)
[1] "NULL"
> cr
NULL
The code that I'm currently using, if is helpful:
corr <- function(directory, threshold){
if (directory == "specdata"){
setwd("C:/Users/User/Downloads/specdata")
fileList <- list.files(pattern="*.csv")
for (i in 1:length(fileList)){
fileValues <- assign(fileList, read.csv(fileList[i]))
okValues <- na.omit(fileValues)
completeCases <- sum(complete.cases(okValues))
if (completeCases > threshold) {
sulfate <- okValues["sulfate"]
nitrate <- okValues["nitrate"]
correlation <- cor(sulfate, nitrate, use="complete.obs", method=c("pearson", "kendall", "spearman"))
#print(correlation)
}
else if (completeCases <= threshold){
#print("0")
}
i = i+1
}
}
else {
print("There's no such directory")
}
}
I'm a begginer on R language, so, if there's any way to fix this issue or to read every single file from a folder and manipulate separately, I'd be glad.
assign is used to do just that, "assign" a value to a variable (up to semantics). I suspect you want
fileValues <- read.csv(fileList[i])

R: why am I not getting the function to return the vector as intended?

corr <- function(directory, threshold) {
files <- list.files(directory, full.names = TRUE)
nu <- numeric()
for(i in length(files)) {
my_data <- read.csv(files[i])
if (sum(complete.cases(my_data)) >= threshold) {
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
nu <- c(nu, cor(vec_sul, vec_nit))
}
}
nu
}
I've a list of .csv files sitting inside the directory I wish to pass as an argument to the function illustrated above. I also pass threshold value as the second argument. The objective is to read through all the files in the directory parameter and check if the files have complete cases more than the threshold value passed as the second arg.
Those files that pass this criteria will further be examined and follows the evaluation of the correlation between the two variables inside it: Sulfate and Nitrate. The series of such correlation values associated with the files that have more complete cases than the threshold value will be concatenated to a numerical variable vector. At the end of the loop execution, I want the function to return the vector containing the series of the correlation values evaluated in the "if" loop.
cr <- corr("specdata", 150)
When I run the above line of code in console, I get a numerical variable which is null. Could someone help me fix the code?
Though this kind of error has been seen so many times, it still happen. You want
i in 1:length(files)
You get numeric(0) (the "numeric null" you talk about), because your loop only reads in the final file. I guess the final file does not satisfy sum(complete.cases(my_data)) >= threshold so nothing is added to nu, initialized as numeric(0).
Also, I would like to point out that
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
nu <- c(nu, cor(vec_sul, vec_nit))
can be replaced by
nu <- c(nu, with(my_data, cor(sulfate, nitrate, use = "complete.obs")))
Consider the vectorized lapply() across list of files which avoids expanding a preset vector. The only adjustment is that lapply will return a length equal to input list, files, hence an else statement is added to fill in for dataframes with unmet threshold condition. But outside the loop, nu is removed of these NAs.
corr <- function(directory, threshold) {
files <- list.files(directory, full.names = TRUE)
nu <- lapply(files, function(i) {
my_data <- read.csv(i)
if (sum(complete.cases(my_data)) >= threshold) {
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
temp <- cor(vec_sul, vec_nit)
} else {
temp <- NA # SET NAs
}
return(temp)
})
nu <- nu[!is.na(nu)] # REMOVE NAs
return(nu)
}
Alternatively, try even vapply() (arguably slightly faster) to specify a numeric vector return
corr <- function(directory, threshold) {
files <- list.files(directory, full.names = TRUE)
nu <- vapply(files, function(i) {
my_data <- read.csv(i)
if (sum(complete.cases(my_data)) >= threshold) {
vec_sul <- my_data[complete.cases(my_data),]$sulfate
vec_nit <- my_data[complete.cases(my_data),]$nitrate
temp <- cor(vec_sul, vec_nit)
} else {
temp <- NA # SET NAs
}
return(temp)
}, numeric(1))
nu <- nu[!is.na(nu)] # REMOVE NAs
return(nu)
}

How to put characters

I have a fasta format file where in i have to only keep those nodes whose length is less than 100. however, the problem i am currently facing is that i am able to separate the nodes but am not able to put the characters of each node in separate variable whose length i can then check and subsequently separate the requisite nodes from longer ones.
So what i mean is i am able to read the headings and separate nodes but how do i put the characters within each node in a variable.
This is a sample of my data
>NODE_1
GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCAGGTCCGGGCTCCACTGCAC
GTAGTCCTCGTTGGACAGCAGCGGGGCGTACGAGGCCAGCTTGACCACGTCGGCGTTGCG
CTCGAGCCGGTCATGAACGCGGCCTCGGCGAGGGCGTTCTTCCAGGCGTTGCCCTGGGAA
>NODE_2
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTAC
>NODE_3
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTACATCATTCCTTAATCTTCC
my code:
x <- readLines("1.fa", n = -1L, ok = TRUE, warn = TRUE)
for (i in 1:length(x)) {
if (substr(x[i],1,1)=='>') {
head <- c(head,x[i])
q <- x[i+1]
if (q=!0) {
contig <- c(contig,q)
print(contig)
contig.length <- c(contig.length, nchar(q))
} else {
break
}
} else {
z <- paste(z,x[i], sep=" ")
}
}
You should use BioConductor for that. You're actually trying to parse a FASTA-file to some kind of a list. Bioconductor has a simple function read.fasta() that does just that, and returns an object where you can get the lengths and so on. Learning bioconductor is definitely worth the hassle if you work with sequences.
To do it in base R, you'll need to work with lists, something like :
Split.Fasta <- function(x){
out <- list()
for(i in x){
if(substr(i,1,1)==">") {
name <- gsub(">","",i)
out[[name]] <- character(0)
} else if (grepl("\\w",i)){
out[[name]] <- paste(out[[name]],gsub("\\W","",i),sep="")
}
}
out
}
Which works like :
zz <- textConnection(">NODE_1
GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCAGGTCCGGGCTCCACTGCAC
GTAGTCCTCGTTGGACAGCAGCGGGGCGTACGAGGCCAGCTTGACCACGTCGGCGTTGCG
CTCGAGCCGGTCATGAACGCGGCCTCGGCGAGGGCGTTCTTCCAGGCGTTGCCCTGGGAA
>NODE_2
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTAC
>NODE_3
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTACATCATTCCTTAATCTTCC")
X <- readLines(zz,n=-1L,ok=TRUE,warn=TRUE)
close(zz)
Y <- Split.Fasta(X)
$`NODE_1 `
[1] "GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCA...
$`NODE_2 `
[1] "CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGC...
$`NODE_3 `
[1] "CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCAC...
It returns a list which you can use later on to check lengths and so on :
sapply(Y,nchar)
NODE_1 NODE_2 NODE_3
180 162 180
Still, learn to use BioConductor, you'll thank yourself for that.
You could install the seqinr package, which has lots of methods for analysing sequence data.
install.packages("seqinr")
Next, read in your fasta file:
seqs <- read.fasta("myfile.fa")
And then, extract sequences from the list with length < 100:
seqs.small <- seqs[sapply(seqs, function(x) getLength(x) < 100)]
maybe assign would be helpful?
assign('NODE_1', 'GTTGG...')

Resources