Assign string to object based on filename with multiple conditions - r

I'm using some switches in my R script based on the provided data and I would like to automatize the recognition of said data. The files I'm using contain the required information in their name, I'm looking for a good way to match and assign this parts.
file names:
# Folder1:
T090_V4_plate1_S90_L001_R1_001.fastq.gz
T090_V4_plate1_S90_L001_R2_001.fastq.gz
# Folder2:
T091_V4_plate2_S1_L001_R1_001.fastq.gz
T091_V4_plate2_S1_L001_R2_001.fastq.gz
# Folder3:
TNT_2017_13_V34_plate4_S13_L001_R1_001.fastq.gz
TNT_2017_13_V34_plate4_S13_L001_R2_001.fastq.gz
TNT_2017_14_V34_plate4_S14_L001_R1_001.fastq.gz
TNT_2017_14_V34_plate4_S14_L001_R2_001.fastq.gz
the two values I would like to assign to objects are V3 or V34 to the object primerset and plate[1-4] to plate. I tried it like this:
if (length(list.files(pattern = "plate1")) > 1) {
plate <<- "plate1"
} else if (length(list.files(pattern = "plate2")) > 1) {
plate <<- "plate2"
} else if (length(list.files(pattern = "plate3")) > 1) {
plate <<- "plate3"
} else if (length(list.files(pattern = "plate4")) > 1) {
plate <<- "plate4"}
if (length(list.files(pattern = "V4")) > 1) {
primerset <<- "V4"
} else if (length(list.files(pattern = "V34")) > 1) {
primerset <<- "V34"
}
# print message based on detected values from file names
if (primerset == "V34"){
cat("sequence length is 301 bp")
} else if (primerset == "V4"){
cat("sequence length is 250 bp")
}
It works fine, but it looks complicated and easy to fail. Is there a more elegant solution? I would prefer not to load a package just for this task.
Additionally I don't know how to add a break if more than one condition is met, e.g. plate1 and plate2 in the same folder (I have the data sets separated, but just to be on the safe side).
Solution:
Based on the answers below these two versions also test if only one instance of primerset or plate is present:
filenames <- list.files()
if (length(unique(sub(".*_(plate\\d)_.*", "\\1", filenames))) == 1) {
plate <- unique(sub(".*_(plate\\d)_.*", "\\1", filenames))
}
matches = stringr::str_match(filenames, '_(V\\d+)_(plate\\d)')
if (length(unique(matches[, 2])) == 1) {
primerset = unique(matches[, 2])
}

In base R, we can use sub to extract specific part of the string.
primerset <- sub(".*_(V4|V34)_.*", "\\1", x)
#Or more generally
#primerset <- sub(".*_(V\\d+)_.*", "\\1", x)
plate <- sub(".*_(plate\\d)_.*", "\\1", x)
where x is vector of all the filenames
x <- c("T090_V4_plate1_S90_L001_R1_001.fastq.gz",
"T090_V4_plate1_S90_L001_R2_001.fastq.gz",
"T091_V4_plate2_S1_L001_R1_001.fastq.gz",
"T091_V4_plate2_S1_L001_R2_001.fastq.gz",
"TNT_2017_13_V34_plate4_S13_L001_R1_001.fastq.gz",
"TNT_2017_13_V34_plate4_S13_L001_R2_001.fastq.gz",
"TNT_2017_14_V34_plate4_S14_L001_R1_001.fastq.gz",
"TNT_2017_14_V34_plate4_S14_L001_R2_001.fastq.gz")

This calls for a regular expression. Using the {stringr} package, you would write:
matches = stringr::str_match(x, '_(V\\d+)_(plate\\d)')
primerset = matches[, 2]
plate = matches[, 3]
That is: match an underscore, followed by 'V' and a single digit, followed by underscore, followed by 'plate' and a single digit. You can extend the expression to also match the lane, mate and replicate.
Best of all, the above is vectorised so it works correctly with a vector of filenames.
Note that, either way, you should not be using <<- here (this performs global rather than local assignment and is very rarely appropriate).

Related

function with FOR and IF loops

I am writing a function that will go through a list of files in a directory, count number of complete cases, and if the sum of complete cases is above a given threshhold, a correlation should be calculated. The output must be a numeric vector of correlations for all files that meet the threshhold requirement. This is what I have so far (and it gives me an Error: unexpected '}' in "}" Full disclosure - I am a complete newbie, as in wrote my first code 2 weeks ago. What am I doing wrong?
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric()
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
c <- c(sum(complete.cases(dataFR2)))
if c >= threshhold {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
correlations_list <- c(correlations_list, d)
}
}
correlations_list
}
"Unexpected *" errors are a syntax error. Often a missing parenthesis, comma, or curly bracket. In this case, you need to change if c >= threshhold { to if (c >= threshhold) {. if() is a function and it requires parentheses.
I'd also strongly recommend that you not use c as a variable name. c() is the most commonly used R function, and giving an object the same name will make your code look very strange to anyone else reading it.
Lastly, I'd recommend that you make your output the same length as the the number of files. As you have it, there won't be any way to know which files met the threshold to have their correlations calculated. I'd make correlations_list have the same length as the number of files, and add names to it so you know which correlation belongs to which file. This has the side benefit of not "growing an object in a loop", which is an anti-pattern known for its inefficiency. A rewritten function would look something like this:
correlation <- function (directory, threshhold = 0) {
all_files <- list.files(path = getwd())
correlations_list <- numeric(length(all_files)) ## initialize to full length
for (i in seq_along(all_files)) {
dataFR2 <- read.csv(all_files[i])
n_complete <- sum(complete.cases(dataFR2))
if(n_complete >= threshhold) {
d <- cor(dataFR2$sulfate, dataFR2$nitrate, use = "complete.obs", method = c("pearson"))
} else {
d <- NA
}
correlations_list[i] <- d
}
names(correlations_list) <- all_files
correlations_list
}

find_element in dataframe in R

I am new to R. I wanted to define a R function, find_element, that takes as its inputs a list and a value of any type, and returns the value of the matched element in the input list that matches the value. thanks for your help
find_element <- function(arr, val){
count = 0
for(i in arr){
if (i == val){
print(count)
} else
count = count + 1
print ("No Match")
}
}
e.g.
arr <- 1:10
find_element(arr, 10)
# 10
find_element(arr, 12)
# NULL
Just for educational purposes, please, try (although this is not recommended practice in R!):
find_element <- function(arr, val) {
count = 1
for (i in arr) {
if (i == val) {
return(count)
} else
count = count + 1
}
return("No Match")
}
This will yield
arr <- 1:10
find_element(arr, 10)
#[1] 10
find_element(arr, 12)
#[1] "No Match"
Please, note
In R, elements of vectors, etc are numbered starting with 1
You have to use return instead of print to indicate the return value of a function (well, I know there's a short cut - but it's for the purpose of education, here)
The final return must come after the for loop.
Built-in function
Also for educational purposes, please, note that Sotos already has shown the R way in his comment:
which(arr == 10)
#[1] 10
which(arr == 12)
#integer(0)
In R, it's almost always better to use the well-documented built-in functions or those from packages. And, yes, try to avoid for loops in R.
Learnig R online
As pointed out in the (now deleted) answer of engAnt there are several ressources to learn R. https://www.rstudio.com/online-learning/#R lists a number of resources.

Dynamic variable names in plots, files and compatibility with loop

I am trying to write a function that makes a plot and saves it into a file automatically.
The trick I struggle with it to do both dynamically [plotname=varname & filename=varname &],
and to make it compatible with calling it from a loop.
# Create data
my_df = cbind(uni=runif (100),norm=rnorm (100),bino=rbinom(100,20, 0.5)); head (my_df)
my_vec = my_df[,'uni'];
# How to make plot and file-name meaningful if you call the variable in a loop?
# if you call by name, the plotname is telling. It is similar what I would like to see.
hist(my_df[,'bino'])
for (plotit in colnames(my_df)) {
hist(my_df[,plotit])
print (plotit)
# this is already not meaningful
}
# step 2 write it into files
hist_auto <- function(variable, col ="gold1", ...) {
if ( length (variable) > 0 ) {
plotname = paste(substitute(variable), sep="", collapse = "_"); print (plotname); is (plotname)
# I would like to define plotname, and later tune it according to my needs
FnP = paste (getwd(),'/',plotname, '.hist.pdf', collapse = "", sep=""); print (FnP)
hist (variable, main = plotname)
#this is apparently not working: I do not get my_df[, "bino"] or anything similar
dev.copy2pdf (file=FnP )
} else { print ("var empty") }
}
hist_auto (my_vec)
# name works, and is meaningful [as much as the var name ... ]
hist_auto (my_df[,'bino'])
# name sort of works, but falls apart
assign (plotit, my_df[,'bino'])
hist_auto (get(plotit))
# name works, but meaningless
# Now in a loop
for (plotit in colnames(my_df)) {
my_df[,plotit]
hist(my_df[,plotit])
## name works, but meaningless and NOT UNIQUE > overwritten by next
}
for (plotit in colnames(my_df)) {
hist_auto(my_df[,plotit])
## name works, but meaningless and NOT UNIQUE > overwritten by next
}
for (plotit in colnames(my_df)) {
assign (plotit, my_df[,plotit])
hist_auto (get(plotit))
## name works, but meaningless and NOT UNIQUE > overwritten by next
}
My aim is to have a function that iterates over eg. columns of a matrix, plots and saves each with a unique and meaningful name.
The solution will probably involve a smart combination of substitute() parse() eval() and paste (), but lacking solid understanding I failed to figure out.
My basis of experimentation was:
how to dynamically call a variable?
How about something like this? You may need to install.packages("ggplot2")
library(ggplot2)
my_df <- data.frame(uni=runif(100),
norm=rnorm(100),
bino=rbinom(100, 20, 0.5))
get_histogram <- function(df, varname, binwidth=1, save=T) {
stopifnot(varname %in% names(df))
title <- sprintf("Histogram of %s", varname)
p <- (ggplot(df, aes_string(x=varname)) +
geom_histogram(binwidth=binwidth) +
ggtitle(title))
if(save) {
filename <- sprintf("histogram_%s.png", gsub(" ", "_", varname))
ggsave(filename, p, width=10, height=8)
}
return(p)
}
for(var in names(my_df))
get_histogram(my_df, var, binwidth=0.5) # If you want to save them
get_histogram(my_df, "uni", binwidth=0.1, save=F) # If you want to look at a specific one
So I ended up with 2 functions, one that can iterate over data frames, and another that takes a single vectors. Using parts of Adrian's [thanks!] solution:
hist_dataframe <- function(variable, col ="gold1", ...) {
stopifnot(colName %in% colnames(df))
variable = df[,colName]
stopifnot(length (variable) >1 )
plotname = paste(substitute(df),'__', colName, sep="")
FnP = paste (getwd(),'/',plotname, '.hist.pdf', collapse = "", sep=""); print (FnP)
hist (variable, main = plotname)
dev.copy2pdf (file=FnP )
}
And the one for simple vectors stays as in Q.

Improve R script efficency

I am trying to match two very big data (nsar & crsp) sets. My code works quite well but needs a lot of time. My procedure works the following way:
Try match via ticker (thereby controlling that NAV (just a number) & date
is the same)
Try match via exact fund name (controlling for NAV & date)
Try match by closest match: search first for same NAV & date --> take list and consider only those companies that are the closest match for both match measures --> take remaining entries and find closest match (but match distance is restricted).
Any suggestions how I could improve the efficiency of the code:
#Go through each nsar entry and try to match with crsp
trackchanges = sapply(seq_along(nsar$fund),function(x){
#Define vars
ticker = nsar$ticker[x]
r_date = format(nsar$r_date[x], "%m%Y")
nav1 = nsar$NAV_share[x]
nav2 = nsar$NAV_sshare[x]
searchbyname = 0
if(nav1 == 0) nav1 = -99
if(nav2 == 0) nav2 = -99
########## If ticker is available --> Merge via ticker and NAV
if(is.na(ticker) == F)
{
#Look for same NAV, date and ticker
found = which(crsp$nasdaq == ticker & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#Mark that you should search by names
searchbyname = 1
} else { #ticker found
#Record crsp_fundno and that match is found
nsar$match[x] = 1
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 1 --> Merged by ticker
return(1)
}
}
###########
########### No Ticker available or found --> Exact name matching
if(is.na(ticker) == T | searchbyname == 1)
{
#Define vars
name = tolower(nsar$fund[x])
company = tolower(nsar$company[x])
#Exact name, date and same NAV
found = which(crsp$fund_name2 == name & crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
#If nothing found
if(length(found) == 0)
{
#####Continue searching by closest match
#First search for nav and date to get list of funds
allfunds = which(crsp$caldt2 == r_date & (round(crsp$mnav,1) == round(nav1,1) | round(crsp$mnav,1) == round(nav2,1)))
allfunds_companies = crsp$company[allfunds]
#Check if anything found
if(length(allfunds) == 0)
{
#Return: 0 --> nothing found
return(0)
}
#Get best match by lev and substring measure for company
levmatch = levenstheinMatch(company, allfunds_companies)
submatch = substringMatch(company, allfunds_companies)
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
#Check if now anything found
if(length(allfunds) == 0)
{
#Mark match (5=Company not found)
nsar$match[x] = 5
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return: 5 --> Company not found
return(5)
}
#Get best match by all measures
levmatch = levenstheinMatch(name, allfunds_names)
submatch = substringMatch(name, allfunds_names)
#Only accept if identical
allfunds = levmatch[levmatch %in% submatch]
allfunds_names = crsp$fund_name2[allfunds]
if(length(allfunds) > 0)
{
#Mark match (3=closest name matching)
nsar$match[x] = 3
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[allfunds[1]]
#Save globally
assign("nsar",nsar,envir=.GlobalEnv)
#Return 3=closest name matching
return(3)
} else {
#return 0 -> no match
return(0)
}
#####
} else { #If exact name,date,nav found
#Mark match (2=exact name matching)
nsar$match[x] = 2
#Add crsp_fundno to nsar data
nsar$crsp_fundno[x] = crsp$crsp_fundno[found[1]]
#Return 2=exact name matching
return(2)
}
}
})#End sapply
Thank you very much for any help!
Laurenz
The script is too complicated to provide a complete answer, but the basic problem is in the first line
#Go through each nsar entry...
where you set out the problem in an iterative way. R works best with vectors.
Hoist the vectorizable components from the sapply that you start your calculations with. For instance, format the r_date column.
nsar$r_date_f <- format(nsar$r_date, "%m%Y")
This advice applies to lines buried deeper in your code, too, for example calculating the rounded crsp$mnav should be done just once on the entire column
crsp$mnav_r <- round(crsp$mnav, 1)
Use R idioms where appropriate, if "-99" represents a missing value, then use NA
nav1 <- nsar$NAV_share
nav1[nav1 == -99] <- NA
nasr$nav1 <- nav1
Code from other packages that you might use is more likely to treat NA correctly.
Use well-established R functions for more complex queries. This is tricky, but if I'm reading your code correctly your query about "same NAV, date, and ticker" could use merge to do the joins, assuming the columns have been created by vectorized operations earlier in the code, as
nasr1 <- nasr[!is.na(nasr$ticker), , drop=FALSE]
df0 <- merge(nasr1, crsp,
by.x = c("ticker", rdate_r", "nav1_r"),
by.y = c("nasdaq", "caldt2", "mnav_r"))
This does not cover the "|" condition, so additional work would be needed. The plyr, data.table, and sqldf packages (among others) were developed in part to simplify these types of operations, so might be worth investigating as you get more comfortable with vectorized calculations.
It's hard to tell, but I think these three steps address the major challenges in your code.

How to put characters

I have a fasta format file where in i have to only keep those nodes whose length is less than 100. however, the problem i am currently facing is that i am able to separate the nodes but am not able to put the characters of each node in separate variable whose length i can then check and subsequently separate the requisite nodes from longer ones.
So what i mean is i am able to read the headings and separate nodes but how do i put the characters within each node in a variable.
This is a sample of my data
>NODE_1
GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCAGGTCCGGGCTCCACTGCAC
GTAGTCCTCGTTGGACAGCAGCGGGGCGTACGAGGCCAGCTTGACCACGTCGGCGTTGCG
CTCGAGCCGGTCATGAACGCGGCCTCGGCGAGGGCGTTCTTCCAGGCGTTGCCCTGGGAA
>NODE_2
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTAC
>NODE_3
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTACATCATTCCTTAATCTTCC
my code:
x <- readLines("1.fa", n = -1L, ok = TRUE, warn = TRUE)
for (i in 1:length(x)) {
if (substr(x[i],1,1)=='>') {
head <- c(head,x[i])
q <- x[i+1]
if (q=!0) {
contig <- c(contig,q)
print(contig)
contig.length <- c(contig.length, nchar(q))
} else {
break
}
} else {
z <- paste(z,x[i], sep=" ")
}
}
You should use BioConductor for that. You're actually trying to parse a FASTA-file to some kind of a list. Bioconductor has a simple function read.fasta() that does just that, and returns an object where you can get the lengths and so on. Learning bioconductor is definitely worth the hassle if you work with sequences.
To do it in base R, you'll need to work with lists, something like :
Split.Fasta <- function(x){
out <- list()
for(i in x){
if(substr(i,1,1)==">") {
name <- gsub(">","",i)
out[[name]] <- character(0)
} else if (grepl("\\w",i)){
out[[name]] <- paste(out[[name]],gsub("\\W","",i),sep="")
}
}
out
}
Which works like :
zz <- textConnection(">NODE_1
GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCAGGTCCGGGCTCCACTGCAC
GTAGTCCTCGTTGGACAGCAGCGGGGCGTACGAGGCCAGCTTGACCACGTCGGCGTTGCG
CTCGAGCCGGTCATGAACGCGGCCTCGGCGAGGGCGTTCTTCCAGGCGTTGCCCTGGGAA
>NODE_2
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTAC
>NODE_3
CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCACCGACCTGTCCA
CCGCGGACAAGGCCGGTTACCTGCACCGCTACATCGAGGCCAGCCGCATCGCGTTCGCGG
ACCGCGGGCGCTGGGTCGGCGACCCCGCCTTCGAGGACGTACATCATTCCTTAATCTTCC")
X <- readLines(zz,n=-1L,ok=TRUE,warn=TRUE)
close(zz)
Y <- Split.Fasta(X)
$`NODE_1 `
[1] "GTTGGCCGAGCCCCAGGACGCGTGGTTGTTGAACCAGATCA...
$`NODE_2 `
[1] "CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGC...
$`NODE_3 `
[1] "CCTCCGGCGGCACCACGGTCGGCGAGGCCCTCAACATCCTGGAGCGCAC...
It returns a list which you can use later on to check lengths and so on :
sapply(Y,nchar)
NODE_1 NODE_2 NODE_3
180 162 180
Still, learn to use BioConductor, you'll thank yourself for that.
You could install the seqinr package, which has lots of methods for analysing sequence data.
install.packages("seqinr")
Next, read in your fasta file:
seqs <- read.fasta("myfile.fa")
And then, extract sequences from the list with length < 100:
seqs.small <- seqs[sapply(seqs, function(x) getLength(x) < 100)]
maybe assign would be helpful?
assign('NODE_1', 'GTTGG...')

Resources