Rename Multi-FASTA sequence headers using Master File of similar names - r

I have multiple FASTA files that are using very simple headers which identify the specimen. However, I would like to make the headers much more detailed by adding geographic location, source and culture date.
My first thought is to use stringr package in R to read in each FASTA and replace any matching sequence ID with the appropriate string.
Using a .xlsx with the necessary data and specimen ID I can create a .txt with a series of strings with the new names I desire. Using this "Master Text" file I would like to rename each of the sequences in each FASTA appropriately through matching the Specimen ID.
So I created rename.txt in the following format:
SpecimenID|ST|Geographic Location|Source|CultDate
VRE32491|736|PUH - 10C|Blood|2016-12-07
VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08
VRE32503|1471|PUH - 11N|Wound|2017-01-05
VRE32504|1471|PUH - EMEP|Blood|2017-01-10
VRE32514|1471|PUH - 6F|Wound|2017-01-20
Using Biostrings::readDNAStringSet(*.fasta) I am able to obtain the names for each sequence using names() on the object. I want to create a matching string from rename.txt that will enable me to simply rename the DNAStringSet object using names({DNAStringSet object}) <- {matching string}.
What my problem is that I can't seem to extract a character string set from the rename.txt.
Below is some code anyone can use for a reprex to simulate my issue:
cat(
">VRE32493", "AGCT",
">VRE32503", "CAGT",
">VRE32504", "TCAA",
file = "example.fasta", sep = "\n"
)
cat(
"SpecimenID|ST|Geographic Location|Source|CultDate",
"VRE32491|736|PUH - 10C|Blood|2016-12-07",
"VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08",
"VRE32503|1471|PUH - 11N|Wound|2017-01-05",
"VRE32504|1471|PUH - EMEP|Blood|2017-01-10",
"VRE32514|1471|PUH - 6F|Wound|2017-01-20",
file = "example.txt", sep = "\n"
)
origMult <- Biostrings::readDNAStringSet("example.fasta")
fasta_rename <- read.delim("example.txt", skip = 1, header = F)
Expected output of example.fasta:
>VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08
AGCT
>VRE32503|1471|PUH - 11N|Wound|2017-01-05
CAGT
>VRE32504|1471|PUH - EMEP|Blood|2017-01-10
TCAA

Thanks to assistance elsewhere worked out this solution.
cat(
">VRE32493", "AGCT",
">VRE32503", "CAGT",
">VRE32504", "TCAA",
file = "example.fasta", sep = "\n"
)
cat(
"SpecimenID|ST|Geographic Location|Source|CultDate",
"VRE32491|736|PUH - 10C|Blood|2016-12-07",
"VRE32493|1471|PUH - 10N|Tissue/Surgical|2016-12-08",
"VRE32503|1471|PUH - 11N|Wound|2017-01-05",
"VRE32504|1471|PUH - EMEP|Blood|2017-01-10",
"VRE32514|1471|PUH - 6F|Wound|2017-01-20",
file = "example.txt", sep = "\n"
)
# Read in from example.txt
fasta_rename <- readr::read_file("example.txt")
fasta_rename <- unlist(strsplit(fasta_rename, "\\r"))
fasta_rename <- stringr::str_remove(fasta_rename[-1], "\\n")
fasta_rename <- fasta_rename[-length(fasta_rename)]
# Remove everything after first | to get the pattern to match off of
patterns <- stringr::str_remove(fasta_rename, "\\|(.+)$")
# Make fasta_rename a named character vector in the form of patterns = fasta_rename
names(fasta_rename) <- patterns
fasta_rename # print to verify
example_fasta <- readr::read_file("example.fasta")
example_fasta <- unlist(strsplit(example_fasta, "\\r"))
example_fasta <- stringr::str_remove(example_fasta, "\\n")
example_fasta <- example_fasta[-length(example_fasta)]
example_fasta #print to verify
cat(stringr::str_replace_all(example_fasta, fasta_rename),
file = "example.fasta",
sep = "\n")

Related

How can I read a table in a loosely structured text file into a data frame in R?

Take a look at the "Estimated Global Trend daily values" file on this NOAA web page. It is a .txt file with something like 50 header lines (identified with leading #s) followed by several thousand lines of tabular data. The link to download the file is embedded in the code below.
How can I read this file so that I end up with a data frame (or tibble) with the appropriate column names and data?
All the text-to-data functions I know get stymied by those header lines. Here's what I just tried, riffing off of this SO Q&A. My thought was to read the file into a list of lines, then drop the lines that start with # from the list, then do.call(rbind, ...) the rest. The downloading part at the top works fine, but when I run the function, I'm getting back an empty list.
temp <- paste0(tempfile(), ".txt")
download.file("ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_trend_gl.txt",
destfile = temp, mode = "wb")
processFile = function(filepath) {
dat_list <- list()
con = file(filepath, "r")
while ( TRUE ) {
line = readLines(con, n = 1)
if ( length(line) == 0 ) {
break
}
append(dat_list, line)
}
close(con)
return(dat_list)
}
dat_list <- processFile(temp)
Here's a possible alternative
processFile = function(filepath, header=TRUE, ...) {
lines <- readLines(filepath)
comments <- which(grepl("^#", lines))
header_row <- gsub("^#","",lines[tail(comments,1)])
data <- read.table(text=c(header_row, lines[-comments]), header=header, ...)
return(data)
}
processFile(temp)
The idea is that we read in all the lines, find the ones that start with "#" and ignore them except for the last one which will be used as the header. We remove the "#" from the header (otherwise it's usually treated as a comment) and then pass it off to read.table to parse the data.
Here are a few options that bypass your function and that you can mix & match.
In the easiest (albeit unlikely) scenario where you know the column names already, you can use read.table and enter the column names manually. The default option of comment.char = "#" means those comment lines will be omitted.
read.table(temp, col.names = c("year", "month", "day", "cycle", "trend"))
More likely is that you don't know those column names, but can get them by figuring out how many comment lines there are, then reading just the last of those lines. That saves you having to read more of the file than you need; this is a small enough file that it shouldn't make a huge difference, but in a larger file it might. I'm doing the counting by accessing the command line, only because that's the way I know how. Note also that I saved the file at an easier path; you could instead paste the command together with the temp variable.
Again, the comments are omitted by default.
n_comments <- as.numeric(system("grep '^# ' co2.txt | wc -l", intern = TRUE))
hdrs <- scan(temp, skip = n_comments - 1, nlines = 1, what = "character")[-1]
read.table(temp, col.names = hdrs)
Or with dplyr and stringr, read all the lines, separate out the comments to extract column names, then filter to remove the comment lines and separate into fields, assigning the column names you've just pulled out. Again, with a bigger file, this could become burdensome.
library(dplyr)
lines <- data.frame(text = readLines(temp), stringsAsFactors = FALSE)
comments <- lines %>%
filter(stringr::str_detect(text, "^#"))
hdrs <- strsplit(comments[nrow(comments), 1], "\\s+")[[1]][-1]
lines %>%
filter(!stringr::str_detect(text, "^#")) %>%
mutate(text = trimws(text)) %>%
tidyr::separate(text, into = hdrs, sep = "\\s+") %>%
mutate_all(as.numeric)

How can I read a ".da" file directly into R?

I want to work with the Health and Retirement Study in R. Their website provides ".da" files and a SAS extract program. The SAS program reads the ".da" files like a fixed width file:
libname EXTRACT 'c:\hrs1994\sas\' ;
DATA EXTRACT.W2H;
INFILE 'c:\hrs1994\data\W2H.DA' LRECL=358;
INPUT
HHID $ 1-6
PN $ 7-9
CSUBHH $ 10-10
ETC ETC
;
LABEL
HHID ="HOUSEHOLD IDENTIFIER"
PN ="PERSON NUMBER"
CSUBHH ="1994 SUB-HOUSEHOLD IDENTIFIER"
ASUBHH ="1992 SUB-HOUSEHOLD IDENTIFIER"
ETC ETC
;
1) What type of file is this? I can't find anything about this file type.
2) Is there an easy way to read this into R without the intermediate step of exporting a .csv from SAS? Is there a way for read.fwf() to work without explicitly stating hundreds of variable names?
Thank you!
After a little more research it appears that you can utilize the Stata dictionary files *.DCT to retrieve the formatting for the data files *.DA. For this to work you will need to download both the "Data files" .zip file, and the "Stata data descriptors" .zip file from the HRS website. Just remember when processing the files to use the correct dictionary file on each data file. IE, use the "W2FA.DCT" file to define "W2FA.DA".
library(readr)
# Set path to the data file "*.DA"
data.file <- "C:/h94da/W2FA.DA"
# Set path to the dictionary file "*.DCT"
dict.file <- "C:/h94sta/W2FA.DCT"
# Read the dictionary file
df.dict <- read.table(dict.file, skip = 1, fill = TRUE, stringsAsFactors = FALSE)
# Set column names for dictionary dataframe
colnames(df.dict) <- c("col.num","col.type","col.name","col.width","col.lbl")
# Remove last row which only contains a closing }
df.dict <- df.dict[-nrow(df.dict),]
# Extract numeric value from column width field
df.dict$col.width <- as.integer(sapply(df.dict$col.width, gsub, pattern = "[^0-9\\.]", replacement = ""))
# Convert column types to format to be used with read_fwf function
df.dict$col.type <- sapply(df.dict$col.type, function(x) ifelse(x %in% c("int","byte","long"), "i", ifelse(x == "float", "n", ifelse(x == "double", "d", "c"))))
# Read the data file into a dataframe
df <- read_fwf(file = data.file, fwf_widths(widths = df.dict$col.width, col_names = df.dict$col.name), col_types = paste(df.dict$col.type, collapse = ""))
# Add column labels to headers
attributes(df)$variable.labels <- df.dict$col.lbl

Transform dput(remove) data.frame from txt file into R object with commas

I have a txt file (remove.txt) with these kind of data (that's RGB Hex colors):
"#DDDEE0", "#D8D9DB", "#F5F6F8", "#C9CBCA"...
Which are colors I don't want into my analysis.
And I have a R object (nacreHEX) with other data like in the file, but there are into this the good colors and the colors wich I don't want into my analysis. So I use this code to remove them:
nacreHEX <- nacreHEX [! nacreHEX %in% remove] .
It's works when remove is a R object like this remove <- c("#DDDEE0", "#D8D9DB"...), but it doesn't work when it's come from a txt file and I change it into a data.frame, and neither when I try with remove2 <-as.vector(t(remove)).
So there is my code:
remove <- read.table("remove.txt", sep=",")
remove2 <-as.vector(t(remove))
nacreHEX <- nacreHEX [! nacreHEX %in% remove2]
head(nacreHEX)
With this, there are no comas with as.vector, so may be that's why it doesn't work.
How can I make a R vector with comas with these kind of data?
What stage did I forget?
The problem is that your txt file is separated by ", " not ",'. The spaces end up in your string:
rr = read.table(text = '"#DDDEE0", "#D8D9DB", "#F5F6F8", "#C9CBCA"', sep = ",")
(rr = as.vector(t(rr)))
# [1] "#DDDEE0" " #D8D9DB" " #F5F6F8" " #C9CBCA"
You can see the leading spaces before the #. We can trim these spaces with trimws().
trimws(rr)
# [1] "#DDDEE0" "#D8D9DB" "#F5F6F8" "#C9CBCA"
Even better, you can use the argument strip.white to have read.table do it for you:
rr = read.table(text = '"#DDDEE0", "#D8D9DB", "#F5F6F8", "#C9CBCA"',
sep = ",", strip.white = TRUE)

rbind txt files from online directory (R)

I am trying to get concatenate text files from url but i don't know how to do this with the html and the different folders?
This is the code i tried, but it only lists the text files and has a lot of html code like this How do I fix this so that I can combine the text files into one csv file?
library(RCurl)
url <- "http://weather.ggy.uga.edu/data/daily/"
dir <- getURL(url, dirlistonly = T)
filenames <- unlist(strsplit(dir,"\n")) #split into filenames
#append the files one after another
for (i in 1:length(filenames)) {
file <- past(url,filenames[i],delim='') #concatenate for urly
if (i==1){
cp <- read_delim(file, header=F, delim=',')
}
else{
temp <- read_delim(file,header=F,delim=',')
cp <- rbind(cp,temp) #append to existing file
rm(temp)# remove the temporary file
}
}
here is a code snippet that I got to work for me. I like to use rvest over RCurl, just because that's what I've learned. In this case, I was able to use the html_nodes function to isolate each file ending in .txt. The result table has the times saved as character strings, but you could fix that later. Let me know if you have any questions.
library(rvest)
library(readr)
url <- "http://weather.ggy.uga.edu/data/daily/"
doc <- xml2::read_html(url)
text <- rvest::html_text(rvest::html_nodes(doc, "tr td a:contains('.txt')"))
# define column types of fwf data ("c" = character, "n" = number)
ctypes <- paste0("c", paste0(rep("n",11), collapse = ""))
data <- data.frame()
for (i in 1:2){
file <- paste0(url, text[1])
date <- as.Date(read_lines(file, n_max = 1), "%m/%d/%y")
# Read file to determine widths
columns <- fwf_empty(file, skip = 3)
# Manually expand `solar` column to be 3 spaces wider
columns$begin[8] <- columns$begin[8] - 3
data <- rbind(data, cbind(date,read_fwf(file, columns,
skip = 3, col_types = ctypes)))
}

Extracting file numbers from file names in r and looping through files

I have a folder full of .txt files that I want to loop through and compress into one data frame, but each .txt file is data for one subject and there are no columns in the text files that indicate subject number or time point in the study (e.g. 1-5). I need to add a line or two of code into my loop that looks for strings of four numbers (i.e. each file is labeled something like: "4325.5_ERN_No_Startle") and just creates a column with 4325 and another column with 5 that will appear for every data point for that subject until the loop gets to the next one. I have been looking for awhile but am still coming up empty, any suggestions?
I also have not quite gotten the loop to work:
path = "/Users/me/Desktop/Event Codes/ERN task/ERN text files transferred"
out.file <- ""
file <- ""
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- read.table(file.names[i],header=FALSE, fill = TRUE)
out.file <- rbind(out.file, file)
}
which runs okay until I get this error message part way through:
Error in read.table(file.names[i], header = FALSE, fill = TRUE) :
no lines available in input
Consider using regex to parse the file name for study period and subject, both of which are then binded in a lapply of list.files:
path = "path/to/text/files"
# ANY TXT FILE WITH PATTERN OF 4 DIGITS FOLLOWED BY A PERIOD AND ONE DIGIT
file.names <- list.files(path, pattern="*[0-9]{4}\\.[0-9]{1}.*txt", full.names=TRUE)
# IMPORT ALL FILES INTO A LIST OF DATAFRAMES AND BINDS THE REGEX EXTRACTS
dfList <- lapply(file.names, function(x) {
if (file.exists(x)) {
data.frame(period=regmatches(x, gregexpr('[0-9]{4}', x))[[1]],
subject=regmatches(x, gregexpr('\\.[0-9]{1}', x))[[1]],
read.table(x, header=FALSE, fill=TRUE),
stringsAsFactors = FALSE)
}
})
# COMBINE EACH DATA FRAME INTO ONE
df <- do.call(rbind, dfList)
# REMOVE PERIOD IN SUBJECT (NEEDED EARLIER FOR SPECIAL DIGIT)
df['subject'] <- sapply(df['subject'],
function(x) gsub("\\.", "", x))
You can try to use tryCatchwhich basically would give you a NULL instead of an error.
file <- tryCatch(read.table(file.names[i],header=FALSE, fill = TRUE), error=function(e) NULL))

Resources