I'm just starting out with R and trying to get a grasp of some of the built in functions. I'm trying to organize a basic FASTA text file that looks like this:
>ID1
AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
>ID2
TCCAATTAAGTCCCTATCCAGGCGCTCCG
>ID3
GAACCGGAGAACGCTTCAGACCAGCCCGGAC
Into a table that'd look something like this:
ID Sequence
ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
ID2 TCCAATTAAGTCCCTATCCAGGCGCTCCG
ID3 GAACCGGAGAACGCTTCAGACCAGCCCGGAC
Or at least something organized in a similar manner. Unfortunately, whenever I try to use read.table, I'm forced to set fill = TRUE, to avoid the following error:
> read.table("ReadingText.txt", header=F, fill=F, sep=">")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 2 elements
Setting fill = TRUE doesn't solve the problem as it just introduces unwanted blank fields. I feel like my problem is that R wants to treat each new line from the input as a new row in the output, whereas I'm expecting it to start a new row only at each ">" and move to the next column of the same row at each new line of the input.
So, how would you get this to work? Is read.table just the wrong function to be trying to do this with or is there something else? Also, I'd really like to accomplish this without using any packages! I want to get a good grasp of the built-in functions in R.
Thanks for taking the time to read this and apologies if I've done anything wrong posting this here. This is the first time I've asked anything.
It would take some tricky post-processing to do this with read.table() or readLines(). There is a function read.fasta() in the seqinr package that can get you most of the way there. Then we just turn the resulting list into a data frame.
library(seqinr)
(fasta <- read.fasta("so.fasta", set.attributes = FALSE, as.string = TRUE, forceDNAtolower = FALSE))
# $ID1
# [1] "AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC"
#
# $ID2
# [1] "TCCAATTAAGTCCCTATCCAGGCGCTCCG"
#
# $ID3
# [1] "GAACCGGAGAACGCTTCAGACCAGCCCGGAC"
setNames(rev(stack(fasta)), c("ID", "Sequence"))
# ID Sequence
# 1 ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
# 2 ID2 TCCAATTAAGTCCCTATCCAGGCGCTCCG
# 3 ID3 GAACCGGAGAACGCTTCAGACCAGCCCGGAC
where the file so.fasta is
writeLines(">ID1
AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
>ID2
TCCAATTAAGTCCCTATCCAGGCGCTCCG
>ID3
GAACCGGAGAACGCTTCAGACCAGCCCGGAC", "so.fasta")
Note: Pascal makes a good point in the comments. When a tool already exists for your specific task, take advantage of that and use it. There is really no need to spend time trying to do this with functions that aren't right for the job when someone has already gone to the trouble to create this tool and shared it in a package to try to help other users attempting to solve the same problem.
Update: Actually, it's not that difficult using readLines(), so long as you have a nice clean file. Here is a possible solution using only base functions.
x <- readLines("so.fasta")
ids <- grepl("^>", x)
data.frame(ID = sub(">", "", x[ids]), Sequence = x[!ids])
# ID Sequence
# 1 ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
# 2 ID2 TCCAATTAAGTCCCTATCCAGGCGCTCCG
# 3 ID3 GAACCGGAGAACGCTTCAGACCAGCCCGGAC
Related
I have this VCF format file, I want to read this file in R. However, this file contains some redundant lines which I want to skip. I want to get something like in the result where the row starts with the line matching #CHROM.
This is what I have tried:
chromo1<-try(scan(myfile.vcf,what=character(),n=5000,sep="\n",skip=0,fill=TRUE,na.strings="",quote="\"")) ## find the start of the vcf file
skip.lines<-grep("^#CHROM",chromo1)
column.labels<-read.delim(myfile.vcf,header=F,nrows=1,skip=(skip.lines-1),sep="\t",fill=TRUE,stringsAsFactors=FALSE,na.strings="",quote="\"")
num.vars<-dim(column.labels)[2]
myfile.vcf
#not wanted line
#unnecessary line
#junk line
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
result
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
Maybe this could be good for you:
# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines("test.vcf")
tmp_vcf_data<-read.table("test.vcf", stringsAsFactors = FALSE)
# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names
p.s.: If you have several vcf files then you should use lapply function.
Best,
Robert
data.table::fread reads it as intended, see example:
library(data.table)
#try this example vcf from GitHub
vcf <- fread("https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf")
#or if the file is local:
vcf <- fread("path/to/my/vcf/sample.vcf")
We can also use vcfR package, see the manuals in the link.
Don't know how fread reads vcf correctly in comments above, but use 'skip' to define the first row start (or, if integer, amount of rows to skip).
library(data.table)
df = fread(file='some.vcf', sep='\t', header = TRUE, skip = '#CHROM')
I am looking for a method and not a code as a solution. Any suggestions are welcome.
Here is a sample data that is corrupted (commas should not have been there). By the way I don't have any control over the csv files I receive.
A B C
1.1 1,859.3 52.1
0 12.2 123
In csv format it looks like:
A,B,C
1.1 ,1,859.3,52.1
0,12.2,123
But then, when I read it in using R, row 1 has an extra column and that is an error. Is there any comfortable way to identify if the csv file has any error like this extra column. I could write a bunch of nested loops that parse through length of each row but then I am talking about 1000 csvs with 100000 rows. It will take for ever. Please help. Any method is appreciated.
Save to csv using a different separator, e.g. ;
Then you would have something like
A;B;C
1.1;1,859.3;52.1
0;12.2;123
The code is simple
write.csv(..., sep = ";")
read.csv( ..., sep = ";")
Say I have the first test.csv that looks like this
,a,b,c,d,e
If I try to read it using read.csv, it works fine.
read.csv("test.csv",header=FALSE)
# V1 V2 V3 V4 V5 V6
#1 NA a b c d e
#Warning message:
#In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on 'test.csv'
However, if I attempt to read this file using fread, i get an error instead.
require(data.table)
fread("test.csv",header=FALSE)
#Error in fread("test.csv", header = FALSE) :
# Not positioned correctly after testing format of header row. ch=','
Why does this happen and what can I do to correct this?
As for me, my problem was only that the first ? rows of my file had a missing ID value.
So I was able to solve the problem by specifying autostart to be sufficiently far into the file that a nonmissing value popped up:
fread("test.csv", autostart = 100L, skip = "A")
This guarantees that when fread attempts to automatically identify sep and sep2, it does so at a well-formatted place in the file.
Specifying skip also makes sure fread finds the correct row in which to base the names of the columns.
If indeed there are no nonmissing values for the first field, you're better off just deleting that field from the .csv with Richard Scriven's approach or a find-and-replace in your favorite text editor.
I think you could use skip/select/drop attributes of the fread function for this purpose.
fread("myfile.csv",sep=",",header=FALSE,skip="A")#to just skip the 1st column
fread("myfile.csv",sep=",",header=FALSE,select=c(2,3,4,5)) # to read other columns except 1
fread("myfile.csv",sep=",",header=FALSE,drop="A") #to drop first column
I've tried making that csv file and running the code. It seems to work now - same for other people? I thought it might be an issue with not having a new line at the end (hence the warning from read.csv), but fread copes fine whether there's an new line at the end or not.
I know that in R I can read in a csv file using read.csv. I also know that by setting header = TRUE I can indicate to R that there is a header with variable names on the first row.
However, I am trying to read in a csv that places a timestamp on the first row and the header / variable names on the second. I can obviously manually strip off the first line before loading it into R, but it’s a pain to do this each time. Is there an elegant solution to this in R?
Use the skip argument to read.csv
read.csv(.... , skip=1)
For the subjective "elegant", you may want to look at fread from "data.table" which generally does a good job of figuring out where the data actually start.
An example:
Create a fake CSV file in our workspace
The first line has "something" and the actual data starts on the second line with the headers "V1", "V2", and "V3".
x <- tempfile()
cat("something",
"V1,V2,V3",
"1,2,3", "4,5,6", "7,8,9", sep = "\n", file = x)
Load "data.table" and try fread
Seems to work out of the box! Obviously replace x with the name of your actual CSV file.
library(data.table)
fread(x)
# V1 V2 V3
# 1: 1 2 3
# 2: 4 5 6
# 3: 7 8 9
I imported a dataset with no column headings, and I'm trying to label the columns for convenience. I've used R quite a bit before, so I'm confused as to why this code isn't working:
library(mosaic)
`0605WindData` <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Station = 0605WindData[,1]
Error: unexpected symbol in "Station = 0605WindData"
I swear I have experience with R (albeit I'm a bit out of practice), but I seem to be stuck on something pretty simple. I know I've used this select column command before. Suggestions?
You forgot to quote the object name when subsetting:
> `0605WindData` <- data.frame(A = 1:10, B = 1:10)
> `0605WindData`[,1]
[1] 1 2 3 4 5 6 7 8 9 10
As Roman points out, object names are not supposed to start with a digit. Your read.csv() line only worked because you back-tick quoted the object name. You have to continue to quote the object in every line of code now because you used a non-standard name for that object. Save yourself some trouble and change the name of the object you assign the output from read.csv() to.
`0605WindData` <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Station = 0605WindData[,1]
Instead of using quotes for variable start variable name with letter such as
winddata060 <- read.csv("~/pathnamehere/0605WindData.txt", header=F)
Now select the required variable name
Station = winddata060[,1]