I know that in R I can read in a csv file using read.csv. I also know that by setting header = TRUE I can indicate to R that there is a header with variable names on the first row.
However, I am trying to read in a csv that places a timestamp on the first row and the header / variable names on the second. I can obviously manually strip off the first line before loading it into R, but it’s a pain to do this each time. Is there an elegant solution to this in R?
Use the skip argument to read.csv
read.csv(.... , skip=1)
For the subjective "elegant", you may want to look at fread from "data.table" which generally does a good job of figuring out where the data actually start.
An example:
Create a fake CSV file in our workspace
The first line has "something" and the actual data starts on the second line with the headers "V1", "V2", and "V3".
x <- tempfile()
cat("something",
"V1,V2,V3",
"1,2,3", "4,5,6", "7,8,9", sep = "\n", file = x)
Load "data.table" and try fread
Seems to work out of the box! Obviously replace x with the name of your actual CSV file.
library(data.table)
fread(x)
# V1 V2 V3
# 1: 1 2 3
# 2: 4 5 6
# 3: 7 8 9
Related
I have this VCF format file, I want to read this file in R. However, this file contains some redundant lines which I want to skip. I want to get something like in the result where the row starts with the line matching #CHROM.
This is what I have tried:
chromo1<-try(scan(myfile.vcf,what=character(),n=5000,sep="\n",skip=0,fill=TRUE,na.strings="",quote="\"")) ## find the start of the vcf file
skip.lines<-grep("^#CHROM",chromo1)
column.labels<-read.delim(myfile.vcf,header=F,nrows=1,skip=(skip.lines-1),sep="\t",fill=TRUE,stringsAsFactors=FALSE,na.strings="",quote="\"")
num.vars<-dim(column.labels)[2]
myfile.vcf
#not wanted line
#unnecessary line
#junk line
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
result
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
Maybe this could be good for you:
# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines("test.vcf")
tmp_vcf_data<-read.table("test.vcf", stringsAsFactors = FALSE)
# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names
p.s.: If you have several vcf files then you should use lapply function.
Best,
Robert
data.table::fread reads it as intended, see example:
library(data.table)
#try this example vcf from GitHub
vcf <- fread("https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf")
#or if the file is local:
vcf <- fread("path/to/my/vcf/sample.vcf")
We can also use vcfR package, see the manuals in the link.
Don't know how fread reads vcf correctly in comments above, but use 'skip' to define the first row start (or, if integer, amount of rows to skip).
library(data.table)
df = fread(file='some.vcf', sep='\t', header = TRUE, skip = '#CHROM')
I have a series of txt files formatted in the same way.
The first few rows are all about file information. There are no variable names. As you can see spaces between factors are inconsistent but Columns are left-aligned or right-aligned.I know SAS could directly read data with this format and wonder if R provide any function similar.
I tried read.csv function to load these data and I want to save them in a data.frame with 3 columns, while it turns out the option sep = "\s"(multiple spaces) in the function cannot recognize regular expression.
So I tried to read these data in a variable first and use substr function to split them as following.
step1
Factor<-data.frame(substr(Share$V1,1,9),substr(Share$V1,9,14),as.numeric(substr(Share$V1,15,30)))
step2
But this is quite unintelligent, and need to count the spaces between.
I wander if there is any method to directly load data as three columns.
> Factor
F T S
1 +B2P A 1005757219
2 +BETA A 826083789
We can use read.table to read it as 3 columns
read.table(text=as.character(Share$V1), sep="", header=FALSE,
stringsAsFactors=FALSE, col.names = c("FactorName", "Type", "Share"))
# FactorName Type Share
#1 +B2P A 1005757219
#2 +BETA A 826083789
#3 +E2P A 499237181
#4 +EF2P A 38647147
#5 +EFCHG A 866171133
#6 +IL1QNS A 945726018
#7 +INDMOM A 862690708
Another option would be to read it directly from the file, skipping the header line and change the column names
read.table("yourfile.txt", header=FALSE, skip=1, stringsAsFactors=FALSE,
col.names = c("FactorName", "Type", "Share"))
I'm just starting out with R and trying to get a grasp of some of the built in functions. I'm trying to organize a basic FASTA text file that looks like this:
>ID1
AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
>ID2
TCCAATTAAGTCCCTATCCAGGCGCTCCG
>ID3
GAACCGGAGAACGCTTCAGACCAGCCCGGAC
Into a table that'd look something like this:
ID Sequence
ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
ID2 TCCAATTAAGTCCCTATCCAGGCGCTCCG
ID3 GAACCGGAGAACGCTTCAGACCAGCCCGGAC
Or at least something organized in a similar manner. Unfortunately, whenever I try to use read.table, I'm forced to set fill = TRUE, to avoid the following error:
> read.table("ReadingText.txt", header=F, fill=F, sep=">")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 2 did not have 2 elements
Setting fill = TRUE doesn't solve the problem as it just introduces unwanted blank fields. I feel like my problem is that R wants to treat each new line from the input as a new row in the output, whereas I'm expecting it to start a new row only at each ">" and move to the next column of the same row at each new line of the input.
So, how would you get this to work? Is read.table just the wrong function to be trying to do this with or is there something else? Also, I'd really like to accomplish this without using any packages! I want to get a good grasp of the built-in functions in R.
Thanks for taking the time to read this and apologies if I've done anything wrong posting this here. This is the first time I've asked anything.
It would take some tricky post-processing to do this with read.table() or readLines(). There is a function read.fasta() in the seqinr package that can get you most of the way there. Then we just turn the resulting list into a data frame.
library(seqinr)
(fasta <- read.fasta("so.fasta", set.attributes = FALSE, as.string = TRUE, forceDNAtolower = FALSE))
# $ID1
# [1] "AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC"
#
# $ID2
# [1] "TCCAATTAAGTCCCTATCCAGGCGCTCCG"
#
# $ID3
# [1] "GAACCGGAGAACGCTTCAGACCAGCCCGGAC"
setNames(rev(stack(fasta)), c("ID", "Sequence"))
# ID Sequence
# 1 ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
# 2 ID2 TCCAATTAAGTCCCTATCCAGGCGCTCCG
# 3 ID3 GAACCGGAGAACGCTTCAGACCAGCCCGGAC
where the file so.fasta is
writeLines(">ID1
AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
>ID2
TCCAATTAAGTCCCTATCCAGGCGCTCCG
>ID3
GAACCGGAGAACGCTTCAGACCAGCCCGGAC", "so.fasta")
Note: Pascal makes a good point in the comments. When a tool already exists for your specific task, take advantage of that and use it. There is really no need to spend time trying to do this with functions that aren't right for the job when someone has already gone to the trouble to create this tool and shared it in a package to try to help other users attempting to solve the same problem.
Update: Actually, it's not that difficult using readLines(), so long as you have a nice clean file. Here is a possible solution using only base functions.
x <- readLines("so.fasta")
ids <- grepl("^>", x)
data.frame(ID = sub(">", "", x[ids]), Sequence = x[!ids])
# ID Sequence
# 1 ID1 AGAATAGCCAGAACCGTTTCTCTGAGGCTTCC
# 2 ID2 TCCAATTAAGTCCCTATCCAGGCGCTCCG
# 3 ID3 GAACCGGAGAACGCTTCAGACCAGCCCGGAC
Say I have the first test.csv that looks like this
,a,b,c,d,e
If I try to read it using read.csv, it works fine.
read.csv("test.csv",header=FALSE)
# V1 V2 V3 V4 V5 V6
#1 NA a b c d e
#Warning message:
#In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on 'test.csv'
However, if I attempt to read this file using fread, i get an error instead.
require(data.table)
fread("test.csv",header=FALSE)
#Error in fread("test.csv", header = FALSE) :
# Not positioned correctly after testing format of header row. ch=','
Why does this happen and what can I do to correct this?
As for me, my problem was only that the first ? rows of my file had a missing ID value.
So I was able to solve the problem by specifying autostart to be sufficiently far into the file that a nonmissing value popped up:
fread("test.csv", autostart = 100L, skip = "A")
This guarantees that when fread attempts to automatically identify sep and sep2, it does so at a well-formatted place in the file.
Specifying skip also makes sure fread finds the correct row in which to base the names of the columns.
If indeed there are no nonmissing values for the first field, you're better off just deleting that field from the .csv with Richard Scriven's approach or a find-and-replace in your favorite text editor.
I think you could use skip/select/drop attributes of the fread function for this purpose.
fread("myfile.csv",sep=",",header=FALSE,skip="A")#to just skip the 1st column
fread("myfile.csv",sep=",",header=FALSE,select=c(2,3,4,5)) # to read other columns except 1
fread("myfile.csv",sep=",",header=FALSE,drop="A") #to drop first column
I've tried making that csv file and running the code. It seems to work now - same for other people? I thought it might be an issue with not having a new line at the end (hence the warning from read.csv), but fread copes fine whether there's an new line at the end or not.
This is a question raised from an impatient person who has started working R just now.
I have a file containing lines like this:
simulation_time:386300;real_time:365;agents:300
simulation_time:386800;real_time:368;agents:300
simulation_time:386900;real_time:383;agents:300
simulation_time:387000;real_time:451;agents:300
simulation_time:387100;real_time:345;agents:300
simulation_time:387200;real_time:327;agents:300
simulation_time:387300;real_time:411;agents:300
simulation_time:387400;real_time:405;agents:300
simulation_time:387500;real_time:476;agents:300
simulation_time:387600;real_time:349;agents:300
....
need to plot a graph out of the file. This link teaches how to plot a file by reading a file in tabular format. But the above lines are not in tabular or a neat csv format.
Could you please tell me how to parse such a file?
Besides, if you have a reference for the the impatients like me, please let me know.
thanks
If the structure of the file is strict, then you can customise you reading to get the data you want.
See the code below.
# reading the file
strvec = readLines(con = "File.txt", n = -1)
# strsplit by ";" or ":"
strlist = strsplit(strvec,":|;")
# changing to matrix (works only if the structure of each line is the same)
strmat = do.call(rbind, strlist)
# lets take only numbers
df = strmat[ ,c(2,4,6)]
# defining the names
colnames(df) = strmat[1 ,c(1,3,5)]
# changing strings to numerics (might be better methods, have any suggestions?)
df = apply(df, 2, as.numeric)
# changing to data.frame
df = as.data.frame(df)
# now you can do that ever you want
plot(df$simulation_time, type="l")
For data in that exact format:
d = read.csv(textConnection(gsub(";",":",readLines("data.csv"))),sep=":",head=FALSE)[,c(2,4,6)]
produces:
V2 V4 V6
1 386300 365 300
2 386800 368 300
3 386900 383 300
4 387000 451 300
you can then assign names to the data frame with names(d)=c("sim","real","agents").
It works by reading the file into a character vector, replacing the ";" with ":" so everything is separated by ":", then using read.csv to read that text into a data frame, and then taking only the data columns and not the repeated text columns.