How to parse custom formatted file R - r

This is a question raised from an impatient person who has started working R just now.
I have a file containing lines like this:
simulation_time:386300;real_time:365;agents:300
simulation_time:386800;real_time:368;agents:300
simulation_time:386900;real_time:383;agents:300
simulation_time:387000;real_time:451;agents:300
simulation_time:387100;real_time:345;agents:300
simulation_time:387200;real_time:327;agents:300
simulation_time:387300;real_time:411;agents:300
simulation_time:387400;real_time:405;agents:300
simulation_time:387500;real_time:476;agents:300
simulation_time:387600;real_time:349;agents:300
....
need to plot a graph out of the file. This link teaches how to plot a file by reading a file in tabular format. But the above lines are not in tabular or a neat csv format.
Could you please tell me how to parse such a file?
Besides, if you have a reference for the the impatients like me, please let me know.
thanks

If the structure of the file is strict, then you can customise you reading to get the data you want.
See the code below.
# reading the file
strvec = readLines(con = "File.txt", n = -1)
# strsplit by ";" or ":"
strlist = strsplit(strvec,":|;")
# changing to matrix (works only if the structure of each line is the same)
strmat = do.call(rbind, strlist)
# lets take only numbers
df = strmat[ ,c(2,4,6)]
# defining the names
colnames(df) = strmat[1 ,c(1,3,5)]
# changing strings to numerics (might be better methods, have any suggestions?)
df = apply(df, 2, as.numeric)
# changing to data.frame
df = as.data.frame(df)
# now you can do that ever you want
plot(df$simulation_time, type="l")

For data in that exact format:
d = read.csv(textConnection(gsub(";",":",readLines("data.csv"))),sep=":",head=FALSE)[,c(2,4,6)]
produces:
V2 V4 V6
1 386300 365 300
2 386800 368 300
3 386900 383 300
4 387000 451 300
you can then assign names to the data frame with names(d)=c("sim","real","agents").
It works by reading the file into a character vector, replacing the ";" with ":" so everything is separated by ":", then using read.csv to read that text into a data frame, and then taking only the data columns and not the repeated text columns.

Related

How to read text file with page break character in R

I am quite new to R. I have few text (.txt) files in a folder that have been converted from PDF with page break character (#12). I need to produce a data frame by reading these text files in R with condition that one row in R represents one PDF page. It means that every time there is a page break (\f), it will only then create a new row.
The problem is when the text file gets load into R, every new line became a new row and I do not want this.
Please assist me on this. Thanks!
Some methods that I have tried are read.table and readLines.
As you can see in lines 273 & 293, there is \f, so I need whatever that comes after \f to be in a row (which represents a page)
Base R:
vec <- c("a","b","\fd","e","\ff","g")
# vec <- readLines("file.txt")
out <- data.frame(page = sapply(split(vec, cumsum(grepl("^\f", vec))), paste, collapse = "\n"))
out
# page
# 0 a\nb
# 1 \fd\ne
# 2 \ff\ng
If you need the leading \f removed, easily done with
out$page <- sub("^\f", "", out$page)
Does something like this work?
library(tidyverse)
read_file("mytxt.txt") %>%
str_split("␌") %>%
unlist() %>%
as_tibble_col("data")
It just reads the file as raw text then splits afterwards. You may have to replace the splitting character with something else.

How can I import a vcf file onto R ? or load it? [duplicate]

I have this VCF format file, I want to read this file in R. However, this file contains some redundant lines which I want to skip. I want to get something like in the result where the row starts with the line matching #CHROM.
This is what I have tried:
chromo1<-try(scan(myfile.vcf,what=character(),n=5000,sep="\n",skip=0,fill=TRUE,na.strings="",quote="\"")) ## find the start of the vcf file
skip.lines<-grep("^#CHROM",chromo1)
column.labels<-read.delim(myfile.vcf,header=F,nrows=1,skip=(skip.lines-1),sep="\t",fill=TRUE,stringsAsFactors=FALSE,na.strings="",quote="\"")
num.vars<-dim(column.labels)[2]
myfile.vcf
#not wanted line
#unnecessary line
#junk line
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
result
#CHROM POS ID REF ALT
11 33443 3 A T
12 33445 5 A G
Maybe this could be good for you:
# read two times the vcf file, first for the columns names, second for the data
tmp_vcf<-readLines("test.vcf")
tmp_vcf_data<-read.table("test.vcf", stringsAsFactors = FALSE)
# filter for the columns names
tmp_vcf<-tmp_vcf[-(grep("#CHROM",tmp_vcf)+1):-(length(tmp_vcf))]
vcf_names<-unlist(strsplit(tmp_vcf[length(tmp_vcf)],"\t"))
names(tmp_vcf_data)<-vcf_names
p.s.: If you have several vcf files then you should use lapply function.
Best,
Robert
data.table::fread reads it as intended, see example:
library(data.table)
#try this example vcf from GitHub
vcf <- fread("https://raw.githubusercontent.com/vcflib/vcflib/master/samples/sample.vcf")
#or if the file is local:
vcf <- fread("path/to/my/vcf/sample.vcf")
We can also use vcfR package, see the manuals in the link.
Don't know how fread reads vcf correctly in comments above, but use 'skip' to define the first row start (or, if integer, amount of rows to skip).
library(data.table)
df = fread(file='some.vcf', sep='\t', header = TRUE, skip = '#CHROM')

read a csv file with quotation marks and regex R

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct
I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

How to convert rows

I have uploaded a data set which is called as "Obtained Dataset", it usually has 16 rows of numeric and character variables, some other files of similar nature have less than 16 characters, each variable is the header of the data which starts from the 17th row and onwards "in this specific file".
Obtained dataset & Required Dataset
For the data that starts 1st column is the x-axis, 2nd column is y-axis and 3rd column is depth (which are standard for all the files in the database) 4th column is GR 1 LIN, 5th column is CAL 1 LIN so and soforth as given in the first 16 rows of the data.
Now i want an R code which can convert it into the format shown in the required data set, also if a different data set has say less than 16 lines of names say GR 1 LIN and RHOB 1 LIN are missing then i want it to still create a column with NA entries till 1:nrow.
Currently i have managed to export this file to excel and manually clean the data and rename the columns correspondingly and then save it as csv and then read.csv("filename") etc but it is simply not possible to do this for 400 files.
Any advice how to proceed will be of great help.
I have noticed that you have probably posted this question again, and in a different format. This is a public forum, and people are happy to help. However, it's your job to simplify life of others, and you are requested to put in some effort. Here is some advice on that.
Having said that, here is some code I have written to help you out.
Step0: Creating your first data set:
sink("test.txt") # This will `sink` all the output to the file "test.txt"
# Lets start with some dummy data
cat("1\n")
cat("DOO\n")
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
cat(c(sample(letters,10),"\n"))
# Now a 10 x 16 dummy data matrix:
cat(paste(apply(matrix(sample(160),10),1,paste,collapse = "\t"),collapse = "\n"))
cat("\n")
sink() # This will stop `sink`ing.
I have created some dummy data in first 6 lines, and followed by a 10 x 16 data matrix.
Note: In principle you should have provided something like this, or a copy of your dataset. This would help other people help you.
Step1: Now we need to read the file, and we want to skip the first 6 rows with undesired info:
(temp <- read.table(file="test.txt", sep ="\t", skip = 6))
Step2: Data clean up:
We need a vector with names of the 16 columns in our data:
namesVec <- letters[1:16]
Now we assign these names to our data.frame:
names(temp) <- namesVec
temp
Looks good!
Step3: Save the data:
write.table(temp,file="test-clean.txt",row.names = FALSE,sep = "\t",quote = FALSE)
Check if the solution is working. If it is working, than move to next step, otherwise make necessary changes.
Step4: Automating:
First we need to create a list of all the 400 files.
The easiest way (to explain also) is copy the 400 files in a directory, and then set that as working directory (using setwd).
Now first we'll create a vector with all file names:
fileNameList <- dir()
Once this is done, we'll need to function to repeat step 1 through 3:
convertFiles <- function(fileName) {
temp <- read.table(file=fileName, sep ="\t", skip = 6)
names(temp) <- namesVec
write.table(temp,file=paste("clean","test.txt",sep="-"),row.names = FALSE,sep = "\t",quote = FALSE)
}
Now we simply need to apply this function on all the files we have:
sapply(fileNameList,convertFiles)
Hope this helps!

Reading CSV using R where header is on the second line

I know that in R I can read in a csv file using read.csv. I also know that by setting header = TRUE I can indicate to R that there is a header with variable names on the first row.
However, I am trying to read in a csv that places a timestamp on the first row and the header / variable names on the second. I can obviously manually strip off the first line before loading it into R, but it’s a pain to do this each time. Is there an elegant solution to this in R?
Use the skip argument to read.csv
read.csv(.... , skip=1)
For the subjective "elegant", you may want to look at fread from "data.table" which generally does a good job of figuring out where the data actually start.
An example:
Create a fake CSV file in our workspace
The first line has "something" and the actual data starts on the second line with the headers "V1", "V2", and "V3".
x <- tempfile()
cat("something",
"V1,V2,V3",
"1,2,3", "4,5,6", "7,8,9", sep = "\n", file = x)
Load "data.table" and try fread
Seems to work out of the box! Obviously replace x with the name of your actual CSV file.
library(data.table)
fread(x)
# V1 V2 V3
# 1: 1 2 3
# 2: 4 5 6
# 3: 7 8 9

Resources