read a csv file with quotation marks and regex R

read a csv file with quotation marks and regex R - r

ne,class,regex,match,event,msg
BOU2-P-2,"tengigabitethernet","tengigabitethernet(?'connector'\d{1,2}\/\d{1,2})","4/2","lineproto-5-updown","%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"
these are the first two lines, with the first one that will serve as columns names, all separated by commas and with the values in quotation marks except for the first one, and I think it is that that creates troubles.
I am interested in the columns class and msg, so this output will suffice:
class msg
tengigabitethernet %lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down
but I can also import all the columns and unselect the ones I don't want later, it's no worries.
The data comes in a .csv file that was given to me.
If I open this file in excel the columns are all in one.
I work in France, but I don't know in which locale or encoding the file was created (btw I'm not French, so I am not really familiar with those).
I tried with
df <- read.csv("file.csv", stringsAsFactors = FALSE)
and the dataframe has the columns' names nicely separated but the values are all in the first one
then with
library(readr)
df <- read_delim('file.csv',
delim = ",",
quote = "",
escape_double = FALSE,
escape_backslash = TRUE)
but this way the regex column gets splitted in two columns so I lose the msg variable altogether.
With
library(data.table)
df <- fread("file.csv")
I get the msg variable present but empty, as the ne variable contains both ne and class, separated by a comma.
this is the best output for now, as I can manipulate it to get the desired one.
another option is to load the file as a character vector with readLines to fix it, but I am not an expert with regexs so I would be clueless.
the file is also 300k lines, so it would be hard to inspect it.
both read.delim and fread gives warning messages, I can include them if they might be useful.
update:
using
library(data.table)
df <- fread("file.csv", quote = "")
gives me a more easily output to manipulate, it splits the regex and msg column in two but ne and class are distinct

I tried with the input you provided with read.csv and had no problems; when subsetting each column is accessible. As for your other options, you're getting the quote option wrong, it needs to be "\""; the double quote character needs to be escaped i.e.: df <- fread("file.csv", quote = "\"").
When using read.csv with your example I definitely get a data frame with 1 line and 6 columns:
df <- read.csv("file.csv")
nrow(df)
# Output result for number of rows
# > 1
ncol(df)
# Output result for number of columns
# > 6
tmp$ne
# > "BOU2-P-2"
tmp$class
# > "tengigabitethernet"
tmp$regex
# > "tengigabitethernet(?'connector'\\d{1,2}\\/\\d{1,2})"
tmp$match
# > "4/2"
tmp$event
# > "lineproto-5-updown"
tmp$msg
# > "%lineproto-5-updown: line protocol on interface tengigabitethernet4/2, changed state to down"

Related

R- my first row has # character in the column names?

My test file is formatted very odly.
The first rows starts with:
If i ignore the first row and import the data by using the read.table it works well but then i donot have the column names. But if i try to import the data using col.names=TRUE, it says "more columns than column names". I guess i can separately import the first row and the rest of data and add the first (which is the column names) to the final output file. But when i import the the first row: it completely ignores the column names and jumps to the row with 0 0 0 0.... Is it because the first row has a # character. And also because of the # character there is an extra empty column in the data.

Here are a few possibilities:
1) process twice Read it in as a character vector of lines, L, using readLines. Then remove the # and read L using read.table:
L <- sub("#", "", readLines("myfile.dat"))
read.table(text = L, header = TRUE)
2) read header separately For smaller files the prior approach is short and should be fine but if the file is large you may not want to process it twice. In that case, use readLines to read in only the header line, fix it up and then read in the rest applying the column names.
File <- "myfile.dat"
col.names <- scan(text = readLines(File, 1), what = "", quiet = TRUE)[-1]
read.table(File, col.names = col.names)
3) pipe Another approach is to make use of external commands:
File <- "myfile.dat"
cmd <- paste("sed -e 1s/.//", File)
read.table(pipe(cmd), header = TRUE)
On UNIX-like systems sed should be available. On Windows you will need to install Rtools and either ensure sed is on the PATH or else use the path to the file:
cmd <- paste("C:/Rtools/bin/sed -e 1s/.//", File)
read.table(pipe(cmd), header = TRUE)

One approach would be to just do a single separate read of the first line to sniff out the column names. Then, do a read.table as you were already doing, and skip the first line.
f <- "path/to/yourfile.csv"
con <- file(f, "r")
header <- readLines(con, n=1)
close(con)
df <- read.table(f, header=FALSE, sep = " ", skip=1) # skip the first line
names(df) <- strsplit(header, "\\s+")[[1]][-1] # assign column names
But, I don't like this approach and would rather prefer that you fix the source of your flat files to not include that troublesome # symbol. Also, if you only need this requirement as a one time thing, you might also just edit the flat file manually to remove the # symbol.

write.table unintendedly adds subscript x to header

I have got a comma delimited csv document with predefined headers and a few rows. I just want to exchange the comma delimiter to a pipe delimiter. So my naive approach is:
myData <- read.csv(file="C:/test.CSV", header=TRUE, sep=",", check.names = FALSE)
Viewing myData gives me results without X subscripts in header columns. If I set check.names = TRUE, the column headers have a X subscript.
Now I am trying to write a new csv with pipe-delimiter.
write.table(MyData1, file = "C:/test_pipe.CSV",row.names=FALSE, na="",col.names=TRUE, sep="|")
In the next step I am going to test my results:
mydata.test <- read.csv(file="C:/test_pipe.CSV", header=TRUE, sep="|")
Import seems fine, but unfortunately the X subscript in column headers appear again. Now my question is:
Is there something wrong with the original file or is there an error in my naive approach?
The original csv test.csv was created with Excel, of course without X subscripts in column headers.
Thanks in advance

You have to keep using check.names = FALSE, also the second time.
Else your header will be modified, because apparently it contains variable names that would not be considered valid names of columns of a data.frame. E.g., special characters would be replaced by dots, i.e. . Similarly, numbers would be pre-fixed with X.

read data file with mixed numbers and texts in r

I have data files that contain the following:
the first 10 columns are numbers, the last column is text. They are separated by space. The problem is that the text in the last column may also contain space. So when I used read.table() I got the following error:
Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
line 21 did not have 11 elements
what's the easiest way of reading the first 10 columns into a data matrix, and the last column into a string vector? Should I use readLines() first then process it?

If you cannot re-export or recreate your data files with different, non-whitespace separators or with quotation marks around the last column to avoid that problem, you can use read.table(... , fill = TRUE) to read in a file with unequal columns and then combine columns 11+ with dat$col11 <- do.call(paste, c(dat[11:nrow(dat)], sep=" ")) (or something like that) and then drop the now unwanted columns with dat[11:(nrow(dat)-1)] <- NULL. Finally, you may need to trim the whitespace from the end of the eleventh column with trimws(dat$col11).
Note that fill only considers the first five lines of your file, so you may need to find out the number of 'pseudo-columns' in the longest line manually and specify an appropriate number of col.names in read.table (see the linked answer).

Hinted by the useful fill = TRUE option of read.table() function, I used the following to solve my problem:
dat <- read.table(fname, fill = T)
dat <- dat[subset(1:nrow(dat),!((1:nrow(dat)) %in% (which(dat[,11]=="No") + 1))),]
The fill = TRUE option puts everything after the first space of the 11th column to a new row (redundant rows that the original data do not have). The code above removes the redundant rows based on three assumptions: (1) the number of space separators in the 11th column is no more than 11 such that we know there is only one more row of text after a line whose 11th column contains space (that's what the +1 does); (2) we know the line whose 11th column starts with a certain word (in my case it is "No") (3) Keeping only the first word in the 11th column would be sufficient (without ambiguity).

The following solved my problem:
nc <- max(count.fields(fname), sep = " ")
data <- read.table(fname, fill = T, col.names = paste0("V", seq_len(nc)), sep = " ", header = F)
Then the first 10 columns will be the numeric results I want and the remaining nc-10 columns can be combined into one string vector.
The most helpful post is:
How can you read a CSV file in R with different number of columns

You could reformat your file before reading it in R.
For example, using perl in a terminal:
perl -pe 's/(?<=[0-9]) /,/g' myfile.txt > myfile.csv
This replaces every space preceded by a number by a comma.
Then read it into R using read.csv:
df = read.csv("myfile.csv")

Read.Table - Having problems with percentage character

I am trying to read in a tab-separated file (available online, using ftp) using read.table() in R.
The issue seems to be that for the third column, the character string for some of the rows contains characters such as apostrophe character ' as well as percentage character % for example Capital gains 15.0% and Bob's earnings, for example -
810| 13 | Capital gains 15.0% |
170| -20 | Bob’s earnings|
100| 80 | Income|
To handle the apostrophe character, I was using the following syntax
df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T)
Yet this syntax shown above does not seem to handle the lines where the problematic third column contains a string with the percentage character, so that the entire remainder of the table is jammed into one field by the function.
I also tried to ignore the third column altogether by using colClasses df <- read.table(url('ftp://location'), sep="|", quote="\"", as.is=T, colClasses=c("numeric", "numeric", "NULL")), but the function still fails over on the lines with the percentage character is present in that third column.
Any ideas on how to address the issue?

Instead of read.table wrapper, use scan function. You can control anything. And is faster, specially if you use "what" option to indicate R what are you scanning.
dat <- scan(file = "data.tsv", # Substitute this for url
what = list(c(rep("numeric",2), "character")), # describe columns
skip = 0, # lines to skip
sep = "|") # separator
Instead of file, use your url.
In this case scan is expecting 2 columns of "numeric" followed by a column of "character"
Type
?scan
to get the complete list of options.

Read tab delimited file with unusual characters, then write an exact copy

The problem
I have a tab delimited input file that looks like so:
Variable [1] Variable [2]
111 Something
Nothing 222
The first row represents column names and the two next rows represents column values. As you can see, the column names includes both spaces and some tricky signs.
Now, what I want to do is to import this file into R and then output it again to a new text file, making it look exactly the same as the input. For this purpose I have created the following script (assuming that the input file is called "Test.txt"):
file <- "Test.txt"
x <- read.table(file, header = TRUE, sep = "\t")
write.table(x, file = "TestOutput.txt", sep = "\t", col.names = TRUE, row.names = FALSE)
From this, I get an output that looks like this:
"Variable..1." "Variable..2."
"1" "111" "Something"
"2" "Nothing" "222"
Now, there are a couple of problems with this output.
The "[" and "]" signs have been converted to dots.
The spaces have been converted to dots.
Quote signs have appeared everywhere.
How can I make the output file look exactly the same as the input file?
What I've tried so far
Regarding problem number one and two, I've tried specifying the column names through creating an internal vector, c("Variable [1]", "Variable [2]"), and then using the col.names option for read.table(). This gives me the exact same output. I've also tried different encodings, through the encoding option for table.read(). If I look at the internally created vector, mentioned above, it prints the variable names as they should be printed so I guess there is a problem with the conversion between the "text -> R" and the "R -> text" phases of the process. That is, if I look at the data frame created by read.table() without any internally created vectors, the column names are wrong.
As for problem number three, I'm pretty much lost and haven't been able to figure out what I should try.

Given the following input file as test.txt:
Variable [1] Variable [2]
111 Something
Nothing 222
Where the columns are tab-separated you can use the following code to create an exact copy:
a <- read.table(file='test.txt', check.names=F, sep='\t', header=T,
stringsAsFactors=F)
write.table(x=a, file='test_copy.txt', quote=F, row.names=F,
col.names=T, sep='\t')

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

read a csv file with quotation marks and regex R - r

Related

R- my first row has # character in the column names?

write.table unintendedly adds subscript x to header

read data file with mixed numbers and texts in r

Read.Table - Having problems with percentage character

Read tab delimited file with unusual characters, then write an exact copy

Categories

Resources