Reading PSV (pipe-separated) file or string - r

I have just received a data file, whose extension is "*.psv". After doing a bit of research, I don't know how to open it R.

We could use read.table to read *.psv file.
read.table("myfile.psv", sep = "|", header = FALSE, stringsAsFactors = FALSE)
There might be many different representations of psv file, but when it comes to data mining, I think it might be more about "pipe separated" file. The data in the file is separated by "|"

Related

How to create a dataframe from csv file with texts separated by pipe I? [duplicate]

I have just received a data file, whose extension is "*.psv". After doing a bit of research, I don't know how to open it R.
We could use read.table to read *.psv file.
read.table("myfile.psv", sep = "|", header = FALSE, stringsAsFactors = FALSE)
There might be many different representations of psv file, but when it comes to data mining, I think it might be more about "pipe separated" file. The data in the file is separated by "|"

Why R cannot read this table while excel can?

I am trying to read a specific file that I have copied from an SFTP location. The file is pipe delimited. I can read the file in Excel. But R read is as null values and column names are being duplicated. I don't understand if this is an encoding issue? I am trying to create a bash script to automate this process. Any help? Below is the link for the data.
Here's file!
I have tried changing the Encoding. But without knowing which encoding I am struggling. I have tried using read_delim, ead_table, read.table, read_csv and read.csv. But no help.
this is the code I have used to read the file.
read_delim("./Engagement_Level.txt", delim = "|")
I would like to read it as a data frame.
The issue is that the file encoding is UTF-16LE, which read_delim cannot read at present.
You could use the base read.delim and file() to specify the encoding:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|")
That will convert all the quoted numbers to numeric. If you'd rather they were type character, to deal with later:
read.delim(file("Engagement_Level.txt", encoding = "UTF-16LE"), sep = "|",
colClasses = "character")
I really recommend you to use Excel to build a CSV file using Data>Text in columns, this is not appropriate in this context but it's incredibly infallible and quickly.
Then use read.csv(file,sep=",").

Delimiters while writing csv files in R

How can I use |(pipe) as a delimiter while writing csv files in R?
When I try writing a data set into a file with write.csv with sep = "|", it ignores the separator and writes the file simply as a comma separated file.
Also write.csv2 also doesn't seem to cover the other variety of characters which could be used as a separator.
Is there a way to use other characters such as ^, $, ~, ¬ or |, as a delimiter while writing a csv file in R.
Thanks.
You have to understand that .csv means "comma-separated value" https://en.wikipedia.org/wiki/Comma-separated_values.
If you want to export with a separator using that characters you need another function.
For example, using write.table, and you'll be able to load this file with R, Excel,....
write.table(data, "data.txt", sep = "|")
data_load <- read.table("data.txt", sep = "|")
Feel free to use any character as separator.
Or you could force this plain text to be .csv
write.table(data, "data.csv", sep = "|")
data_load <- read.csv("data.csv", sep = "|")
This answer is just a variation of the one I gave for this question. They are similar, but I don't think the question itself is an exact duplicate, but they are both part of a bigger question (not yet asked).
In the help for write.table, it states:
write.csv and write.csv2 provide convenience wrappers for writing CSV files.
...
These wrappers are deliberately inflexible: they are designed to ensure that the correct conventions are used to write a valid file.
Attempts to change append, col.names, sep, dec or qmethod are ignored,
with a warning.
To set sep or another of these parameters you need to use write.table instead of write.csv.

read.csv in R skipping non-standard rows [duplicate]

I was reading in a csv file.
Code is:
mydata = read.csv("mycsv.csv", header=True, sep=",", quote="\"")
Get the following warning:
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
Now some cells in my CSV have missing values that are represented by "".
How do I write this code so that I do not get the above warning?
Your CSV might be encoded in UTF-16. This isn't uncommon when working with some Windows-based tools.
You can try loading a UTF-16 CSV like this:
read.csv("mycsv.csv", ..., fileEncoding="UTF-16LE")
You can try using the skipNul = TRUE option.
mydata = read.csv("mycsv.csv", quote = "\"", skipNul = TRUE)
From ?read.csv
Embedded nuls in the input stream will terminate the field currently being read, with a warning once per call to scan. Setting skipNul = TRUE causes them to be ignored.
It worked for me.
This is nothing to do with the encoding. This is the problem with reading of the nulls in the file. To handle that, you need to pass the skipNul = TRUE paramater.
for example:
neg = scan('F:/Natural_Language_Processing/negative-words.txt', what = 'character', comment.char = '', encoding = "UTF-8", skipNul = TRUE)
Might be a file that do not have CRLF, might only have LF. Try to check the HEX output of the file.
If so. Try running the file through awk:
awk '{printf "%s\r\n", $0}' file > new_log_file
I had the same error message and figured out that although my files had a .csv extensions and opened up with no problems in a spreadsheet, they were actually saved as ¨All Formats¨ rather than ¨Text CSV (.csv)¨
Another quick solution:
Double check that you are, in fact, reading a .csv file!
I was accidentally reading a .rds file instead of .csv and got this "embedded null" error.
In those cases be sure the data you are importing does not have "#" characters but if that the case try using the option comment.char="". It worked for me.

Is there a sed type package in R for removing embedded NULs?

I am processing the US Weather service Storm Data, which has one large CSV data file for each year from 1950 onwards. The 1999 year file contains several rows with very large freeform text fields which contain embedded NUL characters, in an otherwise vanilla ascii database. (The offending file is at ftp://ftp.ncdc.noaa.gov/pub/data/swdi/stormevents/csvfiles/StormEvents_details-ftp_v1.0_d1999_c20140915.csv.gz).
R cannot handle corrupted string data without errors,and this includes R data.frame, data.table, stringr, and stringi package functions (tried).
I can clean the files of NULs with sed, but I would prefer not to use external programs, as this is for an R markdown type report with embedded code.
Suggestions?
Maybe this could be of help:
in.file <- file(description = "StormEvents_details-ftp_v1.0_d1999_c20140915.csv",
open = "r")
writeLines(iconv(readLines(in.file), to = "ASCII"),
con = "StormEvents_ascii.csv")
I was able to read the csv file without errors with this call do read.table:
options(stringAsFactors = FALSE)
StormEvents <- read.table("StormEvents_ascii.csv", header = TRUE,
sep = ",", fill = TRUE, quote = '"')
Obviously you'd need to change the class of several columns, since all are considered character as it is.
Just for posterity - you can use binary reads (readBin()) and replace the NULs with anything else - see
Removing "NUL" characters (within R)
An update for May 2020: The tidyverse and data.table both still choke on null characters within files however the base::read.*() family and readLines() will gracefully skip them with the skipNul=TRUE option. You can read a file in skipping over null characters and then write it back out again.

Resources