Not being able to properly read csv with R - r

I run the following:
data <- read.csv("C:/Users/user/Desktop/directory/file.csv")
These are my warning messages:
Warning messages:
1: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 1 appears to contain embedded nulls
2: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 2 appears to contain embedded nulls
....
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
The resulting data frame is just a column filled with NAs.
A problem might be that all my data is in one column separated by commas like this(first row is header, second row is example data):
stat_breed,tran_disp,train_key,t_type,trainer_Id,horsedata_Id
QH,"Tavares Joe","214801","E",0,0
What can I do to accurately read my data?

Related

Read a .DAT file that looks like a sparse matrix in r

I have a .DAT file that contains several thousand rows of data. Each row has a fixed number of variables and each row is a case, but not every case has values for each variable. So if a case doesn't have a value for a variable, that space will be blank. So the entire data looks like a sparse matrix. A sample data looks like below:
10101010 100 10000FM
001 100 100 1000000 F
I want to read this data in r as a data frame. I've tried read.table but failed.
My code is
m <- read.table("C:/Users/Desktop/testdata.dat", header = FALSE)
R gives me error message like
"Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 6 elements"
How do I fix this?
Generally the dat file has some lines of extra information before actual data.
Skip them with the skip argument as follows:
df<-read.table("C:/Users/Desktop/testdata.dat",header=FALSE, skip=3)
Else you can also try the below using the readlines function, this will read the specified number of lines from your file (pass n parameter as below):
readLines("C:/Users/Desktop/testdata.dat",n=5)

unexpected result from skip in read.csv

I am trying to read a csv into R skipping the first 2 rows. The csv is a mixture of blanks, text data and numeric data where the thousand separator is ",".
The file reads into R fine (gives a 31 x 27 df), but when i change the argument to include skip = 2 it returns a single column with 282 observations.
I have tried it using the readr package's read_csv function and it works fine.
testdf <- read.csv("test.csv")
works fine - gives a dataframe of 31 obs of 27 variables
I get the following error message when trying to use the skip argument:
testdf <- read.csv("test.csv", skip = 2)
Warning message:
In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
EOF within quoted string
which results in a single variable with 282 observations

Importing data from an Excel file online

I am trying to download an excel file online and read only lines that contains the word "ORD".
fileUrl <-("http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls")
x <- getURLContent(fileUrl)
out <- read.table(fileUrl,x )
I am using GetUrlContent but receive an error at the early stage of the process:
Warning messages:
1: In read.table(fileUrl, x) : line 1 appears to contain embedded nulls
2: In read.table(fileUrl, x) : line 2 appears to contain embedded nulls
3: In read.table(fileUrl, x) : line 3 appears to contain embedded nulls
4: In read.table(fileUrl, x) : line 4 appears to contain embedded nulls
5: In read.table(fileUrl, x) : line 5 appears to contain embedded nulls
6: In if (!header) rlabp <- FALSE :
the condition has length > 1 and only the first element will be used
7: In if (header) { :
the condition has length > 1 and only the first element will be used
8: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input
The table "out" comes out almost unreadable. Does anyone knows how to read exactly the specific line rather than importing the whole file at the risk of getting the error lines?
One of the answers to this SO question recommends using the gdata library to download the Excel file from the web and then using read.xls() to read it into a data frame. Something like this:
library(gdata)
download.file("http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls", destfile="file.xls")
out <- read.xls("file.xls", header=TRUE, pattern="Some Pattern")
The pattern flag tells read.xls() to ignore everything until the first line in which Some Pattern appears. You can change the value to something which allows you to skip the preliminary material before the actual data you want in your data frame.
I just found a solution, thank you Tim for putting me in the right direction:
library(gdata)
DownloadURL <- "http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls"
out <- read.xls(DownloadURL, pattern="ORD", perl = "C:\\Perl64\\bin\\perl.exe")

In Scan EOF error while reading CSV file

I am trying to read a CSV file into R. I tried:
data <- read.csv(file="train.csv")
Warning message:
In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
EOF within quoted string
But, this reads in only a small percentage of the total observations. Then I tried removing quotes:
data <- read.csv(file="train.csv",quote = "",sep = ",",header = TRUE)
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
Since the data is text, it seems there is some issue with the delimiter.
It is difficult to share the entire data set as it is huge. I tried going to the line where the error comes, but there seems to be no non printable character. I also tried other readers like fread(), but to no avail.
Have encountered this before. Can be very tricky. Try a specialized CSV reader.:
library(readr)
data <- read_csv(file="train.csv")
This should do it.

r - read.csv - skip rows with different number of columns

There are 5 rows at the top of my csv file which serve as information about the file, which I do not need.
These information rows have only 2 columns, while the headers, and rows of data (from 6 on-wards) have 8. This appears to be the cause of the issue.
I have tried using the skip function within read.csv to skip these lines, and the same with read.table
df = read.csv("myfile.csv", skip=5)
df = read.table("myfile.csv", skip=5)
but this still gives me the same error message, which is:
Error in read.table("myfile.csv", :empty beginning of file
In addition: Warning messages:
1: In readLines(file, skip) : line 1 appears to contain an embedded nul
2: In readLines(file, skip) : line 2 appears to contain an embedded nul
...
5: In readLines(file, skip) : line 5 appears to contain an embedded nul
How can I get this .csv to be read into r without the null values in the first 5 rows causing this issue?
You could try:
read.csv(text=readLines('myfile.csv')[-(1:5)])
This will initially store each line in its own vector element, then drop the first five and treat the rest as a csv.
You can get rid of warning messages by using parameter 'skipNul';
text=readLines('myfile.csv', skipNul=True)

Resources