Importing data from an Excel file online - r

I am trying to download an excel file online and read only lines that contains the word "ORD".
fileUrl <-("http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls")
x <- getURLContent(fileUrl)
out <- read.table(fileUrl,x )
I am using GetUrlContent but receive an error at the early stage of the process:
Warning messages:
1: In read.table(fileUrl, x) : line 1 appears to contain embedded nulls
2: In read.table(fileUrl, x) : line 2 appears to contain embedded nulls
3: In read.table(fileUrl, x) : line 3 appears to contain embedded nulls
4: In read.table(fileUrl, x) : line 4 appears to contain embedded nulls
5: In read.table(fileUrl, x) : line 5 appears to contain embedded nulls
6: In if (!header) rlabp <- FALSE :
the condition has length > 1 and only the first element will be used
7: In if (header) { :
the condition has length > 1 and only the first element will be used
8: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : embedded nul(s) found in input
The table "out" comes out almost unreadable. Does anyone knows how to read exactly the specific line rather than importing the whole file at the risk of getting the error lines?

One of the answers to this SO question recommends using the gdata library to download the Excel file from the web and then using read.xls() to read it into a data frame. Something like this:
library(gdata)
download.file("http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls", destfile="file.xls")
out <- read.xls("file.xls", header=TRUE, pattern="Some Pattern")
The pattern flag tells read.xls() to ignore everything until the first line in which Some Pattern appears. You can change the value to something which allows you to skip the preliminary material before the actual data you want in your data frame.

I just found a solution, thank you Tim for putting me in the right direction:
library(gdata)
DownloadURL <- "http://www.hkexnews.hk/reports/sharerepur/documents/SRRPT20151211.xls"
out <- read.xls(DownloadURL, pattern="ORD", perl = "C:\\Perl64\\bin\\perl.exe")

Related

R error handling using read.table and defined colClasses having corrupt lines in CSV file

I have a big .csv file to read in. Badly some lines are corrupt, meaning that something is wrong in the formatting like a number like 0.-02 instead of -0.02. Sometimes even the line break (\n) is missing, so that two lines merge to one.
I want to read the .csv file with read.table and define all colClasses to the format that I expect the file to have (except of course for the corrupt lines). This is a minimal example:
colNames <- c("date", "parA", "parB")
colClasses <- c("character", "numeric", "numeric")
inputText <- "2015-01-01;123;-0.01\n
2015-01-02;421;-0.022015-01-03;433;-0.04\n
2015-01-04;321;-0.03\n
2015-01-05;230;-0.05\n
2015-01-06;313;0.-02"
con <- textConnection(inputText, "r")
mydata <- read.table(con, sep=";", fill = T, colClasses = colClasses)
At the first corrupt lines read.table stops with the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '-0.022015-01-03'
With this error message I have no Idea in which line of the input the error occurred. Hence my only option is to copy the line -0.022015-01-03 and search for it in the file. But this is really annoying if you have to do it for a lot of lines and always have to re-execute read.table until it detects the next corrupt line.
So my question is:
Is there a way to get read.table to tell me the line where the error occurred (and maybe save it for further processing)
Is there a way to get read.table to just skip lines with improper formatting (not to stop at an error)?
Did anyone figure out a way to display these lines for manual correction during the read process? I mean maybe display the whole corrupt line in the plain csv format for manual correction (maybe including the line before and after) and then continue the read-in process including the corrected lines.
What I tried so far is to read everything with colClasses="character" to avoid format checking in the first place. Then do the format checking while I convert every column to the right format. Then which() all lines where the format could not be converted or the result is NA and just delete them.
I have a solution, but it its very slow
With ideas I got from some of the comments the thing I tried next is to read the input line by line with readLine and pipe the result to read.table via the text argument. If read.table files the line is presented to the user via edit() for correction and re-submission. Here is my code:
con <- textConnection(inputText, "r")
mydata <- data.frame()
while(length(text <- readLines(con, n=1)) > 0){
correction = T
while(correction) {
err <- tryCatch(part <- read.table(text=text, sep=";", fill = T,
col.names = colNames,
colClasses = colClasses),
error=function(e) e)
if(inherits(err, "error")){
# try to correct this line
message(err, "\n")
text <- edit(text)
}else{
correction = F
}
}
mydata <- rbind(mydata, part)
}
If the user made the corrections right this returns:
> mydata
date parA parB
1 2015-01-01 123 -0.01
2 2015-01-02 421 -0.02
3 2015-01-03 433 -0.04
4 2015-01-04 321 -0.03
5 2015-01-05 230 -0.05
6 2015-01-06 313 -0.02
The input text had 5 lines, since one linefeed was missing. The corrected output has 6 lines and the 0.-02 is corrected to -0.02.
What I still would change in this solution is to present all corrupt lines together for correction after everything is read in. This way the user can run the script and after it finished can do all corrections at once. But for a minimal example this should be enough.
The really bad thing about this solution is, that it is really slow! Too slow to handle big datasets. Hence I still would like to have another solution using more standard methods or probably a special package.

Not being able to properly read csv with R

I run the following:
data <- read.csv("C:/Users/user/Desktop/directory/file.csv")
These are my warning messages:
Warning messages:
1: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 1 appears to contain embedded nulls
2: In read.table("C:/Users/user/Desktop/directory/file.csv", :
line 2 appears to contain embedded nulls
....
6: In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
embedded nul(s) found in input
The resulting data frame is just a column filled with NAs.
A problem might be that all my data is in one column separated by commas like this(first row is header, second row is example data):
stat_breed,tran_disp,train_key,t_type,trainer_Id,horsedata_Id
QH,"Tavares Joe","214801","E",0,0
What can I do to accurately read my data?

r - read.csv - skip rows with different number of columns

There are 5 rows at the top of my csv file which serve as information about the file, which I do not need.
These information rows have only 2 columns, while the headers, and rows of data (from 6 on-wards) have 8. This appears to be the cause of the issue.
I have tried using the skip function within read.csv to skip these lines, and the same with read.table
df = read.csv("myfile.csv", skip=5)
df = read.table("myfile.csv", skip=5)
but this still gives me the same error message, which is:
Error in read.table("myfile.csv", :empty beginning of file
In addition: Warning messages:
1: In readLines(file, skip) : line 1 appears to contain an embedded nul
2: In readLines(file, skip) : line 2 appears to contain an embedded nul
...
5: In readLines(file, skip) : line 5 appears to contain an embedded nul
How can I get this .csv to be read into r without the null values in the first 5 rows causing this issue?
You could try:
read.csv(text=readLines('myfile.csv')[-(1:5)])
This will initially store each line in its own vector element, then drop the first five and treat the rest as a csv.
You can get rid of warning messages by using parameter 'skipNul';
text=readLines('myfile.csv', skipNul=True)

In read.table(): incomplete final line found by readTableHeader

I have a CSV when I try to read.csv() that file, I get the warning warning message:
In read.table(file = file, header = header, sep = sep, quote = quote, :
incomplete final line found by readTableHeader on ...
And I cannot isolate the problem, despite scouring StackOverflow and R-help for solutions.
This is the Dropbox link for the data: https://www.dropbox.com/s/h0fp0hmnjaca9ff/PING%20CONCOURS%20DONNES.csv
As explained by Hendrik Pon,The message indicates that the last line of the file doesn't end with an End Of Line (EOL) character (linefeed (\n) or carriage return+linefeed (\r\n)).
The remedy is simple:
Open the file
Navigate to the very last line of the file
Place the cursor the end of that line
Press return/enter
Save the file
so here is your file without warning
df=read.table("C:\\Users\\Administrator\\Desktop\\tp.csv",header=F,sep=";")
df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 Date 20/12/2013 09:04 20/12/2013 09:08 20/12/2013 09:12 20/12/2013 09:16 20/12/2013 09:20 20/12/2013 09:24 20/12/2013 09:28 20/12/2013 09:32 20/12/2013 09:36
2 1 1,3631 1,3632 1,3634 1,3633 1,363 1,3632 1,3632 1,3632 1,3629
3 2 0,83407 0,83408 0,83415 0,83416 0,83404 0,83386 0,83407 0,83438 0,83472
4 3 142,35 142,38 142,41 142,4 142,41 142,42 142,39 142,42 142,4
5 4 1,2263 1,22635 1,22628 1,22618 1,22614 1,22609 1,22624 1,22643 1,2265
But i think you should not read in this way because you have to again reshape the dataframe,thanks.
I faced the same problem while creating a data matrix in notepad.
So i came to the last row of data matrix and pressed enter. Now i have a "n" line data matrix and a new blank line with cursor at the starting of "n+1" line.
Problem solved.
This is not a CSV file, each line is a column, you can parse it manually, e.g.:
file <- '~/Downloads/PING CONCOURS DONNES.csv'
lines <- readLines(file)
columns <- strsplit(lines, ';')
headers <- sapply(columns, '[[', 1)
data <- lapply(columns, '[', -1)
df <- do.call(cbind, data)
colnames(df) <- headers
print(head(df))
Note that you can ignore the warning, due that the last end-of-line is missing.
I had the same problem with .xls files.
My solution is to save the file as a tab delimited .txt. Then you can also manually change the .txt extension to .xls, then you can open the dataframe with read.delim.
This is very rude way to overcome the issue anyway.
Having a "proper" CSV file depends on the software that was used to generate it in the first place.
Consider Google Sheets. The warning will be issued every time that the CSV file -- downloaded via utils::download.file -- contains less than five lines. This likely is related to the fact that (utils:read.table):
The number of data columns is determined by looking at the first five lines of input (or the whole input if it has less than five lines), or from the length of col.names if it is specified and is longer.
In my short experience, if the data in the CSV file is rectangular, then the warning can be ignored.
Now consider LibreOffice Calc. There won't be any warnings, irrespective of the number of lines in the CSV file.
I had similar issue which didn't get resolved by the "enter method". After the mentioned error, I noticed the row count of the data frame was lesser than that of CSV. I noticed some non-alpha numeric values were hindering the import to R.
I followed Aurezio comment on [below link] (https://stackoverflow.com/a/29150226) to remove non-alpha numeric values (I included space)
Here the snippet:
Function CleanCode(Rng As Range)
Dim strTemp As String
Dim n As Long
For n = 1 To Len(Rng)
Select Case Asc(Mid(UCase(Rng), n, 1))
Case 32, 48 To 57, 65 To 90
strTemp = strTemp & Mid(UCase(Rng), n, 1)
End Select
Next
CleanCode = strTemp
End Function
I then used CleanCode as a function for the final result
Another option: sending an extra linefeed from R (instead of opening the file)
From Getting Data from Excel to R
cat("\n", file = file.choose(), append = TRUE)
Or you can simply open that excel file and save it as .csv file and voila that warning is gone.

fread unable to read .csv files with first column empty

Say I have the first test.csv that looks like this
,a,b,c,d,e
If I try to read it using read.csv, it works fine.
read.csv("test.csv",header=FALSE)
# V1 V2 V3 V4 V5 V6
#1 NA a b c d e
#Warning message:
#In read.table(file = file, header = header, sep = sep, quote = quote, :
# incomplete final line found by readTableHeader on 'test.csv'
However, if I attempt to read this file using fread, i get an error instead.
require(data.table)
fread("test.csv",header=FALSE)
#Error in fread("test.csv", header = FALSE) :
# Not positioned correctly after testing format of header row. ch=','
Why does this happen and what can I do to correct this?
As for me, my problem was only that the first ? rows of my file had a missing ID value.
So I was able to solve the problem by specifying autostart to be sufficiently far into the file that a nonmissing value popped up:
fread("test.csv", autostart = 100L, skip = "A")
This guarantees that when fread attempts to automatically identify sep and sep2, it does so at a well-formatted place in the file.
Specifying skip also makes sure fread finds the correct row in which to base the names of the columns.
If indeed there are no nonmissing values for the first field, you're better off just deleting that field from the .csv with Richard Scriven's approach or a find-and-replace in your favorite text editor.
I think you could use skip/select/drop attributes of the fread function for this purpose.
fread("myfile.csv",sep=",",header=FALSE,skip="A")#to just skip the 1st column
fread("myfile.csv",sep=",",header=FALSE,select=c(2,3,4,5)) # to read other columns except 1
fread("myfile.csv",sep=",",header=FALSE,drop="A") #to drop first column
I've tried making that csv file and running the code. It seems to work now - same for other people? I thought it might be an issue with not having a new line at the end (hence the warning from read.csv), but fread copes fine whether there's an new line at the end or not.

Resources