R read a csv file without carriage return - r

I'm trying to read a .csv downloaded on Ishares' website containing all holdings of an ETF (and additional info).
Here is the csv.
My issue is that I can't get a neat dataframe containing Tickers, company name, asset class, etc. using the read.csv function.
When I opened the csv file in notepad I saw that all rows are on the same line and no special character is used to separate rows from each others.
Do you have any clues on how should I deal with this file?

That csv file uses the "LF" line break convention found on Linux and other Unixes - Windows usually expects text files to use the "CRLF" convention, which is why the line breaks are not showing in notepad. Another issue with that file is that before the csv content there are 10 header lines. Does this work in R?
read.csv(<fname>, skip=10)
If the line breaks are still an issue, there must be a Windows tool to convert from LF linebreaks to CRLF. On Linux it's "unix2dos" - I'm not sure what it would be on Windows.

Related

Missing delimiter error when importing html text

Playing with Azure Machine Learning using the Designer and am getting a "Delimiter not found" error when importing my data.
I originally started with a few hundred html files stored as azure blobs. Each file would be considered a single row of text, however, I had no luck importing these files for further text analytics.
I created a Data Factory job that imported each file, stripped all the tabs, quotes, cr/lf from the text, added a column for the file name and stored it all as a combined tab-delimited file. In notepad++ I can confirm that the format is FileName tab HtmlText. This is the file I'm trying to import into ML and getting the missing delimiter message as I'm trying to define the import module.
Here is the error when I try and create a dataset:
{
"message": "'Delimiter' is not specified or invalid."
}
Question 1: Is there a better way to do text analytics on a large collection of html files?
Question 2: Is there a format I need to use in my combined .tsv file that works?
Question 3: Is there maybe a max length to the string column? My html can be 10's of thousands of characters long.
you're right that it might be line length, but my guess is that there are still some special characters (i.e. anything starting with \ that aren't properly escaped or removed. How did you scrape and strip the text data? Have you tried using beautifulsoup?

Scilab unable to correctly read text and csv file

I wish to open and read the following text file in Scilab (version 6.0.2).
The original file is an .xlsx that I have converted to both .txt and .csv through Excel to facilitate opening & working with it in Scilab.
Using both fscanfMat and csvRead, scilab only reads the first column as Nan. I understand why the first column is considered as Nan, but I do not see why the rest of the document isn't read. Columns 2 and 3 are in particular of interest to me.
For csvRead, I used :
M=csvRead(chemin+filename," ",",",[],[],[],[],7);
to skip the 7-row header.
Could it be something to do with the way in which the file has been formatted?
For anyone able to help, I will try to upload an example of a .txt file and also the original .xlsx file
Files available for download, here: Excel and Text files
If you convert your xlsx file into a xls one with Excel you can read it withthe readxls function.
Your separator is a tabulation character (ascii code 9). Use the following command:
M=csvRead("Probe1_350N_2S.txt",ascii(9),",",[],[],[],[],7);

Exporting Chinese characters from Excel to R

I have a file in Excel which has a column with Chinese simplified characters. When I open it in R from the corresponding CSV file I only get ?'s.
I'm afraid the problem is when exporting from Excel to CSV because when I open the CSV file on a text editor I also get ?'s.
How can I get around this?
The best way to secure your Chinese/Unicode characters is to read file from .xlsx:
library(readxl)
read_xlsx("yourfilepath.xlsx", col_types = "text")
If your file is too big to read from .xlsx, then the best way is to open Excel and split manually into multiple files.
(My experience with a laptop with 8GB RAM is to split files into 250,000 rows x 106 columns.)
If you need to read from .csv, your all windows settings/localization needs to be the same as your file, but even that does not guarantee the integrity of all your Unicode characters (eg. emojis).
(If you also need .csv for something else, then you can use the R function write.csv after you read data from .xlsx into R.)

read an MSWord file into R

Is it possible to read an MSWord 2010 file into R? I have Windows 7 and a Dell PC.
I am using the line:
my.data <- readLines('c:/users/mark w miller/simple R programs/test_for_r.docx')
to try to read an MSWord file containing the following text:
A 20 1000 AA
B 30 1001 BB
C 10 1500 CC
I get a warning message that says:
Warning message:
In readLines("c:/users/mark w miller/simple R programs/test_for_r.docx") :
incomplete final line found on 'c:/users/mark w miller/simple R programs/test_for_r.docx'
and my.data appears to be gibberish:
# [1] "PK\003\004\024" "¤l" "ÈFÃË‹Átí"
I know with this simple example I could easily convert the MSWord file to a different format. However, my actual data files consist of complex tables that were typed decades ago and then scanned into pdf documents later. Age of the original paper document and perhaps imperfections in the original paper, typing and/or scanning process has resulted in some letters and numbers not being very clear. So far converting the pdf files to MSWord seems to be the most successful at correctly translating the tables. Converting the MSWord files to Excel or rich text, etc, has not been very successful. Even after conversion to MSWord the resulting files are very complex and contain numerous errors. I thought if I could read the MSWord files into R that might be the most efficient way to edit and correct them.
I am aware of 'package tm' that I guess can read MSWord files into R, but I am a little concerned about using it because it seems to require installing third-party software.
Thank you for any suggestions.
First, readLines() is not the correct solution, since a Word file is not a text (that is plain, ASCII text) file.
The Word-related function in the tm package is called readDOC() but both it and the required third-party tool (Antiword) are for older Word files (up to Word 2003) and will not work using newer .docx files.
The best I can suggest is that you try readPDF(), also found in the tm package. Note: it requires that the tool pdftotext is installed on your system. Easy for Linux, no idea about Windows. Alternatively, find a Windows tool which converts PDF to plain, ASCII text files (not Word files) - they should open and display correctly using Notepad on Windows - then try readLines() again. However, given that your PDF files are old and come from a scanner, conversion to text might be difficult.
Finally: I realise that you did not make the original decision in this instance, but for anybody else - Word and PDF are not appropriate formats for storing data that you want to parse.
In case it helps anyone else, https://cran.r-project.org/web/packages/readtext/vignettes/readtext_vignette.html, it appears there's a new package dedicated specifically to reading text data, including Word files (also new .docx format).
I have not figured out how to read the MSWord file into R, but I have gotten the contents into a format that R can read.
I converted a pdf to MSWord with Acrobat X Pro
The original tables had solid vertical lines separating columns. It turns out these vertical lines were disrupting the format of the data when I converted an MSWord file to a text file, but I was able to delete the lines from an MSWord file before creating a text file.
Convert the MSWord file to a text file after deleting vertical lines in Step 2.
Resulting text files still require extensive editing, but at least the data are largely present in a format R can read and I will not have to re-enter all data in the pdfs by hand, saving many hours of work.
You can do this with RDCOMClient very easily.
In saying so, some characters will not read in correctly.
require(RDCOMClient)
# Create the connection
wordApp <- COMCreate("Word.Application")
# Let's set visible to true so you can see it run
wordApp[["Visible"]] <- TRUE
# Define the file we want to open
wordFileName <- "c:/path/to/word/doc.docx"
# Open the file
doc <- wordApp[["Documents"]]$Open(wordFileName)
# Print the text
print(doc$range()$text())

Merging EBCDIC converted files and pdf files into a single file and pushing to mainframes

I have two pdf files and two text files which are converted into ebcdif format. The two text files acts like cover files for the pdf files containing details like pdf name, number of pages, etc. in a fixed format.
Cover1.det, Firstpdf.pdf, Cover2.det, Secondpdf.pdf
Format of the cover file could be:
Firstpdf.pdf|22|03/31/2012
that is
pdfname|page num|date generated
which is then converted into ebcdic format.
I want to merge all these files in a single file in the order first text file, first pdf file, second text file, second pdf file.
The idea is to then push this single merged file into mainframes using scp.
1) How to merge above mentioned four files into a single file?
2) Do I need to convert pdf files also in ebcdic format ? If yes, how ?
3) As far as I know, mainframe files also need record length details during transit. How to find out record length of the file if at all I succeed in merging them in a single file ?
I remember reading somewhere that it could be done using put and append in ftp. However since I have to use scp, I am not sure how to achieve this merging.
Thanks for reading.
1) Why not use something like pkzip?
2) I don't think converting the pdf files to ebcdic is necessary or even possible. The files need to be transfered in Binary mode
3) Using pkzip and scp you will not need the record length
File merging could easily be achieved by using Cat command in unix with > and >> append operators.
Also, if the next file should start from a new line (as was my case) a blank echo could be inserted between files.

Resources