Read htm file and a table wihtin file in R - r

I have got some .htm files. Instead there are HTML files including a data sheet.
I know how to process that in python. Using pandas reading html file;
df = pd.read_html("file")[0]
Python read the file as a list and the table is in [0] position.
How can I do that in R ?

Related

Scilab unable to correctly read text and csv file

I wish to open and read the following text file in Scilab (version 6.0.2).
The original file is an .xlsx that I have converted to both .txt and .csv through Excel to facilitate opening & working with it in Scilab.
Using both fscanfMat and csvRead, scilab only reads the first column as Nan. I understand why the first column is considered as Nan, but I do not see why the rest of the document isn't read. Columns 2 and 3 are in particular of interest to me.
For csvRead, I used :
M=csvRead(chemin+filename," ",",",[],[],[],[],7);
to skip the 7-row header.
Could it be something to do with the way in which the file has been formatted?
For anyone able to help, I will try to upload an example of a .txt file and also the original .xlsx file
Files available for download, here: Excel and Text files
If you convert your xlsx file into a xls one with Excel you can read it withthe readxls function.
Your separator is a tabulation character (ascii code 9). Use the following command:
M=csvRead("Probe1_350N_2S.txt",ascii(9),",",[],[],[],[],7);

Importing to R an Excel file saved as web-page

I would like to open an Excel file saved as webpage using R and I keep getting error messages.
The desired steps are:
1) Upload the file into RStudio
2) Change the format into a data frame / tibble
3) Save the file as an xls
The message I get when I open the file in Excel is that the file format (excel webpage format) and extension format (xls) differ. I have tried the steps in this answer, but to no avail. I would be grateful for any help!
I don't expect anybody will be able to give you a definitive answer without a link to the actual file. The complication is that many services will write files as .xls or .xlsx without them being valid Excel format. This is done because Excel is so common and some non-technical people feel more confident working with Excel files than a csv file. Now, the files will have been stored in a format that Excel can deal with (hence your warning message), but R's libraries are more strict and don't see the actual file type they were expecting, so they fail.
That said, the below steps worked for me when I last encountered this problem. A service was outputting .xls files which were actually just HTML tables saved with an .xls file extension.
1) Download the file to work with it locally. You can script this of course, e.g. with download.file(), but this step helps eliminate other errors involved in working directly with a webpage or connection.
2) Load the full file with readHTMLTable() from the XML package
library(XML)
dTemp = readHTMLTable([filename], stringsAsFactors = FALSE)
This will return a list of dataframes. Your result set will quite likely be the second element or later (see ?readHTMLTable for an example with explanation). You will probably need to experiment here and explore the list structure as it may have nested lists.
3) Extract the relevant list element, e.g.
df = dTemp[2]
You also mention writing out the final data frame as an xls file which suggests you want the old-style format. I would suggest the package WriteXLS for this purpose.
I seriously doubt Excel is 'saved as a web page'. I'm pretty sure the file just sits on a server and all you have to do is go fetch it. Some kind of files (In particular Excel and h5) are binary rather than text files. This needs an added setting to warn R that it is a binary file and should be handled appropriately.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download.file(url=myurl, destfile="localcopy.xlsx", mode="wb")
or, for use downloader, and ty something like this.
myurl <- "http://127.0.0.1/imaginary/file.xlsx"
download(myurl, destfile="localcopy.csv", mode="wb")

How to convert doc to docx file using R code

I tried to read doc file using readdoc() but when doc file consist of tables, it will not able read it properly.
Therefore I want to convert doc file to docx file so that I can extract tables using docxtractr package availabe in R.
I want to convert .doc file to .docx file using R code.

How do you convert a table that is in a .docx file to an .xlsx or a csv file in python or R?

I have a document like the one mentioned below. There is some text above the table and then there's a table. How do I extract table from the docx file in R or python and then convert it to a csv file or an xlsx file. I don't even mind a .txt file if it retains the exact format of the table. I just don't know what to do with this doc file.
If the document is docx, then it is all XML. The docx file is just a zip container with various XML "parts". Take a look at the Open XML SDK for some ideas on how to parse the file. This SDK is C#, but maybe you can get some ideas from that.
If you are just going to extract the table it should not be too bad ( Updating complex docx documents can get very complicated. I'm working on this now.) My tip to make things easier is to go to the table properties, then to the Alt Text tab and add a unique value to the "Title" field. The value will show up like this within the table properties: <w:tblCaption w:val="TBL1"/>, which will make the table easier to extract from the XML.
If you are going to work with Open XML documents, get the OOXML Chrome Addin. That is great for exploring the internals of docx files.
Note: I saw the link to another SO answer for this. That uses "automation", which is certainly easier to code, but Office via "automation" on the server is not recommended by MS.
You can extract tables from docx using python-docx in python.
Try this:
from docx import Document
import pandas as pd
document = Document(file_path)
tables = []
for index,table in enumerate(document.tables):
df = [['' for i in range(len(table.columns))] for j in range(len(table.rows))]
for i, row in enumerate(table.rows):
for j, cell in enumerate(row.cells):
df[i][j] = cell.text
pd.DataFrame(df).to_excel("Table# "+str(index)+".xlsx")

how to write multiple dataframe to multiple sheet of one csv excel file in R?

I am trying to write multiple dataframe to a single csv formated file but each in a different sheet of the excel file:
write.csv(dataframe1, file = "file1.csv",row.names=FALSE)
write.csv(dataframe2, file = "file2.csv",row.names=FALSE)
is there any way to specify the sheet along with the csv file in this code and write them all in one file?
thank you in advance,
This is not possible. That is the functionality of csv to be just in one sheet so that you can view it either from notepad or any other such software. If you still try to write it would get over ridden. Just try to open a csv and open a new sheet and just write some values and save it. The values which were already there is erased. one excel file in csv format can have only one sheet.
The xlsx and XLConnect packages will do the trick as well.

Resources