I am using R to extract HTML Tables from a website.
However, the language for the HTML Table is in Hindi and the text is displayed as unicodes.
Any way where I can set/install the font family and get the actual text instead of the unicode.
The code I follow is :
library('XML')
table<-readHTMLTable(<the html file>)
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
table[[which.max(n.rows)]]
The sample site is : http://mpbhuabhilekh.nic.in/bhunakshaweb/reports/mpror.jsp?base=wz/CP8M/wj/DP8I/wz/CoA==&vsrno=26-03-02-00049-082&year=2013&plotno=71
The output comes as :
"< U+092A>"
etc.
Note:For some reason, the readHTMLTable works only when I remove the first two unwanted tables in the HTML file. So if you have to test with the file, please edit out the first two tables or simply delete the first two table headers from the file.
Any help will be appreciated. Thanks
update:
The issue seems to be related to locale set in R on windows OS Machines. Unable to figure out how to get it working though!
The solution I have found for this locale related bug would be to callthe corresponding encoding..
library('XML')
table<-readHTMLTable(<the html file>)
n.rows <- unlist(lapply(table, function(t) dim(t)[1]))
output <- table[[which.max(n.rows)]]
for (n in names(output)) Encoding(levels(output[[n]])) <-"UTF-16"
The output in R console might still look gibberish, but the advantage is that once you export the dataset (say a csv), it would all appear in Hindi on other editors.
Related
In R
I am extracting data from Pdf tables using Tabulizer library and the Name are on Nepali language
and after extracting i Get this Table
[1]: https://i.stack.imgur.com/Ltpqv.png
But now i want that column 2's name To change, in its English Equivalent
Is there any way to do this in R
The R code i wrote was
library(tabulizer)
location <- "https://citizenlifenepal.com/wp-content/uploads/2019/10/2nd-AGM.pdf"
out <- extract_tables(location,pages = 113)
##write.table(out,file = "try.txt")
final <- do.call(rbind,out)
final <- as.data.frame(final) ### creating df
col_name <- c("S.No.","Types of Insurance","Inforce Policy Count", "","Sum Assured of Inforce Policies","","Sum at Risk","","Sum at Risk Transferred to Re-Insurer","","Sum At Risk Retained By Insurer","")
names(final) <- col_name
final <- final[-1,]
write.csv(final,file = "/cloud/project/Extracted_data/Citizen_life.csv",row.names = FALSE)
View(final)```
It appears that document is using a non-Unicode encoding. This web site https://www.ashesh.com.np/preeti-unicode/ can convert some Nepali encodings to Unicode, which would display properly in R, assuming you have the right fonts loaded. When I tried it on the output of your code, it did something that looked okay to me, but I don't know Nepali:
> out[[1]][1,2]
[1] ";fjlws hLjg aLdf"
When I convert the contents of that string, I get
सावधिक जीवन बीमा
which looks to me something like the text on that page in the document. If it's actually written correctly, then converting it to English will need some Nepali speaker to do the translation: hopefully that's you, but if I use Google Translate, it gives
Term life insurance
So here's my suggestion: contact the owner of that www.ashesh.com.np website, and find out if they can give you the translation rules. Write an R function to implement them if you can't find one by someone else. Then do the English translations manually.
I want to collect some specific text of more than 200 PDF files, so I need something kinda "automatic" to help me.
All PDFs have almost the same structure (but not enough for me to do what I want). The text that I need comes after "Palavras" in every PDF file but not every PDF has only what I want following that.
The code I'm using now (with help from pdftools) collects the content between "Palavras" and "ABSTRACT":
lapply(x, function(x){
list_output <- pdftools::pdf_text(x)
text_output <- gsub('(\\s)+', ' ', paste(unlist(list_output), collapse=" "))
trimws(regmatches(text_output, gregexpr("(?<=Palavras).*?(?=ABSTRACT)", text_output, perl=TRUE))[[1]][1])
})
But as I said, not every PDF has the same structure so it doesn't work for a lot of the files.
I think that the only thing that would work for me is to grab some certain characters after "Palavras", like a code that would extract everything that comes after "Palavras" ultil 200 or 300 characters. The problem is that I have no idea how to do that.
Any suggestions? Any help would be appreciated.
I have been trying to generate a table in R Markdown with output to word looking like this (a very common table format for chemical sciences):
I started with kable using markdown syntax to get the subscripts etc (eg. [FeBr~2~(dpbz)~2~]) which worked in the word document file. However, i could not modify the table design and most importantly i could not figure out how to get the headings to display properly. So i moved on using the flextable package. Here is my code so far (still work in progress):
```{r DipUVvis,echo=FALSE, anchor='Table S', tab.cap="Summary of catalytic reactions monitored with *in situ* UV-Vis spectroscopy."}
df<-data.frame(Entry=c('AMM 51^*a*^','AMM 52^*a*^','AMM 53^*a*^','AMM 54^*a*^','AMM 57^*b*^','AMM 58^*c*^','AMM 59^*d*^'),
Precat=c('[FeBr~2~(dpbz)~2~] (4.00)','[FeBr~2~(dpbz)~2~] (2.00)','[FeBr~2~(dpbz)~2~] (1.00)','[FeBr~2~(dpbz)~2~] (0.50)','[FeBr~2~(dpbz)~2~] (2.00)','[FeBr(dpbz)~2~] (1.00)','[FeBr~2~(dpbz)~2~] (2.00)'),
Nucl=c('Zn(4-tolyl)~2~/2 MgBr~2~ (100)','Zn(4-tolyl)~2~/2 MgBr~2~ (100)','Zn(4-tolyl)~2~/2 MgBr~2~ (100)','Zn(4-tolyl)~2~/2 MgBr~2~ (100)','Zn(4-tolyl)~2~/2 MgBr~2~ (100)','Zn(4-tolyl)~2~/2 MgBr~2~ (100)','Zn(4-tolyl)~2~/2 MgBr~2~ (100)'),
BnBr=c(0,0,0,0,'42 + 42',42,42))
tbl<-regulartable(df)
tbl<-set_header_labels(tbl,Entry='Entry',Precat='Pre-catalyst (mM)',Nucl='Nucleophile (mM)',BnBr='BnBr (mM)')
tbl <- align( tbl, align = "center", part = "all" )
tbl<-autofit(tbl)
tbl
```
This took care of the headers and with a bit of setting the rest parameters i think i can get the table to look like in the picture above. The resulting table looks fine in the Rstudio console from a formatting perspective:
However, there are two major issues:
1) The subscripts/superscripts are not being translated.
2) When i Knit to word, instead of a table i get 5 pages of code, which from my understanding must be the html code?
After many hours of trying to sort this out, i found that one possible cause is R studio using an old version of pandoc (https://github.com/davidgohel/flextable/issues/34). Indeed that was the case for me so i changed it by moving the new installed files of pandoc in the correct directory (where r studio is looking) and renaming. This must have worked now (see second figure console section). However it didnt change anything. Then i tried adding in my code:
knit_print(tbl)
This keeps giving an error:
Error in knit_print.flextable(tbl) : render_flextable needs to be used as a renderer for a knitr/rmarkdown R code chunk (render by rmarkdown).
Interestingly, when i removed the last line from the r chunk in R studio (tbl) and added the following below the r chunk (not in it):
`r tbl`
The table was generated in word (of course i still didnt get the subscripts and superscripts right). It also had the Figure caption on the top and not the bottom as a desirable side effect of generating the table after the main r chunk.
Any ideas of what is going on and how can i get the correct table output in word? Really confused here, so thank you in advance for your help.
UPDATE: If i remove the anchor = 'Table S' from the chunk header the table comes out ok (still without the subscripts or superscripts though) but then i cant automatically number the tables (i have used this: https://gist.github.com/benmarwick/f3e0cafe668f3d6ff6e5 for autonumbering and cross referencing).
I am learning python (using 3.5). I realize I will probably take a bit of heat for posting my question. Here goes: I have literally reviewed several hundred posts, help docs, etc. all in an attempt to construct the code I need. No luck thus far. I hope someone can help me. I have a set of URLs say, 18 or more. Only 2 illustrated here:
[1] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html"
[2] "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"
I need to scrape all the data (text) behind each url and write out to individual text files (one for each URL) for future topic model analysis. Right now, I pull in the urls through R using rvest. I then take each url (one at a time, by code) into python and do the following:
soup = BeautifulSoup(urlopen('http://www.senate.mo.gov/media/14info/chappelle-nadal/Columns/012314-Condensed.html').read())
txt = soup.find('div', {'class' : 'body'})
print(soup.get_text())
#print(soup.prettify()) not much help
#store the info in an object, then write out the object
test=print(soup.get_text())
test=soup.get_text()
#below does write a file
#how to take my BS object and get it in
open_file = open('23Jan2014cplNadal1.txt', 'w')
open_file.write(test)
open_file.close()
The above gets me partially to my target. It leaves me just a little clean up regarding the text, but that's okay. The problem is that it is labor intensive.
Is there a way to
Write a clean text file (without invisibles, etc.) out from R with all listed urls?
For python 3.5: Is there a way to take all the urls, once they are in a clean single file (the clean text file, one url per line), and have some iterative process retrieve the text behind each url and write out a text file for each URL's data(text) to a location on my hard drive?
I have to do this process for approximately 1000 state-level senators. Any help or direction is greatly appreciated.
Edit to original: Thank you so much all. To N. Velasquez: I tried the following:
urls<-c("http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/120114.html",
"http://www.senate.mo.gov/media/14info/Chappelle-Nadal/releases/110614.htm"
)
for (url in urls) {
download.file(url, destfile = basename(url), method="curl", mode ="w", extra="-k")
}
html files are then written out to my working directory. However, is there a way to write out text files instead of html files? I've read download.file info and can't seem to figure out a way to push out individual text files. Regarding the suggestion for a for loop: Is what I illustrate what you mean for me to attempt? Thank you!
The answer for 1 is: Sure!
The following code will loop you through the html list and export atomic TXTs, as per your request.
Note that through rvest and html_node() you could get a much more structure datset, with recurring parts of the html stored separately. (header, office info, main body, URL, etc...)
library(rvest)
urls <- (c("http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/111915.html", "http://www.senate.mo.gov/media/15info/Chappelle-Nadal/releases/092215.htm"))
for (i in 1:length(urls))
{
ht <- list()
ht[i] <- html_text(html_node(read_html(urls[i]), xpath = '//*[#id="mainContent"]'), trim = TRUE)
ht <- gsub("[\r\n]","",ht)
writeLines(ht[i], paste("DOC_", i, ".txt", sep =""))
}
Look for the DOC_1.txt and DOC_2.txt in your working directory.
Why do I get garbled characters in parse a web?
I have used encoding="big-5\\IGNORE"to get the normal character, but it doesn't work.
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5\\IGNORE")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
How should I revise my code to change the garbled characters into normal?
#MartinMorgan (below) suggested using
htmlParse(url,isURL=TRUE,encoding="big-5")
Here is an example of what is going on:
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
options(encoding="big-5")
data=htmlParse(url,isURL=TRUE,encoding="big-5")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock
The total records should be 1335. In the case above it is 309 - many records appear to have been lost
This is a complicated problem. There are a number of issues:
A Badly-formed html file
The web is not a standard web, not well formed html file,let me prove my point.
please run :
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
How about to open the downloaded file stockbig-5wiht firefox?
Iconv function bug in R
if a html file is well formed,you can use
data=readLines(file)
datachange=iconv(data,from="source encode",to="target encode\IGNORE")
when a html file is not well formed,you can do that way ,in this example,
please run ,
data=readLines(stockbig-5)
An error will occur.
1: In readLines("stockbig-5") :
invalid input found on input connection 'stockbig-5'
You can't use iconv function in R to change encode in bad formed html file.
You can, however do this in shell
I have solved it myself for one night,hard time.
System:debian6(locale utf-8)+R2.15(locale utf-8)+gnome terminal(locale utf-8).
Here is the code:
require(XML)
url="http://www.hkex.com.hk/chi/market/sec_tradinfo/stockcode/eisdeqty_c.htm"
txt=download.file(url,destfile="stockbig-5",quiet = TRUE)
system('iconv -f big-5 -t UTF-8//IGNORE stockbig-5 > stockutf-8')
data=htmlParse("stockutf-8",isURL=FALSE,encoding="utf-8\\IGNORE")
tdata=xpathApply(data,"//table[#class='table_grey_border']")
stock <- readHTMLTable(tdata[[1]], header=TRUE, stringsAsFactors=FALSE)
stock
I want my code more elegant ,the shell command in R code is ugly maybe,
system('iconv -f big5 -t UTF-8//IGNORE stockgb2312 > stockutf-8')
i made tries to replace it with pure R code ,failed ,how can replace it in pure R code?
you can duplicate the result in your computer with the code.
half done,half success,continue to try.