R: write csv in unicode, need actual text - r

I have the following R script for scraping some textual data from a website.
library('rvest')
term_data_final <- c()
defn_data_final <- c()
for (term in 1:10) {
url_base <- 'http://www.nplg.gov.ge/gwdict/index.php?a=term&d=9&t='
url <- paste(url_base, term, sep="")
webpage <- read_html(url)
term_data_html <- html_nodes(webpage, '.term')
term_data <- html_text(term_data_html)
if (!grepl("\\?", term_data)) {
term_data_final <- c(term_data_final, term_data)
defn_data_html <- html_nodes(webpage, '.defnblock')
defn_data <- html_text(defn_data_html)
defn_data_final <- c(defn_data_final, defn_data)
}
}
RusGeoDict <- data.frame(term_data_final, defn_data_final)
write.csv(RusGeoDict, file = 'RusGeoDict.csv', fileEncoding="UTF-8")
The script combines the scraped data into a dataframe and then writes that dataframe to a csv file. The scraped text is in Russian and Georgian characters and when saved to a dataframe and a csv, instead of text I get hexadecimal unicode of the format: . When I output the lists that are created before being combined into a dataframe, such as term_data_final I get the original text, but once I save to a dataframe and output to a csv file I get unicode. Is there any way to get the original text in Georgian and Russian characters saved to a csv without the unicode output. Thanks!

Ok, I don't know any Russian, but I think you can just set the GeoDict before you run your script, right.
Sys.setlocale("LC_CTYPE", "russian")
RusGeoDict <- data.frame(term_data_final, defn_data_final)
I just tried it and I think it's working. I can't really say for sure though. Try it and feedback with your findings.
Finally, see the link below. for additional ideas.
https://www.r-bloggers.com/r-and-foreign-characters/

Related

How to load .png images with image names listed in a .csv file to R

I am using a simple code below to append multiple images together with the R magick package. It works well, however, there are many images to process and their names are stored in a .csv file. Could anyone advise on how to load the image names to the image_read function from specific cells in a .csv file (see example below the code)? So far, I was not able to find anything appropriate that would solve this.
library (magick)
pic_A <- image_read('A.png')
pic_B <- image_read('B.png')
pic_C <- image_read('C.png')
combined <- c(pic_A, pic_B, pic_C)
combined <- image_scale(combined, "300x300")
image_info(combined)
final <- image_append(image_scale(combined, "x120"))
print(final)
image_write(final, "final.png") #to save
Something like this should work. If you load the csv into a dataframe then, it's then straightforward to point the image_read towards the appropriate elements.
And the index (row number) is included in the output filename so that things are not overwritten each iteration.
library (magick)
file_list <- read.csv("your.csv",header = F)
names(file_list) <- c("A","B","C")
for (i in 1:nrow(file_list)){
pic_A <- image_read(file_list$A[i])
pic_B <- image_read(file_list$B[i])
pic_C <- image_read(file_list$C[i])
combined <- c(pic_A, pic_B, pic_C)
combined <- image_scale(combined, "300x300")
image_info(combined)
final <- image_append(image_scale(combined, "x120"))
print(final)
image_write(final, paste0("final_",i,".png")) #to save
}

R scraper - how to find entry with missing data

Problem:
During scraping from webpage (imdb.com, webpage with film details) there is error message displayed. When I have checked in details, I have noticed that there is no data available for some of the entries.
How to figured out during scraping for which line there is no data and how to fill it with NA?
Manual investigation:
I have checked on webpage manually, and problem is with the rank number 1097 where there is only film genre available and there is no runtime.
Tried:
to add if entering the 0, but it is added to the last line, not to the title which is missing the value.
Code:
#install packages
install.packages("rvest")
install.packages("RSelenium")
library(rvest)
library(RSelenium)
#open browser (in my case Firefox)
rD <- rsDriver(browser=c("firefox"))
remDr <- rD[["client"]]
#set variable for the link
ile<-seq(from=1, by=250, length.out = 5)
#create empty frame
filmy_df=data.frame()
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#loop reading the data from each page
for (j in ile){
#set link for browser
newURL<-"https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
strona_int<-read_html(startNumberURL)
#empty values
rank_data<-NA;link<-NA;year<-NA;title_data<-NA;description_data<-NA;runtime_data<-NA;genre_data<-NA
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#title_data<-as.character(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
#if (length_runtime_data<250){ runtime_data<-append(runtime_data,list(0))}
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<-data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
#add to df
filmy_df<-rbind(filmy_df,filmy_df_temp)
}
#close browser
remDr$close()
#stop RSelenium
rD[["server"]]$stop()
Error message displayed:
"Error in data.frame(Rank = rank_data, Title = title_data, Release.Year = year, :
arguments imply differing number of rows: 250, 249"
Runtime_data contains only 249 entries instead of 250, and there is no runtime data for last line (instead for the line where it is really missing).
Update
I have found interesting think which maybe can help to solve the problem.
Please check the pictures.
Anima - source of error
Knocked Up - next entry
When we compare the pictures, we can notice that Anima, which is causing the problem with runtime_data, do not have html_node containing runtime at all.
So question: is there a way to check if html_node exists or not? If yes, how to do this?
You wouldn't run into this problem if you have structured your program a little different. In general, it is better to split your program into logically separate chunks that are more or less independent from each other instead of doing everything at once. That makes debugging much easier.
First, scrape the data and store it in a list – use lapply or something similar for that.
newURL <- "https://www.imdb.com/search/title/?title_type=feature&release_date=,2018-12-31&count=250&start="
pages <- lapply(ile, function(j) { #set link for browser
startNumberURL<-paste0(newURL,j)
#open link
remDr$navigate(startNumberURL)
#read webpage code
read_html(startNumberURL)
})
Then you have your data scraped and you can take all the time you need to analyse and filter it, without having to start the reading process again. For example, define the function as follows:
parsuj_strone <- function(strona_int) {
#read rank
rank_data<-html_nodes(strona_int,'.text-primary')
#convert text
rank_data<-html_text(rank_data)
#remove the comma for thousands
rank_data<-gsub(",","",rank_data)
#convert numeric
rank_data<-as.numeric(rank_data)
#read link for each movie
link<-url_absolute(html_nodes(strona_int, '.lister-item-header a')%>%html_attr(.,'href'),"https://www.imdb.com")
#release year
year<-html_nodes(strona_int,'.lister-item-year')
#convert text
year<-html_text(year)
#remove non numeric
year<-gsub("\\D","",year)
#set factor
year<-as.factor(year)
#read title
title_data<-html_nodes(strona_int,'.lister-item-header a')
#convert text
title_data<-html_text(title_data)
#read description
description_data<-html_nodes(strona_int,'.ratings-bar+ .text-muted')
#convert text
description_data<-html_text(description_data)
#remove '\n'
description_data<-gsub("\n","",description_data)
#remove space
description_data<-trimws(description_data,"l")
#read runtime
runtime_data <- html_nodes(strona_int,'.text-muted .runtime')
#convert text
runtime_data <- html_text(runtime_data)
#remove min
runtime_data<-gsub(" min","",runtime_data)
length_runtime_data<-length(runtime_data)
runtime_data<-as.numeric(runtime_data)
#temp_df
filmy_df_temp<- data.frame(Rank=rank_data,Title=title_data,Release.Year=year,Link=link,Description=description_data,Runtime=runtime_data)
return(filmy_df_temp)
}
Now, apply the function to each scraped web site:
pages_parsed <- lapply(pages, parsuj_strone)
And finally put them together in the data frame:
pages_df <- Reduce(rbind, pages_parsed)
Reduce won't mind an occasional NULL. Powodzenia!
EDIT: OK, so the problem is in the parsuj_strone() function. First, replace the final line of that function by this:
filmy_df_temp<- list(Rank=rank_data,
Title=title_data,
Release.Year=year, Link=link,
Description=description_data,
Runtime=runtime_data)
return(filmy_df_temp)
Run
pages_parsed <- lapply(pages, parsuj_strone)
Then, identify which of the 5 web sites returned problematic entries:
sapply(pages_parsed, function(x) sapply(x, length))
This should give you a 5 x 6 matrix. Finally, pick an element which has only 249 entries; how does it look? Without knowing your parser well, this at least should give you a hint where the problems may be.
On Stackoverflow you will find everything.
You just need to know how to search and search answer.
Here is the link to the answer for my problem: Scraping with rvest: how to fill blank numbers in a row to transform in a data frame?
In short: instead of using html_nodes, html_node (without s) should be used.
#read runtime
runtime_data <- html_node(szczegoly_filmu,'.text-muted .runtime')
#convert to text
runtime_data <- html_text(runtime_data)
#remove " min"
runtime_data<-gsub(" min","",runtime_data)
runtime_data<-as.numeric(runtime_data)

Convert XML Doc to CSV using R - Or search items in XML file using R

I have a large, un-organized XML file that I need to search to determine if a certain ID numbers are in the file. I would like to use R to do so and because of the format, I am having trouble converting it to a data frame or even a list to extract to a csv. I figured I can search easily if it is in a csv format. So , I need help understanding how to do convert it and extract it properly, or how to search the document for values using R. Below is the code I have used to try and covert the doc,but several errors occur with my various attempts.
## Method 1. I tried to convert to a data frame, but the each column is not the same length.
require(XML)
require(plyr)
file<-"EJ.XML"
doc <- xmlParse(file,useInternalNodes = TRUE)
xL <- xmlToList(doc)
data <- ldply(xL, data.frame)
datanew <- read.table(data, header = FALSE, fill = TRUE)
## Method 2. I tried to convert it to a list and the file extracts but only lists 2 words on the file.
data<- xmlParse("EJ.XML")
print(data)
head(data)
xml_data<- xmlToList(data)
class(data)
topxml <- xmlRoot(data)
topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))
xml_df <- data.frame(t(topxml),
row.names=NULL)
write.csv(xml_df, file = "MyData.csv",row.names=FALSE)
I am going to do some research on how to search within R as well, but I assume the file needs to be in a data frame or list to so either way. Any help is appreciated! Attached is a screen shot of the data. I am interested in finding matching entity id numbers to a list I have in a excel doc.

Multiple text file processing using scan

I have this code that works for me (it's from Jockers' Text Analysis with R for Students of Literature). However, what I need to be able to do is to automate this: I need to perform the "ProcessingSection" for up to thirty individual text files. How can I do this? Can I have a table or data frame that contains thirty occurrences of "text.v" for each scan("*.txt")?
Any help is much appreciated!
# Chapter 5 Start up code
setwd("D:/work/cpd/R/Projects/5/")
text.v <- scan("pupil-14.txt", what="character", sep="\n")
length(text.v)
#ProcessingSection
text.lower.v <- tolower(text.v)
mars.words.l <- strsplit(text.lower.v, "\\W")
mars.word.v <- unlist(mars.words.l)
#remove blanks
not.blanks.v <- which(mars.word.v!="")
not.blanks.v
#create a new vector to store the individual words
mars.word.v <- mars.word.v[not.blanks.v]
mars.word.v
It's hard to help as your example is not reproducible.
Admitting you're happy with the result of mars.word.v,
you can turn this portion of code into a function that will accept a single argument,
the result of scan.
processing_section <- function(x){
unlist(strsplit(tolower(x), "\\W"))
}
Then, if all .txt files are in the current working directory, you should be able to list them,
and apply this function with:
lf <- list.files(pattern=".txt")
lapply(lf, function(path) processing_section(scan(path, what="character", sep="\n")))
Is this what you want?

How to not overwrite file in R

I am trying to copy and paste tables from R into Excel. Consider the following code from a previous question:
data <- list.files(path=getwd())
n <- length(list)
for (i in 1:n)
{
data1 <- read.csv(data[i])
outline <- data1[,2]
outline <- as.data.frame(table(outline))
print(outline) # this prints all n tables
name <- paste0(i,"X.csv")
write.csv(outline, name)
}
This code writes each table into separate Excel files (i.e. "1X.csv", "2X.csv", etc..). Is there any way of "shifting" each table down some rows instead of rewriting the previous table each time? I have also tried this code:
output <- as.data.frame(output)
wb = loadWorkbook("X.xlsx", create=TRUE)
createSheet(wb, name = "output")
writeWorksheet(wb,output,sheet="output",startRow=1,startCol=1)
writeNamedRegion(wb,output,name="output")
saveWorkbook(wb)
But this does not copy the dataframes exactly into Excel.
I think, as mentioned in the comments, the way to go is to first merge the data frames in R and then writing them into (one) output file:
# get vector of filenames
filenames <- list.files(path=getwd())
# for each filename: load file and create outline
outlines <- lapply(filenames, function(filename) {
data <- read.csv(filename)
outline <- data[,2]
outline <- as.data.frame(table(outline))
outline
})
# merge all outlines into one data frame (by appending them row-wise)
outlines.merged <- do.call(rbind, outlines)
# save merged data frame
write.csv(outlines.merged, "all.csv")
Despite what microsoft would like you to believe, .csv files are not excel files, they are a common file type that can be read by excel and many other programs.
The best approach depends on what you really want to do. Do you want all the tables to read into a single worksheet in excel? If so you could just write to a single file using the append argument to the write.csv or other functions. Or use a connection that you keep open so each new one is appended. You may want to use cat to put a couple of newlines before each new table.
Your second attempt looks like it uses the XLConnect package (but you don't say, so it could be something else). I would think this the best approach, how is the result different from what you are expecting?

Resources