FIle list serach by multiple keywords R

FIle list serach by multiple keywords R - r

I have a list of files, and want to select only those with certain words that are within the file name. Is this case all files that contain "Semana"
I have this code, but unsure what to put in the pattern argument:
malfiles11<-list.files(path = "./", pattern = , recursive = FALSE, full.names = TRUE, ignore.case= FALSE )
Here is a section of the list:
'[1] "./GBD2016_2_1000_Venezuela_MoH_Epi_2008_13.xlsxEntidades Federales10.csv"
[2] "./GBD2016_2_1000_Venezuela_MoH_Epi_2008_13.xlsxESTADO12.csv"
[3] "./GBD2016_2_1000_Venezuela_MoH_Epi_2008_13.xlsxVenezuela, Semana Epidemilógica 01 hasta la semana 13 del Año 2.00811.csv"
[4] "./GBD2016_2_1001_Venezuela_MoH_Epi_2008_14.xlsxESTADO12.csv"
[5] "./GBD2016_2_1001_Venezuela_MoH_Epi_2008_14.xlsxVenezuela, Semana Epidemilógica 01 hasta la semana 14 del Año 2.00811.csv"
[6] "./GBD2016_2_1001_Venezuela_MoH_Epi_2008_14.xlsxVenezuela, Semana Epidemiológica 14 de 2.007, Semana Epidemiológica 14 de año 200810.csv"
[7] "./GBD2016_2_1002_Venezuela_MoH_Epi_2008_15.xlsxESTADO12.csv"
[8] "./GBD2016_2_1002_Venezuela_MoH_Epi_2008_15.xlsxVenezuela, Semana Epidemilógica 01 hasta la semana 15 del Año 2.00811.csv" '

Use grep :
malfiles11_semana <- malfiles11[grep(pattern = "Semana",malfiles11)]

Related

removing url with format www in R

I need to remove some urls from a dataframe. So far I have been able to eliminate those with the pattern http://. However, there are still some websites in my corpus with the format www.stackoverflow.com or stackoverflow.org
Here is my code
#Sample of text
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar")
#Trying to remove the website with no results
test_text <- gsub("www[.]//([a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),])//[.]com", "", test_text)
The outcome should be
test_text
"la primera posibilidad real de acabar con la violencia del país es y luego desatar"

The following regex removes the test url.
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar",
"bla1 bla2 www.stackoverflow.org etc",
"this that www.nameofthewebiste.com one more"
)
gsub("(^[^w]*)www\\.[^\\.]*\\.[[:alpha:]]{2,3}(.*$)", "\\1\\2", test_text)
#[1] "la primera posibilidad real de acabar con la violencia del país es y luego desatar"
#[2] "bla1 bla2 etc"
#[3] "this that one more"

How to lemmatize a corpus with a particular dictionary in R?”

I'm trying to perform lemmatization on a corpus, using the function lemmatize_strings() as an argument to tm_map() of tm package.
But I want to use my own dictionary ("lexico" - first column with the full word form in lower case, while the second column has the corresponding replacement lemma).
I tried to use:
corpus<-tm_map(corpus, lemmatize_strings)
But didn't work...
When I use:
lemmatize_strings(corpus[[1]], dictionary = lexico)
I have no problem!
How can I put the my dictionary "lexico" in the fuction tm_map()?
Sorry for this question, it'is my first attempt to make some text mining, at the age of 48.
To be more understandable, my corpus are composed by 2000 documents; an extract from the first document:
corpus[[1]][[1]]
[9] "..."
[10] "Nos últimos dias da passada legislatura, a maioria de direita aprovou duas leis que significam enormes recuos nos direitos das cidadãs do país. Fizeram tábua rasa do pronunciamento das cidadãs e cidadãos do país em referendo, optando por humilhar e tentar culpabilizar as mulheres que abortam por sua livre escolha. Estas duas leis são a Lei n.º 134/2015 e a Lei n.º 136/2015, de setembro. A primeira prevê o pagamento de taxas moderadoras na interrupção de gravidez quando for realizada, por opção da mulher, nas primeiras 10 semanas de gravidez. A segunda representa a primeira alteração à Lei n.º 16/2007, de 17 de abril, sobre exclusão de ilicitude nos casos de interrupção voluntária da gravidez."
Then worked on a dictionary file (lexico) with this configuration:
lexico[1:10,]
termo lema pos.tag
1 aa a NCMP000
2 aais aal NCMP000
3 aal aal NCMS000
4 aaleniano aaleniano NCMS000
5 aalenianos aaleniano NCMP000
6 ab-rogação ab-rogação NCFS000
7 ab-rogações ab-rogação NCFP000
8 ab-rogamento ab-rogamento NCMS000
9 ab-rogamentos ab-rogamento NCMP000
10 ab-rogáveis ab-rogável AQ0CP0
When I use the function lemmatize_strings(corpus[[1]], dictionary = lexico), it works correctly and give de document of corpus nº1 lemmatized with lemmas from my dictionary.
The problem that I have, is with this function:
> corpus<-tm_map(corpus, lemmatize_strings, dictionary = lexico)
Warning messages:
1: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
2: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
> corpus[[1]][[1]]
[1] ""
This simply destroy all my documents in the corpus
> corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2000
Thnks in advance for your reply!

You could, for example, use the quanteda package for this:
library("quanteda")
text <- "This is a test sentence. We can lemmatize it using quanteda."
dict <- data.frame(
word = c("is", "using"),
lemma = c("be", "use"),
stringsAsFactors = FALSE
)
toks <- tokens(text, remove_punct = TRUE)
toks_lemma <- tokens_replace(toks,
pattern = dict$word,
replacement = dict$lemma,
case_insensitive = TRUE,
valuetype = "fixed")
toks_lemma
tokens from 1 document.
text1 :
[1] "This" "be" "a" "test" "sentence" "We" "can" "lemmatize"
[9] "it" "use" "quanteda"
The function is very fast and despite the name was mainly ceated for lemmatization.

How to use gsub to fix case-defined multiple spaces and broken lines?

I used pdftools to convert some pdf documents to txt. This is a part of the output (it's not so bad)
REPÚBLICA DE CHILE PADRON ELECTORAL AUDITADO ELECCIONES PRESIDENCIAL, PARLAMENTARIAS y de CONSEJEROS REGIONALES 2017 REGISTROS: 2.421
SERVICIO ELECTORAL REGIÓN : ARICA Y PARINACOTA COMUNA: GENERAL LAGOS PÁGINA 1 de 38
PROVINCIA : PARINACOTA
NOMBRE C.IDENTIDAD SEXO DOMICILIO ELECTORAL CIRCUNSCRIPCIÓN MESA
AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M
AGUILERA ZENTENO PATRICIA ALEJANDRA 16.223.938-4 MUJ PUEBLO DE GUACOLLO S N CERCANO A GENERAL LAGOS 5M
AGUIRRE CHOQUE MARCOS JULIO 15.000.385-7 VAR CIRCUNSCRIPCION
CALLE TORREALBA DE VISVIRI
CASA N° 4 PUEBLO DE VISVIRI GENERAL LAGOS 7V
So I'm doing this to clean this and convert it into formatted tsv:
test = read_lines("file.txt")
test2 = test[!grepl("REP\u00daBLICA",test)]
test2 = test2[!grepl("SERVICIO",test2)]
test2 = test2[!grepl("NOMBRE",test2)]
test2 = test2[!grepl("PROVINCIA",test2)]
test2 = gsub("\\.", "", test2)
test2 = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", test2, perl=TRUE)
and the output is:
ABRIGO PIZARRO PATRICIO ESTEBAN 16024716-9 VAR PUEB ALCERRECA GENERAL LAGOS 5V
ABURTO VELASCO ESTHER MARISOL 13005517-6 MUJ VILLA INDUSTRIAL GENERAL LAGOS 2M
ACEVEDO MONTT SEBASTIAN ANDRES 17829470-9 VAR CALLE RAFAEL TORREALBA N° 3 PUEBLO DE VISVIRI GENERAL LAGOS 3V
ACHILLO BLAS ADOLFO ARTURO 13008044-8 VAR VISURI GENERAL LAGOS 7V
I've read some posts and I'm not sure how to implement:
Something like gsub("(?<=[\\s+])[0-9]", "\t", test2, perl=TRUE), this is to replace multiple spaces followed by a number by tab followed by a number
How to move broken lines to the end of the previous line, such as line 8 in the above sample that starts with multiple spaces.
Fixing (1) and (2) would return this:
ABRIGO PIZARRO PATRICIO ESTEBAN \t 16024716-9 \t VAR \t PUEB ALCERRECA \t GENERAL LAGOS \t 5V
ABURTO VELASCO ESTHER MARISOL \t 13005517-6 \t MUJ \t VILLA INDUSTRIAL \t GENERAL LAGOS \t 2M

(1) You can use the words "VAR" and "MUJ" as key-words for splitting:
x <- "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M"
strsplit(x, "\\s{2,}|\\s(?=\\bMUJ\\b)|(?<=\\bMUJ\\b)\\s|\\s(?=\\bVAR\\b)|(?<=\\bVAR\\b)\\s", perl = TRUE)
The result is:
[[1]]
[1] "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA" "13.638.826-6" "MUJ"
[4] "PUEBLO DE TACORA S N VISVIRI" "GENERAL LAGOS" "4M"
Maybe not the most elegant solution, but it works and if you can modify the data you could use real key-words and assure they are unique.
(2) An easy solution would be to check rows length and move values up if the row is too short

Web-Scraping with rvest doesn't work

I'm trying to scrape comments from this website:
http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/
And this is my code for this task.
url <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
webpage <- read_html(url)
data_html <- html_nodes(webpage,"gig-comment-body")
Unfortunately it seems that rvest doesn't recognize the nodes through the CSS selector (gig-comment-body).
nodes comes out to be a null list, so it's not scraping anything.

That is another solution with rselenium without docker
install.packages("RSelenium")
library (RSelenium)
driver<- rsDriver()
remDr <- driver[["client"]]
remDr$navigate("http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/")
elem <- remDr$findElement( using = "id",value = "commentsDiv-779453")
#or
elem <- remDr$findElement( using = "class name", "gig-comments-comments")
elem$highlightElement() # just for interactive use in browser.
elemtxt <- elem$getElementAttribute("outerHTML") # gets us the HTML

#r2evans is correct. It builds the comment <div>s with javascript and it also requires a delay. I prefer Splash to Selenium (tho I made splashr so I'm not exactly impartial):
library(rvest)
library(splashr)
URL <- 'http://www.latercera.com/noticia/trabajos-realizan-donde-viven-los-extranjeros-tienen-residencia-chile/'
# Needs Docker => https://www.docker.com/
# Then needs splashr::install_splash()
start_splash()
splash_local %>%
splash_response_body(TRUE) %>%
splash_go(URL) %>%
splash_wait(10) %>%
splash_html() -> pg
html_nodes(pg, "div.gig-comment-body")
## {xml_nodeset (10)}
## [1] <div class="gig-comment-body"><p><span>Algunosdesubicados comentan y se refieren a la UE<span> </span>como si en alguna forma Chil ...
## [2] <div class="gig-comment-body">Si buscan información se encontrarán que la unión Europea se está desmorona ndo por asunto de la inmi ...
## [3] <div class="gig-comment-body">Pocos inmigrantes tiene Chile en función de su población. En España hay 4.5 mill de inmigrantes. 800. ...
## [4] <div class="gig-comment-body">Chao chilenois idiotas tanto hablan y dicen que hacer cuando ni su pais les pertenece esta gobernado ...
## [5] <div class="gig-comment-body">\n<div> Victor Hugo Ramirez Lillo, de Conchalí, exiliado en Goiania, Brasil, pecha bono de exonerado, ...
## [6] <div class="gig-comment-body">Les escribo desde mi 2do pais, USA. Mi PDTE. TRUMP se bajó del TPP y Chile se va a la cresta. La o ...
## [7] <div class="gig-comment-body">En CHILE siempre fuimos muy cuidadosos con le emigración, solo lo MEJOR de Alemania, Francia, Suecia, ...
## [8] <div class="gig-comment-body"><span>Basta de inmigración!!! Santiago está lleno de vendedores ambulantes extranieros!!!¿¿esos son l ...
## [9] <div class="gig-comment-body">IGNOREN A JON LESCANO, ESE ES UN CHOLO QUE FUE DEPORTADO DE CHILE.<div>IGNOREN A LOS EXTRANJEROS MET ...
## [10] <div class="gig-comment-body">Me pregunto qué dirá el nacionalista promedio cuando agarre un libro de historia y se dé cuenta de qu ...
killall_splash()

Mangling of French unicode when webscraping with rvest

I'm looking at scraping a French website using the rvest package.
library(rvest)
url <- "https://www.vins-bourgogne.fr/nos-vins-nos-terroirs/tous-les-bourgognes/toutes-les-appellations-de-bourgogne-a-votre-portee,2378,9172.html?&args=Y29tcF9pZD0xMzg2JmFjdGlvbj12aWV3RnVsbExpc3RlJmlkPSZ8"
s <- read_html(url)
s %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
I expect to see:
Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÔTE DE BEAUNE)
Bâtard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÔTE DE BEAUNE)
Instead, I see the diacritic characters mangled (see line 3 below):
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
"BÃ¢tard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÃ\u0094TE DE BEAUNE)"
The source html of the page shows it's encoded in utf-8. Using guess_encoding() on the html_text(), it suggests utf-8 as well (1.00 confidence), or windows-1252 with 0.73 confidence. Changing the encoding to windows-1252 doesn't help matters:
"Aloxe-Corton (Appellation Village, VIGNOBLE DE LA CÃ”TE DE BEAUNE)"
"Auxey-Duresses (Appellation Village, VIGNOBLE DE LA CÃ”TE DE BEAUNE)"
"BÃ¢tard-Montrachet (Appellation Grand Cru, VIGNOBLE DE LA CÃ”TE DE BEAUNE)"
I tried the same code on a different French website (also encoded utf-8):
x <- read_html('http://www.lemonde.fr/disparitions/article/2017/12/06/johnny-hallyday-c-etait-notre-seule-rock-star-la-france-perd-son-icone-du-rock_5225507_3382.html')
x %>% html_nodes('.taille_courante+ p , .croix_blanche , .tt2') %>% html_text()
Now I get the diacritics etc:
[1] "Johnny Hallyday : « C’était notre seule rock star », « La France perd son icône du rock »"
[2] "« Comme toute la France, mon cœur est brisé, a déclaré à l’Agence France-Presse (AFP) la chanteuse Sylvie Vartan, qui fut la première épouse de Johnny Hallyday, et mère de leur fils, David, né en 1966. J’ai perdu l’amour de ma jeunesse et rien ne pourra jamais le remplacer. »"
Any suggestions on where I am going wrong with the first website? Or how to fix?

This is a weird website. It is not all valid UTF-8:
lines <- readLines(url, warn = FALSE)
all(utf8::utf8_valid(lines))
#> [1] FALSE
Here are the offending lines:
lines[!utf8::utf8_valid(lines)]
#> [1] "// on supprime l'\xe9ventuel cookie"
#> [2] "//Ouverture et fermeture de l'encart r\xe9saux sociaux lors d'un clic sur le bouton"
#> [3] "//Cr\xe9ation de l'iframe facebook \xe0 la premi\xe8re ouverture de l'encart pour qu'elle fasse la bonne largeur"
#> [4] "//fermeture de l'encart r\xe9saux sociaux lors d'un clic ailleurs sur la page"
These look like comments in the JavaScript code. I suspect that read_html realizes that the page is not all valid UTF-8 and interprets the encoding to be Windows-1252 or some other 8-bit coding scheme.
You could try to work around this by removing the offending JS segments:
content <- paste(lines[utf8::utf8_valid(lines)], collapse = "\n")
content %>% read_html() %>% html_nodes('#resultatListeAppellation .lien') %>% html_text()
This gives the expected output.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

FIle list serach by multiple keywords R - r

Use grep : malfiles11_semana <- malfiles11[grep(pattern = "Semana",malfiles11)]

Related

removing url with format www in R

How to lemmatize a corpus with a particular dictionary in R?”

How to use gsub to fix case-defined multiple spaces and broken lines?

Web-Scraping with rvest doesn't work

Mangling of French unicode when webscraping with rvest

Categories

Resources