How to turn around strings in dataframe - r

I have a name, something like Robin the Bruyne or Victor from the Loo
These names are in a dataframe in my session. I need to change these names into:
<lastname, firstname middlename(s)>,
so they are turn arouned. But I don't know how to do this.
I know I can use things like separate() or map() with PURR (of tidyverse).
Data:
~nr, ~name, ~prodno,
2019001, "Piet de Boer", "lux_zwez",
2019002, "Elly Hamstra", "zuv_vla",
2019003, "Sue Ellen Schilder", "zuv_vla",
2019004, "Truus Janssen", "zuv_vmlk",
2019005, "Evelijne de Vries", "lux_zwez",
2019006, "Berend Boersma", "lux_gras",
2019007, "Marius van Asten", "zuv_vla",
2019008, "Corneel Jansen", "lux_gras",
2019009, "Joke Timmerman", "zuv_vla",
2019010, "Jan Willem de Jong", "lux_zwez",
2019011, "Frederik Janssen", "zuv_vmlk",
2019012, "Antonia de Jongh", "zuv_vmlk",
2019013, "Lena van der Loo", "zuv_qrk",
2019014, "Johanna Haanstra", "lux_gras"

We can try using sub here:
names <- c("Robin the Bruyne", "Victor from the Loo")
output <- sub("^(.*) ([A-Z][a-z]+)$", "\\2, \\1", names)
output
[1] "Bruyne, Robin the" "Loo, Victor from the"
This approach uses the following pattern:
^(.*) capture everything from the start until the last space
([A-Z][a-z]+)$ capture the last name, which starts with a capital
Then, we replace with the last name and first/middle names swapped, separated by a comma.

If I understood you correctly, this should work.
dat = tibble::tribble(
~nr, ~name, ~prodno,
2019001, "Piet de Boer", "lux_zwez",
2019002, "Elly Hamstra", "zuv_vla",
2019003, "Sue Ellen Schilder", "zuv_vla",
2019004, "Truus Janssen", "zuv_vmlk",
2019005, "Evelijne de Vries", "lux_zwez",
2019006, "Berend Boersma", "lux_gras",
2019007, "Marius van Asten", "zuv_vla",
2019008, "Corneel Jansen", "lux_gras",
2019009, "Joke Timmerman", "zuv_vla",
2019010, "Jan Willem de Jong", "lux_zwez",
2019011, "Frederik Janssen", "zuv_vmlk",
2019012, "Antonia de Jongh", "zuv_vmlk",
2019013, "Lena van der Loo", "zuv_qrk",
2019014, "Johanna Haanstra", "lux_gras"
)
library(magrittr)
dat %>% dplyr::mutate(
lastname = stringr::str_extract(name,"(?<=[:blank:])[:alnum:]+$"),
firstname = stringr::str_extract(name,".*(?=[:blank:])"),
name = paste(lastname,firstname,sep = ", ")
) %>% dplyr::select(-firstname,-lastname)
#> # A tibble: 14 x 3
#> nr name prodno
#> <dbl> <chr> <chr>
#> 1 2019001 Boer, Piet de lux_zwez
#> 2 2019002 Hamstra, Elly zuv_vla
#> 3 2019003 Schilder, Sue Ellen zuv_vla
#> 4 2019004 Janssen, Truus zuv_vmlk
#> 5 2019005 Vries, Evelijne de lux_zwez
#> 6 2019006 Boersma, Berend lux_gras
#> 7 2019007 Asten, Marius van zuv_vla
#> 8 2019008 Jansen, Corneel lux_gras
#> 9 2019009 Timmerman, Joke zuv_vla
#> 10 2019010 Jong, Jan Willem de lux_zwez
#> 11 2019011 Janssen, Frederik zuv_vmlk
#> 12 2019012 Jongh, Antonia de zuv_vmlk
#> 13 2019013 Loo, Lena van der zuv_qrk
#> 14 2019014 Haanstra, Johanna lux_gras
Created on 2019-06-02 by the reprex package (v0.2.1)

Related

How can I split string in R from first square bracket and last round bracket?

I am dealing with legal citations. I want to split the citations into four parts. The citation is in general format as follows:
ABC v. DEF [Year] citation data (Authority)
So, I want to split it into four parts - ABC v. DEF, Year, citation data, and authority. The problem is that the first part (i.e., ABC v. DEF)might have additional round brackets, while the third part (i.e., citation data) might have additional square and/or round brackets.
For example, in this following case
"Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"
The first part is "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation)", second part is "2013", third part is "33 taxmann.com 424/60 SOT 118 (URO)" and the last part is "Mum. Trib."
I am unable to come up with the right regex to do this. Can anyone help me with this one?
Use extract:
library(tidyr)
data.frame(txt) %>%
extract(txt,
into = c("First", "Sec", "Thrd", "Frth"),
regex = "(.+)\\[(\\d+)\\](.*)\\((.*)\\)")
First Sec Thrd Frth
1 Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) 2013 33 taxmann.com 424/60 SOT 118 (URO) Mum. Trib.
The regex part looks scarier than it is: you simply describe the string in full, wrapping those parts that you wish to extract into parentheses (the syntaxt for capturing groups)
Data:
txt <- "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"
text <- "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"
pattern <- "(.*?)\\s*\\[(\\d{4})\\]\\s*(.*?)\\s*\\((.*)\\)"
regmatches(text, regexec(pattern, text))
[[1]]
[1] "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) [2013] 33 taxmann.com 424/60 SOT 118 (URO) (Mum. Trib.)"
[2] "Lubrizol Corporation, USA v. Asstt. DIT (International Taxation)"
[3] "2013"
[4] "33 taxmann.com 424/60 SOT 118 (URO)"
[5] "Mum. Trib."
If you want a dataframe:
dat <- data.frame(citation = character(), year = numeric(), data = character(), Authority = character())
strcapture(pattern, text, dat)
citation year data Authority
1 Lubrizol Corporation, USA v. Asstt. DIT (International Taxation) 2013 33 taxmann.com 424/60 SOT 118 (URO) Mum. Trib.

Unable to encode csv to UTF-8

I am importing a csv and want to encode it to UTF-8 as some columns appear like this:
Comentario Fecha Mes `Estaci\xf3n` Hotel Marca
<chr> <date> <chr> <chr> <chr> <chr>
1 "No todas las instalaciones del hotel se pudieron usar, estaban cerradas sin ~ 2020-02-01 feb. "Invierno" Sol Arona T~ Sol
2 "Me he ido con un buen sabor de boca, y ganas de volver. Me ha sorprendido to~ 2019-11-01 nov. "Oto\xf1o" Sol Arona T~ Sol
3 "Hotel normalito. Est\xe1 un poco viejo. Las habitaciones no tienen aire acon~ 2019-09-01 sep. "Verano" Sol Arona T~ Sol
I have tried the following:
df<- read.csv("SolArona.csv", sep=",", encoding = "UTF-8")
But this returns Error in read_csv("SolArona.csv", sep = ",", encoding = "UTF-8") :
unused arguments (sep = ",", encoding = "UTF-8"). So, I have also tried doing the Encoding of each column:
df <-read_csv("SolArona.csv")
Encoding(df$Comentario)<-"UTF-8"
Encoding (df$Estaci\xf3n)<-"UTF-8"
Thanks very much in advance!
Try this, insert another encoding if nescessary
library(readr)
df<- read_csv("SolArona.csv",",",locale = locale(encoding = "UTF-8"))

How to use gsub to fix case-defined multiple spaces and broken lines?

I used pdftools to convert some pdf documents to txt. This is a part of the output (it's not so bad)
REPÚBLICA DE CHILE PADRON ELECTORAL AUDITADO ELECCIONES PRESIDENCIAL, PARLAMENTARIAS y de CONSEJEROS REGIONALES 2017 REGISTROS: 2.421
SERVICIO ELECTORAL REGIÓN : ARICA Y PARINACOTA COMUNA: GENERAL LAGOS PÁGINA 1 de 38
PROVINCIA : PARINACOTA
NOMBRE C.IDENTIDAD SEXO DOMICILIO ELECTORAL CIRCUNSCRIPCIÓN MESA
AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M
AGUILERA ZENTENO PATRICIA ALEJANDRA 16.223.938-4 MUJ PUEBLO DE GUACOLLO S N CERCANO A GENERAL LAGOS 5M
AGUIRRE CHOQUE MARCOS JULIO 15.000.385-7 VAR CIRCUNSCRIPCION
CALLE TORREALBA DE VISVIRI
CASA N° 4 PUEBLO DE VISVIRI GENERAL LAGOS 7V
So I'm doing this to clean this and convert it into formatted tsv:
test = read_lines("file.txt")
test2 = test[!grepl("REP\u00daBLICA",test)]
test2 = test2[!grepl("SERVICIO",test2)]
test2 = test2[!grepl("NOMBRE",test2)]
test2 = test2[!grepl("PROVINCIA",test2)]
test2 = gsub("\\.", "", test2)
test2 = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", test2, perl=TRUE)
and the output is:
ABRIGO PIZARRO PATRICIO ESTEBAN 16024716-9 VAR PUEB ALCERRECA GENERAL LAGOS 5V
ABURTO VELASCO ESTHER MARISOL 13005517-6 MUJ VILLA INDUSTRIAL GENERAL LAGOS 2M
ACEVEDO MONTT SEBASTIAN ANDRES 17829470-9 VAR CALLE RAFAEL TORREALBA N° 3 PUEBLO DE VISVIRI GENERAL LAGOS 3V
ACHILLO BLAS ADOLFO ARTURO 13008044-8 VAR VISURI GENERAL LAGOS 7V
I've read some posts and I'm not sure how to implement:
Something like gsub("(?<=[\\s+])[0-9]", "\t", test2, perl=TRUE), this is to replace multiple spaces followed by a number by tab followed by a number
How to move broken lines to the end of the previous line, such as line 8 in the above sample that starts with multiple spaces.
Fixing (1) and (2) would return this:
ABRIGO PIZARRO PATRICIO ESTEBAN \t 16024716-9 \t VAR \t PUEB ALCERRECA \t GENERAL LAGOS \t 5V
ABURTO VELASCO ESTHER MARISOL \t 13005517-6 \t MUJ \t VILLA INDUSTRIAL \t GENERAL LAGOS \t 2M
(1) You can use the words "VAR" and "MUJ" as key-words for splitting:
x <- "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M"
strsplit(x, "\\s{2,}|\\s(?=\\bMUJ\\b)|(?<=\\bMUJ\\b)\\s|\\s(?=\\bVAR\\b)|(?<=\\bVAR\\b)\\s", perl = TRUE)
The result is:
[[1]]
[1] "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA" "13.638.826-6" "MUJ"
[4] "PUEBLO DE TACORA S N VISVIRI" "GENERAL LAGOS" "4M"
Maybe not the most elegant solution, but it works and if you can modify the data you could use real key-words and assure they are unique.
(2) An easy solution would be to check rows length and move values up if the row is too short

Web Scraping, extract table of a page

i have extract the table that say "R.U.T" and "Entidad" of the page
http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554
I make the follow code:
library(rvest)
#put page
url<-paste("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html",sep="")
url<-read_html(url)
#extract table
table<-html_node(url,xpath='//*[#id="listado_fiscalizados"]/table') #xpath
table<-html_table(table)
#transform table to data.frame
table<-data.frame(table)
but R show me the follow result:
> a
{xml_nodeset (0)}
That is, it is not recognizing the table, Maybe it's because the table has hyperlinks?
If anyone knows how to extract the table, I would appreciate it.
Many thanks in advance and sorry for my English.
It makes an XHR request to another resource which is used to make the table.
library(rvest)
library(dplyr)
pg <- read_html("http://www.svs.cl/institucional/mercados/consulta.php?mercado=S&Estado=VI&consulta=CSVID&_=1484105706447")
html_nodes(pg, "table") %>%
html_table() %>%
.[[1]] %>%
tbl_df() %>%
select(1:2)
## # A tibble: 36 × 2
## R.U.T. Entidad
## <chr> <chr>
## 1 99588060-1 ACE SEGUROS DE VIDA S.A.
## 2 76511423-3 ALEMANA SEGUROS S.A.
## 3 96917990-3 BANCHILE SEGUROS DE VIDA S.A.
## 4 96933770-3 BBVA SEGUROS DE VIDA S.A.
## 5 96573600-K BCI SEGUROS VIDA S.A.
## 6 96656410-5 BICE VIDA COMPAÑIA DE SEGUROS S.A.
## 7 96837630-6 BNP PARIBAS CARDIF SEGUROS DE VIDA S.A.
## 8 76418751-2 BTG PACTUAL CHILE S.A. COMPAÑIA DE SEGUROS DE VIDA
## 9 76477116-8 CF SEGUROS DE VIDA S.A.
## 10 99185000-7 CHILENA CONSOLIDADA SEGUROS DE VIDA S.A.
## # ... with 26 more rows
You can use Developer Tools in any modern browser to monitor the Network requests to find that URL.
This is the answer using RSelenium:
# Start Selenium Server
RSelenium::checkForServer(beta = TRUE)
selServ <- RSelenium::startServer(javaargs = c("-Dwebdriver.gecko.driver=\"C:/Users/Mislav/Documents/geckodriver.exe\""))
remDr <- remoteDriver(extraCapabilities = list(marionette = TRUE))
remDr$open() # silent = TRUE
Sys.sleep(2)
# Simulate browser session and fill out form
remDr$navigate("http://www.svs.cl/portal/principal/605/w3-propertyvalue-18554.html")
Sys.sleep(2)
doc <- htmlParse(remDr$getPageSource()[[1]], encoding = "UTF-8")
# close and stop server
remDr$close()
selServ$stop()
tables <- readHTMLTable(doc)
head(tables)

R: find position of specific character in data frame column

I have been trying to duplicate a move that I've used a lot with SQL but can't seem to find an equivalent in R. I've been searching high and low on the list and other sources for a solution but can't find what I'm looking to do.
I have a data frame with a variable of full names, for example "Doe, John". I have been able to split these names using the following code:
# creates a split name matrix for each record
namesplit <- strsplit(crm$DEF_NAME, ',')
# takes the first/left part of matrix, after the comma
crm$LAST_NAME <- trimws(sapply(namesplit, function(x) x[1]))
# takes the last/right part of the matrix, after the comma
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[length(x)]))
But some of the names have "." instead of "," splitting the names. For example, "Doe. John". In other cases I have two ".", i.e. "Doe. John T.". Here's an example:
> test$LAST_NAME
[1] "DEWITT. B" "TAOY. PETER" "ZULLO. JASON"
[4] "LAWLOR. JOSEPH" "CRAWFORD. ADAM" "HILL. ROBERT W."
[7] "TAGERT. CHRISTOPHER" "ROSEBERY. SCOTT W." "PAYNE. ALBERT"
[10] "BUNTZ. BRIAN JOHN" "COLON. PERFECTO GAUD" "DIAZ. JOSE CANO"
[13] "COLON. ERIK D." "COLON. ERIK D." "MARTINEZ. DAVID C."
[16] "DRISKELL. JASON" "JOHNSON. ALEXANDER" "JACKSON. RONNIE WAYNE"
[19] "SIPE. DAVID J." "FRANCO. BRANDT" "FRANCO. BRANDT"
For these cases, I'm trying to find the position of the first "." so that I can use user-defined functions to split the name. Here are those functions.
left = function (string,char){
substr(string,1,char)}
right = function (string, char){
substr(string,nchar(string)-(char-1),nchar(string))}
I've had some success with the following, but it takes the position of the first record only, so for example it'll grab position 6 for all the records rather than changing for each row.
test$LAST_NAME2 <- left(test$LAST_NAME,
which(strsplit(test$LAST_NAME, '')[[1]]=='.')-1)
I've played around with apply and sapply, but I'm obviously missing something because they don't seem to work.
My plan was to use an ifelse function to apply the "." parsing to the records that have this issue.
I fear the answer is simple. But I'm stuck. Thanks so much for your help.
I would just modify your original function namesplit to this:
namesplit <- strsplit(crm$DEF_NAME, ',|\\.')
which will split on , or ..
Also, maybe change your first name function to
crm$FIRST_NAME <- trimws(sapply(namesplit, function(x) x[2:length(x)]))
to catch any instances where there is a comma or period that is not in the last position.
With tidyr,
library(tidyr)
test %>% separate(LAST_NAME, into = c('LAST_NAME', 'FIRST_NAME'), extra = 'merge')
## LAST_NAME FIRST_NAME
## 1 DEWITT B
## 2 LAWLOR JOSEPH
## 3 TAGERT CHRISTOPHER
## 4 BUNTZ BRIAN JOHN
## 5 COLON ERIK D.
## 6 DRISKELL JASON
## 7 SIPE DAVID J.
## 8 TAOY PETER
## 9 CRAWFORD ADAM
## 10 ROSEBERY SCOTT W.
## 11 COLON PERFECTO GAUD
## 12 COLON ERIK D.
## 13 JOHNSON ALEXANDER
## 14 FRANCO BRANDT
## 15 ZULLO JASON
## 16 HILL ROBERT W.
## 17 PAYNE ALBERT
## 18 DIAZ JOSE CANO
## 19 MARTINEZ DAVID C.
## 20 JACKSON RONNIE WAYNE
## 21 FRANCO BRANDT
Data
test <- structure(list(LAST_NAME = c("DEWITT. B", "LAWLOR. JOSEPH", "TAGERT. CHRISTOPHER",
"BUNTZ. BRIAN JOHN", "COLON. ERIK D.", "DRISKELL. JASON", "SIPE. DAVID J.",
"TAOY. PETER", "CRAWFORD. ADAM", "ROSEBERY. SCOTT W.", "COLON. PERFECTO GAUD",
"COLON. ERIK D.", "JOHNSON. ALEXANDER", "FRANCO. BRANDT", "ZULLO. JASON",
"HILL. ROBERT W.", "PAYNE. ALBERT", "DIAZ. JOSE CANO", "MARTINEZ. DAVID C.",
"JACKSON. RONNIE WAYNE", "FRANCO. BRANDT")), row.names = c(NA,
-21L), class = "data.frame", .Names = "LAST_NAME")

Resources