Suppose I have the next data frame:
dd<-data.frame(a=c("xtr","la casa x-tr","x-tr"))
a
xtr
la casa x-tr
x-tr
How can I replace onlye the "x-tr" occurrencies with "xtr". So, final output would be
a
xtr
la casa xtr
xtr
We can use sub
dd$a <- sub("(x)-(tr)$", "\\1\\2", dd$a)
dd$a
#[1] "xtr" "la casa xtr" "xtr"
If there is only a single -, then
sub("-", "", dd$a)
Related
I have messy data. I want to subset the data based on a phrase in a column till the end.
df1 <- data.frame(
V1=c("No. de Control Interno de", "la Partida / Concepto de Obra","",
"LO-009J0U004-","E50-2021",""),
V2=c("","Descripción Breve","Trabajos de señalamiento horizontal en puente de",
"cuota \"El Zacatal\", consistentes en suministro y","aplicación de pintura de tránsito, suministro y",
"colocación de botones y ménsulas reflejantes."),
V3=c("","ClaveCUCOP","","","62502002",""),
V4=c("Unidad","Observaciones de Medida","","","Obra",""),
V5=c("","Cantidad","","","1","")
)
Whenver in V2, there is the phrase Descripción, the code should subset dataframe from that row till the end. For example, in the case above, this means selecting data from row 2 till row 6. I was trying with str_detect from stringr package.
You can use the which() function to return the indices where str_detect() is TRUE.
library(stringr)
which(str_detect(df1$V2, "Descripción"))
[1] 2
If instead you save the output of which() to a variable, you can use it to subset your data. Note that the follow explicitly calls the first value in x in case there are more than one place str_detect returns true.
x <- which(str_detect(df1$V2, "Descripción"))
df1[x[1]:nrow(df1),]
V1 V2 V3 V4 V5
2 la Partida / Concepto de Obra Descripción Breve ClaveCUCOP Observaciones de Medida Cantidad
3 Trabajos de señalamiento horizontal en puente de
4 LO-009J0U004- cuota "El Zacatal", consistentes en suministro y
5 E50-2021 aplicación de pintura de tránsito, suministro y 62502002 Obra 1
6 colocación de botones y ménsulas reflejantes.
I am trying to split strings by using the first white space coming after 3 characters. Here is my code:
string <- c("Le jour la nuit", "Les jours les nuits")
part1 <- sub("(\\S{3,})\\s?(.*)", "\\1", string)
part2 <- sub("(\\S{3,})\\s?(.*)", "\\2", string)
# output
> part1
[1] "Le jour" "Les"
> part2
[1] "Le la nuit" "jours les nuits"
For the first part, it works exactly as desired. However, it is not the case for the second part: part2[1] should be la nuit instead of Le la nuit.
I am not sure how achieve this and would be thankful for some guidance.
Not sure what you really want but per your requirements, you could use
^(.{3,}?)(?:(?<!,)\\s)+(.*)
This says:
^ # start of the string
(.{3,}?) # capture 3+ characters lazily, up to...
(?:(?<!,)\\s)+ # 1+ whitespaces that must not be preceeded by a comma
(.*) # capture the rest of the string
In R:
string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits")
(part1 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\1", string, perl = T))
(part2 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\2", string, perl = T))
Yielding
[1] "Le jour" "Les" "les, jours"
and
[1] "la nuit" "jours les nuits" "les nuits"
Maybe you need a dataframe as a result, if so, you could define yourself a little function (using sapply and some logic):
make_df <- function(text) {
parts <- sapply(text, function(x) {
m <- regexec("^(.{3,}?)(?:(?<!,)\\s)+(.*)", x, perl = T)
groups <- regmatches(x, m)
c(groups[[1]][2], groups[[1]][3])
}, USE.NAMES = F)
(setNames(as.data.frame(t(parts), stringsAsFactors = F), c("part1", "part2")))
}
(df <- make_df(string))
This would yield for string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits", "somejunk"):
part1 part2
1 Le jour la nuit
2 Les jours les nuits
3 les, jours les nuits
4 <NA> <NA>
I used pdftools to convert some pdf documents to txt. This is a part of the output (it's not so bad)
REPÚBLICA DE CHILE PADRON ELECTORAL AUDITADO ELECCIONES PRESIDENCIAL, PARLAMENTARIAS y de CONSEJEROS REGIONALES 2017 REGISTROS: 2.421
SERVICIO ELECTORAL REGIÓN : ARICA Y PARINACOTA COMUNA: GENERAL LAGOS PÁGINA 1 de 38
PROVINCIA : PARINACOTA
NOMBRE C.IDENTIDAD SEXO DOMICILIO ELECTORAL CIRCUNSCRIPCIÓN MESA
AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M
AGUILERA ZENTENO PATRICIA ALEJANDRA 16.223.938-4 MUJ PUEBLO DE GUACOLLO S N CERCANO A GENERAL LAGOS 5M
AGUIRRE CHOQUE MARCOS JULIO 15.000.385-7 VAR CIRCUNSCRIPCION
CALLE TORREALBA DE VISVIRI
CASA N° 4 PUEBLO DE VISVIRI GENERAL LAGOS 7V
So I'm doing this to clean this and convert it into formatted tsv:
test = read_lines("file.txt")
test2 = test[!grepl("REP\u00daBLICA",test)]
test2 = test2[!grepl("SERVICIO",test2)]
test2 = test2[!grepl("NOMBRE",test2)]
test2 = test2[!grepl("PROVINCIA",test2)]
test2 = gsub("\\.", "", test2)
test2 = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", test2, perl=TRUE)
and the output is:
ABRIGO PIZARRO PATRICIO ESTEBAN 16024716-9 VAR PUEB ALCERRECA GENERAL LAGOS 5V
ABURTO VELASCO ESTHER MARISOL 13005517-6 MUJ VILLA INDUSTRIAL GENERAL LAGOS 2M
ACEVEDO MONTT SEBASTIAN ANDRES 17829470-9 VAR CALLE RAFAEL TORREALBA N° 3 PUEBLO DE VISVIRI GENERAL LAGOS 3V
ACHILLO BLAS ADOLFO ARTURO 13008044-8 VAR VISURI GENERAL LAGOS 7V
I've read some posts and I'm not sure how to implement:
Something like gsub("(?<=[\\s+])[0-9]", "\t", test2, perl=TRUE), this is to replace multiple spaces followed by a number by tab followed by a number
How to move broken lines to the end of the previous line, such as line 8 in the above sample that starts with multiple spaces.
Fixing (1) and (2) would return this:
ABRIGO PIZARRO PATRICIO ESTEBAN \t 16024716-9 \t VAR \t PUEB ALCERRECA \t GENERAL LAGOS \t 5V
ABURTO VELASCO ESTHER MARISOL \t 13005517-6 \t MUJ \t VILLA INDUSTRIAL \t GENERAL LAGOS \t 2M
(1) You can use the words "VAR" and "MUJ" as key-words for splitting:
x <- "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M"
strsplit(x, "\\s{2,}|\\s(?=\\bMUJ\\b)|(?<=\\bMUJ\\b)\\s|\\s(?=\\bVAR\\b)|(?<=\\bVAR\\b)\\s", perl = TRUE)
The result is:
[[1]]
[1] "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA" "13.638.826-6" "MUJ"
[4] "PUEBLO DE TACORA S N VISVIRI" "GENERAL LAGOS" "4M"
Maybe not the most elegant solution, but it works and if you can modify the data you could use real key-words and assure they are unique.
(2) An easy solution would be to check rows length and move values up if the row is too short
I have a table with text like:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
I need to keep only the words that have more than 3 characters:
tt<-data.frame(a=c("esta unica lista hacer","esta unica"))
In this case I have no clue of how to do it. I know I have to use nchar and a loop over the table and inside another loop over the words.
Using the data.table package:
library(data.table)
setDT(tt)
tt[,a:=gsub("\\s+"," ",gsub("\\b\\w{1,3}\\b","",a))]
a
1: esta unica lista hacer
2: esta unica
Another option, depending on exactly the output you want, is:
library(data.table) #1.9.5+
tt[,tstrsplit(gsub("\\b\\w{1,3}\\b","",a),split="\\s+")]
V1 V2 V3 V4
1: esta unica lista hacer
2: esta unica NA NA
Edit: After much tussling at the encouragement of #rawr, here is a way to get at the problem more directly (include 4-letter words instead of exclude 3-letter words)
tt[,a:=lapply(regmatches(a, gregexpr('\\b\\w{4,}\\b',a)),paste0,collapse=" ")]
It's not too tricky if you break it into chunks. First use apply to iterate over each row of the data frame. Then for each row, break strings into words, select the long ones, paste back into a string, and return the result:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
library(stringr)
tt$a <- lapply(tt$a, function(x) {
l <- unlist(str_split(x, " "))
t <- l[which(nchar(l)>3)]
return(paste0(t, collapse=" "))
})
Here is another approach using the qdapRegex package.
library(qdapRegex)
tt <- data.frame(a = c('esta es la unica lista que voy a hacer', 'esta es la 2da unica'))
tt$a <- rm_nchar_words(tt$a, 1, pattern = '\\b\\w{1,3}\\b')
tt
# a
# 1 esta unica lista hacer
# 2 esta unica
Here's a solution using the quanteda package, that tokenizes the texts in your data.frame and removes the tokens whose length is <= 3. Note that I have specified stringsAsFactors = FALSE here in the data.frame() -- although this would work equally fine if you were operating directly on a character vector.
require(quanteda)
tt <- data.frame(a=c("esta es la unica lista que voy a hacer", "esta es la 2da unica"),
stringsAsFactors = FALSE)
ttTokenized <- tokenize(tt$a)
(ttTokenized <- sapply(ttTokenized, function(x) x[nchar(x) > 3]))
## [[1]]
## [1] "esta" "unica" "lista" "hacer"
##
## [[2]]
## [1] "esta" "unica"
If you want the original-looking texts rather than the tokenised versions, then use this additional step:
sapply(ttTokenized, paste, collapse = " ")
## [1] "esta unica lista hacer" "esta unica"
I want to merge/paste (paste(c(...), collapse=" ")) strings in a dataframe based on value (author) in a different column. I am looking for an efficient way to do it.
df <- data.frame(author = c("Shakespeare",
"Dante",
"Proust",
"Shakespeare",
"Dante",
"Proust",
"Shakespeare"),
text = c("Put the wild waters in this roar, allay them",
"Ma tu perche' ritorni a tanta noia?",
"Longtemps, je me suis couché de bonne heure",
"The very virtue of compassion in thee",
"Pensa oramai qual fu colui che degno",
"Quelle horreur! me disais-je",
"She said thou wast my daughter; and thy father"))
And the end result should be
result <- c("Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father",
"Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno",
"Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je")
names(result) <- c("Shakespeare","Dante","Proust")
result
# Shakespeare
# "Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father"
# Dante
# "Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno"
# Proust
# "Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je"
I guess I should somehow use some function from the apply family. Something like
apply( df[??? , 2 , paste , collapse = " " )
but I am not sure how to pass the condition and then obtain as result the name of the author to which the pasted strings correspond...
tapply works more or less exactly as you expected:
tapply(df$text, df$author, paste, collapse = " ")
A more en vogue solution would be to use dplyr
library(dplyr)
df %>% group_by(author) %>% summarize(passage = paste(text, collapse = " "))