Select data till the end based on a pattern in one column - r

I have messy data. I want to subset the data based on a phrase in a column till the end.
df1 <- data.frame(
V1=c("No. de Control Interno de", "la Partida / Concepto de Obra","",
"LO-009J0U004-","E50-2021",""),
V2=c("","Descripción Breve","Trabajos de señalamiento horizontal en puente de",
"cuota \"El Zacatal\", consistentes en suministro y","aplicación de pintura de tránsito, suministro y",
"colocación de botones y ménsulas reflejantes."),
V3=c("","ClaveCUCOP","","","62502002",""),
V4=c("Unidad","Observaciones de Medida","","","Obra",""),
V5=c("","Cantidad","","","1","")
)
Whenver in V2, there is the phrase Descripción, the code should subset dataframe from that row till the end. For example, in the case above, this means selecting data from row 2 till row 6. I was trying with str_detect from stringr package.

You can use the which() function to return the indices where str_detect() is TRUE.
library(stringr)
which(str_detect(df1$V2, "Descripción"))
[1] 2
If instead you save the output of which() to a variable, you can use it to subset your data. Note that the follow explicitly calls the first value in x in case there are more than one place str_detect returns true.
x <- which(str_detect(df1$V2, "Descripción"))
df1[x[1]:nrow(df1),]
V1 V2 V3 V4 V5
2 la Partida / Concepto de Obra Descripción Breve ClaveCUCOP Observaciones de Medida Cantidad
3 Trabajos de señalamiento horizontal en puente de
4 LO-009J0U004- cuota "El Zacatal", consistentes en suministro y
5 E50-2021 aplicación de pintura de tránsito, suministro y 62502002 Obra 1
6 colocación de botones y ménsulas reflejantes.

Related

removing url with format www in R

I need to remove some urls from a dataframe. So far I have been able to eliminate those with the pattern http://. However, there are still some websites in my corpus with the format www.stackoverflow.com or stackoverflow.org
Here is my code
#Sample of text
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar")
#Trying to remove the website with no results
test_text <- gsub("www[.]//([a-zA-Z]|[0-9]|[$-_#.&+]|[!*\\(\\),])//[.]com", "", test_text)
The outcome should be
test_text
"la primera posibilidad real de acabar con la violencia del país es y luego desatar"
The following regex removes the test url.
test_text <- c("la primera posibilidad real de acabar con la violencia del país es www.jorgeorlandomelo.com y luego desatar",
"bla1 bla2 www.stackoverflow.org etc",
"this that www.nameofthewebiste.com one more"
)
gsub("(^[^w]*)www\\.[^\\.]*\\.[[:alpha:]]{2,3}(.*$)", "\\1\\2", test_text)
#[1] "la primera posibilidad real de acabar con la violencia del país es y luego desatar"
#[2] "bla1 bla2 etc"
#[3] "this that one more"

R: regex first occurence based on condition

I am trying to split strings by using the first white space coming after 3 characters. Here is my code:
string <- c("Le jour la nuit", "Les jours les nuits")
part1 <- sub("(\\S{3,})\\s?(.*)", "\\1", string)
part2 <- sub("(\\S{3,})\\s?(.*)", "\\2", string)
# output
> part1
[1] "Le jour" "Les"
> part2
[1] "Le la nuit" "jours les nuits"
For the first part, it works exactly as desired. However, it is not the case for the second part: part2[1] should be la nuit instead of Le la nuit.
I am not sure how achieve this and would be thankful for some guidance.
Not sure what you really want but per your requirements, you could use
^(.{3,}?)(?:(?<!,)\\s)+(.*)
This says:
^ # start of the string
(.{3,}?) # capture 3+ characters lazily, up to...
(?:(?<!,)\\s)+ # 1+ whitespaces that must not be preceeded by a comma
(.*) # capture the rest of the string
In R:
string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits")
(part1 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\1", string, perl = T))
(part2 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\2", string, perl = T))
Yielding
[1] "Le jour" "Les" "les, jours"
and
[1] "la nuit" "jours les nuits" "les nuits"
Maybe you need a dataframe as a result, if so, you could define yourself a little function (using sapply and some logic):
make_df <- function(text) {
parts <- sapply(text, function(x) {
m <- regexec("^(.{3,}?)(?:(?<!,)\\s)+(.*)", x, perl = T)
groups <- regmatches(x, m)
c(groups[[1]][2], groups[[1]][3])
}, USE.NAMES = F)
(setNames(as.data.frame(t(parts), stringsAsFactors = F), c("part1", "part2")))
}
(df <- make_df(string))
This would yield for string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits", "somejunk"):
part1 part2
1 Le jour la nuit
2 Les jours les nuits
3 les, jours les nuits
4 <NA> <NA>

How to use gsub to fix case-defined multiple spaces and broken lines?

I used pdftools to convert some pdf documents to txt. This is a part of the output (it's not so bad)
REPÚBLICA DE CHILE PADRON ELECTORAL AUDITADO ELECCIONES PRESIDENCIAL, PARLAMENTARIAS y de CONSEJEROS REGIONALES 2017 REGISTROS: 2.421
SERVICIO ELECTORAL REGIÓN : ARICA Y PARINACOTA COMUNA: GENERAL LAGOS PÁGINA 1 de 38
PROVINCIA : PARINACOTA
NOMBRE C.IDENTIDAD SEXO DOMICILIO ELECTORAL CIRCUNSCRIPCIÓN MESA
AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M
AGUILERA ZENTENO PATRICIA ALEJANDRA 16.223.938-4 MUJ PUEBLO DE GUACOLLO S N CERCANO A GENERAL LAGOS 5M
AGUIRRE CHOQUE MARCOS JULIO 15.000.385-7 VAR CIRCUNSCRIPCION
CALLE TORREALBA DE VISVIRI
CASA N° 4 PUEBLO DE VISVIRI GENERAL LAGOS 7V
So I'm doing this to clean this and convert it into formatted tsv:
test = read_lines("file.txt")
test2 = test[!grepl("REP\u00daBLICA",test)]
test2 = test2[!grepl("SERVICIO",test2)]
test2 = test2[!grepl("NOMBRE",test2)]
test2 = test2[!grepl("PROVINCIA",test2)]
test2 = gsub("\\.", "", test2)
test2 = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", test2, perl=TRUE)
and the output is:
ABRIGO PIZARRO PATRICIO ESTEBAN 16024716-9 VAR PUEB ALCERRECA GENERAL LAGOS 5V
ABURTO VELASCO ESTHER MARISOL 13005517-6 MUJ VILLA INDUSTRIAL GENERAL LAGOS 2M
ACEVEDO MONTT SEBASTIAN ANDRES 17829470-9 VAR CALLE RAFAEL TORREALBA N° 3 PUEBLO DE VISVIRI GENERAL LAGOS 3V
ACHILLO BLAS ADOLFO ARTURO 13008044-8 VAR VISURI GENERAL LAGOS 7V
I've read some posts and I'm not sure how to implement:
Something like gsub("(?<=[\\s+])[0-9]", "\t", test2, perl=TRUE), this is to replace multiple spaces followed by a number by tab followed by a number
How to move broken lines to the end of the previous line, such as line 8 in the above sample that starts with multiple spaces.
Fixing (1) and (2) would return this:
ABRIGO PIZARRO PATRICIO ESTEBAN \t 16024716-9 \t VAR \t PUEB ALCERRECA \t GENERAL LAGOS \t 5V
ABURTO VELASCO ESTHER MARISOL \t 13005517-6 \t MUJ \t VILLA INDUSTRIAL \t GENERAL LAGOS \t 2M
(1) You can use the words "VAR" and "MUJ" as key-words for splitting:
x <- "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M"
strsplit(x, "\\s{2,}|\\s(?=\\bMUJ\\b)|(?<=\\bMUJ\\b)\\s|\\s(?=\\bVAR\\b)|(?<=\\bVAR\\b)\\s", perl = TRUE)
The result is:
[[1]]
[1] "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA" "13.638.826-6" "MUJ"
[4] "PUEBLO DE TACORA S N VISVIRI" "GENERAL LAGOS" "4M"
Maybe not the most elegant solution, but it works and if you can modify the data you could use real key-words and assure they are unique.
(2) An easy solution would be to check rows length and move values up if the row is too short

Find row that throws error in R

I have more than a thousand rows in my dataframe. One of its columns should hold a single word. I want to lowercase this column:
df$precedingWord <- tolower(df$precedingWord)
But surprisingly, I get an error
Error in tolower(df$precedingWord) :
invalid input '/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen 😳RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #' in 'utf8towcs'
From this I gather that one a specific row, df$precedingWord doesn't hold a single word, but more than a sentence, namely /home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen 😳RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #.
Now to debug this, I'd like to know the row ID of the sentence that is thrown. How can I find this out?
Use grep to search for the string:
x <- c("a",
'/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml: Ik zeg jij hebt goede ogen í ½í¸³RT #IMoonen Ik tel 16 schepen voor de kust, dat mag je gerust een #')
grep("/home/nobackup/SONAR/COMPACT/WR-P-E-L/WR-P-E-L0000106.data.ids.xml:",
x, fixed = TRUE)
#[1] 2

R- work inside text

I have a table with text like:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
I need to keep only the words that have more than 3 characters:
tt<-data.frame(a=c("esta unica lista hacer","esta unica"))
In this case I have no clue of how to do it. I know I have to use nchar and a loop over the table and inside another loop over the words.
Using the data.table package:
library(data.table)
setDT(tt)
tt[,a:=gsub("\\s+"," ",gsub("\\b\\w{1,3}\\b","",a))]
a
1: esta unica lista hacer
2: esta unica
Another option, depending on exactly the output you want, is:
library(data.table) #1.9.5+
tt[,tstrsplit(gsub("\\b\\w{1,3}\\b","",a),split="\\s+")]
V1 V2 V3 V4
1: esta unica lista hacer
2: esta unica NA NA
Edit: After much tussling at the encouragement of #rawr, here is a way to get at the problem more directly (include 4-letter words instead of exclude 3-letter words)
tt[,a:=lapply(regmatches(a, gregexpr('\\b\\w{4,}\\b',a)),paste0,collapse=" ")]
It's not too tricky if you break it into chunks. First use apply to iterate over each row of the data frame. Then for each row, break strings into words, select the long ones, paste back into a string, and return the result:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
library(stringr)
tt$a <- lapply(tt$a, function(x) {
l <- unlist(str_split(x, " "))
t <- l[which(nchar(l)>3)]
return(paste0(t, collapse=" "))
})
Here is another approach using the qdapRegex package.
library(qdapRegex)
tt <- data.frame(a = c('esta es la unica lista que voy a hacer', 'esta es la 2da unica'))
tt$a <- rm_nchar_words(tt$a, 1, pattern = '\\b\\w{1,3}\\b')
tt
# a
# 1 esta unica lista hacer
# 2 esta unica
Here's a solution using the quanteda package, that tokenizes the texts in your data.frame and removes the tokens whose length is <= 3. Note that I have specified stringsAsFactors = FALSE here in the data.frame() -- although this would work equally fine if you were operating directly on a character vector.
require(quanteda)
tt <- data.frame(a=c("esta es la unica lista que voy a hacer", "esta es la 2da unica"),
stringsAsFactors = FALSE)
ttTokenized <- tokenize(tt$a)
(ttTokenized <- sapply(ttTokenized, function(x) x[nchar(x) > 3]))
## [[1]]
## [1] "esta" "unica" "lista" "hacer"
##
## [[2]]
## [1] "esta" "unica"
If you want the original-looking texts rather than the tokenised versions, then use this additional step:
sapply(ttTokenized, paste, collapse = " ")
## [1] "esta unica lista hacer" "esta unica"

Resources