R- work inside text - r

I have a table with text like:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
I need to keep only the words that have more than 3 characters:
tt<-data.frame(a=c("esta unica lista hacer","esta unica"))
In this case I have no clue of how to do it. I know I have to use nchar and a loop over the table and inside another loop over the words.

Using the data.table package:
library(data.table)
setDT(tt)
tt[,a:=gsub("\\s+"," ",gsub("\\b\\w{1,3}\\b","",a))]
a
1: esta unica lista hacer
2: esta unica
Another option, depending on exactly the output you want, is:
library(data.table) #1.9.5+
tt[,tstrsplit(gsub("\\b\\w{1,3}\\b","",a),split="\\s+")]
V1 V2 V3 V4
1: esta unica lista hacer
2: esta unica NA NA
Edit: After much tussling at the encouragement of #rawr, here is a way to get at the problem more directly (include 4-letter words instead of exclude 3-letter words)
tt[,a:=lapply(regmatches(a, gregexpr('\\b\\w{4,}\\b',a)),paste0,collapse=" ")]

It's not too tricky if you break it into chunks. First use apply to iterate over each row of the data frame. Then for each row, break strings into words, select the long ones, paste back into a string, and return the result:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
library(stringr)
tt$a <- lapply(tt$a, function(x) {
l <- unlist(str_split(x, " "))
t <- l[which(nchar(l)>3)]
return(paste0(t, collapse=" "))
})

Here is another approach using the qdapRegex package.
library(qdapRegex)
tt <- data.frame(a = c('esta es la unica lista que voy a hacer', 'esta es la 2da unica'))
tt$a <- rm_nchar_words(tt$a, 1, pattern = '\\b\\w{1,3}\\b')
tt
# a
# 1 esta unica lista hacer
# 2 esta unica

Here's a solution using the quanteda package, that tokenizes the texts in your data.frame and removes the tokens whose length is <= 3. Note that I have specified stringsAsFactors = FALSE here in the data.frame() -- although this would work equally fine if you were operating directly on a character vector.
require(quanteda)
tt <- data.frame(a=c("esta es la unica lista que voy a hacer", "esta es la 2da unica"),
stringsAsFactors = FALSE)
ttTokenized <- tokenize(tt$a)
(ttTokenized <- sapply(ttTokenized, function(x) x[nchar(x) > 3]))
## [[1]]
## [1] "esta" "unica" "lista" "hacer"
##
## [[2]]
## [1] "esta" "unica"
If you want the original-looking texts rather than the tokenised versions, then use this additional step:
sapply(ttTokenized, paste, collapse = " ")
## [1] "esta unica lista hacer" "esta unica"

Related

Select data till the end based on a pattern in one column

I have messy data. I want to subset the data based on a phrase in a column till the end.
df1 <- data.frame(
V1=c("No. de Control Interno de", "la Partida / Concepto de Obra","",
"LO-009J0U004-","E50-2021",""),
V2=c("","Descripción Breve","Trabajos de señalamiento horizontal en puente de",
"cuota \"El Zacatal\", consistentes en suministro y","aplicación de pintura de tránsito, suministro y",
"colocación de botones y ménsulas reflejantes."),
V3=c("","ClaveCUCOP","","","62502002",""),
V4=c("Unidad","Observaciones de Medida","","","Obra",""),
V5=c("","Cantidad","","","1","")
)
Whenver in V2, there is the phrase Descripción, the code should subset dataframe from that row till the end. For example, in the case above, this means selecting data from row 2 till row 6. I was trying with str_detect from stringr package.
You can use the which() function to return the indices where str_detect() is TRUE.
library(stringr)
which(str_detect(df1$V2, "Descripción"))
[1] 2
If instead you save the output of which() to a variable, you can use it to subset your data. Note that the follow explicitly calls the first value in x in case there are more than one place str_detect returns true.
x <- which(str_detect(df1$V2, "Descripción"))
df1[x[1]:nrow(df1),]
V1 V2 V3 V4 V5
2 la Partida / Concepto de Obra Descripción Breve ClaveCUCOP Observaciones de Medida Cantidad
3 Trabajos de señalamiento horizontal en puente de
4 LO-009J0U004- cuota "El Zacatal", consistentes en suministro y
5 E50-2021 aplicación de pintura de tránsito, suministro y 62502002 Obra 1
6 colocación de botones y ménsulas reflejantes.

How to lemmatize a corpus with a particular dictionary in R?”

I'm trying to perform lemmatization on a corpus, using the function lemmatize_strings() as an argument to tm_map() of tm package.
But I want to use my own dictionary ("lexico" - first column with the full word form in lower case, while the second column has the corresponding replacement lemma).
I tried to use:
corpus<-tm_map(corpus, lemmatize_strings)
But didn't work...
When I use:
lemmatize_strings(corpus[[1]], dictionary = lexico)
I have no problem!
How can I put the my dictionary "lexico" in the fuction tm_map()?
Sorry for this question, it'is my first attempt to make some text mining, at the age of 48.
To be more understandable, my corpus are composed by 2000 documents; an extract from the first document:
corpus[[1]][[1]]
[9] "..."
[10] "Nos últimos dias da passada legislatura, a maioria de direita aprovou duas leis que significam enormes recuos nos direitos das cidadãs do país. Fizeram tábua rasa do pronunciamento das cidadãs e cidadãos do país em referendo, optando por humilhar e tentar culpabilizar as mulheres que abortam por sua livre escolha. Estas duas leis são a Lei n.º 134/2015 e a Lei n.º 136/2015, de setembro. A primeira prevê o pagamento de taxas moderadoras na interrupção de gravidez quando for realizada, por opção da mulher, nas primeiras 10 semanas de gravidez. A segunda representa a primeira alteração à Lei n.º 16/2007, de 17 de abril, sobre exclusão de ilicitude nos casos de interrupção voluntária da gravidez."
Then worked on a dictionary file (lexico) with this configuration:
lexico[1:10,]
termo lema pos.tag
1 aa a NCMP000
2 aais aal NCMP000
3 aal aal NCMS000
4 aaleniano aaleniano NCMS000
5 aalenianos aaleniano NCMP000
6 ab-rogação ab-rogação NCFS000
7 ab-rogações ab-rogação NCFP000
8 ab-rogamento ab-rogamento NCMS000
9 ab-rogamentos ab-rogamento NCMP000
10 ab-rogáveis ab-rogável AQ0CP0
When I use the function lemmatize_strings(corpus[[1]], dictionary = lexico), it works correctly and give de document of corpus nº1 lemmatized with lemmas from my dictionary.
The problem that I have, is with this function:
> corpus<-tm_map(corpus, lemmatize_strings, dictionary = lexico)
Warning messages:
1: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
2: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
> corpus[[1]][[1]]
[1] ""
This simply destroy all my documents in the corpus
> corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2000
Thnks in advance for your reply!
You could, for example, use the quanteda package for this:
library("quanteda")
text <- "This is a test sentence. We can lemmatize it using quanteda."
dict <- data.frame(
word = c("is", "using"),
lemma = c("be", "use"),
stringsAsFactors = FALSE
)
toks <- tokens(text, remove_punct = TRUE)
toks_lemma <- tokens_replace(toks,
pattern = dict$word,
replacement = dict$lemma,
case_insensitive = TRUE,
valuetype = "fixed")
toks_lemma
tokens from 1 document.
text1 :
[1] "This" "be" "a" "test" "sentence" "We" "can" "lemmatize"
[9] "it" "use" "quanteda"
The function is very fast and despite the name was mainly ceated for lemmatization.

R: regex first occurence based on condition

I am trying to split strings by using the first white space coming after 3 characters. Here is my code:
string <- c("Le jour la nuit", "Les jours les nuits")
part1 <- sub("(\\S{3,})\\s?(.*)", "\\1", string)
part2 <- sub("(\\S{3,})\\s?(.*)", "\\2", string)
# output
> part1
[1] "Le jour" "Les"
> part2
[1] "Le la nuit" "jours les nuits"
For the first part, it works exactly as desired. However, it is not the case for the second part: part2[1] should be la nuit instead of Le la nuit.
I am not sure how achieve this and would be thankful for some guidance.
Not sure what you really want but per your requirements, you could use
^(.{3,}?)(?:(?<!,)\\s)+(.*)
This says:
^ # start of the string
(.{3,}?) # capture 3+ characters lazily, up to...
(?:(?<!,)\\s)+ # 1+ whitespaces that must not be preceeded by a comma
(.*) # capture the rest of the string
In R:
string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits")
(part1 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\1", string, perl = T))
(part2 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\2", string, perl = T))
Yielding
[1] "Le jour" "Les" "les, jours"
and
[1] "la nuit" "jours les nuits" "les nuits"
Maybe you need a dataframe as a result, if so, you could define yourself a little function (using sapply and some logic):
make_df <- function(text) {
parts <- sapply(text, function(x) {
m <- regexec("^(.{3,}?)(?:(?<!,)\\s)+(.*)", x, perl = T)
groups <- regmatches(x, m)
c(groups[[1]][2], groups[[1]][3])
}, USE.NAMES = F)
(setNames(as.data.frame(t(parts), stringsAsFactors = F), c("part1", "part2")))
}
(df <- make_df(string))
This would yield for string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits", "somejunk"):
part1 part2
1 Le jour la nuit
2 Les jours les nuits
3 les, jours les nuits
4 <NA> <NA>

Split Text String in R

I have large list in R with more than 5000 elements. The elements are of the form:
$`/home/ricardo/MultiClass/data//Execucao_PUBLICACAO_DECISAO_INTERLOCUTORIA_DETERMINACAO_DE_PAGAMENTO/1117.txt.V1
[1] DATA DE DISPONIBILIZACAO DA PUBLICACAO PELA FONTE OFICIAL: 16/11/2016 Pag 4279 Decisao Processo N RTOrd-0122200-90.2006.5.15.0087 <truncated>`
I would like to transform this in a two columns dataframe where:
c1
The contents between $ and [1]
c2
rest of the text
How can I do this split? Important to note that the numberof strings between $ and [1] can change, and the strings $, [ e ] can appear in the rest of the text.
Thanks in advance,
Ricardo.
library(stringr)
string <- '$/home/ricardo/MultiClass/data//Execucao_PUBLICACAO_DECISAO_INTERLOCUTORIA_DETERMINACAO_DE_PAGAMENTO/1117.txt.V1 [1] DATA DE DISPONIBILIZACAO DA PUBLICACAO PELA FONTE OFICIAL: 16/11/2016 Pag 4279 Decisao Processo N RTOrd-0122200-90.2006.5.15.0087'
c1 <- str_match(string = string, pattern = "^\\$(.*) \\[1\\] (.*)")[,2]
c2 <- str_match(string = string, pattern = "^\\$(.*) \\[1\\] (.*)")[,3]
The $ ... text is the name of the list element, and the [1] ... is the value of that element. You can extract these (or better yet, assign them correctly when reading in your data).
a <- list(`this is the name` = "data stored in that variable")
a
#> $`this is the name`
#> [1] "data stored in that variable"
names(a)
#> [1] "this is the name"
as.character(a)
#> [1] "data stored in that variable"

Paste string together based on author

I want to merge/paste (paste(c(...), collapse=" ")) strings in a dataframe based on value (author) in a different column. I am looking for an efficient way to do it.
df <- data.frame(author = c("Shakespeare",
"Dante",
"Proust",
"Shakespeare",
"Dante",
"Proust",
"Shakespeare"),
text = c("Put the wild waters in this roar, allay them",
"Ma tu perche' ritorni a tanta noia?",
"Longtemps, je me suis couché de bonne heure",
"The very virtue of compassion in thee",
"Pensa oramai qual fu colui che degno",
"Quelle horreur! me disais-je",
"She said thou wast my daughter; and thy father"))
And the end result should be
result <- c("Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father",
"Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno",
"Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je")
names(result) <- c("Shakespeare","Dante","Proust")
result
# Shakespeare
# "Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father"
# Dante
# "Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno"
# Proust
# "Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je"
I guess I should somehow use some function from the apply family. Something like
apply( df[??? , 2 , paste , collapse = " " )
but I am not sure how to pass the condition and then obtain as result the name of the author to which the pasted strings correspond...
tapply works more or less exactly as you expected:
tapply(df$text, df$author, paste, collapse = " ")
A more en vogue solution would be to use dplyr
library(dplyr)
df %>% group_by(author) %>% summarize(passage = paste(text, collapse = " "))

Resources