I want to merge/paste (paste(c(...), collapse=" ")) strings in a dataframe based on value (author) in a different column. I am looking for an efficient way to do it.
df <- data.frame(author = c("Shakespeare",
"Dante",
"Proust",
"Shakespeare",
"Dante",
"Proust",
"Shakespeare"),
text = c("Put the wild waters in this roar, allay them",
"Ma tu perche' ritorni a tanta noia?",
"Longtemps, je me suis couché de bonne heure",
"The very virtue of compassion in thee",
"Pensa oramai qual fu colui che degno",
"Quelle horreur! me disais-je",
"She said thou wast my daughter; and thy father"))
And the end result should be
result <- c("Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father",
"Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno",
"Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je")
names(result) <- c("Shakespeare","Dante","Proust")
result
# Shakespeare
# "Put the wild waters in this roar, allay them The very virtue of compassion in thee She said thou wast my daughter; and thy father"
# Dante
# "Ma tu perche' ritorni a tanta noia? Pensa oramai qual fu colui che degno"
# Proust
# "Longtemps, je me suis couché de bonne heure Quelle horreur! me disais-je"
I guess I should somehow use some function from the apply family. Something like
apply( df[??? , 2 , paste , collapse = " " )
but I am not sure how to pass the condition and then obtain as result the name of the author to which the pasted strings correspond...
tapply works more or less exactly as you expected:
tapply(df$text, df$author, paste, collapse = " ")
A more en vogue solution would be to use dplyr
library(dplyr)
df %>% group_by(author) %>% summarize(passage = paste(text, collapse = " "))
Related
I'm trying to perform lemmatization on a corpus, using the function lemmatize_strings() as an argument to tm_map() of tm package.
But I want to use my own dictionary ("lexico" - first column with the full word form in lower case, while the second column has the corresponding replacement lemma).
I tried to use:
corpus<-tm_map(corpus, lemmatize_strings)
But didn't work...
When I use:
lemmatize_strings(corpus[[1]], dictionary = lexico)
I have no problem!
How can I put the my dictionary "lexico" in the fuction tm_map()?
Sorry for this question, it'is my first attempt to make some text mining, at the age of 48.
To be more understandable, my corpus are composed by 2000 documents; an extract from the first document:
corpus[[1]][[1]]
[9] "..."
[10] "Nos últimos dias da passada legislatura, a maioria de direita aprovou duas leis que significam enormes recuos nos direitos das cidadãs do país. Fizeram tábua rasa do pronunciamento das cidadãs e cidadãos do país em referendo, optando por humilhar e tentar culpabilizar as mulheres que abortam por sua livre escolha. Estas duas leis são a Lei n.º 134/2015 e a Lei n.º 136/2015, de setembro. A primeira prevê o pagamento de taxas moderadoras na interrupção de gravidez quando for realizada, por opção da mulher, nas primeiras 10 semanas de gravidez. A segunda representa a primeira alteração à Lei n.º 16/2007, de 17 de abril, sobre exclusão de ilicitude nos casos de interrupção voluntária da gravidez."
Then worked on a dictionary file (lexico) with this configuration:
lexico[1:10,]
termo lema pos.tag
1 aa a NCMP000
2 aais aal NCMP000
3 aal aal NCMS000
4 aaleniano aaleniano NCMS000
5 aalenianos aaleniano NCMP000
6 ab-rogação ab-rogação NCFS000
7 ab-rogações ab-rogação NCFP000
8 ab-rogamento ab-rogamento NCMS000
9 ab-rogamentos ab-rogamento NCMP000
10 ab-rogáveis ab-rogável AQ0CP0
When I use the function lemmatize_strings(corpus[[1]], dictionary = lexico), it works correctly and give de document of corpus nº1 lemmatized with lemmas from my dictionary.
The problem that I have, is with this function:
> corpus<-tm_map(corpus, lemmatize_strings, dictionary = lexico)
Warning messages:
1: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
2: In stringi::stri_extract_all_regex(x, numreg) :
argument is not an atomic vector; coercing
> corpus[[1]][[1]]
[1] ""
This simply destroy all my documents in the corpus
> corpus
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2000
Thnks in advance for your reply!
You could, for example, use the quanteda package for this:
library("quanteda")
text <- "This is a test sentence. We can lemmatize it using quanteda."
dict <- data.frame(
word = c("is", "using"),
lemma = c("be", "use"),
stringsAsFactors = FALSE
)
toks <- tokens(text, remove_punct = TRUE)
toks_lemma <- tokens_replace(toks,
pattern = dict$word,
replacement = dict$lemma,
case_insensitive = TRUE,
valuetype = "fixed")
toks_lemma
tokens from 1 document.
text1 :
[1] "This" "be" "a" "test" "sentence" "We" "can" "lemmatize"
[9] "it" "use" "quanteda"
The function is very fast and despite the name was mainly ceated for lemmatization.
I am trying to split strings by using the first white space coming after 3 characters. Here is my code:
string <- c("Le jour la nuit", "Les jours les nuits")
part1 <- sub("(\\S{3,})\\s?(.*)", "\\1", string)
part2 <- sub("(\\S{3,})\\s?(.*)", "\\2", string)
# output
> part1
[1] "Le jour" "Les"
> part2
[1] "Le la nuit" "jours les nuits"
For the first part, it works exactly as desired. However, it is not the case for the second part: part2[1] should be la nuit instead of Le la nuit.
I am not sure how achieve this and would be thankful for some guidance.
Not sure what you really want but per your requirements, you could use
^(.{3,}?)(?:(?<!,)\\s)+(.*)
This says:
^ # start of the string
(.{3,}?) # capture 3+ characters lazily, up to...
(?:(?<!,)\\s)+ # 1+ whitespaces that must not be preceeded by a comma
(.*) # capture the rest of the string
In R:
string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits")
(part1 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\1", string, perl = T))
(part2 <- sub("^(.{3,}?)(?:(?<!,)\\s)+(.*)", "\\2", string, perl = T))
Yielding
[1] "Le jour" "Les" "les, jours"
and
[1] "la nuit" "jours les nuits" "les nuits"
Maybe you need a dataframe as a result, if so, you could define yourself a little function (using sapply and some logic):
make_df <- function(text) {
parts <- sapply(text, function(x) {
m <- regexec("^(.{3,}?)(?:(?<!,)\\s)+(.*)", x, perl = T)
groups <- regmatches(x, m)
c(groups[[1]][2], groups[[1]][3])
}, USE.NAMES = F)
(setNames(as.data.frame(t(parts), stringsAsFactors = F), c("part1", "part2")))
}
(df <- make_df(string))
This would yield for string <- c("Le jour la nuit", "Les jours les nuits", "les, jours les nuits", "somejunk"):
part1 part2
1 Le jour la nuit
2 Les jours les nuits
3 les, jours les nuits
4 <NA> <NA>
I used pdftools to convert some pdf documents to txt. This is a part of the output (it's not so bad)
REPÚBLICA DE CHILE PADRON ELECTORAL AUDITADO ELECCIONES PRESIDENCIAL, PARLAMENTARIAS y de CONSEJEROS REGIONALES 2017 REGISTROS: 2.421
SERVICIO ELECTORAL REGIÓN : ARICA Y PARINACOTA COMUNA: GENERAL LAGOS PÁGINA 1 de 38
PROVINCIA : PARINACOTA
NOMBRE C.IDENTIDAD SEXO DOMICILIO ELECTORAL CIRCUNSCRIPCIÓN MESA
AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M
AGUILERA ZENTENO PATRICIA ALEJANDRA 16.223.938-4 MUJ PUEBLO DE GUACOLLO S N CERCANO A GENERAL LAGOS 5M
AGUIRRE CHOQUE MARCOS JULIO 15.000.385-7 VAR CIRCUNSCRIPCION
CALLE TORREALBA DE VISVIRI
CASA N° 4 PUEBLO DE VISVIRI GENERAL LAGOS 7V
So I'm doing this to clean this and convert it into formatted tsv:
test = read_lines("file.txt")
test2 = test[!grepl("REP\u00daBLICA",test)]
test2 = test2[!grepl("SERVICIO",test2)]
test2 = test2[!grepl("NOMBRE",test2)]
test2 = test2[!grepl("PROVINCIA",test2)]
test2 = gsub("\\.", "", test2)
test2 = gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", test2, perl=TRUE)
and the output is:
ABRIGO PIZARRO PATRICIO ESTEBAN 16024716-9 VAR PUEB ALCERRECA GENERAL LAGOS 5V
ABURTO VELASCO ESTHER MARISOL 13005517-6 MUJ VILLA INDUSTRIAL GENERAL LAGOS 2M
ACEVEDO MONTT SEBASTIAN ANDRES 17829470-9 VAR CALLE RAFAEL TORREALBA N° 3 PUEBLO DE VISVIRI GENERAL LAGOS 3V
ACHILLO BLAS ADOLFO ARTURO 13008044-8 VAR VISURI GENERAL LAGOS 7V
I've read some posts and I'm not sure how to implement:
Something like gsub("(?<=[\\s+])[0-9]", "\t", test2, perl=TRUE), this is to replace multiple spaces followed by a number by tab followed by a number
How to move broken lines to the end of the previous line, such as line 8 in the above sample that starts with multiple spaces.
Fixing (1) and (2) would return this:
ABRIGO PIZARRO PATRICIO ESTEBAN \t 16024716-9 \t VAR \t PUEB ALCERRECA \t GENERAL LAGOS \t 5V
ABURTO VELASCO ESTHER MARISOL \t 13005517-6 \t MUJ \t VILLA INDUSTRIAL \t GENERAL LAGOS \t 2M
(1) You can use the words "VAR" and "MUJ" as key-words for splitting:
x <- "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA 13.638.826-6 MUJ PUEBLO DE TACORA S N VISVIRI GENERAL LAGOS 4M"
strsplit(x, "\\s{2,}|\\s(?=\\bMUJ\\b)|(?<=\\bMUJ\\b)\\s|\\s(?=\\bVAR\\b)|(?<=\\bVAR\\b)\\s", perl = TRUE)
The result is:
[[1]]
[1] "AGUILERA SIMPERTIGUE JUDITH ALEJANDRA" "13.638.826-6" "MUJ"
[4] "PUEBLO DE TACORA S N VISVIRI" "GENERAL LAGOS" "4M"
Maybe not the most elegant solution, but it works and if you can modify the data you could use real key-words and assure they are unique.
(2) An easy solution would be to check rows length and move values up if the row is too short
I have a table with text like:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
I need to keep only the words that have more than 3 characters:
tt<-data.frame(a=c("esta unica lista hacer","esta unica"))
In this case I have no clue of how to do it. I know I have to use nchar and a loop over the table and inside another loop over the words.
Using the data.table package:
library(data.table)
setDT(tt)
tt[,a:=gsub("\\s+"," ",gsub("\\b\\w{1,3}\\b","",a))]
a
1: esta unica lista hacer
2: esta unica
Another option, depending on exactly the output you want, is:
library(data.table) #1.9.5+
tt[,tstrsplit(gsub("\\b\\w{1,3}\\b","",a),split="\\s+")]
V1 V2 V3 V4
1: esta unica lista hacer
2: esta unica NA NA
Edit: After much tussling at the encouragement of #rawr, here is a way to get at the problem more directly (include 4-letter words instead of exclude 3-letter words)
tt[,a:=lapply(regmatches(a, gregexpr('\\b\\w{4,}\\b',a)),paste0,collapse=" ")]
It's not too tricky if you break it into chunks. First use apply to iterate over each row of the data frame. Then for each row, break strings into words, select the long ones, paste back into a string, and return the result:
tt<-data.frame(a=c("esta es la unica lista que voy a hacer","esta es la 2da unica"))
library(stringr)
tt$a <- lapply(tt$a, function(x) {
l <- unlist(str_split(x, " "))
t <- l[which(nchar(l)>3)]
return(paste0(t, collapse=" "))
})
Here is another approach using the qdapRegex package.
library(qdapRegex)
tt <- data.frame(a = c('esta es la unica lista que voy a hacer', 'esta es la 2da unica'))
tt$a <- rm_nchar_words(tt$a, 1, pattern = '\\b\\w{1,3}\\b')
tt
# a
# 1 esta unica lista hacer
# 2 esta unica
Here's a solution using the quanteda package, that tokenizes the texts in your data.frame and removes the tokens whose length is <= 3. Note that I have specified stringsAsFactors = FALSE here in the data.frame() -- although this would work equally fine if you were operating directly on a character vector.
require(quanteda)
tt <- data.frame(a=c("esta es la unica lista que voy a hacer", "esta es la 2da unica"),
stringsAsFactors = FALSE)
ttTokenized <- tokenize(tt$a)
(ttTokenized <- sapply(ttTokenized, function(x) x[nchar(x) > 3]))
## [[1]]
## [1] "esta" "unica" "lista" "hacer"
##
## [[2]]
## [1] "esta" "unica"
If you want the original-looking texts rather than the tokenised versions, then use this additional step:
sapply(ttTokenized, paste, collapse = " ")
## [1] "esta unica lista hacer" "esta unica"
I have used adist to calculate the number of characters that differ between two strings:
a <- "#IvoryCoast TENNIS US OPEN Clément «Un beau combat» entre Simon et Cilic"
b <- "Clément «Un beau combat» entre Simon et Cilic"
adist(a,b) # result 27
Now I would like to extract all the occurrences of those characters that differ. In my example, I would like to get the string "#IvoryCoast TENNIS US OPEN ".
I tried and used:
paste(Reduce(setdiff, strsplit(c(a, b), split = "")), collapse = "")
But the obtained result is not what I expected!
#IvysTENOP
For this case, you could use gsub.
> a <- "#IvoryCoast TENNIS US OPEN Clément «Un beau combat» entre Simon et Cilic"
> b <- "Clément «Un beau combat» entre Simon et Cilic"
> gsub(b, "", a)
[1] "#IvoryCoast TENNIS US OPEN "
You can do, based on the paste/reduce solution:
paste(Reduce(setdiff, strsplit(c(a, b), split = " ")), collapse = " ")
#[1] "#IvoryCoast TENNIS US OPEN"
Or, if you want to get separated items, with setdiff and strsplit:
setdiff(strsplit(a," ")[[1]],strsplit(b," ")[[1]])
#[1] "#IvoryCoast" "TENNIS" "US" "OPEN"