Speed of Cleaning Text in R using a Dictionary - r

I currently have a list of misspellings and a list of corrections, indexed with a 1 to 1 relationship.
These corrections are specific to the work I am doing so I cannot use existing spelling correction packages.
Given a list of strings which I want to apply these corrections to, I have the following code:
for (i in 1:n){
new_text <- gsub(match[i], dict[i], new_text)
new_text <- gsub('[[:punct:]]', '', new_text)
}
Although this gives the results I want, it takes most of the day to run.
I cannot figure out how to use apply functions because the operations happen in a specific order on the same object.
Is there anything else I can try to speed this up?
Edit: This is the very small test set I have put together to benchmark performance.
match <- c("\\b(abouta|aobut|bout|abot|abotu)\\b","\\b(avdised|advisd|advized|advsied)\\b","\\b(posible|possibl)\\b","\\b(replacment|repalcement|replacemnt|replcement|rplacement)\\b","\\b(tommorrow|tomorow|tommorow|tomorro|tommoro)\\b")
dict <- c('about','advised','possible','replacement','tomorrow')
new_text <- c('be advisd replacment coming tomorow','did you get the email aobut the repalcement tomorro','the customer has been avdised of a posible replacement','there is a replacement coming tomorrow','what time tommorow is the replacment coming')
n <- 5
Running my current code 1000 times on this data gives 0.424 elapsed.

Try the corpus library, using a custom stemmer. The library lets you provide an arbitrary stemmer function. In your case you would use something like the following for your stemmer:
library(corpus)
dict <- strsplit(split = "\\|",
c("about" = "abouta|aobut|bout|abot|abotu",
"advised" = "avdised|advisd|advized|advsied",
"possible" = "posible|possibl",
"replacement" = "replacment|repalcement|replacemnt|replcement|rplacement",
"tomorrow" = "tommorrow|tomorow|tommorow|tomorro|tommoro"))
my_stemmer <- new_stemmer(unlist(dict), rep(names(dict), lengths(dict)))
Then, you can either pass this function as the stemmer argument to any function expecting text, or else you can create a corpus_text object with the stemmer attribute (as part of its token_filter that defines how text gets transformed to tokens):
new_text <- c('be advisd replacment coming tomorow',
'did you get the email aobut the repalcement tomorro',
'the customer has been avdised of a posible replacement',
'there is a replacement coming tomorrow','what time tommorow is the replacment coming')
Use term_stats to count (stemmed) token occurrences:
text <- as_corpus_text(new_text, stemmer = my_stemmer, drop_punct = TRUE)
term_stats(text)
#> term count support
#> 1 replacement 5 5
#> 2 tomorrow 4 4
#> 3 the 4 3
#> 4 coming 3 3
#> 5 a 2 2
#> 6 advised 2 2
#> 7 is 2 2
#> 8 about 1 1
#> 9 be 1 1
#> 10 been 1 1
#> 11 customer 1 1
#> 12 did 1 1
#> 13 email 1 1
#> 14 get 1 1
#> 15 has 1 1
#> 16 of 1 1
#> 17 possible 1 1
#> 18 there 1 1
#> 19 time 1 1
#> 20 what 1 1
#> ⋮ (21 rows total)
Use text_locate to find instances of (stemmed) tokens in the original text:
text_locate(text, "replacement")
#> text before instance after
#> 1 1 be advisd replacment coming tomorow
#> 2 2 …u get the email aobut the repalcement tomorro
#> 3 3 …been avdised of a posible replacement
#> 4 4 there is a replacement coming tomorrow
#> 5 5 what time tommorow is the replacment coming
The results of the stemming function get cached, so this is all very fast.
More examples at http://corpustext.com/articles/stemmer.html

Related

How to count collocations in quanteda based on grouping variables?

I have been working on identifying and classfying collocations over Quenteda package in R.
For instance;
I create token object from a list of documents, and apply collocation analysis.
toks <- tokens(text$abstracts)
collocations <- textstat_collocations(toks)
however, as far as I can see, there is not a clear method to see which collocation(s) is frequent/exist in which document. Even if I apply kwic(toks, pattern = phrase(collocations), selection = 'keep') result will only include rowid as text1, text2 etc.
I would like to group collocation analysis results based on docvars. is it possible with Quanteda ?
It sounds like you wish to tally collocations by document. The output from textstat_collocations() already provides counts for each collocation, but these are for the entire corpus.
So the solution to group by document (or any other variable) is to
Get the collocations using textstat_collocations(). Below, I've done that after removing stopwords and punctuation.
Compound the tokens from which the stopwords were formed, using tokens_compound(). This converts each collocation sequence into a single token.
Form a dfm from the compounded tokens, and use textstat_frequency() to count the compounds by document.
This is a bit trickier
Implementation using the built-in inaugural corpus:
library("quanteda")
## Package version: 3.0
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
library("quanteda.textstats")
toks <- data_corpus_inaugural %>%
tail(10) %>%
tokens(remove_punct = TRUE, padding = TRUE) %>%
tokens_remove(stopwords("en"), padding = TRUE)
colls <- textstat_collocations(toks)
head(colls)
## collocation count count_nested length lambda z
## 1 let us 34 0 2 6.257000 17.80637
## 2 fellow citizens 14 0 2 6.451738 16.18314
## 3 fellow americans 15 0 2 6.221678 16.16410
## 4 one another 14 0 2 6.592755 14.56082
## 5 god bless 15 0 2 8.628894 13.57027
## 6 united states 12 0 2 9.192044 13.22077
Now we compound them and keep only the collocations, then get the frequencies by document:
dfmat <- tokens_compound(toks, colls, concatenator = " ") %>%
dfm() %>%
dfm_keep("* *")
That dfm already contains the counts by document of each collocation, but if you want counts in a data.frame format, with a grouping option, use textstat_frequency(). Here I've only output the top two by document, but if you remove the n = 2 then it will give you the frequencies of all collocations by document.
textstat_frequency(dfmat, groups = docnames(dfmat), n = 2) %>%
head(10)
## feature frequency rank docfreq group
## 1 nuclear weapons 4 1 1 1985-Reagan
## 2 human freedom 3 2 1 1985-Reagan
## 3 new breeze 4 1 1 1989-Bush
## 4 new engagement 3 2 1 1989-Bush
## 5 let us 7 1 1 1993-Clinton
## 6 fellow americans 4 2 1 1993-Clinton
## 7 let us 6 1 1 1997-Clinton
## 8 new century 6 1 1 1997-Clinton
## 9 nation's promise 2 1 1 2001-Bush
## 10 common good 2 1 1 2001-Bush

R detect zeroes in ts object

Simple question : in R, what's the best way to detect if there is a zero somewhere in a time series (ts class)? I run X13 (seasonal package) on hundreds of time series and I would like to identify those who contain zero values (since multiplicative models don't work when they encounter a zero). If I could detect those series, I could use a IF-THEN-ELSE statement with proper specs for the X13.
Thank you!
You can replace or delete them:
ts <- ts(0:10)
## Deleting
ts[ts != 0]
#> [1] 1 2 3 4 5 6 7 8 9 10
## Replacing
replace(ts, ts==0, 1)
#> Time Series:
#> Start = 1
#> End = 11
#> Frequency = 1
#> [1] 1 1 2 3 4 5 6 7 8 9 10
## Detecting
any(ts == 0)
#> [1] TRUE
Created on 2020-10-29 by the reprex package (v0.3.0)

word_stats function from qdap package application on a dataframe

I have a dataframe, where one column contains strings.
q = data.frame(number=1:2,text=c("The surcingle hung in ribands from my body.", "But a glance will show the fallacy of this idea."))
I want to use the word_stats function for each individual record.
is it possible?
text_statistic <- apply(q,1,word_stats)
this will apply word_stats() row-by-row and return a list with the results of word_stats() for every row
you can do it many ways, lapply or sapply apply a Function over a List or Vector.
word_stats <- function(x) {length(unlist(strsplit(x, ' ')))}
sapply(q$text, word_stats)
Sure have a look at the grouping.var argument:
dat = data.frame(number=1:2,text=c("The surcingle hung in ribands from my body.", "But a glance will show the fallacy of this idea."))
with(dat, qdap::word_stats(text, number))
## number n.sent n.words n.char n.syl n.poly wps cps sps psps cpw spw pspw n.state p.state n.hapax grow.rate
## 1 2 1 10 38 14 2 10 38 14 2 3.800 1.400 .200 1 1 10 1
## 2 1 1 8 35 12 1 8 35 12 1 4.375 1.500 .125 1 1 8 1

Summing depth data (consecutive rows) in R

How is it possible with to sum up consecutive depth data with R?
For instance:
a <- data.frame(label = as.factor(c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood")),
depth = as.numeric(c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14)))
The given output should be something like:
Label Depth
Air 7
Wood 3
Stone 1
First the removal of negative values is done with cummax(), because depth can only increase in this special case. Hence:
label depth
1 Air 1
2 Air 2
3 Air 3
4 Air 3
5 Air 4
6 Air 5
7 Wood 5
8 Wood 5
9 Wood 5
10 Wood 6
11 Wood 8
12 Air 9
13 Air 9
14 Air 9
15 Air 10
16 Stone 10
17 Stone 10
18 Stone 11
19 Stone 11
20 Air 11
21 Air 12
22 Air 12
23 Air 12
24 Air 13
25 Wood 14
26 Wood 14
Now by max-min the increase in depth for every consecutive row you would get: (the question is how to do this step)
label depth
1 Air 4
2 Wood 3
3 Air 1
4 Stone 1
5 Air 2
5 Wood 0
And finally summing up those max-min values the output is the one presented above.
Steps tried to achieve the output:
The first obvious solution would be for instance for Air:
diff(cummax(a[a$label=="Air",]$depth))
This solution gets rid of the negative data, which is necessary due to an expected constant increase in depth.
The problem is the output also takes into account the big steps in between each consecutive subset. Hence, the sum for Air would be 12 instead of 7.
[1] 1 1 0 1 1 4 0 0 1 1 1 0 0 1
Even worse would be a solution with aggreagte, e.g.:
aggregate(depth~label, a, FUN=function(x){sum(x>0)})
Note: solutions with filtering big jumps is not what i'm looking for. Sure you could hard code a limit for instance <2 for the example of Air once again:
sum(diff(cummax(a[a$label=="Air",]$depth))[diff(cummax(a[a$label=="Air",]$depth))<2])
Gives you almost the right result but does not work as it is expected here. I'm pretty sure there is already a function for what I'm looking for because it is not a uncommon problem for many different tasks.
I guess taking the minimum and maximum value of each set of consecutive rows per material and summing those up would be one possible solution, but I'm not sure how to apply a function to only the consecutive subsets.
You can use data.table::rleid to quickly group by run, or reconstruct it with rle if you really like. After that, aggregating is fairly easy in any grammar. In dplyr,
library(dplyr)
a <- data.frame(label = c("Air","Air","Air","Air","Air","Air","Wood","Wood","Wood","Wood","Wood","Air","Air","Air","Air","Stone","Stone","Stone","Stone","Air","Air","Air","Air","Air","Wood","Wood"),
depth = c(1,2,3,-1,4,5,4,5,4,6,8,9,8,9,10,9,10,11,10,11,12,10,12,13,14,14))
a2 <- a %>%
# filter to rows where previous value is lower, equal, or NA
filter(depth >= lag(depth) | is.na(lag(depth))) %>%
# group by label and its run
group_by(label, run = data.table::rleid(label)) %>%
summarise(depth = max(depth) - min(depth)) # aggregate
a2 %>% arrange(run) # sort to make it pretty
#> # A tibble: 6 x 3
#> # Groups: label [3]
#> label run depth
#> <fctr> <int> <dbl>
#> 1 Air 1 4
#> 2 Wood 2 3
#> 3 Air 3 1
#> 4 Stone 4 1
#> 5 Air 5 2
#> 6 Wood 6 0
a3 <- a2 %>% summarise(depth = sum(depth)) # a2 is still grouped, so aggregate more
a3
#> # A tibble: 3 x 2
#> label depth
#> <fctr> <dbl>
#> 1 Air 7
#> 2 Stone 1
#> 3 Wood 3
A base R method using aggregate is
aggregate(cbind(val=cummax(a$depth)),
list(label=a$label, ID=c(0, cumsum(diff(as.integer(a$label)) != 0))),
function(x) diff(range(x)))
The first argument to aggregate calculates the cumulative maximum as the OP does above for the input vector, the use of cbind provide for the final output of the calculated vector. The second argument is the grouping argument. This uses a different method than rle, which calculates the cumulative sum of the differences. Finally, the third argument provides the function which calculates the desired output by taking a difference of the range for each group.
This returns
label ID val
1 Air 0 4
2 Wood 1 3
3 Air 2 1
4 Stone 3 1
5 Air 4 2
6 Wood 5 0
The data.table way (borrowing in part from #alistaire):
setDT(a)
a[, depth := cummax(depth)]
depth_gain <- a[,
list(
depth = max(depth) - depth[1], # Only need the starting and max values
label = label[1]
),
by = rleidv(label)
]
result <- depth_gain[, list(depth = sum(depth)), by = label]

Converting R data frame with RDS package: recruitment id error?

I am using the RDS package for respondent-driven sampling survey data. I want to convert a regular R data frame to an rds.data.frame. To do so, I have been trying to use the as.rds.data.frame function from RDS.
Here is an excerpted section of my data frame, where the first case (id=1) is the 'seed' respondent (who has no recruiter). It contains the variables: id (respondent id number), recruit.id(id number of respondent who recruited him/her), netsize (respondent's network size) and population (estimate of whole population size).
df<-data.frame(id=c(1,2,3,4,5,6,7,8,9,10),
recruit.id=c(-1,1,1,2,2,4,5,3,8,3),
netsize=c(6,6,6,5,5,4,4,3,4,6), population=rep(22,000, 10))
I then (try to) apply the relevant function:
new.df <-as.rds.data.frame(df,id=df$id,
recruiter.id=df$recruit.id,
network.size=df$netsize,
population.size=df$population,
max.coupons=2)
I get the error message:
Error in as.rds.data.frame(df, id = df$id, recruiter.id = df$recruit.id,: Invalid id
and the warning
In addition: Warning message:In if (!(id %in% names(x))) stop("Invalid id") :
the condition has length > 1 and only the first element will be used
I have tried assigning various 'recruiter id' values for seed participants, including -1,0 or their own id number but I still get the same message. I have also tried eliminating function arguments (coupon.max, population) or deleting seed respondents, but I still get the same message.
Package documentation says the function will fail if recruitment information is incomplete. As far as I can tell, this is not the case.
I am new to this, so if anyone can point me in the right direction I would be really grateful.
This seems to work:
colnames(df)[2:4] <- c("recruiter.id", "network.size.variable", "population.size")
as.rds.data.frame(df,max.coupons=2)
This gives a result with a warning
as.rds.data.frame(df, id="id", recruiter.id="recruit.id",
network.size="netsize", population.size="population", max.coupons=2)
# An object of class "rds.data.frame"
#id: 1 2 3 4 5 6 7 8 9 10
#recruiter.id: -1 1 1 2 2 4 5 3 8 3
# id recruit.id netsize population
#1 1 -1 6 22
#2 2 1 6 22
#3 3 1 6 22
#4 4 2 5 22
#5 5 2 5 22
#6 6 4 4 22
#7 7 5 4 22
#8 8 3 3 22
#9 9 8 4 22
#10 10 3 6 22
# Warning message:
#In as.rds.data.frame(df, id = "id", recruiter.id = "recruit.id", :
#NAs introduced by coercion

Resources