I have two problems with a combinatory/permutation issue in R.
I am running a combination for letters. I want to validate all resulting combinations to a german dictionary.
letters <- c("a","g","t")
sum_3 <- expand.grid(letters, letters, letters)
Ideally, this would indicate that the word "tag" (english = day) will be found as it is existent in a German dictionary. How to connect/load/validate the results with a German dictionary?
If I increase the number of letters and the expand.grid to a number n, then at some point, this becomes to memory intensive.
Is there any way to slice the expand.grid part into multiple segments in order to avoid memory breakdown?
Related
Task
I am attempting to use better functionality (loop or vector) to parse down a larger list into 26(maybe 27) smaller lists based on each letter of the alphabet (i.e. the first list contains all entries of the larger list that start with the letter A, the second list with the letter B ... the possible 27th list contains all remaining entries that use either numbers of other characters).
I am then attempting to ID which names on the list are similar by using the adist function (for instance, I need to correct company names that are misspelled. e.g. Companyy A needs to be corrected to Company A).
Code thus far
#creates a vector for all uniqueID/stakeholders whose name starts with "a" or "A"
stakeA <- grep("^[aA].*", uniqueID, value=TRUE)
#creates a distance matrix for all stakeholders whose name starts with "a" or "A"
stakeAdist <- (adist(stakeA), ignore.case=TRUE)
write.table(stakeAdist, "test.csv", quote=TRUE, sep = ",", row.names=stakeA, col.names=stakeA)
Explanation
I was able to complete the first step of my task using the above code; I have created a list of all the entries that begin with the letter A and then calculated the "distance" between each entry (appears in a matrix).
Ask One
I can copy and paste this code 26 times and move my way through the alphabet, but I figure that is likely a more elegant way to do this, and I would like to learn it!
Ask Two
To "correct" the entries, thus far I have resorted to writing a table and moving to Excel. In Excel I have to insert a row entry to have the matrix properly align (I suppose this is a small flaw in my code). To correct the entries, I use conditional formatting to highlight all instances where adist is between say 1 and 10 and then have to manually go through the highlights and correct the lists.
Any help on functions / methods to further automate this / better strategies using R would be great.
It would help to have an example of your data, but this might work.
EDIT: I am assuming your data is in a data.frame named df
for(i in 1:26) {
stake <- subset(df, uniqueID==grep(paste0('^[',letters[i],LETTERS[i],'].*'), df$uniqueID, value=T))
stakeDist <- adist(stakeA,ignore.case=T)
write.table(stakeDist, paste0("stake_",LETTERS[i],".csv"), quote=T, sep=',')
}
Using a combination of paste0, and the builtin letters and LETTERS this creates your grep expression.
Using subset, the correct IDs are extracted
paste0 will also create a unique filename for write.table().
And it is all tied together using a for()-loop
I have created a list (Based on items in a column) in order to subset my dataset into smaller datasets relating to a particular variable. This list contains strings with hyphens in them -.
dim.list <- c('Age_CareContactDate-Gender', 'Age_CareContactDate-Group',
'Age_ServiceReferralReceivedDate-Gender',
'Age_ServiceReferralReceivedDate-Gender-0-18',
'Age_ServiceReferralReceivedDate-Group',
'Age_ServiceReferralReceivedDate-Group-ReferralReason')
I have then written some code to loop through each item in this list subsetting my main data.
for (i in dim.list) {assign(paste("df1.",i,sep=""),df[df$Dimension==i,])}
This works fine, however when I come to aggregate this in order to get some summary statistics I can't reference the dataset as R stops reading after the hyphen (I assume that the hyphen is some special character)
If I use a different list without hyphens e.g.
dim.list.abr <- c('ACCD_Gen','ACCD_Grp',
'ASRRD_Gen',
'ASRRD_Gen_0_18',
'ASRRD_Grp',
'ASRRD_Grp_RefRsn')
When my for loop above executes I get 6 data.frames with no observations.
Why is this happening?
Comment to answer:
Hyphens aren't allowed in standard variable names. Think of a simple example: a-b. Is it a variable name with a hyphen or is it a minus b? The R interpreter assumes a minus b, because it doesn't require spaces for binary operations. You can force non-standard names to work using backticks, e.g.,
# terribly confusing names:
`a-b` <- 5
`x+y` <- 10
`mean(x^2)` <- "this is awful"
but you're better off following the rules and using standard names without special characters like + - * / % $ # # ! & | ^ ( [ ' " in them. At ?quotes there is a section on Names and Identifiers:
Identifiers consist of a sequence of letters, digits, the period (.) and the underscore. They must not start with a digit nor underscore, nor with a period followed by a digit. Reserved words are not valid identifiers.
So that's why you're getting an error, but what you're doing isn't good practice. I completely agree with Axeman's comments. Use split to divide up your data frame into a list. And keep it in a list rather than use assign, it will be much easier to loop over or use lapply with that way. You might want to read my answer at How to make a list of data frames for a lot of discussion and examples.
Regarding your comment "dim.list is not the complete set of unique entries in the Dimensions column", that just means you need to subset before you split:
nice_list = df[df$Dimension %in% dim.list, ]
nice_list = split(nice_list, nice_list$Dimension)
There must be a simple answer to this, but I'm new to regex and couldn't find one.
I have a dataframe (df) with text strings arranged in a column vector of length n (df$text). Each of the texts in this column is interspersed with parenthetical phrases. I can identify these phrases using:
regmatches(df$text, gregexpr("(?<=\\().*?(?=\\))", df$text, perl=T))[[1]]
The code above returns all text between parentheses. However, I'm only interested in parenthetical phrases that contain 'v.' in the format 'x v. y', where x and y are any number of characters (including spaces) between the parentheses; for example, '(State of Arkansas v. John Doe)'. Matching phrases (court cases) are always of this format: open parentheses, word beginning with capital letter, possible spaces and other words, v., another word beginning with a capital letter, and possibly more spaces and words, close parentheses.
I'd then like to create a new column containing counts of x v. y phrases in each row.
Bonus if there's a way to do this separately for the same phrases denoted by italics rather than enclosed in parentheses: State of Arkansas v. John Doe (but perhaps this should be posed as a separate question).
Thanks for helping a newbie!
I believe I have figured out what you want, but it is hard to tell without example data. I have made and example data frame to work with. If it is not what you are going for, please give an example.
df <- data.frame(text = c("(Roe v. Wade) is not about boats",
"(Dred Scott v. Sandford) and (Plessy v. Ferguson) have not stood the test of time",
"I am trying to confuse you (this is not a court case)",
"this one is also confusing (But with Capital Letters)",
"this is confusing (With Capitols and v. d)"),
stringsAsFactors = FALSE)
The regular expression I think you want is:
cases <- regmatches(df$text, gregexpr("(?<=\\()([[:upper:]].*? v\\. [[:upper:]].*?)(?=\\))",
df$text, perl=T))
You can then get the number of cases and add it to your data frame with:
df$numCases <- vapply(cases, length, numeric(1))
As for italics, I would really need an example of your data. usually that kind of formatting isn't stored when you read in a string in R, so the italics effectively don't exist anymore.
Change your regex like below,
regmatches(df$text, gregexpr("(?<=\\()[^()]*\\sv\\.\\s[^()]*(?=\\))", df$text, perl=T))[[1]]
DEMO
I want to analyze a big (n=500,000) corpus of documents. I am using quanteda in the expectation that will be faster than tm_map() from tm. I want to proceed step by step instead of using the automated way with dfm(). I have reasons for this: in one case, I don't want to tokenize before removing stopwords as this would result in many useless bigrams, in another I have to preprocess the text with language-specific procedures.
I would like this sequence to be implemented:
1) remove the punctuation and numbers
2) remove stopwords (i.e. before the tokenization to avoid useless tokens)
3) tokenize using unigrams and bigrams
4) create the dfm
My attempt:
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))
> class(text.corpus)
[1] "corpus" "list"
> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"
# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
Bonus question
How do I remove sparse tokens in quanteda? (i.e. equivalent of removeSparseTerms() in tm.
UPDATE
At the light of #Ken's answer, here is the code to proceed step by step with quanteda:
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
1) Remove custom punctuation and numbers. E.g. notice that the "\n" in the ie2010 corpus
text.corpus <- ie2010Corpus
texts(text.corpus)[1] # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is
texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1]) # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
A further note on the reason why one may prefer to preprocess. My present corpus is in Italian, a language that has articles connected to the words with an apostrophe. Thus, the straight dfm() can lead to inexact tokenization.
e.g.:
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
will produce two separated tokens for the same word ("un'abile" and "l'abile"), hence the need of an additional step with gsub() here.
2) In quanteda it is not possible to remove stopwords directly in the text before the tokenization. In my previous example "l" and "un" have to be removed not to produce misleading bigrams. This can be handled in tm with tm_map(..., removeWords).
3) Tokenization
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
4) Create the dfm:
dfm <- dfm(token)
5) Remove sparse features
dfm <- trim(dfm, minCount = 5)
We designed dfm() not as a "black box" but more as a Swiss army knife that combines many of the options that typical users want to apply when converting their texts to a matrix of documents and features. However all of these options are also available through lower-level processing commands, should you wish to exert a finer level of control.
However one of the design principles of quanteda is that text only becomes "features" through the process of tokenisation. If you have a set of tokenised features that you wish to exclude, you must first tokenise your text, or you cannot exclude them. Unlike other text packages for R (e.g. tm), these steps are applied "downstream" from a corpus, so that the corpus remains an unprocessed set of texts to which manipulations will be applied (but will not itself be a transformed set of texts). The purpose of this is to preserve generality but also to promote reproducibility and transparency in text analysis.
In response to your questions:
You can however override our encouraged behaviour using the texts(myCorpus) <- function, where what is assigned to the texts will override the existing texts. So you could use regular expressions to remove punctuation and numbers -- for example the stringi commands and using the Unicode classes for punctuation and numerals to identify patterns.
I would recommend you tokenise before removing stopwords. Stop "words" are tokens, so there is no way to remove these from the text before you tokenise the text. Even applying regular expressions to substitute them for "" involves specifying some form of word boundary in the regex - again, this is tokenisation.
To tokenise into unigrams and bigrams:
tokens(myCorpus, ngrams = 1:2)
To create the dfm, simply call dfm(myTokens). (You could also have applied step 3, for ngrams, at this stage.
Bonus 1: n=2 collocations produces the same list as bigrams, except in a different format. Did you intend something else? (Separate SO question perhaps?)
Bonus 2: See dfm_trim(x, sparsity = ). The removeSparseTerms() options are quite confusing to most people, but this included for migrants from tm. See this post for a full explanation.
BTW: Use texts() instead of ie2010Corpus$documents$texts -- we will rewrite the object structure of a corpus soon, so you should not access its internals this way when there is an extractor function. (Also, this step is unnecessary - here you have simply recreated the corpus.)
Update 2018-01:
The new name for the corpus object is data_corpus_irishbudget2010, and the collocation scoring function is textstat_collocations().
Is it possible to add or retain one or more leading zeros to a number without the result being converted to character? Every solution I have found for adding leading zeros returns a character string, including: paste, formatC, format, and sprintf.
For example, can x be 0123 or 00123, etc., instead of 123 and still be numeric?
x <- 0123
EDIT
It is not essential. I was just playing around with the following code and the last two lines gave the wrong answer. I just thought maybe if I could have leading zeros with numeric format obtaining the correct answer would be easier.
a7 = c(1,1,1,0); b7=c(0,1,1,1); # 4
a77 = '1110' ; b77='0111' ; # 4
a777 = 1110 ; b777=0111 ; # 4
length(b7[(b7 %in% intersect(a7,b7))])
R - count matches between characters of one string and another, no replacement
keyword <- unlist(strsplit(a77, ''))
text <- unlist(strsplit(b77, ''))
sum(!is.na(pmatch(keyword, text)))
ab7 <- read.fwf(file = textConnection(as.character(rbind(a777, b777))), widths = c(1,1,1,1), colClasses = rep("character", 2))
length(ab7[2,][(ab7[2,] %in% intersect(ab7[1,],ab7[2,]))])
You are not thinking correctly about what a "number" is. Programming languages store an internal representation which retains full precision to the machine limit. You are apparently concerned with what gets printed to your screen or console. By definition, those number characters are string elements, which is to say, a couple bytes are processed by the ASCII decoder (or equivalent) to determine what to draw on the screen. What x "is," to draw happily on Presidential Testimony, depends on your definition of what "is" is.
You could always create your own class of objects that has one slot for the value of the number (but if it is stored as numeric then what we see as 123 will actually be stored as as a binary value, something like 01111011 (though probably with more leading 0's)) and another slot or attribute for either the number of leading 0's or the number of significant digits. Then you can write methods for what to do with the number (and what effect that will have on the leading 0's, sig digits, etc.).
The print method could then make sure to print it with the leading zeros while keeping the internal value as a number.
But this seems a bit overkill in most cases (though I know that some fields make a big deal about indicating number of significant digits so that leading 0's could be important). It may be simpler to use the conversion to character methods that you already know about, but just do the printing in a way that does not look obviously like a number, see the cat and print functions for the options.