Alternative to grep function when searching - r

Relatively simple problem, but I'm thinking around in circles, and not reaching a solution. Essentially, I have x=sample(1:30,20) and I need to find the coordinates of each individual number. If I use grep, then I get where each number that has a 1 (including 10, 19, 21 etc.) is. I have a feeling there's an obscenely simple solution to this, but I can't think of it for some reason.
for example: if x=c(2,3,1,10,12),
then
f(1,x)
[1] 3
and
f(3,x)
[1] 2
Note: I tried using fixed = T, but that did not help.

you want which.
x <- c(2,3,1,10,12)
which(x==10)
[1] 4

Related

R fuzzy lookup match down to each character's position

I want to the find the closest match down to their each character's position if possible.
I'm not sure should I dive into writing a loop to check each character? Or is there any more efficient or elegant alternatives?
Sample Data:
look up in lookUp_List for the one that is closest to the item and in this sample, ideally it should be bmw6 or the next, not so ideal could be bmws as well.
Best if could have both methods then I can apply to my data and check which is better.
Thank you so much in advanced!!
item = "bmwx650isport(a)"
lookUp_List = c("bmw5", "bmw6", "bmws")
Trial:
amatch(item, lookUp_List, maxDist = 30)
# [1] 1
stringdist(item, lookUp_List)
# [1] 12 12 12

Correcting mis-typed data in R

I want to correct wrongly entered data in R. For example if I have a vector
V=c('PO','PO','P0')
I want R to recognize that the 0 in the last entry should be a o and to change it. Is there anyway to do that? I have trying to use correctTypos in the deducorrect package in R. However I am having some problem with the editset. I cannot seems to specify that all the entries have to be letters. Any help greatly appreciated.
Another example would be
V2=c('PL','P1','PL','XX')
That 1 should be an L.
The jaro-winkler distance was developed to find issues with data entry. But on entries only 2 long that is going to be difficult as 1 error tends to score higher than you want it to. You could combine this with other distance measurements available in the stringdist package. But in this case that might be too complicated.
Given your examples you might want to use the base function chartr and set up a replacement of numbers to letters.
chartr("01","OL", V2)
[1] "PL" "PL" "PL" "XX"
chartr("01","OL", V)
[1] "PO" "PO" "PO"
This will always replace the 1 by an L and a 0 (zero) by an O. You can add the 5 for S etc etc. But if there are other combo's it might get complicated.
Also note that the next iteration of the deducorrect package is the deductive package.

rank() function in R is ranking objects with floating points rather than integers

I'm quite new to R so this may seem quite trivial to many experienced programmers, sorry in advance!
I've got a numeric vector of length 8 that looks like this:
data <- c(45, 67, 23, 24, 5, 23, 45, 23)
When I type in: rank(data), R returns: [1] 6.5 8.0 3.0 5.0 1.0 3.0 6.5 3.0
However with my (very basic) understanding of rank, I expect R to return to me only whole numbers... such as:
[1] 6 8 2 5 1 3 7 4
How can rank() tell me the first element in data has a floating point ranking rather than a whole number ranking? Is it because there are values in data that are repeated and so rank() is trying to handle ties in a way that I am not expecting? If so, please tell me how I can fix this so I can get output that looks like what I previously expected. Also, any information on how rank() deals with NA values would be much appreciated. A basic description on rank() and what bells and whistles can be used would be fantastic! I've looked for videos on youtube and searched stackoverflow to no avail! Thanks so much.
From ?rank:
With some values equal (called ‘ties’), the argument ties.method determines the result at the corresponding indices. The "first" method results in a permutation with increasing values at each index set of ties. The "random" method puts these in random order whereas the default, "average", replaces them by their mean, and "max" and "min" replaces them by their maximum and minimum respectively, the latter being the typical sports ranking.
Sounds like you're using the default setting of "average" for tie breaking, which uses the mean, which is not necessarily an integer.
The built-in documentation should always be your first stop in looking for help. In this case (and most cases), it details all the "bells and whistles"---here there aren't many: just tie-handling and NA-handling. It also has examples at the bottom.

R: Sample into bins of predefined sizes (partition sample vector)

I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)

Counting syllables

I'm looking to assign some different readability scores to text in R such as the Flesh Kincaid.
Does anyone know of a way to segment words into syllables using R? I don't necessarily need the syllable segments themselves but a count.
so for instance:
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle')
would yield:
1, 1, 2, 2, 1, 3
Each number corresponding the the number of syllables in the word.
qdap version 1.1.0 does this task:
library(qdap)
x <- c('dog', 'cat', 'pony', 'cracker', 'shoe', 'Popsicle')
syllable_sum(x)
## [1] 1 1 2 2 1 3
gsk3 is correct: if you want a correct solution, it is non-trivial.
For example, you have to watch out for strange things like silent e at the end of a word (eg pane), or know when it's not silent, as in finale.
However, if you just want a quick-and-dirty approximation, this will do it:
> nchar( gsub( "[^X]", "", gsub( "[aeiouy]+", "X", tolower( x ))))
[1] 1 1 2 2 1 3
To understand how the parts work, just strip away the function calls from the outside in, starting with nchar and then gsub, etc... ...until the expression makes sense to you.
But my guess is, considering a fight between R's power vs the profusion of exceptions in the English language, you could get a decent answer (maybe 99% right?) parsing through normal text, without a lot of work - heck, the simple parser above may get 90%+ right. With a little more work, you could deal with silent e's if you like.
It all depends on your application - whether this is good enough or you need something more accurate.
Some tools for NLP are available here:
http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
The task is non-trivial though. More hints (including an algorithm you could implement) here:
Detecting syllables in a word
The koRpus package will help you out immensley, but it's a little difficult to work with.
stopifnot(require(koRpus))
tokens <- tokenize(text, format="obj", lang='en')
flesch.kincaid(tokens)

Resources