Correcting mis-typed data in R - r

I want to correct wrongly entered data in R. For example if I have a vector
V=c('PO','PO','P0')
I want R to recognize that the 0 in the last entry should be a o and to change it. Is there anyway to do that? I have trying to use correctTypos in the deducorrect package in R. However I am having some problem with the editset. I cannot seems to specify that all the entries have to be letters. Any help greatly appreciated.
Another example would be
V2=c('PL','P1','PL','XX')
That 1 should be an L.

The jaro-winkler distance was developed to find issues with data entry. But on entries only 2 long that is going to be difficult as 1 error tends to score higher than you want it to. You could combine this with other distance measurements available in the stringdist package. But in this case that might be too complicated.
Given your examples you might want to use the base function chartr and set up a replacement of numbers to letters.
chartr("01","OL", V2)
[1] "PL" "PL" "PL" "XX"
chartr("01","OL", V)
[1] "PO" "PO" "PO"
This will always replace the 1 by an L and a 0 (zero) by an O. You can add the 5 for S etc etc. But if there are other combo's it might get complicated.
Also note that the next iteration of the deducorrect package is the deductive package.

Related

Is there a way to handle calculations invovling exponential of big values in R?

I have looked a bit online and in the site but I did not find any solution. My problem is relatively simple so if you could point me to a possible solution, much appreciated.
test_vec <- c(2,8,709,600)
mean(exp(test_vec))
test_vec_bis <- c(2,8,710,600)
mean(exp(test_vec_bis))
exp(709)
exp(710)
# The numerical limit of R is at exp(709)
How can I calculate the mean of my vector and deal with the Inf values knowing that R could probably handle the mean value but not all values in the numerator of the mean calculation ?
There is an edge case where you can solve your problem by simply restating your problem mathematically, but that would require that the length of your vector is extremely large and/or that your large exp. numbers are close to the numeric limit:
Since the mean sum(x)/n can be written as sum(x/n) and since exp(x)/exp(y) = exp(x-y), you can calculate sum(exp(x-log(n))), which gives you a relief of log(n).
mean(exp(test_vec))
[1] 2.054602e+307
sum(exp(test_vec - log(length(test_vec))))
[1] 2.054602e+307
sum(exp(test_vec_bis - log(length(test_vec_bis))))
[1] 5.584987e+307
While this works for your example, most likely this won't work for your real vector.
In this case, you will have to consult packages like Rmpfr as suggested by #fra.
Here's one way where you qualify to only select those in your test_vec that give an answer < Inf:
mean(exp(test_vec)[which(exp(test_vec) < Inf)])
[1] 1.257673e+260
t2 <- c(2,8,600)
mean(exp(t2))
[1] 1.257673e+260
This assumes you were looking to exclude values that result in Inf, of course.

Multiple Pattern Search - Pick the most hit Line from document

I am trying to search a list of terms or keywords in list of sentences. Here, I wanted to pick that line from the list of lines (which are review comments from customers) which matches to most of my terms or keywords that appear in.
At present i am doing this,
mydata<-c("i like this product, awesome",
"i could not go with this produt, since s/w is problem",
"Very good s/w. keep up the good work. i really like it")
terms<-c("really, "good", "like", "product")
termco(mydata, 1:3, terms)
and i get
3 word.count really good like product
1 1 5 0 0 1(20.00%) 1(20.00%)
2 2 11 0 0 0 0
3 3 12 1(8.33%) 2(16.67%) 1(8.33%) 0
I also tried few other suggestions HERE. But i could not get the result i wanted. But solution is very nice.
My expectation is that, a particular line or lines should only be displayed which has maximum number of terms or keywords that i am searching.
In this case i expected below line, since i have maximum number of terms or keywords present i.e., "really, "good", and "like"
"Very good s/w. keep up the good work. i really like it"
Thank in Advance!!
Here is a base R solution using apply and grep. The basic idea is to call grep(term, sentence), for every term in a given sentence. Then, we sum the number of hit terms for each sentence. Note carefully that we add word boundary markers around each term. This is to prevent false matches where a term happens to be a substring of another word in a sentence.
sapply(mydata, function(x) {
Reduce("+", sapply(terms, function(y) {
sum(grep(paste0("\\b", y, "\\b"), x))
}))
})
i like this product, awesome
2
i could not go with this product, since s/w is problem
1
Very good s/w. keep up the good work. i really like it
3
Demo
Using stringr's str_count can help as well:
Using str_count to get the counts(4 in total for last record) of the all matches and then use which.max to get the index of the vector(In this case it will return 3, which means third element the vector mydata)
mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
Incase you want an absolute match with boundary conditions, you may use:
mydata[which.max(stringr::str_count(mydata,paste0("\\b",paste0(terms, collapse="\\b|\\b"),"\\b")))]
In your case both will give you same answer, however second will give less number of matches. E.g. when you have words like "keeping" instead of "keep" in a sentence. In this case the later regex will not match as its not absolute however, prior regex will match as there are no boundary conditions set.
Output:
> mydata[which.max(stringr::str_count(mydata, paste0(terms, collapse="|")))]
[1] "Very good s/w. keep up the good work. i really like it"

How to get an Element from a vector without using numbers or indices?

Lets say I have these two vectors in my R workspace with the following content:
[1] "Atom.Type" and "Molar.Mass"
> Atom.Type
[1] "Oxygen" "Lithium" "Nitrogen" "Hydrogen"
> Molar.Mass
[1] 16 6.9 14 1
I now want to assign the Molar.Mass belonging to "Lithium" (i.e. 6.9) to a new variable called mass.
The problem is: I have to do that without using any numbers or indices.
Does anyone have a suggestion for this problem?
This should work: mass<-Molar.Mass[Atom.Type=="Lithium"] Clearly this assumes the two vectors are of the same length and sorted correctly. See additional comment from Roland below.

String matching for names with jumbled up words in R

I am trying to match names in R similar to VISHWANATHAN KRISHNA GURUVAYUR and GURUVAYUR KRISHNA VISHWANATHAN. After removing spaces levenshtein gives a 21% match.
I want to know if there is some string matching algorithm that could tag these two names as similar...
library(RecordLinkage)
levenshteinSim("GURUVAYURKRISHNAVISHWANATHAN","VISHWANATHANKRISHNAGURUVAYUR")
#[1] 0.2142857
Try the Jaro-Winkler algorithm, also from the RecordLinkage package. See here for example, and here for more.
In your case,
jarowinkler("GURUVAYURKRISHNAVISHWANATHAN","VISHWANATHANKRISHNAGURUVAYUR")
yields:
0.7063492
Results are always between 0 and 1, so this is an improvement.

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Resources