What algorithm is used for finding ngrams?
Supposing my input data is an array of words and the size of the ngrams I want to find, what algorithm I should use?
I'm asking for code, with preference for R. The data is stored in database, so can be a plgpsql function too. Java is a language I know better, so I can "translate" it to another language.
I'm not lazy, I'm only asking for code because I don't want to reinvent the wheel trying to do an algorithm that is already done.
Edit: it's important know how many times each n-gram appears.
Edit 2: there is a R package for N-GRAMS?
If you want to use R to identify ngrams, you can use the tm package and the RWeka package. It will tell you how many times the ngram occurs in your documents, like so:
library("RWeka")
library("tm")
data("crude")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
tdm <- TermDocumentMatrix(crude, control = list(tokenize = BigramTokenizer))
inspect(tdm[340:345,1:10])
A term-document matrix (6 terms, 10 documents)
Non-/sparse entries: 4/56
Sparsity : 93%
Maximal term length: 13
Weighting : term frequency (tf)
Docs
Terms 127 144 191 194 211 236 237 242 246 248
and said 0 0 0 0 0 0 0 0 0 0
and security 0 0 0 0 0 0 0 0 1 0
and set 0 1 0 0 0 0 0 0 0 0
and six-month 0 0 0 0 0 0 0 1 0 0
and some 0 0 0 0 0 0 0 0 0 0
and stabilise 0 0 0 0 0 0 0 0 0 1
hat-tip: http://tm.r-forge.r-project.org/faq.html
For anyone still interested in this topic, there is a package on the cran already.
ngram: An n-gram Babbler
This package offers utilities for creating, displaying, and "babbling" n-grams. The babbler is a simple Markov process.
http://cran.r-project.org/web/packages/ngram/index.html
Usually the n-grams are calculated to find its frequency distribution. So Yes, it does matter how many times the n-grams appear.
Also you want character level n-gram or word level n-gram. I have written a code for finding the character level n-gram from a csv file in r. I used package 'tau' for that. You can find it here.
Also here is the code I wrote:
library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=FALSE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L,split = "[[:space:][:punct:]]+", decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b
Cheers!
EDIT: Sorry, this is PHP. I wasn't quite sure what you wanted. I don't know it in java but perhaps the following could be converted easily enough.
Well it depends on the size of the ngrams you want.
I've had quite a lot of success with single letters (especially accurate for language detection), which is easy to get with:
$letters=str_split(preg_replace('/[^a-z]/', '', strtolower($text)));
$letters=array_count_values($letters);
Then there is the following function for calculating ngrams from a word:
function getNgrams($word, $n = 3) {
$ngrams = array();
$len = strlen($word);
for($i = 0; $i < $len; $i++) {
if($i > ($n - 2)) {
$ng = '';
for($j = $n-1; $j >= 0; $j--) {
$ng .= $word[$i-$j];
}
$ngrams[] = $ng;
}
}
return $ngrams;
}
The source of the above is here, which I recommend you read, and they have lots of functions to do exactly what you want.
You can use ngram package. One example of its usage is http://amunategui.github.io/speak-like-a-doctor/
Have a look at https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf
Here is a quick example. It's quite fast look at the benchmark of the vignette.
require(ngram)
"hi i am ig" %>% ngram(n = 2) %>% get.ngrams()
Simple heres the java answer:
int ngrams = 9;// let's say 9-grams since it's the length of "bonasuera"...
String string = "bonasuera";
for (int j=1; j <= ngrams;j++) {
for (int k=0; k < string.length()-j+1;k++ )
System.out.print(string.substring(k,k+j) + " ");
System.out.println();
}
output :
b o n a s u e r a
bo on na as su ue er ra
bon ona nas asu sue uer era
bona onas nasu asue suer uera
bonas onasu nasue asuer suera
bonasu onasue nasuer asuera
bonasue onasuer nasuera
bonasuer onasuera
bonasuera
Related
Original post on Vi and Vim Beta, which has had one interesting answer, but not much attention so far. I am sorry for the crossposting and I will ask for the original to be closed/deleted.
Given the following function in the .vimrc file,
fu! MyFun(count)
echo a:count
echo a:count
if a:count > 0
normal ,
call MyFun(a:count - 1)
endif
endf
calling :call MyFun(3) generates the following output.
3
3
2
2
1
1
0
0
However, if I define the mapping nn , :<C-U>execute "call MyFun(" . v:count . ")"<CR>, then the call to :call MyFun(3) generates
3
0
2
0
1
0
0
0
I do understand that the mapping of , makes the MyFun function call itself twice (if a:count > 0), however I cannot understand how this can cause a different result of the two successive calls to echo a:count.
The problem is all about screen redraw (see :h echo-redraw) in Vim.
Changing echo to echom still produces the same (broken) screen output (3 0 2 0 1 0 0 0), but :mess reveals what is hidden: 3 3 0 0 2 2 0 0 1 1 0 0 0 0.
In rnn package, int can be converted to binary using int2bin. Like
a= int2bin(8)
Now when we use bin2int on a, it should give 8. Simply
bin2int(int2bin(8)) should be 8. But it is giving something else: 134217728.
Why? What is the correct approach to convert it back to int.
i can not access rnn package because it had some bugs reported
https://cran.r-project.org/web/packages/rnn/index.html page. i suggest u try this.
library(binaryLogic)
as.binary(8) ///[1] 1 0 0 0
as.intfrombinary <- function(x) {
returnvalue<-
sum(2^(which(rev(unlist(strsplit(as.character(x), "")) == 1))-1))
return(returnvalue)
}
as.intfrombinary(as.binary(8)) ///[1] 8
I'm trying to use subs in maple to replace derivatives in a longer formula with 0:
subs(diff(u(r),r) = 0, formula);
It seems that if formula only involves first derivatives of u(r) this works as I expect. For example,
formula := diff(u(r),r);
subs(diff(u(r),r) = 0, formula);
0
But if formula involves second derivatives I get a diff(0,r) in the result that won't go away even when using simplify:
formula := diff(u(r),r,r);
subs(diff(u(r),r) = 0, formula);
d
-- 0
dr
(My actual formula is quite long involving first and second derivatives of two variables. I know that all derivatives with respect to a certain variable are 0 and I'd like to remove them).
One way is to use the simplify command with so-called side-relations.
formula := diff(u(r),r,r) + 3*cos(diff(u(r),r,r))
+ diff(u(r),r) + x*(4 - diff(u(r),r,r,r)):
simplify( formula, { diff(u(r),r) = 0 } );
3 + 4 x
formula2 := diff(u(r,s),s,s) + 3*cos(diff(u(r,s),r,r))
+ diff(u(r,s),r) + x*(4 - diff(u(r,s),r,s,r,r)):
simplify( formula2, { diff(u(r,s),r) = 0 } );
/ 2 \
| d |
3 + |---- u(r, s)| + 4 x
| 2 |
\ ds /
[edit] I forgot to answer your additonal query about why you got d/dr 0 before. The answer is because you used subs instead of 2-argument eval. The former does purely syntactic substitution, and doesn't evaluate the result. The latter is the one that people often need, without knowing it, and does "evaluation at a (particular) point".
formulaA := diff(u(r),r,r):
subs(diff(u(r),r) = 0, formulaA);
d
--- 0
dr
%; # does an evaluation
0
eval(formulaA, diff(u(r),r) = 0);
0
formulaB := diff(u(r,s),s,r,r,s):
eval(formulaB, diff(u(r,s),r) = 0);
0
You can see that any evaluation of those d/dr 0 objects will produce 0. But it's is often better practice to use 2-argument eval than it is to do eval(subs(...)). People use subs because it sounds like "substitution", I guess, or they see others use it. Sometimes subs is the right tool for the job, so it's important to know the difference.
I am a Phd student in the university of Padua and I am trying to write a little script (the first!) in R cran v. 3.0.1 to make a simulation on epidemiology.
I'd like to change the values of a vector of 883 values basing on a neighbour matrix constructed with nb2mat from a shapefile: if i and j (two cells) are neighbour (matrix) and i or j have a positive value in the vector, I'd like to transform the value of both i and j to 1 (positive), otherwise the value of i and j should remain 0. When I launch the next little script:
for(i in 1:883)
{ for(j in 1:883)
{ if(MatriceDist[i,j] > 0 & ((vectorID[i] > 0 | vectorID[j] > 0)) {
vectorID[i] = 1 & vectorID[j] = 1
print(vectorID)
} } }
the answer from the software is:
Error: unexpected '{' in:
" { for(j in 1:883)
{ while(MatriceDist[i,j] > 0 & ((vectorID[i] > 0 | vectorID[j] > 0)) {"
I think that it is an error in the statement for if but I can not understand how to solve it...
Thank you everyone!
Elisa
check your brackets :-)
for(i in 1:883) {
for(j in 1:883) {
if(MatriceDist[i,j] > 0 & (vectorID[i] > 0 | vectorID[j] > 0)) { vectorID[i] = 1 & vectorID[j] = 1 print(vectorID)
}
}
}
you had one ( to mucch before vectorID in your if statement.
please double check is the condition now specified in the statement is still the one you require.
btw: for loops are very slow in R. If you know the end size of vectorID, try pre-allocating the full matrix. That will speed things up a little bit.
I have a function that at the moment programmed in a functional model and either want to speed it up and maybe solve the problem more in the spirit of R.
I have a data.frame and want to add a column based on information that's where every entry depends on two rows.
At the moment it looks like the following:
faultFinging <- function(heartData){
if(heartData$Pulse[[1]] == 0){
Group <- 0
}
else{
Group <- 1
}
for(i in seq(2, length(heartData$Pulse), 1)){
if(heartData$Pulse[[i-1]] != 0
&& heartData$Pulse[[i]] != 0
&& abs(heartData$Pulse[[i-1]] - heartData$Pulse[[i]])<20){
Group[[i]] <- 1
}
else{
if(heartData$Pulse[[i-1]] == 0 && heartData$Pulse[[i]] != 0){
Group[[i]] <- 1
}
else{
Group[[i]] <- 0
}
}
}
Pulse<-heartData$Pulse
Time<-heartData$Time
return(data.frame(Time,Pulse,Group))
}
I can't test this without sample data, but this is the general idea. You can avoid doing the for() loop entirely by using & and | which are vectorized versions of && and ||. Also, there's no need for an if-else statement if there's only one value (true or false).
faultFinging <- function(heartData){
Group <- as.numeric(c(heartData$Pulse[1] != 0,
(heartData$Pulse[-nrow(heartData)] != 0
& heartData$Pulse[-1] != 0
& abs(heartData$Pulse[-nrow(heartData)] - heartData$Pulse[-1])<20) |
(heartData$Pulse[-nrow(heartData)] == 0 & heartData$Pulse[-1] != 0)))
return(cbind(heartData, Group))
}
Putting as.numeric() around the index will set TRUE to 1 and FALSE to 0.
This can be done in a more vector way by separating your program into two parts: firstly a function which takes two time samples and determines if they meet your pulse specification:
isPulse <- function(previous, current)
{
(previous != 0 & current !=0 & (abs(previous-current) < 20)) |
(previous == 0 & current !=0)
}
Note the use of vector | instead of boolean ||.
And then invoke it, supplying the two vector streams 'previous' and 'current' offset by a suitable delay, in your case, 1:
delay <- 1
samples = length(heartData$pulse)
isPulse(heartData$pulse[-(samples-(1:delay))], heartData$pulse[-(1:delay)])
Let's try this on some made-up data:
sampleData = c(1,0,1,1,4,25,2,0,25,0)
heartData = data.frame(pulse=sampleData)
result = isPulse(heartData$pulse[-(samples-(1:delay))], heartData$pulse[-(1:delay)])
Note that the code heartData$pulse[-(samples-(1:delay))] trims delay samples from the end, for the previous stream, and heartData$pulse[-(1:delay)] trims delay samples from the start, for the current stream.
Doing it manually, the results should be (using F for false and T for true)
F,T,T,T,F,F,F,T,F
and by running it, we find that they are!:
> print(result)
FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
success!
Since you want to bind these back as a column into your original dataset, you should note that the new array is delay elements shorter than your original data, so you need to pad it at the start with delay FALSE elements. You may also want to convert it into 0,1 as per your data:
resultPadded <- c(rep(FALSE,delay), result)
heartData$result = ifelse(resultPadded, 1, 0)
which gives
> heartData
pulse result
1 1 0
2 0 0
3 1 1
4 1 1
5 4 1
6 25 0
7 2 0
8 0 0
9 25 1
10 0 0