How to smooth unigrams - information-retrieval

I have a unigram language model and i want to smooth the counts. Is add one smoothing the only way or can i use some other smoothing also. I dont think we can use knesser nay as that is for Ngrams with N>=2. Any other smoothing method you know?
How about witten bell?

For unigram smoothing, Good-Turing would be optimal ...and it's easy (to apply)!
http://en.wikipedia.org/wiki/Good%E2%80%93Turing_frequency_estimation
For higher orders, modified interpolated Kneser-Ney is a good choice.

Related

I have a graph-1 and I need to predict the value of another graph-2?

Colleagues, I have a graph-1 and I need to predict the value of another graph-2 based on its data
graphs have a correlation, that's for sure - using machine learning, I can predict graph-2 according to graph-1, but I would like to have a mathematical formula for which the prediction will be
My plan is just to make an approximation, there are цуи-sites where mathematical formulas are automatically selected and, as a result, take a formula that has the least average approximation error,% well, then use this formula and see
maybe there is a smarter way
Please see the image

R: Evaluate Gradient Boosting Machines (GBM) for Regression

Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation

The proper way to plot PDF of a sample of data

I know this must be pretty basic, but what is the proper, accurate way to plot the PDF of some sample data that you know comes from some pop. distribution, like if you generated it using rnorm() or rexp()?
The reason I ask is because I know a lot of people use density(), and then input that into plot(), but the density() function seems too arbitrary to be accurate; for example, it is inaccurate when it approximates negative value for data that came from the exponential distribution, which does not possess neg. values.
So could someone recommend me a more fine-tuned method to accomplish plotting sample PDFs?
The density function performs kernel density estimation (KDE). To find the best KDE for your dataset, you should tune the bandwidth (parameter bw). Here's a paper that discusses KDE and bandwidth selection: http://www.stat.washington.edu/courses/stat527/s13/readings/Sheather_StatSci_2004.pdf
Or for a simpler approach, you can try out different bandwidth methods to pass to bw:
https://stat.ethz.ch/R-manual/R-devel/library/stats/html/bandwidth.html
The current default, "nrd0", is there for historical reasons. I find "ucv" and "bcv" have worked better for my datasets.
ggplot does help take care of negative values when they are not appropriate. It can be used in the following manner:
ggplot(df,
aes(x=contVar, fill = "green")) +
geom_density(alpha=.3)
I would also take a look at this post in cross validated

How to calculate first derivative of time series

I would calculate the first derivative (dpH/dtime) of time series using two variables, time and pH.
Are there any kind of functions to do this in R or should I compute an extra function to do this?
Assuming pH and time are plain vectors try this:
library(pspline)
predict(sm.spline(time, pH), time, 1)
You might want to start with stats::deriv or diff.ts as Matt L suggested. Just keep in mind what a professor of mine used to tell all his students: numeric differentiation is known as "error multiplier."
EDIT:
To clarify -- what he was warning about was that any noise in your data can throw the derivative estimate way off. It's been said that integration is a low-pass filter and differentiation is a high-pass filter.
So, the important thing is to do some smoothing on your data before calculating a derivative. Hence Gabor's excellent suggestion to use predict.spline . But keep in mind that modifying the spline parameters will smooth your data to different levels, so always look at the results to make sure you removed apparent noise but not desired features.
Here's a link to "Numerical Differentiation".
http://en.wikipedia.org/wiki/Numerical_differentiation
Here's a link describing a method based on Taylor Series Expansions:
http://ocw.usu.edu/civil_and_environmental_engineering/numerical_methods_in_civil_engineering/ODEsMatlab.pdf

Looking for interesting formula

I'm creating a game where players can make an alloy. To make it less predictable and more interesting, I thought that the durability and hardness of an alloy should not be calculated by a simple formula, because it will be extremely easy to find extrema, where alloy have best statistics.
So the questions is, is there any formula for a function where extrema can be found only by investigating all points? Input values will be in percents: 0.0%-100.0%. I think it should look like this: half sound wave
A very simple way would be a couple of sin function, just vary the constants and the sign for each new player. Here is one example (sin(1.1*x) + sin(x) + sin(0.9 *x))^2
If you use this between 10pi and 20pi you have an by average increasing function with local minima.
Modulating a simple linear or exponential function with trigonometric functions whose frequency and amplitude are dependent on the input should get you what you want.
You don't need a formula, I think — throw a bunch of random values around your domain, and then interpolate (linear interpolation will do) between them. Then you can even change the "formula" completely each time the game is run, or once in a while, or change it slowly with time, etc, etc.
If you want something that is very hard to predict then I would suggest involving a random number generator with the same seed every time. You can use it as an envelope for whatever function you come up with (trig functions or what not) to make it more jagged.
An interesting formula to use would be that of gamma of the Black-Scholes options pricing model. It goes as follows:
You can easily replace the variables, here's a graph of how the function looks:
alt text http://www.sqbimmer.com/aalex/gamma.png

Resources