Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 8 years ago.
Improve this question
More a general question, but since I am using R -> tags
My training data set is 15,000 entries big from which around 20 i would like to use for positive data set -> building up the svm. I wanted to use the remaining resampled dataset as my negative dataset, but i was wondering, it might be better to take the same size (around 20) as the negative data set, otherwise it's highly imbalanced? Is there an easy approach to pool then the classifiers (ensemble based) in R after 1000 rounds of resampling? (or even with the e1071 package)
Followup question: I would like to calculate a score for each prediction afterwards, is it fine just to take the probabilities times 100??
Thx
You can try "class weight" approach in which the smaller class gets more weight, thus taking more cost to mis-classify the positive labelled class.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 5 years ago.
Improve this question
I have a time-series data containing 16 values (of no. of Vehicles) from 2001 to 2016. I wanted to predict - based on the underlying trend - the values upto 2050 (which is a long shot I agree).
Upon doing some research, I found that it can be done by methods like HoltWinters or TBATS which, even though, did not go with my own plan of using some Machine Learning algorithm.
I am using R for all my work. Now, after using HoltWinters() and then forecast() methods, I did get an extrapolated curve uptil 2050 but it is a simple exponential curve from 2017 to 2050 which I think I could have obtained through meager calculations.
My question is twofold:
1) What would be the best approach to obtain a meaningful extrapolation?
2) Does my current approach be modified to give me a more meaningful extrapolation?
By meaningful I want to express that a curve with the details more closer to actuality.
Thanks a lot.
I guess you need more data to make predictions. HoltWinters or TBATS may work but there are many other ML models for time series data you can try.
http://a-little-book-of-r-for-time-series.readthedocs.io/en/latest/src/timeseries.html
This link has the R sample code for Holtwinters and the plots.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
This is a borderline question for Stackoverflow, I know, but I am looking for a package. If I can't get an answer here I will transfer to https://stats.stackexchange.com/. I am looking for a R package or a method to create a phase diagram. This means I have e.g. two variables, like air pressure and temperature, and a binary variable (to make it easier) indicating if the substance is liquid or frozen. Below you find a typical example of a phase diagram. I need to estimate the transition borders or something however just in a case with two groups. Every hint is appreciated.
I think about the closet you will get is function diagram in package CHNOSZ. There's a lot to read about in this package and it has some nice vignettes. But, it seems to calculate phase diagrams from first principles or theory. Perhaps if you look at the code for diagram you can figure out a fairly easy way to use your empirical data.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 9 years ago.
Improve this question
Maybe my question will fail to be specific but when fitting a glme model (using lme4 package in R) I get for one of the parameters SE=1000, with the estimated parameter as high as 16. The variable is a dichotomous variable. My question is if there might be an explanation for such a result, considering that the other parameters have parameters and SE that seem ok
That's a sign that you have complete separation. You should re-run the model without that covariate. Since its an ME model you may need to do a tabulation of outcome by covariate by levels to see what is happening. More details would allow greater specificity in our answers.
This is a link to a posting by Jarrod Hadfield, one of the guRus on the R mixed model mailing list. It demonstrates how complete separation leads to the Hauck-Donner effect, and it offers some further approaches to attempt dealing with it.
You may be seeing a case of the Hauck-Donner effect. Here is one post that discusses it, you can read the original paper or search the web for additional discussions.
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I'm trying to perform correlation analysis with
R's linear model
lm()
I'm wondering what is the reasonable minimum sample for it?
Is there any rule for determining that?
As a rule of thumb, 20, 30, 1000, samples As a rule of thumb, you should be wary of rules of thumb. Excluding perhaps that "less is more, except of course for sample size" (Cohen & Cohen, 1983: 169-171).
You could ask your question on https://stats.stackexchange.com/ but they're probably going to give you answers that might not be the round number that you're looking for. For example:
Is the number 20 magic?
Is there a reference that suggest using 30 as a large enough sample size?
Rules of thumb for minimum sample size for multiple regression
What is a reasonable sample size for correlation analysis for both overall and sub-group analyses?
30 Samples. Standard, Suggestion, or Superstition?
etc.
You'll get more useful responses if you edit your question here to include a reproducible example that resembles your actual use-case and then ask for help coding calculations of relevant measures of error. You might explore the pwr package before you edit your question (see here for examples: http://www.statmethods.net/stats/power.html).
Do a bit of googling to find the names of error measures you think will be useful to you. You might start with these:
Lenth, R. V. (2001), Some Practical Guidelines for Effective Sample Size Determination, The American Statistician, 55, 187-193.
Wheeler, R. E. (1974), 'Portable Power', Technometrics, 16, 193–201.
Cohen, J. & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.).(Hillsdale, NJ: Erlbaum)
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I have a species abundance dataset with quite a few zeros in it and even when I set trymax = 1000 for metaMDS() the program is unable to find a stable solution for the stress. I have already tried combining data (collapsing multiple years together to reduce the number of zeros) and I can't do any more. I was just wondering if anyone knows - is it scientifically valid to pick what R gives me at the end (the lowest of the 1000 solutions) or should I not be using NMDS because it cannot find a stable spot? There seems to be very little information about this on the internet.
One explanation for this is that you are trying to use too few dimensions for the mapping. I presume you are using the default k = 2? If so, try k = 3 and compare the stress from the best solution you got from the 1000 tries for the k = 2 solution.
I would be a little concerned to take one solution out of 1000 just because it had the best/lowest stress.
You could also try 1000 more random starts and see if it converges if you run more iterations. When you saved the output from metaMDS(), you can supply that object to another call to metaMDS() via the previous.best argument. It will then do trymax further random starts but compare any lower-stress solutions with the previous best and converge if it finds one similar to it, rather than have to find two similar low-stress solutions in the 1000 starts.