zoo object cannot contain both factor and numeric vectors [closed] - r

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I was reading the zoo FAQs, and came across something that I found surprising.
A "zoo" object may be (1) a numeric vector, (2) a numeric matrix or
(3) a factor but may not contain both a numeric vector and factor.
Is it unreasonable to expect this to hold? And what are the reasons that this cannot be implemented in zoo? Basically, I would like to think of a zoo object as a dataframe with time ordering.

zoo objects are a matrix with an index attribute. Therefore, you cannot mix types in zoo for the same reason you cannot mix types in a matrix (i.e. a matrix is just a vector with a dim attribute and you can't mix types in a vector).

You write
Basically, I would like to think of a zoo object as a dataframe with
time ordering.
and you are simply off-base here. "Wishing alone" does not make it so. In a nutshell, zoo and xts can cope with a numeric matrix (or vector as special case, both really are vectors with/without dimension attributes) and the factor is already a stretch.
For all the years that zoo existed, data.frame was never a supported data type and will never be due to internal architectural and implementation choices. Performance on data.frame objects is also worse.
But you could consider data.table as an alternative.

Related

Visualizing of frequency of integer data in R [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
When I try to visualize my integer data with histogram(mydata,breaks=c(0,n)), R usually doesnt care about how many breaks (usually 1 bar for each sample) do I use and it plots n-1 bars (first two bars are summed into one).
In most cases I use barplot(table(mydata))
And there is one more way to do it
How to separate the two leftmost bins of a histogram in R
but I think its not "clear" way.
So how do you visualize frequency of your integer data?
Which one is right?
Thank you a lot
hist(dataset, breaks=seq(min(dataset)-0.5, max(dataset)+0.5, by=1) )
Another option (for thos situations where you know these are integers would be:
require(lattice)
barchart(table(dataset), horizontal=FALSE)
Or:
barplot(table(dataset))

RNG (random number generators) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm new in random numbers generators field. I would like to use the Mersenne-Twister algorithm since it has the longest period respect to other algorithms.
Which R function implements this algorithm? I used
"?sample" but no information about which algorithm is used, is there.
Another question is: which is the best seed to set in the random number generation?
Finally: is R the best tool to generate random numbers?
The default algorithm used by R is Mersenne-Twister.
There is no best seed. It depends on your application. Do you want it to be the same set of numbers every time you run your code? Use the same seed(s). If not, perhaps using the current time will suit your needs.
The best tool to generate random numbers is something that does not use a deterministic PRNG (such as Mersenne-Twister). Instead look into something such as random.org. I think it will really benefit you to read up on True randomness vs. Pseudo randomness.

When is it worth using `data.table`? When can I expect the largest performance gains? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I just spent some time researching about data.table in R and was wondering about the conditions under which I can expect the largest performance gains. Maybe the simple answer is when I have a large data.frame and often operate on subsets of this data.frame. When I just load data files and estimate models I can't expect much but many [ operations make the difference. Is that true and the only answer or what else should I consider? When does it start to matter? 10x5, 1,000x5, 1,000,000x5?
Edit: Some of the comments suggest that data.table is often faster and, equally important, almost never slower. So it would also be good to know when not to use data.table.
There are at least a few cases where data.table shines:
Updating an existing dataset with new results. Because data.table is by-reference, this is massively faster.
Split-apply-combine type strategies with large numbers of groups to split over (as #PaulHiemstra's answer points out).
Doing almost anything to a truly large dataset.
Here are some benchmarks:
Benchmarking data.frame (base), data.frame(package dataframe) and data.table
One instance where data.table is veeeery fast is in the split-apply-combine type of work which made plyr famous. Say you have a data.frame with the following data:
precipitation time station_id
23.3 1 A01
24.1 2 A01
26.1 1 A02
etc etc
When you need to average per station id, you can use a host of R functions, e.g. ave, ddply, or data.table. If the number of unique elements in station_id grows, data.table scales really well, whilst e.g. ddply get's really slow. More details, including an example, can be found in this post on my blog. This test suggests that speed increases of more than 150 fold are possible. This difference can probably be much bigger...

Where can I find a basic implementation of the EM clustering algorithm for R? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm looking for a basic implementation of EM clustering in R. So far, what I can find seem to be specialized or 'some-assembly-required' versions of it. For example, the implementation from mclust defines a range of parameters that I'm not familiar with and doesn't take a parameter for k. What I am looking for is something closer to the kmeans implementation that comes with R, or ELKI's implementation of EM.
How about reading the documentation for mclust?
http://cran.r-project.org/web/packages/mclust/mclust.pdf
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Clustering/Expectation_Maximization_%28EM%29
Make sure to choose the desired model (probably VVV?), and if you want a fixed k, then set G to a single value instead of the default 1:9.
Try this:
library(mclust)
m <- Mclust(data, 4:4, c("VVV"), control=emControl(tol=e1-4))
I must say I don't use or like R much. It has tons of stuff, but it doesn't fit together. It's just random stuff written independently by random people and then uploaded to a central repository. But there is no QA at all, and nobody that makes libraries compatible.

difference between ff and filehash package in R [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have a dataframe compose of 25 col and ~1M rows, split into 12 files, now I need to import them and then use some reshape package to do some data management. Each file is too large that I have to look for some "non-RAM" solution for importing and data processing, current I don't need to do any regression, I will have some descriptive statistics about the dataframe only.
I searched a bit and found two packages: ff and filehash, I read filehash manual first and found that it seems simple, just added some code on importing the dataframe into a file, the rest seems to be similar as usual R operations.
I haven't tried ff yet, as it comes with lots of different class, and I wonder if it worth investing time for understanding ff itself before my real work begins. But filehash package seems to be static for sometime and there's little discussion about this package, I wonder if filehash has become less popular, or even become obsolete.
Can anyone help me to choose which package to use? Or can anyone tell me what is the difference/ pros-and-cons between them? Thanks.
update 01
I am currently using filehash for importing the dataframe, and realize that it dataframe imported using filehash should be considered as readonly, as all the further modification in that dataframe will not be stored back to the file, unless you save it again, which is not very convenient in my view, as I need to remind myself to do the saving. Any comment on this?

Resources