I am trying to create a a boxplot in R of an extremely large data set. The file containing the data is 2.5G and crashes R if I try to import it. Fortunately some other piece of (python) software can generate the mean and variance without a problem, which is all I really want to plot(for now).
Every tutorial I've found so far requires you to input the full data set, then R computes the statistics itself, but I was wondering how to pass the mean, median, min, max, etc... to bwplot just for plotting. The reason I prefer R and lattice is because it integrates well with the software suite the code might end up in. If I used matlab or some other software that would be a problem because it would be yet another requirement from our current users.
Boxplots do not plot mean or variance. You actually need the full ranked data to plot a proper boxplot, because the quantities are median, quartiles and the actual value of the closes data point within 1.5 times IRQ plus all data points that are outside that range (outliers). This is typically not a good idea for a large data set (because by definition you have millions of outliers).
That said, you can generate the essential summaries any way you want and use bxp to plot them - see ?bxp in R. Just make sure you clarify what quantities you are plotting if they are not the above.
Related
I am trying to run a PCA on a dataframe which is accompanied by a metadata table. The PCA table is all normalized, scaled etc. the metadata, however, is not.
I want the PCA to not only cluster based on the dataframe but also the option to add one or multiple columns from the metadata table as explanatory variables as well. Again, these are not scaled and normalized with the main dataset. Also, I am not looking to color the plots with a certain data column, I'd want the column to be considered for the actual clustering.
I am aware that this sounds kinda vague, but I am having a hard time to find the exact words. After looking around for a little bit I found demixed PCA which seems to be very close to what I want to achieve. Sadly, there is no package in R to run it.
Any recommendations are welcome and thank you in advance.
I am working on my bachelor thesis, where I want to look into the lagged cross-correlation of a timeseries of search query volumes (=x) to the price of bitcoin (=y).
I have already created several ccf-plots using the "ccf"-function in R .
See picture:
I saw in the description of R's acf-function that ccf only works with one y and one x series. I was wondering if someone knows a way to put several of those plots into one, especially since I can categorize positively correlated and negatively correlated ones.
Further I was wondering, the dashed-blue line representing the confidence value, but at what level? 0.05? 0.01?
These are two questions in one.
1. question: combine plots
This question has been asked before. Please look it up:
Combining plots created by R base, lattice, and ggplot2
Combine plots in R
2. question: confidence intervals in ccf-plot:
The plot gives you the confidence intervals. The manual advises caution with these even though it uses ci.type = "white" is default setting. This default bluntly adds some confidence based on the quantiles of a standard normal distribution. It does not take the statistical properties of your data into account. In my opinion it is altogether useless. The manual recommends ci.type = "ma". But that will only work for autocorrelations. If you try using it with cross-correlations, you will get a warning saying "can use ci.type=‘ma’ only if first lag is 0". When doing autocorrelations the function shifts the sequences from -k to +k and will allow the first lag to be zero. ccf does not.
Further support
I hope it is not against the code of conduct to offer further support.
The ccf function has some pecularities that aren't well explained in the manual. Since I had trouble with ccf myself I wrote it all down here for everybody.
Because I wanted meaningful confidence intervals I developed an improved version of 'ccf' (link to repository in case anyone is interested) myself. It offers confidence intervals. The ccf-object by the new function is compatible with the output by stats::ccf() but contains more information. Additional functions make it more useful.
I can perform a 1 sample t-test in R with the t.test command. This requires actual sets of data. I can't use summary statistics (sample size, sample mean, standard deviation). I can work around this utilizing the BSDA package. But are there any other ways to accomplish this 1-sample-T in R without the BSDA pacakage?
Many ways. I'll list a few:
directly calculate the p-value by computing the statistic and calling pt with that and the df as arguments, as commenters suggest above (it can be done with a single short line in R - ekstroem shows the two-tailed test case; for the one tailed case you wouldn't double it)
alternatively, if it's something you need a lot, you could convert that into a nice robust function, even adding in tests against non-zero mu and confidence intervals if you like. Presumably if you go this route you'' want to take advantage of the functionality built around the htest class
(code and even a reasonably complete function can be found in the answers to this stats.SE question.)
If samples are not huge (smaller than a few million, say), you can simulate data with the exact same mean and standard deviation and call the ordinary t.test function. If m and s and n are the mean, sd and sample size, t.test(scale(rnorm(n))*s+m) should do (it doesn't matter what distribution you use, so runif would suffice). Note the importance of calling scale there. This makes it easy to change your alternative or get a CI without writing more code, but it wouldn't be suitable if you had millions of observations and needed to do it more than a couple of times.
call a function in a different package that will calculate it -- there's at least one or two other such packages (you don't make it clear whether using BSDA was a problem or whether you wanted to avoid packages altogether)
At work when I want to understand a dataset (I work with portfolio data in life insurance), I would normally use pivot tables in Excel to look at e.g. the development of variables over time or dependencies between variables.
I remembered from university the nice R-function where you can plot every column of a dataframe against every other column like in:
For the dependency between issue.age and duration this plot is actually interesting because you can clearly see that high issue ages come with shorter policy durations (because there is a maximum age for each policy). However the plots involving the issue year iss.year are much less "visual". In fact you cant see anything from them. I would like to see with once glance if the distribution of issue ages has changed over the different issue.years, something like
where you could see immediately that the average age of newly issue policies has been increasing from 2014 to 2016.
I don't want to write code that needs to be customized for every dataset that I put in because then I can also do it faster manually in Excel.
So my question is, is there an easy way to plot each column of a matrix against every other column with more flexible chart types than with the standard plot(data.frame)?
The ggpairs() function from the GGally library. It has a lot of capability for visualizing columns of all different types, and provides a lot of control over what to visualize.
For example, here is a snippet from the vignette linked to above:
data(tips, package = "reshape")
ggpairs(tips)
I run a bunch of simulations to evaluate type I error, so the result is a vector such as
pdata = c(0,0,0,0,0,0,0,0,0,0.07,0,0.02,0.03)
The mean of the simulated vector should be 0.05. Now I am thinking of a way to display the results via boxplots. The default function in R
boxplot(pdata)
gives a boxplot that is rather hard to see the typical value as there are many 0's. In addition, it shows the median, but what I really want is the mean to be displayed on the plot. Are there any graphical display that is effective in such situation? I know that I can simply report the numerical values, but because my simulation involves other factors which I hope to compare, a boxplot-like graph will be ideal. Thanks!
Something like this maybe :
plot(table(pdata))
Here a ggplot2 version :
ggplot(as.data.frame(table(pdata)),aes(x=pdata,y=Freq))+geom_bar()