Correlation between 3 continuous variables - plot

I am pretty new to statistics, and I am stuck with this.
I have the data containing birth weight, length of baby and head circumference.
I need to provide an answer to how are they related.
How can I do that?
Thank you very much :)
I was thinking doing the Pearson test between each pair.

You could
First just start with finding the correlations (Pearson or any other kind) between the pairs.
Then You probably would want to plot them on X & Y axes (scatter plot) for each pair to visualize. In the plot you will see if higher values of one variable are associated (or not) with the other variable's higher values.
Lastly, you could also check
Each variable's distribution using a box plot. This will tell you the mean, median and the standard deviation.

Related

How to account for outliers in a histogram? - R/Matlab

I am wondering if there is a way to account for outlier in a histogram plot. I want to plot the frequencies of a random variable, which is very small and distributed around zero. However, in most of the cases I am considering I also have an outlier that complicates things. Is there a way to adjust the scale of the x axis in R/Matlab so that I can capture the distribution of the random variable I am considering and also show the outlier? Because normal ways to obtain the plot result in such a scale that all values are considered to be zero, and I want to show how they are distributed around zero. So ideally I would like to have the scales around zero accounting for very small numbers and than after a gap (which does not necessarily have to be proportional to the actual distance from zero) a bin to indicate the value of the outlier. And I do not want to remove the outlier from the sample.
Is such a thing possible in R/Matlab? Any other suggestions would be welcome.
Edit: The problem is not in identifying the outliers and using a different color for them. The problem is in adjusting the scales on the x-axis so I can observe the distribution of the variable as well as have the outlier included in the plot.
The next code will do the job, but you need to change the Xticklabels of the axes in order to make them show the real value of the outliers.
A=rand(1000,1)*0.1;
A(1:10)=10;
% modify the data for plotting pourposes. Get the outliers closer
expected_maximum_value=1; % You can compute this useg 3*sigma maybe?
distance_to_outliers=0.5;
outlier_mean=mean(A(A>expected_maximum_value));
A(A>expected_maximum_value)=A(A>expected_maximum_value)-outlier_mean+distance_to_outliers;
% plot
h=histogram(A,'BinWidth',0.01)
%% trick the X axis
ax=gca;
ax.XTickLabel{end-1}=[ax.XTickLabel{end-1} '//'];
ax.XTickLabel{end}=['//' num2str(outlier_mean)];

Is there a way to plot a frequency histogram from a continuous variable?

I have DNA segment lengths (relative to chromosome arm, 251296 entries), as such:
0.24592963
0.08555043
0.02128725
...
The range goes from 0 to 2, and I would like to make a continuous relative frequency plot. I know that I could bin the values and use a histogram, but I would like to show continuity. Is there a simple strategy? If not, I'll use binning. Thank you!
EDIT:
I have created a binning vector with 40 equally spaced values between 0 and 2 (both included). For simplicity's sake, is there a way to round each of the 251296 entries to the closest value within the binning vector? Thank you!
Given that most of your values are not duplicated and thus don't have an easy way to derive a value for plotting on the y-axis, I'd probably go for a density plot. This will highlight dense segment lengths i.e. where you have lots of segment lengths occurring near each other.
d <- c(0.24592963, 0.08555043, 0.02128725)
plot(density(d), xlab="DNA Segment Length", xlim=c(0,2))

Understanding what the kde2d z values mean?

I have two data sets that I am comparing using a ked2d contour plot on a log10 scale,
Here I will use an example of the following data sets,
b<-log10(rgamma(1000,6,3))
a<-log10((rweibull(1000,8,2)))
density<-kde2d(a,b,n=100)
filled.contour(density,color.palette=colorRampPalette(c('white','blue','yellow','red','darkred')))
This produces the following plot,
Now my question is what does the z values on the legend actually mean? I know it represents where most the data lies but 0-15 confuses me. I thought it could be a percentage but without the log10 scale I have values ranging from 0-1? And I have also produced plots with scales 1-1.2, 1-2 using my real data.
The colors represent the the values of the estimated density function ranging from 0 to 15 apparently. Just like with your other question about the odd looking linear regression I can relate to your confusion.
You just have to understand that a density's integral over the full domain has to be 1, so you can use it to calculate the probability of an observation falling into a specific region.

Probability distribution values plot

I have
probability values: 0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04 ,summing up to 1
the corresponding intervals where a stochastic variable may take its value with the corresponding probability from the above list:
126,162,233,304,375,446,517,588,659,730,801,839
How can I plot the probability distribution?
On the x axis, the interval values, between the intervals histogram with the probability value?
Thanks.
How about
x <- c(126,162,233,304,375,446,517,588,659,730,801,839)
p <- c(0.06,0.06,0.1,0.08,0.12,0.16,0.14,0.14,0.08,0.02,0.04)
plot(x,c(p,0),type="s")
lines(x,c(0,p),type="S")
rect(x[-1],0,x[-length(x)],p,col="lightblue")
for a quick answer? (With the rect included you might not need the lines call and might be able to change it to plot(x,p,type="n"). As usual I would recommend par(bty="l",lty=1) for my preferred graphical defaults ...)
(Explanation: "s" and "S" are two different stair-step types (see Details in ?plot): I used them both to get both the left and right boundaries of the distribution.)
edit: In your comments you say "(it) doesn't look like a histogram". It's not quite clear what you want. I added rectangles in the example above -- maybe that does it? Or you could do
b <- barplot(p,width=diff(x),space=0)
but getting the x-axis labels right is a pain.

Lorentz curve plot

I need to get a plot of a Lorentz curve of a cumulative variable as a function of the number of observations. I want both axes to be displayed on a percentage basis (e.g. say observations are the number of buyers and the y variable is the amount they bought, buyers are already ranked in descending order, I want to get the plot that says "The top 10% buyers purchased 90% of the total bought"). My dataset is a couple million observations.
What is the best way to do this? Sub-questions:
If I need to add two variables for the quantiles of total observations and total $ bought (so as to use them to plot), what is the object that returns the row number? I tried:
user_quantile <- row(df)/nrow(df)
but I get a matrix of identical columns (user_quantile.1, user_quantile.2) of which I only need one column.
Is there instead any way to skip adding percentages as variables and only have them for axes values?
The plot has way to many points than I need to get the line. What is the best approach to minimize the computational effort and get a nice graph?
Thanks.
You may want to acquaint yourself with the excellent RSeek search engine for R content. One quick query for Lorentz curve (and Lorenz curve) lead to these packages:
ineq: Measuring inequality, concentration, and poverty
reldist: Relative Distribution Methods
GeoXp: Interactive exploratory spatial data analysis
lawstat: An R package for biostatistics, public policy and law
all of which seem to supply a Lorenz curve function.
In order to get the plot done you need first to arrange the raw data.
1) You can use the cut2() function from the Hmisc package to cut the data in quantiles. Check the documentation, it's not hard. It's similar to the cut() from the base package.
2) After using the cut2() function with the income data, you need to compute the frequency of each decile. Use table() for that. Then calculate percentages of income for each decile.
3) Now you should have a very small table with the following columns:
Decile, cumulative % of total income.
Add another column with the 45 degree line. Just add a constant cumulative % of income.
finaltable$cumulative_equality_line = seq(0.1, 1, by = 0.1)
4) You can use base graphics or ggplot2 for plotting. I guess you can do it with the info of step 3 or perhaps check out specific plotting questions.
I'll have to do it soon, but i already have the final table. I'll post the code for plotting once i do it.
Good luck!

Resources