How to solve R linear regression graph problem? - r

I have a problem with the following graph:
For every value that I put in plot() I get this graph. Does anyone maybe know what it means?
Cor.test works, I got weak correlation.
my code:
cor.test(podatki$v54, podatki$v197, method = c("pearson"),
conf.level = 0.95, use = "all.obs" )
plot(podatki$v54, podatki$v197)

Your graph looks that way because the points are plotted directly on top of one another.
You can use the jitter(...) function to add small amounts of randomness to the data points so they aren't directly on top of one another (it jitters them around so you can see the ones underneath!) Here is an example you can copy and paste:
# create some random numbers to plot. all are values 1-5.
x1 <- sample(c(1:5), 100, replace = TRUE)
x2 <- sample(c(1:5), 100, replace = TRUE)
# plotting without jitter
plot(x1, x2)
# plotting with jitter
plot(jitter(x1), jitter(x2))
jitter(...) changes the values by small amounts so only use the jittered data for plotting, otherwise it will bias your results!

Related

Is there a way to rescale the axes of a plot produced by plot.clusterlm (R)?

I have run cluster analysis on some time series data using permuco in R. (Permutes the labels of control/treatment conditions and calculates the F statistic as to how likely it is that these time clusters of significant differences occurred by chance.)
So far so good.
I have produced a number of plots using the inbuilt function plot.clusterlm that comes with this package. However, the data come from different groups, and the F values on the y axis get rescaled in each plot, i.e. the values and ticks are reset depending on how strong the effects are.
This is problematic, because the different plots based on different cluster analyses are not visually comparable.
I would like to rescale the y axis, so that all clusters are visualised along the same F values (0-10 for example).
I haven't been able to do that, and I was wondering if there is a way to pass any additional functions into the plot.clusterlm to do this.
This is the usage of the function, but I don't see a way to rescale the y axis. (Although rescaling the x axis is possible by manipulating the nbbaselinepts & nbptsperunit, but that's not what I want...)
plot(x, effect = "all", type = "statistic",
multcomp = "clustermass", alternative = "two.sided",
enhanced_stat = FALSE, nbbaselinepts = 0, nbptsperunit = 1, ...)
If you have any ideas on this, please let me know.
Thank you!
Thanks for using permuco! I opened an issue on GitHub to have a solution for implementing these features. You can expect changes in further releases of permuco.
However, the plot() method shows the F statistic which is not a good measure of effect size. A better measure of effect size is the partial-eta square which is implemented in the afex package
In the base R plotting device axes are altered like this:
x<-1:10; y=x*x
# Simple graph
plot(x, y)
# Enlarge the scale
plot(x, y, xlim=c(1,15), ylim=c(1,150))
# Log scale
plot(x, y, log="y")
This is an example from STHDA where you can find many helpful tutorials.

Plotting many (i.e. 1000+) distributions on one plot in R

I'm trying to plot 18000 distributions as a heatmap type thing in R
One row can easily be plotted as a histogram but as i need to represent so many the only option I can think of is a heatmap.
This is not currently working as all the heatmap/imaging functions seem to do some kind of clustering/compare the rows instead of just plotting the distribution like in a histogram.
Does anyone know how to get around the problem or a better way of representing a large number of distribution?
matrix <- replicate(100, rnorm(100))
hist(matrix[1,],breaks = 60)
image2D(z=matrix, border="black")
image2D doesn't seem to do the trick...
Thanks
Edit 12/06/18:
Using
library(denstrip)
Does the trick for anyone who needs to visualise differences in a large amount of distributions.
You could overlay a lot of density plots using transparancy to get a sense of overlap.
m <- replicate(100, rnorm(100))
plot(range(m), c(0, 0.5), type = 'n')
for (i in 1:ncol(m)) lines(density(m[, i]), col = rgb(0.5, 0.5, 0.5, 0.5))

Set ylim() automatically

Here is some data to work with.
df <- data.frame(x1=c(234,543,342,634,123,453,456,542,765,141,636,3000),x2=c(645,123,246,864,134,975,341,573,145,468,413,636))
If I plot these data, it will produce a simple scatter plot with an obvious outlier:
plot(df$x2,df$x1)
Then I can always write the code below to remove the y-axis outlier(s).
plot(df$x2,df$x1,ylim=c(0,800))
So my question is: Is there a way to exclude obvious outliers in scatterplots automatically? Like ouline=F would do if I were to plot, say, boxplots for an example. To my knowledge, outline=F doesn't work with scatterplots.
This is relevant because I have hundreds of scatterplots and I want to exclude all obvious outlying data points without setting ylim(...) for each individual scatterplot.
You could write a function that returns the index of what you define as an obvious outlier. Then use that function to subset your data before plotting.
Here all observations with "a" exceeding 5 * median of "a" are excluded.
df <- data.frame(a = c(1,3,4,2,100), b=c(1,3,2,4,2))
f <- function(x){
which(x$a > 5*median(x$a))
}
with(df[-f(df),], plot(b, a))
There is no easy yes/no option to do what you are looking for (the question of defining what is an "obvious outlier" for a generic scatterplot is potentially quite problematic).
That said, it should not be too difficult to write a reasonable function to give y-axis limits from a set of data points. If we take "obvious outlier" to mean a point with y value significantly above or below the bulk of the sample (which could be justified assuming a sufficient distribution of x values), then you could use something like:
ybounds <- function(y){ # y is the response variable in the dataframe
bounds = quantile(df$x1, probs=c(0.05, 0.95), type=3, names=FALSE)
return(bounds + c(-1,1) * 0.1 * (bounds[2]-bounds[1]) )
}
Then plot each dataframe with plot(df$x, df$y, ylim=ybounds(df$y))

ggplot2 2d Density Weights

I'm trying to plot some data with 2d density contours using ggplot2 in R.
I'm getting one slightly odd result.
First I set up my ggplot object:
p <- ggplot(data, aes(x=Distance,y=Rate, colour = Company))
I then plot this with geom_points and geom_density2d. I want geom_density2d to be weighted based on the organisation's size (OrgSize variable). However when I add OrgSize as a weighting variable nothing changes in the plot:
This:
p+geom_point()+geom_density2d()
Gives an identical plot to this:
p+geom_point()+geom_density2d(aes(weight = OrgSize))
However, if I do the same with a loess line using geom_smooth, the weighting does make a clear difference.
This:
p+geom_point()+geom_smooth()
Gives a different plot to this:
p+geom_point()+geom_smooth(aes(weight=OrgSize))
I was wondering if I'm using density2d inappropriately, should I instead be using contour and supplying OrgSize as the 'height'? If so then why does geom_density2d accept a weighting factor?
Code below:
require(ggplot2)
Company <- c("One","One","One","One","One","Two","Two","Two","Two","Two")
Store <- c(1,2,3,4,5,6,7,8,9,10)
Distance <- c(1.5,1.6,1.8,5.8,4.2,4.3,6.5,4.9,7.4,7.2)
Rate <- c(0.1,0.3,0.2,0.4,0.4,0.5,0.6,0.7,0.8,0.9)
OrgSize <- c(500,1000,200,300,1500,800,50,1000,75,800)
data <- data.frame(Company,Store,Distance,Rate,OrgSize)
p <- ggplot(data, aes(x=Distance,y=Rate))
# Difference is apparent between these two
p+geom_point()+geom_smooth()
p+geom_point()+geom_smooth(aes(weight = OrgSize))
# Difference is not apparent between these two
p+geom_point()+geom_density2d()
p+geom_point()+geom_density2d(aes(weight = OrgSize))
geom_density2d is "accepting" the weight parameter, but then not passing to MASS::kde2d, since that function has no weights. As a consequence, you will need to use a different 2d-density method.
(I realize my answer is not addressing why the help page says that geom_density2d "understands" the weight argument, but when I have tried to calculate weighted 2D-KDEs, I have needed to use other packages besides MASS. Maybe this is a TODO that #hadley put in the help page that then got overlooked?)

best fitting curve from plot in R

I have a probability density function in a plot called ph that i derived from two samples of data, by the help of a user of stackoverflow, in this way
few <-read.table('outcome.dat',head=TRUE)
many<-read.table('alldata.dat',head=TRUE)
mh <- hist(many$G,breaks=seq(0,1.,by=0.03), plot=FALSE)
fh <- hist(few$G, breaks=mh$breaks, plot=FALSE)
ph <- fh
ph$density <- fh$counts/(mh$counts+0.001)
plot(ph,freq=FALSE,col="blue")
I would like to fit the best curve of the plot of ph, but i can't find a working method.
how can i do this? I have to extract the vaule from ph and then works on they? or there is same function that works on
plot(ph,freq=FALSE,col="blue")
directly?
Assuming you mean that you want to perform a curve fit to the data in ph, then something along the lines of
nls(FUN, cbind(ph$counts, ph$mids),...) may work. You need to know what sort of function 'FUN' you think the histogram data should fit, e.g. normal distribution. Read the help file on nls() to learn how to set up starting "guess" values for the coefficients in FUN.
If you simply want to overlay a curve onto the histogram, then smoo<-spline(ph$mids,ph$counts);
lines(smoo$x,smoo$y)
will come close to doing that. You may have to adjust the x and/or y scaling.
Do you want a density function?
x = rnorm(1000)
hist(x, breaks = 30, freq = FALSE)
lines(density(x), col = "red")

Resources