scatter plot in r with huge unique observation

scatter plot in r with huge unique observation - r

Currently, plot is not useful. How would I plot this distribution, since the range is too high?
I have data of 50 year and have to see which activity is most harmful.
The data contain about 1000 unique activity say column1.
I am using groupby(column1) and summarise(total = sum(column2,column3))
but problem is there few total value in 6 to 7 digit
because of these two fact my plot x look bad and due few high value y most value are near x axis.
I believe the problem is at x axis since so many names are clustered together due to less space.

I think a log transformation might help you gain some better insight out of your data:
Set up some fake data that resembles your situation:
set.seed(1776) # reproducible random numbers
num_obs <- 10000 # set number of observations
options(scipen = 999) # don't use scientific notation
# don't worry about this code, just creating a reproducible example
y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) *
abs(rnorm(sum(make_these_outliers)) * 50000))
Recreate the plot you have now to show the issue you're facing:
# recreating your current situation
plot(y, main='Ugly Plot')
Log10 transformation
Now we'll use the log10 transformation on your data an visualize the result. So a value of "10" is now "1", value of "100" is now "2", value of "1000" is now "3", etc.
# log10
plot(log10(y), col= rgb(0, 0, 0, alpha=0.3), pch=16, main='Log Scale and Transparency - Slightly Better')
The pch = 16 argument fills in the points and the alpha = 0.4 sets the opacity of each point. An alpha of 0.4 means an opacity of 40% (can also think of this as 60% transparent).
ggplot2
I'll also show this in ggplot2, because using the scale transformations, ggplot2 is smart enough to put the true value on the y-axis to prevent you from having to do the mental gymnastics of log10 transforms in your head.
# now with ggplot2
# install.packages("ggplot2") # <-- run this if you haven't installed ggplot2 yet
library(ggplot2)
# ggplot2 prefers your data to be in a data.frame (makes it easier to work with)
data_df <- data.frame(
index = 1:num_obs,
y = y)
ggplot(data = data_df, aes(x = index, y = y)) +
geom_point(alpha=0.2) +
scale_y_continuous(trans="log10") +
ggtitle("Y-axis reflects values of the datapoints", "even better?") +
theme_bw(base_size = 12)
At this point, you can start to tell how I've constructed the fake data, which is why there is such a high concentration of points in the 10-1000 range.
Hopefully this helps! I definitely recommend taking PauloH's advice and asking around on stats.stackexchange.com as well to make sure you aren't misrepresenting your data.

Using ggplot2 instead and setting alpha may solve your problem but if that is not enough you may want tag along zoom_facet() from the ggforce package.
set.seed(1776)
num_obs <- 10000
options(scipen = 999)
y <- abs(rnorm(num_obs) + 2) * abs(rnorm(num_obs) * 50)
make_these_outliers <- runif(num_obs, min=0, max=1) > 0.99
y[make_these_outliers] <- abs(rnorm(sum(make_these_outliers), + 2) *
abs(rnorm(sum(make_these_outliers)) * 50000))
# install.packages('ggplot2')
library(ggplot2)
# install.packages('ggforce')
library(ggforce)
data_df <- data.frame(
index = 1:num_obs,
y = y)
ggplot(data = data_df, aes(x = index, y = y)) +
geom_point(alpha=0.05) +
facet_zoom(y = (y <= 500), zoom.size = .8) +
theme_bw()
The result would look more or less like the following:
Hope it helps. Check the ggforce's GitHub:
https://github.com/thomasp85/ggforce

Related

Plot multiple line graphs on one graph

I can get a single power curve shown below but I want to create a power analysis graph. I want to change my delta value (to .6, .7, and .8) and plot those 3 other lines on that same r curve in a different color. I provided an example of what I kinda want it to look.
n_participants <- c(5, 10, 20, 30, 40)
npercluster <- 20
n_tot <- n_participants*npercluster
icc <- 0.6 # assumption
deff <- 1 + icc*(npercluster - 1)
ess <- n_tot / deff
mydelt <- 0.5
mypowers <- power.t.test(n=ess, delta=mydelt)$power
plot(n_participants, mypowers, type='l',
main=paste('Power based on', npercluster, 'volumes per participants'),
xlab='Number of participants', ylim=c( 0, 1),
ylab='Power')

If you are planning to use R a lot I would recommend investing in learning ggplot2. Base R plotting solutions get very limited very quickly.
To solve your problem I would make a data frame with every combination of effect size and sample size.
dat <- expand.grid(mydelt=c(0.5,0.6,0.7,0.8), ess=n_tot / deff)
Then add a column for the power:
dat$mypowers = power.t.test(n=dat$ess, delta=dat$mydelt)$power
Then I can use ggplot to easily make a nice graph of the power curves:
library(ggplot2)
ggplot(dat, aes(x=ess, y=mypowers, color=factor(mydelt))) + geom_point() + geom_line()
You can easily change the overall graph look and add appropriate labels:
ggplot(dat, aes(x=ess, y=mypowers, color=factor(mydelt))) +
geom_point() +
geom_line() +
theme_bw() +
labs(x="Effective sample size", y="Power", color="Effect size" )
In response to the comment.. there was a mistake in the code above in that I plotted the effective total sample size on the x axis not the sample size per cluster. So instead we should make sure we have n_participants in the dataset for plotting, then calculate the powers and plot:
So the whole script is now:
n_participants <- 5:40
npercluster <- 20
icc <- 0.6 # assumption
deff <- 1 + icc*(npercluster - 1)
dat <- expand.grid(mydelt=c(0.5,0.6,0.7,0.8), npart=n_participants)
dat$n_tot <- dat$npart*npercluster
dat$ess <- dat$n_tot / deff
dat$mypowers <- power.t.test(n=dat$ess, delta=dat$mydelt)$power
library(ggplot2)
ggplot(dat, aes(x=npart, y=mypowers, color=factor(mydelt))) +
geom_line()+
theme_bw() +
labs(x="Number of participants", y="Power", color="Effect size" )
Which gives this graph:

You may put the logic in a function f, sapply over desired deltas and - as also suggested in comments - use matplot without having to bother with any new packages.
f <- \(mydelt=.5, n_participants=c(5, 10, 20, 30, 40), npercluster=20, icc=.6) {
n_tot <- n_participants*npercluster
deff <- 1 + icc*(npercluster - 1)
ess <- n_tot/deff
power.t.test(n=ess, delta=mydelt)$power
}
deltas <- seq(.5, .8, .1)
res <- t(sapply(deltas, f))
matplot(res, type='l', main=paste('Power based on 20 volumes per participants'),
xlab='Number of participants',
ylab='Power')
legend('topleft', legend=deltas, col=seq_along(deltas), lty=seq_along(deltas),
title='delta', cex=.8)
It's also possible pipe it directly into matplot:
t(sapply(deltas, f)) |>
matplot(res, ...)
See ?matplot for easy customizing of colors, linetypes etc.
Note: R >= 4.1 used.

Again, I have 4 graphs on R, different x axis, but similar trend profile. I tried to overlay them but they are not aligned

I was assisted to overlay two graphs with different x-axis on this link I have 2 graphs on R. They have different x axis, but similar trend profile. how do I overlay them on r?.
However, I am now trying to overlay 4 graphs. I tried to overlay them but they are not aligned.
I need assistance to overlay these four graphs.
My initial trial codes were as follows:
My raw data is in this following link https://drive.google.com/drive/folders/1ZZQAATkbeV-Nvq1YYZMYdneZwMvKVUq1?usp=sharing.
Codes used to execute:
first <- ggplot(data = first,
aes(x, y)) +
geom_line(pch = 1)
second <- ggplot(data = second,
aes(x, y)) +
geom_line(pch = 1)
third <- ggplot(data = third,
aes(x, y)) +
geom_line(pch = 1)
fourth <- ggplot(data = fourth,
aes(x, y)) +
geom_line(pch = 1)
first$match <- first$x
second$match <- second$x - second$x[second$y == max(second$y)] + first$x[first$y == max(first$y)]
third$match <- third$x
fourth$match <- fourth$x
first$series = "first"
second$series = "second"
third$series = "third"
fourth$series = "fourth"
all_data <- rbind(first, second, third, fourth)
ggplot(all_data) + geom_line(aes(x = match, y, color = series)) +
scale_x_continuous(name = "X, arbitrary units") +
theme(axis.text.x = element_blank())
Would greatly appreciate the help indeed.

OP, I thought I would propose a solution for your question. OP has 4 datasets with x and y columns, and wants to align the peaks in each dataset so that they stack on top of one another. Here's what it looks like when we plot all datasets together:
p <- ggplot(mapping=aes(x=x, y=y)) + theme_bw() +
geom_line(data=first, aes(color="first")) +
geom_line(data=second, aes(color="second")) +
geom_line(data=third, aes(color="third")) +
geom_line(data=fourth, aes(color="fourth"))
The approach will be as follows:
Find the peak x value for each dataset
Adjust each peak x value to match that of the first peak x value
Combine the datasets and plot together which respects Tidy Data principles
Finding peaks and adjusting x values
To find the peaks, I like to use the findpeaks() function from the pracma library. You feed the function your dataset's y values (arranged by increasing x value), and the function will return a matrix with each row representing a "peak" and the columns give you height of peak in y value, index or row of dataset for the peak, where the peak begins, and where the peak ends. As an example, here's how we can apply this principle and what the result looks like on one of the datasets:
library(pracma)
first <- arrange(first, x) # arrange first by increasing x
findpeaks(first$y, sortstr = TRUE, npeaks=1)
[,1] [,2] [,3] [,4]
[1,] 1047.54 402 286 515
The argument sortstr= indicates we want the list of peaks sorted by "highest" first, and we are only interested in picking the first peak. In this case, we can see that 402 is the index of the x,y value in first for the peak. So we can access that x value via first[index,]$x.
The one concern we may have here is that this may not work for fourth, since the max value of y is actually not the peak of interest; however, if we run the function and test this out, using the findpeaks() method where we return the highest peak works fine: apparently the function does not find there is a "peak" at the right since it has an "up", but not a "down".
The function below handles all the steps to do what we need to: arranging, finding peaks, and adjusting peaks.
# find the minimum peak. We know it's from third, but here's
# how you do it if you don't "know" that
peaks_first <- findpeaks(first$y, sortstr = TRUE, npeaks=1)
peaks_second <- findpeaks(second$y, sortstr = TRUE, npeaks=1)
peaks_third <- findpeaks(third$y, sortstr = TRUE, npeaks=1)
peaks_fourth <- findpeaks(fourth$y, sortstr = TRUE, npeaks=1)
# minimum peak x value
peak_x <- min(c(first[peaks_first[2],]$x, second[peaks_second[2],]$x, third[peaks_third[2],]$x, fourth[peaks_fourth[2],]$x))
# function to use to fix each dataset
fix_x <- function(peak_x, dataset) {
dataset <- arrange(dataset, x)
d_peak <- findpeaks(dataset$y, sortstr = TRUE, npeaks=1)
d_peak_x <- dataset[d_peak[2],]$x
x_adj <- peak_x - d_peak_x
dataset$x <- dataset$x + x_adj
return(dataset)
}
# apply and fix each dataset
fix_first <- fix_x(peak_x, first)
fix_second <- fix_x(peak_x, second)
fix_third <- fix_x(peak_x, third)
fix_fourth <- fix_x(peak_x, fourth)
# combine datasets
fix_first$measure <- 'First'
fix_second$measure <- 'Second'
fix_third$measure <- 'Third'
fix_fourth$measure <- 'Fourth'
fixed <- rbind(fix_first, fix_second, fix_third, fix_fourth)
fixed$measure <- factor(fixed$measure, levels=c('First','Second','Third','Fourth'))
Plot Together
Now fixed contains all the data, and we can plot them all together:
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line()
Alternate Plotting Methods
If you want to "stack" the lines on top of one another, this is what is known as a ridgeline plot. There are two methods I can show for how to create the ridgeline plot: faceting or using ggridges and geom_ridgeline(). I can demonstrate both.
# Using facets
ggplot(fixed, aes(x=x, y=y, color=measure)) + theme_bw() +
geom_line(show.legend = FALSE) +
facet_grid(measure~.)
Note I chose not to show the legend, since the strip text indicates this same information.
# Using ggridges and geom_ridgeline
ggplot(fixed, aes(x=x, y=measure, color=measure)) + theme_bw() +
geom_ridgeline(aes(height=y), fill=NA, scale=0.001)
When using geom_ridgeline(), you'll notice that the y= aesthetic becomes the column used for the stacking, and your original y value is instead mapped to the height= aesthetic. I also had to play around with scale=, since for discrete values, each measure will be treated as integers (1, 2, 3, 4). Your height= values are waaaay higher than that, so we have to scale them down so that they are around this range (scaled down by about 1000).

Trying to vertically scale the graph of a data set with R, ggplot2

I'm working with a data frame of size 2 x 400. I need to graph this (let's call it data set A) on the same graph as the main data set for my project.
All I need is the general shape of data set A's graph. ie i only need to see the trend.
The scale that data set A takes place on happens to be much smaller than that of the main graph. So dataset A just looks like a horizontal line.
I decided to scale data set A by multiplying it by a factor of... I tried various values to get the optimum vertical scaling, which leads me to the problem I'm having.
When trying to find the ideal multiplicative factor by trial and error, I expected the general shape of data set A's graph to retain its shape, and only vary in its relative vertical points . ie the horizontal coordinates of all maxes and mins shouldn't move, and only the vertical points should be moving. but this wasn't happening. I'd like to know why.
Here's the data set A (yellow), when multiplied by factor of 3:
factor of 5:
The yellow dots are the geom_point and the yellow curve is the corresponding geom_smooth.
EDIT:
here is my the code original code:
I haven't had much formal training with code. I'm apologize for any messiness!
library("ggplot2")
library("dplyr")
# READ IN DATA
temp_data <-read.table(col.names = "y",
"C:/Users/Ben/Documents/Visual Studio 2013/Projects/Home/Home/steamdata2.txt")
boilpoint <- which(temp_data$y == "boil") # JUST A MARKER..
temp_data <- filter(temp_data, y != "boil") # GETTING RID OF THE MARKER ENTRY
# DON'T KNOW WHY BUT I HAD TO DO THIS INTERMEDIATE STEP
# BEFORE I COULD CONVERT FROM FACTOR -> NUMERIC
temp_data$y <- as.character(temp_data$y)
# CONVERTING TO NUMERIC
temp_data$y <- as.numeric(temp_data$y)
# GETTING RID OF BASICALLY THE LAST ENTRY WHICH HAS THE LARGEST VALUE
temp_data <- filter(temp_data, y<max(temp_data$y))
# ADD ANOTHER COLUMN WITH THE ROW NUMBER,
# BECAUSE I DON'T KNOW HOW TO ACCESS THIS FOR GGPLOT
temp_data <- transform(temp_data, x = 1:nrow(temp_data))
n <- nrow(temp_data) # Num of readings
period <- temp_data[n,1] # (sec)
RpS <- n / period # Avg Readings per Second
MIN <- min(temp_data$y)
MAX <- max(temp_data$y)
# DERIVATIVE OF ORIGINAL
deriv <- data.frame(matrix(ncol=2, nrow=n))
# ADD ANOTHER COLUMN TO ACCESS ROW NUMBERS FOR GGPLOT LATER
colnames(deriv) <- c("y","x")
deriv <- transform(deriv, x = c(1:n))
# FILL DERIVATIVE DATAFRAME
deriv[1, 1] <- 0
for(i in 2:n){
deriv[i - 1, 1] <- temp_data[i, 1] - temp_data[i - 1, 1]
}
deriv <- filter(deriv, y != 0)
# DID THE SAME FOR SECOND DERIVATIVE
dderiv <- data.frame(matrix(ncol = 2, nrow = nrow(deriv)))
colnames(dderiv) <- c("y", "x")
dderiv <- transform(dderiv, x=rep(0, nrow(deriv)))
dderiv[1, 1] <- 0
for(i in 2:nrow(deriv)) {
dderiv$y[i - 1] <- (deriv$y[i] - deriv$y[i - 1]) /
(deriv$x[i] - deriv$x[i - 1])
dderiv$x[i - 1] <- deriv$x[i] + (deriv$x[i] - deriv$x[i - 1]) / 2
}
dderiv <- filter(dderiv, y!=0)
# HERE'S WHERE I FACTOR BY VARIOUS MULTIPLES
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
graph <- ggplot(temp_data, aes(x, y)) + geom_smooth()
graph <- graph + geom_point(data = deriv, color = "yellow")
graph <- graph + geom_smooth(data = deriv, color = "yellow")
graph <- graph + geom_point(data = dderiv, color = "green")
graph <- graph + geom_smooth(data = dderiv, color = "green")
graph <- graph + geom_vline(xintercept = boilpoint, color = "red")
graph <- graph + xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)")))
graph <- graph + xlim(c(0,n)) + ylim(c(MIN, MAX))

It's hard to check without your raw data, but I'm 99% sure that your main problem is that you're hard-coding the y limits with ylim(c(MIN, MAX)). This is exacerbated by accidentally scaling both variables in your deriv and dderiv data frame, not just y.
I was able to debug the problem when I noticed that your top "scale by 3" graph has a lot more yellow points than your bottom "scale by 5" graph.
The quick fix is don't scale the row numbers, only scale the y values, which is to say, replace this
# scales entire data frame: bad!
deriv <- MIN + deriv * 3
dderiv <- MIN + dderiv * 3
with this:
# only scale y
deriv$y <- MIN + deriv$y * 3
dderiv$y <- MIN + dderiv$y * 3
I think there is another problem too: even with my correction above, negative values of your derivatives will be excluded. If deriv$y or dderiv$y is ever negative, then MIN + deriv$y * 3 will be less than MIN, and since your y axis begins at MIN it won't be plotted.
So I think the whole fix would be to instead do something like
# keep the original y values around so we can experiment with scaling
# without running *all* the code again
deriv$y_orig <- deriv$y
# multiplicative scale
# fill in the value of `prop` to be the proportion of the vertical plot area
# that you want taken up by the derivative
deriv$y <- deriv$y_orig * diff(c(MIN, MAX)) / diff(range(deriv$y_orig)) * prop
# shift into plot range
# fill in the value of `intercept` to be the y value of the
# lowest point of this line
deriv$y <- deriv$y + MIN - min(deriv$y) + 1
I normally don't answer questions that aren't reproducible with data because I hate lack of clarity and I hate the inability to test. However, your question was very clear and I'm pretty sure this will work even without testing. Fingers crossed!
A few other, more general comments:
It's good you know that to convert factor to numeric you need to go via character. It's an annoyance, but if you want to understand more here's the r-faq on it.
I'm not sure why you bother with (deriv$x[i] - deriv$x[i - 1]) in your for loop. Since you define x to be 1, 2, 3, ... the difference is always 1. I'm more confused by why you divide by 2 in the second derivative.
Your for loop can probably be replaced by the diff() function. (See below.)
You seem to have just gotten your foot in the dplyr door, so I used base functions in my recommendation. Keep working with dplyr, I think you'll like it. The big dplyr function you're not using is mutate. It works like base::transform for adding new columns.
I dislike that you've created all these different data frames, it clutters things up. I think your code could be simplified to something like this
all_data = filter(temp_data, y != "boil") %>%
mutate(y = as.numeric(as.character(y))) %>%
filter(y < max(y)) %>%
mutate(
x = 1:n(),
deriv = c(NA, diff(y)) / c(NA, diff(x)),
dderiv = c(NA, diff(deriv)) / 2
)
Rather than having separate data frames for the original data, first derivative and second derivative, this puts them all in the same data frame.
The big benefit of having things in one data frame is that you could then "gather" it into a nice, long (rather than wide) tidy format and simplify your plotting call:
library(tidyr)
long_data = gather(all_data, key = function, value = y, y, deriv, dderiv)
Then your ggplot call would look more like this:
graph <- ggplot(temp_data, aes(x, y, color = function)) +
geom_smooth() +
geom_point() +
geom_vline(xintercept = boilpoint, color = "red") +
scale_color_manual(values = c("green", "yellow", "blue")) +
xlab("Readings (n)") +
ylab(expression(paste("Temperature (",degree,"C)"))) +
xlim(c(0,n)) + ylim(c(MIN, MAX))
With data in long format, you'd have a column of you data (I've named it "function") that maps to color, so you don't have to add all the layers one at a time, and you get a nicely generated legend!

How to convert a bar histogram into a line histogram in R

I've seen many examples of a density plot but the density plot's y-axis is the probability. What I am looking for a is a line plot (like a density plot) but the y-axis should contain counts (like a histogram).
I can do this in excel where I manually make the bins and the frequencies and make a bar histogram and then I can change the chart type to a line - but can't find anything similar in R.
I've checked out both base and ggplot2; yet can't seem to find an answer. I understand that histograms are meant to be bars but I think representing them as a continuous line makes more visual sense.

Using default R graphics (i.e. without installing ggplot) you can do the following, which might also make what the density function does a bit clearer:
# Generate some data
data=rnorm(1000)
# Get the density estimate
dens=density(data)
# Plot y-values scaled by number of observations against x values
plot(dens$x,length(data)*dens$y,type="l",xlab="Value",ylab="Count estimate")

This is an old question, but I thought it might be helpful to post a solution that specifically addresses your question.
In ggplot2, you can plot a histogram and display the count with bars using:
ggplot(data) +
geom_histogram()
You can also plot a histogram and display the count with lines using a frequency polygon:
ggplot(data) +
geom_freqpoly()
For more info --
ggplot2 reference

To adapt the example on the ?stat_density help page:
m <- ggplot(movies, aes(x = rating))
# Standard density plot.
m + geom_density()
# Density plot with y-axis scaled to counts.
m + geom_density(aes(y = ..count..))

Although this is old, I thought the following might be useful.
Let's say you have a data set of 10,000 points, and you believe they belong to a certain distribution, and you would like to plot the histogram of the actual data and the line of the probability density of the ideal distribution on top of it.
noise <- 2
#
# the noise is tagged onto the end using runif
# just do demo issues w/real data and fitting
# the subtraction causes the data to have some
# negative values, which must be addressed in
# the fit later on
#
noisylognorm <- rlnorm(10000,
mean = 0.25,
sd = 1) +
(noise * runif(10000) - noise / 10)
#
# using package fitdistrplus
#
# subset is used to remove the negative values
# as the lognormal distribution needs positive only
#
fitlnorm <- fitdist(subset(noisylognorm,
noisylognorm > 0),
"lnorm")
fitlnorm_density <- density(rlnorm(10000,
mean = fitlnorm$estimate[1],
sd = fitlnorm$estimate[2]))
hist(subset(noisylognorm,
noisylognorm < 25),
breaks = seq(-1, 25, 0.5),
col = "lightblue",
xlim = c(0, 25),
xlab = "value",
ylab = "frequency",
main = paste0("Log Normal Distribution\n",
"noise = ", noise))
lines(fitlnorm_density$x,
10000 * fitlnorm_density$y * 0.5,
type="l",
col = "red")
Note the * 0.5 in the lines function. As far as I can tell, this is necessary to account for the width of the hist() bars.

There is a very simple and fast way for count data.
First let's generate some dummy count data:
my.count.data = rpois(n = 10000, lambda = 3)
And then the plotting command (assuming you have called library(magrittr)):
my.count.data %>% table %>% plot

plot with overlapping points

I have data in R with overlapping points.
x = c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y = c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
plot(x,y)
How can I plot these points so that the points that are overlapped are proportionally larger than the points that are not. For example, if 3 points lie at (4,5), then the dot at position (4,5) should be three times as large as a dot with only one point.

Here's one way using ggplot2:
x = c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y = c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
df <- data.frame(x = x,y = y)
ggplot(data = df,aes(x = x,y = y)) + stat_sum()
By default, stat_sum uses the proportion of instances. You can use raw counts instead by doing something like:
ggplot(data = df,aes(x = x,y = y)) + stat_sum(aes(size = ..n..))

Here's a simpler (I think) solution:
x <- c(4,4,4,7,3,7,3,8,6,8,9,1,1,1,8)
y <- c(5,5,5,2,1,2,5,2,2,2,3,5,5,5,2)
size <- sapply(1:length(x), function(i) { sum(x==x[i] & y==y[i]) })
plot(x,y, cex=size)

## Tabulate the number of occurrences of each cooordinate
df <- data.frame(x, y)
df2 <- cbind(unique(df), value = with(df, tapply(x, paste(x,y), length)))
## Use cex to set point size to some function of coordinate count
## (By using sqrt(value), the _area_ of each point will be proportional
## to the number of observations it represents)
plot(y ~ x, cex = sqrt(value), data = df2, pch = 16)

You didn't really ask for this approach but alpha may be another way to address this:
library(ggplot2)
ggplot(data.frame(x=x, y=y), aes(x, y)) + geom_point(alpha=.3, size = 3)

You need to add the parameter cex to your plot function. First what I would do is use the function as.data.frame and table to reduce your data to unique (x,y) pairs and their frequencies:
new.data = as.data.frame(table(x,y))
new.data = new.data[new.data$Freq != 0,] # Remove points with zero frequency
The only downside to this is that it converts numeric data to factors. So convert back to numeric, and plot!
plot(as.numeric(new.data$x), as.numeric(new.data$y), cex = as.numeric(new.data$Freq))

You may also want to try sunflowerplot.
sunflowerplot(x,y)

Let me propose alternatives to adjusting the size of the points. One of the drawbacks of using size (radius? area?) is that the reader's evaluation of spot size vs. the underlying numeric value is subjective.
So, option 1: plot each point with transparency --- ninja'd by Tyler!
option 2: use jitter to push your data around slightly so the plotted points don't overlap.

A solution using lattice and table ( similar to #R_User but no need to remove 0 since lattice do the job)
dt <- as.data.frame(table(x,y))
xyplot(dt$y~dt$x, cex = dt$Freq^2, col =dt$Freq)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

scatter plot in r with huge unique observation - r

Related

Plot multiple line graphs on one graph

Again, I have 4 graphs on R, different x axis, but similar trend profile. I tried to overlay them but they are not aligned

Trying to vertically scale the graph of a data set with R, ggplot2

How to convert a bar histogram into a line histogram in R

plot with overlapping points

Categories

Resources