How to visualize (value, count) dataset with thousands data points - r

I have a file with 2 numeric columns: value and count. File may have > 5000 rows. I do plot(value, count) to find the shape of distribution. But because there are too many data points the picture is not very clear.
Do you know better visualization approach? Probably histograms or barplot with grouping close values on x axis will be the better way to look on data? I cannot figure out the syntax of using histogram or barplot for my case.

If you want to relate the two (continuous) quantities value and count to each other, then you want to do a scatterplot. The problem is that if you have too many observations, the points will overlap and the plot ends up as a big opaque mass with a few scattered outliers. There are a couple of ways to solve this:
Use a smaller plotting symbol: plot(value, count, pch=".")
Plot the data points with a transparency factor: plot(value, count, col=rgb(0, 0, 1, alpha=0.1))

Why not plot a subset of the data? For example, plot the counts associated with values corresponding to the 5th, 10th, ..., 90th, 95th percentiles, e.g.,
value.subset <- quantile(value, seq(0, 1, 0.05))plot
Then plot the quantiles against their respective counts.

Related

Creating multiple density plots using only summary statistics (no raw data) in R

I work with a massive 4D nifti file (x - y - z - subject; MRI data) and due to the size I can't convert to a csv file and open in R. I would like to get a series of overlaying density plots (classic example here) one for each subject with the idea to just visualise that there is not much variance in density distributions across the sample.
I could however, extract summary statistics for each subject (mean, median, SD, range etc. of the variable of interest) and use these to create the density plots (at least for the variables that are normally distributed). Something like this would be fantastic but I am not sure how to do it for density plots.
Your help will be much appreciated.
So these really aren't density plots per se - they are plots of densties of normal distributions with given means and standard deviations.
That can be done in ggplot2, but you need to expand your table of subjects and summaries into grids of points and normal densities at those points.
Here's an example. First, make up some data, consisting of subject IDs and some simulated sample averages and sample standard deviations.
library(tidyverse)
set.seed(1)
foo <- data_frame(Subject = LETTERS[1:10], avg=runif(10, 10,20), stdev=runif(10,1,2))
Now, for each subject we need to obtain a suitable grid of "x" values along with the normal density (for that subject's avg and stdev) evaluated at those "x" values. I've chosen plus/minus 4 standard deviations. This can be done using do. But that produces a funny data frame with a column consisting of data frames. I use unnest to explode out the data frame.
bar <- foo %>%
group_by(Subject) %>%
do(densities=data_frame(x=seq(.$avg-4*.$stdev, .$avg+4*.$stdev, length.out = 50),
density=dnorm(x, .$avg, .$stdev))) %>%
unnest()
Have a look at bar to see what happened. Now we can use ggplot2 to put all these normal densities on the same plot. I'm guessing with lots of subjects you wouldn't want a legend for the plot.
bar %>%
ggplot(aes(x=x, y=density, color=Subject)) +
geom_line(show.legend = FALSE)

Standardize Color Range For Multiple Plots

I am plotting multiple dataframes, where the color of the line is dependent on a variable in the dataframe. The problem is that for each plot, R makes the color spectrum relative to the range of each plot.
I would like for the range (and corresponding colors) to be kept constant for all of the dataframes I'm using. I won't know the range of numbers in advance, though they'll all be set before plotting. In addition, there will hundreds of values, so a manual mapping is not feasible.
As of right now, I have:
library(ggplot2)
df1 <- as.data.frame(list('x'=1:5,'y'=1:5,'colors'=6:10))
df2 <- as.data.frame(list('x'=1:5,'y'=1:5,'colors'=8:12))
qplot(data=df1,x,y,geom='line', colour=colors)
qplot(data=df2,x,y,geom='line', colour=colors)
The first plot produces:
where the color range goes from 6-10.
The second plot produces:
where the color range goes from 8-12
I would like a constant range for both that goes from 6-12.

Barplot with threshold

I have a huge data frame consisting of binary values (extract):
id,topic,w_hello,w_apple,w_tomato
1,politics,1,1,0
2,sport,0,1,0
3,politics,1,0,1
With:
barplot(col_prefix_matrix)
I plot the number of their occurrences:
As there are many columns, the plot looks very confusing.
Would it be possible to plot only those columns with a specific threshold, say 5, to make it look more clear?

Plotting multiple frequency polygon lines using ggplot2

I have a dataset with records that have two variables: "time" which are id's of decades, and "latitude" which are geographic latitudes. I have 7 time periods (numbered from 26 to 32).
I want to visualize a potential shift in latitude through time. So what I need ggplot2 to do, is to plot a graph with latitude on the x-axis and the count of records at a certain latitude on the y-axis. I need it do this for the seperate time periods and plot everything in 1 graph.
I understood that I need the function freqpoly from ggplot2, and I got this so far:
qplot(latitude, data = lat_data, geom = "freqpoly", binwidth = 0.25)
This gives me the correct graph of the data, ignoring the time. But how can I implement the time? I tried subsetting the data, but I can't really figure out if this is the best way..
So basically I'm trying to get a graph with 7 lines showing the frequency distribution in each decade in order to look for a latitude shift.
Thanks!!
Without sample data it is hard to answer but try to add color=factor(time) (where time is name of your column with time periods). This will draw lines for each time period in different color.
qplot(latitude, data = lat_data, geom = "freqpoly", binwidth = 0.25,
color=factor(time))

Scaled/weighted density plot

I want to generate a density plot of observed temperatures that is scaled by the number of events observed for each temperature data point. My data contains two columns: Temperature and Number [of observations].
Right now, I have a density plot that only incorporates the Temperature frequency according to:
plot(density(Temperature, na.rm=T), type="l", bty="n")
How do I scale this density to account for the Number of observations at each temperature? For example, I want to be able to see the temperature density plot scaled to show if there are greater/fewer observations for each temperature at higher/lower temperatures.
I think I'm looking for something that could weight the temperatures?
I think you can get what you want by passing a weights argument to density. Here's an example using ggplot
dat <- data.frame(Temperature = sort(runif(10)), Number = 1:10)
ggplot(dat, aes(Temperature)) + geom_density(aes(weights=Number/sum(Number)))
And to do this in base (using DanM's data):
plot(density(dat$Temperature,weights=dat$Number/sum(dat$Number),na.rm=T),type='l',bty='n')

Resources