Convert absolute values to ranges for charting in R - r

Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.

Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))

Related

How to change x axis (in order to be scaled) of a boxplot in R

I'm new to R and i'm having some trouble in solving this problem.
I have the following table/dataframe:
I am trying to generate a boxplot like this one:
However, i want that the x-axis be scaled according to the labels 1000, 2000, 5000, etc.
So, i want that the distance between 1000 and 2000 be different from the distance between 50000 and 100000, since the exact distance is not the same.
Is it possible to do that in R?
Thank you everyone and have a nice day!
Maybe try to convert the data set in to this format, ie as integers in a column, rather than a header title?
# packages
library(ggplot2)
library(reshape2)
# data in ideal format
dt <- data.frame(x=rep(c(1,10,100), each=5),
y=runif(15))
# data that we have. Use reshape2::dcast to get data in to this format
dt$id <- rep(1:5, 3)
dt_orig <- dcast(dt, id~x, value.var = "y")
dt_orig$id <- NULL
names(dt_orig) <- paste0("X", names(dt_orig))
# lets get back to dt, the ideal format :)
# melt puts it in (variable, value) form. Need to configure variable column
dt2 <- melt(dt_orig)
# firstly, remove X from the string
dt2$variable <- gsub("X", "", dt2$variable)
# almost there, can see we have a character, but need it to be an integer
class(dt2$variable)
dt2$variable <- as.integer(dt2$variable)
# boxplot with variable X axis
ggplot(dt2, aes(x=variable, y=value, group=variable)) + geom_boxplot() + theme_minimal()
Base way of re-shaping data: https://www.statmethods.net/management/reshape.html

R-Programming: Chart the Z distribution of a factor's frequency

I have reviewed a number of posts regarding histograms/barcharts from categorical data but I still can't seem to progress. I have a data set of names (single column) and each name occurs anywhere from once to 8,000 times. I can create a table with variable and frequency and I can move that table to a data frame but o matter what I try I can't even get a barplot much less a histogram with variable on x axis and frequency on the y axis.
Ultimately, I want to use the table or dataframe with name and frequency to calculate the Z score for each name and then graph the distribution. I can do this easily with a series of numbers but doing it with a categorical variable has me stumped.
thanks,
rms
Is this what you're looking for?
example_data <- data.frame(Name = sample(paste0("Name", 1:15), size = 8000, replace=TRUE, prob = (1:15)/sum(1:15)))
counts <- as.data.frame(table(example_data))
colnames(counts) <- c("Name", "Freq")
library(ggplot2)
ggplot(data = counts, aes(x = Name, y = Freq)) + geom_bar(stat="identity")
For future reference, it's a little easier to answer if you provide a reproducible example, or go into more detail about what you've tried already. Hope this helps!

Get a histogram plot of factor frequencies (summary)

I've got a factor with many different values. If you execute summary(factor) the output is a list of the different values and their frequency. Like so:
A B C D
3 3 1 5
I'd like to make a histogram of the frequency values, i.e. X-axis contains the different frequencies that occur, Y-axis the number of factors that have this particular frequency. What's the best way to accomplish something like that?
edit: thanks to the answer below I figured out that what I can do is get the factor of the frequencies out of the table, get that in a table and then graph that as well, which would look like (if f is the factor):
plot(factor(table(f)))
Update in light of clarified Q
set.seed(1)
dat2 <- data.frame(fac = factor(sample(LETTERS, 100, replace = TRUE)))
hist(table(dat2), xlab = "Frequency of Level Occurrence", main = "")
gives:
Here we just apply hist() directly to the result of table(dat). table(dat) provides the frequencies per level of the factor and hist() produces the histogram of these data.
Original
There are several possibilities. Your data:
dat <- data.frame(fac = rep(LETTERS[1:4], times = c(3,3,1,5)))
Here are three, from column one, top to bottom:
The default plot methods for class "table", plots the data and histogram-like bars
A bar plot - which is probably what you meant by histogram. Notice the low ink-to-information ratio here
A dot plot or dot chart; shows the same info as the other plots but uses far less ink per unit information. Preferred.
Code to produce them:
layout(matrix(1:4, ncol = 2))
plot(table(dat), main = "plot method for class \"table\"")
barplot(table(dat), main = "barplot")
tab <- as.numeric(table(dat))
names(tab) <- names(table(dat))
dotchart(tab, main = "dotchart or dotplot")
## or just this
## dotchart(table(dat))
## and ignore the warning
layout(1)
this produces:
If you just have your data in variable factor (bad name choice by the way) then table(factor) can be used rather than table(dat) or table(dat$fac) in my code examples.
For completeness, package lattice is more flexible when it comes to producing the dot plot as we can get the orientation you want:
require(lattice)
with(dat, dotplot(fac, horizontal = FALSE))
giving:
And a ggplot2 version:
require(ggplot2)
p <- ggplot(data.frame(Freq = tab, fac = names(tab)), aes(fac, Freq)) +
geom_point()
p
giving:

Subset of data included in more than one ggplot facet

I have a population and a sample of that population. I've made a few plots comparing them using ggplot2 and its faceting option, but it occurred to me that having the sample in its own facet will distort the population plots (however slightly). Is there a way to facet the plots so that all records are in the population plot, and just the sampled records in the second plot?
Matt,
If I understood your question properly - you want to have a faceted plot where one panel contains all of your data, and the subsequent facets contain only a subset of that first plot?
There's probably a cleaner way to do this, but you can create a new data.frame object with the appropriate faceting variable that corresponds to each subset. Consider:
library(ggplot2)
df <- data.frame(x = rnorm(100), y = rnorm(100), sub = sample(letters[1:5], 100, TRUE))
df2 <- rbind(
cbind(df, faceter = "Whole Sample")
, cbind(df[df$sub == "a" ,], faceter = "Subset A")
#other subsets go here...
)
qplot(x,y, data = df2) + facet_wrap(~ faceter)
Let me know if I've misunderstood your question.
-Chase

Setting up a CSV file for R to display histograms

Greetings,
Basically, I have two vectors of data (let's call it experimental and baseline). I want to use the lattice library and histogram functions of R to plot the two histograms side-by-side, just as seen at the end of this page.
I have my data in a CSV file like this:
Label1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label2,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label3,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label4,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label5,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label6,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Each row should be a new pair of histograms. Columns 1-9 represents the data for the experiment (left-side histogram). Columns 10-18 represents the baseline data (right-side histogram).
Can anyone help me on this? Thanks.
Your data is poorly formatted for faceting with lattice. You can restructure it using reshape.
read.csv(textConnection("Label1,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label2,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label3,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label4,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label5,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18
Label6,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18"), header = F)->data
colnames(data)[1] <- "ID"
colnames(data)[2:10] <- paste("exp",1:9, sep = "_")
colnames(data)[11:19] <- paste("base", 1:9, sep = "_")
library(reshape)
data.m <- melt(data, id = "ID")
data.m <- cbind(data.m, colsplit(data.m$variable, "_", names = c("Source","Measure")))
data.m is now in the format you really want your data to be in for almost everything. I don't know if each of the 9 measurements from the experiment and the baseline are meaningful or can be meaningfully compared so I kept them distinct.
Now, you can use lattice properly.
histogram(~value | Source + ID, data = data.m)
If the measurements are meaningfully compared (that is, data[,2] and data[,11] are somehow the "same"), you could recast the data to directly compare experiment to baseline
data.comp <- cast(data.m, ID + Measure ~ Source)
## I know ggplot2 better
library(ggplot2)
qplot(base, exp, data = data.comp)+
geom_abline()+
expand_limits(x = 0, y = 0)
Something like this should work:
library(lattice)
data <- matrix(1:18, ncol=18, nrow=3, byrow=T)
for (i in 1:nrow(data))
{
tmp <- cbind(data[i,], rep(1:2, each=9))
print(histogram(~tmp[,1]|tmp[,2]), split=c(1,i,1,nrow(data)), more=T)
}
Note: this will work only for few rows of data... for larger datasets you may want to think of slightly different layout (change the split parameter in histogram)

Resources