I have a pre-binned frequency table for a rather large dataset. That is, a single column vector of bins and a single column vector of counts associated with those bins. I'd like R to plot a histogram of this data by doing further binning and summing the existing counts. For example, if in the pre-binned data I have something like [(0.01, 5000), (0.02, 231), (0.03, 948)], where the first number is the bin and the second is the count, and I choose 0.04 as the new bin width, I'd expect to get [(0.04, 6179)]. What's the fastest and or easiest way to do this in R?
Looks like ggplot2 has the answer.
library(ggplot2)
qplot(bin, data=cbind(bins,counts), weight=counts, geom="histogram")
The new HistogramTools package on CRAN has a number of useful functions for doing exactly this. In your example, if you want to merge three adjacent buckets together at each point in the histogram to produce a new histogram with 1/3rd as many buckets, you could use the MergeBuckets function.
install.packages("HistogramTools")
library(HistogramTools)
h <- hist(rexp(1000), breaks=60)
plot(MergeBuckets(h, adj.buckets=3))
Alternatively, you can also specify a list of the new breakpoints you want explicitly, rather than telling MergeBuckets() to always merge the same number of adjacent buckets.
Related
In a simulation I produce one very large vector of numbers, which I want to show in a histogram. Unfortunately, my RAM doesn't allow vectors as long as I require them to be. (10^10 entries)
Thus, I put my simulation in a loop producing several smaller vectors of shorter length.
It tried the hist-function and the summation of hist$counts, however the binning keeps changing, which makes a summation impossible(for me...)
Now, I search a soultion to handle these smaller vectors, in sequential way.
read the frist vector (from the loop)
extract information for a histogram
keep the histogram information of the 1st but discard the vector itself to safe memory
do this for all the other vectors and store only the histogramm of all vectors.
build one histogram where the accumulated histogram information are added up to one set of information.
Can any one help out? Is this possible in R ? I'm stuck... Thanks to all who took time to read this !
Your problem, if I understand correctly, is that the histogram bins are changing. So the natural solution would be to fix the bins using the breaks parameter of the hist function. For better performance you can set plot = FALSE and just collect the bin counts from each part.
You can obtain the information an histogram will require with the function count() of the library dplyr.
Let's say the values of vector of numbers range from 1 to 100. First you have to define your buckets : 1-10, 11-20, ...
Then, within the loop and with a smaller vector, use the function cut() with the arguments breaks = to transform your numeric vector to a categorical vector. Use count to count the numbers of values in each buckets.
At the end of your loop, combine all the counts you obtain.
In case there is an easier way, I am trying to overlay the plots of 4 different "performance" objects from the ROCR package. The gist is that each of these objects contains two vectors of equal length, one for the X values and one for the Y values, but the X/Y vectors are not the same length between objects.
Currently I am just extracting and plotting these curves manually with plot() and lines(), to create this:
It's not terrible, but I think I would have better control with ggplot. The only problem is I can't think of a way to create a data.frame() from these vectors with ggplot.
ggplot prefers data in long format, so different lengths for different lines doesn't matter.
The structure is pretty easy - you have one column that defines the line, iteration, in your case, with values either 1, 2, 3, or 4 (probably make this one a factor); one column that gives x, and one column that gives y.
Since you don't provide any code or sample data, I'll assume that's as much of an answer as you're looking for. You can use c() on individual vectors or rbind() on individual data frames to combine them. Or dplyr::bind_rows or data.table::rbindlist() to operate on a list of data frames.
After applying hexbin'ning I would like to know which id or rownumbers of the original data ended up in which bin.
I am currently analysing spatial data and I am binning, e.g., depth of water and temperature. Ideally, I would like to map the colormap of the bins back to the spatial map to see where more or less common parameter combinations exist. I'm not bound to hexbin though.
I wasn't able to figure out from the documentation, how to trace which datapoint ends up in which bin. It seems hexbin() only stores counts.
Is there a function that generates a list with one entry for every bin, each containing a vector of all rownumbers that were assigned to that bin?
Please point me into the right direction.
Up to now, I use plain hexbin to do the binning:
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h <- hexbin(df)
but currently I see no way to extract rownames of df from h that link the bins to df. Possibly there is no such thing, maybe I overlooked it or there is a completely different approach needed.
Assuming you are using the hexbin package, then you will need to set IDs=TRUE to be able to go back to the original rows
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h<-hexbin(df, IDs=TRUE)
Then to get the bin number for each observation, you can use
h#cID
To get the count of observations in the cell populated by a particular observation, you would do
h#count[match(h#cID, h#cell)]
The idea is that the second observation df[2,] is in cell h#cID[2]=424. Cell 424 is at index which(h#cell==424)=241 in the list of cells (zero count cells appear to be omitted). The number of observations in that cell is h#count[241]=2.
In R, I'm drawing a rather large boxplot from a data.frame with approximately 150 columns. I know that there are some "anomalous" columns where the distribution is too different from the rest of the data set and I want to identify which ones precisely.
Rather unsurprisingly, there is not enough room for the labels and even if there were, it would be probably inconvenient to check by hand. So I thought I could use R's
identify function to locate the offending columns. Such a function however needs x and y coordinates, and so far I was unable to get it to work.
I tried
boxplot(dd.noctr$TGS, outline=F)
identify(xy.coords(dd.noctr$TGS)$x, y=xy.coords(dd.noctr$TGS)$y)
where dd.noctr$TGS is my data (a matrix or data.frame), only to get the error
warning: no point within 0.25 inches
meaning that no point was identified.
Is there an alternative solution to identify column names (not single points)?
This solution seems a bit clunky, so there is probably a better solution.
Set up some example data with three columns:
TGS = data.frame(A = rnorm(100), B = rnorm(100), C=rnorm(100))
Next plot the boxplot
boxplot(TGS, outline=F)
Now we construct the identity function.
identify(x=rep(1:ncol(TGS), each=nrow(TGS)),
y=as.vector(unlist(TGS)),
label=rep(colnames(TGS), each=nrow(TGS)))
The labels are the column names. This function only works if you click near the centre of the boxplot.
If you want to get a list of outliers, you can use the 'out' component of boxplot.
example:
Create a dataframe : with a few random values with mean 20, and add some outliers. This code will display the outliers.
df1 = data.frame(A = c(rnorm(15,20,3),7,8,35,32)) #15 rnorm and 4 extreme values
bplot=boxplot(df1)
bplot$out
I have a set of data that looks like this,
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
In this example, the column "ind" has more levels and they don't have always the same length (here each individual has 4 levels, but in reality some individuals have thousands of rows of data, while other only a few lines).
What I would like to do is to have an outer loop or function that will select all the rows from each individual ("ind") and generate a boxplot using the depth/hour columns.
This is the idea that I have in mind,
for (i in 1:length(unique(df$ind))){
data<-df[df$ind==df$ind[i],]
individual[i]<-data
plot.boxplot<-function(data){
boxplot(depth~hour,dat=data,xlab="Hour of day",ylab="Depth (m)")
}
}
par(mfrow=c(2,2),mar=c(5,4,3,1))
plot.boxplot(individual)
I realized that this loop might be inappropriate, but I am still learning. I can do the boxplot for each individual at a time, but I would like a faster, more efficient way of selecting the data for each individual and creating or storing boxplot results. This will be very useful for when I have many more individuals (instead of doing one at a time...). Thanks a lot in advance.
What about something like this?
par(mfrow=c(2,2))
invisible(
by(df,df$ind,
function(x)
boxplot(depth~hour,data=x,xlab="Hour of day",ylab="Depth (m)")
)
)
To provide some explanation, this runs a boxplot for each group of cases in df defined by df$ind. The invisible wrapper just makes it so that the bunch of output used for the boxplot is not written to the console.