After applying hexbin'ning I would like to know which id or rownumbers of the original data ended up in which bin.
I am currently analysing spatial data and I am binning, e.g., depth of water and temperature. Ideally, I would like to map the colormap of the bins back to the spatial map to see where more or less common parameter combinations exist. I'm not bound to hexbin though.
I wasn't able to figure out from the documentation, how to trace which datapoint ends up in which bin. It seems hexbin() only stores counts.
Is there a function that generates a list with one entry for every bin, each containing a vector of all rownumbers that were assigned to that bin?
Please point me into the right direction.
Up to now, I use plain hexbin to do the binning:
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h <- hexbin(df)
but currently I see no way to extract rownames of df from h that link the bins to df. Possibly there is no such thing, maybe I overlooked it or there is a completely different approach needed.
Assuming you are using the hexbin package, then you will need to set IDs=TRUE to be able to go back to the original rows
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h<-hexbin(df, IDs=TRUE)
Then to get the bin number for each observation, you can use
h#cID
To get the count of observations in the cell populated by a particular observation, you would do
h#count[match(h#cID, h#cell)]
The idea is that the second observation df[2,] is in cell h#cID[2]=424. Cell 424 is at index which(h#cell==424)=241 in the list of cells (zero count cells appear to be omitted). The number of observations in that cell is h#count[241]=2.
Related
I've got a 1000x1000 matrix consisting of a random distribution of the letters a - z, and I need to be able to plot the data in a rank abundance distribution plot; however I'm having a lot of trouble with it due to a) it all being in character format, b) it being as a matrix and not a vector (though I have changed it to a vector in one attempt to sort it), and c) I seem to have no idea how to summarise the data so that I get species abundance, let alone then be able to rank it.
My code for the matrix is:
##Create Species Vector
species.b<-letters[1:26]
#Matrix creation (Random)
neutral.matrix2<- matrix(sample(species.b,10000,replace=TRUE),
nrow=1000,
ncol=1000)
##Turn Matrix into Vector
neutral.b<-as.character(neutral.matrix2)
##Loop
lo.op <- 2
neutral.v3 <- neutral.matrix2
neutral.c<-as.character(neutral.v3)
repeat {
neutral.v3[sample(length(neutral.v3),1)]<-as.character(sample(neutral.c,1))
neutral.c<-as.character(neutral.v3)
lo.op <- lo.op+1
if(lo.op > 10000) {
break
}
}
Which creates a matrix, 1000x1000, then replaces 10,000 elements randomly (I think, I don't know how to check it until I can check the species abundances/rank distribution).
I've run it a couple of times to get neutral.v2, neutral.v3, and neutral.b, neutral.c, so I should theoretically have two matrices/vectors that I can plot and compare - I just have no idea how to do so on a purely character dataset.
I also created a matrix of the two vectors:
abundance.matrix<-matrix(c(neutral.vb,neutral.vc),
nrow=1000000,
ncol=2)
As a later requirement is for sites, and each repeat of my code (neutral.v2 to neutral.v11 eventually) could be considered a separate site for this; however this didn't change the fact that I have no idea how to treat the character data set in the first place.
I think I need to calculate the abundance of each species in the matrix/vectors, then run it through either radfit (vegan) or some form of the rankabundance/rankabun plot (biodiversityR). However the requirements for those functions:
rankabundance(x,y="",factor="",level,digits=1,t=qt(0.975,df=n-1))
x Community data frame with sites as rows, species as columns and species abundance
as cell values.
y Environmental data frame.
factor Variable of the environment
aren't available in the data I have, as for all intents and purposes I just have a "map" of 1,000,000 species locations, and no idea how to analyse it at all.
Any help would be appreciated: I don't feel like I've explained it very well though, so sorry about that!.
I'm not sure exactly what you want, but this will summarise the data and make it into a data.frame for rankabundance
counts <- as.data.frame(as.list(table(neutral.matrix2)))
BiodiversityR::rankabundance(counts)
My dataframe contains three variables:
Row_Number Sample_ID Expression_Level
1 hum_449 0.25
2 hum_459 0.35
4 mur_223 0.45
I want to produce histograms of the third column using
hist(dataframe$Expression_Level)
And I want to label some of the bars with a list a list of Sample_ID values that correspond to that particular expression level.
I have the desired Sample_IDs stored as a list object and also as a data frame with corresponding Row_Number and Expression_Level values (essentially just a subset of the original data frame). I don't know what to do next or even what to type into a search engine.
I have ggplot2 installed because friends told me it would probably be helpful but I am unfamiliar with it and face the same problem of not knowing what to look for when reading the documentation. Would prefer not to install more packages if possible.
You could use the following to add a label corresponding to the third element of Sample_ID to the third "bar" of a histogram. But, this seems like an odd way to go really, since the bars of a histogram are counts. Might you be wanting to use barplot instead? same code would work with "barplot" instead of hist.
temp <- hist(dataframe$Expression_Level)
mtext(text=Expression_Level[3],side=1,line=2,at=temp[3])
Something like this?
set.seed(1) # for reproduceale example
# crate sample data - you have this already
df <- data.frame(sample_ID=paste0("S-",1:100),
Expression_Level=round(runif(100),1),
stringsAsFactors=F)
# you start here...
labels <- aggregate(sample_ID~Expression_Level,df,c)
labels$lab <- sapply(labels$sample_ID,function(x)paste(unlist(x),collapse="|"))
library(ggplot2)
ggplot(df, aes(x=factor(Expression_Level))) +
geom_histogram(fill="lightgreen",color="grey50")+
geom_text(data=labels,aes(y=.1,label=lab),hjust=0)+
labs(x="Expression_Level")+
coord_flip()
Here's my hypothetical data frame;
location<- as.factor(rep(c("town1","town2","town3","town4","town5"),100))
visited<- as.factor(rbinom(500,1,.4)) #'Yes or No' variable
variable<- rnorm(500,10,2)
id<- 1:500
DF<- data.frame(id,location,visited,variable)
I want to create a new data frame where the number of 0's and 1's are equal for each location. I want to accomplish this by taking a random sample of the 0's for each location (since there are more 0's than 1's).
I found this solution to sample by group;
library(plyr)
ddply(DF[DF$visited=="0",],.(location),function(x) x[sample(nrow(x),size=5),])
I entered '5' for the size argument so the code would run, But I can't figure out how to set the 'size' argument equal to the number of observations where DF$visited==1.
I suspect the answer could be in other questions I've reviewed, but they've been a bit too advanced for me to implement.
Thanks for any help.
The key to using ddply well is to understand that it will:
break the original data frame down by groups into smaller data frames,
then, for each group, it will call the function you give it, whose job it is to transform that data frame into a new data frame*
and finally, it will stitch all the little transformed data frames back together.
With that in mind, here's an approach that (I think) solves your problem.
sampleFunction <- function(df) {
# Determine whether visited==1 or visited==0 is less common for this location,
# and use that count as our sample size.
n <- min(nrow(df[df$visited=="1",]), nrow(df[df$visited=="0",]))
# Sample n from the two groups (visited==0 and visited==1).
ddply(df, .(visited), function(x) x[sample(nrow(x), size=n),])
}
newDF <- ddply(DF,.(location),sampleFunction)
# Just a quick check to make sure we have the equal counts we were looking for.
ddply(newDF, .(location, visited), summarise, N=length(variable))
How it works
The main ddply simply breaks DF down by location and applies sampleFunction, which does the heavy lifting.
sampleFunction takes one of the smaller data frames (in your case, one for each location), and samples from it an equal number of visited==1 and visited==0. How does it do this? With a second call to ddply: this time, using location to break it down, so we can sample from both the 1's and the 0's.
Notice, too, that we're calculating the sample size for each location based on whichever sub-group (0 or 1) has fewer occurrences, so this solution will work even if there aren't always more 0's than 1's.
In R, I'm drawing a rather large boxplot from a data.frame with approximately 150 columns. I know that there are some "anomalous" columns where the distribution is too different from the rest of the data set and I want to identify which ones precisely.
Rather unsurprisingly, there is not enough room for the labels and even if there were, it would be probably inconvenient to check by hand. So I thought I could use R's
identify function to locate the offending columns. Such a function however needs x and y coordinates, and so far I was unable to get it to work.
I tried
boxplot(dd.noctr$TGS, outline=F)
identify(xy.coords(dd.noctr$TGS)$x, y=xy.coords(dd.noctr$TGS)$y)
where dd.noctr$TGS is my data (a matrix or data.frame), only to get the error
warning: no point within 0.25 inches
meaning that no point was identified.
Is there an alternative solution to identify column names (not single points)?
This solution seems a bit clunky, so there is probably a better solution.
Set up some example data with three columns:
TGS = data.frame(A = rnorm(100), B = rnorm(100), C=rnorm(100))
Next plot the boxplot
boxplot(TGS, outline=F)
Now we construct the identity function.
identify(x=rep(1:ncol(TGS), each=nrow(TGS)),
y=as.vector(unlist(TGS)),
label=rep(colnames(TGS), each=nrow(TGS)))
The labels are the column names. This function only works if you click near the centre of the boxplot.
If you want to get a list of outliers, you can use the 'out' component of boxplot.
example:
Create a dataframe : with a few random values with mean 20, and add some outliers. This code will display the outliers.
df1 = data.frame(A = c(rnorm(15,20,3),7,8,35,32)) #15 rnorm and 4 extreme values
bplot=boxplot(df1)
bplot$out
I have a pre-binned frequency table for a rather large dataset. That is, a single column vector of bins and a single column vector of counts associated with those bins. I'd like R to plot a histogram of this data by doing further binning and summing the existing counts. For example, if in the pre-binned data I have something like [(0.01, 5000), (0.02, 231), (0.03, 948)], where the first number is the bin and the second is the count, and I choose 0.04 as the new bin width, I'd expect to get [(0.04, 6179)]. What's the fastest and or easiest way to do this in R?
Looks like ggplot2 has the answer.
library(ggplot2)
qplot(bin, data=cbind(bins,counts), weight=counts, geom="histogram")
The new HistogramTools package on CRAN has a number of useful functions for doing exactly this. In your example, if you want to merge three adjacent buckets together at each point in the histogram to produce a new histogram with 1/3rd as many buckets, you could use the MergeBuckets function.
install.packages("HistogramTools")
library(HistogramTools)
h <- hist(rexp(1000), breaks=60)
plot(MergeBuckets(h, adj.buckets=3))
Alternatively, you can also specify a list of the new breakpoints you want explicitly, rather than telling MergeBuckets() to always merge the same number of adjacent buckets.