In R, I'm drawing a rather large boxplot from a data.frame with approximately 150 columns. I know that there are some "anomalous" columns where the distribution is too different from the rest of the data set and I want to identify which ones precisely.
Rather unsurprisingly, there is not enough room for the labels and even if there were, it would be probably inconvenient to check by hand. So I thought I could use R's
identify function to locate the offending columns. Such a function however needs x and y coordinates, and so far I was unable to get it to work.
I tried
boxplot(dd.noctr$TGS, outline=F)
identify(xy.coords(dd.noctr$TGS)$x, y=xy.coords(dd.noctr$TGS)$y)
where dd.noctr$TGS is my data (a matrix or data.frame), only to get the error
warning: no point within 0.25 inches
meaning that no point was identified.
Is there an alternative solution to identify column names (not single points)?
This solution seems a bit clunky, so there is probably a better solution.
Set up some example data with three columns:
TGS = data.frame(A = rnorm(100), B = rnorm(100), C=rnorm(100))
Next plot the boxplot
boxplot(TGS, outline=F)
Now we construct the identity function.
identify(x=rep(1:ncol(TGS), each=nrow(TGS)),
y=as.vector(unlist(TGS)),
label=rep(colnames(TGS), each=nrow(TGS)))
The labels are the column names. This function only works if you click near the centre of the boxplot.
If you want to get a list of outliers, you can use the 'out' component of boxplot.
example:
Create a dataframe : with a few random values with mean 20, and add some outliers. This code will display the outliers.
df1 = data.frame(A = c(rnorm(15,20,3),7,8,35,32)) #15 rnorm and 4 extreme values
bplot=boxplot(df1)
bplot$out
Related
In case there is an easier way, I am trying to overlay the plots of 4 different "performance" objects from the ROCR package. The gist is that each of these objects contains two vectors of equal length, one for the X values and one for the Y values, but the X/Y vectors are not the same length between objects.
Currently I am just extracting and plotting these curves manually with plot() and lines(), to create this:
It's not terrible, but I think I would have better control with ggplot. The only problem is I can't think of a way to create a data.frame() from these vectors with ggplot.
ggplot prefers data in long format, so different lengths for different lines doesn't matter.
The structure is pretty easy - you have one column that defines the line, iteration, in your case, with values either 1, 2, 3, or 4 (probably make this one a factor); one column that gives x, and one column that gives y.
Since you don't provide any code or sample data, I'll assume that's as much of an answer as you're looking for. You can use c() on individual vectors or rbind() on individual data frames to combine them. Or dplyr::bind_rows or data.table::rbindlist() to operate on a list of data frames.
After applying hexbin'ning I would like to know which id or rownumbers of the original data ended up in which bin.
I am currently analysing spatial data and I am binning, e.g., depth of water and temperature. Ideally, I would like to map the colormap of the bins back to the spatial map to see where more or less common parameter combinations exist. I'm not bound to hexbin though.
I wasn't able to figure out from the documentation, how to trace which datapoint ends up in which bin. It seems hexbin() only stores counts.
Is there a function that generates a list with one entry for every bin, each containing a vector of all rownumbers that were assigned to that bin?
Please point me into the right direction.
Up to now, I use plain hexbin to do the binning:
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h <- hexbin(df)
but currently I see no way to extract rownames of df from h that link the bins to df. Possibly there is no such thing, maybe I overlooked it or there is a completely different approach needed.
Assuming you are using the hexbin package, then you will need to set IDs=TRUE to be able to go back to the original rows
library(hexbin)
set.seed(5)
df <- data.frame(depth=runif(1000,min=0,max=100),temp=runif(1000,min=4,max=14))
h<-hexbin(df, IDs=TRUE)
Then to get the bin number for each observation, you can use
h#cID
To get the count of observations in the cell populated by a particular observation, you would do
h#count[match(h#cID, h#cell)]
The idea is that the second observation df[2,] is in cell h#cID[2]=424. Cell 424 is at index which(h#cell==424)=241 in the list of cells (zero count cells appear to be omitted). The number of observations in that cell is h#count[241]=2.
My dataframe contains three variables:
Row_Number Sample_ID Expression_Level
1 hum_449 0.25
2 hum_459 0.35
4 mur_223 0.45
I want to produce histograms of the third column using
hist(dataframe$Expression_Level)
And I want to label some of the bars with a list a list of Sample_ID values that correspond to that particular expression level.
I have the desired Sample_IDs stored as a list object and also as a data frame with corresponding Row_Number and Expression_Level values (essentially just a subset of the original data frame). I don't know what to do next or even what to type into a search engine.
I have ggplot2 installed because friends told me it would probably be helpful but I am unfamiliar with it and face the same problem of not knowing what to look for when reading the documentation. Would prefer not to install more packages if possible.
You could use the following to add a label corresponding to the third element of Sample_ID to the third "bar" of a histogram. But, this seems like an odd way to go really, since the bars of a histogram are counts. Might you be wanting to use barplot instead? same code would work with "barplot" instead of hist.
temp <- hist(dataframe$Expression_Level)
mtext(text=Expression_Level[3],side=1,line=2,at=temp[3])
Something like this?
set.seed(1) # for reproduceale example
# crate sample data - you have this already
df <- data.frame(sample_ID=paste0("S-",1:100),
Expression_Level=round(runif(100),1),
stringsAsFactors=F)
# you start here...
labels <- aggregate(sample_ID~Expression_Level,df,c)
labels$lab <- sapply(labels$sample_ID,function(x)paste(unlist(x),collapse="|"))
library(ggplot2)
ggplot(df, aes(x=factor(Expression_Level))) +
geom_histogram(fill="lightgreen",color="grey50")+
geom_text(data=labels,aes(y=.1,label=lab),hjust=0)+
labs(x="Expression_Level")+
coord_flip()
I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric
I'm trying to plot multiple overlaying density plots for two vectors on the same figure. As far as I know, I'm not able to do so unless they are in the same object.
In order to plot the data, I need to have a data.frame() with two columns; one for the value, and one to specify which vector each value belongs to.
My first vector contains 400 data. The second contains 1200. My current (somewhat inelegant) solution involves concatenating the two vectors into a new data.frame vector, and adding a second vector to the data.frame which contains 400 'a's and 1200 'b's, to indicate which vector the original data came from. This only works because I know how many data there were in each original vector.
Surely there must be a more efficient way to do this?
Let's say my original data are from dframe1$vector and dframe2$vector. I'm looking to create a new object called dframe3 which contains the columns $value and $original_vector_number. How do I do this?
You're trying to solve a problem you don't need to solve. You don't need to have them in the same object to plot their densities. Just use lines.
x <- rnorm(400,0,1)
y <- rnorm(1200,2,2)
plot(density(x))
lines(density(y))
Use library(reshape) and melt if you don't want to do this by hand:
library(reshape)
dframe <- data.frame(a = rnorm(400,1,1),b = rnorm(1200,1.2,2))
df.m <- melt(dframe)
library(ggplot2)
ggplot(df.m,aes(x = value,color = variable)) + geom_density()
Note that this will not truly provide the correct answer as putting the data frames together does expand the smaller of the two to fit the number of rows. The correct way to do this and plot in ggplot is the following:
By hand:
vecA <- data.frame(rnorm(400,1,1),'a')
vecB <- data.frame(rnorm(1200,1.2,2),'b')
names(vecA) <- c('value','name')
names(vecB) <- c('value','name')
dtf <- rbind(vecA,vecB)
library(ggplot2)
ggplot(dtf,aes(x=value,color=name))+geom_density()