Breaks not unique error when using cut and ddply - r

I am trying to break a dataset into quantiles based on a group.
I have the following code which if i try to do a cut using seq(0,1,.5) it works fine but when I change to the seq(0,1,.2) then it gives :
Error in cut.default(x = fwd_quarts$v, breaks =
quantile(fwd_quarts$v, : 'breaks' are not unique
Tring different code, I can't get away from the error. How do I adjust this so when it expands to larger data sets that the quantiles will be created without the error?
ddf <- vector(mode="numeric", length=0)
df <- vector(mode="numeric", length=0)
g<-data.frame( g= c(1,1,1,1,2,2,2,2,3,3))
v<-data.frame( v= c(1,4,4,5,NA,2,6,NA,7,8))
df<-cbind(g,v)
df<-df[complete.cases(df), ]
ddf<-ddply(df, "g", function(fwd_quarts){
eps_quartile <- cut(x = fwd_quarts$v, breaks =quantile(fwd_quarts$v, probs = seq(0, 1, 0.5)),na.rm=TRUE, labels = FALSE, include.lowest = TRUE)
cbind(ddf,eps_quartile)
})
df<-cbind(df,fwde_quart=ddf$eps_quartile)

This has nothing to do with ddply.
If your data is not generating unique breaks, you can make them unique by wrapping the breaks with a unique statement.
breaks =unique(quantile(fwd_quarts$v, probs = seq(0, 1, 0.2)))
However, this will lower the number of levels from what you originally desired.
Generally speaking, if you have data like c(1,1,1,2) you can't break it into 3 groups. The number of groups should be less than or equal to the unique values in your data. HTH.

I got the same problem in the leaflet, if there is not enough observation to make the map it gives the same error. As a solution I just combine the clusters that having low observations.

Related

How to fix ‘Error in FUN(X[[i]], ...) : only defined on a data frame with all numeric variables”

I intend to draw a qq plot on the data, but it reminds me that qqnorm function only works on numerical data.
As the factor include A,B,C,D and their two, three and four way interaction, I have no idea how to convert it into numerical form.
The data is as follows:
Effects,Value
A,76.95
B,-67.52
C,-7.84
D,-18.73
AB,-51.32
AC,11.69
AD,9.78
BC,20.78
BD,14.74
CD,1.27
ABC,-2.82
ABD,-6.5
ACD,10.2
BCD,-7.98
ABCD,-6.25
My code is as follows:
library(readr)
data621 <- read_csv("Desktop/data621.csv")
data621_qq<-qqnorm(data621,xlab = "effects",datax = T)
qqline(data621,probs=c(0.3,0.7),datax = T)
text(data621_qq$x,data621_qq$y,names(data621),pos=4)
Your code would work if using the proper columns instead of the entire data frame. For example,
data621_qq <- qqnorm(data621$Value, xlab = "Effects", datax = TRUE)
qqline(data621$Value, probs = c(0.3, 0.7), datax = TRUE)
text(data621_qq$x, data621_qq$y, data621$Effects, pos=4)
By the way, names(data621) would give you the column names, instead of the effect names (which are stored as values in a column).

R - Histogram Doesn't show density due to magnitude of the Data

I have a vector called data with length 444000 approximately, and most of the numeric values are between 1 and 100 (almost all of them). I want to draw the histogram and draw the the appropriate density on it. However, when I draw the histogram I get this:
hist(data,freq=FALSE)
What can I do to actually see a more detailed histogram? I tried to use the breaks code, it helped, but it's really hard do see the histogram, because it's so small. For example I used breaks = 2000 and got this:
Is there something that I can do? Thanks!
Since you don't show data, I'll generate some random data:
d <- c(rexp(1e4, 100), runif(100, max=5e4))
hist(d)
Dealing with outliers like this, you can display the histogram of the logs, but that may difficult to interpret:
If you are okay with showing a subset of the data, then you can filter the outliers out either dynamically (perhaps using quantile) or manually. The important thing when showing this visualization in your analysis is that if you must remove data for the plot, then be up-front when the removal. (This is terse ... it would also be informative to include the range and/or other properties of the omitted data, but that's subjective and will differ based on the actual data.)
quantile(d, seq(0, 1, len=11))
d2 <- d[ d < quantile(d, 0.90) ]
hist(d2)
txt <- sprintf("(%d points shown, %d excluded)", length(d2), length(d) - length(d2))
mtext(txt, side = 1, line = 3, adj = 1)
d3 <- d[ d < 10 ]
hist(d3)
txt <- sprintf("(%d points shown, %d excluded)", length(d3), length(d) - length(d3))
mtext(txt, side = 1, line = 3, adj = 1)

Identify spikes/peaks in density plot by group

I created a density plot with ggplot2 package for R. I would like to identify the spikes/peaks in the plot which occur between 0.01 and 0.02. There are too many legends to pick it out so I deleted all legends. I tried to filter my data out to find most number of rows that a group has between 0.01 and 0.02. Then I filtered out the selected group to see whether the spike/peak is gone but no, it is there plotted still. Can you suggest a way to identify these spikes/peaks in these plots?
Here is some code :
ggplot(NumofHitsnormalized, aes(NumofHits_norm, fill = name)) + geom_density(alpha=0.2) + theme(legend.position="none") + xlim(0.0 , 0.15)
## To filter out the data that is in the range of first spike
test <- NumofHitsnormalized[which(NumofHitsnormalized$NumofHits_norm > 0.01 & NumofHitsnormalized$NumofHits_norm <0.02),]
## To figure it out which group (name column) has the most number of rows ##thus I thought maybe I could get the data that lead to spike
testMatrix <- matrix(ncol=2, nrow= length(unique(test$name)))
for (i in 1:length(unique(test$name))){
testMatrix[i,1] <- unique(test$name)[i]
testMatrix[i,2] <- nrow(unique(test$name)[i])}
Konrad,
This is the new plot made after I filtered my data out with extremevalues package. There are new peaks and they are located at different intervals and it also says 96% of the initial groups have data in the new plot (though number of rows in filtered data reduced to 0.023% percent of the initial dataset) so I cant identify which peaks belong to which groups.
I had a similar problem to this.
How i did was to create a rolling mean and sd of the y values with a 3 window.
Calculate the average sd of your baseline data ( the data you know won't have peaks)
Set a threshold value
If above threshold, 1, else 0.
d5$roll_mean = runMean(d5$`Current (pA)`,n=3)
d5$roll_sd = runSD(x = d5$`Current (pA)`,n = 3)
d5$delta = ifelse(d5$roll_sd>1,1,0)
currents = subset(d5,d5$delta==1,na.rm=TRUE) # Finds all peaks
my threshold was a sd > 1. depending on your data you may want to use mean or sd. for slow rising peaks mean would be a better idea than sd.
Without looking at the code, I drafted this simple function to add TRUE/FALSE flags to variables indicating outliers:
GenerateOutlierFlag <- function(x) {
# Load required packages
Vectorize(require)(package = c("extremevalues"), char = TRUE)
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% getOutliers(x, method = "I")$iRight,
TRUE,out_flg)
return(out_flg)
}
If you care to read about the extremevalues package you will see that it provides some flexibility in terms of identifying outliers but broadly speaking it's a good tool for finding various peaks or spikes in the data.
Side point
You could actually optimise it significantly by creating one object corresponding to getOutliers(x, method = "I") instead of calling the method twice.
More sensible syntax
GenerateOutlierFlag <- function(x) {
# Load required packages
require("extremevalues")
# Outliers object
outObj <- getOutliers(x, method = "I")
# Run check for ouliers
out_flg <- ifelse(1:length(x) %in% outObj$iLeft,
TRUE,FALSE)
out_flg <- ifelse(1:length(x) %in% outObj$iRight,
TRUE,out_flg)
return(out_flg)
}
Results
x <- c(1:10, 1000000, -99099999)
table(GenerateOutlierFlag(x))
FALSE TRUE
10 2

R: Create a more readable X-axis after binning data in ggplot2. Turn bins into whole numbers

I have a dummy variable call it "drink" and a corresponding age variable that represents a precise age estimate (several decimal points) for each person in a dataset. I want to first "bin" the age variable, extracting the mean value for each bin based on the "drink" dummy, and then graph the result. My code to do so looks like this:
df$bins <- cut(df$age, seq(from = 17, to = 31, by = .2), include.lowest = TRUE)
df.plot <- ddply(df, .(bins), summarise, avg.drink = mean(drinks_alcohol))
qplot(bins, avg.drink, data = df.plot)
This works well enough, but the x-axis in the graph is unreadable because it corresponds to the length size of the bins. Is there a way to make the modify the X-axis to show, for example, ages 19-23 only, with the "ticks" still aligning with the correct bins? For example, in my current code there is a bin for (19, 19.2] and another bin for (20, 20.2]. I would want only the bins that start in whole numbers to be identified on the X-axis with the first number (19, 20), not the second (19.2, 20.2) shown.
Is there any straightforward way to do this?
The most direct way to specify axis labels is with the appropriate scale function... in the case of factors on the x axis, scale_x_discrete. It will use whatever labels you give it with the labels argument, or you can give it a function that formats things as you like.
To "manually" specify the labels, you just need to create a vector of appropriate length. In this case, if you factor values go are intervals beginning with seq(17, 31.8, by = 0.2) and you want to label bins beginning with integers, then your labels vector will be
bin_starts = seq(17, 31.8, by = 0.2)
bin_labels = ifelse(bin_starts - trunc(bin_starts) < 0.0001, as.character(bin_starts), "")
(I use the a - b < 0.0001 in case of precision problems, though it shouldn't be a problem in this particular case).
A more robust solution would to label the factor levels with the number at the start of the interval from the beginning. cut also has a labels argument.
my_breaks = seq(17, 32, by = 0.2)
df$bins <- cut(df$age, breaks = my_breaks, labels = head(my_breaks, -1),
include.lowest = TRUE)
You could then fairly easily write a formatter (following templates from the scales package) to print only the ones you want:
int_only = function(x) {
# test if we can coerce to numeric, if not do nothing
if (any(is.na(as.numeric(x)))) return(x)
# otherwise convert to numeric and return integers and blanks as labels
x = as.numeric(x)
return(ifelse(x - trunc(x) < 1e-10, as.character(x), ""))
}
Then, using the nicely formatted data created above, you should be able to pass int_only as a formatter function to labels to get the labels you want. (Note: untested! necessary tweaks left as an exercise for the reader, though I'll gladly accept edits :) )

Tufte tables: convert quartile plots into standard error plots hacking qTable function from NMOF package

If you remember there is a nice version of table conceived by Tufte that include small quartile plots running next to the corresponding data rows:
There is an implementation of such solution in R using package NMOF and function qTable, which basically creates the table shown above and outputs it as a LaTeX code:
require(NMOF)
x <- rnorm(100, mean = 0, sd = 2)
y <- rnorm(100, mean = 1, sd = 2)
z <- rnorm(100, mean = 1, sd = 0.5)
X <- cbind(x, y, z)
qTable(X,filename="res.tex")#this will save qTable in tex format in current dir
This method of visualization seem to be especially useful if you have a small amount of information to present, and you don't want to waste a space for the full graph. But I would like to hack qTable a bit. Instead of displaying quantile plots, I would prefer to display standard errors of the mean. I am not great in hacking such functions, and I used brute force to do it. I replaced the line from the qTable function which computes quantiles:
A <- apply(X, 2L, quantile, c(0.25, 0.5, 0.75))
with something very brutal, that computes standard errors:
require(psych)#got 'SE' function to compute columns standard deviation
M = colMeans(X)#column means
SE = SD(X)/sqrt(nrow(X))#standard error
SELo = M-SE#lower bound
SEHi = M+SE#upper bound
A = t(data.frame(SELo,M,SEHi))#combines it together
I know, I know it's possibly unsustainable approach, but it actually works to some extend - it plots standard errors but keeps this gap in the plot:
and I would like this gap to disappear.
Here is a qTable function with modification discussed above.
To remove the gaps, you can insert these two lines:
B[2, ] <- B[3, ]
B[4, ] <- B[3, ]
right before the for loop that starts with
for (cc in 1L:dim(X)[2L]) {...
Why does it work? Reading the graph from left to right, the five rows of B correspond to where
1) the left segments start
2) the left segments ends
3) the dots are
4) the right segments start
5) the right segments end
so by forcing B[2, ] and B[4, ] to B[3, ] you are effectively getting rid of the gaps.

Resources