show only 0-90% or 0-95% percentile - r

Here is my code and plot results, dues to some outliers, the x-axis is very long. Is there a simple method which I can filter df$foo by only 0-90% or 0-95% percentile in R, so that I can plot only normal values? Thanks.
df <- read.csv('~/Downloads/foo.tsv', sep='\t', header=F, stringsAsFactors=FALSE)
names(df) <- c('a', 'foo', 'goo')
df$foo <- as.numeric(df$foo)
goodValue <- df$foo
summary(goodValue)
hist(goodValue,main="Distribution",xlab="foo",breaks=20)

Maybe this is what you're looking for?
a = c(rnorm(99), 50) #create some data
quant <- as.numeric(quantile(a, c(0, 0.9))) #get 0 and 0.9 quantile
hist(a[a > quant[1] & a < quant[2]]) #histogram only data within these bounds

Suppose you wanted to examine the diamonds. (I don't have your data)
library(ggplot2)
library(dplyr)
diamonds %>% ggplot() + geom_histogram(aes(x = price))
You might decide to examine the deciles of your data, and since the tail probability is not of interest to you, you might throw away the top uppermost decile. You could do that as follows, with a free scale so that you can see what is happening within each decile.
diamonds %>% mutate(ntile = ntile(price, 10)) %>%
filter(ntile < 10) %>%
ggplot() + geom_histogram(aes(x = price)) +
facet_wrap(~ntile, scales = "free_x")
But be cautious although seeing your data in a much finer granularity has its benefits, notice how you could almost barely tell that your data is roughly exponentially distributed (with a heavy tail, as commodities price data often are).

Related

R crashed when using Geom_Point for large data frame

Background: I have a large data frame data_2014, containing ~ 1,000,000 rows like this
library(tidyverse)
tibble(
date_time = "4/1/2014 0:11:00",
Lat = 40.7690,
Lon = -73.9549,
Base = "B02512"
)
Problem: I want to create a plot like this
This is what I've attempted to do:
library(tidyverse)
library(ggthemes)
library(scales)
min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004
ggplot(data_2014, aes(Lon, Lat)) +
geom_point(size = 1, color = "chocolate") +
scale_x_continuous(limits = c(min_long, max_long)) +
scale_y_continuous(limits = c(min_lat, max_lat)) +
theme_map() +
ggtitle("NYC Map Based on Uber Rides Data (April-September 2014)")
However, when I ran this code, Rstudio crashed. I'm not particularly sure how to fix or improve this. Is there any suggestion?
A million points is a lot for ggplot2, but do-able if your computer is good enough. Yours may or may not be. Short of getting a bigger computer here's what you should do.
This is spatial data, so use the sf package.
library(sf)
data_2014 <- st_as_sf(data_2014, coords = c('Lon', 'Lat')) %>%
st_set_crs(4326)
If you're only plotting the points, get rid of the columns of data you don't need. I'm guessing they might include trip distance, time, borough, etc. Use dplyr's select, or whatever other method you're familiar with.
Try plotting some of the data, and then a little more. See where your computer slows down & stop there. You can plot the data from row 1:n, or sample x number of rows.
# try starting with 100,000 and go up from there.
n <- 100000
ggplot(data_2014[1:n,]) +
geom_sf()
# Alternatively sample a fraction of the data.
# Start with ~10% and go up until R crashes again.
data_2015 %>%
sample_frac(.1) %>%
ggplot() +
geom_sf()

ROC curve in ggplot calculation [r]

I am trying to create a ROC curve in ggplot
I wrote function myself, however when I compare my results to results from roc_curve function from community (that I believe more) I get different results.
I would like to ask where is mistake in the function below?
library(ggplot2)
library(dplyr)
library(yardstick)
n <- 300 # sample size
data <-
data.frame(
real = sample(c(0,1), replace=TRUE, size=n),
pred = sample(runif(n), replace=TRUE, size=n)
)
simple_roc <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
data.frame(TPR=cumsum(labels)/sum(labels), FPR=cumsum(!labels)/sum(!labels), labels)
}
simple_roc(data$real, data$pred) %>%
ggplot(aes(TPR, FPR)) +
geom_line()
yardstick::roc_curve(data, factor(real), pred) %>%
ggplot(aes(1 - specificity, sensitivity)) +
geom_line()
First you need to anchor your ROC curve in the points (0, 0) and (1, 1).
simple_roc <- function(labels, scores){
labels <- labels[order(scores, decreasing=TRUE)]
data.frame(
TPR = c(0, cumsum(labels)/sum(labels), 1),
FPR = c(0, cumsum(!labels)/sum(!labels), 1)
)
}
Then the order in which your data is presented matters in ggplot2. Reversing the line direction should get you a bit closer:
yardstick::roc_curve(data, factor(real), pred) %>%
ggplot(aes(rev(1 - specificity), rev(sensitivity))) +
geom_line()
I would recommend against using your own function for any serious work. There are many other things that can go wrong and that well-maintained packages will handle properly such as missing values, infinite values, absence of some labels, and others that I can't even think about right now.

Show outliers in an efficient manner using ggplot

The actual data (and aim) I have is different but for reproducing purposes I used the Titanic dataset. My aim is create a plot of the age outliers (1 time SD) per class and sex.
Therefore the first thing I did is calculating the sd values and ranges:
library(dplyr)
library(ggplot2)
#Load titanic set
titanic <- read.csv("titanic_total.csv")
group <- group_by(titanic, Pclass, Sex)
#Create outlier ranges
summarise <- summarise(group, mean=mean(Age), sd=sd(Age))
summarise <- as.data.frame(summarise)
summarise$outlier_max <- summarise$mean + summarise$sd
summarise$outlier_min <- summarise$mean - summarise$sd
#Create a key
summarise$key <- paste0(summarise$Pclass, summarise$Sex)
#Create a key for the base set
titanic$key <- paste0(titanic$Pclass, titanic$Sex)
total_data <- left_join(titanic, summarise, by = "key")
total_data$outlier <- 0
Next, using a loop I determine whether the age is inside or outside the range
for (row in 1:nrow(total_data)){
if((total_data$Age[row]) > (total_data$outlier_max[row])){
total_data$outlier[row] <- 1
} else if ((total_data$Age[row]) < (total_data$outlier_min[row])){
total_data$outlier[row] <- 1
} else {
total_data$outlier[row] <- 0
}
}
Do some data cleaning ...
total_data$Pclass.x <- as.factor(total_data$Pclass.x)
total_data$outlier <- as.factor(total_data$outlier)
Now this code gives me the plot I am looking for.
ggplot(total_data, aes(x = Age, y = Pclass.x, colour = outlier)) + geom_point() +
facet_grid(. ~Sex.x)
However, this not really seems like the easiest way to crack this problem. Any thoughts on how I can include best practises to make this more efficients.
One way to reduce your code and make it less repetitive is to get it all into one procedure thanks to the pipe. Instead of creating a summary with the values, re-join this with the data, you could basically do this within one mutate step:
titanic %>%
mutate(Pclass = as.factor(Pclass)) %>%
group_by(Pclass, Sex) %>%
mutate(Age.mean = mean(Age),
Age.sd = sd(Age),
outlier.max = Age.mean + Age.sd,
outlier.min = Age.mean - Age.sd,
outlier = as.factor(ifelse(Age > outlier.max, 1,
ifelse(Age < outlier.min, 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, Pclass, colour = outlier)) +
facet_grid(.~Sex)
Pclass is mutated to a factor in advance, as it is a grouping factor. Then, the steps are done within the original dataframe, instead of creating two new ones. No changes are made to the original dataset however! If you would want this, just reassign the results to titanic or another data frame, and execute the ggplot-part as next step. Else you would assign the result of the figure to your data.
For the identification of outliers, one way is to work with the ifelse. Alternatively, dplyr offers the nice between function, however, for this, you would need to add rowwise, i.e. after creating the min and max thresholds for outliers:
...
rowwise() %>%
mutate(outlier = as.factor(as.numeric(between(Age, outlier.min, outlier.max)))) %>% ...
Plus:
Additionally, you could even reduce your code further, depends on which variables you want to keep in which way:
titanic %>%
group_by(Pclass, Sex) %>%
mutate(outlier = as.factor(ifelse(Age > (mean(Age) + sd(Age)), 1,
ifelse(Age < (mean(Age) - sd(Age)), 1, 0)))) %>%
ggplot() +
geom_point(aes(Age, as.factor(Pclass), colour = outlier)) +
facet_grid(.~Sex)

How to make plots scales the same or trun them into Log scales in ggplot

I am using this script to plot chemical elements using ggplot2 in R:
# Load the same Data set but in different name, becaus it is just for plotting elements as a well log:
Core31B1 <- read.csv('OilSandC31B1BatchResultsCr.csv', header = TRUE)
#
# Calculating the ratios of Ca.Ti, Ca.K, Ca.Fe:
C31B1$Ca.Ti.ratio <- (C31B1$Ca/C31B1$Ti)
C31B1$Ca.K.ratio <- (C31B1$Ca/C31B1$K)
C31B1$Ca.Fe.ratio <- (C31B1$Ca/C31B1$Fe)
C31B1$Fe.Ti.ratio <- (C31B1$Fe/C31B1$Ti)
#C31B1$Si.Al.ratio <- (C31B1$Si/C31B1$Al)
#
# Create a subset of ratios and depth
core31B1_ratio <- C31B1[-2:-18]
#
# Removing the totCount column:
Core31B1 <- Core31B1[-9]
#
# Metling the data set based on the depth values, to have only three columns: depth, element and count
C31B1_melted <- melt(Core31B1, id.vars="depth")
#ratio melted
C31B1_ra_melted <- melt(core31B1_ratio, id.vars="depth")
#
# Eliminating the NA data from the data set
C31B1_melted<-na.exclude(C31B1_melted)
# ratios
C31B1_ra_melted <-na.exclude(C31B1_ra_melted)
#
# Rename the columns:
colnames(C31B1_melted) <- c("depth","element","counts")
# ratios
colnames(C31B1_ra_melted) <- c("depth","ratio","percentage")
#
# Ploting the data in well logs format using ggplot2:
Core31B1_Sp <- ggplot(C31B1_melted, aes(x=counts, y=depth)) +
theme_bw() +
geom_path(aes(linetype = element))+ geom_path(size = 0.6) +
labs(title='Core 31 Box 1 Bioturbated sediments') +
scale_y_reverse() +
facet_grid(. ~ element, scales='free_x') #rasterImage(Core31Image, 0, 1515.03, 150, 0, interpolate = FALSE)
#
# View the plot:
Core31B1_Sp
I got the following image (as you can see the plot has seven element plots, and each one has its scale. Please ignore the shadings and the image at the far left):
My question is, is there a way to make these scales the same like using log scales? If yes what I should change in my codes to change the scales?
It is not clear what you mean by "the same" because that will not give you the same result as log transforming the values. Here is how to get the log transformation, which, when combined with the no using free_x will give you the plot I think you are asking for.
First, since you didn't provide any reproducible data (see here for more on how to ask good questions), here is some that gives at least some of the features that I think your data has. I am using tidyverse (specifically dplyr and tidyr) to do the construction:
forRatios <-
names(iris)[1:3] %>%
combn(2, paste, collapse = " / ")
toPlot <-
iris %>%
mutate_(.dots = forRatios) %>%
select(contains("/")) %>%
mutate(yLocation = 1:n()) %>%
gather(Comparison, Ratio, -yLocation) %>%
mutate(logRatio = log2(Ratio))
Note that the last line takes the log base 2 of the ratio. This allows ratios in each direction (above and below 1) to plot meaningfully. I think that step is what you need. you can accomplish something similar with myDF$logRatio <- log2(myDF$ratio) if you don't want to use dplyr.
Then, you can just plot that:
ggplot(
toPlot
, aes(x = logRatio
, y = yLocation) ) +
geom_path() +
facet_wrap(~Comparison)
Gives:

Three-way graph (variable, mean, sd) with ggplot2

I think I have an error in my logic while reproducing a graph I found in this pdf here.
It should be fairly easy to do, but I have issues to plot a variable with its mean and standard deviation each in their own graph together, as can be seen in the example graph below. Did they do it with facet_grid() or facet_wrap()?
How can I plot an arbitrary variable in that way? In particular, I would not know how to plot the mean and sd over distance (or time).
Example graph:
Here's my approach to the solution outlined by #DavidArenburg (though I simplified the data a little, using simple cumulative statistics and a plain index):
library(tidyr)
library(dplyr)
library(TTR)
v <- rnorm(1000)
df <- data.frame(index = 1:1000,
variable = v,
mean = runMean(v, n=1, cumulative=TRUE),
sd = runSD(v, n=1, cumulative=TRUE))
dd <- gather(df, facet, value, -index)
ggplot(dd, aes(x = index, y = value)) +
geom_path() +
facet_grid(facet ~ .)
Bonus: illustration that sample mean and sd are unbiased (0 and 1, respectively).

Resources