I am using the boxplot function in R 3.1.1, and I am trying to understand what is happening behind the scenes rather than fix my code.
png(file = "plot1.png")
par(mfrow= c(1,2))
par(mar = c(3,4,4,1))
boxplot(emissions ~ year, col = "blue", xlab="Year", ylab ="Emissions", title = "Pm25 Emissions 1999 and 2008", bg ="white",ylim=c(0,6000))
boxplot(emissions2 ~ year2, col = "blue", xlab="Year", ylab ="Emissions", title = "Pm25 Emissions per Year", bg ="white",ylim=c(0,6000))
dev.off()
The resulting output is:
Under most situations from what I have read, the code should return a box and whiskers, but it is returning this linear mess of aligned dots that are no better than a bar chart. Any clues on what I have done wrong?
Thanks. The image is not posted as that I don't have 10 reputation points.
Full code to upload data set for automated and temporary processing.
url = "https://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip"
#######Erased to encourage the learning process...
NEI <- readRDS(mydata[2])
SCC <- readRDS(mydata[1])
year <- (NEI[,6])
emissions <-( NEI[,4])
mat <- cbind(year,emissions)
png(file = "plot1.png")
....
Summary(NEI) results:
Emissions
Min : 0.0
1st Qu.: 0.0
Median : 0.0
Mean : 3.4
3rd Qu.: 0.1
Max. :646952.0
year
Min. :1999
1st Qu.:2002
Median :2005
Mean :2004
3rd Qu.:2008
Max. :2008
As you may have noticed, your NEI variable is strongly skewed.
library(dplyr)
nei <- as.tbl(NEI)
nei%>%
group_by(year) %>%
summarise(
min = min(Emissions),
max = max(Emissions),
mean = mean(Emissions),
median = median(Emissions),
Q25 = quantile (Emissions, probs=0.25),
Q75 = quantile (Emissions, probs=0.75)
)
the summary
Source: local data frame [4 x 7]
year min max mean median Q25 Q75
1 1999 0 66696.32 6.615401 0.040000000 0.0100000000 0.25600000
2 2002 0 646951.97 3.317747 0.007164684 0.0005436423 0.08000000
3 2005 0 58896.10 3.182719 0.006741885 0.0005283287 0.07000000
4 2008 0 20799.70 1.752560 0.005273130 0.0003983980 0.06162755
boxplot is a representation of your data distribution. More preiscely it depends in your data quantiles values.
For example, if yours quantiles overlaps , you will have only one horizontal line( the box and whisker is flat) and your outliers as a vertical line of points.
You can easily imagine your data distibuted like this example:
set.seed(1)
boxplot(count ~ spray,
data = data.frame(count=c(rep(0,800),runif(200)),
spray=sample(1:2,1000,rep=TRUE)), col = "lightgray")
Related
I am trying to create boxplots where the medians of my variables are aligned at 0. Their range do not have to be fixed at [-1, 1], but I would like their min and max to fall within this range. Is there an operation that I could use?
I was able to normalize my variables in [-1, 1], but I am aiming at having the medians aligned at 0, and their range just falling within (not being fixed at) [-1, 1].
Here's a function to do that. It finds which extreme is farther from the median and then uses the median and the max distance from the median to scale everything into a range between -1 and 1, with median at the center. This will break if the data has no range (ie min = median = max), as that will result in an infinite rescaling factor, but I'm not sure what the expected behavior should be in that case.
rescale_center_median <- function(my_numbers) {
my_median = median(my_numbers, na.rm = TRUE)
my_range = range(my_numbers, na.rm = TRUE)
scale_factor = max(abs(my_range-my_median))
(my_numbers - my_median) / scale_factor
}
Testing:
set.seed(42)
rescale_center_median(rnorm(10))
# [1] 0.60393025 -0.58015650 -0.01258313 0.15241963 0.01258313 -0.29963620 0.68991628
# [8] -0.29262249 1.00000000 -0.27308102
median(scaled_numbers)
#[1] 0
> range(scaled_numbers)
[1] -0.4922334 1.0000000
Apologies I am new to R, I have a dataset with height and canopy density of trees for example:
i_h100 i_cd
2.89 0.0198
2.88 0.0198
17.53 0.658
27.23 0.347
I want to regroup 'h_100' into 2m intervals going from 2m min to 30m max, I then want to calculate the mean i_cd value and interquartile range for each of these intervals so that I can then plot these with a least squares regression. There is something wrong with the code I am using to get the mean. This is what I have so far:
mydata=read.csv("irelandish.csv")
height=mydata$i_h100
breaks=seq(2,30,by=2) #2m intervals
height.cut=cut(height, breaks, right=TRUE)
#attempt at calculating means per group
install.packages("dplyr")
mean=summarise(group_by(cut(height, breaks, right=TRUE),
mean(mydata$i_cd)))
install.packages("reshape2")
dcast(mean)
Thanks in advance for any advice.
Using aggregate() to calculate the groupwise means.
# Some example data
set.seed(1)
i_h100 <- round(runif(100, 2, 30), 2)
i_cd <- rexp(100, 1/i_h100)
mydata <- data.frame(i_cd, i_h100)
# Grouping i_h100
mydata$i_h100_2m <- cut(mydata$i_h100, seq(2, 30, by=2))
head(mydata)
# i_cd i_h100 i_h100_2m
# 1 2.918093 9.43 (8,10]
# 2 13.735728 12.42 (12,14]
# 3 13.966347 18.04 (18,20]
# 4 2.459760 27.43 (26,28]
# 5 8.477551 7.65 (6,8]
# 6 6.713224 27.15 (26,28]
# Calculate groupwise means of i_cd
i_cd_2m_mean <- aggregate(i_cd ~ i_h100_2m, mydata, mean)
# And IQR
i_cd_2m_iqr <- aggregate(i_cd ~ i_h100_2m, mydata, IQR)
upper <- i_cd_2m_mean[,2]+(i_cd_2m_iqr[,2]/2)
lower <- i_cd_2m_mean[,2]-(i_cd_2m_iqr[,2]/2)
# Plotting the result
plot.default(i_cd_2m_mean, xaxt="n", ylim=range(c(upper, lower)),
main="Groupwise means \U00B1 0.5 IQR", type="n")
points(upper, pch=2, col="lightblue", lwd=1.5)
points(lower, pch=6, col="pink", lwd=1.5)
points(i_cd_2m_mean, pch=16)
axis(1, i_cd_2m[,1], as.character(i_cd_2m[,1]), cex.axis=0.6, las=2)
Here is a solution,
library(reshape2)
library(dplyr)
mydata <- data_frame(i_h100=c(2.89,2.88,17.53,27.23),i_cd=c(0.0198,0.0198,0.658,0.347))
height <- mydata$i_h100
breaks <- seq(2,30,by=2) #2m intervals
height.cut <- cut(height, breaks, right=TRUE)
mydata$height.cut <- height.cut
mean_i_h100 <- mydata %>% group_by(height.cut) %>% summarise(mean_i_h100 = mean(i_h100))
A few remarks:
it is better to avoid naming variables with function names, so I changed the mean variable to mean_i_h100
I am using the pipe notation, which makes the code more readable, it avoids repeating the first argument of each function, you can find a more detailed explanation here.
Without the pipe notation, the last line of code would be:
mean_i_h100 <- summarise(group_by(mydata,height.cut),mean_i_h100 = mean(i_h100))
you have to load the two packages you installed with library
I am looking for a way to visualize inventory throughout a day. The dataset looks as follows, with the summaries of the last two columns below:
Time Price Inventory Duration
1 9/1/2016 9:25:06 AM 13.960 318 0
2 9/1/2016 9:36:42 AM 13.980 106 696
3 9/1/2016 9:40:52 AM 13.990 -599 250
4 9/1/2016 9:52:54 AM 14.015 68 722
5 9/1/2016 9:52:54 AM 14.015 321 0
6 9/1/2016 9:54:17 AM 14.010 74 83
Inventory
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1120.00 -98.75 9.00 0.00 100.00 1988.00
Duration
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 40.25 205.50 2100.00 529.00 272700.00
I want to visualize the data by showing how much time was spent on various inventory levels. What would you recommend as a function for this? So far I found only histograms based on frequency, not time. My intended result would look similar to this:
https://postimg.org/image/z074waij1/
Thanks in advance
I wrote the following function for my needs. Hope it helps
inv.barplot.IDs <- function(inv.list, IDs = 1:1620)
{
# Subset according to the IDs
myinvs <- as.data.frame(matrix(nrow = 0, ncol = 14))
names(myinvs) <- inv.names
Volume <- Duration <- vector("numeric")
for (i in IDs)
{
#myinvs <- rbind(myinvs, inv.list[[i]])
Volume <- c(Volume, as.numeric(inv.list[[i]]$Volume))
Duration <- c(Duration, as.numeric(inv.list[[i]]$Duration))
}
# Design a sequence of skatules
minimum <- min(Volume)
maximum <- max(Volume)
width <- (maximum + abs(minimum)) / 18
width <- round(width, -1)
seq.pos <- seq(width, maximum + width, by = width)
seq.neg <- - seq(0, abs(minimum) + width, by = width)
seq <- c(rev(seq.neg), seq.pos)
# Categorize the dataframe (new column)
Skatule <- numeric(length = length(Volume))
for (i in 1:length(Volume))
{
Skatule[i] <- seq[head(which(seq > Volume[i]), 1) - 1]
}
barplot.data <- tapply(Duration, Skatule, sum)
# Save the barplot
#jpeg(filename = file.barplot, width = 480 * (16/9))
inv.barplot <- barplot(barplot.data, border = NA, ylim = c(0, max(barplot.data)), main = "Total time spent on various inventory levels", xlab = "Inventory", ylab = "Log of Hours")
#print(inv.barplot)
#dev.off()
}
Here's my code, everything is running smoothly until I try to create a line
in the plot. It doesn't give me any error, just nothing showing up!
setwd("~/RESEARCH/")
NHISdata <- read.csv("NHIS Data.csv", header=TRUE)
attach(NHISdata)
age = 21 + days_21/365
#Create a variable centered at 0, for pre and post
z=ifelse(age>=21,1,0)
#Create a polynomial in age
agec=age-21
agec_sq=agec^2
agec_cu=agec^3
#Interact with the post variable
agec_post=agec*z
agec_sq_post=agec_sq*z
agec_cu_post=agec_cu*z
reg1<- lm(drinks_alcohol ~ z + agec + agec_post)
reg2<- lm(drinks_alcohol ~ z + agec + agec_sq + agec_post + agec_sq_post)
reg3<- lm(drinks_alcohol ~ z + agec + agec_sq+ agec_cu + agec_post +
agec_sq_post + agec_cu_post)
#z will give us the jump at 21
summary(reg1)
summary(reg2)
summary(reg3)
pred_rate_linear <- predict(reg1)
pred_rate_quad <- predict(reg2)
pred_rate_cubic <- predict(reg3)
#rate1 <-
cbind(NHISdata,z,agec,agec_sq,agec_cu,agec_post,agec_sq_post,agec_cu_post,pred_rate_linear,pred_rate_quad,pred_rate_cubic)
#attach(rate1)
bin7=floor(days_21/7)
bin14=floor(days_21/14)
bin30=floor(days_21/30)
bin100=floor(days_21/100)
tipsy = data.frame(cbind(days_21,drinks_alcohol,bin7,bin14,bin30,bin100))
tipsy = aggregate(NHISdata,by=list(bin30),FUN=mean)
attach(tipsy)
par(mfrow=c(2,2), oma=c(0,0,2,0))
age = 21 + days_21/365
plot(x=age, y=drinks_alcohol, xlim=c(19,23), ylim=c(.4,.75), xlab='Age',
ylab='Drinking Rate', cex = 0.75)
title(main='Linear Regression')
sub1 <- subset(tipsy, age>=21)
sub2 <- subset(tipsy, age<=21)
lines(sub1$age, sub1$pred_rate_linear)
lines(sub2$age, sub2$pred_rate_linear)
summary(age)
summary(pred_rate_linear)
plot(x=age, y=drinks_alcohol, xlim=c(19,23), ylim=c(.4,.75), xlab='Age',
ylab='Drinking Rate', cex = 0.75)
title(main='W/ Linear Regression Line')
sub1 <- subset(rate1, age>=21)
sub2 <- subset(rate1, age<=21)
lines(sub1$age, sub1$pred_rate_quad, col="red", lwd=3)
lines(sub2$age, sub2$pred_rate_quad, col="red", lwd=3)
Here's the link to the csv file for you to try out.
http://speedy.sh/BdmP5/NHIS-Data.csv
the summary for age and pred_rate_quad are
> summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.70 20.87 24.00 24.00 27.12 30.85
> summary(pred_rate_linear)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.3096 0.6599 0.6631 0.6243 0.6665 0.6701
I don't see a problem, why isn't R showing my lines?
Never use attach. Especially never use attach when you also use functions like subset that use non-standard evaluation (take unquoted column names).
You define age as a vector by itself, not as a column of tipsy, so it won't be a column of sub either. (I haven't run your code, but I'm pretty sure of this.)
Your debugging strategy should be to run each line one-at-a-time and look at the what it is (or at least at its head()) in the console. I think, if you had done this, you would have noticed that sub1 and sub2 don't have age columns. You could do a lot to narrow down where the problem is. Instead of giving use 50 lines of code to produce sub1 and sub2, you should either just give us dput(head(sub1)) and ask why the plot isn't working---or if sub1 is missing columns look back to where you think they should be created.
I have a very large dataframe (N = 107,251), that I wish to split into relatively equal halves (~53,625). However, I would like the split to be done such that three variables are kept in equal proportion in the two sets (pertaining to Gender, Age Category with 6 levels, and Region with 5 levels).
I can generate the proportions for the variables independently (e.g., via prop.table(xtabs(~dat$Gender))) or in combination (e.g., via prop.table(xtabs(~dat$Gender + dat$Region + dat$Age)), but I'm not sure how to utilise this information to actually do the sampling.
Sample dataset:
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
Probabilities:
round(prop.table(xtabs(~dat$Gender)), 3) # 48.5% Female; 51.5% Male
round(prop.table(xtabs(~dat$Age)), 3) # 16.8, 18.2, ..., 16.0%
round(prop.table(xtabs(~dat$Region)), 3) # 21.5%, 17.7, ..., 21.9%
# Multidimensional probabilities:
round(prop.table(xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
The end goal for this dummy example would be two data frames with ~500 observations in each (completely independent, no participant appearing in both), and approximately equivalent in terms of gender/region/age splits. In the real analysis, there is more disparity between the age and region weights, so doing a single random split-half isn't appropriate. In real world applications, I'm not sure if every observation needs to be used or if it is better to get the splits more even.
I have been reading over the documentation from package:sampling but I'm not sure it is designed to do exactly what I require.
You can check out my stratified function, which you should be able to use like this:
set.seed(1) ## just so you can reproduce this
## Take your first group
sample1 <- stratified(dat, c("Gender", "Region", "Age"), .5)
## Then select the remainder
sample2 <- dat[!rownames(dat) %in% rownames(sample1), ]
summary(sample1)
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
summary(sample2)
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
Compare the following and see if they are within your expectations.
x1 <- round(prop.table(
xtabs(~dat$Gender + dat$Age + dat$Region)), 3)
x2 <- round(prop.table(
xtabs(~sample1$Gender + sample1$Age + sample1$Region)), 3)
x3 <- round(prop.table(
xtabs(~sample2$Gender + sample2$Age + sample2$Region)), 3)
It should be able to work fine with data of the size you describe, but a "data.table" version is in the works that promises to be much more efficient.
Update:
stratified now has a new logical argument "bothSets" which lets you keep both sets of samples as a list.
set.seed(1)
Samples <- stratified(dat, c("Gender", "Region", "Age"), .5, bothSets = TRUE)
lapply(Samples, summary)
# $SET1
# Gender Region Age X1
# F:235 1:112 1:84 Min. :-2.82847
# M:259 2: 90 2:78 1st Qu.:-0.69711
# 3: 94 3:82 Median :-0.03200
# 4: 97 4:80 Mean :-0.01401
# 5:101 5:90 3rd Qu.: 0.63844
# 6:80 Max. : 2.90422
#
# $SET2
# Gender Region Age X1
# F:238 1:114 1:85 Min. :-2.76808
# M:268 2: 92 2:81 1st Qu.:-0.55173
# 3: 97 3:83 Median : 0.02559
# 4: 99 4:83 Mean : 0.05789
# 5:104 5:91 3rd Qu.: 0.74102
# 6:83 Max. : 3.58466
The following code basically creates a key based on the group membership then loops through each group, sampling half to one set and half (roughly) to the other. If you compare the resulting probabilities they are within 0.001 of each other. The downside to this is that its biased to make a larger sample size for the second group due to how rounding of odd-numbered group member number is handled. In this case the first sample is 488 observations and the second is 512. You can probably throw in some logic to account for that and even it out better.
EDIT: Added that logic and it split it up evenly.
set.seed(42)
Gender <- sample(c("M", "F"), 1000, replace = TRUE)
Region <- sample(c("1","2","3","4","5"), 1000, replace = TRUE)
Age <- sample(c("1","2","3","4","5","6"), 1000, replace = TRUE)
X1 <- rnorm(1000)
dat <- data.frame(Gender, Region, Age, X1)
dat$group <- with(dat, paste(Gender, Region, Age))
groups <- unique(dat$group)
setA <- dat[NULL,]
setB <- dat[NULL,]
for (i in 1:length(groups)){
temp <- dat[dat$group==groups[i],]
if (nrow(setA) > nrow(setB)){
tempA <- temp[1:floor(nrow(temp)/2),]
tempB <- temp[(1+floor(nrow(temp)/2)):nrow(temp),]
} else {
tempA <- temp[1:ceiling(nrow(temp)/2),]
tempB <- temp[(1+ceiling(nrow(temp)/2)):nrow(temp),]
}
setA <- rbind(setA, tempA)
setB <- rbind(setB, tempB)
}