I have the following data.frame
df<-data.frame(x=c("A","B","C"),colour=c(0.3,0.6,0.9))
x colour
1 A 0.3
2 B 0.6
3 C 0.9
I want to replace the numbers in df[,"colour"] with colours such that
< 0.4 = colour 1
0.4 - 0.7 = colour 2
< 0.7= colour 3
I have tried the following replacement scheme but the previous colour assignment are replaced with the latter. Any advice?
library(RColorBrewer)
g<-brewer.pal(3,"Greens")
col1<-df[,"colour"] < 0.4
df[col1,"colour"]<-g[1]
col2<-df[,"colour"] < 0.7
df[col2,"colour"]<-g[2]
col3<-df[,"colour"] >= 0.7
df[col3,"colour"]<-g[3]
Thanks for you advice.
THE SOLUTION
Thanks seancarmody (and spaceman for useful comments)
v<-c(0.45,0.65,0.75,0.85,0.95)
breaks<-c(0.4,0.5,0.6,0.7,0.8,0.9)
#Create a colour for each section
cols<-brewer.pal(length(breaks)+1,"Greens")
#Replace the gsim values with colours using the breaks
v <- as.character(cut(v, c(-Inf, breaks, Inf),labels=cols))
> v
[1] "#C7E9C0" "#74C476" "#41AB5D" "#238B45" "#005A32"
Your approach will work if you change order:
col2<-df[,"colour"] < 0.7
df[col2,"colour"]<-g[2]
col1<-df[,"colour"] < 0.4
df[col1,"colour"]<-g[1]
Since everything less than 0.4 is also less than 0.7, your approach was overwriting col1. Since you've edited your original question, the above is out of date. I'd just use the more general approach here:
breaks <- c(0.4, 0.7) # you can add more cut points here
cols <- brewer.pal(length(breaks) + 1, "Greens")
df$colour <- as.character(cut(df$colour, c(-Inf, breaks, Inf), labels=cols))
Related
I would like to implement random numbers for the time values equal to 0 (time == 0) and keep other time values as given.
set.seed(123)
df$time.new <- ifelse(df$time == 0, sample(0.2:0.8, replace=F), df$time)
Using the formula only 0.2 is created.
I will fill the blanks in the comment that answered the question. This is your code:
set.seed(123)
df$time.new <- ifelse(df$time == 0, sample(0.2:0.8, replace=F), df$time)
The key to understand why you are getting always 0.2 is to run:
0.2:0.8
This just yields: [1] 0.2 and that's the reason you are always getting 0.2 The seq() command lets you make sequences that have more elements by specifying shorter increments:
> seq(0.2, 0.8, by = 0.1)
[1] 0.2 0.3 0.4 0.5 0.6 0.7 0.8
If I remember correctly the default increment for a:b is one unit. Let's check a toy example:
> a <- 1; b <- 7
> a:b
[1] 1 2 3 4 5 6 7
If we do this with a <- 0.2 and b <- 0.8 the resulting vector would consist of just the value 0.2 hence, your code just detects such value.
First: This is my first question here and I'm relatively new to R, too. So, I'm sorry if this is a stupid question or wrong way to ask.
I have a data frame like this:
df <- data.frame(Website = c("A", "A", "A", "B", "B", "B"),
seconds = c(1,12,40,3,5,14),
visitors = c(200000,100000,12000,250000,180000,90000))
> df
Website seconds visitors
A 1 200000
A 12 100000
A 40 12000
B 3 250000
B 5 180000
B 14 90000
How to interpret the data: Website A has 200000 visitors who have been on the website for only 1 second, 100000 visitors for 12 seconds and so on.
In reality, the data has about hundred different websites, each with seconds ranging from 0 to about 900 (and a high number of visitors respectively).
Now, I want to calculate percentiles or at least quartiles for the visiting duration (for each website).
I already found and tried this solution here: https://stackoverflow.com/a/53882909
However, this solution is very inefficient as it results in a data frame with several million rows (and a very long processing time).
My question now: Is there a faster (more efficient way) to calculate percentiles from such pre-aggregated data?
I believe this will be faster. First make a function to compute the quantiles you specify. Then split the data into a list and use sapply:
quant <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
s <- c(0, x$seconds)
approx(v, s, p)$y
}
df.split <- split(df, df$Website)
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 0.2 0.3 0.5 0.6 0.8 0.9 3.0 6.5 9.9
# B 0.6 1.2 1.9 2.5 3.1 3.7 4.3 4.8 8.8
To see better what is going on here is a plot showing the data for Website A:
test1 <- df[1:3, ]
test1$cumvis <- cumsum(test1$visitors)
barplot(test1$seconds, test1$visitors, space=0, xlim=c(0, 325000))
axis(1, seq(0, 300000, 50000), c("0", "50K", "100K", "150K", "200K",
"250K", "300K"), xpd=NA)
axis(3, seq(0, sum(test1$visitors), by=31200), seq(0, 1, by=.1), lty=1)
lines(c(0, test1$cumvis), c(0, test1$seconds), col="red", lwd=2)
lines(c(0, test1$cumvis-.5*test1$visitors, tail(test1$cumvis, 1)),
c(0, test1$seconds, tail(test1$seconds, 1)), col="blue", lwd=2)
The plot shows the data as grey rectangles. The bottom x-axis shows the cumulative number of visits and the top x-axis shows the cumulative proportion. We can treat the rectangles as the distribution or we can assume that the rectangles are a sample that approximates the underlying distribution. My suggested solution took the red line and used the approx function to use linear interpolation between the data points to estimate the number of seconds along that curve.
The same approach can be used with a different definition of the curve in which the data points are placed in the middle of each rectangle, the blue curve. I'll provide code for that approach as well. It is also possible to estimate the quantiles from the original data without replicating it.
First a function to estimate the quantiles along the blue line:
quant2 <- function(x, p=c(.25, .50, .75)) {
v <- c(0, cumsum(x$visitors)-(.5*x$visitors)/sum(x$visitors), 1)
s <- c(0, x$seconds, tail(x$seconds, 1))
approx(v, s, p)$y
}
p <- c(.1, .2, .3, .4, .5, .6, .7, .8, .9)
stats <- t(sapply(df.split, quant2, p=p))
colnames(stats) <- as.character(p)
round(stats, 1)
# 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
# A 4.0 8.0 12.0 16.0 20 24.0 28.0 32.0 36.0
# B 1.4 2.8 4.2 5.6 7 8.4 9.8 11.2 12.6
The estimates are higher because the blue line is above the red line.
Finally, we can simply use the rectangles without any interpolation. Basically we set breaks at the boundaries of the data points and use those to identify which proportions fall in which groups of observations (seconds).
quant3 <- function(x, p=c(.25, .50, .75)){
v <- c(0, cumsum(x$visitors)/sum(x$visitors))
limits <- cut(p, breaks=v, include.lowest=TRUE, labels=x$seconds)
limits <- as.numeric(as.character(limits))
}
p <- 0:10/10
stats <- t(sapply(df.split, quant3, p=p))
colnames(stats) <- as.character(p)
stats
# 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
# A 1 1 1 1 1 1 1 12 12 12 40
# B 3 3 3 3 3 5 5 5 5 14 14
So for website A, 1 second is the value for quantiles 0 - .6.
I am wondering how I can plot a QQ plot with multiple p-value vectors for different studies in one plot.
I am using the following code to generate a QQ-plot:
install.packages("ggplot2")
library(ggplot2)
The code for qq can be found here: http://gettinggeneticsdone.blogspot.com/2009/11/qq-plots-of-p-values-in-r-using-ggplot2.html
qq(data$Pvals, title="My Quantile-Quantile Plot")
Now I have 4 studies, so 4 $Pval vectors. I am able to add in the first Pval1 as :
qq(data$Pval1, title="My Quantile-Quantile Plot")
How can I add labeled lines of observed p-values for the remaining studies? -> Pval2, Pval3, Pval4. Essentially I'd like to display the QQ-plot with 4 observed p-value lines representing the 4 studies in one graph.
Please help!
Thanks!
Can you share how your data looks? I think the answer you're looking for is defining the group variable in the aes string. For instance,
UPDATE TO TRANSPOSE DATA SET
# install.packages('ggplot2') # only needs to be installed first time
# install.packages('reshape2') # only needs to be installed first time
library(ggplot2)
library(reshape2)
# fakeData
# RowNum Pval1 Pval2 Pval3 Pval4
# 1 0.5 0.5 0.5 0.5
# 2 0.5 0.5 0.5 0.5
# 3 0.5 0.5 0.5 0.5
#
# melt(fakeData, id.vars = 'RowNum')
# RowNum variable value
# 1 Pval1 0.5
# 1 Pval2 0.5
# 1 Pval3 0.5
ORIGINAL CODE
df <- data.frame(Group = rep(c('A', 'B', 'C', 'D'), 50),
Number = sample(1:100, 200, replace = T))
ggplot(df, aes(sample = Number, group = Group, color = Group)) +
geom_point(stat = 'qq')
I do not have much experience in R, and I wonder if they can help me in this situation.
I have the following matrix:
mat <- matrix(c(0,0.5,0.2,0.23,0.6,0,0,0.4,
0.56,0.37,0,0.32,0.4,0.99,0.54,0.6,0,0.39), ncol=6, nrow=3)
dimnames(mat) = list(
c("y1","y2","y3"),
c("day1","day2","day3","day4","day5","day6")
)
> mat
day1 day2 day3 day4 day5 day6
y1 0.0 0.23 0.00 0.37 0.40 0.60
y2 0.5 0.60 0.40 0.00 0.99 0.00
y3 0.2 0.00 0.56 0.32 0.54 0.39
>
I want to know how can I get a graph where points would be marked based on the matrix.
The values are arbitrary in the interval [0,1]. It is possible to change the color of the generated points as a set of constraints?
Example:
(0,0.2] - Red
(0.2,0.4] - Green
(0.4,0.6] - Yellow
(0.6,0.9] - Blue
(0.9,1] - Black
I apologize if I have not explained myself well.
Thank you!
Assuming that your range for yellow should be [0.4,0.6] (otherwise you haven't specified a colour for (0.4,0.5) - you need to even if your data doesn't require it)
image(mat,col=c("red","green","yellow","blue","black"),breaks=c(0,0.2,0.4,0.6,0.9,1))
I've ignored the interval endpoint issue.
If you just want coloured points, something like this will do it:
palette(c("red","green","yellow","blue","black"))
plot.default(
as.data.frame.table(t(mat))[1:2],
col=findInterval(t(mat),c(0,0.2,0.4,0.6,0.9)),
pch=19,
axes=FALSE,ann=FALSE,
panel.first=grid()
)
axis(2,at=1:length(rownames(mat)),labels=rownames(mat),lwd=0,lwd.ticks=1,las=1)
axis(1,at=1:length(colnames(mat)),labels=colnames(mat),lwd=0,lwd.ticks=1)
box()
palette("default")
Result:
To assign colours to the different intervals, you can break up your values into groups using cut. Like others have said, it's a bit unclear what to do with points on the boundaries, so I've set include.lower to TRUE:
library(reshape2)
df = melt(mat)
colnames(df)[1:2] = c('year', 'day')
df$value_groups = cut(df$value, breaks=c(0,0.2,0.4,0.6,0.9,1), include.lower=TRUE)
library(ggplot2)
ggplot(df, aes(x=day, y=value, colour=value_groups, shape=year)) +
geom_point(size=3)
Result:
Here is how I would do it using lattice:
library(reshape2)
library(lattice)
mmat <- melt(mat) # reshaping the data
# note that zero isn't included in the interval
mmat$colors <- cut(mmat$value, breaks=seq(0, 1, 0.2), include.lower=TRUE) # stealing from Marius
xyplot(value ~ Var2 | Var1, mmat, groups = colors,
par.settings = list(superpose.symbol =
list(col = c('red', 'green', 'yellow', 'blue', 'black'))))
I have 2 dataframes, Tg and Pf, each of 127 columns. All columns have at least one row and can have up to thousands of them. All the values are between 0 and 1 and there are some missing values (empty cells). Here is a little subset:
Tg
Tg1 Tg2 Tg3 ... Tg127
0.9 0.5 0.4 0
0.9 0.3 0.6 0
0.4 0.6 0.6 0.3
0.1 0.7 0.6 0.4
0.1 0.8
0.3 0.9
0.9
0.6
0.1
Pf
Pf1 Pf2 Pf3 ...Pf127
0.9 0.5 0.4 1
0.9 0.3 0.6 0.8
0.6 0.6 0.6 0.7
0.4 0.7 0.6 0.5
0.1 0.6 0.5
0.3
0.3
0.3
Note that some cell are empty and the vector lengths for the same subset (i.e. 1 to 127) can be of very different length and are rarely the same exact length.
I want to generate 127 graph as follow for the 127 vectors (i.e. graph is for col 1 from each dataframe, graph 2 is for col 2 for each dataframe etc...):
Hope that makes sense. I'm looking forward to your assistance as I don't want to make those graphs one by one...
Thanks!
Here is an example to get you started (data at https://gist.github.com/1349300). For further tweaking, check out the excellent ggplot2 documentation that is all over the web.
library(ggplot2)
# Load data
Tg = read.table('Tg.txt', header=T, fill=T, sep=' ')
Pf = read.table('Pf.txt', header=T, fill=T, sep=' ')
# Format data
Tg$x = as.numeric(rownames(Tg))
Tg = melt(Tg, id.vars='x')
Tg$source = 'Tg'
Tg$variable = factor(as.numeric(gsub('Tg(.+)', '\\1', Tg$variable)))
Pf$x = as.numeric(rownames(Pf))
Pf = melt(Pf, id.vars='x')
Pf$source = 'Pf'
Pf$variable = factor(as.numeric(gsub('Pf(.+)', '\\1', Pf$variable)))
# Stack data
data = rbind(Tg, Pf)
# Plot
dev.new(width=5, height=4)
p = ggplot(data=data, aes(x=x)) + geom_line(aes(y=value, group=source, color=source)) + facet_wrap(~variable)
p
Highlighting the area between the lines
First, interpolate the data onto a finer grid. This way the ribbon will follow the actual envelope of the lines, rather than just where the original data points were located.
data = ddply(data, c('variable', 'source'), function(x) data.frame(approx(x$x, x$value, xout=seq(min(x$x), max(x$x), length.out=100))))
names(data)[4] = 'value'
Next, calculate the data needed for geom_ribbon - namely ymax and ymin.
ribbon.data = ddply(data, c('variable', 'x'), summarize, ymin=min(value), ymax=max(value))
Now it is time to plot. Notice how we've added a new ribbon layer, for which we've substituted our new ribbon.data frame.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax), alpha=0.3, data=ribbon.data)
Dynamic coloring between the lines
The trickiest variation is if you want the coloring to vary based on the data. For that, you currently must create a new grouping variable to identify the different segments. Here, for example, we might use a function that indicates when the "Tg" group is on top:
GetSegs <- function(x) {
segs = x[x$source=='Tg', ]$value > x[x$source=='Pf', ]$value
segs.rle = rle(segs)
on.top = ifelse(segs, 'Tg', 'Pf')
on.top[is.na(on.top)] = 'Tg'
group = rep.int(1:length(segs.rle$lengths), times=segs.rle$lengths)
group[is.na(segs)] = NA
data.frame(x=unique(x$x), group, on.top)
}
Now we apply it and merge the results back with our original ribbon data.
groups = ddply(data, 'variable', GetSegs)
ribbon.data = join(ribbon.data, groups)
For the plot, the key is that we now specify a grouping aesthetic to the ribbon geom.
dev.new(width=5, height=4)
p + geom_ribbon(aes(ymin=ymin, ymax=ymax, group=group, fill=on.top), alpha=0.3, data=ribbon.data)
Code is available together at: https://gist.github.com/1349300
Here is a three-liner to do the same :-). We first reshape from base to convert the data into long form. Then, it is melted to suit ggplot2. Finally, we generate the plot!
mydf <- reshape(cbind(Tg, Pf), varying = 1:8, direction = 'long', sep = "")
mydf_m <- melt(mydf, id.var = c(1, 4), variable = 'source')
qplot(id, value, colour = source, data = mydf_m, geom = 'line') +
facet_wrap(~ time, ncol = 2)
NOTE. The reshape function in base R is extremely powerful, albeit very confusing to use. It is used to transform data between long and wide formats.
Kudos for automating something you used to do in Excel using R! That's exactly how I got started with R and a common path to R enlightenment :)
All you really need is a little looping. Here's an example, most of which is creating example data that represents your data structure:
## create some example data
Tg <- data.frame(Tg1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Tg[paste("Tg", i, sep="")] <- vec[1:10]
}
Pf <- data.frame(Pf1 = rnorm(10))
for (i in 2:10) {
vec <- rep(NA, 8)
vec <- c(rnorm(sample(5:10,1)), vec)
Pf[paste("Pf", i, sep="")] <- vec[1:10]
}
## ok, sample data created
## now lets loop through all the columns
## if you didn't know how many columns there are you could
## use ncol(Tg) to figure out
for (i in 1:10) {
plot(1:10, Tg[,i], type = "l", col="blue", lwd=5, ylim=c(-3,3),
xlim=c(1, max(length(na.omit(Tg[,i])), length(na.omit(Pf[,i])))))
lines(1:10, Pf[,i], type = "l", col="red", lwd=5, ylim=c(-3,3))
dev.copy(png, paste('rplot', i, '.png', sep=""))
dev.off()
}
This will result in 10 graphs in your working directory that look like the following: