I'm trying to do a boxplot of a list of values at ggplot2, but the problem is that it doesn't know how to deal with lists, what should I try ?
E.g.:
k <- list(c(1,2,3,4,5),c(1,2,3,4),c(1,3,6,8,14),c(1,3,7,8,10,37))
k
[[1]]
[1] 1 2 3 4 5
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 3 6 8 14
[[4]]
[1] 1 3 7 8 10 37
If I pass k as an argument to boxplot() it will handle it flawlessly and produce a nice (well not so nice... hehehe) boxplot with the range of all the values as the Y-axis and the list index (each element) as the X-axis.
How should I achieve the exact same effect with ggplot2 ? I think that dataframes or matrices are not an option because the vectors are of different length.
Thanks
The answer is that you don't. ggplot2 is designed to work with data frames, particularly long form data frames. That means you need your data as one tall vector, with a grouping factor:
d <- data.frame(x = unlist(k),
grp = rep(letters[1:length(k)],times = sapply(k,length)))
ggplot(d,aes(x = grp, y = x)) + geom_boxplot()
And as pointed out in the comments, melt achieves the same result as this manual reshaping and is much simpler. I guess I like to make things difficult.
Related
I have some data in a 3D grid identified by simple i,j,k locations (no real-world spatial information). These data are in a RasterStack right now.
b <- stack(system.file("external/rlogo.grd", package="raster"))
# add more layers
b <- stack(b,b)
# dimensions
dim(b)
[1] 77 101 6
yields 77 rows, 101 columns, 6 layers.
# upscale by 2
up <- aggregate(b,fact=2)
dim(up)
[1] 39 51 6
yields 39 rows, 51 columns, 6 layers.
Hoped-for behavior: 3 layers.
I'm looking for a method to aggregate across layers in addition to the present behavior, which is to aggregate within each layer. I'm open to other data structures, but would prefer an existing upscaling/resampling/aggregation algorithm to one I write myself.
Potentially related are http://quantitative-advice.gg.mq.edu.au/t/fast-way-to-grid-and-sum-coordinates/110/5 or the spacetime package, which assumes the layers are temporal rather than spatial, adding more complexity.
Supouse you define agg.fact variable to denote the value 2:
agg.fact <- 2
up <- aggregate(b, fact = agg.fact)
dim(up)
[1] 39 51 6
Now we generate a table which indicates which layers will be aggregate with anothers using agg.fact:
positions <- matrix(1:nlayers(b), nrow = nlayers(b)/agg.fact, byrow = TRUE)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
And apply a function(in this case mean but could be max``,sum` or another ...) to each pair of layers
up2 <- stack(apply(positions, 1, function(x){
mean(b[[x[1]]], b[[x[2]]])
}))
dim(up2)
[1] 77 101 3
Or if want to aggregate in 3 dimensions (choose if want aggregate 1-2d and then 3d or viceverza):
up3 <- stack(apply(positions, 1, function(x){
aggregate(mean(b[[x[1]]], b[[x[2]]]), fact = agg.fact) #first 3d
#mean(aggregate(b[[x[1]]], fact = agg.fact), aggregate(b[[x[2]]]), fact = agg.fact) ##first 1d-2d
}))
dim(up3)
[1] 39 51 3
I did not read the documentation correctly. To aggregate across layers:
For example, fact=2 will result in a new Raster* object with 2*2=4 times fewer cells. If two numbers are supplied, e.g., fact=c(2,3), the first will be used for aggregating in the horizontal direction, and the second for aggregating in the vertical direction, and the returned object will have 2*3=6 times fewer cells. Likewise, fact=c(2,3,4) aggregates cells in groups of 2 (rows) by 3 (columns) and 4 (layers).
It may be necessary to play with expand=TRUE vs expand=FALSE to get it to work, but this seems inconsistent (I have reported it as a bug).
I have a data frame DF which looks like this:
ID Area time
1 1 182.685 1
2 2 182.714 1
3 3 182.275 1
4 4 211.928 1
5 5 218.804 1
6 6 183.445 1
...
1 1 184.334 2
2 2 196.765 2
3 3 186.435 2
4 4 213.322 2
5 5 214.766 2
6 6 172.667 2
.. and so to ID = 6. I want to apply an autocorrelation function on each ID, i.e. compare ID = 1 at time 1 with ID = 1 at time 2 and so on.
What is the most straightforward way to apply e.g. acf() to my data frame?
When I try to use
autocorr = aggregate(x = DF$Area, by = list(DF$ID), FUN = acf)
I get a weird object.
Thanks in advance!
I want to apply an autocorrelation function on each ID
OK, good, so you don't want any cross-correlation, which make things much easier.
I get a weird object
acf returns a bunch of things, i.e., it returns a list of things. I think you will be only interested in ACF values, so you need:
FUN = function (u) c(acf(u, plot = FALSE)$acf)
Also, using aggregate is not a good idea. You may want split and sapply:
## so your data frame is called `x`
oo <- sapply(split(x$Area, x$ID), FUN = function (u) c(acf(u, plot = FALSE)$acf) )
If you have balanced data, i.e., if you have equal number of observations for each ID, oo will be simplified into a matrix for sure. If you do not have balanced data, you may want to explicitly control the lag.max argument in acf. By default, acf will auto-decide on this value based on the number of observations.
Now suppose we want lag 0 to lag 7, we can set:
oo <- sapply(split(x$Area, x$ID),
FUN = function (u) c(acf(u, plot = FALSE, lag.max = 7)$acf) )
Thus result oo is a matrix of 8 rows (row for lag, column for ID). I don't see any good of using a data frame to hold this result, but in case you want a data frame, simply do:
data.frame(oo)
With data either in a matrix or a data frame, it is easy for you to do further analysis.
-----------
For a complete description of acf, please read Produce a boxplot for multiple ACFs
I have data saved in a text file with couple thousands line. Each line only has one value. Like this
52312
2
3
4
5
7
9
4
5
3
The first value is always roughly 10.000 times bigger than all the other values.
I can read in the data with data<-read.table("data.txt")
When I just use plot(data) all the data have the same y-value, resulting in a line, where the x values just represent the values given from the data.
What I want, however, is that the x-value represents the linenumber and y-value the actual data. So for the above example my values would be (1,52312), (2,2), (3,3), (4,4), (5,5), (6,7), (7,9), (8,4), (9,5), (10,3).
Also, since the first value is way higher than all the other values, I'd like to use a log scale for the y-axis.
Sorry, very new to R.
set.seed(1000)
df = data.frame(a=c(9999999,sample(2:78,77,replace = F)))
plot(x=1:nrow(df), y=log(df$a))
i) set.seed(1000) helps you reproduce the same random numbers from sample() each time you run this code. It makes code reproducible.
ii) type ?sample in R console for documentation.
iii) since you wanted the x-axis to be linenumber - I create it using ":" operator. 1:3 = 1,2,3. Similarily I created a "id" index using 1:nrow(df) which will create based on the dimension of your data.
iv) for log ,just use it simple :). read more about ?plot and its parameters
Try this:
df
x y
1 1 52312
2 2 2
3 3 3
4 4 4
5 5 5
6 6 7
7 7 9
8 8 4
9 9 5
10 10 3
library(ggplot2)
ggplot(df, aes(x, y)) + geom_point(size=2) + scale_y_log10()
How can I plot a recurrency in R.
Any solution with base plot, ggplot2, lattice, or a dedicated package is welcome.
For example:
Imagine I have these data:
mydata <- data.frame(t=1:10, Y=runif(10))
t Y
1 0.3744869
2 0.6314202
3 0.3900789
4 0.6896278
5 0.6894134
6 0.5549006
7 0.4296244
8 0.4527201
9 0.3064433
10 0.5783539
I could transform it like this:
mydata2 <- data.frame(t=c(NA,mydata$t),Y=c(NA,mydata$Y),Y2=c(mydata$Y, NA))
t Y Y2
NA NA 0.9103703
1 0.9103703 0.1426041
2 0.1426041 0.4150476
3 0.4150476 0.2109258
4 0.2109258 0.4287504
5 0.4287504 0.1326900
6 0.1326900 0.4600964
7 0.4600964 0.9429571
8 0.9429571 0.7619739
9 0.7619739 0.9329098
10 0.9329098 NA
(or similar methods, but I can have problems with missing data)
And plot it
plot(Y2~Y, data=mydata2)
I guess I must use some grouping function such as ave or apply. But it's not an elegant solution, and if I have more columns it can become difficult to generalize the transformation.
For example
mydata3 <- data.frame(x=sample(10,100, replace=T),t=1:100, Y=2*runif(100)+1)
For every x (or combination of values on other columns) I want to plot Y_{i+1} ~ Y_i, on the same plot.
Other tools, such as Mathematica have functions to plot sequences directly.
I've found a solution, thoug not very beautiful:
For this sample data.
mydata <- data.frame(x=sample(4,25, replace=T),t=1:25, Y=2*runif(25)+1)
newdata <- mydata[order(mydata$x, mydata$t), ]
newdata$prev <- ave(newdata$Y, newdata$x, FUN=function(x) c(NA,head(x,-1)))
plot(Y~prev, data=newdata)
In this example you don't have rows for every t value, you would need to first generate NAs for missing values. But it's just a quick solution. In my real data I have many observations for each t.
lag.plot can plot recurrence plots but not within each subgroup.
I have a question about plots. For example we have variable a and b, we plot this in R and you get the point. Now, I want to make a range of best/highest point. Is there a way to generate a ranking in the point? I thought maybe something with mean?
Thanks!
a<- c(1,3,7,5,3,8,4,5,3,6,9,4,2,6,3)
b<- c(5,3,7,2,7,2,5,2,7,3,6,2,1,1,9)
plot(a,b)
Based on your comment to get the positions of the points with the 5 highest b values, use order:
order(b,decreasing=T)[1:5]
[1] 15 3 5 9 11
And you can use this to get the relevant a and b values:
a[order(b,decreasing=T)[1:5]]
[1] 3 7 3 3 9
b[order(b,decreasing=T)[1:5]]
[1] 9 7 7 7 6
You can use this also to highlight them in the plot:
high <- order(b,decreasing=T)[1:5]
col <- rep("black",length(b))
col[high] <- "red"
plot(a,b,col=col)
Note that there is some overplotting here (2 values at (3,7))