ggplot2 plotting single data frame with 10 different levels - r

I am using R studio
I have a single data frame with three columns titled
colnames(result)
[1] "v" "v2" "Lambda"
I wish to use ggplot2 to create an overlay plot assigning 10 different colors to each of the 10 different values in the Lambda column
summary(result$Lambda)
1 2 3 4 5 6 7 8 9 10
101 100 100 100 100 100 100 100 100 100
Now I created a factor for the Lambda values as follows:
result$Lambda<-factor(result$Lambda, levels=c(1,2,3,4,5,6,7,8,9,10), labels=c("1","2","3","4","5","6","7","8","9","10"))
qplot(x=v, y=v2, data=result, geom="line", fill=Lambda main="Newton Revolved Plot", xlab="x=((L/2)*((1/t)+2*t+t^(3)))", ylab="y=(L/2)*(log(1/t)+t^(2)+(3/4)*t^(4))-(7*L/8)")
but this command does not work.
I am just inquiring about how to plot 10 different plots, one plot for each Lambda value. Or perhaps resources for the novice..
Thank you in advance
2.000000e+00 0.000000e+00 1
3 6.250000e+00 6.778426e+00 1
4 1.666667e+01 3.345069e+01 1
5 3.612500e+01 1.024319e+02 1
6 6.760000e+01 2.451953e+02 1
7 1.140833e+02 5.022291e+02 1
8 1.785714e+02 9.230270e+02 1
9 2.640625e+02 1.566085e+03 1
10 3.735556e+02 2.498901e+03 1
105 7.225000e+01 2.048637e+02 2
106 1.352000e+02 4.903906e+02 2
107 2.281667e+02 1.004458e+03 2
108 3.571429e+02 1.846054e+03 2
109 5.281250e+02 3.132171e+03 2
110 7.471111e+02 4.997803e+03 2
111 1.020100e+03 7.595947e+03 2
112 1.353091e+03 1.109760e+04 2
250 1.766205e+05 6.488994e+06 3
251 1.876500e+05 7.034992e+06 3
252 1.991295e+05 7.614744e+06 3
253 2.110680e+05 8.229615e+06 3
254 2.234745e+05 8.880996e+06 3
255 2.363580e+05 9.570303e+06 3
Here is some sample date, the first column is the "v", the second column is "v2" and the third column is the "Lambda" column. (not include the 0th column as the index)
Just to rephrase my R problems, I have a single data frame with 10 levels each with 100 entries (roughly.. see above for exact count). and wish to use ggplot2 to plot each level a different color.
there are two ways I can think of
1) Find the correct ggplot2 option and distinguish each level
2) split this single data frame titled "result" into 10 subsets.
thank you very much in advance.

for a line use
color=
instead of
fill=
But without any data to reproduce it is hard to know if that is what you want.

Related

R: Plot several lines in the same plot: ggplot + data tables or frames vs matrices

My general problem: I tend to struggle using ggplot, because it's very data-frame-centric but the objects I work with seem to fit matrices better than data frames. Here is an example (adapted a little).
I have a quantity x that can assume values 0:5, and a "context" that can have values 0 or 1. For each context I have 7 different frequency distributions over the quantity x. (More generally I could have more than two "contexts", more values of x, and more frequency distributions.)
I can represent these 7×2 frequency distributions as a list freqs of two matrices, say:
> freqs
$`context0`
x0 x1 x2 x3 x4 x5
sample1 20 10 10 21 37 2
sample2 34 40 6 10 1 8
sample3 52 4 1 2 17 25
sample4 16 32 25 11 5 10
sample5 28 2 10 4 21 35
sample6 22 13 35 12 13 5
sample7 9 5 43 29 4 10
$`context1`
x0 x1 x2 x3 x4 x5
sample1 15 21 14 15 14 21
sample2 27 8 6 5 29 25
sample3 13 7 5 26 48 0
sample4 33 3 18 11 13 22
sample5 12 23 40 11 2 11
sample6 5 51 2 28 5 9
sample7 3 1 21 10 63 2
or a 3D array.
Or I could use a data.table tablefreqs like this one:
> tablefreqs
context x0 x1 x2 x3 x4 x5
1: 0 20 10 10 21 37 2
2: 0 34 40 6 10 1 8
3: 0 52 4 1 2 17 25
4: 0 16 32 25 11 5 10
5: 0 28 2 10 4 21 35
6: 0 22 13 35 12 13 5
7: 0 9 5 43 29 4 10
8: 1 15 21 14 15 14 21
9: 1 27 8 6 5 29 25
10: 1 13 7 5 26 48 0
11: 1 33 3 18 11 13 22
12: 1 12 23 40 11 2 11
13: 1 5 51 2 28 5 9
14: 1 3 1 21 10 63 2
Now I'd like to draw the following line plot (there's a reason why I need line plots and not, say, histograms or bar plots):
The 7 frequency distributions for context 0, with x as x-axis and the frequency as y-axis, all in the same line plot (with some alpha).
The 7 frequency distributions for context 1, again with x as x-axis and the frequency as y-axis, all in the same line plot (with alpha), but displayed upside-down below the plot for context 0.
Ggplot would surely do this very nicely, but it seems to require some acrobatics with data tables:
– If I use the data table tablefreqs it's not clear to me how to plot all its rows having context==0 in the same plot: ggplot seems to only think column-wise, not row-wise. I could use the six values of x as table rows, but then the "context" values would also end up in a row, and I'm not sure I can subset a data table by values in a row, rather than in a column.
– If I use the matrix freqs, I could create a mini-data-table having x as one column and one frequency distribution as another column, input that into ggplot+geom_line, then go over all 7 frequency distributions in a for-loop maybe. Not clear to me how to tell ggplot to keep the previous plots in this case. Then another for-loop over the two "contexts".
I'd be grateful for suggestions on how to approach this problem in particular, and more generally on what objects to choose for storing this kind of data: matrices? data tables, maybe with a different structure than shown here? some other formats?
I would suggest to familiarize yourself with the concept of what is known as Tidy Data, which are principles for data handling and storage that are adopted by ggplot2 and a number of other packages.
You are free to use a matrix or list of matrices to store your data; however, you can certainly store the data as you describe it (and as I understand it) in a data frame or single table following the following convention of columns:
context | sample | x | freq
I'll show you how I would convert the tablefreqs dataset you shared with us into that format, then how I would go about creating a plot as you are describing it in your question. I'm assuming in this case you only have the two values for context, although you allude to there being more. I'm going to try to interpret correctly what you stated in your question.
Create the Tidy Data frame
Your data frame as shown contains columns x1 through x5 that have values for x spread across more than one column, when you really need these to be converted in the format shown above. This is called "gathering" your data, and you can do that with tidyr::gather().
First, I also need to replicate the naming of your samples according to the matrix dataset, so I'll do that and gather your data:
library(dplyr)
library(tidyr)
library(ggplot2)
# create the sample names
tablefreqs$sample <- rep(paste0('sample',1:7), 2)
# gather the columns together
df <- tablefreqs %>%
gather(key='x', value='freq', -c(context, sample))
Note that in the gather() function, we have to specify to leave alone the two columns df$context and df$sample, as they are not part of the gathering effort. But now we are left with df$x containing character vectors. We can't plot that, because we want the to be in the form of a number (at least... I'm assuming you do). For that, we'll convert using:
df$x <- as.numeric(gsub("[^[:digit:].]", "", df$x))
That extracts the number from each value in df$x and represents it as a number, not a character. We have the opposite issue with df$context, which is actually a discrete factor, and we should represent it as such in order to make plotting a bit easier:
df$context <- factor(df$context)
Create the Plot
Now we're ready to create the plot. From your description, I may not have this perfectly right, but it seems that you want a plot containing both context = 1 and context = 0, and when context = 1 the data should be "upside down". By that, I'm assuming you are talking about plotting df$freq when df$context == 0 and -df$freq when df$context == 1. We could do that using some fancy logic in the ggplot() call, but I find it's easier just to create a new column in your dataset to represent what we will be plotting on the y axis. We'll call this column df$freq_adj and use that for plotting:
df$freq_adj <- ifelse(df$context==1, -df$freq, df$freq)
Then we create the plot. I'll explain a bit below the result:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(
aes(color=context, linetype=sample)
) +
geom_hline(yintercept=0, color='gray50') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
Without some clearer description or picture of what you were looking to do, I took some liberties here. I used color to discriminate between the two values for context, and I'm using linetype to discriminate the different samples. I also added a line at 0, since it seemed appropriate to do so here, and the scale_x_continuous() command is removing the extra white space that is put in place at the extreme ends of the data.
An alternative that is maybe closer to your description would be to physically have a separation between the two plots, and represent context = 1 as a physically separate plot compared to context = 0, with one over top of the other.
Here's the code and plot:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(aes(group=sample), alpha=0.3) +
facet_grid(context ~ ., scales='free_y') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
There the use of aes(group=sample) is quite important, since I want all the lines for each sample to be the same (alpha setting and color), yet ggplot2 needs to know that the connections between the points should be based on "sample". This is done using the group= aesthetic. The scales='free_y' argument on facet_grid() allows the y axis scale to shrink and fit the data according to each facet.

Grouping and building intervals of data in R and useful visualization

I have some data extracted via HIVE. In the end we are talking of csv with around 500 000 rows. I want to plot them after grouping them in intervals.
Beside the grouping it's not clear how to visualize the data. Since we are talking about low spends and sometimes a high frequency I'm not sure how to handle this problem.
Here is just an overview via head(data)
userid64 spend freq
575033023245123 0.00924205 489
12588968125440467 0.00037 2
13830962861053825 0.00168 1
18983461971805285 0.001500366 333
25159368164208149 0.00215 1
32284253673482883 0.001721303 222
33221593608613197 0.00298 709
39590145306822865 0.001785281 11
45831636009567401 0.00397 654
71526649454205197 0.000949978 1
78782620614743930 0.00552 5
I want to group the data in intervals. So I want an extra columns indicating the groups. The first group should contain all data with an frequency (called freq) between 1 and 100. The second group should contain all rows where there entries have a frequency between 101 and 200... and so on.
The result should look like
userid64 spend freq group
575033023245123 0.00924205 489 5
12588968125440467 0.00037 2 1
13830962861053825 0.00168 1 1
18983461971805285 0.001500366 333 3
25159368164208149 0.00215 1 1
32284253673482883 0.001721303 222 2
33221593608613197 0.00298 709 8
39590145306822865 0.001785281 11 1
45831636009567401 0.00397 654 7
71526649454205197 0.000949978 1 1
78782620614743930 0.00552 5 1
Is there a nice and gentle art to get this? I need this grouping for upcoming plots. I want to do visualization for all intervals to get an overview regarding the spend. If you have any ideas for the visualization please let me know. I thought I should work with boxplots.
If you want to group freq for every 100 units, you can try ceiling function in base R
ceiling(df$freq / 100)
#[1] 5 1 1 4 1 3 8 1 7 1 1
where df is your dataframe.

Printing only certain panels in R lattice

I am plotting a quantile-quantile plot for a certain data that I have. I would like to print only certain panels that satisfy a condition that I put in for panel.qq(x,y,...).
Let me give you an example. The following is my code,
qq(y ~ x|cond,data=test.df,panel=function(x,y,subscripts,...){
if(length(unique(test.df[subscripts,2])) > 3 ){panel.qq(x,y,subscripts,...})})
Here y is the factor and x is the variable that will be plotted on X and y axis. Cond is the conditioning variable. What I would like is, only those panels be printed that pass the condition in the panel function, which is
if(length(unique(test.df[subscripts,2])) > 3).
I hope this information helps. Thanks in advance.
Added Sample data,
y x cond
1 1 6 125
2 2 5 125
3 1 5 125
4 2 6 125
5 1 3 125
6 2 8 125
7 1 8 125
8 2 3 125
9 1 5 125
10 2 6 125
11 1 5 124
12 2 6 124
13 1 6 124
14 2 5 124
15 1 5 124
16 2 6 124
17 1 4 124
18 2 7 124
19 1 0 123
20 2 11 123
21 1 0 123
22 2 11 123
23 1 0 123
24 2 11 123
25 1 0 123
26 2 11 123
27 1 0 123
28 2 2 123
So this is the sample data. What I would like is to not have a panel for 123 as the number of unique values for 123 is 3, while for others its 4. Thanks again.
Yeah, I think it is a subset problem, not a lattice one. You don't include an example, but it looks like you want to keep only rows where there are more than 3 rows for each value of whatever is in column 2 of your data frame. If so, here is a data.table solution.
library(data.table)
test.dt <- as.data.table(test.df)
test.dt.subset <- test.dt[,N:=.N,by=c2][N>3]
Where c2 is that variable in the second column. The last line of code first adds a variable, N, for the count of rows (.N) for each value of c2, then subsets for N>3.
UPDATE: And since a data table is also a data frame, you can use test.dt.subset directly as the data source in the call to qq (or other lattice function).
UPDATE 2: Here is one way to do the same thing without data.table:
d <- data.frame(x=1:15,y=1:15%%2, # example data frame
c2=c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,5))
d$N <- 1 # create a column for count
split(d$N,d$c2) <- lapply(split(d$x,d$c2),length) # populate with count
d
d[d$N>3,] # subset
I did something very similar to DaveTurek.
My sample dataframe above is test.df
test.df.list <- split(test.df,test.df$cond,drop=F)
final.test.df <- do.call("rbind",lapply(test.df.list,function(r){
if(length(unique(r$x)) > 3){r}})
So, here I am breaking the test.df as a list of data.frames by the conditioning variable. Next, in the lapply I am checking the number of unique values in each of subset dataframe. If this number is greater than 3 then the dataframe is given /taken back if not it is ignored. Next, a do.call to bind all the dfs back to one big df to run the quantile quantile plot on it.
In case anyone wants to know the qq function call after getting the specific data. then it is,
trellis.device(postscript,file="test.ps",color=F,horizontal=T,paper='legal')
qq(y ~ x|cond,data=final.test.df,layout=c(1,1),pch=".",cex=3)
dev.off()
Hope this helps.

drawing multiple boxplots from imputed data in R

I have an imputed dataset that I'm analysing, and I'm trying to draw boxplots, but I can't wrap my head around the proper procedure.
my data (a sample, original has 20 observations per imputation and 13 vars per group, all values range from 0 to 25):
.imp .id FTE_RM FTE_PD OMZ_RM OMZ_PD
1 1 25 25 24 24
1 2 4 0 2 6
1 3 11 5 3 2
1 4 12 3 3 3
2 1 20 15 15 15
2 2 4 1 2 3
2 3 0 0 0 6
2 4 20 0 0 0
.imp signifies the imputation round, .id the identifer for each observartion.
I want to draw all the FTE_* variables in a single plot (and the `OMZ_* in another), but wonder what to do with all the imputations, can I just include all values? The imputated data now has 500 observations. With for instance an ANOVA I'd need to average the ANOVA results by 5 to get back to 20 observations. But is this needed for a boxplot as well, since I only deal with medians, means, max. and min.?
Such as:
data_melt <- melt(df[grep("^FTE_", colnames(df))])
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()
I've played a couple of times with ggplot, but consider myself a complete newbie.
I assume you want to keep the identifier for .imp and .id after melting so rather put:
data_melt <- melt(df,c(".imp",".id"))
For completeness of the dataframe it probably helps to introduce a column that identifies the type - FTE vs. OMZ:
data_melt$type <- ifelse(grepl("FTE",data_melt$variable),"FTE","OMZ")
Having this data.frame you can, for example, facet on the type (alternatively you can just use a simple filter statement on data_melt to restrict to one type):
ggplot(data_melt, aes(x=variable, y=value))+geom_boxplot()+facet_wrap(~type,scales="free_x")
This would look like this.
EDIT: fixed the data mess-up

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Resources