How do you add jitter to a scatterplot matrix in ggpairs? - r

I want to add jitter to a scatterplot matrix. The question was addressed on the following page (and nowhere else) on stackoverflow:
How to produce a meaningful draftsman/correlation plot for discrete values
But both solutions to the jitter problem which were suggested there involve deprecated code (plotmatrix and params):
library(ggplot2)
plotmatrix(y) + geom_jitter(alpha = .2)
library(GGally)
ggpairs(y, lower = list(params = c(alpha = .2, position = "jitter")))
I would have simply commented asking for an update there so as to not create a new question, but that appears to require reputation points, and I'm new to the site. My apologies if I've done something wrong in posting the question.
EDIT:
Here's what the data looks like:
> str(EHRound4.subset)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 301 obs. of 22 variables:
$ Subject# : int 1 2 3 4 6 7 8 13 14 16 ...
$ Condition : Factor w/ 2 levels "CDR","Mturk": 1 1 1 1 1 1 1 1
1 1 ...
$ Launch4 : int 5 8 8 5 8 5 3 8 5 6 ...
$ NewSong4 : int 6 8 8 6 8 6 8 8 8 7 ...
$ StudCom5 : int 6 5 8 3 1 3 4 8 7 7 ...
$ Textbook5 : int 8 1 8 3 1 7 8 8 8 8 ...
And here's several attempts at getting jitter.
> ggpairs(EHRound4.subset, columns = 3:6,
ggplot2::aes(colour=Condition), lower = list(geom_jitter(alpha = .2)))
> ggpairs(EHRound4.subset, columns = 3:6,
ggplot2::aes(colour=Condition, alpha=.2), lower = list(geom_jitter()))
> ggpairs(EHRound4.subset, columns = 3:6,
ggplot2::aes(colour=Condition, alpha=.2, position="jitter"))

#user20650 answered the question in comments below the question. For completeness, here it is in the form of an answer:
Use wrap, such as:
library(GGally)
ggpairs(y, lower = list(continuous=wrap("points", position=position_jitter(height=3, width=3))))
By using position = position_jitter() instead of just position = "jitter" (which also works) the additional jitter parameters can also be controlled.

Related

single instead multiple boxplots with ggplot

I would like to make a boxplot for a variable (Theta..vol..) depending on two factors (Tiefe) and (Ort).
> str(data)
'data.frame': 30 obs. of 6 variables:
$ Nummer : int > 1 2 3 4 5 6 7 8 9 10 ...
$ Name : int 11 12 13 14 15 16 17 18 19 20 ...
$ Ort : Factor w/ 2 levels "NNW","S": 2 2 2 2 2 2 2 2 2 2 ...
$ Tiefe : int 20 20 20 20 20 50 50 50 50 50 ...
$ Gerät : int 2 2 2 2 2 2 2 2 2 2 ...
$ Theta..vol..: num 15 16.4 14.9 16.6 10.6 22.1 17.6 10 18 20.3 ...
My code is:
ggplot(data, aes(x = Tiefe, y = Theta..vol.., fill=Ort))+geom_boxplot()
Since the variable(Tiefe) has 3 levels and the variable (Ort) has 2 levels I wish to see three paired boxplots (each pair for a single (Tiefe).
But I see just a single pair (one boxplot for one level of "Ort" and another boxplot for the second level of the "Ort"
What should I change to get three pairs for each "Tiefe"? Thank you
In your code, Tiefe is being read as an integer not a factor.
Easy fix using dplyr with ggplot2:
First I made some dummy data:
library(dplyr)
data <- tibble(
Ort = ifelse(runif(30) > 0.5, "NNW", "S"),
Tiefe = rep(c(20, 50, 75), times = 10),
Theta..vol.. = rnorm(30,15))
Next, we modify the Tiefe column before piping into the ggplot:
data %>%
mutate(Tiefe = factor(Tiefe)) %>%
ggplot(aes(x = Tiefe, y = Theta..vol.., fill = Ort)) +
geom_boxplot()

How can I store a value in a name?

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...
I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

What is the difference between dataset[,'column'] and dataset$column in R?

If I want to list all rows of a column in a dataset in R, I am able to do it in these two ways:
> dataset[,'column']
> dataset$column
It appears that both give me the same result. What is the difference?
In practice, not much, as long as dataset is a data frame. The main difference is that the dataset[, "column"] formulation accepts variable arguments, like j <- "column"; dataset[, j] while dataset$j would instead return the column named j, which is not what you want.
dataset$column is list syntax and dataset[ , "column"] is matrix syntax. Data frames are really lists, where each list element is a column and every element has the same length. This is why length(dataset) returns the number of columns. Because they are "rectangular," we are able to treat them like matrices, and R kindly allows us to use matrix syntax on data frames.
Note that, for lists, list$item and list[["item"]] are almost synonymous. Again, the biggest difference is that the latter form evaluates its argument, whereas the former does not. This is true even in the form `$`(list, item), which is exactly equivalent to list$item. In Hadley Wickham's terminology, $ uses "non-standard evaluation."
Also, as mentioned in the comments, $ always uses partial name matching, [[ does not by default (but has the option to use partial matching), and [ does not allow it at all.
I recently answered a similar question with some additional details that might interest you.
Use 'str' command to see the difference:
> mydf
user_id Gender Age
1 1 F 13
2 2 M 17
3 3 F 13
4 4 F 12
5 5 F 14
6 6 M 16
>
> str(mydf)
'data.frame': 6 obs. of 3 variables:
$ user_id: int 1 2 3 4 5 6
$ Gender : Factor w/ 2 levels "F","M": 1 2 1 1 1 2
$ Age : int 13 17 13 12 14 16
>
> str(mydf[1])
'data.frame': 6 obs. of 1 variable:
$ user_id: int 1 2 3 4 5 6
>
> str(mydf[,1])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[,'user_id'])
int [1:6] 1 2 3 4 5 6
> str(mydf$user_id)
int [1:6] 1 2 3 4 5 6
>
> str(mydf[[1]])
int [1:6] 1 2 3 4 5 6
>
> str(mydf[['user_id']])
int [1:6] 1 2 3 4 5 6
mydf[1] is a data frame while mydf[,1] , mydf[,'user_id'], mydf$user_id, mydf[[1]], mydf[['user_id']] are vectors.

How do I get faceted barplot values to show as negative

I'm struggling a bit to make a vertically- faceted barplot. I added a 'thus far' version of my work below. My main issue is that the negative values aren't showing as I'd expect. Shouldn't there be some line, or tick, indicating 0, with negative bars registering below it? The code below should be fully reproducible. You can see several negative values in the final data set I'm trying to plot. I'm getting a rather verbose error beginning with 'Mapping a variable to y and also using stat="bin".' I sense it's likely related to my issue, but I'm not able to find or derive a concrete solution.
Also, as secondary points, if anyone has any advice past the current snag, my goal end- result would be to color those negative bars red, and the positive ones green, to add the 'spdrNames' to the y axis, to label the bars with the actual value, and to remove the illegible values from the x axis.
require('ggplot')
require('reshape')
require('tseries')
spdrTickers = c('XLY','XLP','XLE','XLF','XLV','XLI','XLB','XLK','XLU')
spdrNames = c('Consumer Discretionary','Consumer Staples', 'Energy',
'Financials','Health Care','Industrials','Materials','Technology',
'Utilities')
latestDate =Sys.Date()
dailyPrices = lapply(spdrTickers, function(ticker) get.hist.quote(instrument= ticker, start = "2012-01-01",
end = latestDate, quote="Close", provider = "yahoo", origin="1970-01-01", compression = "d", retclass="zoo"))
perf5Day = lapply(dailyPrices, function(x){(x-lag(x,k=-5))/lag(x,k=-5)})
perf20Day = lapply(dailyPrices, function(x){(x-lag(x,k=-20))/lag(x,k=-20)})
perf60Day = lapply(dailyPrices, function(x){(x-lag(x,k=-60))/lag(x,k=-60)})
names(perf5Day) = spdrTickers
names(perf20Day) = spdrTickers
names(perf60Day) = spdrTickers
perfsMerged = lapply(spdrTickers, function(spdr){merge(perf5Day[[spdr]],perf20Day[[spdr]],perf60Day[[spdr]])})
perfNames = c('1Week','1Month','3Month')
perfsMerged = lapply(perfsMerged, function(x){
names(x)=perfNames
return(x)
})
latestDataPoints = t(sapply(perfsMerged, function(x){return(x[nrow(x)])}))
latestDataPoints = data.frame(cbind(spdrTickers,latestDataPoints))
names(latestDataPoints) = c('Ticker', '1Week','1Month','3Month')
drm = melt(latestDataPoints, id.vars=c('Ticker'))
names(drm) = c('Ticker','Period','Value')
p = ggplot(drm, aes(x=Ticker,y=Value)) + geom_bar() + coord_flip() + facet_grid(. ~ Period)
Yields this:
Somehow you have converted your Value-values to a factor:
str(drm)
'data.frame': 27 obs. of 3 variables:
$ Ticker: Factor w/ 9 levels "XLB","XLE","XLF",..: 9 6 2 3 8 4 1 5 7 9 ...
$ Period: Factor w/ 3 levels "1Week","1Month",..: 1 1 1 1 1 1 1 1 1 2 ...
$ Value : Factor w/ 27 levels "0.0164396430248944",..: 2 4 5 1 8 3 7 6 9 11 ...
Probably happens here:
latestDataPoints = data.frame(cbind(spdrTickers,latestDataPoints))
> str( latestDataPoints )
'data.frame': 9 obs. of 4 variables:
$ Ticker: Factor w/ 9 levels "XLB","XLE","XLF",..: 9 6 2 3 8 4 1 5 7
$ 1Week : Factor w/ 9 levels "0.0164396430248944",..: 2 4 5 1 8 3 7 6 9
$ 1Month: Factor w/ 9 levels "-0.00139291932675571",..: 2 3 1 5 8 4 6 7 9
$ 3Month: Factor w/ 9 levels "-0.0110357512357742",..: 3 2 1 5 9 6 7 8 4
Since just before that step you had a numeric matrix from: t(sapply(perfsMerged, function(x){return(x[nrow(x)])}))
Then doing this:
latestDataPoints[2:4] <- lapply( latestDataPoints[2:4], function(x)
as.numeric(as.character(x)) )
drm = melt(latestDataPoints, id.vars=c('Ticker'))
names(drm) = c('Ticker','Period','Value')
p = ggplot(drm, aes(x=Ticker,y=Value)) + geom_bar() + coord_flip() +
facet_grid(. ~ Period)
png();print(p);dev.off()
Produces:
The construction data.frame(cbind(...)) is a real trap. I've seen is used by supposedly authoritative sources and it is a recurrent source of puzzlement. I think R would be a safer language to use if the interpreter would simply highlight that combination in red (along with as.numeric applied to factors.) When you cbind a character vector to a numeric matrix, you get an all character matrix.

Mapping factors to colors in heatmap.2

Hello I am making a heatmap in R using heatmap.2. I would like to use the RowSideColors option. However, I can't figure out how to easily create the vector of colors for the rows.
The rows represent bacteria, and I have a dataframe with information on the bacteria that I would like to use for coloring. I will either use Bact_Phylo_Info$Phylum or k to create the color vector.
> str(Bact_Phylo_Info)
'data.frame': 33 obs. of 3 variables:
$ Phylum: Factor w/ 7 levels "Actinobacteria",..: 4 3 3 2 2 2 2 2 5 5 ...
$ Order : Factor w/ 10 levels "Bacteroidales",..: NA 4 4 1 1 1 1 1 NA 5 ...
$ Family: Factor w/ 13 levels "Anaplasmataceae",..: NA 4 4 NA 11 2 11 12 NA NA ...
I have tried a few things, for example the crazy loop below, but I think there must be an easy way that I am missing. Any help appreciated.
BactPhyColors <- sapply(Bact_Phylo_Info$Phylum,
if (Bact_Phylo_Info$Phylum == levels(Bact_Phylo_Info$Phylum)[i]),
rainbow(length(levels(Bact_Phylo_Info$Phylum)))[i]
}
)
If I understood well what you tried to do: you have a variable coded as a factor and you try to transform it to colors?
Are creating some discrete scale color?
Here I am using brewer_pal from scales package to create to create a brewer palette. then merge it with the factor variable.
library(scales)
dat <- data.frame(Phylum = gl(7,2))
n <- nlevels(dat$Phylum)
dat.col <- data.frame(Phylum =unique(dat$Phylum),
BactPhyColors =brewer_pal()(n)) ## you can also use rainbow(n)
merge(dat,dat.col)
merge(dat,dat.col)
Phylum BactPhyColors
1 1 #EFF3FF
2 1 #EFF3FF
3 2 #C6DBEF
4 2 #C6DBEF
5 3 #9ECAE1
6 3 #9ECAE1
7 4 #6BAED6
8 4 #6BAED6
9 5 #4292C6
10 5 #4292C6
11 6 #2171B5
12 6 #2171B5
13 7 #084594
14 7 #084594

Resources