R slopegraph geom_line color ggplot2 - r

I am trying to create a slopegraph with ggplot and geom_line. I want the lines of a subset of data (e.g. those higher then 0.5) to be in red and those less than 0.5 to be another color. Here's my code:
library(ggplot2)
library(reshape2)
mydata <- read.csv("testset.csv")
mydatam = melt(mydata)
line plot:
ggplot(mydatam, aes(factor(variable), value, group = Gene, label = Gene)) +
geom_line(col='red')
in this case, all the lines are red. how do I make red lines for those "Gene"s that have a variable low value > 0.5 (there are 5 of them, aa,ac, ba, bc and bd) and the rest black lines?
mydatam looks like this:
Gene variable value
1 aa Control 0.0
2 ab Control 0.0
3 ac Control 0.0
4 ad Control 0.0
5 ba Control 0.0
6 bb Control 0.0
7 bc Control 0.0
8 bd Control 0.0
9 aa Low 0.6
10 ab Low 0.2
11 ac Low 0.8
12 ad Low 0.1
13 ba Low 0.7
14 bb Low 0.3
15 bc Low 0.8
16 bd Low 1.2
17 aa High -0.6
18 ab High 1.6
19 ac High 2.1
20 ad High 0.7
21 ba High -1.2
22 bb High -0.7
23 bc High -0.8
24 bd High 0.6

You'll probably want to create a new variable in the data for this. Here's one way:
## Load dplyr package for data manipulation
library("dplyr")
## Genes where "Low" value is >0.5
genes <- mydatam[mydatam$variable == "Low" & mydatam$value > 0.5, "Gene"]
## Add new column
newdat <- mutate(mydatam, newval = ifelse(Gene %in% genes, ">0.5", "<=0.5"))
Now we can create the plot using newval to set the color.
## Color lines based on `newval` column
ggplot(newdat, aes(factor(variable), value, group = Gene, label = Gene)) +
geom_line(aes(color = newval)) +
scale_color_manual(values = c("#000000", "#FF0000"))

Related

Color and shape coding within ggplot

Working with a chemical dateset and what I want to do is to color code the geom_points by the depth at which they were sampled from and then make the shape based on when it was sampled from. I also want to add a thin black border on all the geom_points in order to distinguish them.
Here is a sample table:
ID Depth(m) Sampling Date Cl Br
1 1 May 4.0 .05
2 1 June 5.0 .07
3 2 May 6.0 .03
4 2 June 7.0 .05
5 3 May 8.0 .01
6 3 June 9.0 .03
7 4 May 10.0 .00
8 4 June 11.0 .01
I am trying to use the code
graph <- df %>%
ggplot(aes(x = Cl, y = Br, fill = Depth, shape = Sampling Date), color = black) +
geom_point(shape = c(21:24, size = 4) +
labs(x = "Cl", y = "Br")
graph
But everytime I do this it just fills in the shape black ignoring the color specification. Also I need to use the shapes 21:25 but everytime I try to specify the number of shapes it always says that it doesn't match the number of variables within my dataset.
Your code is somewhat filled with ... challenges.
Remove all spaces! That makes your life easier. Also add shape aes to geom_point and specify the shapes with a scale call.
library(ggplot2)
df <- read.table(text = "ID Depth SamplingDate Cl Br
1 1 May 4.0 .05
2 1 June 5.0 .07
3 2 May 6.0 .03
4 2 June 7.0 .05
5 3 May 8.0 .01
6 3 June 9.0 .03
7 4 May 10.0 .00
8 4 June 11.0 .01", header = T)
ggplot(df, aes(x = Cl, y = Br, fill = Depth, shape = SamplingDate)) +
geom_point(aes(shape = SamplingDate), size = 4) +
scale_shape_manual(values = 21:24)
Created on 2020-07-30 by the reprex package (v0.3.0)

How to match two columns with nearest time points?

I have a following dataframe. It is a time series with each observations having values for days 1-4. There is an additional column that shows at which time the test was made in hrs.
dt
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24
I have to make a time series such that each line represents the subject.
First I made a plot with days and values, with subjects as colors.
This gave me a line plot for each subject, plotted against days and values. I am happy with it.
However, I have to incorporte when the test was taken on the line plot. I could do it separately at the top or bottom of the plot. But not exactly on the line.
Could someone please help me?
Thanks in advance!
Use the directlabels package to add the times:
library(ggplot2)
library(directlabels)
ggplot(DF, aes(Days, values, color = Name)) +
geom_line() +
geom_dl(aes(label = Test), method = "last.points")
Note
The input DF in reproducible form is:
Lines <- "
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24"
DF <- read.table(text = Lines, header = TRUE)

combined barplots with R ggplot2: dodged and stacked

I have a table of data which already contain several values to be plotted on a barplot with ggplot2 package (already cumulative data).
The data in the data frame "reserves" has the form (simplified):
period,amount,a1,a2,b1,b2,h1,h2,h3,h4
J,18.1,30,60,40,60,15,50,30,5
K,29,65,35,75,25,5,50,40,5
P,13.3,94,6,85,15,10,55,20,15
N,21.6,95,5,80,20,10,55,20,15
The first column (period) is the geological epoch. It will be on x axis, and I needed to have no extra ordering on it, so I prepared appropriate factor labelling with the command
reserves$period <- factor(reserves$period, levels = reserves$period)
The column "amount" is the main column to be plotted as y axis (it is percentage of hydrocarbons in each epoch, but it could be in absolute values as well, say, millions of tons or whatever). So basic plot is invoked by the command:
ggplot(reserves,aes(x=period,y=amount)) + geom_bar(stat="identity")
But here is the question. I need to plot other values, that is a1-a2, b1-b2, and h1-h4 on the same bar graph. These values are percentage values for each letter (for example, a1=60, then a2=40; the same for b1-b2; and for h1-h4 as well they sum up to 100. So: I need to have values a1-a2 as some color, proportionally dividing the "amount" bar for each value of x (stacked barplot), then I need the same for values b1-b2; so we have for each period two adjacent columns (grouped barplots), each of them is stacked. And next, I need the third column, for values h1-h4, perhaps, also as a stacked barplot, but either as a third column, or as a staggered barplot above the first one.
So the layout looks like this:
I learned that I need first to reshape data with package reshape2, and then use the option position="dodge" or position="fill" in geom_bar(), but here is the combination thereof. And the third barplot (for values h1-h4) seems to need "stacked percent" representation with fixed height.
Are there packages which handle the data for plotting in a more intuitive way? Lets say, we just declare, that we want variables ai,bi, hi to be plotted.
First you should reshape your data from wide to long, then scale your proportions to their raw values. Then split your old column names (now levels of "lett") into their letters and numbers for labeling. If your real data aren't formatted like this (a1...h4) there's ways to handle that as well.
library(dplyr)
library(tidyr)
library(ggplot2)
reserves <- read.csv(text = "period,amount,a1,a2,b1,b2,h1,h2,h3,h4
J,18.1,30,60,40,60,15,50,30,5
K,29,65,35,75,25,5,50,40,5
P,13.3,94,6,85,15,10,55,20,15
N,21.6,95,5,80,20,10,55,20,15")
reserves.tidied <- reserves %>%
gather(key = lett, value = prop, -period, -amount) %>%
mutate(rawvalue = prop * amount/100,
lett1 = substr(lett, 1, 1),
num = substr(lett, 2, 2))
reserves.tidied
period amount lett prop rawvalue lett1 num
1 J 18.1 a1 30 5.430 a 1
2 K 29.0 a1 65 18.850 a 1
3 P 13.3 a1 94 12.502 a 1
4 N 21.6 a1 95 20.520 a 1
5 J 18.1 a2 60 10.860 a 2
6 K 29.0 a2 35 10.150 a 2
7 P 13.3 a2 6 0.798 a 2
8 N 21.6 a2 5 1.080 a 2
9 J 18.1 b1 40 7.240 b 1
10 K 29.0 b1 75 21.750 b 1
11 P 13.3 b1 85 11.305 b 1
12 N 21.6 b1 80 17.280 b 1
13 J 18.1 b2 60 10.860 b 2
14 K 29.0 b2 25 7.250 b 2
15 P 13.3 b2 15 1.995 b 2
16 N 21.6 b2 20 4.320 b 2
17 J 18.1 h1 15 2.715 h 1
18 K 29.0 h1 5 1.450 h 1
19 P 13.3 h1 10 1.330 h 1
20 N 21.6 h1 10 2.160 h 1
21 J 18.1 h2 50 9.050 h 2
22 K 29.0 h2 50 14.500 h 2
23 P 13.3 h2 55 7.315 h 2
24 N 21.6 h2 55 11.880 h 2
25 J 18.1 h3 30 5.430 h 3
26 K 29.0 h3 40 11.600 h 3
27 P 13.3 h3 20 2.660 h 3
28 N 21.6 h3 20 4.320 h 3
29 J 18.1 h4 5 0.905 h 4
30 K 29.0 h4 5 1.450 h 4
31 P 13.3 h4 15 1.995 h 4
32 N 21.6 h4 15 3.240 h 4
Then to plot your tidied data, you want the letters across the x axis, and the rawvalue we just calculated with amount*proportion on the y axis. We stack the geom_col up from 1 to 2 or 1 to 4 (the reverse=T argument overrides the default, which would have 2 or 4 at the bottom of the stack). alpha and fill let us distinguish between groups in the same bar and between bars.
Then the geom_text labels each stacked segment with the name, a newline, and the original percentage, centered on each segment. The scale reverses the default behavior again, making 1 the darkest and 2 or 4 the lightest in each bar. Then you facet across, making one group of bars for each period.
ggplot(reserves.tidied,
aes(x = lett1, y = rawvalue, alpha = num, fill = lett1)) +
geom_col(position = position_stack(reverse = T), colour = "black") +
geom_text(position = position_stack(reverse = T, vjust = .5),
aes(label = paste0(lett, ":\n", prop, "%")), alpha = 1) +
scale_alpha_discrete(range = c(1, .1)) +
facet_grid(~period) +
guides(fill = F, alpha = F)
Rearranging it so that the "h" bars are different from the "a" and "b" bars is a bit more complex, and you'd have to think about how you want it presented, but it's totally doable.

How to join(Merge) two SparkDataFrame in SparkR and keep one of the common columns

i have the following Spark DataFrame :
aps=data.frame(agent=c('a','b','c','d','a','a','a','b','c','a','b'),product=c('P1','P2','P3','P4','P1','P1','P2','P2','P2','P3','P3'),
sale_amount=c(1000,2000,3000,4000,1000,1000,2000,2000,2000,3000,3000))
RDD_aps=createDataFrame(sqlContext,agent_product_sale)
agent product sale_amount
1 a P1 1000
2 b P2 2000
3 c P3 3000
4 d P4 4000
5 a P1 1000
6 a P1 1000
7 a P2 2000
8 b P2 2000
9 c P2 2000
10 a P3 3000
11 b P3 3000
and
percent=data.frame(agent=c('a','b','c'),percent=c(0.2 ,0.5,1.0))
agent percent
a 0.2
b 0.5
c 1.0
I need to join (merge) two data frame so that i can have a percent for each agent
something like this as output :
agent product sale_amount percent
1 d P4 4000 NA
2 c P3 3000 1.0
3 c P2 2000 1.0
4 b P2 2000 0.5
5 b P2 2000 0.5
6 b P3 3000 0.5
7 a P1 1000 0.2
8 a P1 1000 0.2
9 a P1 1000 0.2
10 a P2 2000 0.2
11 a P3 3000 0.2
I have already tried :
joined_aps=join(RDD_aps,percent,RDD_aps$agent==percent$agent,"left_outer")
but it adds an new second "agent" column from percent dataframe and i don't want the duplicate column.
I have also tried:
merged=merge(RDD_aps,percent, by = "agent",all.x=TRUE)
This one also add "agent_y " column but i just want to have one agent column in (agent column from RDD_aps)
I think I saw someone prevent the '_x' and '_y' variables from being generated by using join somewhere on SO, but I am unable to find that post. I personally prefer merge in my operations...I think it's easier for me, plus I like being able to switch between a left/right/inner/outer/etc join using the all.x=TRUE/FALSE and all.y=TRUE/FALSE arguments. I still get the annoying (but useful for verification purposes) _x and _y columns, but I fix those using code similar to the example below:
df1<- data.frame(person=c("Bob", "Jane", "John", "Liz"), favoriteColor=c("Blue", "Green", "Black", "White"))
df2<- data.frame(person=c("Bob", "Jane", "John", "Liz"), age=c(10,20,30,40))
sdf1<- SparkR::createDataFrame(df1)
sdf2<- SparkR::createDataFrame(df2)
sdf<- SparkR::merge(sdf1, sdf2, by.x="person", by.y="person", all.x=FALSE, all.y=FALSE) # Inner join...all.x/y not needed
colnames(sdf) # person_x and person_y are now present...to be fixed here
colnames(sdf)<- gsub(pattern= "_x",replacement = "", colnames(sdf))
col_names_sdf_subset<- colnames(sdf)[!(colnames(sdf) %in% colnames(sdf)[grep("_y", colnames(sdf))])]
sdf<- sdf %>% SparkR::select(col_names_sdf_subset)
colnames(sdf)
View(head(sdf, num=20L))

overlap of time series in ggplot2 keeping the x labels

How can I overlap two time series with ggplot2 and keep both X labels (one with 1970 and another with 1980)?
This is an overview of my datasets and the code I use to plot each graphic.
> dataset1.data
Date Obs
1 1/1/1970 2.0
2 1/2/1970 1.0
3 1/3/1970 0.0
4 1/4/1970 0.0
5 1/5/1970 0.5
6 1/6/1970 5.1
7 1/7/1970 0.0
8 1/8/1970 0.0
> dataset2.data
Date Obs
1 1/1/1980 3.0
2 1/2/1980 0.5
3 1/3/1980 0.5
4 1/4/1980 5.0
5 1/5/1980 0.4
6 1/6/1980 6.2
7 1/7/1980 9.0
8 1/8/1980 1.3
qplot(main="Observations 1")+xlab("Date")+ylab("Obs")+
geom_point(data = dataset1.data,aes(Date, Obs, colour="blue"),alpha = 0.7,na.rm = TRUE)+
scale_colour_identity("Legend", breaks=c("blue"), labels="1970")
qplot(main="Observations 2")+xlab("Date")+ylab("Obs")+
geom_point(data = dataset2.data,aes(Date, Obs, colour="red"),alpha = 0.7,na.rm = TRUE)+
scale_colour_identity("Legend", breaks=c("red"), labels="1980")
I would put them both in a single dataset, and then use a new Year variable for the color aesthetic:
dataset1.data = read.table('dataset1.txt')
dataset2.data = read.table('dataset2.txt')
dataset1.data$Date = as.Date(dataset1.data$Date, format='%m/%d/%Y')
dataset2.data$Date = as.Date(dataset2.data$Date, format='%m/%d/%Y')
data = rbind(dataset1.data, dataset2.data)
data = transform(data, MonthDay=gsub('(.+)-(.+-.+)', '\\2', data$Date), Year=gsub('(.+)-(.+-.+)', '\\1', data$Date))
qplot(main="Observations 1")+xlab("Date")+ylab("Obs")+geom_point(data = data,aes(MonthDay, Obs, colour=Year),alpha = 0.7,na.rm = TRUE)
It's probably also possible to do it by editing the grid objects. For example, see: https://github.com/hadley/ggplot2/wiki/Editing-raw-grid-objects-from-a-ggplot

Resources