adding labels and colour to different points on a graph using R - r

Happy new year to you all!
I am plotting some graphs and would like to differentiate some plotted lines and points. This is an example of my data and the graph that I am trying to get:
anim <- c(1,2,3,4,5)
var1 <- c(32,36,40,38,39)
var2 <- c(30,31,34,36,38)
surv <- c(0,1,0,1,1)
mydf <- data.frame(anim,var1,var2,surv)
mydf
anim var1 var2 surv
1 1 32 30 0
2 2 36 31 1
3 3 40 34 0
4 4 38 36 1
5 5 39 38 1
lm.pos1 <- lm(var1~var2,data=mydf)
plot(mydf$var2,mydf$var1,xlab="ave.ear",ylab="rtemp",xlim=c(25,45),ylim=c(25,45))
abline(lm.pos1)
abline(h=37.6,v=0,col="gray10",lty=20)
abline(h=34,v=0,col="gray10",lty=20)
First, I would like to insert the label "37.6°C" on the top horizontal and continuous line and "34.0°C" on the bottom horizontal and broken line.
Second, I would like to colour those individuals (circles) as red if surv=0 (died) or green if surv=1.
Any help would be very much appreciated!
Baz

plot(mydf$var2, mydf$var1, xlab="ave.ear", ylab="rtemp",
xlim=c(25,45), ylim=c(25,45), col=c('green', 'red')[surv+1])
abline(lm.pos1)
abline(h=37.6,v=0,col="gray10",lty=20)
text(25,38.1,parse(text='37.6*degree'),col='gray10')
abline(h=34,v=0,col="gray10",lty=20)
text(25,34.5,parse(text='34*degree'),col='gray10')

Related

R: Plot several lines in the same plot: ggplot + data tables or frames vs matrices

My general problem: I tend to struggle using ggplot, because it's very data-frame-centric but the objects I work with seem to fit matrices better than data frames. Here is an example (adapted a little).
I have a quantity x that can assume values 0:5, and a "context" that can have values 0 or 1. For each context I have 7 different frequency distributions over the quantity x. (More generally I could have more than two "contexts", more values of x, and more frequency distributions.)
I can represent these 7×2 frequency distributions as a list freqs of two matrices, say:
> freqs
$`context0`
x0 x1 x2 x3 x4 x5
sample1 20 10 10 21 37 2
sample2 34 40 6 10 1 8
sample3 52 4 1 2 17 25
sample4 16 32 25 11 5 10
sample5 28 2 10 4 21 35
sample6 22 13 35 12 13 5
sample7 9 5 43 29 4 10
$`context1`
x0 x1 x2 x3 x4 x5
sample1 15 21 14 15 14 21
sample2 27 8 6 5 29 25
sample3 13 7 5 26 48 0
sample4 33 3 18 11 13 22
sample5 12 23 40 11 2 11
sample6 5 51 2 28 5 9
sample7 3 1 21 10 63 2
or a 3D array.
Or I could use a data.table tablefreqs like this one:
> tablefreqs
context x0 x1 x2 x3 x4 x5
1: 0 20 10 10 21 37 2
2: 0 34 40 6 10 1 8
3: 0 52 4 1 2 17 25
4: 0 16 32 25 11 5 10
5: 0 28 2 10 4 21 35
6: 0 22 13 35 12 13 5
7: 0 9 5 43 29 4 10
8: 1 15 21 14 15 14 21
9: 1 27 8 6 5 29 25
10: 1 13 7 5 26 48 0
11: 1 33 3 18 11 13 22
12: 1 12 23 40 11 2 11
13: 1 5 51 2 28 5 9
14: 1 3 1 21 10 63 2
Now I'd like to draw the following line plot (there's a reason why I need line plots and not, say, histograms or bar plots):
The 7 frequency distributions for context 0, with x as x-axis and the frequency as y-axis, all in the same line plot (with some alpha).
The 7 frequency distributions for context 1, again with x as x-axis and the frequency as y-axis, all in the same line plot (with alpha), but displayed upside-down below the plot for context 0.
Ggplot would surely do this very nicely, but it seems to require some acrobatics with data tables:
– If I use the data table tablefreqs it's not clear to me how to plot all its rows having context==0 in the same plot: ggplot seems to only think column-wise, not row-wise. I could use the six values of x as table rows, but then the "context" values would also end up in a row, and I'm not sure I can subset a data table by values in a row, rather than in a column.
– If I use the matrix freqs, I could create a mini-data-table having x as one column and one frequency distribution as another column, input that into ggplot+geom_line, then go over all 7 frequency distributions in a for-loop maybe. Not clear to me how to tell ggplot to keep the previous plots in this case. Then another for-loop over the two "contexts".
I'd be grateful for suggestions on how to approach this problem in particular, and more generally on what objects to choose for storing this kind of data: matrices? data tables, maybe with a different structure than shown here? some other formats?
I would suggest to familiarize yourself with the concept of what is known as Tidy Data, which are principles for data handling and storage that are adopted by ggplot2 and a number of other packages.
You are free to use a matrix or list of matrices to store your data; however, you can certainly store the data as you describe it (and as I understand it) in a data frame or single table following the following convention of columns:
context | sample | x | freq
I'll show you how I would convert the tablefreqs dataset you shared with us into that format, then how I would go about creating a plot as you are describing it in your question. I'm assuming in this case you only have the two values for context, although you allude to there being more. I'm going to try to interpret correctly what you stated in your question.
Create the Tidy Data frame
Your data frame as shown contains columns x1 through x5 that have values for x spread across more than one column, when you really need these to be converted in the format shown above. This is called "gathering" your data, and you can do that with tidyr::gather().
First, I also need to replicate the naming of your samples according to the matrix dataset, so I'll do that and gather your data:
library(dplyr)
library(tidyr)
library(ggplot2)
# create the sample names
tablefreqs$sample <- rep(paste0('sample',1:7), 2)
# gather the columns together
df <- tablefreqs %>%
gather(key='x', value='freq', -c(context, sample))
Note that in the gather() function, we have to specify to leave alone the two columns df$context and df$sample, as they are not part of the gathering effort. But now we are left with df$x containing character vectors. We can't plot that, because we want the to be in the form of a number (at least... I'm assuming you do). For that, we'll convert using:
df$x <- as.numeric(gsub("[^[:digit:].]", "", df$x))
That extracts the number from each value in df$x and represents it as a number, not a character. We have the opposite issue with df$context, which is actually a discrete factor, and we should represent it as such in order to make plotting a bit easier:
df$context <- factor(df$context)
Create the Plot
Now we're ready to create the plot. From your description, I may not have this perfectly right, but it seems that you want a plot containing both context = 1 and context = 0, and when context = 1 the data should be "upside down". By that, I'm assuming you are talking about plotting df$freq when df$context == 0 and -df$freq when df$context == 1. We could do that using some fancy logic in the ggplot() call, but I find it's easier just to create a new column in your dataset to represent what we will be plotting on the y axis. We'll call this column df$freq_adj and use that for plotting:
df$freq_adj <- ifelse(df$context==1, -df$freq, df$freq)
Then we create the plot. I'll explain a bit below the result:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(
aes(color=context, linetype=sample)
) +
geom_hline(yintercept=0, color='gray50') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
Without some clearer description or picture of what you were looking to do, I took some liberties here. I used color to discriminate between the two values for context, and I'm using linetype to discriminate the different samples. I also added a line at 0, since it seemed appropriate to do so here, and the scale_x_continuous() command is removing the extra white space that is put in place at the extreme ends of the data.
An alternative that is maybe closer to your description would be to physically have a separation between the two plots, and represent context = 1 as a physically separate plot compared to context = 0, with one over top of the other.
Here's the code and plot:
ggplot(df, aes(x=x, y=freq_adj)) +
geom_line(aes(group=sample), alpha=0.3) +
facet_grid(context ~ ., scales='free_y') +
scale_x_continuous(expand=expansion(mult=0)) +
theme_bw()
There the use of aes(group=sample) is quite important, since I want all the lines for each sample to be the same (alpha setting and color), yet ggplot2 needs to know that the connections between the points should be based on "sample". This is done using the group= aesthetic. The scales='free_y' argument on facet_grid() allows the y axis scale to shrink and fit the data according to each facet.

In R, How do I create a dataframe from a text file which splits the data?

In R I am attempting to import a massive text file with the following structure: This is an example saved as example.txt:
Curve Name:
Curve A
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 95
2 50 90
Curve Color:
Blue
Curve Name:
Curve B
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 90
2 50 80
Curve Color:
Green
So far I can extract the names and colors
file.text <- readLines("example.txt")
curve.names <- trimws(file.text[which(regexpr('Curve Name:', file.text) > 0) + 1])
curve.colors <- trimws(file.text[which(regexpr('Curve Color:', file.text) > 0) + 1])
How do I create a dataframe with curve.name as a factor, and the other values as numeric in the following structure?
curve.name index variable.1 variable.2
Curve A 0 30 100
Curve A 1 40 95
Curve A 2 50 90
Curve B 0 30 100
Curve B 1 40 90
Curve B 2 50 80
Assuming that every file has exactly the format from above:
txt <- readLines("example.txt")
curve_name <- rep(trimws(txt[c(2,13)]), each=3)
curve_color <- rep(trimws(txt[c(10,21)]), each=3)
val <- read.table(text=paste(txt[c(6:8, 17:19)], collapse = "\n"))
colnames(val) <- c("index", "var1", "var2")
cbind(curve_name, curve_color, val)
If the format is not exactly the above one, you can try to figure out the line-indices via the header's. So looking where it says Curve Values:
Which gives:
curve_name curve_color index var1 var2
1 Curve A Blue 0 30 100
2 Curve B Blue 1 40 95
3 Curve A Blue 2 50 90
4 Curve B Green 0 30 100
5 Curve A Green 1 40 90
6 Curve B Green 2 50 80
Slightly different approach assuming predictable formatting. We get each "record", extract salient components and bind them all together.
library(purrr)
library(stringi)
starts <- which(grepl("Curve Name:", lines)) # find the start of each record
ends <- which(grepl("Curve Color:", lines))+1 # find the end of each record
map2_df(starts, ends, function(start, end) {
rec <- paste0(lines[start:(end)], collapse="\n") # extract the record
# regex extract each set of values
stri_match_first_regex(rec, c("Curve Name:[[:space:]]+([[:alnum:][:blank:]]+)",
"Curve Values:[[:space:]]+([[:print:][:space:]]+)Curve",
"Curve Color:[[:space:]]+([[:alnum:][:blank:]]+)"))[,2] %>%
trimws() -> found
df <- read.table(text=found[2], skip=2, col.names=c("index", "variable.1", "variable.2"))
df$curve.name <- found[1]
df$color <- found[3]
df
})
## index variable.1 variable.2 curve.name color
## 1 0 30 100 Curve A Blue
## 2 1 40 95 Curve A Blue
## 3 2 50 90 Curve A Blue
## 4 0 30 100 Curve B Green
## 5 1 40 90 Curve B Green
## 6 2 50 80 Curve B Green
Read the lines into L removing any spaces before Curve Color. (Removing spaces may not be necessarty if there are no spaces before Curve Color in the actual file but in the question there is a space before Curve Color.) Then re-read the lines that start with a digit creating the variables data.frame. Then read the rest using read.dcf and put the two together using cbind.
We have assumed that
Curve Values comes second so we can omit it from rest using [, -2]
Only lines in the numeric tables start with numbers (prefaced by whitespace).
Each numeric record has 3 columns with the column names shown in the question. The rows start with an index number of 0 and subsequent rows in the same record do not also have a 0 index number. (There is no restriction on the number of rows in each numeric table and different records may have different numbers of such rows.)
No packages are used.
L <- sub("^ *Curve Color", "Curve Color", readLines("example.txt"))
variables <- read.table(text = grep("^\\d", trimws(L), value = TRUE),
col.names = c("index", "variable.1", "variable.2"))
rest <- trimws(read.dcf(textConnection(L))[, -2])
cbind(rest[cumsum(variables$index == 0), ], variables)
giving:
Curve Name Curve Color index variable.1 variable.2
1 Curve A Blue 0 30 100
2 Curve A Blue 1 40 95
3 Curve A Blue 2 50 90
4 Curve B Green 0 30 100
5 Curve B Green 1 40 90
6 Curve B Green 2 50 80
Generally a lot of grep. Finding a way to group entries, like the cumulative sum of a blank line, can be handy as well:
l <- readLines(textConnection('Curve Name:
Curve A
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 95
2 50 90
Curve Color:
Blue
Curve Name:
Curve B
Curve Values:
index Variable 1 Variable 2
[°C] [%]
0 30 100
1 40 90
2 50 80
Curve Color:
Green '))
do.call(rbind,
lapply(split(trimws(l), cumsum(l == '')), function(x){
data.frame(
curve = x[grep('Curve Name:', x) + 1],
read.table(text = paste(x[(grep('index', x) + 2):(grep('Curve Color:', x) - 1)],
collapse = '\n'),
col.names = c('index', 'variable.1', 'varible.2')))}))
## curve index variable.1 varible.2
## 0.1 Curve A 0 30 100
## 0.2 Curve A 1 40 95
## 0.3 Curve A 2 50 90
## 1.1 Curve B 0 30 100
## 1.2 Curve B 1 40 90
## 1.3 Curve B 2 50 80

how to look at specific subset of a dataset

I have a dataset that looks like:
foo bar
23 0
72 1
41 1
32 2
21 1
21 1
I want to plot a qq plot and a histogram of the distribution of foo at bar equal to 1. How would I do that?
I know plot and qqnorm for qq plot. And I know hist.
Simply subset as the other suggested.
> subset(df, bar==1)
or in one line for the hist function
> hist(subset(df, bar==1))
Just get all rows with bar==1. Following should work:
df1 = ddf[ddf$bar==1,]
df1
foo bar
2 72 1
3 41 1
5 21 1
6 21 1
plot(df1$foo, df1$bar)

How to make a spaghetti plot in R?

I have the following:
heads(dataframe):
ID Result Days
1 70 0
1 80 23
2 90 15
2 89 30
2 99 40
3 23 24
ect...
what I am trying to do is: Create a spaghetti plot with the above datast. What I use is this:
interaction.plot(dataframe$Days,dataframe$ID,dataframe$Result,xlab="Time",ylab="Results",legend=F) but none of the patient lines are continuous even when they were supposed to be a long line.
Also I want to convert the above dataframe to something like this:
ID Result Days
1 70 0
1 80 23
2 90 0
2 89 15
2 99 25
3 23 0
ect... ( I am trying to take the first (or minimum) of each id and have their dating starting from zero and up). Also in the spaghetti plot i want all patients to have the same color IF a condition in met, and another color if the condition is not met.
Thank you for your time and patience.
How about this, using ggplot2 and data.table
# libs
library(ggplot2)
library(data.table)
# your data
df <- data.table(ID=c(1,1,2,2,2,3),
Result=c(70,80,90,89,99,23),
Days=c(0,23,15,30,40,24))
# adjust each ID to start at day 0, sort
df <- merge(df, df[, list(min_day=min(Days)), by=ID], by='ID')
df[, adj_day:=Days-min_day]
df <- df[order(ID, Days)]
# plot
ggplot(df, aes(x=adj_day, y=Result, color=factor(ID))) +
geom_line() + geom_point() +
theme_bw()
Contents of updated data.frame (actually a data.table):
ID Result Days min_day adj_day
1 70 0 0 0
1 80 23 0 23
2 90 15 15 0
2 89 30 15 15
2 99 40 15 25
3 23 24 24 0
You can handle the color coding easily using scale_color_manual()

Plot Issues - Start always in (0,0)

I am working with a huge data set where all columns look something like this:
0
10
12
30
10
0
20
30
0
40
50
10
0
The idea is to make a simple plot in R where every time it reads a 0 the plot will begin in (0,0).
Do you have any idea of how I can do this?
Thanks in advance,
J
UPDATE:
I am a new user so I can't post any images!
Here's an example of the column I want to plot:
0
10
20
12
5
6
9
0
20
24
40
14
0
20
59
50
12
0
20
23
49
45
23
12
(...)
Image a line plot.
Instead of plotting a long line with all the values I want to plot several shorter lines with the first line plotting (0,10,20,12,5,6,9), the second line plotting (0,20,24,40,14) etc...
I would add an additional column specifying which subdataset your are:
Value Group
0 1
1 1
5 1
0 2
Etc.
You can then plot the subgroups using e.g. ggplot2:
ggplot(yourdata, aes(x = xcoor, y = Value, color = Group)) +
geom_line()
Which will draw the lines with different colors. Or using plot using something like:
split_dat = with(yourdata, split(Value, Group))
plot(split_dat[[1]])
for(i in 2:length(split_dat)) {
lines(split_dat[[i]])
}

Resources