How to match two columns with nearest time points? - r

I have a following dataframe. It is a time series with each observations having values for days 1-4. There is an additional column that shows at which time the test was made in hrs.
dt
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24
I have to make a time series such that each line represents the subject.
First I made a plot with days and values, with subjects as colors.
This gave me a line plot for each subject, plotted against days and values. I am happy with it.
However, I have to incorporte when the test was taken on the line plot. I could do it separately at the top or bottom of the plot. But not exactly on the line.
Could someone please help me?
Thanks in advance!

Use the directlabels package to add the times:
library(ggplot2)
library(directlabels)
ggplot(DF, aes(Days, values, color = Name)) +
geom_line() +
geom_dl(aes(label = Test), method = "last.points")
Note
The input DF in reproducible form is:
Lines <- "
Name values Days Test
a 0.2 1 20
a 0.3 2 20
a 0.6 3 20
a 0.2 4 20
b 0.3 1 44
b 0.4 2 44
b 0.8 3 44
b 0.7 4 44
c 0.2 1 24
c 0.7 2 24"
DF <- read.table(text = Lines, header = TRUE)

Related

Create data frame from values in every two continuous rows from an existing data frame

I have data frame z1:
z1 <- data.frame(time=as.factor(rep(0.5:9.5,times=rep(c(9,10,8,11,12),2))),
roi= rep(c(1:9,1:10,1:8,1:11,1:12),2), area=runif(100, 5.0, 7.5))
I want to create a new data frame z2 has 10*nrow(z1) rows with condition:
at each time value, every second row (z1$roi[i:i+1] and z1$area[i:i+1]) for i in 1: c(nrow(z1) -1) are used to make column roi and area in z2, like
z2$roi <- seq(z1$roi[i],z1$roi[i+1], length.out = 10)
z2$area <- seq(z1$area[i],z1$area[i+1], length.out = 10)
If the data frame z1 looks like:
time roi area
1 0.5 1 6.181150 #=z1$roi[1]
2 0.5 2 5.469366 #=z1$roi[2]
3 0.5 3 6.742525
.
.
.
98 9.5 10 6.063234
99 9.5 11 6.824393 #=z1$roi[99]
100 9.5 12 7.346298 #=z1$roi[100]
the data frame z2 would be:
time roi area
1 0.5 1.000000 6.181150 #=z1$roi[1]
2 0.5 1.111111 6.102063
.
.
.
9 0.5 1.888889 5.548453
10 0.5 2.000000 5.469366 #=z1$roi[2]
.
.
.
991 9.5 11.00000 6.824393 #=z1$roi[99]
992 9.5 11.11111 6.882383
.
.
.
999 9.5 11.88889 7.288309
1000 9.5 12.00000 7.346298 #=z1$roi[100]
Can anyone help me? Thank you!
with tidyverse, changing a bit your values to appreciate the output (replace 5 by 10):
z1 <- head(z1,3)
library(tidyverse)
z1 %>%
mutate_at(vars(roi,area),~map2(.,c(.[-1],last(.)),~seq(.x,.y,length.out=5))) %>%
unnest %>%
head(-5)
# time roi area
# 1 0.5 1.00 6.302351
# 2 0.5 1.25 6.151644
# 3 0.5 1.50 6.000938
# 4 0.5 1.75 5.850231
# 5 0.5 2.00 5.699525
# 6 0.5 2.00 5.699525
# 7 0.5 2.25 5.687045
# 8 0.5 2.50 5.674566
# 9 0.5 2.75 5.662087
# 10 0.5 3.00 5.649608
We will apply the same transformations to cols time and area, so we use mutate_at on those.
We want to transform them into list columns containing vectors, so we can unnest afterwards and get a long data.frame(you may need to get acquainted with tidyr::unnest to understand this step, basically it makes a 'regular' data.frame out of a data.frame that would have vectors, lists, or nested data.frames as elements).
The map family will return such a list output, but each value depends on current AND next value, so we use purrr::map2 to get both input.
. is current value, c(.[-1],last(.)) is the next value (for last element there is no next value, so we keep the last value).
We unnest to create a long data.frames.
The repeated last value created duplicated rows, so we remove them with head(-n)
You could do this as a linear interpolation problem using approx():
s1 <- seq_len(nrow(z1)-1)
s2 <- rep(s1,each=9)
out <- approx(
x = seq_along(z1$area),
y = z1$area,
xout = c(s2 + head(seq(0,1,length.out=10),-1), nrow(z1))
)$y
z1
# time roi area
#1 0.5 1 6.413124
#2 0.5 2 6.837422
#3 0.5 3 6.656612
And then just join the results back together using row indexing:
cbind(z1[c(s2,nrow(z1)),], out)
# time roi area out
#1 0.5 1 6.413124 6.413124
#1.1 0.5 1 6.413124 6.460268
#1.2 0.5 1 6.413124 6.507413
#1.3 0.5 1 6.413124 6.554557
#1.4 0.5 1 6.413124 6.601701
#1.5 0.5 1 6.413124 6.648845
#1.6 0.5 1 6.413124 6.695989
#1.7 0.5 1 6.413124 6.743134
#1.8 0.5 1 6.413124 6.790278
#2 0.5 2 6.837422 6.837422
#2.1 0.5 2 6.837422 6.817332
#2.2 0.5 2 6.837422 6.797242
#2.3 0.5 2 6.837422 6.777152
#2.4 0.5 2 6.837422 6.757062
#2.5 0.5 2 6.837422 6.736972
#2.6 0.5 2 6.837422 6.716882
#2.7 0.5 2 6.837422 6.696792
#2.8 0.5 2 6.837422 6.676702
#3 0.5 3 6.656612 6.656612
This sort of logic should scale much better than having to calculate a sequence for each row. Something of the order of 10 secs vs 1 minute for 1 million rows from a quick and dirty test.

R slopegraph geom_line color ggplot2

I am trying to create a slopegraph with ggplot and geom_line. I want the lines of a subset of data (e.g. those higher then 0.5) to be in red and those less than 0.5 to be another color. Here's my code:
library(ggplot2)
library(reshape2)
mydata <- read.csv("testset.csv")
mydatam = melt(mydata)
line plot:
ggplot(mydatam, aes(factor(variable), value, group = Gene, label = Gene)) +
geom_line(col='red')
in this case, all the lines are red. how do I make red lines for those "Gene"s that have a variable low value > 0.5 (there are 5 of them, aa,ac, ba, bc and bd) and the rest black lines?
mydatam looks like this:
Gene variable value
1 aa Control 0.0
2 ab Control 0.0
3 ac Control 0.0
4 ad Control 0.0
5 ba Control 0.0
6 bb Control 0.0
7 bc Control 0.0
8 bd Control 0.0
9 aa Low 0.6
10 ab Low 0.2
11 ac Low 0.8
12 ad Low 0.1
13 ba Low 0.7
14 bb Low 0.3
15 bc Low 0.8
16 bd Low 1.2
17 aa High -0.6
18 ab High 1.6
19 ac High 2.1
20 ad High 0.7
21 ba High -1.2
22 bb High -0.7
23 bc High -0.8
24 bd High 0.6
You'll probably want to create a new variable in the data for this. Here's one way:
## Load dplyr package for data manipulation
library("dplyr")
## Genes where "Low" value is >0.5
genes <- mydatam[mydatam$variable == "Low" & mydatam$value > 0.5, "Gene"]
## Add new column
newdat <- mutate(mydatam, newval = ifelse(Gene %in% genes, ">0.5", "<=0.5"))
Now we can create the plot using newval to set the color.
## Color lines based on `newval` column
ggplot(newdat, aes(factor(variable), value, group = Gene, label = Gene)) +
geom_line(aes(color = newval)) +
scale_color_manual(values = c("#000000", "#FF0000"))

R - conditional cumsum using multiple columns

I'm new to stackoverflow So I hope I post my question in the right format. I have a test dataset with three columns where rank is the ranking of a cell, Esvalue is the value of a cell and zoneID is an area identifier(Note! in the real dataset I have up to 40.000 zoneIDs)
rank<-seq(0.1,1,0.1)
Esvalue<-seq(10,1)
zoneID<-rep(seq.int(1,2),times=5)
rank Esvalue zoneID
0.1 10 1
0.2 9 2
0.3 8 1
0.4 7 2
0.5 6 1
0.6 5 2
0.7 4 1
0.8 3 2
0.9 2 1
1.0 1 2
I want to calculate the following:
% ES value <- For each rank, including all lower ranks, the cumulative % share of the total ES value relative to the ES value of all zones
cumsum(df$Esvalue)/sum(df$Esvalue)
% ES value zone <- For each rank, including all lower ranks, the cumulative % share of the total Esvalue relative to the ESvalue of a zoneID for each zone. I tried this now using mutate and using dplyr. Both so far only give me the cumulative sum, not the share. In the end this will generate a variable for each zoneID
df %>%
mutate(cA=cumsum(ifelse(!is.na(zoneID) & zoneID==1,Esvalue,0))) %>%
mutate(cB=cumsum(ifelse(!is.na(zoneID) & zoneID==2,Esvalue,0)))
These two variables I want to combine by
1) calculating the abs difference between the two for all the zoneIDs
2) for each rank calculate the mean of the absolute difference over all zoneIDs
In the end the final output should look like:
rank Esvalue zoneID mean_abs_diff
0.1 10 1 0.16666667
0.2 9 2 0.01333333
0.3 8 1 0.12000000
0.4 7 2 0.02000000
0.5 6 1 0.08000000
0.6 5 2 0.02000000
0.7 4 1 0.04666667
0.8 3 2 0.01333333
0.9 2 1 0.02000000
1.0 1 2 0.00000000
Now I created the last using some intermediate steps in Excel but my final dataset will be way too big to be handled by Excel. Any advice on how to proceed would be appreciated

Grouping consecutive integers in r and performing analysis on groups

I have a data frame, with which I would like to group the intervals based on whether the integer values are consecutive or not and then find the difference between the maximum and minimum value of each group.
Example of data:
x Integers
0.1 14
0.05 15
2.7 17
0.07 19
3.4 20
0.05 21
So Group 1 would consist of 14 and 15 and Group 2 would consist of 19,20 and 21.
The difference of each group then being 1 and 2, respectively.
I have tried the following, to first group the consecutive values, with no luck.
Breaks <- c(0, which(diff(Data$Integer) != 1), length(Data$Integer))
sapply(seq(length(Breaks) - 1),
function(i) Data$Integer[(Breaks[i] + 1):Breaks[i+1]])
Here's a solution using by():
df <- data.frame(x=c(0.1,0.05,2.7,0.07,3.4,0.05),Integers=c(14,15,17,19,20,21));
do.call(rbind,by(df,cumsum(c(0,diff(df$Integers)!=1)),function(g) data.frame(imin=min(g$Integers),imax=max(g$Integers),irange=diff(range(g$Integers)),xmin=min(g$x),xmax=max(g$x),xrange=diff(range(g$x)))));
## imin imax irange xmin xmax xrange
## 0 14 15 1 0.05 0.1 0.05
## 1 17 17 0 2.70 2.7 0.00
## 2 19 21 2 0.05 3.4 3.35
I wasn't sure what data you wanted in the output, so I just included everything you might want.
You can filter out the middle group with subset(...,irange!=0).

how to make box plot with R with interrupted Y axis

This is a sample of my data. It's a tab delimited file with a header.
X1 X2 X3 X4
1.3 0.5 0.1 1
NA 0.3 0.4 3
NA 0.2 0.3 0.3
NA 0.1 3 0.2
NA 27 5 56
NA NA 10 0.01
I would like to get a boxplot from this data. The problem is that I want to interrupt the plot at 10 and 50 on Y-axis. I want a bigger plot size before 10 and a smaller plot size after that. I don't know how to plot with 2 gaps in Y-axis. I tried with axis.break and gap.boxplot but as my programming skills with R are very limited so I am unable to use both of these methods properly. I'd be grateful for any hints to accomplish this?
I'm not really clear on what you want, and what you mean by "bigger plot size before 10 and a smaller plot size after that". Do you mean different scales? That is a bad idea, I think, and I don't believe it would be straightforward.
Here's how to break the axis twice (I'm guessing on the regions to exclude):
library(plotrix)
library(reshape2)
a <- read.table(textConnection("X1 X2 X3 X4
1.3 0.5 0.1 1
NA 0.3 0.4 3
NA 0.2 0.3 0.3
NA 0.1 3 0.2
NA 27 5 56
NA NA 10 0.01"),sep=" ",header=T)
am <-melt(a) #from reshape2 - allows categorical variables to be in one column
gap.boxplot(am$value ~ am$variable, #means the values are plotted againsy variable
gap=list(top=c(30,50),bottom=c(10,24)), #specifies regions of Y axis to exclude
axis.labels=T) #should label all the Y axis, doesn't seem to work well

Resources