Multiple scatterplot figure in R - r

I have a slightly complicated plotting task. I am half way there, quite sure how to get it. I have a dataset of the form below, with multiple subjects, each in either Treatgroup 0 or Treatgroup 1, each subject contributing several rows of data. Each row corresponds to a single timepoint at which there are values in columns "count1, count2, weirdname3, etc.
Task 1. I need to calculate "Days", which is just the visitdate - the startdate, for each row. Should be an apply type function, I guess.
Task 2. I have to make a multiplot figure with one scatterplot for each of the count variables (a plot for count1, one for count2, etc). In each scatterplot, I need to plot the value of the count (y axis) against "Days" (x-axis) and connect the dots for each subject. Subjects in Treatgroup 0 are one color, subjects in treatgroup 1 are another color. Each scatterplot should be labeled with count1, count2 etc as appropriate.
I am trying to use the base plotting function, and have taken the approach of writing a plotting function to call later. I think this can work but need some help with syntax.
#Enter example data
tC <- textConnection("
ID StartDate VisitDate Treatstarted count1 count2 count3 Treatgroup
C0098 13-Jan-07 12-Feb-10 NA 457 343 957 0
C0098 13-Jan-06 2-Jul-10 NA 467 345 56 0
C0098 13-Jan-06 7-Oct-10 NA 420 234 435 0
C0098 13-Jan-05 3-Feb-11 NA 357 243 345 0
C0098 14-Jan-06 8-Jun-11 NA 209 567 254 0
C0098 13-Jan-06 9-Jul-11 NA 223 235 54 0
C0098 13-Jan-06 12-Oct-11 NA 309 245 642 0
C0110 13-Jan-06 23-Jun-10 30-Oct-10 629 2436 45 1
C0110 13-Jan-07 30-Sep-10 30-Oct-10 461 467 453 1
C0110 13-Jan-06 15-Feb-11 30-Oct-10 270 365 234 1
C0110 13-Jan-06 22-Jun-11 30-Oct-10 236 245 23 1
C0151 13-Jan-08 2-Feb-10 30-Oct-10 199 653 456 1
C0151 13-Jan-06 24-Mar-10 3-Apr-10 936 25 654 1
C0151 13-Jan-06 7-Jul-10 3-Apr-10 1147 254 666 1
C0151 13-Jan-06 9-Mar-11 3-Apr-10 1192 254 777 1
")
data1 <- read.table(header=TRUE, tC)
close.connection(tC)
# format date
data1$VisitDate <- with(data1,as.Date(VisitDate,format="%d-%b-%y"))
# stuck: need to define days as VisitDate - StartDate for each row of dataframe (I know I need an apply family fxn here)
data1$Days <- [applyfunction of some kind ](VisitDate,ID,function(x){x-data1$StartDate})))
# Unsure here. Need to define plot function
plot_one <- function(d){
with(d, plot(Days, Count, t="n", tck=1, cex.main = 0.8, ylab = "", yaxt = 'n', xlab = "", xaxt="n", xlim=c(0,1000), ylim=c(0,1200))) # set limits
grid(lwd = 0.3, lty = 7)
with(d[d$Treatgroup == 0,], points(Days, Count1, col = 1))
with(d[d$Treatgroup == 1,], points(Days, Count1, col = 2))
}
#Create multiple plot figure
par(mfrow=c(2,2), oma = c(0.5,0.5,0.5,0.5), mar = c(0.5,0.5,0.5,0.5))
#trouble here. I need to call the column names somehow, with; plyr::d_ply(data1, ???, plot_one)

Task 1:
data1$days <- floor(as.numeric(as.POSIXlt(data1$VisitDate,format="%d-%b-%y")
-as.POSIXlt(data1$StartDate,format="%d-%b-%y")))
Task 2:
par(mfrow=c(3,1), oma = c(2,0.5,1,0.5), mar = c(2,0.5,1,0.5))
plot(data1$days, data1$count1, col=as.factor(data1$Treatgroup), main="count1")
plot(data1$days, data1$count2, col=as.factor(data1$Treatgroup), main="count2")
plot(data1$days, data1$count3, col=as.factor(data1$Treatgroup), main="count3")

Related

Is there a way on R to represent the alternation of these sequences, SEQ1 and SEQ2, within a single ultra-long read?

I have a dataframe on R similar to this one, only it is 2000 rows long.
Throughout the dataframe I have this alternation of SEQ1 and SEQ2 within a single read called "id read". These sequences alternate, and SEQ1 is always 1 nucleotide away from SEQ1, while SEQ2 from SEQ1 about 335 nucleotides, sometimes jumps and goes to 670.
The sequences are both in forward and in revers, as can be seen from the value of the end coordinate which is sometimes less than the start coordinate.
sequence
id read
start
end
sequencedistance
sequencelength
SEQ1
id read
90
105
1
15
SEQ2
id read
440
458
335
18
SEQ1
id read
459
474
1
15
SEQ2
id read
808
826
334
18
SEQ1
id read
827
812
1
15
SEQ2
id read
1148
1156
336
18
SEQ1
id read
1157
1172
1
15
SEQ2
id read
1850
1868
678
18
SEQ1
id read
1869
1854
1
15
SEQ2
id read
2187
2205
333
18
SEQ1
id read
2206
2221
1
15
SEQ2
id read
2887
2905
666
18
Would anyone have any ideas on how to plot this data and visually show the pattern that these sequences have within a read?
I have tried plotting with horizontal lines, lollipop, point, but none of these methods are effective in representing the amount of data I have and to visually understand the behavior of these sequences.
Would anyone have an idea of ​​how to plot the pattern? If I wanted, I could also plot only a part of the large dataframe I have, but at least I would like to understand the particularity of these sequences in the ultra-long read taken into consideration.
I'm still not exactly sure what you are looking for, but if every row i where sequence == "SEQ" has a paired row i + 1 where sequence == "SEQ2", you can calculate the relative start and ends sites and then try to visualise it.
Assuming your data is in a variable called df, you can calculate these as follows.
df <- transform(
df,
rel_start = ifelse(
as.character(sequence) == "SEQ1",
start - start,
start - c(0, head(start, -1))
),
rel_end = ifelse(
as.character(sequence) == "SEQ1",
end - start,
end - c(0, head(start, -1))
)
)
Then for visualisation, you can just use geom_segment(). You could use arrows to indicate the direction of the reads.
library(ggplot2)
ggplot(df, aes(rel_start, y = seq_along(start), colour = sequence)) +
geom_segment(aes(xend = rel_end, yend = seq_along(start)),
arrow = arrow(length = unit(2, "mm")))
Data loading:
txt <- "sequence id read start end sequencedistance sequencelength
SEQ1 id read 90 105 1 15
SEQ2 id read 440 458 335 18
SEQ1 id read 459 474 1 15
SEQ2 id read 808 826 334 18
SEQ1 id read 827 812 1 15
SEQ2 id read 1148 1156 336 18
SEQ1 id read 1157 1172 1 15
SEQ2 id read 1850 1868 678 18
SEQ1 id read 1869 1854 1 15
SEQ2 id read 2187 2205 333 18
SEQ1 id read 2206 2221 1 15
SEQ2 id read 2887 2905 666 18"
df <- read.table(text = txt, header = TRUE)

Plot observations in same x-axis point which linked with id variable

I need help. This is a view of my database :
482 940 914 1
507 824 1042 2
514 730 1450 3
477 595 913 4
My aim is to plot in the same point of x-axis each row.
Example:
in 1 (=x) i want to plot 482, 940 and 914
in 2 (=x) I want to plot 507, 824 and 1042.
So three points in vertical for each x axis points.
it's a good idea to share the data in a reproducible way - I'm using readClipboard to read in the copied vector into R. Anyway, here's a quick answer:
x <- as.numeric(unlist(strsplit(readClipboard(), " ")))
This makes it into a numeric vector. We now need to split into groups based on the description you provided. I'm using matrix to achieve this and will then convert to data.frame for plotting using ggplot2:
m <- matrix(x, ncol = 4, byrow = T)
> m
[,1] [,2] [,3] [,4]
[1,] 482 940 914 1
[2,] 507 824 1042 2
[3,] 514 730 1450 3
[4,] 477 595 913 4
df <- as.data.frame(m)
# Assign names to the data.frame
names(df) <- letters[1:4]
> df
a b c d
1 482 940 914 1
2 507 824 1042 2
3 514 730 1450 3
4 477 595 913 4
To get the plot:
library(ggplot2)
ggplot(df, aes(x = d)) +
geom_point(aes(y = a), color = "red") +
geom_point(aes(y = b), color = "green") +
geom_point(aes(y = c), color = "blue")
OUTPUT
You can play around with ggtitle and xlab etc. to change the plot labels and add legends.
Hope this is helpful!

Creating data continuously using rnorm until an outlier occurs in R

Sorry for the confusing title, but i wasn't sure how to title what i am trying to do. My objective is to create a dataset of 1000 obs each would be the length of the run. I have created a phase1 dataset, from which a set of control limits are produced. What i am trying to do now is create a phase2 dataset most likely using rnorm. what im trying to do is create a repeat loop that will continuously create values in the phase2 dataset until one of those values is outside of the control limits produced from the phase1 dataset. for example if i had 3.0 and -3.0 as control limits the phase2 dataset would create a bunch of observations until obs 398 when the value here happens to be 3.45, thus stopping the creation of data. my objective is then to record the number 398. Furthermore, I am then trying to loop the code back to the phase1 dataset/ control limits portion and create a new set of control limits and then run another phase2, until i have 1000 run lengths recorded. the code i have for the phase1/ control limits works fine and looks like this:
nphase1=50
nphase2=1000
varcount=1
meanshift= 0
sigmashift= 1
##### phase1 dataset/ control limits #####
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- apply(phase1, 2, mean)
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
I have previously created this code in SAS and it looks like this. might be a better reference for what i am trying to achieve then me trying to explain it.
%macro phase2_dataset (n=,varcount=, meanshift=, sigmashift=, nphase1=,simID=,);
%do z=1 %to &n;
%phase1_dataset (n=&nphase1, varcount=&varcount);
data phase2; set control_limits n=lastobs;
call streaminit(0);
do until (phase2_var1<Lower_SPC_limit_method1_var1 or
phase2_var1>Upper_SPC_limit_method1_var1);
phase2_var1 = rand("normal", &meanshift, &sigmashift);
output;
end;
run;
ods exclude all;
proc means data=phase2;
var phase2_var1;
ods output summary=x;
run;
ods select all;
data run_length; set x;
keep Phase2_var1_n;
run;
proc append base= QA.Phase2_dataset&simID data=Run_length force; run;
%end;
%mend;
Also been doing research about using a while loop in replace of the repeat loop.
Im new to R so Any ideas you are able to throw my way are greatly appreciated. Thanks!
Using a while loop indeed seems to be the way to go. Here's what I think you're looking for:
set.seed(10) #Making results reproducible
replicate(100, { #100 is easier to display here
phase1 <- matrix(rnorm(nphase1*varcount, 0, 1), nrow = nphase1, ncol=varcount)
mean_var <- colMeans(phase1) #Slightly better than apply
std_var <- apply(phase1, 2, sd)
df_var <- data.frame(mean_var, std_var)
Upper_SPC_Limit_Method1 <- with(df_var, mean_var + 3 * std_var)
Lower_SPC_Limit_Method1 <- with(df_var, mean_var - 3 * std_var)
df_control_limits<- data.frame(Upper_SPC_Limit_Method1, Lower_SPC_Limit_Method1)
#Phase 2
x <- 0
count <- 0
while(x > Lower_SPC_Limit_Method1 && x < Upper_SPC_Limit_Method1) {
x <- rnorm(1)
count <- count + 1
}
count
})
The result is:
[1] 225 91 97 118 304 275 550 58 115 6 218 63 176 100 308 844 90 2758
[19] 161 311 1462 717 2446 74 175 91 331 210 118 1517 420 32 39 201 350 89
[37] 64 385 212 4 72 730 151 7 1159 65 36 333 97 306 531 1502 26 18
[55] 67 329 75 532 64 427 39 352 283 483 19 9 2 1018 137 160 223 98
[73] 15 182 98 41 25 1136 405 474 1025 1331 159 70 84 129 233 2 41 66
[91] 1 23 8 325 10 455 363 351 108 3
If performance becomes a problem, perhaps it would be interesting to explore some improvements, like creating more numbers with rnorm() at a time and then counting how many are necessary to exceed the limits and repeat if necessary.

R summing row one with all rows

I am trying to analyse website data for AB testing.
My reference point is based on experimentName = Experiment 1 (control version)
experimentName UniquePageView UniqueFrequency NonUniqueFrequency
1 Experiment 1 459 294 359
2 Experiment 2 440 286 338
3 Experiment 3 428 273 348
What I need to do is sum every UniquePageView, UniqueFrequency and NonUniqueFrequency row when experimentName = Experiment 1
e.g.
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 2 ',
UniquePageView WHERE experimentName = 'Experiment 1 ' + UniquePageView WHERE experimentName = 'Experiment 3 '
so on so forth (I could have an unlimted number of experiment #)
then do the same for UniqueFrequency and NonUniqueFrequency (I could have an unlimited number of column as well)
Result expected:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 918 588 718
2 Experiment 2 440 286 338 899 580 697
3 Experiment 3 428 273 348 887 567 707
here is the math behind it:
experimentName UniquePageView UniqueFrequency NonUniqueFrequency Conversion Rate Pooled UniquePageView Conversion Rate Pooled UniqueFrequency Conversion Rate Pooled NonUniqueFrequency
1 Experiment 1 459 294 359 459 + 459 294 + 294 359 + 359
2 Experiment 2 440 286 338 459 + 440 294 + 286 359 + 338
3 Experiment 3 428 273 348 459 + 428 294 + 273 359 + 348
In base R, you can do this in one line by column binding (with cbind) the initial data frame to the initial data frame plus a version that is just duplicates of the "Experiment 1" row).
cbind(dat, dat[,-1] + dat[rep(which(dat$experimentName == "Experiment 1"), nrow(dat)), -1])
# experimentName UniquePageView UniqueFrequency NonUniqueFrequency UniquePageView UniqueFrequency
# 1 Experiment 1 459 294 359 918 588
# 2 Experiment 2 440 286 338 899 580
# 3 Experiment 3 428 273 348 887 567
# NonUniqueFrequency
# 1 718
# 2 697
# 3 707
To update the column names at the end (assuming you stored the resulting data frame in res), you could use:
names(res)[4:6] <- c("CombinedPageView", "CombinedUniqueFrequency", "CombinedNonUniqueFrequency")
Do you know how to use dplyr? If you're new to R, this is a pretty good lesson to learn. Dplyr includes the functions filter and summarise, which are all you need to do this problem - very simple!
First, take your data frame
df
Then, filter to only the data you want, in this case when experimentName = Experiment 1
df
df <- filter(df, experimentName == "Experiment 1")
Now, summarise to find the sums of UniquePageView, UniqueFrequency and NonUniqueFrequency
df
df <- filter(df, experimentName == "Experiment 1")
summarise(df, SumUniquePageView = sum(UniquePageView),
SumUniqueFrequency = sum(UniqueFrequency),
SumNonUniqueFrequency = sum(NonUniqueFrequency))
This will return a small table with the answers you're looking for. For a slightly more advanced (but simpler) way to do this, you can use the piping operator %>% from the packages magrittr. That code borrows the object from the previous statement and uses it as the first argument in the proceeding statement, as follows:
df %>% filter(experimentName == "Experiment 1") %>% summarise(SumUniquePageView = sum(UniquePageView), etc)
If you don't yet have those packages, you can get them with install.packages("dpyr"), library(dplyr)

How ask R not to combine the X axis values for a bar chart?

I am a beginner with R . My data looks like this:
id count date
1 210 2009.01
2 400 2009.02
3 463 2009.03
4 465 2009.04
5 509 2009.05
6 861 2009.06
7 872 2009.07
8 886 2009.08
9 725 2009.09
10 687 2009.10
11 762 2009.11
12 748 2009.12
13 678 2010.01
14 699 2010.02
15 860 2010.03
16 708 2010.04
17 709 2010.05
18 770 2010.06
19 784 2010.07
20 694 2010.08
21 669 2010.09
22 689 2010.10
23 568 2010.11
24 584 2010.12
25 592 2011.01
26 548 2011.02
27 683 2011.03
28 675 2011.04
29 824 2011.05
30 637 2011.06
31 700 2011.07
32 724 2011.08
33 629 2011.09
34 446 2011.10
35 458 2011.11
36 421 2011.12
37 459 2012.01
38 256 2012.02
39 341 2012.03
40 284 2012.04
41 321 2012.05
42 404 2012.06
43 418 2012.07
44 520 2012.08
45 546 2012.09
46 548 2012.10
47 781 2012.11
48 704 2012.12
49 765 2013.01
50 571 2013.02
51 371 2013.03
I would like to make a bar graph like graph that shows how much what is the count for each date (dates in format of Month-Y, Jan-2009 for instance). I have two issues:
1- I cannot find a good format for a bar-char like graph like that
2- I want all of my data-points to be present in X axis(date), while R aggregates it to each year only (so I inly have four data-points there). Below is the current command that I am using:
plot(df$date,df$domain_count,col="red",type="h")
and my current plot is like this:
Ok, I see some issues in your original data. May I suggest the following:
Add the days in your date column
df$date=paste(df$date,'.01',sep='')
Convert the date column to be of date type:
df$date=as.Date(df$date,format='%Y.%m.%d')
Plot the data again:
plot(df$date,df$domain_count,col="red",type="h")
Also, may I add one more suggestion, have you used ggplot for ploting chart? I think you will find it much easier and resulting in better looking charts. Your example could be visualized like this:
library(ggplot2) #if you don't have the package, run install.packages('ggplot2')
ggplot(df,aes(date, count))+geom_bar(stat='identity')+labs(x="Date", y="Count")
First, you should transform your date column in a real date:
library(plyr) # for mutate
d <- mutate(d, month = as.numeric(gsub("[0-9]*\\.([0-9]*)", "\\1", as.character(date))),
year = as.numeric(gsub("([0-9]*)\\.[0-9]*", "\\1", as.character(date))),
Date = ISOdate(year, month, 1))
Then, you could use ggplot to create a decent barchart:
library(ggplot2)
ggplot(d, aes(x = Date, y = count)) + geom_bar(fill = "red", stat = "identity")
You can also use basic R to create a barchart, which is however less nice:
dd <- setNames(d$count, format(d$Date, "%m-%Y"))
barplot(dd)
The former plot shows you the "holes" in your data, i.e. month where there is no count, while for the latter it is even wuite difficult to see which bar corresponds to which month (this could however be tweaked I assume).
Hope that helps.

Resources