Sliding window means with dplyr & zoo

Sliding window means with dplyr & zoo - r

I have a data frame containing per-base coverage along a genome. A much smaller example version is below:
> head(per_base_cov)
contig_id position coverage
1 contig_1 1 40
2 contig_1 2 33
3 contig_1 3 40
4 contig_1 4 32
5 contig_1 5 36
6 contig_1 6 30
7 contig_1 7 40
8 contig_1 8 38
9 contig_1 9 36
10 contig_1 10 40
11 contig_2 11 38
12 contig_2 12 39
13 contig_2 13 34
14 contig_2 14 39
15 contig_2 15 39
16 contig_2 16 32
17 contig_2 17 30
18 contig_2 18 37
19 contig_2 19 33
20 contig_2 20 35
I would like to calculate sliding window means for each contig, every 4 positions and overlapping by 2 positions. I've tried the following using dplyr and zoo:
per_base_cov %>%
group_by(contig_id) %>%
mutate(cov.win.mean=rollapply(coverage,4,mean,by=2))
But I get the error message:
Error: Problem with `mutate()` input `cov.win.mean`.
x Input `cov.win.mean` can't be recycled to size 10.
ℹ Input `cov.win.mean` is `rollapply(coverage, 4, mean, by = 2)`.
ℹ Input `cov.win.mean` must be size 10 or 1, not 4.
ℹ The error occurred in group 1: contig_id = "contig_1".
Does anyone know how I could solve this? I would like an output that looks something like the following:
contig_id mean_coverage
1 contig_1 36.25
2 contig_1 34.50
3 contig_1 36.00
4 contig_1 38.50
5 contig_2 37.5
6 contig_2 36
7 contig_2 34.5
8 contig_2 33.75
Many thanks in advance.

I managed to find a solution with the help of Ronak below:
win_means <- per_base_cov %>%
group_by(contig_id) %>%
mutate(cov.win.mean=rollapply(coverage,4,mean,by=2, fill=NA))
win_means_complete <- win_means[complete.cases(win_means), ]
win_means_final <- win_means_complete[,c(1,2,4)]
win_means_final <- as.data.frame(win_means_final)
head(win_means_final)
contig_id position cov.win.mean
1 contig_1 2 36.25
2 contig_1 4 34.50
3 contig_1 6 36.00
4 contig_1 8 38.50
5 contig_2 12 37.50
6 contig_2 14 36.00

Related

Conditional interpolation of time series data in R

I have time series data with N/As. The data are to end up in an animated scatterplot
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200
For a smooth animation, the data will be supplemented by calculated, additional values/rows. For the X values this is simply arithmetical. No problem so far.
Week X Y
1 1 105
2
2 3 110
4
3 5 N/A
6
4 7 130
8
9
10
11
12
13
14
8 15 160
16
17
18
19
20
21
22
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
The Y values should be interpolated and there is the additional requirement, that interpolation should only appear between two consecutive values and not between values, that have a N/A between them.
Week X Value
1 1 105
2 interpolated value
2 3 110
4
3 5 N/A
6
4 7 130
8 interpolated value
9 interpolated value
10 interpolated value
11 interpolated value
12 interpolated value
13 interpolated value
14 interpolated value
8 15 160
16 interpolated value
17 interpolated value
18 interpolated value
19 interpolated value
20 interpolated value
21 interpolated value
22 interpolated value
12 23 180
24
25
26
27
28
29
16 30 N/A
31
32
33
34
35
36
20 37 200
I have already experimented with approx, converted the "original" N/A to placeholder values and tried the zoo package with na.approx etc. but don´t get it, to express a correct condition statement for this kind of "conditional approximation" or "conditional gap filling". Any hint is welcome and very appreciated.
Thanks in advance

Replace the NAs with Inf, interpolate and then revert infinite values to NA.
library(zoo)
DF2 <- DF
DF2$Y[is.na(DF2$Y)] <- Inf
w <- merge(DF2, data.frame(Week = min(DF2$Week):max(DF2$Week)), by = 1, all.y = TRUE)
w$Value <- na.approx(w$Y)
w$Value[!is.finite(Value)] <- NA
giving the following where Week has been expanded to all weeks, Y is such that the original NAs are shown as Inf and the inserted NAs as NA. Value is the interpolated Y.
> w
Week X Y Value
1 1 1 105 105.0
2 2 3 110 110.0
3 3 5 Inf NA
4 4 7 130 130.0
5 5 NA NA 137.5
6 6 NA NA 145.0
7 7 NA NA 152.5
8 8 15 160 160.0
9 9 NA NA 165.0
10 10 NA NA 170.0
11 11 NA NA 175.0
12 12 23 180 180.0
13 13 NA NA NA
14 14 NA NA NA
15 15 NA NA NA
16 16 30 Inf NA
17 17 NA NA NA
18 18 NA NA NA
19 19 NA NA NA
20 20 37 200 200.0
Note: Input DF in reproducible form:
Lines <- "
Week X Y
1 1 105
2 3 110
3 5 N/A
4 7 130
8 15 160
12 23 180
16 30 N/A
20 37 200"
DF <- read.table(text = Lines, header = TRUE, na.strings = "N/A")

ordering nodes in Sankey diagram using rCharts

I'm building a Sankey diagram in R using rCharts per https://github.com/timelyportfolio/rCharts_d3_sankey
Everything is fine except that I'd like to have control over the placement of the nodes. As I run the R script, it produces this:
I want all node-columns in ascending order, like the 2012 and 2013 node-columns. And like the image below (which I modified manually).
My graph.data is already sorted in the proper order as you can see:
g <- graph.data.frame(network.df[ , c("source","target","weight")])
edgelist <- get.data.frame(g)
colnames(edgelist) <- c("source","target","value")
edgelist$source <- as.character(edgelist$source)
edgelist$target <- as.character(edgelist$target)
edgelist #<-edgelist is sorted by source
source target value
1 2012-0 2013-0 5
2 2012-0 2013-1 21
3 2012-1 2013-1 79
4 2013-0 2014-0 42
5 2013-0 2014-1 10
6 2013-0 2014-2 13
7 2013-0 2014-3 19
8 2013-0 2014-4 12
9 2013-0 2014-5 1
10 2013-1 2014-0 29
11 2013-1 2014-1 29
12 2013-1 2014-2 23
13 2013-1 2014-3 54
14 2013-1 2014-4 17
15 2014-0 2015-0 2
16 2014-0 2015-1 8
17 2014-0 2015-2 1
18 2014-0 2015-3 1
19 2014-0 2015-4 9
20 2014-1 2015-0 5
21 2014-1 2015-1 13
22 2014-1 2015-2 68
23 2014-1 2015-3 7
24 2014-1 2015-4 66
25 2014-2 2015-0 9
26 2014-2 2015-2 23
27 2014-2 2015-3 21
28 2014-3 2015-3 56
29 2014-4 2015-4 2
30 2014-5 2015-5 1
31 2015-0 2016-0 1
32 2015-0 2016-1 1
33 2015-0 2016-2 4
<more rows omitted>
sankeyPlot <- rCharts$new()
sankeyPlot$setLib('/rCharts_d3_sankey-gh-pages/rCharts_d3_sankey-gh-pages')
sankeyPlot$setTemplate(script = "rCharts_d3_sankey-gh-
pages/rCharts_d3_sankey-gh-pages/layouts/chart.html")

How to update and replace part of old data

I want to merge the df OldData and NewData.
In this case, Nov-2015 and Dec 2015 are present in both df.
Since NewData is the most accurate update available, I want to update the value of Nov-2015 and Dec 2015 using the value in df NewData and of course adding the records of Jan-2016 and Feb-2016 as well.
Can anyone help?
OldData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 42
12 Dec-2015 32
NewData
Month Value
1 Nov-2015 88
2 Dec-2015 45
3 Jan-2016 32
4 Feb-2016 11
Here is the output I want:
JoinData
Month Value
1 Jan-2015 3
2 Feb-2015 76
3 Mar-2015 31
4 Apr-2015 45
5 May-2015 99
6 Jun-2015 95
7 Jul-2015 18
8 Aug-2015 97
9 Sep-2015 61
10 Oct-2015 7
11 Nov-2015 88
12 Dec-2015 45
13 Jan-2016 32
14 Feb-2016 11
Thanks for #akrun, the problem is solved, and the following code works smoothly!!
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Update: Now, let's upgrade our problem little bit.
suppose our OldData and NewData have another column called "Type".
How do we merge/update it this time?
> OldData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-12 C 32
13 2015-12 D 77
> NewData
Month Type Value
1 2015-11 A 88
2 2015-12 C 45
3 2015-12 D 22
4 2016-01 A 32
5 2016-02 A 11
The JoinData will suppose to update all value from NewData ass following:
> JoinData
Month Type Value
1 2015-01 A 3
2 2015-02 A 76
3 2015-03 A 31
4 2015-04 A 45
5 2015-05 A 99
6 2015-06 A 95
7 2015-07 A 18
8 2015-08 A 97
9 2015-09 A 61
10 2015-10 A 7
11 2015-11 B 42
12 2015-11 A 88 (originally not included, added from the NewData)
12 2015-12 C 45 (Updated the value by NewData)
13 2015-12 D 22 (Updated the value by NewData)
14 2016-01 A 32 (newly added from NewData)
15 2016-02 A 11 (newly added from NewData)
Thanks for #akrun: I have got the solution here for the second question as well.
Thanks for the help for everyone here!
Here is the answer:
d1 <- merge(OldData, NewData, by = c("Month","Type"), all = TRUE);d2 <- transform(d1, Value.x= ifelse(!is.na(Value.y), Value.y, Value.x))[-4];d2[!duplicated(d2[1:2], fromLast=TRUE),]

Here is an option using data.table (similar approach as #thelatemail mentioned in the comments)
library(data.table)
rbindlist(list(OldData, NewData))[!duplicated(Month, fromLast=TRUE)]
Or
rbindlist(list(OldData, NewData))[,if(.N >1) .SD[.N] else .SD, Month]

how to fix "undefined columns selected" for network meta-analysis in R?

I am conducting a network meta-analysis on R with two packages, gemtc and rjags. However, when I type
Model <- mtc.model (network, linearmodel=’fixed’).
R always returns “
Error in [.data.frame(data, sel1 | sel2, columns, drop = FALSE) :
undefined columns selected In addition: Warning messages: 1: In
mtc.model(network, linearModel = "fixed") : Likelihood can not be
inferred. Defaulting to normal. 2: In mtc.model(network, linearModel =
"fixed") : Link can not be inferred. Defaulting to identity “
How to fix this problem? Thanks!
I am attaching my codes and data here:
SAE <- read.csv(file.choose(),head=T, sep=",")
head(SAE)
network <- mtc.network(data.ab=SAE)
summary(network)
plot(network)
model.fe <- mtc.model (network, linearModel="fixed")
plot(model.fe)
summary(model.fe)
cat(model.fe$code)
model.fe$data
# run this model
result.fe <- mtc.run(model.fe, n.adapt=0, n.iter=50)
plot(result.fe)
gelman.diag(result.fe)
result.fe <- mtc.run(model.fe, n.adapt=1000, n.iter=5000)
plot(result.fe)
gelman.diag(result.fe)
following is my data: SAE
study treatment responder sample.size
1 1 3 0 76
2 1 30 2 72
3 2 3 99 1389
4 2 23 132 1383
5 3 1 6 352
6 3 30 2 178
7 4 2 6 106
8 4 30 3 95
9 5 3 49 393
10 5 25 18 198
11 6 1 20 65
12 6 22 10 26
13 7 1 1 76
14 7 30 3 76
15 8 3 7 441
16 8 26 1 220
17 9 2 1 47
18 9 30 0 41
19 10 3 10 156
20 10 30 9 150
21 11 1 4 85
22 11 25 5 85
23 11 30 4 84
24 12 3 6 152
25 12 30 5 160
26 13 18 4 158
27 13 21 8 158
28 14 1 3 110
29 14 30 2 111
30 15 3 3 83
31 15 30 1 92
32 16 1 3 124
33 16 22 6 123
34 16 30 4 125
35 17 3 236 1553
36 17 23 254 1546
37 18 6 5 398
38 18 7 6 403
39 19 1 64 588
40 19 22 73 584

How about reading the manual ?mtc.model. It clearly states the following:
Required columns [responders, sampleSize]
So your responder variable should be responders and your sample.size variable should be sampleSize.
Next, your plot(network) should help you determine that some comparisons can not be made. In your data, there are 2 subgroups of trials that were compared. Treatment 18 and 21 were not compared with any of the others. Therefore you can only do a meta-analysis of 21 and 18 or a network meta-analysis of the rest.
network <- mtc.network(data.ab=SAE[!SAE$treatment %in% c(21, 18), ])
model.fe <- mtc.model(network, linearModel="fixed")

Changing factors to Integers without changing the order of the data

I have following data and trying change CCG and Pract to numbers so I can use stan or Winbugs...when I try to change it seems its changing the order of the data..
I want to change CCG and Pract to numbers without changing the order of the data...I tried hard but I couldn't do it.
I am struggling with this basic issue than writing Bugs codes....please help..
I have the following data
CCG pract Deno Numer Points Excep
1 01C N81049 49 46 4 4
2 01C N81022 28 26 4 23
3 01C N81632 66 64 4 4
4 01C N81069 15 14 4 3
5 01C N81062 98 89 4 9
6 01C N81033 31 28 4 9
I tried to change to integer using as.integer() and I am getting I am getting..
CCG pract Deno Numer Points Excep
1 20 6621 160 144 41 36
2 20 6594 130 117 41 18
3 20 6698 179 164 41 36
4 20 6640 57 46 41 25
5 20 6633 214 191 41 62
6 20 6605 137 119 41 62
By checking Deno and Numer it is clear the order of the data has been changed...Why CCG is not starting from 1?
I want
CCG pract Deno Numer Points Excep
1 01C N81049 49 46 4 4
2 01C N81022 28 26 4 23
3 01C N81632 66 64 4 4
4 01C N81069 15 14 4 3
5 01C N81062 98 89 4 9
6 01C N81033 31 28 4 9
change to something like this
CCG pract Deno Numer Points Excep
1 1 1 49 46 4 4
2 1 1 28 26 4 23
3 1 1 66 64 4 4
4 1 1 15 14 4 3
5 1 1 98 89 4 9
6 1 1 31 28 4 9
Please help me..

In R, factors are internally represented as integers, linking to a table of the factor levels. AFAIK, these internal integers are assigned based on a lexicographic order of the factor levels, so 57 gets a higher code than 238.
as.integer() will extract this internal integer coding. As you found out, this is not very useful. (I honestly don't understand why R does this when applying as.integer() to factors that have integers as factor levels.)
Solution: first convert to character, then to integer. as.integer(as.character(Deno))