Cumulative sum of values without calculating repeated values in a column - r

I have a data like this in R
x <- c(1,2,2,3,4,4,7,8)
y <- c(300,200,200,150,100,100,30,20)
df <- data.frame(x, y)
The cumulative with the dataset is
cum_df <- data.frame(x, y, Y)
> cum_df
x y Y
1 1 300 300
2 2 200 500
3 2 200 700
4 3 150 850
5 4 100 950
6 4 100 1050
7 7 30 1080
8 8 20 1100
The cumulative of "y" using cumsum(y) is:
Y <- cumsum(y)
> Y
[1] 300 500 700 850 950 1050 1080 1100
Instead, I want the cumulative of "y" to be like this
> Y
[1] 300 500 500 650 750 750 780 800
In essence, it does not compute repeated values of y. How do I go about this in R? I have tried different functions but it seem not to work. I want the answer to look like this
> ans
x y Y
1 1 300 300
2 2 200 500
3 2 200 500
4 3 150 650
5 4 100 750
6 4 100 750
7 7 30 780
8 8 20 800

We can get the distinct rows, do the cumsum and then do a join
library(dplyr)
df %>%
distinct() %>%
mutate(Y = cumsum(y)) %>%
right_join(df)
# x y Y
#1 1 300 300
#2 2 200 500
#3 2 200 500
#4 3 150 650
#5 4 100 750
#6 4 100 750
#7 7 30 780
#8 8 20 800
Or without any join by replacing the duplicated values in 'y' with 0, and then do the cumsum
df %>%
mutate(Y = cumsum(y * !duplicated(y)))
# x y Y
#1 1 300 300
#2 2 200 500
#3 2 200 500
#4 3 150 650
#5 4 100 750
#6 4 100 750
#7 7 30 780
#8 8 20 800
Or in base R
df$Y <- with(df, cumsum(y * !duplicated(y)))

Related

R: Conditional Sum by Row in DataTable

I've got a very large dataset (millions of rows that I need to loop through thousands of times), and during the loop I have to do a conditional sum that appears to be taking a very long time. Is there a way of making this more efficient?
Datatable format as follows:
DT <- data.table('A' = c(1,1,1,2,2,3,3,3,3,4),
'B' = c(500,510,540,500,540,500,510,519,540,500),
'C' = c(10,20,10,20,10,50,20,50,20,10))
A
B
C
1
500
10
1
510
20
1
540
10
2
500
20
2
540
10
3
500
50
3
510
20
3
519
50
3
540
20
4
500
10
I need the sum of column C (in a new column, D) subject to A == A, and B >= B & B < B + 20 (by row). So the output table would look like the following:
A
B
C
D
1
500
10
30
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
120
3
510
20
120
3
519
50
120
3
540
20
20
4
500
10
10
The code I'm currently using:
DT[,D:= sum(DT$C[A == DT$A & ((B >= DT$B) & (B < DT$B + 20))]), by=c('A', 'B')]
This takes a very long time to actually run, as well as giving me the wrong answer. The output I get looks like this:
A
B
C
D
1
500
10
10
1
510
20
30
1
540
10
10
2
500
20
20
2
540
10
10
3
500
50
50
3
510
20
70
3
519
50
120
3
540
20
20
4
500
10
10
(i.e. D only appears to increase cumulatively).
I'm less concerned with the cumulative thing, more about speed. Ultimately what I'm trying to get to is the largest sum of C, by A, subject to B being within 20 of eachother. I would really appreciate any help on this! Thanks in advance.
If I understand correctly, this can be solved by a non-equi self join:
DT[, Bp20 := B + 20][
DT, on = .(A, B >= B, B < Bp20), mult = "last"][
, .(B, C = i.C, D = sum(i.C)), by = .(A, Bp20)][
, Bp20 := NULL][]
A B C D
1: 1 500 10 30
2: 1 510 20 30
3: 1 540 10 10
4: 2 500 20 20
5: 2 540 10 10
6: 3 500 50 120
7: 3 510 20 120
8: 3 519 50 120
9: 3 540 20 20
10: 4 500 10 10
# logic for B
DT[, g := B >= shift(B) & B < shift(B, 1) + 20, by = A]
# creating index column
DT[, gi := !g]
DT[is.na(gi), gi := T]
DT[, gi := cumsum(gi)]
DT[, D := sum(C), by = gi] # summing by new groups
DT
# A B C g gi D
# 1: 1 500 10 NA 1 30
# 2: 1 510 20 TRUE 1 30
# 3: 1 540 10 FALSE 2 10
# 4: 2 500 20 NA 3 20
# 5: 2 540 10 FALSE 4 10
# 6: 3 500 50 NA 5 120
# 7: 3 510 20 TRUE 5 120
# 8: 3 519 50 TRUE 5 120
# 9: 3 540 20 FALSE 6 20
# 10: 4 500 10 NA 7 10
You might need to adjust logic for B, as all edge cases isn't clear from the question... if for one A value we have c(30, 40, 50, 60), all of those rows are in one group?

Calculate mean of pairwise differences between ALL observations WITHIN group() in R?

I have a table like :
Fish Group Position
1 1 100
2 1 250
3 1 500
4 2 25
5 2 75
6 2 625
I have groups of fish with their position in the stream. To know how close they are, on average, I need to calculate the mean of the difference of distances for all observations within groups.
For fish of group 1, it does:
1-2 distance = 250 - 100 = 150
2-3 distance = 500 - 250 = 250
3-1 distance = 500 - 100 = 400
So the mean I look for is mean(150 + 250 + 400)
The tricky thing for me is to find a way to do it in the tidyverse philosophy !
If DF is your data you can try this. Hope it can help:
library(dplyr)
DF %>% group_by(Group) %>% mutate(Diff=c(last(Position)-first(Position),diff(Position)))
# A tibble: 6 x 4
# Groups: Group [2]
Fish Group Position Diff
<int> <int> <int> <int>
1 1 1 100 400
2 2 1 250 150
3 3 1 500 250
4 4 2 25 600
5 5 2 75 50
6 6 2 625 550
As long as previous solution is just a sketch, try this modification and see if this applies for your original data:
#Create list by group
L <- split(DF,DF$Group)
#Create function
compute_d <- function(x)
{
xv <- as.numeric(x$Position)
y <- dist(xv)
return(y)
}
#Apply function
lapply(L,compute_d)
The results:
$`1`
1 2
2 150
3 400 250
$`2`
1 2
2 50
3 600 550
Or even more modified (new version):
#Create list by group
L <- split(DF,DF$Group)
#Create function
compute_d <- function(x)
{
xv <- as.numeric(x$Position)
y <- dist(xv)
avg <- mean(y)
y1 <- as.data.frame(as.matrix(y))
y2 <- cbind(x,y1)
y2$mean <- avg
return(y2)
}
#Apply function
z <- do.call('rbind',lapply(L,compute_d))
rownames(z)<-NULL
Fish Group Position 1 2 3 mean
1 1 1 100 0 150 400 266.6667
2 2 1 250 150 0 250 266.6667
3 3 1 500 400 250 0 266.6667
4 4 2 25 0 50 600 400.0000
5 5 2 75 50 0 550 400.0000
6 6 2 625 600 550 0 400.0000

Make a spaggetiplot of data

I would like to make a spagettiplot of the data below. Treatment C should be set as the reference,1, compared to treatment A and B. Does anyone have a suggestion how to do that? Thanks in advance! :)
id <- rep(c(300,450), each=6)
trt <- rep(c("A","B","C"),2)
q1 <- c(100,89, 60,85,40,10)
df <- data.frame(id,trt,q1)
df
id trt q1
1 300 A 100
2 300 B 89
3 300 C 60
4 300 A 85
5 300 B 40
6 300 C 10
7 450 A 100
8 450 B 89
9 450 C 60
10 450 A 85
11 450 B 40
12 450 C 10

The value in one column depends in the value of another column

I want to make all rows with number 2 in column q1 to zero in column q2. Anyone have a smart solution?
a <- rep(c(300,450), each=c(3,3))
q1 <- rep(c(1,1,2,1,1,2),2)
q2 <- c(100,40,"",80,30,"" , 45,78,"",20,58,"")
df <- cbind(a,q1,q2)
df <- as.data.frame(df)
Original input data :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2
4 450 1 80
5 450 1 30
6 450 2
7 300 1 45
8 300 1 78
9 300 2
10 450 1 20
11 450 1 58
12 450 2
Desired output :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2 0
4 450 1 80
5 450 1 30
6 450 2 0
7 300 1 45
8 300 1 78
9 300 2 0
10 450 1 20
11 450 1 58
12 450 2 0
An option would be to create a logical vector based on the column 'q1' and assign the value of 'q2' to 0
df$q2[df$q1 == 2] <- 0
df
# a q1 q2
#1 300 1 100
#2 300 1 40
#3 300 2 0
#4 450 1 80
#5 450 1 30
#6 450 2 0
#7 300 1 45
#8 300 1 78
#9 300 2 0
#10 450 1 20
#11 450 1 58
#12 450 2 0
Another option is replace
transform(df, q2 = replace(q2, q1 == 2, 0))
With cbind, it converts to a matrix first, so any character element anywhere results in the whole matrix to be character. Better, would be use data.frame directly
Or in data.table
library(data.table)
setDT(df)[q1== 2, q2 := '0']
data
df <- data.frame(a, q1, q2, stringsAsFactors = FALSE)

How to make a timeline/waterfall like plot in R for gene/genome coverage

I would like to make a relatively simple plot (reminiscent of timelines such as this: http://www.ats.ucla.edu/stat/sas/code/timeline.gif), but instead of time on the x-axis, it will be base positions in a genome. The "time spans" will be coverage distances for DNA-sequence scaffolds, showing the spans of where they fall in the genome, where they overlap and places with no coverage. Here is a crude mock-up of what I am looking for, showing contig coverage of rRNAs,(I left out, but need, an x-axis showing positions the starts and stops, and labeling of the contigs (colored lines)): http://i.imgur.com/MDABx.png , with the following coordinates:
Contig# Start1 Stop1 Start2 Stop2 Start3 Stop3 Start4 Stop4
1 1 90 90 100 120 150 200 400
2 1 100 120 150 200 400 NA NA
3 1 30 90 100 120 135 200 400
4 1 100 120 140 200 400 NA NA
5 -35 80 90 100 130 150 200 400
6 1 100 200 300 360 400 NA NA
I am pretty sure this can be done in R, probably with ggplot2, but for some reason I cannot figure it out.
This is not going to be as organized as your plot but it puts the lines in with coordinates that you have yet to provide:
dfr <- data.frame(seg=sample(1:6, 20, replace=TRUE), start=sample(1:100, 20, replace=TRUE), end=sample(1:100,20, replace=TRUE) )
plot(c(1,100), c(1,6), type="n")
with(dfr, segments(y0=seg, y1=seg, x0=start, x1=end, col=2:7, lwd=3))
With new dataset:
Contig <- read.table(text=" Start1 Stop1 Start2 Stop2 Start3 Stop3 Start4 Stop4
1 1 90 90 100 120 150 200 400
2 1 100 120 150 200 400 NA NA
3 1 30 90 100 120 135 200 400
4 1 100 120 140 200 400 NA NA
5 -35 80 90 100 130 150 200 400
6 1 100 200 300 360 400 NA NA")
# the reshape function can be tricky.... but seems to finally work.
reshape(Contig, direction="long", sep="",
varying=list(Start=names(Contig)[c(1,3,5,7)],
Stop=names(Contig)[c(2,4,6,8)] ) )
#------------------------------
time Start1 Stop1 id
1.1 1 1 90 1
2.1 1 1 100 2
3.1 1 1 30 3
4.1 1 1 100 4
5.1 1 -35 80 5
6.1 1 1 100 6
1.2 2 90 100 1
2.2 2 120 150 2
3.2 2 90 100 3
4.2 2 120 140 4
5.2 2 90 100 5
6.2 2 200 300 6
1.3 3 120 150 1
2.3 3 200 400 2
3.3 3 120 135 3
4.3 3 200 400 4
5.3 3 130 150 5
6.3 3 360 400 6
1.4 4 200 400 1
2.4 4 NA NA 2
3.4 4 200 400 3
4.4 4 NA NA 4
5.4 4 200 400 5
6.4 4 NA NA 6
#-----------------
LContig <- reshape(Contig, direction="long", sep="",
varying=list(Start=names(Contig)[c(1,3,5,7)], Stop=names(Contig)[c(2,4,6,8)] ) )
plot(range(c(Contig$Start1, Contig$Stop1) , na.rm=TRUE ), c(1,6),
type="n", xlab="Segments", ylab="Groups")
with(LContig, segments(y0=id, y1=id, x0=Start1, x1=Stop1, col=2:7, lwd=3))
Here's a version using ggplot2:
# Never forget
options(stringsAsFactors = FALSE)
# Load ggplot2 and reshape2
library(ggplot2)
library(reshape2)
# Read in the data
contig <- read.table(
text = "id Start1 Stop1 Start2 Stop2 Start3 Stop3 Start4 Stop4
1 1 90 90 100 120 150 200 400
2 1 100 120 150 200 400 NA NA
3 1 30 90 100 120 135 200 400
4 1 100 120 140 200 400 NA NA
5 -35 80 90 100 130 150 200 400
6 1 100 200 300 360 400 NA NA",
header = TRUE
)
# Reshape it
# Melt it all the way down - each data value is gets a record
# identified by id and variable name
contig.melt <- melt(contig, id.var = "id")
# Your variable names contain two pieces of information:
# whether this point is a start or a stop, and
# which span this point is associated with.
# Much easier to work with those separately, so I'll parse them
# into variables.
# Which span?
contig.melt$span <- gsub(x = contig.melt$variable,
pattern = ".*(\\d)",
replace = "\\1")
# Start or stop?
contig.melt$point <- gsub(x = contig.melt$variable,
pattern = "(.*)\\d",
replace = "\\1")
# Cast it back into a dataset with a record for each span
contig.long <- dcast(contig.melt, id + span ~ point)
# Plot it. The vertical position and line colors are determined by
# the ID. I'm calling that x here, but I'll flip the coords later
ggplot(contig.long, aes(x = id, color = factor(id))) +
# geom_linerange plots a line from Start (ymin) to stop (ymax)
# Control the width of the plot with size
geom_linerange(aes(ymin = Start, ymax = Stop), size = 2) +
# Flip the coordinates
coord_flip() +
# Make it pretty
scale_colour_brewer("RNA ID", palette = "Dark2") +
labs(x = "RNA ID", y = "Position") +
theme_bw()

Resources