Calculate mean of pairwise differences between ALL observations WITHIN group() in R? - r

I have a table like :
Fish Group Position
1 1 100
2 1 250
3 1 500
4 2 25
5 2 75
6 2 625
I have groups of fish with their position in the stream. To know how close they are, on average, I need to calculate the mean of the difference of distances for all observations within groups.
For fish of group 1, it does:
1-2 distance = 250 - 100 = 150
2-3 distance = 500 - 250 = 250
3-1 distance = 500 - 100 = 400
So the mean I look for is mean(150 + 250 + 400)
The tricky thing for me is to find a way to do it in the tidyverse philosophy !

If DF is your data you can try this. Hope it can help:
library(dplyr)
DF %>% group_by(Group) %>% mutate(Diff=c(last(Position)-first(Position),diff(Position)))
# A tibble: 6 x 4
# Groups: Group [2]
Fish Group Position Diff
<int> <int> <int> <int>
1 1 1 100 400
2 2 1 250 150
3 3 1 500 250
4 4 2 25 600
5 5 2 75 50
6 6 2 625 550
As long as previous solution is just a sketch, try this modification and see if this applies for your original data:
#Create list by group
L <- split(DF,DF$Group)
#Create function
compute_d <- function(x)
{
xv <- as.numeric(x$Position)
y <- dist(xv)
return(y)
}
#Apply function
lapply(L,compute_d)
The results:
$`1`
1 2
2 150
3 400 250
$`2`
1 2
2 50
3 600 550
Or even more modified (new version):
#Create list by group
L <- split(DF,DF$Group)
#Create function
compute_d <- function(x)
{
xv <- as.numeric(x$Position)
y <- dist(xv)
avg <- mean(y)
y1 <- as.data.frame(as.matrix(y))
y2 <- cbind(x,y1)
y2$mean <- avg
return(y2)
}
#Apply function
z <- do.call('rbind',lapply(L,compute_d))
rownames(z)<-NULL
Fish Group Position 1 2 3 mean
1 1 1 100 0 150 400 266.6667
2 2 1 250 150 0 250 266.6667
3 3 1 500 400 250 0 266.6667
4 4 2 25 0 50 600 400.0000
5 5 2 75 50 0 550 400.0000
6 6 2 625 600 550 0 400.0000

Related

Inexact joining data based on greater equal condition

I have some values in
df:
# A tibble: 7 × 1
var1
<dbl>
1 0
2 10
3 20
4 210
5 230
6 266
7 267
that I would like to compare to a second dataframe called
value_lookup
# A tibble: 4 × 2
var1 value
<dbl> <dbl>
1 0 0
2 200 10
3 230 20
4 260 30
In particual I would like to make a join based on >= meaning that a value that is greater or equal to the number in var1 gets a values of x. E.g. take the number 210 of the orginal dataframe. Since it is >= 200 and <230 it would get a value of 10.
Here is the expected output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
I thought it should be doable using {fuzzyjoin} but I cannot get it done.
value_lookup <- tibble(var1 = c(0, 200,230,260),
value = c(0,10,20,30))
df <- tibble(var1 = c(0,10,20,210,230,266,267))
library(fuzzyjoin)
fuzzyjoin::fuzzy_left_join(
x = df,
y = value_lookup ,
by = "var1",
match_fun = list(`>=`)
)
An option is also findInterval:
df$value <- value_lookup$value[findInterval(df$var1, value_lookup$var1)]
Output:
var1 value
1 0 0
2 10 0
3 20 0
4 210 10
5 230 20
6 266 30
7 267 30
As you're mentioning joins, you could also do a rolling join via data.table with the argument roll = T which would look for same or closest value preceding var1 in your df:
library(data.table)
setDT(value_lookup)[setDT(df), on = 'var1', roll = T]
You can use cut:
df$value <- value_lookup$value[cut(df$var1,
c(value_lookup$var1, Inf),
right=F)]
# # A tibble: 7 x 2
# var1 value
# <dbl> <dbl>
# 1 0 0
# 2 10 0
# 3 20 0
# 4 210 10
# 5 230 20
# 6 266 30
# 7 267 30

Cumulative sum of values without calculating repeated values in a column

I have a data like this in R
x <- c(1,2,2,3,4,4,7,8)
y <- c(300,200,200,150,100,100,30,20)
df <- data.frame(x, y)
The cumulative with the dataset is
cum_df <- data.frame(x, y, Y)
> cum_df
x y Y
1 1 300 300
2 2 200 500
3 2 200 700
4 3 150 850
5 4 100 950
6 4 100 1050
7 7 30 1080
8 8 20 1100
The cumulative of "y" using cumsum(y) is:
Y <- cumsum(y)
> Y
[1] 300 500 700 850 950 1050 1080 1100
Instead, I want the cumulative of "y" to be like this
> Y
[1] 300 500 500 650 750 750 780 800
In essence, it does not compute repeated values of y. How do I go about this in R? I have tried different functions but it seem not to work. I want the answer to look like this
> ans
x y Y
1 1 300 300
2 2 200 500
3 2 200 500
4 3 150 650
5 4 100 750
6 4 100 750
7 7 30 780
8 8 20 800
We can get the distinct rows, do the cumsum and then do a join
library(dplyr)
df %>%
distinct() %>%
mutate(Y = cumsum(y)) %>%
right_join(df)
# x y Y
#1 1 300 300
#2 2 200 500
#3 2 200 500
#4 3 150 650
#5 4 100 750
#6 4 100 750
#7 7 30 780
#8 8 20 800
Or without any join by replacing the duplicated values in 'y' with 0, and then do the cumsum
df %>%
mutate(Y = cumsum(y * !duplicated(y)))
# x y Y
#1 1 300 300
#2 2 200 500
#3 2 200 500
#4 3 150 650
#5 4 100 750
#6 4 100 750
#7 7 30 780
#8 8 20 800
Or in base R
df$Y <- with(df, cumsum(y * !duplicated(y)))

The value in one column depends in the value of another column

I want to make all rows with number 2 in column q1 to zero in column q2. Anyone have a smart solution?
a <- rep(c(300,450), each=c(3,3))
q1 <- rep(c(1,1,2,1,1,2),2)
q2 <- c(100,40,"",80,30,"" , 45,78,"",20,58,"")
df <- cbind(a,q1,q2)
df <- as.data.frame(df)
Original input data :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2
4 450 1 80
5 450 1 30
6 450 2
7 300 1 45
8 300 1 78
9 300 2
10 450 1 20
11 450 1 58
12 450 2
Desired output :
> df
a q1 q2
1 300 1 100
2 300 1 40
3 300 2 0
4 450 1 80
5 450 1 30
6 450 2 0
7 300 1 45
8 300 1 78
9 300 2 0
10 450 1 20
11 450 1 58
12 450 2 0
An option would be to create a logical vector based on the column 'q1' and assign the value of 'q2' to 0
df$q2[df$q1 == 2] <- 0
df
# a q1 q2
#1 300 1 100
#2 300 1 40
#3 300 2 0
#4 450 1 80
#5 450 1 30
#6 450 2 0
#7 300 1 45
#8 300 1 78
#9 300 2 0
#10 450 1 20
#11 450 1 58
#12 450 2 0
Another option is replace
transform(df, q2 = replace(q2, q1 == 2, 0))
With cbind, it converts to a matrix first, so any character element anywhere results in the whole matrix to be character. Better, would be use data.frame directly
Or in data.table
library(data.table)
setDT(df)[q1== 2, q2 := '0']
data
df <- data.frame(a, q1, q2, stringsAsFactors = FALSE)

How to calculate percentage in R?

I am a newbie to R and I have a data frame which contains the following fields:
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
So my aim is to calculate the time spent in each place where 75% of the vehicles to cross the place.So from this data frame I generate the below data frame by
day place hour time_spent count cum_count percentage
1 1 1 1 120 120 30.7%
1 1 1 2 100 220 56.4%
1 1 1 3 90 310 79%
1 1 1 4 80 390 100%
df$cum_count=cumsum(df$count)
df$percentage=cumsum(df$percentage)
for(i in 1:length(df$percentage)){
if(df$percentage[i]>75%){
low time=df$time_spent[i-1]
high_time=df$time_spent[i]
}
}
So which means that 75% of vehicles are spending 2-3 minutes in the place 1.But now I have a data frame like this which is for all the places and for all the days.
day place hour time_spent count
1 1 1 1 120
1 1 1 2 100
1 1 1 3 90
1 1 1 4 80
1 2 1 1 220
1 2 1 2 100
1 2 1 3 90
1 2 1 4 80
1 3 1 1 100
1 3 1 2 80
1 3 1 3 90
1 3 1 4 100
2 1 1 1 120
2 1 1 2 100
2 1 1 3 90
2 1 1 4 80
2 2 1 1 220
2 2 1 2 100
2 2 1 3 90
2 2 1 4 80
2 3 1 1 100
2 3 1 2 80
2 3 1 3 90
2 3 1 4 100
How is it possible to calculate the high time and low time for each place?Any help is appreciated.
The max and min functions ought to do the trick here. Although you could also do summary to get median, mean, etc in one go. I'd also recommend the quantile function for these percentages. As usually the case with R the tricky part if getting the data in the correct format.
Say you want the total time spent at each place:
index <- sort(unique(df$place))
times <- as.list(rep(NA, length(index)))
names(times) <- index
for(ii in index){
counter <- c()
for(jj in df[df$place==ii,]$time_spent){
counter <- c(counter, rep(jj, df[df$place==ii,]$count[jj]))
}
times[[ii]] <- counter
}
Now for each place you can compute the max and min with:
lapply(times, max)
lapply(times, min)
Similarly you can compute the mean:
lapply(times, function(x) sum(x)/length(x))
lapply(times, mean)
I think what you want are the quantiles:
lapply(times, quantile, 0.75)
This would be time by which at least 75% of vehicles had passed though a place, i.e., 75% of vehicles had took this time or less to pass through.
We can use a group by operation
library(dplyr)
dfN %>%
group_by(day, place) %>%
mutate(cum_count = cumsum(count),
percentage = 100*cum_count/sum(count),
low_time = time_spent[which.max(percentage > 75)-1],
high_time = time_spent[low_time+1])
if i understood your question correctly (you want min and max value of time_spent in a place):
df %>%
group_by(place) %>%
summarise(min(time_spent),
max(time_spent))
will give you this:
place min(time-spent) max(time_spent)
1 1 4
2 1 4
3 1 4

How to make a timeline/waterfall like plot in R for gene/genome coverage

I would like to make a relatively simple plot (reminiscent of timelines such as this: http://www.ats.ucla.edu/stat/sas/code/timeline.gif), but instead of time on the x-axis, it will be base positions in a genome. The "time spans" will be coverage distances for DNA-sequence scaffolds, showing the spans of where they fall in the genome, where they overlap and places with no coverage. Here is a crude mock-up of what I am looking for, showing contig coverage of rRNAs,(I left out, but need, an x-axis showing positions the starts and stops, and labeling of the contigs (colored lines)): http://i.imgur.com/MDABx.png , with the following coordinates:
Contig# Start1 Stop1 Start2 Stop2 Start3 Stop3 Start4 Stop4
1 1 90 90 100 120 150 200 400
2 1 100 120 150 200 400 NA NA
3 1 30 90 100 120 135 200 400
4 1 100 120 140 200 400 NA NA
5 -35 80 90 100 130 150 200 400
6 1 100 200 300 360 400 NA NA
I am pretty sure this can be done in R, probably with ggplot2, but for some reason I cannot figure it out.
This is not going to be as organized as your plot but it puts the lines in with coordinates that you have yet to provide:
dfr <- data.frame(seg=sample(1:6, 20, replace=TRUE), start=sample(1:100, 20, replace=TRUE), end=sample(1:100,20, replace=TRUE) )
plot(c(1,100), c(1,6), type="n")
with(dfr, segments(y0=seg, y1=seg, x0=start, x1=end, col=2:7, lwd=3))
With new dataset:
Contig <- read.table(text=" Start1 Stop1 Start2 Stop2 Start3 Stop3 Start4 Stop4
1 1 90 90 100 120 150 200 400
2 1 100 120 150 200 400 NA NA
3 1 30 90 100 120 135 200 400
4 1 100 120 140 200 400 NA NA
5 -35 80 90 100 130 150 200 400
6 1 100 200 300 360 400 NA NA")
# the reshape function can be tricky.... but seems to finally work.
reshape(Contig, direction="long", sep="",
varying=list(Start=names(Contig)[c(1,3,5,7)],
Stop=names(Contig)[c(2,4,6,8)] ) )
#------------------------------
time Start1 Stop1 id
1.1 1 1 90 1
2.1 1 1 100 2
3.1 1 1 30 3
4.1 1 1 100 4
5.1 1 -35 80 5
6.1 1 1 100 6
1.2 2 90 100 1
2.2 2 120 150 2
3.2 2 90 100 3
4.2 2 120 140 4
5.2 2 90 100 5
6.2 2 200 300 6
1.3 3 120 150 1
2.3 3 200 400 2
3.3 3 120 135 3
4.3 3 200 400 4
5.3 3 130 150 5
6.3 3 360 400 6
1.4 4 200 400 1
2.4 4 NA NA 2
3.4 4 200 400 3
4.4 4 NA NA 4
5.4 4 200 400 5
6.4 4 NA NA 6
#-----------------
LContig <- reshape(Contig, direction="long", sep="",
varying=list(Start=names(Contig)[c(1,3,5,7)], Stop=names(Contig)[c(2,4,6,8)] ) )
plot(range(c(Contig$Start1, Contig$Stop1) , na.rm=TRUE ), c(1,6),
type="n", xlab="Segments", ylab="Groups")
with(LContig, segments(y0=id, y1=id, x0=Start1, x1=Stop1, col=2:7, lwd=3))
Here's a version using ggplot2:
# Never forget
options(stringsAsFactors = FALSE)
# Load ggplot2 and reshape2
library(ggplot2)
library(reshape2)
# Read in the data
contig <- read.table(
text = "id Start1 Stop1 Start2 Stop2 Start3 Stop3 Start4 Stop4
1 1 90 90 100 120 150 200 400
2 1 100 120 150 200 400 NA NA
3 1 30 90 100 120 135 200 400
4 1 100 120 140 200 400 NA NA
5 -35 80 90 100 130 150 200 400
6 1 100 200 300 360 400 NA NA",
header = TRUE
)
# Reshape it
# Melt it all the way down - each data value is gets a record
# identified by id and variable name
contig.melt <- melt(contig, id.var = "id")
# Your variable names contain two pieces of information:
# whether this point is a start or a stop, and
# which span this point is associated with.
# Much easier to work with those separately, so I'll parse them
# into variables.
# Which span?
contig.melt$span <- gsub(x = contig.melt$variable,
pattern = ".*(\\d)",
replace = "\\1")
# Start or stop?
contig.melt$point <- gsub(x = contig.melt$variable,
pattern = "(.*)\\d",
replace = "\\1")
# Cast it back into a dataset with a record for each span
contig.long <- dcast(contig.melt, id + span ~ point)
# Plot it. The vertical position and line colors are determined by
# the ID. I'm calling that x here, but I'll flip the coords later
ggplot(contig.long, aes(x = id, color = factor(id))) +
# geom_linerange plots a line from Start (ymin) to stop (ymax)
# Control the width of the plot with size
geom_linerange(aes(ymin = Start, ymax = Stop), size = 2) +
# Flip the coordinates
coord_flip() +
# Make it pretty
scale_colour_brewer("RNA ID", palette = "Dark2") +
labs(x = "RNA ID", y = "Position") +
theme_bw()

Resources