Split intervals (genomic regions) in individual numbers (nucleotides) - r

I would like to transform my data frame df based in regions in point by point (number by number or nucletide by nucleotide) information.
My input df:
start end state freq
100 103 1nT 22
100 103 3nT 34
104 106 1nT 12
104 106 3nT 16
My expected output:
position state freq
100 1nT 22
101 1nT 22
102 1nT 22
103 1nT 22
100 3nT 34
101 3nT 34
102 3nT 34
103 3nT 34
104 1nT 12
105 1nT 12
106 1nT 12
104 3nT 16
105 3nT 16
106 3nT 16
Any ideas? Thank you very much.

Here is a vectorized approach:
# load your data
df <- read.table(textConnection("start end state freq
100 103 1nT 22
100 103 3nT 34
104 106 1nT 12
104 106 3nT 16"), header=TRUE)
# extract number of needed replications
n <- df$end - df$start + 1
# calculate position and replicate state/freq
res <- data.frame(position = rep(df$start - 1, n) + sequence(n),
state = rep(df$state, n),
freq = rep(df$freq, n))
res
# position state freq
# 1 100 1nT 22
# 2 101 1nT 22
# 3 102 1nT 22
# 4 103 1nT 22
# 5 100 3nT 34
# 6 101 3nT 34
# 7 102 3nT 34
# 8 103 3nT 34
# 9 104 1nT 12
# 10 105 1nT 12
# 11 106 1nT 12
# 12 104 3nT 16
# 13 105 3nT 16
# 14 106 3nT 16

Here is one approach....
Build you data
require(data.table)
fakedata <- data.table(start=c(100,100,104,104),
end=c(103,103,106,106),
state=c("1nT","3nT","1nT","3nT"),
freq=c(22,34,12,16))
Perform calculation
fakedata[ , dur := (end-start+1)]
outdata <- fakedata[ , lapply(.SD,function(x) rep(x,dur))]
outdata[ , position := (start-1)+1:.N, by=list(start,end,state)]
And the output
start end state freq dur position
1: 100 103 1nT 22 4 100
2: 100 103 1nT 22 4 101
3: 100 103 1nT 22 4 102
4: 100 103 1nT 22 4 103
5: 100 103 3nT 34 4 100
6: 100 103 3nT 34 4 101
7: 100 103 3nT 34 4 102
8: 100 103 3nT 34 4 103
9: 104 106 1nT 12 3 104
10: 104 106 1nT 12 3 105
11: 104 106 1nT 12 3 106
12: 104 106 3nT 16 3 104
13: 104 106 3nT 16 3 105
14: 104 106 3nT 16 3 106

This can be accomplished with a simple apply command.
Let's build this in sequence:
You want to perform an operation based on every row, so apply by row should be your first thought (or for loop). So we know we want to use apply(data, 1, row.function).
Think of what you would want to do for a single row. You want to repeat state and freq for every number between start and stop.
To get the range of numbers between start and stop we can use the colon operator start:stop.
Now, R will automatically repeat the values in a vector to match the longest vector length when creating a data.frame. So, we can create the piece from a single row like this:
data.frame(position=(row['start']:row['end']), state=row['state'], freq=row['freq'])
Then we want to bind it all together, so we use `do.call('rbind', result).
Putting this all together now, we have:
do.call('rbind',
apply(data, 1, function(row) {
data.frame(position=(row['start']:row['end']),
state=row['state'], freq=row['freq'])
}))
Which will give you what you want. Hopefully this helps teach you how to approach problems like this in the future too!

Here's rough implementation using for loop.
a = t(matrix(c(100, 103, "1nT" , 22,
100, 103 , "3nT" , 34,
104, 106 , "1nT" , 12,
104, 106 , "3nT" , 16), nrow = 4))
a = data.frame(a, stringsAsFactor = F)
colnames(a) = c("start", "end" , "state", "freq")
a$start = as.numeric(as.character(a$start))
a$end = as.numeric(as.character(a$end))
n = dim(a)[1]
res = NULL
for (i in 1:n) {
position = a$start[i]:a$end[i]
state = rep(a$state[i], length(position))
freq = rep(a$freq[i], length(position))
temp = cbind.data.frame(position, state, freq)
res = rbind(res, temp)
}

Related

R - Reducing a matrix

I have a square matrix that is like:
A <- c("111","111","111","112","112","113")
B <- c(100,10,20,NA,NA,10)
C <- c(10,20,40,NA,10,20)
D <- c(10,20,NA,NA,40,200)
E <- c(20,20,40,10,10,20)
F <- c(NA,NA,40,100,10,20)
G <- c(10,20,NA,30,10,20)
df <- data.frame(A,B,C,D,E,F,G)
names(df) <- c("Codes","111","111","111","112","112","113")
# Codes 111 111 111 112 112 113
# 1 111 100 10 10 20 NA 10
# 2 111 10 20 20 20 NA 20
# 3 111 20 40 NA 40 40 NA
# 4 112 NA NA NA 10 100 30
# 5 112 NA 10 40 10 10 10
# 6 113 10 20 200 20 20 20
I want to reduce it so that observations with the same row and column names are summed up.
So I want to end up with:
# Codes 111 112 113
# 1 111 230 120 30
# 2 112 50 130 40
# 3 113 230 40 20
I tried to first combine the rows with the same "Codes" number, but I was having a lot of trouble.
In tidyverse
library(tidyverse)
df %>%
pivot_longer(-Codes, values_drop_na = TRUE) %>%
group_by(Codes, name) %>%
summarise(value = sum(value), .groups = 'drop')%>%
pivot_wider()
# A tibble: 3 x 4
Codes `111` `112` `113`
<chr> <dbl> <dbl> <dbl>
1 111 230 120 30
2 112 50 130 40
3 113 230 40 20
One way in base R:
tapply(unlist(df[-1]), list(names(df)[-1][col(df[-1])], df[,1][row(df[-1])]), sum, na.rm = TRUE)
111 112 113
111 230 50 230
112 120 130 40
113 30 40 20
Note that this can be simplified as denoted by #thelatemail to
grp <- expand.grid(df$Codes, names(df)[-1])
tapply(unlist(df[-1]), grp, FUN=sum, na.rm=TRUE)
You can also use `xtabs:
xtabs(vals~., na.omit(cbind(grp, vals = unlist(df[-1]))))
Var2
Var1 111 112 113
111 230 120 30
112 50 130 40
113 230 40 20
When dealing with actual matrices - especially with large ones -, expressing the operation as (sparse) linear algebra should be most efficient.
library(Matrix) ## for sparse matrix operations
idx <- c("111","111","111","112","112","113")
mat <- matrix(c(100,10,20,NA,NA,10,
10,20,40,NA,10,20,
10,20,NA,NA,40,200,
20,20,40,10,10,20,
NA,NA,40,100,10,20,
10,20,NA,30,10,20),
nrow=length(idx),
byrow=TRUE, dimnames=list(idx, idx))
## convert NA's to zero
mat[is.na(mat)] <- 0
## examine matrix
mat
## 111 111 111 112 112 113
## 111 100 10 20 0 0 10
## 111 10 20 40 0 10 20
## 111 10 20 0 0 40 200
## 112 20 20 40 10 10 20
## 112 0 0 40 100 10 20
## 113 10 20 0 30 10 20
## indicator matrix
## converts between "code" and "idx" spaces
M_code_idx <- fac2sparse(idx)
## project to "code_code" space
M_code_idx %*% mat %*% t(M_code_idx)
## 3 x 3 Matrix of class "dgeMatrix"
## 111 112 113
## 111 230 50 230
## 112 120 130 40
## 113 30 40 20

What is the best way to assign detection history using the following values?

I have three years of detection data. In each year there are 8 probabilities at a site. These are no, a, n, na, l, la, ln, lna. I've assigned the values below:
no = 0
a = 1
n = 1
na = 2
l = 100
la = 101
ln = 101
lna = 102
In year 2, I wish to calculate and label all outcomes, so any combination of 2 of the terms above, to describe a detection history numerically.
So essentially I'm trying to get a list of 64 terms ranging from no,no to lna,lna with their respective values.
For example, no,no = 0 and lna,lna = 204
In year 3, I wish for the same. I'd like to calculate and label all possibilities. This needs to be arranged in two columns, one with history text, and one with history value.
x1 x2
no,no,no 0
I'm sure this is possible, and possibly even basic. Though I have no idea where to begin.
Any help would be greatly appreciated.
Thanks in advance
I'm sure there are more elegant, concise ways to do it, but here's one approach:
Define the two lists of possibilities
poss = c("no", "a", "n", "na", "l", "la", "ln", "lna")
vals = c(1, 1, 2, 100, 101, 101, 101, 102)
Use expand.grid to enumerate the possibilities
output <- expand.grid(poss, poss, stringsAsFactors = FALSE)
comb_values <- expand.grid(vals, vals)
Write the ouput
output$names <- paste(output$Var1, output$Var2, sep = ",")
output$value <- comb_values$Var1 + comb_values$Var2
output$Var1 <- output$Var2 <- NULL
Result
names value
1 no,no 2
2 a,no 2
3 n,no 3
4 na,no 101
5 l,no 102
6 la,no 102
7 ln,no 102
8 lna,no 103
9 no,a 2
10 a,a 2
11 n,a 3
12 na,a 101
13 l,a 102
14 la,a 102
15 ln,a 102
16 lna,a 103
17 no,n 3
18 a,n 3
19 n,n 4
20 na,n 102
21 l,n 103
22 la,n 103
23 ln,n 103
24 lna,n 104
25 no,na 101
26 a,na 101
27 n,na 102
28 na,na 200
29 l,na 201
30 la,na 201
31 ln,na 201
32 lna,na 202
33 no,l 102
34 a,l 102
35 n,l 103
36 na,l 201
37 l,l 202
38 la,l 202
39 ln,l 202
40 lna,l 203
41 no,la 102
42 a,la 102
43 n,la 103
44 na,la 201
45 l,la 202
46 la,la 202
47 ln,la 202
48 lna,la 203
49 no,ln 102
50 a,ln 102
51 n,ln 103
52 na,ln 201
53 l,ln 202
54 la,ln 202
55 ln,ln 202
56 lna,ln 203
57 no,lna 103
58 a,lna 103
59 n,lna 104
60 na,lna 202
61 l,lna 203
62 la,lna 203
63 ln,lna 203
64 lna,lna 204
Same logic for three days, just replace poss, poss with poss, poss, poss etc.

How to use mutate_at() with two sets of variables, in R

Using dplyr, I want to divide a column by another one, where the two columns have a similar pattern.
I have the following data frame:
My_data = data.frame(
var_a = 101:110,
var_b = 201:210,
number_a = 1:10,
number_b = 21:30)
I would like to create a new variable: var_a_new = var_a/number_a, var_b_new = var_b/number_b and so on if I have c, d etc.
My_data %>%
mutate_at(
.vars = c('var_a', 'var_b'),
.funs = list( new = function(x) x/(.[,paste0('number_a', names(x))]) ))
I did not get an error, but a wrong result. I think that the problem is that I don't understand what the 'x' is. Is it one of the string in .vars? Is it a column in My_data? Something else?
One option could be:
bind_cols(My_data,
My_data %>%
transmute(across(starts_with("var"))/across(starts_with("number"))) %>%
rename_all(~ paste0(., "_new")))
var_a var_b number_a number_b var_a_new var_b_new
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
You can do this directly provided the columns are correctly ordered meaning "var_a" is first column in "var" group and "number_a" is first column in "number" group and so on for other pairs.
var_cols <- grep('var', names(My_data), value = TRUE)
number_cols <- grep('number', names(My_data), value = TRUE)
My_data[paste0(var_cols, '_new')] <- My_data[var_cols]/My_data[number_cols]
My_data
# var_a var_b number_a number_b var_a_new var_b_new
#1 101 201 1 21 101.00000 9.571429
#2 102 202 2 22 51.00000 9.181818
#3 103 203 3 23 34.33333 8.826087
#4 104 204 4 24 26.00000 8.500000
#5 105 205 5 25 21.00000 8.200000
#6 106 206 6 26 17.66667 7.923077
#7 107 207 7 27 15.28571 7.666667
#8 108 208 8 28 13.50000 7.428571
#9 109 209 9 29 12.11111 7.206897
#10 110 210 10 30 11.00000 7.000000
The function across() has replaced scope variants such as mutate_at(), summarize_at() and others. For more details, see vignette("colwise") or https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html. Based on tmfmnk's answer, the following works well:
My_data %>%
mutate(
new = across(starts_with("var"))/across(starts_with("number")))
The prefix "new." will be added to the names of the new variables.
var_a var_b number_a number_b new.var_a new.var_b
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000

correct way to add columns to data frame without loop

I have this "d" data frame that has 2 groups. In real life I have 20 groups.
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
group value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 101
12 B 102
13 B 103
14 B 104
15 B 105
16 B 106
17 B 107
18 B 108
19 B 109
20 B 110
21 A 10000
I'd like to add 2 columns, "Upper" and "Lower" that are calculated at the GROUP below level. Since there are only 2 groups I can add the columns manually like this:
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
d$upper = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]+ 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]+ 2.5*IQR(d$value[d$group=="B"]) )
d$lower = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]- 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]- 2.5*IQR(d$value[d$group=="B"]) )
group value upper lower
1 A 1 21 -4.0
2 A 2 21 -4.0
3 A 3 21 -4.0
4 A 4 21 -4.0
5 A 5 21 -4.0
6 A 6 21 -4.0
7 A 7 21 -4.0
8 A 8 21 -4.0
9 A 9 21 -4.0
10 A 10 21 -4.0
11 B 101 119 96.5
12 B 102 119 96.5
13 B 103 119 96.5
14 B 104 119 96.5
15 B 105 119 96.5
16 B 106 119 96.5
17 B 107 119 96.5
18 B 108 119 96.5
19 B 109 119 96.5
20 B 110 119 96.5
21 A 10000 21 -4.0
But when I have 20 or 30 columns whats the best way to add these columns without doing a loop?
Groupwise operations can easily be done using dplyr's group_by function:
library(dplyr)
d <- data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d %>%
group_by(group) %>%
mutate(upper=quantile(value, 0.75) + 2.5*IQR(value),
lower=quantile(value, 0.75) - 2.5*IQR(value))
This splits the data frame by the "group" variable and then computes the "upper" and "lower" columns separately for each group.

Change labels from numeric to numeric

I have the following dataset.
dat2 <- read.table(header=TRUE, text="
ID De Ep Ti ID1
1123 113 121 100 11231
1123 105 107 110 11232
1134 122 111 107 11241
1134 117 120 111 11242
1154 122 116 109 11243
1165 108 111 118 11251
1175 106 115 113 11252
1185 113 104 108 11253
1226 109 119 116 11261
")
dat2
ID De Ep Ti ID1
1 1 2 121 100 11231
2 1 1 107 110 11232
3 2 3 111 107 11241
4 2 2 120 111 11242
5 2 3 116 109 11243
6 3 1 111 118 11251
7 3 1 115 113 11252
8 4 2 104 108 11253
9 4 1 119 116 11261
I want to change first two columns to be changed like the following numeric labels. But it turns them into factor.
dat2$ID <- cut(dat2$ID, breaks=c(0,1124,1154,1184,Inf),
labels=c(5, 25, 55, 75))
table(dat2$ID)
5 25 55 75
2 3 2 2
dat2$De <- cut(dat2$De, breaks=c(0,110,118,125,Inf),
labels=c(10, 20, 30, 40))
table(dat2$De)
10 20 30 40
4 3 2 0
str(dat2)
'data.frame': 9 obs. of 5 variables:
$ ID : Factor w/ 4 levels "5","25","55",..: 1 1 2 2 2 3 3 4 4
$ De : Factor w/ 4 levels "10","20","30",..: 2 1 3 2 3 1 1 2 1
$ Ep : int 121 107 111 120 116 111 115 104 119
$ Ti : int 100 110 107 111 109 118 113 108 116
$ ID1: int 11231 11232 11241 11242 11243 11251 11252 11253 11261
I used as.numeric to convert them back to numeric that eventually creates new labeling (like 1, 2, 3) what I don't want. I need a simple line of code to transform it easily.
dat2$ID <- as.numeric(dat2$ID)
table(dat2$ID)
1 2 3 4
2 3 2 2
dat2$De <- as.numeric(dat2$De)
table(dat2$De)
1 2 3
4 3 2
In your case it will probably be more efficient to use findInterval directly instead of converting numeric to factors and then back to numeric values as shown here
c(5, 25, 55, 75)[findInterval(dat2$ID, c(0, 1124, 1154, 1184, Inf))]
## [1] 5 5 25 25 55 55 55 75 75
Or (as per the second column)
c(10, 20, 30, 40)[findInterval(dat2$De, c(0, 110, 118, 125, Inf))]
## [1] 20 10 30 20 30 10 10 20 10
Which is equivalent to using cut but returns the numeric values directly
cut(dat2$ID, breaks=c(0, 1124, 1154, 1184, Inf), labels=c(5, 25, 55, 75))
# [1] 5 5 25 25 25 55 55 75 75
# Levels: 5 25 55 75
Here's a quick benchmark showing ~X18 speed improvement
set.seed(123)
x <- sample(1e8, 1e7, replace = TRUE)
system.time({
res1 <- cut(x, breaks = c(0, 1e4, 1e5, 1e6, Inf), labels = c(5, 25, 55, 75))
res1 <- as.numeric(levels(res1))[res1]
})
# user system elapsed
# 3.40 0.09 3.51
system.time(res2 <- c(5, 25, 55, 75)[findInterval(x, c(0, 1e4, 1e5, 1e6, Inf))])
# user system elapsed
# 0.18 0.03 0.20
identical(res1, res2)
## [1] TRUE

Resources