Apply function to column by segments in R - r

I have a function f that needs to be applied to a single column of length n in segments of m length, where m divides n. (For example, to a column of 1000 values, apply f to the first 250 values, then to 250-500, ...).
A loop is overkill, since the column has over 16 million values. I was thinking the efficient way would be to separate the column of length n into q vectors of length m, where mq = n. Then I could apply f simultaneously to all this vectors using some lapply-like functionality. Then I cold join the q vectors to obtain the transformed version of the column.
Is that the efficient way to go here? If so, what function could decompose a column into q vectors of equal length and what function should I use to broadcast f across the q vectors?
Lastly, although less importantly, what if we wanted to do this to several columns and not just one?
Context
I've programmed a function that computes the power spectrum of an EEG signal (a numeric vector). However, it is bad practice to compute the power spectrum of a whole signal at once. The correct method is to compute it epoch by epoch, in 30 or 5 second segments, and average the spectrum of all those epochs. Hence why I need to apply a function to a column (an EEG signal) by epochs (or segments).

A way to do it is to create an auxiliar variable, so you can apply to each variable, depending on your function you can use group_by and/or summarize, an example:
df <- data.frame(
x = rnorm(15),
y = rnorm(15),
z = rnorm(15)
)
library(dplyr)
df %>%
mutate(
aux = rep(1:3,each = (nrow(df)/3)),
across(.cols = c(x,y,z),.fns = ~ . + 2 * aux)
)
x y z aux
1 2.164841 2.882465 2.139098 1
2 2.364115 2.205598 2.410275 1
3 2.552158 1.383564 1.441543 1
4 1.398107 1.265201 2.605371 1
5 1.006301 1.868197 1.493666 1
6 5.026785 4.310017 2.579434 2
7 4.751061 2.960320 4.127993 2
8 2.490833 3.815691 5.945851 2
9 3.904853 4.967267 4.800914 2
10 3.104052 3.891720 5.165253 2
11 3.929249 5.301579 6.358856 3
12 6.150120 5.724055 5.391443 3
13 5.920788 7.114649 5.797759 3
14 5.902631 6.550044 5.726752 3
15 6.216153 7.236676 5.531300 3

Related

Perform vector operation on dataframe of coordinates

I currently have a data frame storing separate x,y,z coordinates from an accelerometer sensor (with timestamps), but want to perform vector operations on it.
Test data (actually have thousands of rows, and a timestamp row to be preserved)
x <- c(1,3,1,0,3)
y <- c(2,4,8,8,9)
z <- c(0,1,1,2,0)
df <- data.frame(x,y,z)
proj <- function(a,b) {
as.double((a %*% b) / (b %*% b)) * b
}
v = c(1,2,3)
I want to mutate (or create a new dataframe?) df by applying proj(_,v) on each row.
I have tried along the lines of mutate(projected = proj(c(x,y,z), v), but doesn't work, I am probably misusing this.
What is the best way to achieve this? Should I instead be using a list of vectors to store the coordinates?
While your proj(a,b)-function does only take two inputs, in your example you wanted to provide three proj(c(x,y,z),v) or did I misunderstand?
However, this would work:
dplyr::mutate(projected = proj(x,y), df) resulting in
x y z projected
1 1 2 0 0.4279476
2 3 4 1 0.8558952
3 1 8 1 1.7117904
4 0 8 2 1.7117904
5 3 9 0 1.9257642

R expand.grid with row restrictions

I have a numeric vector x of length N and would like to create a vector of the within-set sums of all of the following sets: any possible combination of the x elements with at most M elements in each combination. I put together a slow iterative approach; what I am looking for here is a way without using any loops.
Consider the approach I have been taking, in the following example with N=5 and M=4
M <- 4
x <- 11:15
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
However, as N gets large (above 22 for me), the expand.grid output becomes too big and gives an error (replace x above with x <- 11:55 to observe this). Ideally there would be an expand.grid function that permits restrictions on the rows before constructing the full matrix, which (at least for what I want) would keep the matrix size within memory limits.
Is there a way to achieve this without causing problems for large N?
Your problem has to do with the sheer amount of combinations.
What you appear to be doing is listing all different combinations of 0's and 1's in a sequence of length of x.
In your example x has length 5 and you have 2^5=32 combinations
When x has length 22 you have 2^22=4194304 combinations.
Couldn't you use a binary encoding instead?
In your case that would mean
0 stands for 00000
1 stands for 00001
2 stands for 00010
3 stands for 00011
...
It will not solve your problem completely, but you should be able to get a bit further than now.
Try this:
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
It generates the same result as with your expand.grid approach, shown below for the test data.
M <- 4
x <- 11:15
# expand.grid approach
y <- as.matrix(expand.grid(rep(list(0:1), length(x))))
result <- y[rowSums(y) <= M, ] %*% x
# combn approach
result1 <- c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
all(sort(result[,1]) == sort(result1))
# [1] TRUE
This should be fast (it takes 0.227577 secs on my machine, with N=22, M=4):
x <- 1:22 # N = 22
M <- 4
c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k)))))
# [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 3 4 5 6 7
you may want to choose the unique values of the sums with
unique(c(0, unlist(lapply(1:M, function(k) colSums(combn(x, k))))))

Assign numbers to each letter so that r calculates the sum of the letters in a word

I'm trying to create a tool in R that will calculate the atomic composition (i.e. number of carbon, hydrogen, nitrogen and oxygen atoms) of a peptide chain that is input in single letter amino acid code. For example, the peptide KGHLY consists of the amino acids lysine (K), glycine (G), histidine (H), leucine (L) and tyrosine (Y). Lysine is made of 6 carbon, 13 hydrogen, 1 nitrogen and 2 oxygen. Glycine is made of 2 carbon, 5 hydrogen, 1 nitrogen and 2 oxygen. etc. etc.
I would like the r code to either read the peptide string (KGHLY) from a data frame or take input from the keyboard using readline()
I am new to R and new to programming. I am able to make objects for each amino acid, e.g. G <- c(2, 5, 1, 2) or build a data frame containing all 20 amino acids and their respective atomic compositions.
The bit that I am struggling with is that I don't know how to get R to index from a data frame in response to a string of letters. I have a feeling the solution is probably very simple but so far I have not been able to find a function that is suited to this task.
There's two main components to take care of here: The selection of
a method for the storing of the basic data and the algorithm that
computes the result you desire.
For the computation, it might be preferable to have your data
stored in a matrix, due to the way R recycles the shorter vector
when multiplying two vectors. This recycling also kicks in if you
want to multiply a matrix with a vector, since a matrix is a
vector with some additional attributes (that is to say, dimension
and dimension-names). Consider the example below to see how it
works
test_matrix <- matrix(data = 1:12, nrow = 3)
test_vec <- c(3, 0, 1)
test_matrix
[,1] [,2] [,3] [,4]
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
test_matrix * test_vec
[,1] [,2] [,3] [,4]
[1,] 3 12 21 30
[2,] 0 0 0 0
[3,] 3 6 9 12
Based on this observation, it's possible to deduce that a solution
where each amino acid has one row in a matrix might be a good way
to store the look up data; when we have a counting vector with
specifying the desired amount of contribution from each row, it
will be sufficient to multiply our matrix with our counting
vector, and then sum the columns - the last part solved using
colSums.
colSums(test_matrix * test_vec)
[1] 6 18 30 42
It's in general a "pain" to store this kind of information in a
matrix, since it might be a "lot of work" to update the
information later on. However, I guess it's not that often it's
required to add new amino acids, so that might not be an issue in
this case.
So let's create a matrix for the the five amino acids needed
for the peptide you mentioned in your example. The numbers was
found on Wikipedia, and hopefully I didn't mess up when I copied
them. Just follow suit to add all the other amino acids too.
amino_acids <- rbind(
G = c(C = 2, H = 5, N = 1, O = 2),
L = c(C = 6, H = 13, N = 1, O = 2),
H = c(C = 6, H = 9, N = 3, O = 2),
K = c(C = 6, H = 14, N = 2, O = 2),
Y = c(C = 9, H = 11, N = 1, O = 3))
amino_acids
C H N O
G 2 5 1 2
L 6 13 1 2
H 6 9 3 2
K 6 14 2 2
Y 9 11 1 3
This matrix contains the information we want, but it might be
preferable to have them in lexicographic order - and it would be
nice to ensure that we haven't by mistake added the same row
twice. The code below takes care of both of these issues.
amino_acids <-
amino_acids[sort(unique(rownames(amino_acids))), ]
amino_acids
C H N O
G 2 5 1 2
H 6 9 3 2
K 6 14 2 2
L 6 13 1 2
Y 9 11 1 3
The next part is to figure out how to deal with the peptides. This
will here be done by first using strsplit to split the string
into separate characters, and then use a table-solution upon the
result to get the vector that we want to multiply with the matrix.
peptide <- "KGHLY"
peptide_2 <- unlist(strsplit(x = peptide, split = ""))
peptide_2
[1] "K" "G" "H" "L" "Y"
Using table upon peptide_2 gives us
table(peptide_2)
peptide_2
G H K L Y
1 1 1 1 1
This can thus be used to define a vector to play the role of test_vec in the first example. However, in general the resulting vector will contain fewer components than the rows of the matrix amino_acids; so a restriction must be performed first, in order to get the correct format we want for our computation.
Several options is available, and the simplest one might be to use the names from the table to subset the required rows from amino_acids, such that the computation can proceed without any further fuzz.
peptide_vec <- table(peptide_2)
colSums(amino_acids[names(peptide_vec), ] * as.vector(peptide_vec))
C H N O
29 52 8 11
This outlines one possible solution for the core of your problem,
and this can be collected into a function that takes care of all
the steps for us.
peptide_function <- function(peptide, amino_acids) {
peptide_vec <- table(
unlist(strsplit(x = peptide, split = "")))
## Compute the result and return it to the work flow.
colSums(
amino_acids[names(peptide_vec), ] *
as.vector(peptide_vec))
}
And finally a test to see that we get the same answer as before.
peptide_function(peptide = "GHKLY",
amino_acids = amino_acids)
C H N O
29 52 8 11
What next? Well that depends on how you have stored your
peptides, and what you would like to do with the result. If for
example you have the peptides stored in a vector, and would like
to have the result stored in a matrix, then it might e.g. be
possible to use vapply as given below.
data_vector <- c("GHKLY", "GGLY", "HKLGL")
result <- t(vapply(
X = data_vector,
FUN = peptide_function,
FUN.VALUE = numeric(4),
amino_acids = amino_acids))
result
C H N O
GHKLY 29 52 8 11
GGLY 19 34 4 9
HKLGL 26 54 8 10

Generate the shortest possible unique number from two number with known length limit

I have two numbers fields in the db. One with max 15 digits (say x), and the other one with maximum 5 digits (say y). I need to produce unique number from any pair (x,y), such that for any other pair (w,z), k(x,y) = k(w,z) if and only if x=w and y=z.
Note: I read about Cantor function but since i have a known limit on the number length, i'd like to use a more efficient function to generate the shortest possible key.
Just concatenate them together. k(111111111111111,22222) = 11111111111111122222. Be sure to include leading zeroes: k(3,4) = 300004. If you leave them out, then k(3,4) will have the same result as k(0,34).
By the pigeonhole principle, if you want perfect uniqueness, you can't do any better than 15+5 digits.
Lets say that x could hold the numbers between 0 - 4 and y could hold the numbers between 0 - 3 then you could create a grid like so
x 0 1 2 3 4
y----------------
0| 0 1 2 3 4
1| 5 6 7 8 9
2| 10 11 12 13 14
3| 15 16 17 18 19
Given an (x, y) you would find the unique number by doing x + 5y
For example, (3, 2) would be 3 + 5*2 = 13, and if you check the grid you will see that 13 is the column where the x label is 3 and in the row where the y label is 2.
Going back the other way, given a number lets say 16 then x = 16 modulo 5 = 1
and y = (16 - x) / 5 = 3
You can see from the grid that in column 1 row 3 is the number 16.
To extend it to your question your x holds the values between 0 - 999999999999999 and your y holds the values between 0 - 99999
so your formula would be
(x, y) = 1000000000000000*y + x
I would store second number's length in a first digit, then two digits concatenated, it's easy and will work in every case. Since the second number is 5 digits max, then our first digit can store that.
without it if you want concatenate 23 and 45 it will be same result as 234 and 5, the same case with zeroes between them.
example:
k(3,4) = 134
k(30,4) = 1304
k(3,40) = 2340
k(3333,4) = 133334
k(3,4321) = 434321
expression for f(x,y) would be:
n*(10^(n+m))+x*(10^n)+y
where
n = log10(y)+1 m = log10(x)+1 (n and m are integer floor)

Function that group values of a list (in R)

I am trying to construct a function which shouldn't be hard in terms of programming but I am having some difficulties to conceptualize it. Hope you'll be able to understand my problem better than me!
I'd like a function that takes a single list of vectors as argument. Something like
arg1 = list(c(1,2), c(2,3), c(5,6), c(1,3), c(4,6), c(6,7), c(7,5), c(5,8))
The function should output a matrix with two columns (or a list of two vectors or something like that) where one column contains letters and the other numbers. One can think of the argument as a list of the positions/values that should be placed in the same group. If in the list there is the vector c(5,6), then the output should contain somewhere the same letters next to the values 5 and 6 in the number column. If there are the three following vectors c(1,2), c(2,3) and c(1,3), then the output should contain somewhere the same letters next to the value 1, 2 and 3 in the number column.
Therefore if we enter the object arg1 in the function it should return:
myFun(arg1)
number_column letters_column
1 A
2 A
3 A
5 B
6 B
7 B
4 C
6 C
5 D
8 D
(the order is not important. The letters E should not be present before the letter D has been used)
Therefore the function has constructed 2 groups of 3 (A:[1,2,3] and B:[5,6,7]) and 2 groups of 2 (C:[4,6] and D:[5,8]). Note one position or number can be in several group.
Please let me know if something is unclear in my question! Thanks!
As I wrote in the comments, it appears that you want a data frame that lists the maximal cliques of a graph given a list of vectors that define the edges.
require(igraph)
## create a matrix where each row is an edge
argmatrix <- do.call(rbind, arg1)
## create an igraph object from the matrix of edges
gph <- graph.edgelist(argmatrix, directed = FALSE)
## returns a list of the maximal cliques of the graph
mxc <- maximal.cliques(gph)
## creates a data frame of the output
dat <- data.frame(number_column = unlist(mxc),
group_column = rep.int(seq_along(mxc),times = sapply(mxc,length)))
## converts group numbers to letters
## ONLY USE if max(dat$group_column) <= 26
dat$group_column <- LETTERS[dat$group_column]
# number_column group_column
# 1 5 A
# 2 8 A
# 3 5 B
# 4 6 B
# 5 7 B
# 6 4 C
# 7 6 C
# 8 3 D
# 9 1 D
# 10 2 D

Resources