In standard functional programming, Map takes a list l and a function F and returns a new list with F applied to every element. As an example consider:
F(x) = x^2 and the list l = [1, 2, 3, 4, 5]
Map(f, l) would produce the list: [1, 4, 9, 16, 25]
I would like to use this notion of Map on an R dataframe. I would like my function F(x) to compute x / rowSum(row that x belongs to in the dataframe).
Consider the data frame given by:
df <- data.frame()
for(i in 1:5)
{
df <- rbind(df, c(i, i+1, i+2, i+3, i+4))
}
colnames(df) <- c("a", "b", "c", "d", "e")
Which gives:
a b c d e
1 1 2 3 4 5
2 2 3 4 5 6
3 3 4 5 6 7
4 4 5 6 7 8
5 5 6 7 8 9
I would like Map(F, df) to produce:
[,1] [,2] [,3] [,4] [,5]
v1 0.06666667 0.1333333 0.2 0.2666667 0.3333333
v2 0.10000000 0.1500000 0.2 0.2500000 0.3000000
v3 0.12000000 0.1600000 0.2 0.2400000 0.2800000
v4 0.13333333 0.1666667 0.2 0.2333333 0.2666667
v5 0.14285714 0.1714286 0.2 0.2285714 0.2571429
which is a dataframe where F is applied to every entry x in df.
The only hard part is figuring out how to write F:
F <- function(x) x / rowSum( row in which x belongs to in dataframe)
Map(F, df)
How do I write F
EDIT Here is an iterative solution:
pStat <- data.frame()
for(i in 1: 5)
{
v <- df[i,] / rowSums(df[i,])
pStates <- rbind(pStates, v)
}
R's recycling rules work out of the box
df / rowSums(df)
A data.frame is a (column-oriented) list of equal-length vectors (try df[[2]], for instance, or str(df)), so Map(F, df) is acting as in other functional languages by applying F to each column. The use of rowSums implies that the data are all numeric; it is often appropriate and efficient to then use a matrix, where recycling still works out of the box.
m <- as.matrix(df)
m / rowSums(m)
One could use a closure (e.g., a function that returns a function) to provide constant arguments (rowSums(df)) to a (inefficient) Map solution that acts explicitly on each column
Ffactory <- function(df) { r = rowSums(df); function(x) x / r }
mapped <- Map(Ffactory(df), df)
remembering to coerce the list to a data frame
as.data.frame(mapped)
Related
I would like to know how to split a vector by a percentage. I tried to use the stats :: quantile function but it doesn't manage to separate correctly when there are several times the same values.
I would like a method that does the split only by taking into account the length of the vector without taking into account the values.
vector <- c(1,1,1, 4:10)
minProb <- 0.1
maxProb <- 0.9
l <- length(vector)
dt <- data.frame("id" = 1:l, "value" = vector)
dt <- dt %>% arrange(act)
#min <- l*minProb
#max <- l*maxProb
#data1 <- dt$id[min:max]
#data2 <- dt$id[-c(min:max)]
#q <- quantile(dt$act, probs=c(minProb,maxProb))
#w <- which(dt$act >= q[1] & dt$act <= q[2])
expected result (index of elements)
> g2
1 10
> g1
2 3 4 5 6 7 8 9
The following does split the vector, whether that's what the question asks for is not clear.
l <- length(vector)
qq <- quantile(seq_along(vector), probs = c(minProb, maxProb))
f <- logical(l)
f[round(qq[1])] <- TRUE
f[round(qq[2])] <- TRUE
split(vector, cumsum(f))
#$`0`
#[1] 1
#
#$`1`
#[1] 1 1 4 5 6 7 8
#
#$`2`
#[1] 9 10
In order to have the indices, like it is asked in a comment, do
split(seq_along(vector), cumsum(f))
I have 3 vectors
a = c(3,7)
b = c(4,6)
c = c(2,6)
I would like to make the union of these 3 sets. I could use the union() function but "convex" union requires that the vector c is removed from the union because it is dominated by a, which is higher for the two elements.
Any idea of a simple way to do it?
If each row of m is a pair then which.nondominated(-t(m)) gives the row numbers of the rows not dominated by some other row. The code is written in C so it should be fast.
library(ecr)
m <- rbind(a, b, c) # input data
ix <- which.nondominated(-t(m)) # 1, 2
mm <- m[ix, ]
mm
## [,1] [,2]
## a 3 7
## b 4 6
There are no duplicates in this example but if there could be and if you also wanted to remove them then:
unique(mm)
or
mm[!duplicated(mm), ]
This will work in the cases mentioned above in the comments.
a = c(3,7)
b = c(4,6)
c = c(2,6)
d = c(3.5,6.5)
my_fun <- function(A){
B <- matrix(NA,ncol=ncol(A))
for(i in 1:nrow(A)){
include <- !any(apply(A[-i,],1,function(x){all(A[i,] < x)}))
if(include){
B <- rbind(B,A[i,])
}
}
na.omit(B)
}
A <- rbind(a,b,c,d)
my_fun(A)
[,1] [,2]
[1,] 3.0 7.0
[2,] 4.0 6.0
[3,] 3.5 6.5
I think the solution by #G. Grothendieck is so far the best. Here is another solution with base R.
Assuming you are using a data frame df which consists of a,b,c and d, i.e,
a = c(3,7)
b = c(4,6)
c = c(2,6)
d = c(3.5,6.5)
df <- data.frame(t(data.frame(a,b,c,d)))
> df
X1 X2
a 3.0 7.0
b 4.0 6.0
c 2.0 6.0
d 3.5 6.5
then maybe the following code can help you to make convex union
l <- combn(df[,1],2,diff)
b <- combn(df[,2],2,diff)
idx <- combn(seq(nrow(df)),2)
m <- df[unique(as.vector(idx[,l*b<0])),]
> m
X1 X2
a 3.0 7.0
b 4.0 6.0
d 3.5 6.5
add <- c( 2,3,4)
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
print(z)
}
# Result
[1] 13
[1] 15
[1] 17
In R, it can print the result, but I want to save the results for further computation in a vector, data frame or list
Thanks in advance
Try something like:
add <- c(2, 3, 4)
z <- rep(0, length(add))
idx = 1
for(i in add) {
a <- i + 3
b <- a + 3
z[idx] <- a + b
idx <- idx + 1
}
print(z)
This is simple algebra, no need in a for loop at all
res <- (add + 3)*2 + 3
res
## [1] 13 15 17
Or if you want a data.frame
data.frame(a = add + 3, b = add + 6, c = (add + 3)*2 + 3)
# a b c
# 1 5 8 13
# 2 6 9 15
# 3 7 10 17
Though in general, when you are trying to something like that, it is better to create a function, for example
myfunc <- function(x) {
a <- x + 3
b <- a + 3
z <- a + b
z
}
myfunc(add)
## [1] 13 15 17
In cases when a loop is actually needed (unlike in your example) and you want to store its results, it is better to use *apply family for such tasks. For example, use lapply if you want a list back
res <- lapply(add, myfunc)
res
# [[1]]
# [1] 13
#
# [[2]]
# [1] 15
#
# [[3]]
# [1] 17
Or use sapply if you want a vector back
res <- sapply(add, myfunc)
res
## [1] 13 15 17
For a data.frame to keep all the info
add <- c( 2,3,4)
results <- data.frame()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- rbind(results, cbind(a,b,z))
}
results
a b z
1 5 8 13
2 6 9 15
3 7 10 17
If you just want z then use a vector, no need for lists
add <- c( 2,3,4)
results <- vector()
for (i in add){
a <- i +3
b <- a + 3
z <- a + b
#print(z)
results <- c(results, z)
}
results
[1] 13 15 17
It might be instructive to compare these two results with those of #dugar:
> sapply(add, function(x) c(a=x+3, b=a+3, z=a+b) )
[,1] [,2] [,3]
a 5 6 7
b 10 10 10
z 17 17 17
That is the result of lazy evaluation and sometimes trips us up when computing with intermediate values. This next one should give a slightly more expected result:
> sapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[,1] [,2] [,3]
a 5 6 7
b 8 9 10
z 13 15 17
Those results are the transpose of #dugar. Using sapply or lapply often saves you the effort off setting up a zeroth case object and then incrementing counters.
> lapply(add, function(x) c(a=x+3, b=(x+3)+3, z=(x+3)+((x+3)+3)) )
[[1]]
a b z
5 8 13
[[2]]
a b z
6 9 15
[[3]]
a b z
7 10 17
I have very big list. every element of this list is a matrix. but the dimention of matrix (number of rows of it) is differnet in every element of list, but all of the element of thelist are subset of one of the element of list.
My goal is to find the element of list whith largest dimension, and compare the other element of list with this refrence element and add to the other coordinate the missing row names with value correspond to zero.
Would someone help me to implement it in R ?
Here is simple example of what i want:
> P
[[1]]
[,1]
A 1
B 2
C 3
D 4
[[2]]
[,1]
A 1
B 2
D 3
[[3]]
[,1]
B 1
C 2
Expected output is:
> P
[[1]]
[,1]
A 1
B 2
C 3
D 4
[[2]]
[,1]
A 1
B 2
D 3
C 0
[[3]]
[,1]
B 1
C 2
D 0
A 0
This should work:
N <- length(P)
length.max <- max(lapply(P, function(x) ncol(x)))
for (i in 1:N){
temp <- rownames(P[[i]])
P[[i]] <- rbind(P[[i]],matrix(0,ncol=1,nrow=length.max - ncol(P[[i]]))
rownames(P[[i]]) <- c(temp, setdiff(LETTERS[1:length.max],temp))
}
This can also be done with lapply
P <- list(A = matrix(1:10), B = matrix(1:4), C = matrix(1:2))
longest <- max(sapply(P, nrow))
P <- lapply(P, function(x) c(x, rep(0, longest-length(x))))
I would like to do the following:
combine into a data frame, two vectors that
have different length
contain sequences found also in the other vector
contain sequences not found in the other vector
sequences that are not found in other vector are never longer than 3 elements
always have same first element
The data frame should show the equal sequences in the two vectors aligned, with NA in the column if a vector lacks a sequence present in the other vector.
For example:
vector 1 vector 2 vector 1 vector 2
1 1 a a
2 2 g g
3 3 b b
4 1 or h a
1 2 a g
2 3 g b
5 4 c h
5 c
should be combined into data frame
1 1 a a
2 2 g g
3 3 b b
4 NA h NA
1 1 or a a
2 2 g g
NA 3 NA b
NA 4 NA h
5 5 c c
What I did, is to search for merge, combine, cbind, plyr examples but was not able to find solutions. I am afraid I will need to start write a function with nested for loops to solve this problem.
Note - this was proposed as an answer to the first version of the OP. The question has been modified since then but the problem is still not well-defined in my opinion.
Here is a solution that works with your integer example and would also work with numeric vectors. I am also assuming that:
both vectors contain the same number of sequences
a new sequence starts where value[i+1] <= value[i]
If your vectors are non-numeric or if one of my assumptions does not fit your problem, you'll have to clarify.
v1 <- c(1,2,3,4,1,2,5)
v2 <- c(1,2,3,1,2,3,4,5)
v1.sequences <- split(v1, cumsum(c(TRUE, diff(v1) <= 0)))
v2.sequences <- split(v2, cumsum(c(TRUE, diff(v2) <= 0)))
align.fun <- function(s1, s2) { #aligns two sequences
s12 <- sort(unique(c(s1, s2)))
cbind(ifelse(s12 %in% s1, s12, NA),
ifelse(s12 %in% s2, s12, NA))
}
do.call(rbind, mapply(align.fun, v1.sequences, v2.sequences))
# [,1] [,2]
# [1,] 1 1
# [2,] 2 2
# [3,] 3 3
# [4,] 4 NA
# [5,] 1 1
# [6,] 2 2
# [7,] NA 3
# [8,] NA 4
# [9,] 5 5
I maintain that your problem might be solved in terms of the shortest common supersequence. It assumes that your two vectors each represent one sequence. Please give the code below a try.
If it still does not solve your problem, you'll have to explain exactly what you mean by "my vector contains not one but many sequences": define what you mean by a sequence and tell us how sequences can be identified by scanning through your two vectors.
Part I: given two sequences, find the longest common subsequence
LongestCommonSubsequence <- function(X, Y) {
m <- length(X)
n <- length(Y)
C <- matrix(0, 1 + m, 1 + n)
for (i in seq_len(m)) {
for (j in seq_len(n)) {
if (X[i] == Y[j]) {
C[i + 1, j + 1] = C[i, j] + 1
} else {
C[i + 1, j + 1] = max(C[i + 1, j], C[i, j + 1])
}
}
}
backtrack <- function(C, X, Y, i, j) {
if (i == 1 | j == 1) {
return(data.frame(I = c(), J = c(), LCS = c()))
} else if (X[i - 1] == Y[j - 1]) {
return(rbind(backtrack(C, X, Y, i - 1, j - 1),
data.frame(LCS = X[i - 1], I = i - 1, J = j - 1)))
} else if (C[i, j - 1] > C[i - 1, j]) {
return(backtrack(C, X, Y, i, j - 1))
} else {
return(backtrack(C, X, Y, i - 1, j))
}
}
return(backtrack(C, X, Y, m + 1, n + 1))
}
Part II: given two sequences, find the shortest common supersequence
ShortestCommonSupersequence <- function(X, Y) {
LCS <- LongestCommonSubsequence(X, Y)[c("I", "J")]
X.df <- data.frame(X = X, I = seq_along(X), stringsAsFactors = FALSE)
Y.df <- data.frame(Y = Y, J = seq_along(Y), stringsAsFactors = FALSE)
ALL <- merge(LCS, X.df, by = "I", all = TRUE)
ALL <- merge(ALL, Y.df, by = "J", all = TRUE)
ALL <- ALL[order(pmax(ifelse(is.na(ALL$I), 0, ALL$I),
ifelse(is.na(ALL$J), 0, ALL$J))), ]
ALL$SCS <- ifelse(is.na(ALL$X), ALL$Y, ALL$X)
ALL
}
Your Example:
ShortestCommonSupersequence(X = c("a","g","b","h","a","g","c"),
Y = c("a","g","b","a","g","b","h","c"))
# J I X Y SCS
# 1 1 1 a a a
# 2 2 2 g g g
# 3 3 3 b b b
# 9 NA 4 h <NA> h
# 4 4 5 a a a
# 5 5 6 g g g
# 6 6 NA <NA> b b
# 7 7 NA <NA> h h
# 8 8 7 c c c
(where the two updated vectors are in columns X and Y.)