Named arrays, dataframes and matrices - r

If I split my data matrix into rows according to class labels in another vector y like this, the result is something with 'names' like this:
> X <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> y <- c(1,3,1,3)
> X_split <- split(as.data.frame(X),y)
$`1`
V1 V2
1 1 5
3 3 7
$`3`
V1 V2
2 2 6
4 4 8
I want to loop through the results and do some operations on each matrix, for example sum the elements or sum the columns. How do I access each matrix in a loop so I can that?
labels = names(X_split)
for (k in labels) {
# How do I get X_split[k] as a matrix?
sum_class = sum(X_split[k]) # Doesn't work
}
In fact, I don't really want to deal with dataframes and named arrays at all. Is there a way I can call split without as.data.frame and get a list of matrices or something similar?

To split without converting to a data frame
X_split <- list(X[c(1, 3), ], X[c(2, 4), ])
More generally, to write it in terms of a vector y of length nrow(X), indicating the group to which each row belongs, you can write this as
X_split <- lapply(unique(y), function(i) X[y == i, ])
To sum the results
X_sum <- lapply(X_split, sum)
# [[1]]
# [1] 16
# [[2]]
# [1] 20
(or use sapply if you want the result as a vector)

Another option is not to split in the first place and just sum per y. Here's a possible data.table approach
library(data.table)
as.data.table(X)[, sum(sapply(.SD, sum)), by = y]
# y V1
# 1: 1 16
# 2: 3 20

Pretty sure operating directly on the matrix is most efficient:
tapply(rowSums(X),y,sum)
# 1 3
# 16 20

Related

r - find maximum length "chain" of numerically increasing pairs of numbers

I have a two column dataframe of number pairs:
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
> dfPairs
ODD EVEN
1 1 10
2 1 8
3 1 2
4 3 2
5 3 6
6 3 4
7 5 2
8 7 6
9 7 8
10 9 4
11 9 8
Each row of this dataframe is a pair of numbers, and I would like to a find the longest possible numerically increasing combination of pairs. Conceptually, this is analogous to making a chain link of number pairs; with the added conditions that 1) links can only be formed using the same number and 2) the final chain must increase numerically. Visually, the program I am looking for will accomplish this:
For instance, row three is pair (1,2), which increases left to right. The next link in the chain would need to have a 2 in the EVEN column and increase right to left, such as row four (3,2). Then the pattern repeats, so the next link would need to have a 3 in the ODD column, and increase left to right, such as rows 5 or 6. The chain doesn't have to start at 1, or end at 9 - this was simply a convenient example.
If you try to make all possible linked pairs, you will find that many unique chains of various lengths are possible. I would like to find the longest possible chain. In my real data, I will likely encounter a situation in which more than one chain tie for the longest, in which case I would like all of these returned.
The final result should return the longest possible chain that meets these requirements as a dataframe, or a list of dataframes if more than one solution is possible, containing only the rows in the chain.
Thanks in advance. This one has been perplexing me all morning.
Edited to deal with df that does not start at 1 and returns maximum chains rather than chain lengths
Take advantage of graph data structure using igraph
Your data, dfPairs
ODD <- c(1,1,1,3,3,3,5,7,7,9,9)
EVEN <- c(10,8,2,2,6,4,2,6,8,4,8)
dfPairs <- data.frame(ODD, EVEN)
New data, dfTest
ODD <- c(3,3,3,5,7,7,9,9)
EVEN <- c(2,6,4,2,6,8,4,8)
dfTest <- data.frame(ODD, EVEN)
Make graph of your data. A key to my solution is to rbind the reverse (rev(dfPairs)) of the data frame to the original data frame. This will allow for building directional edges from odd numbers to even numbers. Graphs can be used to construct directional paths fairly easily.
library(igraph)
library(dplyr)
GPairs <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2"))), X1))
GTest <- graph_from_data_frame(dplyr::arrange(rbind(setNames(dfTest, c("X1", "X2")), setNames(rev(dfTest), c("X1", "X2"))), X1))
Here's the first three elements of all_simple_paths(GPairs, 1) (starting at 1)
[[1]]
+ 2/10 vertices, named, from f8e4f01:
[1] 1 2
[[2]]
+ 3/10 vertices, named, from f8e4f01:
[1] 1 2 3
[[3]]
+ 4/10 vertices, named, from f8e4f01:
[1] 1 2 3 4
I create a function to 1) convert all simple paths to list of numeric vectors, 2) filter each numeric vector for only elements that satisfy left->right increasing, and 3) return the maximum chain of left->right increasing numeric vector
max_chain_only_increasing <- function(gpath) {
list_vec <- lapply(gpath, function(v) as.numeric(names(unclass(v)))) # convert to list of numeric vector
only_increasing <- lapply(list_vec, function(v) v[1:min(which(v >= dplyr::lead(v, default=tail(v, 1))))]) # subset vector for only elements that are left->right increasing
return(unique(only_increasing[lengths(only_increasing) == max(lengths(only_increasing))])) # return maximum chain length
}
This is the output of the above function using all paths that start from 1
max_chain_only_increasing(all_simple_paths(GPairs, 1))
# [[1]]
# [1] 1 2 3 6 7 8 9
Now, I'll output (header) of max chains starting with each unique element in dfPairs, your original data
start_vals <- sort(unique(unlist(dfPairs)))
# [1] 1 2 3 4 5 6 7 8 9 10
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GPairs, i)))
names(max_chains) <- start_vals
# $`1`
# [1] 1 2 3 6 7 8 9
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# etc
And finally with dfTest, the newer data
start_vals <- sort(unique(unlist(dfTest)))
max_chains <- sapply(seq_len(length(start_vals)), function(i) max_chain_only_increasing(all_simple_paths(GTest, i)))
names(max_chains) <- start_vals
# $`2`
# [1] 2 3 6 7 8 9
# $`3`
# [1] 3 6 7 8 9
# $`4`
# [1] 4 9
# $`5`
# [1] 5
# $`6`
# [1] 6 7 8 9
In spite of Cpak's efforts I ended up writing my own function to solve this. In essence I realize I could make the right to left chain links left to right by using this section of code from Cpak's answer:
output <- arrange(rbind(setNames(dfPairs, c("X1", "X2")), setNames(rev(dfPairs), c("X1", "X2")))`, X1)
To ensure the resulting chains were sequential, I deleted all decreasing links:
output$increase <- with(output, ifelse(X2>X1, "Greater", "Less"))
output <- filter(output, increase == "Greater")
output <- select(output, -increase)
I realized that if I split the dataframe output by unique values in X1, I could join each of these dataframes sequentially by joining the last column of the first dataframe to the first column of the next dataframe, which would create rows of sequentially increasing chains. The only problem I needed to resolve was the issues of NAs in last column of the mered dataframe. So ended up splitting the joined dataframe after each merge, and then shifted the dataframe to remove the NAs, and rbinded the result back together.
This is the actual code:
out_split <- split(output, output$X1)
df_final <- Reduce(join_shift, out_split)
The function, join_shift, is this:
join_shift <- function(dtf1,dtf2){
abcd <- full_join(dtf1, dtf2, setNames(colnames(dtf2)[1], colnames(dtf1)[ncol(dtf1)]))
abcd[is.na(abcd)]<-0
colnames(abcd)[ncol(abcd)] <- "end"
# print(abcd)
abcd_na <- filter(abcd, end==0)
# print(abcd_na)
abcd <- filter(abcd, end != 0)
abcd_na <- abcd_na[moveme(names(abcd_na), "end first")]
# print(abcd_na)
names(abcd_na) <- names(abcd)
abcd<- rbind(abcd, abcd_na)
z <- length(colnames(abcd))
colnames(abcd)<- c(paste0("X", 1:z))
# print(abcd)
return(abcd)
}
Finally, I found there were a lot of columns that had only zeros in it, so I wrote this to delete them and trim the final dataframe:
df_final_trim = df_final[,colSums(df_final) > 0]
Overall Im happy with this. I imagine it could be a little more elegant, but it works on anything, and it works on some rather huge, and complicated data. This will produce ~ 241,700 solutions from a dataset of 700 pairs.
I also used a moveme function that I found on stackoverflow (see below). I employed it to move NA values around to achieve the shift aspect of the join_shift function.
moveme <- function (invec, movecommand) {
movecommand <- lapply(strsplit(strsplit(movecommand, ";")[[1]],
",|\\s+"), function(x) x[x != ""])
movelist <- lapply(movecommand, function(x) {
Where <- x[which(x %in% c("before", "after", "first",
"last")):length(x)]
ToMove <- setdiff(x, Where)
list(ToMove, Where)
})
myVec <- invec
for (i in seq_along(movelist)) {
temp <- setdiff(myVec, movelist[[i]][[1]])
A <- movelist[[i]][[2]][1]
if (A %in% c("before", "after")) {
ba <- movelist[[i]][[2]][2]
if (A == "before") {
after <- match(ba, temp) - 1
}
else if (A == "after") {
after <- match(ba, temp)
}
}
else if (A == "first") {
after <- 0
}
else if (A == "last") {
after <- length(myVec)
}
myVec <- append(temp, values = movelist[[i]][[1]], after = after)
}
myVec
}

computing percentiles in a nested list and output in a dataframe in R

I have a nested list of participant ids and their visits. For each participant at each visit 2 variables are measured 10 times and the data structure is similar to the simulated data given by the R code. For participant [[1]] we have 3 visits corresponding to [[1]][[1]], [[1]][[2]] and [[1]][[3]]. Similarly for participant 2. I want to compute 25th and 75th quantiles for each patient at each visit and store it in a long dataframe as follows. Any help is appreciated.
set.seed(23)
ll <- lapply(1:2, FUN=function(i) replicate(3, matrix(rnorm(20), ncol=2), simplify=FALSE))
df<-data.frame(id=numeric(0),visit=numeric(0),col1.lower.ptile=numeric(0),col1.upper.ptile=numeric(0),col2.lower.ptile=numeric(0),col2.upper.ptile=numeric(0))
Here is a base R approach using a boat-load of *apply functions.
# use R rapply to get in the nested list and apply pull out the quantiles from the columns
myRList <- rapply(ll, function(x) apply(x, 2, quantile, probs=c(.25, .75)), how="list")
rapply goes into the nested structure, and apply pulls out the quantiles for each matrix by column. For the first patient, for example, this returns
myRList[1]
[[1]]
[[1]][[1]]
[,1] [,2]
25% 0.08238097 -0.5795292
75% 1.08541924 0.2856748
[[1]][[2]]
[,1] [,2]
25% -0.5572163 -0.9983007
75% 0.6329706 0.2190313
[[1]][[3]]
[,1] [,2]
25% -0.7966075 -0.4823147
75% 0.8378379 0.9661969
Note that each measurement is in a column, which is how R stores data in a matrix. Thus, we can pull out the data into the desired format using c to strip out the matrix class. To put this into a data.frame, use vapply and c to return a vector from each matrix, use lapply to run over each patient and return a list of vectors. Then use do.call with rbind.data.frame to return the desired data.frame.
dat <- cbind(id=rep(seq_along(ll), lengths(ll)),
do.call(rbind.data.frame,
lapply(myRList, function(x) t(vapply(x, c, FUN.VALUE=numeric(4))))))
This returns
dat
id V1 V2 V3 V4
1 1 0.08238097 1.08541924 -0.5795292 0.2856748
2 1 -0.55721633 0.63297057 -0.9983007 0.2190313
3 1 -0.79660754 0.83783794 -0.4823147 0.9661969
4 2 -1.07159858 0.02937252 -0.4670312 0.6528579
5 2 -0.44806361 0.52761202 0.1081107 1.1419628
6 2 -0.72854367 0.55997887 -0.1397759 0.6157683
When you see that you can trust the results, we can simplify this to a single operation as follows.
dat <- cbind.data.frame(id=rep(seq_along(ll), lengths(ll)),
matrix(rapply(ll, function(x) apply(x, 2, quantile,
probs=c(.25, .75))), ncol=4))
Then set the variable names as desired.
A similar solution with somewhat simpler syntax:
with(new.env(), {
# compute quantiles
q <- lapply(ll, lapply, apply, 2, quantile, prob = c(0.25, 0.75))
# simplify to array
a <- simplify2array(unlist(q, recursive = FALSE))
# return dataframe
data.frame(id = rep(seq_along(ll), lengths(ll)),
visit = unlist(lapply(lengths(ll), seq)),
col1.lower.ptile = a[1, 1, ],
col1.upper.ptile = a[2, 1, ],
col2.lower.ptile = a[1, 2, ],
col2.upper.ptile = a[2, 2, ])
}) -> df
df
# id visit col1.lower.ptile col1.upper.ptile col2.lower.ptile col2.upper.ptile
#1 1 1 -0.18320744 0.42239195 -0.1075228 0.82134959
#2 1 2 -0.30466626 0.72310699 -1.1047154 -0.05519628
#3 1 3 -0.69026613 0.08553756 -0.4338562 0.34916939
#4 2 1 0.08335451 0.59375988 -1.1392453 0.45905958
#5 2 2 -0.81078650 0.23024319 -0.8819546 0.33385295
#6 2 3 -1.15689954 0.82117652 -0.2739212 0.61445726
Notice that I put the whole thing inside a with so that the intermediate results q and a are automatically destroyed at the end, but this is not strictly necessary.

Methods to exhaustively partition a vector into pairs in R

(This is inspired by another question marked as a duplicate. I think it is an interesting problem though, although perhaps there is an easy solution from combinatorics, about which I am very ignorant.)
Problem
For a vector of length n, where n mod 2 is zero, find all possible ways to partition all elements of the vector into pairs, without replacement, where order does not matter.
For example, for a vector c(1,2,3,4):
list(c(1,2), c(3,4))
list(c(1,3), c(2,4))
list(c(1,4), c(2,3))
My approach has been the following (apologies in advance for novice code):
# write a function that recursively breaks down a list of unique pairs (generated with combn). The natural ordering produced by combn means that for the first pass through, we take as the starting pair, all pairings with element 1 of the vector with all other elements. After that has been allocated, we iterate through the first p/2 pairs (this avoids duplicating).
pairer2 <- function(kn, pair_list) {
pair1_partners <- lapply(kn, function(x) {
# remove any pairs in the 'master list' that contain elements of the starting pair.
partners <- Filter(function(t) !any(t %in% x), pair_list)
if(length(partners) > 1) {
# run the function again
pairer2(kn = partners[1:(length(partners)/2)], partners)
} else {return(partners)}
})
# accumulate results into a nested list structure
return(mapply(function(x,y) {list(root = x, partners = y)}, kn, pair1_partners, SIMPLIFY = F))
}
# this function generates all possible unique pairs for a vector of length k as the starting point, then runs the pairing off function above
pair_combn <- function(k, n = 2) {
p <- combn(k, n, simplify = F)
pairer2(kn = p[1:(length(k)-1)], p)}
# so far a vector k = 4
pair_combn(1:4)
[[1]]
[[1]]$root
[1] 1 2
[[1]]$partners
[[1]]$partners[[1]]
[1] 3 4
[[2]]
[[2]]$root
[1] 1 3
[[2]]$partners
[[2]]$partners[[1]]
[1] 2 4
[[3]]
[[3]]$root
[1] 1 4
[[3]]$partners
[[3]]$partners[[1]]
[1] 2 3
It also works for larger k as far as I can tell. This isn't that efficient, possibly because Filter is slow for large lists, and I have to confess I can't collapse the nested lists (which are a tree representation of possible solutions) into a list of each partitioning. It feels like there should be a more elegant solution (in R)?
Mind you, it is interesting that this recursive approach generates a parsimonious (albeit inconvenient) representation of the possible solutions.
Here is one way:
> x <- c(1,2,3,4)
> xc <- combn(as.data.frame(combn(x, 2)), 2, simplify = FALSE)
> Filter(function(x) all(1:4 %in% unlist(x)), xc)
[[1]]
V1 V6
1 1 3
2 2 4
[[2]]
V2 V5
1 1 2
2 3 4
[[3]]
V3 V4
1 1 2
2 4 3
>
More generally:
pair_combn <- function(x) {
Filter(function(e) all(unique(x) %in% unlist(e)),
combn(as.data.frame(combn(x, 2)),
length(x)/2, simplify = FALSE))
}

R: Summing rows in a loop based on rowname

I am new to R and would find some tips very helpful.
I have populated matrix X that has lists of rownames which are numeric.
These correspond to matrix (Y).
I would like to summate all the rows in matrix Y based on the rownames in Matrix X.
So X[,1] may contain a list of rownames which I want to extract the row sums of these particular rows in matrix Y.
I think where I'm having difficulty is where to put the rownames() in the statements - I've tried many different combinations using functions, with and if. Any guidance or tips would be very gratefully received. Thank you.
I have provided a simplified version of the problem below:
X Y
1 2 10 10 10
3 3 20 20 20
5 4 30 30 30
40 40 40
50 50 50
Z[1] (X[,1]) should equal [10+10+10]+[30+30+30]+[50+50+50]
Z[2] (X[,2]) should equal [20+20+20]+[30+30+30]+[40+40+40]
Z should be a vector of sums of Y's rows depending on the column of X's row name values.
You can achieve this as follows:
x <- data.frame(x)
sapply(x, function(r) sum(y[r, ]))
Output is:
X1 X2
270 270
Alternatively, you can name columns of matrix x and supply them to sapply. In this case, I went with easy conversion of x to data frame.
A solution based on data.table and reshape2 packages:
library(data.table)
library(reshape2)
X <- matrix(c(1,3,5,2,3,4), nrow = 3, ncol = 2)
Y <- 10*matrix(rep(1:5, each = 3), nrow = 5, byrow = TRUE)
# Convert to data.table
X.DT <- data.table(X)
Y.DT <- data.table(Y)
Z.DT <-
# First melt the X to get the column names as grouping 'variable'
# and the numeric values in 'value'
melt(X.DT, measure.vars = names(X.DT))[
# Sum the values of Y selected by the indicies stored in X
, .(Z = sum(Y.DT[value]))
, by = variable
]
Z.DT
Result looks like this:
variable Z
1: V1 270
2: V2 270
And if you need the result as a simple vector Z then you can do it like this:
Z <- Z.DT[,Z]
Z
[1] 270 270
For reference, the intermediary data.table that is returned by the melt function looks like this:
> melt(X.DT, measure.vars = names(X.DT))
variable value
1: V1 1
2: V1 3
3: V1 5
4: V2 2
5: V2 3
6: V2 4

R: Properly using a dataframe as an argument to a function

I am practicing using the apply function in R, and so I'm writing a simple function to apply to a dataframe.
I have a dataframe with 2 columns.
V1 V2
1 3
2 4
I decided to do some basic arithmetic and have the answer in the 3rd column, specifically, I want to multiply the first column by 2 and the second column by 3, then sum them.
V1 V2 V3
1 3 11
2 4 16
Here's what I was thinking:
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[,1]*2 +
some_df[,2]*3}
mydf <- apply(mydf ,2, some_function)
But what is wrong with my arguments to the function? R is giving me an error regarding the dimension of the dataframe. Why?
Three things wrong:
1) apply "loops" a vector of either each column or row, so you just address the name [1] not [,1]
2) you need to run by row MARGIN=1, not 2
3) you need to cbind the result, because apply doesn't append, so you're overwriting the vector
mydf <- as.data.frame(matrix(c(1:4),ncol=2,nrow=2))
some_function <- function(some_df) {some_df[1]*2 +
some_df[2]*3}
mydf <- cbind(mydf,V3=apply(mydf ,1, some_function))
# V1 V2 V3
#1 1 3 11
#2 2 4 16
but probably easier just to do the vector math:
mydf$V3<-mydf[,1]*2 + mydf[,2]*3
because vector math is one of the greatest things about R

Resources