Downsample to equalize the counts for pairs of factor levels? - r

Suppose you have a factor variable whose level labels come in pairs
(such as 'a1' and 'a2', 'b1' and 'b2', etc.), and these pairs have unequal n-sizes.
x <- factor(c(rep("a1", 10), rep("a2", 15),rep("b1", 5), rep("b2", 30),rep("c1", 33), rep("c2", 22)))
> table(x)
a1 a2 b1 b2 c1 c2
10 15 5 30 33 22
But you wanted to randomly downsample the larger-sized level of each pair to
equalize their n-sizes. Here's the desired outcome:
a1 a2 b1 b2 c1 c2
10 10 5 5 22 22
I have found that caret::downSample() can downsample to equalize all the levels of
a factor:
x_ds <- caret::downSample(1:115, x)
table(x_ds$Class)
a1 a2 b1 b2 c1 c2
5 5 5 5 5 5
And I have the notion to use split() in conjunction with downSample(), but I'm having trouble figuring out a way to split on the level pairs. How could this be done?

Related

Create multiple variables in data.table based other variables names [duplicate]

This question already has answers here:
Multiple pairwise differences based on column name patterns
(3 answers)
Multiply several sets of columns in the same data.table
(2 answers)
Closed 2 years ago.
I am trying to create a series of variables, c1, c2, and c3, based on the values of two sets of variables, a1, a2, and a3, and b1, b2, and b3. The code below shows a hard-coded solution, but in reality I don't know the total number of set of variables, say an and bn. As you can see the name of the c variables depend on the names of the a and b variables.
Is there a way in data.table to do this? I tried to do it by using purrr::map2 within data.table but I could not make it work. I would highly appreciate your help.
Thanks.
library(data.table)
DT <- data.table(
a1 = c(1, 2, 3),
a2 = c(1, 2, 3)*2,
a3 = c(1, 2, 3)*3,
b1 = c(5, 6, 7),
b2 = c(5, 6, 7)*4,
b3 = c(5, 6, 7)*5
)
DT[]
#> a1 a2 a3 b1 b2 b3
#> 1: 1 2 3 5 20 25
#> 2: 2 4 6 6 24 30
#> 3: 3 6 9 7 28 35
DT[,
`:=`(
c1 = a1 + b1,
c2 = a2 + b2,
c3 = a3 + b3
)
]
DT[]
#> a1 a2 a3 b1 b2 b3 c1 c2 c3
#> 1: 1 2 3 5 20 25 6 22 28
#> 2: 2 4 6 6 24 30 8 28 36
#> 3: 3 6 9 7 28 35 10 34 44
Created on 2020-08-26 by the reprex package (v0.3.0)
This first part is mostly defensive, guarding against: a* variables without matching b* variables; vice versa; and different order of each:
anames <- grep("^a[0-9]+$", colnames(DT), value = TRUE)
bnames <- grep("^b[0-9]+$", colnames(DT), value = TRUE)
numnames <- gsub("^a", "", anames)
anames <- sort(anames[gsub("^a", "", anames) %in% numnames])
bnames <- sort(bnames[gsub("^b", "", bnames) %in% numnames])
cnames <- gsub("^b", "c", bnames)
If you know the number ranges a priori and want something less-dynamic but more straight-forward, then
anames <- paste0("a", 1:3)
bnames <- paste0("b", 1:3)
cnames <- paste0("c", 1:3)
Now the magic:
DT[, (cnames) := Map(`+`, mget(anames), mget(bnames)) ]
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
# 1: 1 2 3 5 20 25 6 22 28
# 2: 2 4 6 6 24 30 8 28 36
# 3: 3 6 9 7 28 35 10 34 44
You could tackle this issue if you split DT column-wise by the pattern of the names first, and then aggregate it
# removes numbers from col names
(ptn <- sub("\\d", "", names(DT)))
# [1] "a" "a" "a" "b" "b" "b"
# get unique numbers contained in the col names (as strings but it doesn't matter here)
(nmb <- unique(sub("\\D", "", names(DT))))
# [1] "1" "2" "3"
Next step is to split DT and finally do the aggregation
DT[, paste0("c", nmb) := do.call(`+`, split.default(DT, f = ptn))]
Result
DT
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44
We can melt to long format, create the column 'c', dcast into 'wide' format and then cbind
library(data.table)
cbind(DT, dcast(melt(DT, measure = patterns('^a', '^b'))[,
c := value1 + value2], rowid(variable) ~ paste0('c', variable),
value.var = 'c')[, variable := NULL])
# a1 a2 a3 b1 b2 b3 c1 c2 c3
#1: 1 2 3 5 20 25 6 22 28
#2: 2 4 6 6 24 30 8 28 36
#3: 3 6 9 7 28 35 10 34 44
A base R option
u<-split.default(DT,gsub("\\D","",names(DT)))
cbind(DT,do.call(cbind,Map(rowSums,setNames(u,paste0("c",names(u))))))
which gives
a1 a2 a3 b1 b2 b3 c1 c2 c3
1: 1 2 3 5 20 25 6 22 28
2: 2 4 6 6 24 30 8 28 36
3: 3 6 9 7 28 35 10 34 44

Get data from the last column with data per row [duplicate]

This question already has answers here:
Column name of last non-NA row per row; using tidyverse solution?
(1 answer)
Extract first and last values among a number of columns in data frame
(2 answers)
Closed 4 years ago.
I have different sequences of events for elements in a spreadsheet. Row to row the number of events differ.
I want to get the last element for each row and put it in another column for each element as in the column "Last"
ev1 ev2 ev3 ev4 Last
A A1 A2 A3 NA A3
B B1 B2 NA NA B2
C C1 C2 C3 C4 C4
D D1 D2 D3 NA D3
E E1 NA NA NA E1
If any of the events in each row is = "Delivered" I want to show Delivered instead of the last event.
You can try dplyr::coalesce on the whole data.frame. But you have to change the order of the columns. The coalesce collapses right to left but you want last column (right most). The solution could be:
library(dplyr)
df$Last <- coalesce(!!! df[ncol(df):1])
df
# ev1 ev2 ev3 ev4 Last
# A A1 A2 A3 <NA> A3
# B B1 B2 <NA> <NA> B2
# C C1 C2 C3 C4 C4
# D D1 D2 D3 <NA> D3
# E E1 <NA> <NA> <NA> E1
Data:
df <- read.table(text =
"ev1 ev2 ev3 ev4
A A1 A2 A3 NA
B B1 B2 NA NA
C C1 C2 C3 C4
D D1 D2 D3 NA
E E1 NA NA NA",
header = TRUE, stringsAsFactors = FALSE)

R - Convert a bipartite edgelist to a unipartite adjacency matrix

I have a bipartite edgelist that I would like to convert into a unipartite graph of just the 'from' nodes. I need to do this in a sparse matrix because of the size. Unfortunately, this means that easier solutions such as using bipartite.projection(graph) freezes everything. My data looks like:
To From Weight
A1 B2 1
A1 B3 1
A2 B2 1
A3 B3 1
A3 B4 1
A4 B2 1
A4 B3 1
Using the Matrix package, I've created a sparse matrix with the correct dimensions, but for some reason only the diagonal is populated. For the sparse matrix I used:
myMat <- sparseMatrix(as.integer(as.factor(df$from),
as.integer(as.factor(df$from),
x = df$weight,
dimnames = list(levels(as.factor(df$from)),
levels(as.factor(df$from))
)
)
returns:
B2 B3 B4
B2 2 . .
B3 . 2 .
B4 . . 1
The diagonal summed the weight, but the rest of the matrix is empty where I was expecting it to have filled with the summed weight as well.
What I'd like is:
B2 B3 B4
B2 . 2 .
B3 2 . 1
B4 . 1 .
As this is a matrix of one side of the bipartite graph with the matrix[i,j] representing the count of df$to values connecting any two df$from values. This would then be the weight I would give to edges in any network graph.
What about as_adjacency_matrix which returns sparse?
library(igraph)
df <- read.table(textConnection("
To From Weight
A1 B2 1
A1 B3 1
A2 B2 1
A3 B3 1
A3 B4 1"), header=T, stringsAsFactors=F)
g <- graph.data.frame(df[,1:2])
V(g)$type <- V(g)$name %in% df[,1]
is.bipartite(g)
as_adjacency_matrix(g)
I ended up using some matrix algebra rather than a defined function to get my result. By changing around the sparseMatrix call just a tiny bit and then multiplying by the transpose I got the matrix I was looking for
myMat <- sparseMatrix(as.integer(as.factor(df$from),
as.integer(as.factor(df$to),
x = 1,
dimnames = list(levels(as.factor(df$from)),
levels(as.factor(df$to))
)
)
finalMat <- myMat %*% t(myMat)

cbind with partially nested list

I'm trying to cbind or unnest or as.data.table a partially nested list.
id <- c(1,2)
A <- c("A1","A2","A3")
B <- c("B1")
AB <- list(A=A,B=B)
ABAB <- list(AB,AB)
nested_list <- list(id=id,ABAB=ABAB)
The length of id is the same as ABAB (2 in this case). I don't know how to unlist a part of this list (ABAB) and cbind another part (id). Here's my desired result as a data.table:
data.table(id=c(1,1,1,2,2,2),A=c("A1","A2","A3","A1","A2","A3"),B=rep("B1",6))
id A B
1: 1 A1 B1
2: 1 A2 B1
3: 1 A3 B1
4: 2 A1 B1
5: 2 A2 B1
6: 2 A3 B1
I haven't tested for more general cases, but this works for the OP example:
library(data.table)
as.data.table(nested_list)[, lapply(ABAB, as.data.table)[[1]], id]
# id A B
#1: 1 A1 B1
#2: 1 A2 B1
#3: 1 A3 B1
#4: 2 A1 B1
#5: 2 A2 B1
#6: 2 A3 B1
Or another option (which is probably faster, but is more verbose):
rbindlist(lapply(nested_list$ABAB, as.data.table),
idcol = 'id')[, id := nested_list$id[id]]
This is some super ugly base R, but produces the desired output.
Reduce(rbind, Map(function(x, y) setNames(data.frame(x, y), c("id", "A", "B")),
as.list(nested_list[[1]]),
lapply(unlist(nested_list[-1], recursive=FALSE),
function(x) Reduce(cbind, x))))
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
lapply takes the a list of two elements (each containing the A and B variables) extracted with unlist and recursive=FALSE. It returns a list of character matrices with the B variable filled in by recycling. A list of the individual id variables from as.list(nested_list[[1]]) and the lit of matrices are fed to Map which converts corresponding pairs to a data.frame and gives the columns the desired names and returns a list of data.frames. Finally, this list of data.frames is fed to Reduce, which rbinds the results to a single data.frame.
The final Reduce(rbind, could be replaced by data.tables rbindlist if desired.
Here's another hideous solution
max_length = max(unlist(lapply(nested_list, function(x) lapply(x, lengths))))
data.frame(id = do.call(c, lapply(nested_list$id, rep, max_length)),
do.call(rbind, lapply(nested_list$ABAB, function(x)
do.call(cbind, lapply(x, function(y) {
if(length(y) < max_length) {
rep(y, max_length)
} else {
y
}
})))))
# id A B
#1 1 A1 B1
#2 1 A2 B1
#3 1 A3 B1
#4 2 A1 B1
#5 2 A2 B1
#6 2 A3 B1
And one more, also inelegant- but I`d gone too far by the time I saw the other answers.
restructure <- function(nested_l) {
ids <- as.numeric(max(unlist(lapply(unlist(nested_l, recursive = FALSE), function(x){
lapply(x, length)
}))))
temp = data.frame(rep(nested_l$id, each = ids),
sapply(1:length(nested_l$id), function(x){
out <-unlist(lapply(nested_l[[2]], function(y){
return(y[x])
}))
}))
names(temp) <- c("id", unique(substring(unlist(nested_l[2]), first = 1, last = 1)))
return(temp)
}
> restructure(nested_list)
id A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1
Joining the party:
library(tidyverse)
temp <- map(nested_list,~map(.x,~expand.grid(.x)))
df <- map_df(1:2,~cbind(temp$id[[.x]],temp$ABAB[[.x]]))
Var1 A B
1 1 A1 B1
2 1 A2 B1
3 1 A3 B1
4 2 A1 B1
5 2 A2 B1
6 2 A3 B1

Combining objects across a list

I have a simple question. I have a list of objects. Each object holds a few lists. Before this gets too complicated, let me illustrate:
x = a list
x[[1]] = some object
x[[2]] = another object
...
x[[n]] = another object
And as I said, each object holds some more lists. But I'm interested in a specific list, let's call it "a".
x[[1]][[a]] = ('A': 1, 'B': 2, 'C': 3, ..., Z: 26)
Sorry for the python-like syntax! I am really just learning R. Anyway, what I want to do is combine the lists held in these objects, then take their median. To make this more clear, I want to group all 'A' elements, then take their median:
x[[1]][[a]][['A']], x[[2]][[a]][['A']], x[[3]][[a]][['A']], ..., x[[n]][[a]][['A']]
Similarly I want to group all 'B', 'C', ..., 'Z' elements and take their median...
x[[1]][[a]][['Z']], x[[2]][[a]][['Z']], x[[3]][[a]][['Z']], ..., x[[n]][[a]][['Z']]
So the question is what's the best way to do this? I've spent hours trying to figure this out! It would be great if someone could help me.
And if you would like to know what I'm actually doing, basically I have a list (x) of random forest objects. So x[[1]] is the first random forest, x[[100]] is the 100th random forest. Each random forest has a list of predicted values, which are stored in, e.g. x[[1]][['predicted']]. Each prediction list has a label associated with its predicted value. What I'm actually trying to do is calculate each label's median predicted value across all 100 random forests. And I want to do it efficiently. In Python, this is easy, but in R I'm not so sure. Anyway, thanks for the help!!! I really appreciate it.
Here's one way you could do it. It's a bit tough because you can't use rapply to subset by the names of list elements (which is frustrating). But you can unlist and then subset on names and take the median that way...
# Make some reproducible data
set.seed(1)
l <- list( a = sample(10,3) , b = sample(10,3) , c = sample(10,3) )
ll <- list( l , l , l )
# Unlist - we get a named vector but all a's have unique names - e.g. a1 , a2... an
unl <- unlist(ll)
# a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3 a1 a2 a3 b1 b2 b3 c1 c2 c3
# 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9 3 4 5 10 2 8 10 6 9
# Subset by those elements that contian 'a' in their name
a.unl <- unl[ grepl("a",names(unl)) ]
# a1 a2 a3 a1 a2 a3 a1 a2 a3
# 3 4 5 3 4 5 3 4 5
# Take median
median( a.unl )
# [1] 4
To loop over multiple names try this...
sapply( c( "a" , "b" , "c" ) , function(x) median( unl[ grepl(x,names(unl) ) ] ) )
# a b c
# 4 8 9
you could do this with a simple loop for every A,B,C,...
x <- c()
for( i in 1:n ) x <- c( x, x[[i]][[a]][['A']] )
median(x)
Sample data for creating your top-level list x:
x <- replicate(3, list(a = as.list(setNames(sample(1:100, 26), LETTERS)),
b = runif(10)),
simplify = FALSE)
First, extract a from each list:
a.only <- lapply(ll, `[[`, "a")
Then, to compute all A through Z medians in one shot, do:
do.call(mapply, c(a.only, FUN = function(...) median(unlist(list(...)))))
# A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
# 55 59 41 21 93 72 65 74 51 42 87 25 60 40 13 77 35 31 92 51 57 37 87 67 29 46
If the sublists contain more items than you need, say you only want to compute medians on A, C, Z, do:
a.slices <- lapply(a.only, `[`, c("A", "C", "Z"))
do.call(mapply, c(a.slices, FUN = function(...) median(unlist(list(...)))))
# A C Z
# 55 41 46

Resources