Related
R coders! I have a data frame, plan, with two columns. One column has group labels, lab, and the other, tr has only two distinct values in it.
lab <- rep(letters[1:2], each = 4)
tr <- c(1, 2, 2, 1, 1, 2, 1, 2)
plan <- data.frame(lab = lab, tr = tr)
> plan
lab tr
1 a 1
2 a 2
3 a 2
4 a 1
5 b 1
6 b 2
7 b 1
8 b 2
I have another vector, order_new, which is a reordered version of lab.
order_new <- lab[sample(1:8)]
> order_new
[1] "b" "b" "a" "a" "b" "a" "b" "a"
I want to reorder the data frame above so the tr values are sorted in the order given by order_new but with the order within the original lab groups preserved. The result I want is:
plan_new <- data.frame(order_new = order_new, tr = c(1, 2, 1, 2, 1, 2, 2, 1))
> plan_new
order_new tr
1 b 1
2 b 2
3 a 1
4 a 2
5 b 1
6 a 2
7 b 2
8 a 1
The first row in the new data frame is a "b" value and so takes the first "b" value in the original data frame. Row 2, also a "b", takes the second "b" value in the original. The third row, an "a", takes the first "a" value in the original etc.
I can't find anything close enough in past answers to work this out and am really looking forward to someone helping me out with this!
If you don't mind a loop
order_new=c("b", "b", "a", "a", "b", "a", "b", "a")
tmp=split(plan$tr,plan$lab)
res=list()
for (x in 1:length(order_new)) {
res[[x]]=tmp[[order_new[x]]][1]
tmp[[order_new[x]]]=tail(tmp[[order_new[x]]],-1)
}
data.frame(
"lab"=order_new,
"tr"=unlist(res)
)
lab tr
1 b 1
2 b 2
3 a 1
4 a 2
5 b 1
6 a 2
7 b 2
8 a 1
Here is a data.table approach of things.. can easily be tinkerd into a dplyr or baseR solution, followint the same logic..
I included all intermediate results to show you the results of each line..
lab <- rep(letters[1:2], each = 4)
tr <- c(1, 2, 2, 1, 1, 2, 1, 2)
plan <- data.frame(lab = lab, tr = tr)
#hard coded, since sample is not reproducible without set.seed()
order_new <- c("b", "b", "a", "a", "b", "a", "b", "a")
library( data.table )
#make plan a data.table
setDT(plan)
#set row_id's by grope (lab)
plan[, row_id := rowid( lab ) ]
# lab tr row_id
# 1: a 1 1
# 2: a 2 2
# 3: a 2 3
# 4: a 1 4
# 5: b 1 1
# 6: b 2 2
# 7: b 1 3
# 8: b 2 4
#make a new data.table for the new ordering
plan_new <- data.table( order_new = order_new )
#also add rownumbers by group
plan_new[, row_id := rowid( order_new ) ][]
# order_new row_id
# 1: b 1
# 2: b 2
# 3: a 1
# 4: a 2
# 5: b 3
# 6: a 3
# 7: b 4
# 8: a 4
#now join the tr-value from data.table 'plan' to 'plkan2', based on the rowid
plan_new[ plan, tr := i.tr, on = .(order_new = lab, row_id) ]
# order_new row_id tr
# 1: b 1 1
# 2: b 2 2
# 3: a 1 1
# 4: a 2 2
# 5: b 3 1
# 6: a 3 2
# 7: b 4 2
# 8: a 4 1
#drop the row_id column if needed
plan_new[, row_id := NULL ][]
# order_new tr
# 1: b 1
# 2: b 2
# 3: a 1
# 4: a 2
# 5: b 1
# 6: a 2
# 7: b 2
# 8: a 1
I have a data.table:
groups <- data.table(group = c("A", "B", "C", "D", "E", "F", "G"),
code_1 = c(2,2,2,7,8,NA,5),
code_2 = c(NA,3,NA,3,NA,NA,2),
code_3 = c(4,1,1,4,4,1,8))
group code_1 code_2 code_3
A 2 NA 4
B 2 3 1
C 2 NA 1
D 7 3 4
E 8 NA 4
F NA NA 1
G 5 2 8
What I would like to achieve, is for each group to find the immediate neighbors based on the available codes. For example: Group A has immediate neighbors groups B, C due to code_1 (code_1 is equal to 2 in all groups) and has immediate neighbor groups D,E due to code_3 (code_3 is equal to 4 in all those groups).
What I tried is for each code, subsetting the first column (group) based on the matches as follows:
groups$code_1_match = list()
for (row in 1:nrow(groups)){
set(groups, i=row, j="code_1_match", list(groups$group[groups$code_1[row] == groups$code_1]))
}
group code_1 code_2 code_3 code_1_match
A 2 NA 4 A,B,C,NA
B 2 3 1 A,B,C,NA
C 2 NA 1 A,B,C,NA
D 7 3 4 D,NA
E 8 NA 4 E,NA
F NA NA 1 NA,NA,NA,NA,NA,NA,...
G 5 2 8 NA,G
This "kinda" works but I would assume there is a more data table kind of way of doing this. I tried
groups[, code_1_match_2 := list(group[code_1 == groups$code_1])]
But this doesn't work.
Am I missing some obvious data table trick to deal with it?
My ideal case result would look like this (which currently would require using my method for all 3 columns and then concatenating the results):
group code_1 code_2 code_3 Immediate neighbors
A 2 NA 4 B,C,D,E
B 2 3 1 A,C,D,F
C 2 NA 1 A,B,F
D 7 3 4 B,A
E 8 NA 4 A,D
F NA NA 1 B,C
G 5 2 8
Using igraph, get 2nd degree neighbours, drop numeric nodes, paste remaining nodes.
library(data.table)
library(igraph)
# reshape wide-to-long
x <- melt(groups, id.vars = "group")[!is.na(value)]
# convert to graph
g <- graph_from_data_frame(x[, .(from = group, to = paste0(variable, "_", value))])
# get 2nd degree neighbours
x1 <- ego(g, 2, nodes = groups$group)
# prettify the result
groups$res <- sapply(seq_along(x1), function(i) toString(intersect(names(x1[[ i ]]),
groups$group[ -i ])))
# group code_1 code_2 code_3 res
# 1: A 2 NA 4 B, C, D, E
# 2: B 2 3 1 A, C, D, F
# 3: C 2 NA 1 A, B, F
# 4: D 7 3 4 B, A, E
# 5: E 8 NA 4 A, D
# 6: F NA NA 1 B, C
# 7: G 5 2 8
More info
This is how our data looks like before converting to igraph object. We want to ensure code1 with value 2 is different from code2 with value 2, etc.
x[, .(from = group, to = paste0(variable, "_", value))]
# from to
# 1: A code_1_2
# 2: B code_1_2
# 3: C code_1_2
# 4: D code_1_7
# 5: E code_1_8
# 6: G code_1_5
# 7: B code_2_3
# 8: D code_2_3
# 9: G code_2_2
# 10: A code_3_4
# 11: B code_3_1
# 12: C code_3_1
# 13: D code_3_4
# 14: E code_3_4
# 15: F code_3_1
# 16: G code_3_8
Here is how our network looks like:
Note that A..G nodes are always connected through code_x_y.
So we need to get the 2nd degree, ego(..., order = 2) gives us neighbours up to including 2nd degree neighbours, and returns a list object.
To get the names:
lapply(x1, names)
# [[1]]
# [1] "A" "code_1_2" "code_3_4" "B" "C" "D" "E"
#
# [[2]]
# [1] "B" "code_1_2" "code_2_3" "code_3_1" "A" "C" "D" "F"
#
# [[3]]
# [1] "C" "code_1_2" "code_3_1" "A" "B" "F"
#
# [[4]]
# [1] "D" "code_1_7" "code_2_3" "code_3_4" "B" "A" "E"
#
# [[5]]
# [1] "E" "code_1_8" "code_3_4" "A" "D"
#
# [[6]]
# [1] "F" "code_3_1" "B" "C"
#
# [[7]]
# [1] "G" "code_1_5" "code_2_2" "code_3_8"
To prettify the result, we need to remove code_x_y nodes and the origin node (1st node)
sapply(seq_along(x1), function(i) toString(intersect(names(x1[[ i ]]), groups$group[ -i ])))
#[1] "B, C, D, E" "A, C, D, F" "A, B, F" "B, A, E" "A, D" "B, C" ""
There is probably some more practical way of achieving this but you could do something like this, using melts and joins:
mgrp <- melt(groups, id.vars = "group")[!is.na(value)]
setkey(mgrp, variable, value)
for (i in seq_along(groups$group)) {
let = groups$group[i]
set(
groups,
i = i,
j = "inei",
value = list(mgrp[mgrp[group == let], setdiff(unique(group), let)])
)
}
groups
# group code_1 code_2 code_3 inei
# 1: A 2 NA 4 B,C,D,E
# 2: B 2 3 1 A,C,D,F
# 3: C 2 NA 1 A,B,F
# 4: D 7 3 4 B,A,E
# 5: E 8 NA 4 A,D
# 6: F NA NA 1 B,C
# 7: G 5 2 8
As mentioned by zx8754, using data.table::melt with combn and then igraph::as_adjacency_matrix
library(data.table)
df <- melt(groups, id.vars="group", na.rm=TRUE)[,
if (.N > 1L) transpose(combn(group, 2L, simplify=FALSE)), value][, (1) := NULL]
library(igraph)
as_adjacency_matrix(graph_from_data_frame(df, FALSE))
output:
7 x 7 sparse Matrix of class "dgCMatrix"
A B C E D G F
A . 1 1 1 1 1 .
B 1 . 2 . 1 1 1
C 1 2 . . . 1 1
E 1 . . . 1 1 .
D 1 1 . 1 . . .
G 1 1 1 1 . . .
F . 1 1 . . . .
or without using igraph
x <- df[, unique(c(V1, V2))]
df <- rbindlist(list(df, data.table(x, x)))
tab <- table(df) #or xtabs(~ V1 + V2, data=df)
ans <- t(tab) + tab
diag(ans) <- 0L
ans
output:
V1
V2 A B C D E F G
A 0 1 1 1 1 0 1
B 1 0 2 1 0 1 1
C 1 2 0 0 0 1 1
D 1 1 0 0 1 0 0
E 1 0 0 1 0 0 1
F 0 1 1 0 0 0 0
G 1 1 1 0 1 0 0
This is inspired by #sindri_baldur's melt. This solution:
Melts the groups
Performs a cartesian self-join.
Pastes together all the groups that matches.
Joins back to the original DT
library(data.table)
#> Warning: package 'data.table' was built under R version 3.6.2
groups <- data.table(group = c("A", "B", "C", "D", "E", "F", "G"), code_1 = c(2,2,2,7,8,NA,5), code_2 = c(NA,3,NA,3,NA,NA,2), code_3=c(4,1,1,4,4,1,8))
molten_grps = melt(groups, measure.vars = patterns("code"), na.rm = TRUE)
inei_dt = molten_grps[molten_grps,
on = .(variable, value),
allow.cartesian = TRUE
][,
.(inei = paste0(setdiff(i.group, .BY[[1L]]), collapse = ", ")),
by = group]
groups[inei_dt, on = .(group), inei := inei]
groups
#> group code_1 code_2 code_3 inei
#> <char> <num> <num> <num> <char>
#> 1: A 2 NA 4 B, C, D, E
#> 2: B 2 3 1 A, C, D, F
#> 3: C 2 NA 1 A, B, F
#> 4: D 7 3 4 B, A, E
#> 5: E 8 NA 4 A, D
#> 6: F NA NA 1 B, C
#> 7: G 5 2 8
I am trying to solve a problem with R using rle() (or another relevant function) but am not sure where to start. The problem is as follows - foo, bar, and baz and qux can be in one of three positions - A, B, or C.
Their first position will always be A, and their last position will always be C, but their positions in between are random.
My objective is to eliminate the first A or first sequence of A's, and the last C or the last sequence of C's. For example:
> foo
position
1 A
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 C
10 C
> output(foo)
position
4 B
5 B
6 A
7 B
8 A
> bar
position
1 A
2 B
3 A
4 B
5 A
6 C
7 C
8 C
9 C
10 C
> output(bar)
position
2 B
3 A
4 B
5 A
> baz
position
1 A
2 A
3 A
4 A
5 A
6 C
7 C
8 C
9 C
10 C
> output(baz)
NULL
> qux
position
1 A
2 C
3 A
4 C
5 A
6 C
> output(qux)
position
2 C
3 A
4 C
5 A
Basic rle() will tell me about the sequences and their lengths but it will not preserve row indices. How should one go about solving this problem?
> rle(foo$position)
Run Length Encoding
lengths: int [1:6] 3 2 1 1 1 2
values : chr [1:6] "A" "B" "A" "B" "A" "C"
I would write a function using cumsum where we check how many of first consecutive values start with first_position and how many of last consecutive values start with last_position and remove them.
get_reduced_data <- function(dat, first_position, last_position) {
dat[cumsum(dat != first_position) != 0 &
rev(cumsum(rev(dat) != last_position) != 0)]
}
get_reduced_data(foo, first_position, last_position)
#[1] "B" "B" "A" "B" "A"
get_reduced_data(bar, first_position, last_position)
#[1] "B" "A" "B" "A"
get_reduced_data(baz, first_position, last_position)
#character(0)
get_reduced_data(qux, first_position, last_position)
#[1] "C" "A" "C" "A"
data
foo <- c("A", "A","A", "B", "B", "A", "B", "A", "C")
bar <- c("A", "B","A", "B", "A", "C", "C", "C", "C", "C")
baz <- c(rep("A", 5), rep("C", 5))
qux <- c("A", "C", "A", "C", "A", "C")
first_position <- "A"
last_position <- "C"
Here is one option with rle. The idea would be to subset the 1st and last values, check whether it is equal to 'A', 'C', assign it to NA and convert that to a logical vector for subsetting
i1 <- !is.na(inverse.rle(within.list(rle(foo$position),
values[c(1, length(values))][values[c(1, length(values))] == c("A", "C")] <- NA)))
foo[i1, , drop = FALSE]
# position
#4 B
#5 B
#6 A
#7 B
#8 A
A data.table approach could be,
library(data.table)
setDT(df)[, grp := rleid(position)][
!(grp == 1 & position == 'A' | grp == max(grp) & position == 'C'), ][
, grp := NULL][]
which gives,
position
1: B
2: B
3: A
4: B
5: A
Another possible solution without rle by creating an index and subsetting rows to between first occurrence of non-A and last occurrence of non-C:
library(data.table)
output <- function(DT) {
DT[, rn:=.I][,{
mn <- min(which(position!="A"))
mx <- max(which(position!="C"))
if (mn > mx) return(NULL)
.SD[mn:mx]
}]
}
output(setDT(foo))
# position rn
#1: B 4
#2: B 5
#3: A 6
#4: B 7
#5: A 8
output(setDT(baz))
#NULL
data:
foo <- fread("position
A
A
A
B
B
A
B
A
C
C")
baz <- fread("position
A
A
A
A
A
C
C
C
C
C")
The problem seems to be two-fold. Triming 'first' and 'last' elements, and identifying what constitutes 'first' and 'last'. I like your rle() approach, because it maps many possibilities into a common structure. So the task is to write a function to mask the first and last elements of a vector of any length
mask_end = function(x) {
n = length(x)
mask = !logical(n)
mask[c(min(1, n), max(0, n))] = FALSE # allow for 0-length x
mask
}
This is very easy to test comprehensively
> mask_end(integer(0))
logical(0)
> mask_end(integer(1))
[1] FALSE
> mask_end(integer(2))
[1] FALSE FALSE
> mask_end(integer(3))
[1] FALSE TRUE FALSE
> mask_end(integer(4))
[1] FALSE TRUE TRUE FALSE
The solution (returning the mask; easy to modify to return the actual values, x[inverse.rle(r)]) is then
mask_end_runs = function(x) {
r = rle(x)
r$values = mask_end(r$values)
inverse.rle(r)
}
I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.
I have two named vectors
v1 <- 1:4
v2 <- 3:5
names(v1) <- c("a", "b", "c", "d")
names(v2) <- c("c", "e", "d")
I want to add them up by the names, i.e. the expected result is
> v3
a b c d e
1 2 6 9 4
Is there a way to programmatically do this in R? Note the names may not necessarily be in a sorted order, like in v2 above.
Just combine the vectors (using c, for example) and use tapply:
v3 <- c(v1, v2)
tapply(v3, names(v3), sum)
# a b c d e
# 1 2 6 9 4
Or, for fun (since you're just doing sum), continuing with "v3":
xtabs(v3 ~ names(v3))
# names(v3)
# a b c d e
# 1 2 6 9 4
I suppose with "data.table" you could also do something like:
library(data.table)
as.data.table(Reduce(c, mget(ls(pattern = "v\\d"))),
keep.rownames = TRUE)[, list(V2 = sum(V2)), by = V1]
# V1 V2
# 1: a 1
# 2: b 2
# 3: c 6
# 4: d 9
# 5: e 4
(I shared the latter not so much for "data.table" but to show an automated way of capturing the vectors of interest.)