I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.
Related
I looked around for a solution but could not find an exact one.
Given:
a<-c('a','b','c')
b<-c('d','e','f')
d<-c('g','h')
as a toy subset of a much larger set, I want to be able to find unique pairs between
attribute (vector) sets. If I use
combn(c(a,b,d),2)
It would return ALL pairwise combinations of all of the attribute elements.
e.g.
combn(c(a,b,d),2)
returns c(a,b) c(a,d) c(a,d) c(a,e)...
But I only want pairs of elements between attributes. So I would not see a,b or a,c but
a,d a,e a,f b,d b,e,b,f etc...
I could sort of do it with expand.grid(a,b,d)..
Var1 Var2 Var3
1 a d g
2 b d g
3 c d g
4 a e g
5 b e g
6 c e g
7 a f g
8 b f g
9 c f g
10 a d h
11 b d h
12 c d h
13 a e h
14 b e h
15 c e h
16 a f h
17 b f h
18 c f h
but now I have an n-col dimensional set of the combinations. Is there any way to limit
it to just attribute pairs of elements, such as combn(x,2)
The main goal is to find a list of unique pairwise combinations of elements between all attribute pairs, but I do not want combinations of elements
within the same attribute column, as it is redundant in my application.
Taking combinations of pairs in each row in the grid, then filtering to get unique entries, we have this:
unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))
A list of combinations is returned:
> L <- unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))[1:5]
> length(L) ## 21
> L
## [[1]]
## Var1 Var2
## "a" "d"
##
## [[2]]
## Var1 Var3
## "a" "g"
##
## [[3]]
## Var2 Var3
## "d" "g"
##
## [[4]]
## Var1 Var2
## "b" "d"
##
## [[5]]
## Var1 Var3
## "b" "g"
First, create a list where each element is a pair of your original vectors, e.g. list(a, b):
L <- list(a, b, d)
L.pairs <- combn(seq_along(L), 2, simplify = FALSE, FUN = function(i)L[i])
Then run expand.grid for each of these pairs and put the pieces together:
do.call(rbind, lapply(L.pairs, expand.grid))
# Var1 Var2
# 1 a d
# 2 b d
# 3 c d
# [...]
# 19 d h
# 20 e h
# 21 f h
I have a large data.frame, example:
> m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
> colnames(m) <- c("A", "B", "C", "D", "E")
> rownames(m) <- c("a", "b", "c", "d", "e")
> m
A B C D E
a 3 3 5 7 5
b 6 2 3 5 4
c 2 5 6 8 9
d 5 4 3 2 2
e 3 3 6 5 2
I would like to remove all rows, where A and/or B columns have greater value than C D and E columns.
So in this case rows b, d, e should be removed and I should get this:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Can not remove them one by one because the data.frame has more than a million rows.
Thanks
Use subsetting, together with pmin() and pmax() to retain the values that you want. I'm not sure that I fully understand your criteria (you said "C D and E" but since you want to throw away row e, I think that you meant C, D or E ), but the following seems to do what you want:
> m[pmax(m[,"A"],m[,"B"])<=pmin(m[,"C"],m[,"D"],m[,"E"]),]
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
# creating the df
m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
colnames(m) <- c("A", "B", "C", "D", "E")
rownames(m) <- c("a", "b", "c", "d", "e")
# initialize as data frame.
m <- as.data.frame(m)
df_n <- m
for(i in 1:nrow(m)){
#print(i)
#print(paste(max(m[,1:2][i,]), max(m[,3:5][i,])))
if(max(m[,1:2][i,]) > (max(m[,3:4][i,])) || max(m[,1:2][i,]) > ((m[,5])[i])){
#df_n <- m[-i,]
df_n[i,] <- NA
}
}
#df_n
df_n <- df_n[complete.cases(df_n), ]
print(df_n)
Results
> print(df_n)
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Here's another solution with apply:
m[apply(m, 1, function(x) max(x[1], x[2]) < min(x[3], x[4], x[5])),]
Result:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
I think what you actually meant is to remove rows where max(A, B) > min(C, D, E), which translates to keep rows where all values of A and B are smaller than all values of C, D, and E.
I have a data set in Excel with a lot of vlookup formulas that I am trying to transpose in R using the data.table package.
In my example below I am saying, for each row, find the value in column y within column x and return the value in column z.
The first row results in na because the value 6 doesn't exist in column x.
On the second row the value 5 appears twice in column x but returning the first match is fine, which is e in this case
I've added in the result column which is the expected outcome.
library(data.table)
dt <- data.table(x = c(1,2,3,4,5,5),
y = c(6,5,4,3,2,1),
z = c("a", "b", "c", "d", "e", "f"),
Result = c("na", "e", "d", "c", "b", "a"))
Many thanks
You can do this with a join, but need to change the order first:
setorder(dt, y)
dt[.(x = x, z = z), result1 := i.z, on = .("y" = x)]
setorder(dt, x)
# x y z Result result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 1 f a a
#6: 5 2 e b b
I haven't tested if this is faster than match for a big data.table, but it might be.
We can just use match to find the index of those matching elements of 'y' with that of 'x' and use that to index to get the corresponding 'z'
dt[, Result1 := z[match(y,x)]]
dt
# x y z Result Result1
#1: 1 6 a na NA
#2: 2 5 b e e
#3: 3 4 c d d
#4: 4 3 d c c
#5: 5 2 e b b
#6: 5 1 f a a
I have two named vectors
v1 <- 1:4
v2 <- 3:5
names(v1) <- c("a", "b", "c", "d")
names(v2) <- c("c", "e", "d")
I want to add them up by the names, i.e. the expected result is
> v3
a b c d e
1 2 6 9 4
Is there a way to programmatically do this in R? Note the names may not necessarily be in a sorted order, like in v2 above.
Just combine the vectors (using c, for example) and use tapply:
v3 <- c(v1, v2)
tapply(v3, names(v3), sum)
# a b c d e
# 1 2 6 9 4
Or, for fun (since you're just doing sum), continuing with "v3":
xtabs(v3 ~ names(v3))
# names(v3)
# a b c d e
# 1 2 6 9 4
I suppose with "data.table" you could also do something like:
library(data.table)
as.data.table(Reduce(c, mget(ls(pattern = "v\\d"))),
keep.rownames = TRUE)[, list(V2 = sum(V2)), by = V1]
# V1 V2
# 1: a 1
# 2: b 2
# 3: c 6
# 4: d 9
# 5: e 4
(I shared the latter not so much for "data.table" but to show an automated way of capturing the vectors of interest.)
I looked around for a solution but could not find an exact one.
Given:
a<-c('a','b','c')
b<-c('d','e','f')
d<-c('g','h')
as a toy subset of a much larger set, I want to be able to find unique pairs between
attribute (vector) sets. If I use
combn(c(a,b,d),2)
It would return ALL pairwise combinations of all of the attribute elements.
e.g.
combn(c(a,b,d),2)
returns c(a,b) c(a,d) c(a,d) c(a,e)...
But I only want pairs of elements between attributes. So I would not see a,b or a,c but
a,d a,e a,f b,d b,e,b,f etc...
I could sort of do it with expand.grid(a,b,d)..
Var1 Var2 Var3
1 a d g
2 b d g
3 c d g
4 a e g
5 b e g
6 c e g
7 a f g
8 b f g
9 c f g
10 a d h
11 b d h
12 c d h
13 a e h
14 b e h
15 c e h
16 a f h
17 b f h
18 c f h
but now I have an n-col dimensional set of the combinations. Is there any way to limit
it to just attribute pairs of elements, such as combn(x,2)
The main goal is to find a list of unique pairwise combinations of elements between all attribute pairs, but I do not want combinations of elements
within the same attribute column, as it is redundant in my application.
Taking combinations of pairs in each row in the grid, then filtering to get unique entries, we have this:
unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))
A list of combinations is returned:
> L <- unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))[1:5]
> length(L) ## 21
> L
## [[1]]
## Var1 Var2
## "a" "d"
##
## [[2]]
## Var1 Var3
## "a" "g"
##
## [[3]]
## Var2 Var3
## "d" "g"
##
## [[4]]
## Var1 Var2
## "b" "d"
##
## [[5]]
## Var1 Var3
## "b" "g"
First, create a list where each element is a pair of your original vectors, e.g. list(a, b):
L <- list(a, b, d)
L.pairs <- combn(seq_along(L), 2, simplify = FALSE, FUN = function(i)L[i])
Then run expand.grid for each of these pairs and put the pieces together:
do.call(rbind, lapply(L.pairs, expand.grid))
# Var1 Var2
# 1 a d
# 2 b d
# 3 c d
# [...]
# 19 d h
# 20 e h
# 21 f h