My dataset contains two columns with data that are offset - something like:
col1<-c("a", "b", "c", "d", "ND", "ND", "ND", "ND")
col2<-c("ND", "ND", "ND", "ND", "e", "f", "g", "h")
dataset<-data.frame(cbind(col1, col2))
I would like to combine those two offset columns into a single column that contains the letters a through h and nothing else.
Something like the following is what I'm thinking, but rbind is not the right command:
dataset$combine<-rbind(dataset$col1[1:4], dataset$col2[5:8])
What about:
sel2 <- col2!="ND"
col1[sel2] <- col2[sel2]
> col1
[1] "a" "b" "c" "d" "e" "f" "g" "h"
Use sapply and an anonymous function:
dataset[sapply(dataset, function(x) x != "ND")]
# [1] "a" "b" "c" "d" "e" "f" "g" "h"
dataset$combine <- dataset[sapply(dataset, function(x) x != "ND")]
dataset
# col1 col2 combine
# 1 a ND a
# 2 b ND b
# 3 c ND c
# 4 d ND d
# 5 ND e e
# 6 ND f f
# 7 ND g g
# 8 ND h h
Use grep to find the matching elements and select them:
c(col1[grep("^[a-h]$",col1)],col2[grep("^[a-h]$",col2)])
Yet another way, using mapply and gsub:
within(dataset, combine <- mapply(gsub, pattern='ND', replacement=col2, x=col1))
# col1 col2 combine
# 1 a ND a
# 2 b ND b
# 3 c ND c
# 4 d ND d
# 5 ND e e
# 6 ND f f
# 7 ND g g
# 8 ND h h
Per your comment to #Andrie's answer, this will also preserve NA rows.
Another point of view:
transform(dataset,
combine=dataset[apply(dataset, 2, function(x) x %in% letters[1:8])])
col1 col2 combine
1 a ND a
2 b ND b
3 c ND c
4 d ND d
5 ND e e
6 ND f f
7 ND g g
8 ND h h
dataset$combine <- dataset[apply(dataset,2, function(x) nchar(x)==1)] #Also works
Sometimes the problem is to think simple enough... ;-)
dataset$combine<-c(dataset$col1[1:4], dataset$col2[5:8])
Related
In R have a table containing a set of insect species and an empty column "habitat specifity". Additionally, a vector specifies those species considerated habitat specialists: Species B and C are habitat specialists, species A, D and E are habitat generalists.
example.species <- data.frame (species = c("A","B","C","D","E"), habitat.specifity=NA)
example.species
species habitat.specifity
1 A NA
2 B NA
3 C NA
4 D NA
5 E NA
example.specialists <- c("B","C")
I simply want to fill column two ("habitat specifity") with "s" for specialist and "g" for generalist. The table should then look like this:
species habitat.specifity
1 A g
2 B s
3 C s
4 D g
5 E g
I think it must be a simple task to accomplish, but I cannot figure out how. Any help is appreciated!
Here's a straightforward way in base R:
example.species <- data.frame (species = c("A","B","C","D","E"), habitat.specifity=NA)
example.species$habitat.specifity <- "g" # default value
example.species$habitat.specifity[example.species$species %in% c("B","C")] <- "s"
# species habitat.specifity
# 1 A g
# 2 B s
# 3 C s
# 4 D g
# 5 E g
Example with dplyr:
library(dplyr)
# Your data
example.species <- data.frame(species = c("A","B","C","D","E"),habitat.specifity=NA)
# Simple if_else with dplyr and pipes
example.species %>%
mutate(habitat.specifity = if_else(species %in% c("B","C"), "s", "g"))
# Result
species habitat.specifity
1 A g
2 B s
3 C s
4 D g
5 E g
I am trying to solve a problem with R using rle() (or another relevant function) but am not sure where to start. The problem is as follows - foo, bar, and baz and qux can be in one of three positions - A, B, or C.
Their first position will always be A, and their last position will always be C, but their positions in between are random.
My objective is to eliminate the first A or first sequence of A's, and the last C or the last sequence of C's. For example:
> foo
position
1 A
2 A
3 A
4 B
5 B
6 A
7 B
8 A
9 C
10 C
> output(foo)
position
4 B
5 B
6 A
7 B
8 A
> bar
position
1 A
2 B
3 A
4 B
5 A
6 C
7 C
8 C
9 C
10 C
> output(bar)
position
2 B
3 A
4 B
5 A
> baz
position
1 A
2 A
3 A
4 A
5 A
6 C
7 C
8 C
9 C
10 C
> output(baz)
NULL
> qux
position
1 A
2 C
3 A
4 C
5 A
6 C
> output(qux)
position
2 C
3 A
4 C
5 A
Basic rle() will tell me about the sequences and their lengths but it will not preserve row indices. How should one go about solving this problem?
> rle(foo$position)
Run Length Encoding
lengths: int [1:6] 3 2 1 1 1 2
values : chr [1:6] "A" "B" "A" "B" "A" "C"
I would write a function using cumsum where we check how many of first consecutive values start with first_position and how many of last consecutive values start with last_position and remove them.
get_reduced_data <- function(dat, first_position, last_position) {
dat[cumsum(dat != first_position) != 0 &
rev(cumsum(rev(dat) != last_position) != 0)]
}
get_reduced_data(foo, first_position, last_position)
#[1] "B" "B" "A" "B" "A"
get_reduced_data(bar, first_position, last_position)
#[1] "B" "A" "B" "A"
get_reduced_data(baz, first_position, last_position)
#character(0)
get_reduced_data(qux, first_position, last_position)
#[1] "C" "A" "C" "A"
data
foo <- c("A", "A","A", "B", "B", "A", "B", "A", "C")
bar <- c("A", "B","A", "B", "A", "C", "C", "C", "C", "C")
baz <- c(rep("A", 5), rep("C", 5))
qux <- c("A", "C", "A", "C", "A", "C")
first_position <- "A"
last_position <- "C"
Here is one option with rle. The idea would be to subset the 1st and last values, check whether it is equal to 'A', 'C', assign it to NA and convert that to a logical vector for subsetting
i1 <- !is.na(inverse.rle(within.list(rle(foo$position),
values[c(1, length(values))][values[c(1, length(values))] == c("A", "C")] <- NA)))
foo[i1, , drop = FALSE]
# position
#4 B
#5 B
#6 A
#7 B
#8 A
A data.table approach could be,
library(data.table)
setDT(df)[, grp := rleid(position)][
!(grp == 1 & position == 'A' | grp == max(grp) & position == 'C'), ][
, grp := NULL][]
which gives,
position
1: B
2: B
3: A
4: B
5: A
Another possible solution without rle by creating an index and subsetting rows to between first occurrence of non-A and last occurrence of non-C:
library(data.table)
output <- function(DT) {
DT[, rn:=.I][,{
mn <- min(which(position!="A"))
mx <- max(which(position!="C"))
if (mn > mx) return(NULL)
.SD[mn:mx]
}]
}
output(setDT(foo))
# position rn
#1: B 4
#2: B 5
#3: A 6
#4: B 7
#5: A 8
output(setDT(baz))
#NULL
data:
foo <- fread("position
A
A
A
B
B
A
B
A
C
C")
baz <- fread("position
A
A
A
A
A
C
C
C
C
C")
The problem seems to be two-fold. Triming 'first' and 'last' elements, and identifying what constitutes 'first' and 'last'. I like your rle() approach, because it maps many possibilities into a common structure. So the task is to write a function to mask the first and last elements of a vector of any length
mask_end = function(x) {
n = length(x)
mask = !logical(n)
mask[c(min(1, n), max(0, n))] = FALSE # allow for 0-length x
mask
}
This is very easy to test comprehensively
> mask_end(integer(0))
logical(0)
> mask_end(integer(1))
[1] FALSE
> mask_end(integer(2))
[1] FALSE FALSE
> mask_end(integer(3))
[1] FALSE TRUE FALSE
> mask_end(integer(4))
[1] FALSE TRUE TRUE FALSE
The solution (returning the mask; easy to modify to return the actual values, x[inverse.rle(r)]) is then
mask_end_runs = function(x) {
r = rle(x)
r$values = mask_end(r$values)
inverse.rle(r)
}
I looked around for a solution but could not find an exact one.
Given:
a<-c('a','b','c')
b<-c('d','e','f')
d<-c('g','h')
as a toy subset of a much larger set, I want to be able to find unique pairs between
attribute (vector) sets. If I use
combn(c(a,b,d),2)
It would return ALL pairwise combinations of all of the attribute elements.
e.g.
combn(c(a,b,d),2)
returns c(a,b) c(a,d) c(a,d) c(a,e)...
But I only want pairs of elements between attributes. So I would not see a,b or a,c but
a,d a,e a,f b,d b,e,b,f etc...
I could sort of do it with expand.grid(a,b,d)..
Var1 Var2 Var3
1 a d g
2 b d g
3 c d g
4 a e g
5 b e g
6 c e g
7 a f g
8 b f g
9 c f g
10 a d h
11 b d h
12 c d h
13 a e h
14 b e h
15 c e h
16 a f h
17 b f h
18 c f h
but now I have an n-col dimensional set of the combinations. Is there any way to limit
it to just attribute pairs of elements, such as combn(x,2)
The main goal is to find a list of unique pairwise combinations of elements between all attribute pairs, but I do not want combinations of elements
within the same attribute column, as it is redundant in my application.
Taking combinations of pairs in each row in the grid, then filtering to get unique entries, we have this:
unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))
A list of combinations is returned:
> L <- unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))[1:5]
> length(L) ## 21
> L
## [[1]]
## Var1 Var2
## "a" "d"
##
## [[2]]
## Var1 Var3
## "a" "g"
##
## [[3]]
## Var2 Var3
## "d" "g"
##
## [[4]]
## Var1 Var2
## "b" "d"
##
## [[5]]
## Var1 Var3
## "b" "g"
First, create a list where each element is a pair of your original vectors, e.g. list(a, b):
L <- list(a, b, d)
L.pairs <- combn(seq_along(L), 2, simplify = FALSE, FUN = function(i)L[i])
Then run expand.grid for each of these pairs and put the pieces together:
do.call(rbind, lapply(L.pairs, expand.grid))
# Var1 Var2
# 1 a d
# 2 b d
# 3 c d
# [...]
# 19 d h
# 20 e h
# 21 f h
I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.
I looked around for a solution but could not find an exact one.
Given:
a<-c('a','b','c')
b<-c('d','e','f')
d<-c('g','h')
as a toy subset of a much larger set, I want to be able to find unique pairs between
attribute (vector) sets. If I use
combn(c(a,b,d),2)
It would return ALL pairwise combinations of all of the attribute elements.
e.g.
combn(c(a,b,d),2)
returns c(a,b) c(a,d) c(a,d) c(a,e)...
But I only want pairs of elements between attributes. So I would not see a,b or a,c but
a,d a,e a,f b,d b,e,b,f etc...
I could sort of do it with expand.grid(a,b,d)..
Var1 Var2 Var3
1 a d g
2 b d g
3 c d g
4 a e g
5 b e g
6 c e g
7 a f g
8 b f g
9 c f g
10 a d h
11 b d h
12 c d h
13 a e h
14 b e h
15 c e h
16 a f h
17 b f h
18 c f h
but now I have an n-col dimensional set of the combinations. Is there any way to limit
it to just attribute pairs of elements, such as combn(x,2)
The main goal is to find a list of unique pairwise combinations of elements between all attribute pairs, but I do not want combinations of elements
within the same attribute column, as it is redundant in my application.
Taking combinations of pairs in each row in the grid, then filtering to get unique entries, we have this:
unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))
A list of combinations is returned:
> L <- unique(do.call(c, apply(expand.grid(a,b,d), 1, combn, m=2, simplify=FALSE)))[1:5]
> length(L) ## 21
> L
## [[1]]
## Var1 Var2
## "a" "d"
##
## [[2]]
## Var1 Var3
## "a" "g"
##
## [[3]]
## Var2 Var3
## "d" "g"
##
## [[4]]
## Var1 Var2
## "b" "d"
##
## [[5]]
## Var1 Var3
## "b" "g"
First, create a list where each element is a pair of your original vectors, e.g. list(a, b):
L <- list(a, b, d)
L.pairs <- combn(seq_along(L), 2, simplify = FALSE, FUN = function(i)L[i])
Then run expand.grid for each of these pairs and put the pieces together:
do.call(rbind, lapply(L.pairs, expand.grid))
# Var1 Var2
# 1 a d
# 2 b d
# 3 c d
# [...]
# 19 d h
# 20 e h
# 21 f h