I have a data.frame ystr:
v1
1 a
2 B
3 B
4 C
5 d
6 a
7 B
8 D
I want to find the start and end of each group of letters in CAPS so my output would be:
groupId startPos endPos
1 1 2 4
2 2 7 8
I was able to do it with a for loop by looking at each element in order and comparing it to the one before as follows:
currentGroupId <-0
for (i in 1:length(ystr[,1])){
if (grepl("[[:upper:]]", ystr[i,]))
{
if (startCounter == 0)
{
currentGroupId <- currentGroupId +1
startCounter <-1
mygroups[currentGroupId,] <- c(currentGroupId, i, 0)
}
}else if (startCounter == 1){
startCounter <-0
mygroups[currentGroupId,3]<- i-1
}
}
Is there a simple way of doing this in R?
This might be similar to Mark start and end of groups but I could not figure out how it would apply in this case.
You can do this by calculating the run-length encoding (rle) of the binary indicator for whether your data is upper case, as determined by whether the data is equal to itself when it's converted to upper case.
with(rle(d[,1] == toupper(d[,1])),
data.frame(start=cumsum(lengths)[values]-lengths[values]+1,
end=cumsum(lengths)[values]))
# start end
# 1 2 4
# 2 7 8
You can see other examples of the use of rle by looking at Stack Overflow answers using this command.
Data:
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
You can use the IRanges package. It's basically to find the consecutive ranges.
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
d.idx <- which(d$v1 %in% LETTERS)
d.idx
# [1] 2 3 4 7 8
library(IRanges)
d.idx.ir <- IRanges(d.idx, d.idx)
reduce(d.idx.ir)
# IRanges of length 2
# start end width
# [1] 2 4 3
# [2] 7 8 2
Related
Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.
Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.
An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")
Consider the following data frame
## Example data frame
z1<-c("a", "b", "c", "c", "b", "a", "a", "b", "c") ##groups
z2<-c("x", "x", "x", "y", "y", "y", "z", "z", "z") ##experiments
z3<-c(2,4,8,15,9,3,1,2,3) ##results
df<-data.frame(group=z1, id=z2, res=z3) ##z1, z2, and z3 to data frame
I'd like to automatically generate the column (z4), which is the fold change for the data in z3.
The calculation I would like to perform is:
a/a & b/a & c/a for id=x and c/a & b/a & c/a for id=y etc.
and add the results in a new column of the data frame.
i.e.,
df$z4 <- c(1,2,4,5,3,1,1,2,3) ## by hand
My line of thought was that I should break the problem down with something like:
###function to check whether group=a, returns TRUE/FALSE
checkA<-function(x){
if(x=="a"){
res=TRUE
}else {
res=FALSE
}
return(res)
}
### checks whether a is in df$group
df$check<-lapply(df$group, checkA)
But it did not really help. I have the feeling that this can be done with built-in functions (one of the applys or ifelse) but I cannot get my head around it.
You can use by to apply a function to subsets of data:
df$z4 <- unlist(by(df, df$id, FUN = function(x) x$res / x$res[x$group == "a"]))
df
group id res z4
1 a x 2 1
2 b x 4 2
3 c x 8 4
4 c y 15 5
5 b y 9 3
6 a y 3 1
7 a z 1 1
8 b z 2 2
9 c z 3 3
I have a DF where I want to add a new variable called "B" into the 2nd position.
A C D
1 1 5 2
2 3 3 7
3 6 2 3
4 6 4 8
5 1 1 2
Anyone have an idea?
The easiest way would be to add the columns you want and then reorder them:
dat$B <- 1:5
newdat <- dat[, c("A", "B", "C", "D")]
Another way:
newdat <- cbind(dat[1], B=1:5, dat[,2:3])
If you're concerned about overhead, perhaps a data.table solution? (With help from this answer):
library(data.table)
dattable <- data.table(dat)
dattable[,B:=1:5]
setcolorder(dattable, c("A", "B", "C", "D"))
dat$B <- 1:5
ind <- c(1:which(names(data) == "A"),ncol(data),(which(names(data) == "A")+1):ncol(data)-1)
data <- data[,ind]
Create the variable at the end of the data.frame and then using an indicator vector signaling how to reorder the columns. ind is just a vector of numbers
I am trying to apply different functions to different rows based on the value of a string in an adjacent column. My dataframe looks like this:
type size
A 1
B 3
A 4
C 2
C 5
A 4
B 32
C 3
and I want to apply different functions to types A, B, and C, to give a third column column "size2." For example, let's say the following functions apply to A, B, and C:
for A: size2 = 3*size
for B: size2 = size
for C: size2 = 2*size
I'm able to do this for each type separately using this code
df$size2 <- ifelse(df$type == "A", 3*df$size, NA)
df$size2 <- ifelse(df$type == "B", 1*df$size, NA)
df$size2 <- ifelse(df$type == "C", 2*df$size, NA)
However, I can't seem to do it for all of the types without erasing all of the other values. I tried to use this code to limit the application of the function to only those values that were NA (i.e., keep existing values and only fill in NA values), but it didn't work using this code:
df$size2 <- ifelse(is.na(df$size2), ifelse(df$type == "C", 2*df$size, NA), NA)
Does anyone have any ideas? Is it possible to use some kind of AND statement with "is.na(df$size2)" and "ifelse(df$type == "C""?
Many thanks!
This might be a might more R-ish (and I called my dataframe 'dat' instead of 'df' since df is a commonly used function.
> facs <- c(3,1,2)
> dat$size2= dat$size* facs[ match( dat$type, c("A","B","C") ) ]
> dat
type size size2
1 A 1 3
2 B 3 3
3 A 4 12
4 C 2 4
5 C 5 10
6 A 4 12
7 B 32 32
8 C 3 6
The match function is used to construct indexes to supply to the extract function [.
if you want you can nest the ifelses:
df$size2 <- ifelse(df$type == "A", 3*df$size,
ifelse(df$type == "B", 1*df$size,
ifelse(df$type == "C", 2*df$size, NA)))
# > df
# type size size2
#1 A 1 3
#2 B 3 3
#3 A 4 12
#4 C 2 4
#5 C 5 10
#6 A 4 12
#7 B 32 32
#8 C 3 6
This could do it like this, creating separate logical vectors for each type:
As <- df$type == 'A'
Bs <- df$type == 'B'
Cs <- df$type == 'C'
df$size2[As] <- 3*df$size[As]
df$size2[Bs] <- df$size[Bs]
df$size2[Cs] <- 2*df$size[Cs]
but a more direct approach would be to create a separate lookup table like this:
df$size2 <- c(A=3,B=1,C=2)[as.character(df$type)] * df$size
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 5 years ago.
In R I have a data frame with observations described by several values one of which is a factor. I have sorted the dataset by this factor and would like to add a column in which I would get a number of observation on each level of the factor e.g.
factor obsnum
a 1
a 2
a 3
b 1
b 2
b 3
b 4
c 1
c 2
...
In SAS I do it with something like:
data logs.full;
set logs.full;
count + 1;
by cookie;
if first.cookie then count = 1;
run;
How can I achieve that in R?
Thanks,
Use rle (run length encoding) and sequence:
x <- c("a", "a", "a", "b", "b", "b", "b", "c", "c")
data.frame(
x=x,
obsnum = sequence(rle(x)$lengths)
)
x obsnum
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 b 4
8 c 1
9 c 2
Here is the ddply() solution
dataset <- data.frame(x = c("a", "a", "a", "b", "b", "b", "b", "c", "c"))
library(plyr)
ddply(dataset, .(x), function(z){
data.frame(obsnum = seq_along(z$x))
})
One solution using base R, assuming your data is in a data.frame named dfr:
dfr$cnt<-do.call(c, lapply(unique(dfr$factor), function(curf){
seq(sum(dfr$factor==curf))
}))
There are likely better solutions (e.g. employing package plyr and its ddply), but it should work.