Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.
Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.
An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")
Related
I have been going crazy with something basic...
I am trying to count and list in a comma separated column each unique ID coming up in a data frame, e.g.:
df<-data.frame(id = as.character(c("a", "a", "a", "b", "c", "d", "d", "e", "f")), x1=c(3,1,1,1,4,2,3,3,3),
x2=c(6,1,1,1,3,2,3,3,1),
x3=c(1,1,1,1,1,2,3,3,2))
> > df
id x1 x2 x3
1 a 3 6 1
2 a 1 1 1
3 a 1 1 1
4 b 1 1 1
5 c 4 3 1
6 d 1 2 2
7 d 3 3 3
8 e 1 3 3
9 f 3 1 2
I am trying to get a count of unique id that satisfy a condition, >1:
res = data.frame(x1_counts =5, x1_names="a,c,d,e,f", x2_counts = 4, x2_names="a,c,d,f", x3_counts = 3, x3_names="d,e,f")
> res
x1_counts x1_names x2_counts x2_names x3_counts x3_names
1 5 a,c,d,e,f 4 a,c,d,f 3 d,e,f
I have tried with data.table but it seems very convoluted, i.e.
DT = as.data.table(df)
res <- DT[, list(x1= length(unique(id[which(x1>1)])), x2= length(unique(id[which(x2>1)]))), by=id)
But I can't get it right, I am going not getting what I need to do with data.table since it is not really a grouping I am looking for. Can you direct me in the right path please? Thanks so much!
You can reshape your data to long format and then do the summary:
library(data.table)
(melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable])
# You can replace the list to toString if you want a string as name instead of list
# variable counts names
#1: x1 5 a,c,d,e,f
#2: x2 4 a,c,d,e
#3: x3 3 d,e,f
To get what you need, reshape it back to wide format:
dcast(1~variable,
data = (melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable]),
value.var = c('counts', 'names'))
# . counts_x1 counts_x2 counts_x3 names_x1 names_x2 names_x3
# 1: . 5 4 3 a,c,d,e,f a,c,d,e d,e,f
I have a data.frame ystr:
v1
1 a
2 B
3 B
4 C
5 d
6 a
7 B
8 D
I want to find the start and end of each group of letters in CAPS so my output would be:
groupId startPos endPos
1 1 2 4
2 2 7 8
I was able to do it with a for loop by looking at each element in order and comparing it to the one before as follows:
currentGroupId <-0
for (i in 1:length(ystr[,1])){
if (grepl("[[:upper:]]", ystr[i,]))
{
if (startCounter == 0)
{
currentGroupId <- currentGroupId +1
startCounter <-1
mygroups[currentGroupId,] <- c(currentGroupId, i, 0)
}
}else if (startCounter == 1){
startCounter <-0
mygroups[currentGroupId,3]<- i-1
}
}
Is there a simple way of doing this in R?
This might be similar to Mark start and end of groups but I could not figure out how it would apply in this case.
You can do this by calculating the run-length encoding (rle) of the binary indicator for whether your data is upper case, as determined by whether the data is equal to itself when it's converted to upper case.
with(rle(d[,1] == toupper(d[,1])),
data.frame(start=cumsum(lengths)[values]-lengths[values]+1,
end=cumsum(lengths)[values]))
# start end
# 1 2 4
# 2 7 8
You can see other examples of the use of rle by looking at Stack Overflow answers using this command.
Data:
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
You can use the IRanges package. It's basically to find the consecutive ranges.
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
d.idx <- which(d$v1 %in% LETTERS)
d.idx
# [1] 2 3 4 7 8
library(IRanges)
d.idx.ir <- IRanges(d.idx, d.idx)
reduce(d.idx.ir)
# IRanges of length 2
# start end width
# [1] 2 4 3
# [2] 7 8 2
I would like to improve my piece of code. Let's say you want to remove duplicate rows that have the same 'label' and 'id'. The way I do it is:
library(data.table)
dt <- data.table(label = c("A", "A", "B", "B", "C", "A", "A", "A"),
id = c(1, 1, 2, 2, 3, 4, 5, 5))
tmp = dt[label == 'A',]
tmp = unique(tmp, by = 'id')
dt = dt[label != 'A',]
dt = rbind(dt, tmp)
Is there a smarter/shorter way to accomplish that? If possible by reference?
This code looks very ugly and implies a lot of copies.
(Moreover I have to do this operation for a few labels, but not all of them. So this implies 4 lines for every label...)
Thanks !
Example:
label id
A 1
A 1
B 2
B 2
C 3
A 4
A 5
A 5
Would give :
label id
A 1
B 2
B 2
C 3
A 4
A 5
Note that line 3 and 4 stay duplicated since the label is equal to 'B' and not to 'A'.
There is no need to create tmp and then rbind it again. You can simply use the duplicated function as follows:
dt[label != "A" | !duplicated(dt, by=c("label", "id"))]
# label id
# 1: A 1
# 2: B 2
# 3: B 2
# 4: C 3
# 5: A 4
# 6: A 5
If you want to do this over several labels:
dt[!label %in% c("A", "C") | !duplicated(dt, by=c("label", "id"))]
See ?duplicated to learn more about de-duplication functions in data.table.
This could be also done using an if/else condition
dt[, if(all(label=='A')) .SD[1L] else .SD, by = id]
# id label
#1: 1 A
#2: 2 B
#3: 2 B
#4: 3 C
#5: 4 A
#6: 5 A
If I have a data.frame like this
df <- data.frame(col1 = c(letters[1:4],"a"),col2 = 1:5,col3 = letters[10:14])
df
col1 col2 col3
1 a 1 j
2 b 2 k
3 c 3 l
4 d 4 m
5 a 5 n
I want to get the row indices that contains one of the element in c("a", "k", "n"); in this example, the result should be 1, 2, 5.
If you have a large data frame and you wish to check all columns, try this
x <- c("a", "k", "n")
Reduce(union, lapply(x, function(a) which(rowSums(df == a) > 0)))
# [1] 1 5 2
and of course you can sort the end result.
s <- c('a','k','n');
which(df$col1%in%s|df$col3%in%s);
## [1] 1 2 5
Here's another solution. This one works on the entire data.frame, and happens to capture the search strings as element names (you can get rid of those via unname()):
sapply(s,function(s) which(apply(df==s,1,any))[1]);
## a k n
## 1 2 5
Original second solution:
sort(unique(rep(1:nrow(df),ncol(df))[as.matrix(df)%in%s]));
## [1] 1 2 5
I am trying to apply different functions to different rows based on the value of a string in an adjacent column. My dataframe looks like this:
type size
A 1
B 3
A 4
C 2
C 5
A 4
B 32
C 3
and I want to apply different functions to types A, B, and C, to give a third column column "size2." For example, let's say the following functions apply to A, B, and C:
for A: size2 = 3*size
for B: size2 = size
for C: size2 = 2*size
I'm able to do this for each type separately using this code
df$size2 <- ifelse(df$type == "A", 3*df$size, NA)
df$size2 <- ifelse(df$type == "B", 1*df$size, NA)
df$size2 <- ifelse(df$type == "C", 2*df$size, NA)
However, I can't seem to do it for all of the types without erasing all of the other values. I tried to use this code to limit the application of the function to only those values that were NA (i.e., keep existing values and only fill in NA values), but it didn't work using this code:
df$size2 <- ifelse(is.na(df$size2), ifelse(df$type == "C", 2*df$size, NA), NA)
Does anyone have any ideas? Is it possible to use some kind of AND statement with "is.na(df$size2)" and "ifelse(df$type == "C""?
Many thanks!
This might be a might more R-ish (and I called my dataframe 'dat' instead of 'df' since df is a commonly used function.
> facs <- c(3,1,2)
> dat$size2= dat$size* facs[ match( dat$type, c("A","B","C") ) ]
> dat
type size size2
1 A 1 3
2 B 3 3
3 A 4 12
4 C 2 4
5 C 5 10
6 A 4 12
7 B 32 32
8 C 3 6
The match function is used to construct indexes to supply to the extract function [.
if you want you can nest the ifelses:
df$size2 <- ifelse(df$type == "A", 3*df$size,
ifelse(df$type == "B", 1*df$size,
ifelse(df$type == "C", 2*df$size, NA)))
# > df
# type size size2
#1 A 1 3
#2 B 3 3
#3 A 4 12
#4 C 2 4
#5 C 5 10
#6 A 4 12
#7 B 32 32
#8 C 3 6
This could do it like this, creating separate logical vectors for each type:
As <- df$type == 'A'
Bs <- df$type == 'B'
Cs <- df$type == 'C'
df$size2[As] <- 3*df$size[As]
df$size2[Bs] <- df$size[Bs]
df$size2[Cs] <- 2*df$size[Cs]
but a more direct approach would be to create a separate lookup table like this:
df$size2 <- c(A=3,B=1,C=2)[as.character(df$type)] * df$size