Lets say I have a data frame
mydata <- data.frame(x = 1:25,
y = 26:50)
and another data frame with a set of min and max values
df.remove <- data.frame(min = c(3,10,22,17),
max = c(6,13,24,20))
Im looking to create an output where the rows with values in column x of mydata, that fall between each row of min and max in df.remove are deleted.
thus giving me an output data frame
x y
1 26
2 27
7 32
8 33
9 34
14 39
15 40
16 41
21 46
25 50
I figured I can use the between() function to delete the values that fall between a range, and since I would be looking at the min and max values from each row in df.remove I attempted to run a loop using the code
result <- data.frame()
for(i in 1:nrow(df.filter)) {
result <- mydata[!between(mydata$x,df.filter$min[i],df.filter$max[i]),]
}
This, for obvious reasons returns the output with only the last set of min and max values removed. I figured to get the output I am looking for I would likely have to run the consecutive iteration on the output from the previous iteration instead of the original data frame mydata, however I couldn't find a way to do it.
What you are looking for is known as non-equi anti-join. This can be done pretty easily with the data.table package. Consider
library(data.table)
mydata <- data.frame(x = 1:25, y = 26:50)
df.remove <- data.frame(min = c(3,10,22,17), max = c(6,13,24,20))
setDT(mydata)[!df.remove, on = .(x >= min, x <= max)] # drop rows where min <= x <= max
Output
x y
1: 1 26
2: 2 27
3: 7 32
4: 8 33
5: 9 34
6: 14 39
7: 15 40
8: 16 41
9: 21 46
10: 25 50
In your code, the result dataframe can only keep your last update, as you operated on the original mydata dataframe and assigned this single update to the result dataframe every time.
Instead, you are supposed to operate on the updated dataframe. You could try the following code.
result <- mydata
for(i in 1:nrow(df.remove)) {
result <- result[!between(result$x,df.remove$min[i],df.remove$max[i]),]
}
After assigning the original mydata dataframe to the result dataframe, you are able to update it in an iterated way.
A base R approach -
res <- subset(mydata, !x %in% unlist(Map(`:`, df.remove$min, df.remove$max)))
res
# x y
#1 1 26
#2 2 27
#7 7 32
#8 8 33
#9 9 34
#14 14 39
#15 15 40
#16 16 41
#21 21 46
#25 25 50
Using Map we create sequence between min and max values, unlist them in a single vector and drop the rows if x has the same value.
Another option using fuzzyjoin package -
fuzzyjoin::fuzzy_anti_join(mydata, df.remove,
c('x' = 'min', 'x' = 'max'),
match_fun = c(`>=`, `<=`))
Since you're using dplyr function between, we can use dplyr filter function. For each row of mydata you want to apply between to each row of df.remove to see if value of column x is between. This can be accomplished with mapply (since there are two values to input to the function). This will create a matrix of T/F. Then go through each row and see if any values are returned as T. Do this with apply, across rows. Negative filter for any row that returns a T indicating a value between the target value:
library(dplyr)
mydata %>%
filter(
!mapply(function(left, right) between(mydata$x, left, right), left = df.remove$min, right = df.remove$max) %>%
apply(., 1, any)
)
Returns:
x y
1 1 26
2 2 27
3 7 32
4 8 33
5 9 34
6 14 39
7 15 40
8 16 41
9 21 46
10 25 50
Just because this is an interesting problem which has several possible solutions, here is another approach using meta programming.
The idea is that we turn df.remove into a list of expressions which we then use inside filter(mydata, !!! .) by splicing it with the !!! operator.
One way to get the list of expressions is to use rowwise summarise and create a list of expressions with bquote which allows us to evaluate expressions wrapped in .(). In our case the min and max values.
And although this is possible, I'd probably use either #ekoam's {data.table} or #Ronak's base R approach.
library(dplyr)
df.remove %>%
rowwise %>%
summarise(x = list(bquote(!x %in% c(.(min):.(max))))) %>%
pull(x) %>%
filter(mydata, !!! .)
#> `summarise()` has ungrouped output. You can override using the `.groups`
#> argument.
#> x y
#> 1 1 26
#> 2 2 27
#> 3 7 32
#> 4 8 33
#> 5 9 34
#> 6 14 39
#> 7 15 40
#> 8 16 41
#> 9 21 46
#> 10 25 50
Created on 2022-01-23 by the reprex package (v0.3.0)
Using data.table::inrange.
library(data.table)
mydata[!mydata$x %inrange% df.remove, ]
# x y
# 1 1 26
# 2 2 27
# 7 7 32
# 8 8 33
# 9 9 34
# 14 14 39
# 15 15 40
# 16 16 41
# 21 21 46
# 25 25 50
Related
I cannot share the dataset but I will explain it as best as I can.
The dataset has 50 columns 48 of them are in Y/m/d h:m:s format. also the data has many NA, but it must not be removed.
Let's say there is a column B. I want to remove the rows if the value of B is not the earliest in that row.
How can I do this in R? For example, the original would be like this:
df <- data.frame(
A = c(11,19,17,6,13),
B = c(18,9,5,16,12),
C = c(14,15,8,87,16))
A B C
1 11 18 14
2 19 9 15
3 17 5 8
4 6 16 87
5 13 12 16
but I want this:
A B C
1 19 9 15
2 17 5 8
3 13 12 16
You could use apply() to find the minimum for each row.
df |> subset(B == apply(df, 1, min, na.rm = TRUE))
# A B C
# 2 19 9 15
# 3 17 5 8
# 5 13 12 16
The tidyverse equivalent is
library(tidyverse)
df %>% filter(B == pmap(across(A:C), min, na.rm = TRUE))
If you are willing to use data.table, you could do the following for the example.
library(data.table)
setDT(df)
df[(B < A & B < C)]
A B C
1: 19 9 15
2: 17 5 8
3: 13 12 16
More generally, you could do
df <- as.data.table(df)
df[, min := do.call(pmin, .SD)][B == min, !"min"]
.SDcols in the first [ would let you control which columns you want to take the min over, if you wanted to eg. exclude some. I am not super knowledgeable about the inner workings of data.table, but I believe that creating this new column is probably efficient RAM-wise.
How can I get the index of the sample whose previous samples were consecutive and were greater than a fixed threshold in groups?
In the below example, I need to find the time when I have consecutively 3 samples whose speed is greater than 35 speed >= 35 group-wise
speed_threshold = 35
Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40 # Speed > 35
1 25 42 # Speed > 35
1 30 52 # Speed > 35
1 35 48 # <--- Return time = 35 as answer for Group 1 !
1 40 45
2 5 22
2 10 36 # Speed > 35
2 15 38 # Speed > 35
2 20 46 # Speed > 35
2 25 53 # <--- Return time = 25 as answer for Group 2 !
3 5 45
3 10 58 # <--- Return time = NA as answer for group 3 !
If it's above the threshold and it's the third such value in a row, capture the index in ends. Select the first index in ends and add one to get the index of the return time. (There may be more than 1 such group of 3 and therefore more than one element of ends. In this case, the first end needs to be used.)
Note: In your example, the speed at return time is always above the threshold. This code does not check that as a condition at all, but simply gives the first time after three rows with speeds above threshold (regardless of whether the speed at that time is still above the threshold).
library(data.table)
setDT(df)
speed_thresh <- 35
df[, {above <- Speed > speed_thresh
ends <- which(above & rowid(rleid(above)) == 3)
.(Return_Time = Time[ends[1] + 1])}
, Group]
# Group Return_Time
# 1: 1 35
# 2: 2 25
# 3: 3 NA
Data used:
df <- fread('
Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40
1 25 42
1 30 52
1 35 48
1 40 45
2 5 22
2 10 36
2 15 38
2 20 46
2 25 53
3 5 45
3 10 58
')
One option is to use rleid to create a grouping variable based on the logic in 'Speed' and filter the rows where the number of rows (n()) is equal to 3 and all 'Speed' is greater than 35
library(dplyr)
library(data.table)
df1 %>%
group_by(Group, grp = rleid(Speed > speed_threshold)) %>%
filter(n() >= 3, all(Speed > speed_threshold)) %>%
slice(1:3)
1) Using DF defined reproducibly in the Note at the end, define a function ok which takes a vector of logicals indicating whether speed is greater than 35 and returns a logical vector of the same length which is TRUE for the first speed that comes after 3 consecutive TRUEs. Apply that to each group using ave and subset DF down those rows which are TRUE giving s.
If just returning the groups which satisfy the condition is sufficient then we are done; otherwise, define Groups as a one column data frame with one row per Group and merge that with s so that we get an NA for those groups not satisfying the condition.
library(zoo)
ok <- function(x) cumsum(rollapplyr(x, list(-(1:3)), all, fill = FALSE)) == 1
s <- subset(DF, ave(Speed > 35, Group, FUN = ok))
Groups <- data.frame(Group = unique(DF$Group))
merge(Groups, s, all.x = TRUE)[1:2]
## Group Time
## 1 1 35
## 2 2 25
## 3 3 NA
2) A second approach is to split DF by group and then perform the calculation over each component of the split.
library(zoo)
calc <- function(x) {
r <- rollapplyr(x$Speed > 35, list(-(1:3)), all, fill = FALSE)
c(which(cumsum(r) == 1), NA)[1]
}
sapply(split(DF, DF$Group), calc)
## 1 2 3
## 35 25 NA
Note
Lines <- "Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40 # Speed > 35
1 25 42 # Speed > 35
1 30 52 # Speed > 35
1 35 48 # <--- Return time = 35 as answer for Group 1 !
1 40 45
2 5 22
2 10 36 # Speed > 35
2 15 38 # Speed > 35
2 20 46 # Speed > 35
2 25 53 # <--- Return time = 25 as answer for Group 2 !
3 5 45
3 10 58 # <--- Return time = NA as answer for group 3 !"
DF <- read.table(text = Lines, header = TRUE)
I have a melted data table with a column containing values that refer to other column names within the same table. I want to replace each row within that same column with the row value of the referenced column.
library("data.table")
## Example input data table
DT_input <- data.table(A=c(1:10),
B=c(11:20),
C=c(21:30),
replace=c(rep("A", 5), rep("B", 3), rep("C", 2)))
## Desired output data table
DT_output <- data.table(A=c(1:10),
B=c(11:20),
C=c(21:30),
replace=c(1:5, 16:18, 29:30))
My old approach shown here is very slow because of the for loop:
## Attempted looping solution
for (kRow in seq_len(nrow(DT_input))) {
e <- parse(text = DT_input[kRow, Variable])
DT_input[kRow, Variable := eval(e)]
}
If we need a vectorized approach use the row/column indexing from base R
i1 <- cbind(seq_len(nrow(df1)), match(df1$replace, names(df1)[-4]))
df1$replace <- df1[-4][i1]
df1$replace
#[1] 1 2 3 4 5 16 17 18 29 30
With data.table, an option is Map or for loop without the eval, but it would be still not vectorized
data
df1 <- as.data.frame(DT_input)
An option using data.table:
DT_input[, rn := .I]
DT_input[, replace :=
DT_input[, DT_input[.SD, on=c("rn", .BY$replace), get(.BY$replace)], .(replace)]$V1
]
output:
A B C replace
1: 1 11 21 1
2: 2 12 22 2
3: 3 13 23 3
4: 4 14 24 4
5: 5 15 25 5
6: 6 16 26 16
7: 7 17 27 17
8: 8 18 28 18
9: 9 19 29 29
10: 10 20 30 30
It will be slower than Akrun base R method.
This question already has answers here:
How to create a consecutive group number
(13 answers)
Add ID column by group [duplicate]
(4 answers)
Closed 4 years ago.
I apologize for the wording of the question and the errors. Newbie in OS and in R.
Problem: Find efficient way to fill column with numbers that uniquely identify observations with same value in another column.
Result would look like this:
patient_number id
1 46 1
2 47 2
3 15 3
4 42 4
5 33 5
6 26 6
7 37 7
8 7 8
9 33 5
10 36 9
Sample data frame
set.seed(42)
df <- data.frame(
patient_number = sample(seq(1, 50, 1), 100, replace = TRUE)
)
What I was able to come up with
df$id <- NA ## create id and fill with NA make if statement easier
n_unique <- length(unique(df$patient_number)) ## how many unique obs
for (i in 1:nrow(df)) {
index_identical <- which(df$patient_number == df$patient_number[i])
## get index of obs with same patient_number
if (any(is.na(df$id[index_identical]))) {
## if any of the ids of obs with same patient number not filled in,
df$id[index_identical] <- setdiff(seq(1, n_unique, 1), df$id)[1]
## get a integer between 1 and the number of unique obs that is not used
}
else {
df$id <- df$id
}
}
It does the job, but with thousands of rows, it takes time.
Thanks for bearing with me.
If you're open to other packages, you can use the group_indices function from the dplyr package:
library(dplyr)
df %>%
mutate(id = group_indices(., patient_number))
patient_number id
1 46 40
2 47 41
3 15 14
4 42 37
5 33 28
6 26 23
7 37 32
8 7 6
9 33 28
10 36 31
11 23 21
12 36 31
13 47 41
...
We can use .GRP from data.table
library(data.table)
setDT(df)[, id := .GRP, patient_number]
Or with base R match and factor options are fast as well
df$id <- with(df, match(patient_number, unique(patient_number)))
df$id <- with(df, as.integer(factor(patient_number,
levels = unique(patient_number))))
I am having a very hard time leading or lagging an entire dataframe. What I am able to do is shifting individual columns with the following attempts but not the whole thing:
require('DataCombine')
df_l <- slide(df, Var = var1, slideBy = -1)
using colnames(x_ret_mon) as Var does not work, I am told the variable names are not found in the dataframe.
This attempt shifts the columns right but not down:
df_l<- dplyr::lag(df)
This only creates new variables for the lagged variables but then I do not know how to effectively delete the old non lagged values:
df_l<-shift(df, n=1L, fill=NA, type=c("lead"), give.names=FALSE)
Use dplyr::mutate_all to apply lags or leads to all columns.
df = data.frame(a = 1:10, b = 21:30)
dplyr::mutate_all(df, lag)
a b
1 NA NA
2 1 21
3 2 22
4 3 23
5 4 24
6 5 25
7 6 26
8 7 27
9 8 28
10 9 29
I don't see the point in lagging all columns in a data.frame. Wouldn't that just correspond to rbinding an NA row to your original data.frame (minus its last row)?
df = data.frame(a = 1:10, b = 21:30)
rbind(NA, df[-nrow(df), ]);
# a b
#1 NA NA
#2 1 21
#3 2 22
#4 3 23
#5 4 24
#6 5 25
#7 6 26
#8 7 27
#9 8 28
#10 9 29
And similarly for leading all columns.
A couple more options
data.frame(lapply(df, lag))
require(purrr)
map_df(df, lag)
If your data is a data.table you can do
require(data.table)
as.data.table(shift(df))
Or, if you're overwriting df
df[] <- lapply(df, lag) # Thanks Moody
require(magrittr)
df %<>% map_df(lag)