I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
Now I only want to take the lowest positive value of "days" so that my output would look like this:
new <- data.frame(ID=c(1, 2, 3), days=c(10, 12, 10))
I have thought about this:
df%>%
group_by(ID)%>%
slice_min(days)
But of course this will return the lowest number number also if it is negative. What can I do to only get the lowest positive values?
Preferably using dplyr.
Thanks so much!
filtering only positve values for days should do.
df <- data.frame(ID=c(1, 1, 1, 2, 2, 3), days=c(100, 10, -8, -5, 12, 10))
library(dplyr)
df %>%
group_by(ID) %>%
filter(days>0) %>%
slice_min(days)
#> # A tibble: 3 x 2
#> # Groups: ID [3]
#> ID days
#> <dbl> <dbl>
#> 1 1 10
#> 2 2 12
#> 3 3 10
You can use aggregate()
aggregate(days ~ ID, df, function(x){
min(x[x > 0])
})
# ID days
# 1 1 10
# 2 2 12
# 3 3 10
Related
I have a very long data frame (~10,000 rows), in which two of the columns look something like this.
A B
1 5.5
1 5.5
2 201
9 18
9 18
2 201
9 18
... ...
Just scrubbing through the data it seems that the two columns are "paired" together, but is there any way of explicitly checking this?
You want to know if value x in column A always means value y in column B?
Let's group by A and count the distinct values in B:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.5, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 1
2 2 1
3 9 1
If we now alter the df to the case that this is not true:
df <- data.frame(
A = c(1, 1, 2, 9, 9, 2, 9),
B = c(5.5, 5.4, 201, 18, 18, 201, 18)
)
df %>%
group_by(A) %>%
distinct(B) %>%
summarize(n_unique = n())
# A tibble: 3 x 2
A n_unique
<dbl> <int>
1 1 2
2 2 1
3 9 1
Observe the increased count for group 1. As you have more than 10000 rows, what remains is to see whether or not there is at least one instance that has n_unique > 1, for instance by filter(n_unique > 1)
If you run this you will see how many unique values of B there are for each value of A
tapply(dat$B, dat$A, function(x) length(unique(x)))
So if the max of this vector is 1 then there are no values of A that have more than one corresponding value of B.
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 1 year ago.
I'm trying to create a column that ranks each person based on their date of entry, but since everyone's date of entry is unique, it's been challenging.
here's a reprex:
df <- data.frame(
unique_id = c(1, 1, 1, 2, 2, 3, 3, 3),
date_of_entry = c("3-12-2001", "3-13-2001", "3-14-2001", "4-1-2001", "4-2-2001", "3-28-2001", "3-29-2001", "3-30-2001"))
What I want:
df_desired <- data.frame(
unique_id = c(1, 1, 1, 2, 2, 3, 3, 3),
date_of_entry = c("3-12-2001", "3-13-2001", "3-14-2001", "4-1-2001", "4-2-2001", "3-28-2001", "3-29-2001", "3-30-2001"),
day_at_facility = c(1, 2, 3, 1, 2, 1, 2, 3))
basically, i want to order the days at facility, but I need it to restart based on each unique ID. let me know if this is not clear.
(This is a dupe of something, haven't found it yet, but in the interim ...)
base R
ave(rep(1L,nrow(df)), df$unique_id, FUN = seq_along)
# [1] 1 2 3 1 2 1 2 3
so therefore
df$day_at_facility <- ave(rep(1L,nrow(df)), df$unique_id, FUN = seq_along)
dplyr
library(dplyr)
df %>%
group_by(unique_id) %>%
mutate(day_at_facility = row_number())
# # A tibble: 8 x 3
# # Groups: unique_id [3]
# unique_id date_of_entry day_at_facility
# <dbl> <chr> <int>
# 1 1 3-12-2001 1
# 2 1 3-13-2001 2
# 3 1 3-14-2001 3
# 4 2 4-1-2001 1
# 5 2 4-2-2001 2
# 6 3 3-28-2001 1
# 7 3 3-29-2001 2
# 8 3 3-30-2001 3
Hello and hope all goes well.
I made an edit to my previous question and hope it makes it more clear.
I created an igraph object and would like to run same analysis several times and extract some information in each iteration.
I can't share the whole data, so I am sharing just a small subset.
df_edge is as follows:
library(dplyr)
job_1 <-c(1,2,6,6,5,6,7,8,6,8,8,6,6,8)
job_2 <- c(2,4,5,8,3,1,4,6,1,7,3,2,4,5)
weight <- c(1,1,1,2,1,1,2,1,1,1,2,1,1,1)
df_edge <- tibble(job_1,job_2,weight)
df_edge %>% glimpse()
Rows: 14
Columns: 3
$ job_1 <dbl> 1, 2, 6, 6, 5, 6, 7, 8, 6, 8, 8, 6, 6, 8
$ job_2 <dbl> 2, 4, 5, 8, 3, 1, 4, 6, 1, 7, 3, 2, 4, 5
$ weight <dbl> 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1
df_node is as follows:
job_id <- c(1,2,3,4,5,6,7,8)
job_type <- c(1,2,0,0,3,1,1,1)
df_node <- tibble(job_id,job_type)
df_node %>% glimpse()
Rows: 8
Columns: 2
$ job_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8
$ job_type <dbl> 1, 2, 0, 0, 3, 1, 1, 1
Creating the igraph object:
library(igraph)
library(tidygraph)
tp_network_subset <- graph.data.frame(df_edge,vertices = df_node,directed = F)
summary of job_type column in the df_node
df_node %>%
count(job_type)
A tibble: 4 x 2
job_type n
<dbl> <int>
1 0 2
2 1 4
3 2 1
4 3 1
What I am doing manually is the following:
### finding a job_id that belongs to job_type==1 category
df_node %>% filter(job_type==1) %>%
select(job_id)
A tibble: 4 x 1
job_id
<dbl>
1 1
2 6
3 7
4 8
# for instance, I picked one of them and it is job_id = 6
### using the job_id to create a subgraph by selecting order 1 neighbors of this job_id (6)
node_test <- make_ego_graph(tp_network_subset,order = 1 ,nodes="6")
### creating a dataframe of this subgrapgh where there is no isolated nodes
df_test <- as_tbl_graph(node_test[[1]]) %>%
activate(nodes) %>%
filter(!node_is_isolated()) %>%
as_tibble()
df_test %>% glimpse()
Rows: 6
Columns: 2
$ name <chr> "1", "2", "4", "5", "6", "8"
$ job_type <dbl> 1, 2, 0, 3, 1, 1
## subgraph size is 6 which will be an outcome of interest
### if the graph is zero length , I should stop here and pick another job_id that belongs to job_type==1 category
In this example, the graph in not zero length so I proceed to the next step
### calculating the measure of interest in respect to job_type==1 category
df_test %>%
summarise(job_rate= (nrow(df_test %>% filter(job_type==1)))/(nrow(df_test %>%
filter(job_type %in% c(1,2,3)))))
# 0.6
if job_rate > 0.5 , I want to keep the job_rate and rows (corresponding nodes) of the job_type=4 category of the subgraph. in this instance, job_rate was 0.6 so I am keeping the following
df_final <- as_tbl_graph(node_test[[1]]) %>%
activate(nodes) %>%
filter(!node_is_isolated()) %>%
as_tibble() %>% filter(job_type==0)
# A tibble: 1 x 2
name job_type
<chr> <dbl>
1 4 0
But, I need to assign their corresponding job__rate and some other related columns. So, my favorite outcome would be
name job_type subgraph_origin_id job_rate subgraph_size no_(job_type==0)_in_subgrapgh no_(job_type==1)_in_subgrapgh no_(job_type==2)_in_subgrapgh no_(job_type==3)_in_subgrapgh
<chr> <dbl>
1 4 0 6 0.6 6
so, I need to do this process and create subgrapghs for all job_type==1 nodes. If the grapgh is not zero length and its job_rate > 0.5 then extract all the corresponding nodes in that subgrapgh along with the job_rate and other columns shown in the favorite outcome.
Does this work for you?
dflst <- split(df_node, job_type)
tpe <- as.numeric(names(dflst))
out <- tibble()
for (i in seq_along(dflst)) {
df <- dflst[[i]]
node_test_lst <- make_ego_graph(tp_network_subset, order = 1, nodes = df$job_id)
origin_id <- df$job_id
jtpe <- tpe[i]
for (j in seq_along(node_test_lst)) {
node_test <- node_test_lst[[j]]
df_test <- as_tbl_graph(node_test) %>%
activate(nodes) %>%
filter(!node_is_isolated()) %>%
as_tibble()
if (nrow(df_test %>% filter(job_type == 0)) > 0 & any(df_test$job_type %in% 1:3)) {
job_rate <- with(df_test, sum(job_type == jtpe) / sum(job_type %in% 1:3))
if (job_rate > 0.5) {
df_final <- df_test %>%
filter(job_type == 0) %>%
mutate(
subgraph_origin_id = origin_id[j],
job_rate = job_rate,
subgraph_size = nrow(df_test)
) %>%
cbind(
setNames(
as.list(table(factor(df_test$job_type, levels = 0:3))),
sprintf("no_(job_type==%s)_in_subgrapgh", 0:3)
)
)
out <- out %>% rbind(df_final)
}
}
}
}
which gives
> out
name job_type subgraph_origin_id job_rate subgraph_size
1 4 0 6 0.60 6
2 4 0 7 1.00 3
3 3 0 8 0.75 5
no_(job_type==0)_in_subgrapgh no_(job_type==1)_in_subgrapgh
1 1 3
2 1 2
3 1 3
no_(job_type==2)_in_subgrapgh no_(job_type==3)_in_subgrapgh
1 1 1
2 0 0
3 0 1
I have a dataset looking like this:
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3), timeA=c(-10, NA, NA, -15, -10, -5), timeB=c(5, 100, -10, -10, -15, 5), timeC=c(1, 160, 17, -5, -5, 2))
Question 1:
I want to create a column giving me the lowest positive value of time for each participant or if all values are negative then keep the negative value in and choose the one that is least negative. Then I want to only choose the lowest positive value for each participant (ID), or when all values are negative, choose the value that is least negative.
Question 2: Is there a function looking for the value that is closest to 0?
So that my output would look like this:
df <- data.frame(ID=c(1,2,3), time_new=c(1, -5, 2))
I think your looking for Closest() from the library DescTools.
library(tidyverse)
library(DescTools)
# your data
df <- data.frame(ID=c(1, 1, 1, 2, 3, 3),
timeA=c(-10, NA, NA, -15, -10, -5),
timeB=c(5, 100, -10, -10, -15, 5),
timeC=c(1, 160, 17, -5, -5, 2))
# your results
# I stacked the information for easier searching
df %>% pivot_longer(!ID,values_to = "value") %>%
group_by(ID) %>%
summarise(time_new = Closest(value, 0, na.rm = T)) # closest value to zero
Simply calculate distance to 0 and then filter
For #1
library(tidyverse)
# function filter check and return a TRUE/FALSE with
# follow logic of #1 - priority positive value first
# if no positive take the maximum negative number
filter_function <- function(x) {
result <- rep(0, length(x))
if (all(x < 0, na.rm = TRUE)) {
reference <- max(x, na.rm = TRUE)
} else {
reference <- min(x[x > 0], na.rm = TRUE)
}
result <- result + (x == reference)
result[is.na(result)] <- 0
as.logical(result)
}
# filter as #1 option
df %>% pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(distance_to_zero = 0 + value,
abs_distance_to_zero = abs(distance_to_zero)) %>%
group_by(ID) %>%
filter(filter_function(distance_to_zero))
#> # A tibble: 3 x 5
#> # Groups: ID [3]
#> ID name value distance_to_zero abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl> <dbl>
#> 1 1 timeC 1 1 1
#> 2 2 timeC -5 -5 5
#> 3 3 timeC 2 2 2
And this is for #2
# filter as closest to ZERO no matter positive or negative
df %>%
pivot_longer(!ID,values_to = "value") %>%
# calculate the distance to ZERO for each value
mutate(abs_distance_to_zero = abs(0 + value)) %>%
group_by(ID) %>%
# Then filter by the one equal to minimum in each group can return multiple
# records in your actual data
filter(abs_distance_to_zero == min(abs_distance_to_zero, na.rm = TRUE) &
!is.na(abs_distance_to_zero)) %>%
ungroup()
#> # A tibble: 3 x 4
#> ID name value abs_distance_to_zero
#> <dbl> <chr> <dbl> <dbl>
#> 1 1 timeC 1 1
#> 2 2 timeC -5 5
#> 3 3 timeC 2 2
I have the following dataset:
ID, diff
1 -40
1 -21
1 -5
1 1
1 6
1 7
...
ID variable has values 1,2,3,4,5,... while diff is a numeric variable. Now, from the dataset, for each ID I want to extract the row with a diff that is closest to zero AND is negative. So, I want the row with the highest negative value of diff. In the dataset above, for ID 1 I want to extract 3 rows with values (1 -5).
The following code can extract rows where the absolute value is closest to 0:
library(dplyr)
dataset22 = dataset1 %>% group_by(ID) %>% slice(which.min(abs(diff)))
How can I extract the row with a negative number that is closest to zero?
Thanks in advance!
This works:
library(dplyr)
df <- data.frame(ID = c(1, 1, 1, 1, 1, 1),
diff = c(-40, -21, -5, 1, 6, 7))
df %>%
group_by(ID) %>%
filter(diff < 0) %>%
summarise(min_negative_diff = max(diff))
#> # A tibble: 1 x 2
#> ID min_negative_diff
#> <dbl> <dbl>
#> 1 1 -5