R: subset data.frame based on column value using dplyr

R: subset data.frame based on column value using dplyr - r

library(dplyr)
mydat1 <- data.frame(ID = c(1, 1, 2, 2),
Gender = c("Male", "Female", "Male", "Male"),
Score = c(30, 40, 20, 60))
mydat1 %>%
group_by(ID, Gender) %>%
slice(which.min(Score))
# A tibble: 3 x 3
# Groups: ID, Gender [3]
ID Gender Score
<dbl> <fctr> <dbl>
1 1 Female 40
2 1 Male 30
3 2 Male 20
I'm trying to group the rows by ID and Gender. And then I want to only keep the row with the lowest Score. The above code works perfectly because when ID == 2, I only kept the entry with the lower score.
mydat2 <- data.frame(ID = c(1, 1, 2, 2),
Gender = c("Male", "Female", "Male", "Male"),
Score = c(NA, NA, 20, 60))
mydat2 %>%
group_by(ID, Gender) %>%
slice(which.min(Score))
# A tibble: 1 x 3
# Groups: ID, Gender [1]
ID Gender Score
<dbl> <fctr> <dbl>
1 2 Male 20
However, when I have NAs, which.min doesn't work like I want it to because it'll not return a valid index. Instead, all of my ID == 1 entries are erased. My desired output in this scenario is:
# A tibble: 1 x 3
# Groups: ID, Gender [1]
ID Gender Score
<dbl> <fctr> <dbl>
1 1 Female NA
2 1 Male NA
1 2 Male 20
How can I modify my code to account for this?
Edit:
df2 <- structure(list(pubmed_id = c(23091106L, 23091106L), Gender = structure(c(4L,
4L), .Label = c("", "Both", "female", "Female", "Male"), class = "factor"),
Total_Carrier = c(NA, 1107)), class = c("grouped_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -2L), vars = "pubmed_id", drop = TRUE, indices = list(
0:1), group_sizes = 2L, biggest_group_size = 2L, labels = structure(list(
pubmed_id = 23091106L), class = "data.frame", row.names = c(NA,
-1L), vars = "pubmed_id", drop = TRUE, .Names = "pubmed_id"), .Names = c("pubmed_id",
"Gender", "Total_Carrier"))
> df2
# A tibble: 2 x 3
# Groups: pubmed_id [1]
pubmed_id Gender Total_Carrier
<int> <fctr> <dbl>
1 23091106 Female NA
2 23091106 Female 1107
In this example, I would want the desired output to only contain row 2 (i.e. the row with carrier sample size of 1107). However, I get the following result:
> df2 %>%
group_by(pubmed_id, Gender) %>%
slice(which.min(Total_Carrier) || 1)
# A tibble: 1 x 3
# Groups: pubmed_id, Gender [1]
pubmed_id Gender Total_Carrier
<int> <fctr> <dbl>
1 23091106 Female NA

which.min ignores the missing values, and returns integer(0) when the input vector contains solely NAs. You can add a condition check in the slice, i.e. when all Scores are NAs in a group, pick the first row:
mydat2 %>%
group_by(ID, Gender) %>%
slice({idx <- which.min(Score); if(length(idx) > 0) idx else 1})
# A tibble: 3 x 3
# Groups: ID, Gender [3]
# ID Gender Score
# <dbl> <fctr> <dbl>
#1 1 Female NA
#2 1 Male NA
#3 2 Male 20

You could also use arrange to sort your scores within your groups, and then slice to select the first row of each group. That way, if there are only NAs in the group, you would still select the first row:
mydat2 %>%
group_by(ID, Gender) %>%
arrange(ID,Gender,Score) %>%
slice(1)
ID Gender Score
<dbl> <fctr> <dbl>
1 1 Female NA
2 1 Male NA
3 2 Male 20

Here is another option with which and pmin
mydat2 %>%
group_by(ID, Gender) %>%
slice(pmin(1, which(Score == min(Score, na.rm = TRUE))[1], na.rm = TRUE))
# A tibble: 3 x 3
# Groups: ID, Gender [3]
# ID Gender Score
# <dbl> <fctr> <dbl>
#1 1 Female NA
#2 1 Male NA
#3 2 Male 20

A solution using data.table
library(data.table)
setDT(mydat2)
mydat2[, .(Score = sort(Score)[1]), by = .(ID, Gender)]
# ID Gender Score
# 1: 1 Male NA
# 2: 1 Female NA
# 3: 2 Male 20

Related

R: How to find differing values in one column based on multiple other columns

newbie R question:
So say I have a dataframe with 3 columns: id, date, and value.
How do I capture, for each id, if they have values that are different but only if the dates are different.
For example (below), id 1 would be a miss here (different value but same date), but id 2 would be a hit (different value on different dates). Id 3 would be a miss since the values don't differ.
id date value
1 1/1/2000 A
1 1/1/2000 B
2 1/1/2000 A
2 1/1/1999 B
3 1/1/2000 A
3 1/1/1999 A

After grouping by 'id', check whether there are more than one unique 'date' as well as on 'value' column and pass that in filter
library(dplyr)
df1 %>%
group_by(id) %>%
filter(n_distinct(date) > 1, n_distinct(value) > 1)
-output
# A tibble: 2 x 3
# Groups: id [1]
# id date value
# <int> <chr> <chr>
#1 2 1/1/2000 A
#2 2 1/1/1999 B
Or with anyDuplicated
df1 %>%
group_by(id) %>%
filter(!anyDuplicated(date), !anyDuplicated(value))
# A tibble: 2 x 3
# Groups: id [1]
# id date value
# <int> <chr> <chr>
#1 2 1/1/2000 A
#2 2 1/1/1999 B
data
df1 <- structure(list(id = c(1L, 1L, 2L, 2L, 3L, 3L), date = c("1/1/2000",
"1/1/2000", "1/1/2000", "1/1/1999", "1/1/2000", "1/1/1999"),
value = c("A", "B", "A", "B", "A", "A")),
class = "data.frame", row.names = c(NA,
-6L))

Pivot Wide with Custom Names, Original Values in the cell

I have data that is set up like the following - the CODE variable is character and needs to remain as it is because the numbers have meaning.
ID CODE
1 1.0
1 0.00
1 9.99
2 40.56
3 33.54
3 0.00
How would I use pivot wider to rearrange it so it is like the following, where I can have 4 CODE columns and if there isn't a fourth code per ID, it is just left blank
ID CODE_1 CODE_2 CODE_3 CODE_4
1 1.0 0.00 9.99 "."
2 40.56 "." "." "."
3 33.54 0.00 "." "."
Thank you!

This approach can be close to what you want. You can use tidyverse function complete() to enable the level not present in your original values. Here the code:
library(tidyverse)
#Code
df <- df %>% group_by(ID) %>% mutate(Var=factor(paste0('CODE_',row_number()),
levels = paste0('CODE_',1:4),
labels = paste0('CODE_',1:4),ordered = T,
exclude = F)) %>%
complete(Var = Var) %>%
pivot_wider(names_from = Var,values_from=CODE)
Output:
# A tibble: 3 x 5
# Groups: ID [3]
ID CODE_1 CODE_2 CODE_3 CODE_4
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0 9.99 NA
2 2 40.6 NA NA NA
3 3 33.5 0 NA NA
Some data used:
#Data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), CODE = c(1, 0,
9.99, 40.56, 33.54, 0)), class = "data.frame", row.names = c(NA,
-6L))
If you really want dots for missing values, you have to transform the variables to character and then assign the replace like this:
#Code 2
df <- df %>% group_by(ID) %>% mutate(Var=factor(paste0('CODE_',row_number()),
levels = paste0('CODE_',1:4),
labels = paste0('CODE_',1:4),ordered = T,
exclude = F)) %>%
complete(Var = Var) %>%
pivot_wider(names_from = Var,values_from=CODE) %>%
mutate(across(CODE_1:CODE_4,~as.character(.))) %>%
replace(is.na(.),'.')
Output:
# A tibble: 3 x 5
# Groups: ID [3]
ID CODE_1 CODE_2 CODE_3 CODE_4
<int> <chr> <chr> <chr> <chr>
1 1 1 0 9.99 .
2 2 40.56 . . .
3 3 33.54 0 . .

We can use dcast from data.table
library(data.table)
dcast(setDT(df), ID ~ paste0("CODE_", rowid(ID)), value.var = 'CODE')
# ID CODE_1 CODE_2 CODE_3
#1: 1 1.00 0 9.99
#2: 2 40.56 NA NA
#3: 3 33.54 0 NA
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L), CODE = c(1, 0,
9.99, 40.56, 33.54, 0)), class = "data.frame", row.names = c(NA,
-6L))

Convert logical data from wide to long format in R

I have the following data:
ID cancer cancer_date stroke stroke_date diabetes diabetes_date
1 1 Feb2017 0 Jan2015 1 Jun2015
2 0 Feb2014 1 Jan2015 1 Jun2015
I would like to get
ID condition date
1 cancer xx
1 diabetes xx
2 stroke xx
2 diabetes xx
I tried reshape and gather, but it did not do what I want. Any ideas how can I do this?

This should do it. The key to make it work easily is to change the names of cancer, stroke and diabetes to x_val and then you can use pivot_longer() from tidyr to do the work.
library(tidyr)
library(dplyr)
dat <- tibble::tribble(
~ID, ~cancer, ~cancer_date, ~stroke, ~stroke_date, ~diabetes, ~diabetes_date,
1, 1, "Feb2017", 0, "Jan2015", 1, "Jun2015",
2, 0, "Feb2014", 1, "Jan2015", 1, "Jun2015")
dat %>%
rename("cancer_val" = "cancer",
"stroke_val" = "stroke",
"diabetes_val" = "diabetes") %>%
pivot_longer(cols=-ID,
names_to = c("diagnosis", ".value"),
names_pattern="(.*)_(.*)") %>%
filter(val == 1)
# # A tibble: 4 x 4
# ID diagnosis val date
# <dbl> <chr> <dbl> <chr>
# 1 1 cancer 1 Feb2017
# 2 1 diabetes 1 Jun2015
# 3 2 stroke 1 Jan2015
# 4 2 diabetes 1 Jun2015

library(data.table)
data <- data.table(ID = c(1, 2), cancer = c(1, 0), cancer_date = c("Feb2017", "Feb2014"), stroke = c(0, 1), stroke_date = c("Jan2015", "Jan2015"), diabetes = c(1, 1), diabetes_date = c("Jun2015", "Jun2015"))
datawide <-
melt(data, id.vars = c("ID", "cancer", "stroke", "diabetes"),
measure.vars = c("cancer_date", "stroke_date", "diabetes_date"))
datawide[(cancer == 1 & variable == "cancer_date") |
(stroke == 1 & variable == "stroke_date") |
(diabetes == 1 & variable == "diabetes_date"), .(ID, condition = variable, date = value)]

Try this solution using pivot_longer() and a flag variable to filter the desired states. After pivoting you can filter the values different to zero and only choose the one values. Here the code:
library(tidyverse)
#Code
df2 <- df %>% pivot_longer(cols = -c(ID,contains('_'))) %>%
filter(value!=0) %>% rename(condition=name) %>% select(-value) %>%
pivot_longer(-c(ID,condition)) %>%
separate(name,c('v1','v2'),sep='_') %>%
mutate(Flag=ifelse(condition==v1,1,0)) %>%
filter(Flag==1) %>% select(-c(v1,v2,Flag)) %>%
rename(date=value)
Output:
# A tibble: 4 x 3
ID condition date
<int> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
Some data used:
#Data
df <- structure(list(ID = 1:2, cancer = 1:0, cancer_date = c("Feb2017",
"Feb2014"), stroke = 0:1, stroke_date = c("Jan2015", "Jan2015"
), diabetes = c(1L, 1L), diabetes_date = c("Jun2015", "Jun2015"
)), class = "data.frame", row.names = c(NA, -2L))
If the first obtain is complex, here another choice:
#Code 2
df2 <- df %>% mutate(across(everything(),~as.character(.))) %>%
pivot_longer(cols = -c(ID)) %>%
separate(name,c('condition','v2'),sep = '_') %>%
replace(is.na(.),'val') %>%
pivot_wider(names_from = v2,values_from=value) %>%
filter(val==1) %>% select(-val)
Output:
# A tibble: 4 x 3
ID condition date
<chr> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015

How to recursive replace an element with value from 2 columns of previous row

I have a data set of the following:
Id Val1 Val2
ID1 3 12
ID1 4 NA
ID1 -2 NA
ID1 4 33
ID2 4 NA
I want to replace the NA with Val1+Val2 from the previous row if the Id is the same. The following is the ideal output:
Id Val1 Val2
ID1 3 12
ID1 4 15
ID1 -2 19
ID1 4 33
ID2 4 NA
I have a very big dataset. I personally don’t like the for loop in r and am looking for a beautiful vectorization solutions.

Here is one option where we group by 'Id' and a group created by taking the cumulative sum of logical vector i.e. where there are no missing values in 'Val2', then add (+) the first element of 'Val2' with the cumsum of 'Val1', take the lag, ungroup and remove the temporary 'grp' column
library(dplyr)
df1 %>%
group_by(Id, grp = cumsum(!is.na(Val2))) %>%
mutate(Val2 = lag(first(Val2) + cumsum(Val1), default = first(Val2))) %>%
ungroup %>%
select(-grp)
# A tibble: 5 x 3
# Id Val1 Val2
# <fct> <dbl> <dbl>
#1 ID1 3 12
#2 ID1 4 15
#3 ID1 -2 19
#4 ID1 4 33
#5 ID2 4 NA
data
df1 <- structure(list(Id = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("ID1",
"ID2"), class = "factor"), Val1 = c(3, 4, -2, 4, 4), Val2 = c(12,
NA, NA, 33, NA)), class = "data.frame", row.names = c(NA, -5L
))

Condense factor variables for duplicated ID´s in a data frame

I have a data frame with duplicated ID´s. An ID stands for a specific entity. The ID´s are duplicated because the dataset refers to a process that every entity can go through multiple times.
Here is a small example dat:
library(dplyr)
glimpse(dat)
Observations: 6
Variables: 3
$ ID <dbl> 1, 1, 1, 2, 2, 2
$ Amount <dbl> 10, 70, 80, 50, 10, 10
$ Product <fct> A, B, C, B, E, A
ID stands for the entity, Amount stands for the amount of money the entity has spend and Product stands for the good the entity bought.
The issue is that I have to "condense" this data. So, every ID / entity may occur only once. For the continuous variable, this is not an issue because I can simply calculate the mean per ID.
library(tidyr)
dat_con_ID <- dat %>%
select(ID) %>%
unique()
dat_con_Amount <- dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount))
dat_con <- inner_join(dat_con_ID, dat_con_Amount, by = "ID")
glimpse(dat_con)
Observations: 2
Variables: 2
$ ID <dbl> 1, 2
$ Amount <dbl> 53.33333, 23.33333
The problem is, that I can´t calculate the mean of Product because it´s a categorical variable. An option would be to make a dummy variable out of this factor and calculate the mean. But since the original data frame is really huge this is not a good solution. Any Idea how to handle this problem?

May be you are trying to do this:
I am using data.table library. I also modified your data by adding one extra row for ID = 1, so that you can see the difference in the output.
Data:
library('data.table')
dat <- data.table(ID =as.double(c(1, 1, 1, 2, 2, 2,1)),
Amount = as.double(c( 10, 70, 80, 50, 10, 10, 20)),
Product = factor( c('A', 'B', 'C', 'B', 'E', 'A', 'A')))
Code:
# average amount per id
dat[, .(avg_amt = mean(Amount)), by = .(ID) ]
# ID avg_amt
# 1: 1 45.00000
# 2: 2 23.33333
# average product per id
dat[, .SD[, .N, by = Product ][, .( avg_pdt = N/sum(N), Product)], by = .(ID) ]
# ID avg_pdt Product
# 1: 1 0.5000000 A
# 2: 1 0.2500000 B
# 3: 1 0.2500000 C
# 4: 2 0.3333333 B
# 5: 2 0.3333333 E
# 6: 2 0.3333333 A
# combining average amount and average product per id
dat[, .SD[, .N, by = Product ][, .( Product,
avg_pdt = N/sum(N),
avg_amt = mean(Amount))],
by = .(ID) ]
# ID Product avg_pdt avg_amt
# 1: 1 A 0.5000000 45.00000
# 2: 1 B 0.2500000 45.00000
# 3: 1 C 0.2500000 45.00000
# 4: 2 B 0.3333333 23.33333
# 5: 2 E 0.3333333 23.33333
# 6: 2 A 0.3333333 23.33333

edit
Another idea would be to count 'Product' as per 'ID', calculating the mean of 'Amount' and the relative frequencies for each product. spread the data by 'Product' to end up with the data in wide format. So, every ID / entity may occur only once.
dat %>%
add_count(Product, ID) %>%
group_by(ID) %>%
mutate(Amount = mean(Amount),
n = n / n()) %>%
unique() %>%
spread(Product, n, sep = "_") %>%
ungroup()
# A tibble: 2 x 6
# ID Amount Product_A Product_B Product_C Product_E
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1. 45.0 0.500 0.250 0.250 NA
#2 2. 23.3 0.333 0.333 NA 0.333
My first attempt, not what OP was looking for but in case someone is interested:
As suggested by #steveb in the comments, you could summarise Product as a string.
library(dplyr)
dat %>%
group_by(ID) %>%
summarise(Amount = mean(Amount),
Product = toString( sort(unique(Product)))
)
# A tibble: 2 x 3
# ID Amount Product
# <dbl> <dbl> <chr>
#1 1. 45.0 A, B, C
#2 2. 23.3 A, B, E
data
dat <- structure(list(ID = c(1, 1, 1, 2, 2, 2, 1), Amount = c(10, 70,
80, 50, 10, 10, 20), Product = structure(c(1L, 2L, 3L, 2L, 4L,
1L, 1L), .Label = c("A", "B", "C", "E"), class = "factor")), .Names = c("ID",
"Amount", "Product"), row.names = c(NA, -7L), .internal.selfref = <pointer: 0x2c14528>, class = c("tbl_df",
"tbl", "data.frame"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: subset data.frame based on column value using dplyr - r

Here is another option with which and pmin mydat2 %>% group_by(ID, Gender) %>% slice(pmin(1, which(Score == min(Score, na.rm = TRUE))[1], na.rm = TRUE)) # A tibble: 3 x 3 # Groups: ID, Gender [3] # ID Gender Score # <dbl> <fctr> <dbl> #1 1 Female NA #2 1 Male NA #3 2 Male 20

A solution using data.table library(data.table) setDT(mydat2) mydat2[, .(Score = sort(Score)[1]), by = .(ID, Gender)] # ID Gender Score # 1: 1 Male NA # 2: 1 Female NA # 3: 2 Male 20

Related

R: How to find differing values in one column based on multiple other columns

Pivot Wide with Custom Names, Original Values in the cell

Convert logical data from wide to long format in R

How to recursive replace an element with value from 2 columns of previous row

Condense factor variables for duplicated ID´s in a data frame

Categories

Resources