How can I subtract one group of values from all values using group_by in tibble.
Below is an example with expected results. I wish to subtract values of category "A" from all values
d <- tibble(categories = c(rep("A", 3), rep("B", 3), rep("C", 3)),
values = 1:9)
# expected outcome
d <- tibble(categories = c(rep("A", 3), rep("B", 3), rep("C", 3)),
values = c(0, 0, 0, 3, 3, 3, 6, 6, 6))
If the categories size are the same length, we could do
library(dplyr)
d %>%
mutate(values = values - d$values[d$categories == "A"])
-output
# A tibble: 9 × 2
categories values
<chr> <int>
1 A 0
2 A 0
3 A 0
4 B 3
5 B 3
6 B 3
7 C 6
8 C 6
9 C 6
You can do:
library(tidyverse)
d %>%
group_by(categories) %>%
mutate(id = row_number()) %>%
ungroup() %>%
pivot_wider(names_from = 'categories',
values_from = 'values') %>%
mutate(across(-id, ~ . - A)) %>%
pivot_longer(cols = -id,
names_to = 'categories',
values_to = 'values',
cols_vary = 'slowest') %>%
select(-id)
Alternatively:
d %>%
group_by(categories) %>%
mutate(id = row_number()) %>%
ungroup() %>%
mutate(values = values - values[categories == 'A' & id == id]) %>%
select(-id)
# A tibble: 9 x 2
categories values
<chr> <int>
1 A 0
2 A 0
3 A 0
4 B 3
5 B 3
6 B 3
7 C 6
8 C 6
9 C 6
Related
I have a simple test dataset that has many repeating rows for participants. I want one row per participant that doesn't have NAs, unless the participant has NAs for the entire column. I tried grouping by participant name and then using coalesce(.) and fill(.), but it still leaves missing values. Here's my test dataset:
library(dplyr)
library(tibble)
test_dataset <- tibble(name = rep(c("Justin", "Corey", "Sibley"), 4),
var1 = c(rep(c(NA), 10), 2, 3),
var2 = c(rep(c(NA), 9), 2, 4, 6),
var3 = c(10, 15, 7, rep(c(NA), 9)),
outcome = c(3, 9, 23, rep(c(NA), 9)),
tenure = rep(c(10, 15, 20), 4))
And here's what I get when I use coalesce(.) or fill(., direction = "downup"), which both produce the same result.
library(dplyr)
library(tibble)
test_dataset_coalesced <- test_dataset %>%
group_by(name) %>%
coalesce(.) %>%
slice_head(n=1) %>%
ungroup()
test_dataset_filled <- test_dataset %>%
group_by(name) %>%
fill(., .direction="downup") %>%
slice_head(n=1) %>%
ungroup()
And here's what I want--note, there is one NA because that participant only has NA for that column:
library(tibble)
correct <- tibble(name = c("Justin", "Corey", "Sibley"),
var1 = c(NA, 2, 3),
var2 = c(2, 4, 6),
var3 = c(10, 15, 7),
outcome = c(3, 9, 23),
tenure = c(10, 15, 20))
You can group_by the name column, then fill the NA (you need to fill every column using everything()) with the non-NA values within the group, then only keep the distinct rows.
library(tidyverse)
test_dataset %>%
group_by(name) %>%
fill(everything(), .direction = "downup") %>%
distinct()
# A tibble: 3 × 6
# Groups: name [3]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20
Try this
cleaned<- test_dataset |>
dplyr::group_by(name) |>
tidyr::fill(everything(),.direction = "downup") |>
unique()
# To filter out the ones with all NAs
cleaned[sum(is.na(cleaned[,-1]))<ncol(cleaned[,-1]),]
name var1 var2 var3 outcome tenure
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Justin NA 2 10 3 10
2 Corey 2 4 15 9 15
3 Sibley 3 6 7 23 20
``
I have a table like the following:
A, B, C
1, Yes, 3
1, No, 2
2, Yes, 4
2, No, 6
etc
I want to convert it to:
A, Yes, No
1, 3, 2
2, 4, 6
I have tried using:
dat <- dat %>%
spread(B, C) %>%
group_by(A)
However, now I have a bunch of NA values. Is it possible to use pivot_longer to do this instead?
We can use pivot_wider
library(tidyr)
pivot_wider(dat, names_from = B, values_from = C)
-output
# A tibble: 2 x 3
# A Yes No
# <dbl> <dbl> <dbl>
#1 1 3 2
#2 2 4 6
If there are duplicate rows, then an option is to create a sequence by that column
library(data.table)
library(dplyr)
dat1 <- bind_rows(dat, dat) # // example with duplicates
dat1 %>%
mutate(rn = rowid(B)) %>%
pivot_wider(names_from = B, values_from = C) %>%
select(-rn)
-output
# A tibble: 4 x 3
# A Yes No
# <dbl> <dbl> <dbl>
#1 1 3 2
#2 2 4 6
#3 1 3 2
#4 2 4 6
data
dat <- structure(list(A = c(1, 1, 2, 2), B = c("Yes", "No", "Yes", "No"
), C = c(3, 2, 4, 6)), class = "data.frame", row.names = c(NA,
-4L))
I have the following data:
ID cancer cancer_date stroke stroke_date diabetes diabetes_date
1 1 Feb2017 0 Jan2015 1 Jun2015
2 0 Feb2014 1 Jan2015 1 Jun2015
I would like to get
ID condition date
1 cancer xx
1 diabetes xx
2 stroke xx
2 diabetes xx
I tried reshape and gather, but it did not do what I want. Any ideas how can I do this?
This should do it. The key to make it work easily is to change the names of cancer, stroke and diabetes to x_val and then you can use pivot_longer() from tidyr to do the work.
library(tidyr)
library(dplyr)
dat <- tibble::tribble(
~ID, ~cancer, ~cancer_date, ~stroke, ~stroke_date, ~diabetes, ~diabetes_date,
1, 1, "Feb2017", 0, "Jan2015", 1, "Jun2015",
2, 0, "Feb2014", 1, "Jan2015", 1, "Jun2015")
dat %>%
rename("cancer_val" = "cancer",
"stroke_val" = "stroke",
"diabetes_val" = "diabetes") %>%
pivot_longer(cols=-ID,
names_to = c("diagnosis", ".value"),
names_pattern="(.*)_(.*)") %>%
filter(val == 1)
# # A tibble: 4 x 4
# ID diagnosis val date
# <dbl> <chr> <dbl> <chr>
# 1 1 cancer 1 Feb2017
# 2 1 diabetes 1 Jun2015
# 3 2 stroke 1 Jan2015
# 4 2 diabetes 1 Jun2015
library(data.table)
data <- data.table(ID = c(1, 2), cancer = c(1, 0), cancer_date = c("Feb2017", "Feb2014"), stroke = c(0, 1), stroke_date = c("Jan2015", "Jan2015"), diabetes = c(1, 1), diabetes_date = c("Jun2015", "Jun2015"))
datawide <-
melt(data, id.vars = c("ID", "cancer", "stroke", "diabetes"),
measure.vars = c("cancer_date", "stroke_date", "diabetes_date"))
datawide[(cancer == 1 & variable == "cancer_date") |
(stroke == 1 & variable == "stroke_date") |
(diabetes == 1 & variable == "diabetes_date"), .(ID, condition = variable, date = value)]
Try this solution using pivot_longer() and a flag variable to filter the desired states. After pivoting you can filter the values different to zero and only choose the one values. Here the code:
library(tidyverse)
#Code
df2 <- df %>% pivot_longer(cols = -c(ID,contains('_'))) %>%
filter(value!=0) %>% rename(condition=name) %>% select(-value) %>%
pivot_longer(-c(ID,condition)) %>%
separate(name,c('v1','v2'),sep='_') %>%
mutate(Flag=ifelse(condition==v1,1,0)) %>%
filter(Flag==1) %>% select(-c(v1,v2,Flag)) %>%
rename(date=value)
Output:
# A tibble: 4 x 3
ID condition date
<int> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
Some data used:
#Data
df <- structure(list(ID = 1:2, cancer = 1:0, cancer_date = c("Feb2017",
"Feb2014"), stroke = 0:1, stroke_date = c("Jan2015", "Jan2015"
), diabetes = c(1L, 1L), diabetes_date = c("Jun2015", "Jun2015"
)), class = "data.frame", row.names = c(NA, -2L))
If the first obtain is complex, here another choice:
#Code 2
df2 <- df %>% mutate(across(everything(),~as.character(.))) %>%
pivot_longer(cols = -c(ID)) %>%
separate(name,c('condition','v2'),sep = '_') %>%
replace(is.na(.),'val') %>%
pivot_wider(names_from = v2,values_from=value) %>%
filter(val==1) %>% select(-val)
Output:
# A tibble: 4 x 3
ID condition date
<chr> <chr> <chr>
1 1 cancer Feb2017
2 1 diabetes Jun2015
3 2 stroke Jan2015
4 2 diabetes Jun2015
My data frame looks like this:
id A T C G ref var
1 1 10 15 7 0 A C
2 2 11 9 2 3 A G
3 3 2 31 1 12 T C
I'd like to create two new columns: ref_count and var_count which will have following values:
Value from A column and value from C column, since ref is A and var is C
Value from A column and value from G column, since ref is A and var is G
etc.
So I'd like to select a column based on the value in another column for each row.
Thanks!
We can use pivot_longer to reshape into 'long' format, filter the rows and then reshape it to 'wide' format with pivot_wider
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = A:G) %>%
group_by(id) %>%
filter(name == ref|name == var) %>%
mutate(nm1 = c('ref_count', 'var_count')) %>%
ungroup %>%
select(id, value, nm1) %>%
pivot_wider(names_from = nm1, values_from = value) %>%
left_join(df1, .)
# A tibble: 3 x 9
# id A T C G ref var ref_count var_count
#* <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#1 1 10 15 7 0 A C 10 7
#2 2 11 9 2 3 A G 11 3
#3 3 2 31 1 12 T C 31 1
Or in base R, we can also make use of the vectorized row/column indexing
df1$refcount <- as.matrix(df1[2:5])[cbind(seq_len(nrow(df1)), match(df1$ref, names(df1)[2:5]))]
df1$var_count <- as.matrix(df1[2:5])[cbind(seq_len(nrow(df1)), match(df1$var, names(df1)[2:5]))]
data
df1 <- structure(list(id = 1:3, A = c(10, 11, 2), T = c(15, 9, 31),
C = c(7, 2, 1), G = c(0, 3, 12), ref = c("A", "A", "T"),
var = c("C", "G", "C")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
The following is a tidyverse alternative without creating a long dataframe that needs filtering. It essentially uses tidyr::nest() to nest the dataframe by rows, after which the correct column can be selected for each row.
df1 %>%
nest(data = -id) %>%
mutate(
data = map(
data,
~mutate(., refcount = .[[ref]], var_count = .[[var]])
)
) %>%
unnest(data)
#> # A tibble: 3 × 9
#> id A T C G ref var refcount var_count
#> <int> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 10 15 7 0 A C 10 7
#> 2 2 11 9 2 3 A G 11 3
#> 3 3 2 31 1 12 T C 31 1
A variant of this does not need the (assumed row-specific) id column but defines the nested groups from the unique values of ref and var directly:
df1 %>%
nest(data = -c(ref, var)) %>%
mutate(
data = pmap(
list(data, ref, var),
function(df, ref, var) {
mutate(df, refcount = df[[ref]], var_count = df[[var]])
}
)
) %>%
unnest(data)
The data were specified by akrun:
df1 <- structure(list(id = 1:3, A = c(10, 11, 2), T = c(15, 9, 31),
C = c(7, 2, 1), G = c(0, 3, 12), ref = c("A", "A", "T"),
var = c("C", "G", "C")), row.names = c(NA, -3L), class = c("tbl_df",
"tbl", "data.frame"))
I have a dataframe I've created in the form
FREQ CNT
0 5
1 20
2 1000
3 3
4 3
I want to further group my results to be in the following form:
CUT CNT
0+1 25
2+3 1003
4+5 ...
.....
I've tried using the between and cut functions in dplyr but it just adds a new interval column to my dataframe can anyone give me a good indication as to where to go to achieve this?
Here is a way to do it in dplyr:
library(dplyr)
df <- df %>%
mutate(id = 1:n()) %>%
mutate(new_freq = ifelse(id %% 2 != 0, paste0(FREQ, "+", lead(FREQ, 1)), paste0(lag(FREQ, 1), "+", FREQ)))
df <- df %>%
group_by(new_freq) %>%
mutate(new_cnt = sum(CNT))
unique(df[, 4:5])
# A tibble: 2 x 2
# Groups: new_freq [2]
# new_freq new_cnt
# <chr> <int>
#1 0+1 25
#2 2+3 1003
data
df <- structure(list(FREQ = 0:3, CNT = c(5L, 20L, 1000L, 3L)), class = "data.frame", row.names = c(NA, -4L))
A non-elegant solution using dplyr... probably a better way to do this.
dat <- data.frame(FREQ = c(0,1,2,3,4), CNT = c(5,20,1000, 3, 3))
dat2 <- dat %>%
mutate(index = 0:(nrow(dat)-1)%/%2) %>%
group_by(index)
dat2 %>%
summarise(new_CNT = sum(CNT)) %>%
left_join(dat2 %>%
mutate(CUT = paste0(FREQ[1], "+", FREQ[2])) %>%
distinct(index, CUT),
by = "index") %>%
select(-index)
# A tibble: 3 x 2
new_CNT CUT
<dbl> <chr>
1 25 0+1
2 1003 2+3
3 3 4+NA