I have a datafile with one row per participants (named 1-x, based on the study they took part in). I want to check whether all participants are present in the dataset. This is my toy dataset, personid are the participants, study is the study they took part in.
df <- read.table(text = "personid study measurement
1 x 23
2 x 32
1 y 21
3 y 23
4 y 23
6 y 23", header=TRUE)
which looks like this:
personid study measurement
1 1 x 23
2 2 x 32
3 1 y 21
4 3 y 23
5 4 y 23
6 6 y 23
so for y, I am missing participants 2 and 5. How do I check that automatically? I tried adding a counter variable and comparing that counter variable to the participant id but once one participant is missing, the comparison is meaningless because the alignment is off.
df %>% group_by(study) %>% mutate(id = 1:n(),check = id==personid)
Source: local data frame [6 x 5]
Groups: date [2]
personid study measurement id check
<int> <fctr> <int> <int> <lgl>
1 1 x 23 1 TRUE
2 2 x 32 2 TRUE
3 1 y 21 1 TRUE
4 3 y 23 2 FALSE
5 4 y 23 3 FALSE
6 6 y 23 4 FALSE
Assuming your personid is sequential, then you can do this using setdiff, i.e.
library(dplyr)
df %>%
group_by(study) %>%
mutate(new = toString(setdiff(max(personid):min(personid), personid)))
#Source: local data frame [6 x 4]
#Groups: study [2]
# personid study measurement new
# <int> <fctr> <int> <chr>
#1 1 x 23
#2 2 x 32
#3 1 y 21 5, 2
#4 3 y 23 5, 2
#5 4 y 23 5, 2
#6 6 y 23 5, 2
One approach is to use tidy::expand() to generate all possible combinations of study and personid and then use anti_join() to remove the combinations that actually appear in the data.
library(dplyr, warn.conflicts = FALSE)
library(tidyr)
df %>%
expand(study, personid) %>%
anti_join(df)
#> Joining, by = c("study", "personid")
#> # A tibble: 4 × 2
#> study personid
#> <fctr> <int>
#> 1 y 2
#> 2 x 6
#> 3 x 4
#> 4 x 3
A simple solution using base R
tapply(df$personid, df$study, function(a) setdiff(min(a):max(a), a))
Output:
$x
integer(0)
$y
[1] 2 5
Related
I'm working on creating some error reports and one of the times I'm trying to address is potential errors within the ID column id_1. I've made an alternative id column from various identifying features within the rows that I'm calling id_2. To help, I've also created a date_lag column on date to catch items that were entered within a specific period after the initial entry. The main problem that I'm having is returning the entire group that meets the criteria, including that first entry that would have an NA in the date_lag, or, if I allow the NA values through, I get more than just the items I'm looking for (id_1 1 and 2 below).
Example:
#id_1 where potential errors lie
#id_2 alternative id col I'm using to test
df <- data.table(id_1 = c(1:4, 1:4),
id_2 = c(rep(c("b", "a"), c(2, 2))),
date = c(rep(1,4),rep(20,2), rep(10,2)))
df %>%
group_by(id_2) %>%
mutate(date_lag = date - lag(date)) %>%
filter(between(date_lag, 0, 10) | is.na(date_lag))
# A tibble: 6 x 4
# Groups: id_1 [4]
id_1 id_2 date date_lag
<int> <chr> <dbl> <dbl>
1 b 1 NA
2 b 1 0
3 a 1 NA
4 a 1 0
2 b 20 0
3 a 10 9
4 a 10 0
Expected:
# A tibble: 6 x 4
# Groups: id_2 [4]
id_1 id_2 value val_lag
<int> <chr> <dbl> <dbl>
3 a 1 NA
4 a 1 NA
3 a 10 9
4 a 10 9
Perhaps, we can use diff
library(dplyr)
df %>%
group_by(id_1) %>%
filter(between(diff(date), 0, 10))
-output
# A tibble: 4 x 3
# Groups: id_1 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 3 a 1
#2 4 a 1
#3 3 a 10
#4 4 a 10
Concatenate with NA as the diff returns a length 1 less than the original data
df %>%
group_by(id_2) %>%
filter(between(c(NA, diff(date)), 0, 10))
# A tibble: 5 x 3
# Groups: id_2 [2]
# id_1 id_2 date
# <int> <chr> <dbl>
#1 2 b 1
#2 4 a 1
#3 2 b 20
#4 3 a 10
#5 4 a 10
This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))
I have a dataframe that looks like this:
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
ID Type X2019 X2020 X2021
1 1 A 1 2 3
2 2 A 2 3 4
3 3 B 3 4 5
4 4 B 4 5 6
5 5 C 5 6 7
6 6 C 6 7 8
Now, I'm looking for some code that does the following:
1. Create a new data.frame for every row in df
2. Names the new dataframe with a combination of "ID" and "Type" (A_1, A_2, ... , C_6)
The resulting new dataframes should look like this (example for A_1, A_2 and C_6):
Year Values
1 2019 1
2 2020 2
3 2021 3
Year Values
1 2019 2
2 2020 3
3 2021 4
Year Values
1 2019 6
2 2020 7
3 2021 8
I have some things that somehow complicate the code:
1. The code should work in the next few years without any changes, meaning next year the data.frame df will no longer contain the years 2019-2021, but rather 2020-2022.
2. As the data.frame df is only a minimal reproducible example, I need some kind of loop. In the "real" data, I have a lot more rows and therefore a lot more dataframes to be created.
Unfortunately, I can't give you any code, as I have absolutely no idea how I could manage that.
While researching, I found the following code that may help adress the first problem with the changing years:
year <- as.numeric(format(Sys.Date(), "%Y"))
Further, I read about list, and that it may help to work with a list in a for loop and then transform the list back into a dataframe. Sorry for my limited approach, I hope anyone can give me a hint or even the solution to my problem. If you need any further information, please let me know. Thanks in advance!
A kind of similar question to mine:
Populating a data frame in R in a loop
Try this:
library(stringr)
library(dplyr)
library(tidyr)
library(magrittr)
df %>%
gather(Year, Values, 3:5) %>%
mutate(Year = str_sub(Year, 2)) %>%
select(ID, Year, Values) %>%
group_split(ID) # split(.$ID)
# [[1]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 1 2019 1
# 2 1 2020 2
# 3 1 2021 3
#
# [[2]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 2 2019 2
# 2 2 2020 3
# 3 2 2021 4
#
# [[3]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 3 2019 3
# 2 3 2020 4
# 3 3 2021 5
#
# [[4]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 4 2019 4
# 2 4 2020 5
# 3 4 2021 6
#
# [[5]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 5 2019 5
# 2 5 2020 6
# 3 5 2021 7
#
# [[6]]
# # A tibble: 3 x 3
# ID Year Values
# <dbl> <chr> <dbl>
# 1 6 2019 6
# 2 6 2020 7
# 3 6 2021 8
Data
df <- data.frame(ID = c(1,2,3,4,5,6), Type = c("A","A","B","B","C","C"), `2019` = c(1,2,3,4,5,6),`2020` = c(2,3,4,5,6,7), `2021` = c(3,4,5,6,7,8))
library(magrittr)
library(tidyr)
library(dplyr)
library(stringr)
names(df) <- str_replace_all(names(df), "X", "") #remove X's from year names
df %>%
gather(Year, Values, 3:5) %>%
select(ID, Year, Values) %>%
group_split(ID)
Example data:
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
> tibbly
# A tibble: 12 x 4
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 10 A X 1
2 30 A X 2
3 50 A X 3
4 10 A Y 4
5 30 A Y 4
6 50 A Y 6
7 10 B X 2
8 30 B X 5
9 50 B X 3
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
Question:
How to obtain the order of rows for each group in a dataframe? I can use dplyr to arrange the data in the an appropriate form to visualize what I am interested in:
> tibbly %>%
group_by(grouping1, grouping2) %>%
arrange(grouping1, grouping2, desc(value))
# A tibble: 12 x 4
# Groups: grouping1, grouping2 [4]
age grouping1 grouping2 value
<dbl> <chr> <chr> <dbl>
1 50 A X 3
2 30 A X 2
3 10 A X 1
4 50 A Y 6
5 10 A Y 4
6 30 A Y 4
7 30 B X 5
8 50 B X 3
9 10 B X 2
10 10 B Y 6
11 30 B Y 3
12 50 B Y 2
In the end I am interested in the order of the age column, for each group based on the value column. Is there a elegant way to do this with dplyr? Something like summarise() based on the order of rows and not actual values
library(dplyr)
tibbly = tibble(age = c(10,30,50,10,30,50,10,30,50,10,30,50),
grouping1 = c("A","A","A","A","A","A","B","B","B","B","B","B"),
grouping2 = c("X", "X", "X","Y","Y","Y","X","X","X","Y","Y","Y"),
value = c(1,2,3,4,4,6,2,5,3,6,3,2))
tibbly %>%
group_by(grouping1, grouping2) %>% # for each group
arrange(desc(value)) %>% # arrange value descending
summarise(order = paste0(age, collapse = ",")) %>% # get the order of age as a strings
ungroup() # forget the grouping
# # A tibble: 4 x 3
# grouping1 grouping2 order
# <chr> <chr> <chr>
# 1 A X 50,30,10
# 2 A Y 50,10,30
# 3 B X 30,50,10
# 4 B Y 10,30,50
With data.table
library(data.table)
setDT(tibbly)[order(-value), .(order = toString(age)),.(grouping1, grouping2)]
let's say I have a data.frame that looks like this:
Variable X Y Z
A 2 5 3
B 4 3 2
C 5 1 5
B 6 2 4
C 2 5 2
Using dplyr or any other suitable package, I would like to group by every single variable, compare it to the rest of the variables pooled together and compute a mathematical operation between the two resulting groups, let's say the sum along columns. I would get something like this:
Variable X Y Z
A 2 5 3
rest 17 11 13
Variable X Y Z
B 10 5 6
rest 9 11 10
Variable X Y Z
C 7 6 7
rest 12 10 9
I have a large data.frame with hundreds of variables, so I would also like to do it in an iterative way. Any suggestion would be of great help. Thank you very much in advance.
something along this ? (you can subset from a list)
l = lapply(unique(df$Variable), function(x) rbind(colSums(df[df$Variable == x,][c("X", "Y", "Z")]),
colSums(df[df$Variable != x,][c("X", "Y", "Z")])))
#[[1]]
# X Y Z
#[1,] 2 5 3
#[2,] 17 11 13
#[[2]]
# X Y Z
#[1,] 10 5 6
#[2,] 9 11 10
#[[3]]
# X Y Z
#[1,] 7 6 7
#[2,] 12 10 9
names(l) = LETTERS[1:3]
l = lapply(l, function(x){rownames(x) = c("Variable", "Rest");x})
list2env(l, .Globalenv) # this should load all dataframes separately
If you want to go full tidyverse
library(tidyverse)
df <- tibble(Variable = c("A","B","C","B","C"),
X = c(2,4,5,6,2),
Y = c(5,3,1,2,5),
Z = c(3,2,5,4,2))
group_summary <- function(data, var) {
data %>%
group_by_(group = ~ if_else(grepl(var, Variable), var, "rest")) %>%
summarise_each_(funs(sum),~-Variable) %>%
rename_(.dots = setNames(c("group"), c("Variable")))
}
map(unique(df$Variable), ~group_summary(df, .x))
[[1]]
# A tibble: 2 × 4
Variable X Y Z
<chr> <dbl> <dbl> <dbl>
1 A 2 5 3
2 rest 17 11 13
[[2]]
# A tibble: 2 × 4
Variable X Y Z
<chr> <dbl> <dbl> <dbl>
1 B 10 5 6
2 rest 9 11 10
[[3]]
# A tibble: 2 × 4
Variable X Y Z
<chr> <dbl> <dbl> <dbl>
1 C 7 6 7
2 rest 12 10 9
If you want a different output than a list you can explore the use of the different map functions (e.g map_df) and the use of tibbles