I am trying to create a new dichotomous variable column (1/0; Yes/No) based on if anyone within a household (marked by a household ID number) is 1 or 2 in one column (var1) AND indicated 1 (Yes) for a different column (var3)?
This is the code I've tried. I am able to mutate a new column based on the individuals but not on the household level. As of now, new_dichotomous is 1 only if for an individual var1 is equal to 1 or 2 and var3 is 1. I need it to be if anyone within the household'svar1 is equal to 1 or 2 and var3 is 1 then the value for new_dichotomous is 1 for the everyone in the household (those with the same householdID).
dataset %>%
group_by(householdID) %>%
mutate(new_dichotomous = if_else(var1 == 1 |
var1 == 2 & any(var3 == 1), 1, 0))
We may use
library(dplyr)
dataset %>%
group_by(householdID) %>%
mutate(new_dichotomous = +(any(c(1, 2) %in% var1) & 1 %in% var3) *
NA^(if_any(c(var1, var2, var3), is.na))) %>%
ungroup
Related
Can someone help me understand what the grouping is doing here, please?
Why do these two produce two different grouped outputs? The top returns all grouped variables where n() >1 in results A and outside of A category but just the A pairing while the bottom returns n() > 1 here duplicates exist in only A.
Sample Data:
df <- data.frame(ID = c(1,1,3,4,5,6,6),
Acronym = c('A','B','A','A','B','A','A')
)
df %>%
group_by(ID) %>%#
filter(Acronym == 'A',n() > 1)
df %>% filter(Acronym == 'A') %>%
group_by(ID) %>%
filter(n() > 1)
In the first example, rows with Acroynm == "A" are in the data frame and contribute to the row count n().
In the second example, these rows are removed, and don't contribute to row count from n().
If we want the first case to return only 'ID' 6, use sum to get the count of 'A' values in Acronym
library(dplyr)
df %>%
group_by(ID) %>%
filter(sum(Acronym == 'A') > 1)
As mentioned in the other post, it is just that n() is based on the whole group count and not on the number of 'A's. If we are unsure about the filter, create a column with mutate and check the output
df %>%
group_by(ID) %>%
mutate(ind = Acronym == 'A' & n() > 1)
# A tibble: 7 × 3
# Groups: ID [5]
ID Acronym ind
<dbl> <chr> <lgl>
1 1 A TRUE
2 1 B FALSE
3 3 A FALSE
4 4 A FALSE
5 5 B FALSE
6 6 A TRUE
7 6 A TRUE
Hi All,
Example :- The above is the data I have. I want to group age 1-2 and count the values. In this data value is 4 for age group 1-2. Similarly I want to group age 3-4 and count the values. Here the value for age group 3-4 is 6.
How can I group age and aggregate the values correspond to it?
I know this way: code-
data.frame(df %>% group_by(df$Age) %>% tally())
But the values are aggregating on individual Age.
I want the values aggregating on multiple age to be a group as mentioned above example.
Any help on this will be greatly helpful.
Thanks a lot to All.
Here are two solutions, with base R and with package dplyr.
I will use the data posted by Shree.
First, base R.
I create a grouping variable grp and then aggregate on it.
grp <- with(df, c((age %in% 1:2) + 2*(age %in% 3:4)))
aggregate(age ~ grp, df, length)
# grp age
#1 1 4
#2 2 6
Second a dplyr way.
Function case_when is used to create a grouping variable. This allows for meaningful names to be given to the groups in an easy way.
library(dplyr)
df %>%
mutate(grp = case_when(
age %in% 1:2 ~ "2:3",
age %in% 3:4 ~ "3:4",
TRUE ~ NA_character_
)) %>%
group_by(grp) %>%
tally()
## A tibble: 2 x 2
# grp n
# <chr> <int>
#1 1:2 4
#2 3:4 6
Here's one way using dplyr and ?cut from base R -
df <- data.frame(age = c(1,1,2,2,3,3,3,4,4,4),
Name = letters[1:10],
stringsAsFactors = F)
df %>%
count(grp = cut(age, breaks = c(0,2,4)))
# A tibble: 2 x 2
grp n
<fct> <int>
1 (0,2] 4
2 (2,4] 6
I would like to find the minimum value of a variable (time) that several other variables are equal to 1 (or any other value). Basically my application is finding the first year that x ==1, for several x. I know how to find this for one x but would like to avoid generating multiple reduced data frames of minima, then merging these together. Is there an efficient way to do this? Here is my example data and solution for one variable.
d <- data.frame(cat = c(rep("A",10), rep("B",10)),
time = c(1:10),
var1 = c(0,0,0,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1),
var2 = c(0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1))
ddply(d[d$var1==1,], .(cat), summarise,
start= min(time))
How about this using dplyr
d %>%
group_by(cat) %>%
summarise_at(vars(contains("var")), funs(time[which(. == 1)[1]]))
Which gives
# A tibble: 2 x 3
# cat var1 var2
# <fct> <int> <int>
# 1 A 4 5
# 2 B 7 8
We can use base R to get the minimum 'time' among all the columns of 'var' grouped by 'cat'
sapply(split(d[-1], d$cat), function(x)
x$time[min(which(x[-1] ==1, arr.ind = TRUE)[, 1])])
#A B
#4 7
Is this something you are expecting?
library(dplyr)
df <- d %>%
group_by(cat, var1, var2) %>%
summarise(start = min(time)) %>%
filter()
I have left a blank filter argument that you can use to specify any filter condition you want (say var1 == 1 or cat == "A")
Some questions are similar to this topic (here or here, as an example) and I know one solution that works, but I want a more elegant response.
I work in epidemiology and I have variables 1 and 0 (or NA). Example:
Does patient has cancer?
NA or 0 is no
1 is yes
Let's say I have several variables in my dataset and I want to count only variables with "1". Its a classical frequency table, but dplyr are turning things more complicated than I could imagine at the first glance.
My code is working:
dataset %>%
select(VISimpair, HEARimpai, IntDis, PhyDis, EmBehDis, LearnDis,
ComDis, ASD, HealthImpair, DevDelays) %>% # replace to your needs
summarise_all(funs(sum(1-is.na(.))))
And you can reproduce this code here:
library(tidyverse)
dataset <- data.frame(var1 = rep(c(NA,1),100), var2=rep(c(NA,1),100))
dataset %>% select(var1, var2) %>% summarise_all(funs(sum(1-is.na(.))))
But I really want to select all variables I want, count how many 0 (or NA) I have and how many 1 I have and report it and have this output
Thanks.
What about the following frequency table per variable?
First, I edit your sample data to also include 0's and load the necessary libraries.
library(tidyr)
library(dplyr)
dataset <- data.frame(var1 = rep(c(NA,1,0),100), var2=rep(c(NA,1,0),100))
Second, I convert the data using gather to make it easier to group_by later for the frequency table created by count, as mentioned by CPak.
dataset %>%
select(var1, var2) %>%
gather(var, val) %>%
mutate(val = factor(val)) %>%
group_by(var, val) %>%
count()
# A tibble: 6 x 3
# Groups: var, val [6]
var val n
<chr> <fct> <int>
1 var1 0 100
2 var1 1 100
3 var1 NA 100
4 var2 0 100
5 var2 1 100
6 var2 NA 100
A quick and dirty method to do this is to coerce your input into factors:
dataset$var1 = as.factor(dataset$var1)
dataset$var2 = as.factor(dataset$var2)
summary(dataset$var1)
summary(dataset$var2)
Summary tells you number of occurrences of each levels of factor.
I have the following data frame:
Group 1 ID A Value
Group 1 ID B Value
Group 1 ID C Value
Group 2 ID B Value
Group 2 ID C Value
Group 3 ID B Value
… … …
I am trying to use dplyr to get the mean value for each of the same ID across groups (e.g. the mean of the value of ID B across group 1, group 2, and group 3). However, not every group has all of the IDs so I wanted to subset so that only means for IDs which are in all groups get computed. I know that I can group_by(dataFrame, group) %>% filter subset %>% group_by(id) %>% mutate(mean) but I don't know what code to place in the filter subset.
How about
df %>%
group_by(id) %>%
mutate(count = n()) %>%
filter(count != ngroups) %>% #...
So basically remove all the rows in the dataframe that correspond to an ID that doesn't appear in all groups, then perform the computation.