How can I select only the dummy variable columns? - r

Is there a way to select only those columns in a dataframe where the values in the columns are either 0 or 1 at a time?
I have a data frame with various values, including various strings as well as numbers.
I tried to use dplyr select but could not find a way to evaluate the values contained in the columns.
Sample data is shown below.
data %>%
tribble(
~id, ~gender, ~height, smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
)

You can pass a function (or rlang-tilde function) to select_if, and look for columns that only contain 0:1.
tribble(
~id, ~gender, ~height, ~smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
) %>%
select_if(~ all(. %in% 0:1))
# # A tibble: 3 x 2
# gender smoking
# <dbl> <dbl>
# 1 1 0
# 2 0 0
# 3 1 1
If you may have NA in a dummy-variable column, you may want to instead use %in% c(0:1, NA) in the predicate.

Related

Group_by not working, summarize() computing identical values?

I am using the data found here: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system. In my R studio, I have named the csv file, BRFSS2015. Below is the code I am trying to execute. I have created two new columns comparing people who have arthritis vs. people who do not have arthritis (arth and no_arth). Grouping by these variables, I am now trying to find the mean and sd for their weights. The weight variable was generated from another variable in the dataset using this code: (weight = BRFSS2015$WEIGHT2) Below is the code I am trying to run for mean and sd.
BRFSS2015%>%
group_by(arth,no_arth)%>%
summarize(mean_weight=mean(weight),
sd_weight=sd(weight))
I am getting output that says mean and sd for these two groups is identical. I doubt this is correct. Can someone check and tell me why this is happening? The numbers I am getting are:
arth: mean = 733.2044; sd= 2197.377
no_arth: mean= 733.2044; sd= 2197.377
Here is how I created the variables arth and no_arth:
a=BRFSS2015%>%
select(HAVARTH3)%>%
filter(HAVARTH3=="1")
b=BRFSS2015%>%
select(HAVARTH3)%>%
filter(HAVARTH3=="2")
as.data.frame(BRFSS2015)
arth=c(a)
no_arth=c(b)
BRFSS2015$arth <- c(arth, rep(NA, nrow(BRFSS2015)-length(arth)))
BRFSS2015$no_arth <- c(no_arth, rep(NA, nrow(BRFSS2015)-length(no_arth)))
as.tibble(BRFSS2015)
Before I started, I also removed NAs from weight using weight=na.omit(WEIGHT2)
Based on the info you provided one can only guess what when wrong in your analysis. But here is a working code using a snippet of the real data.
library(tidyverse)
BRFSS2015_minimal <- structure(list(HAVARTH3 = c(
1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2,
1, 1, 1, 1, 1, 1, 2, 1, 2
), WEIGHT2 = c(
280, 165, 158, 180, 142,
145, 148, 179, 84, 161, 175, 150, 9999, 140, 170, 128, 200, 178,
155, 163
)), row.names = c(NA, -20L), class = c(
"tbl_df", "tbl",
"data.frame"
))
BRFSS2015_minimal %>%
filter(!is.na(WEIGHT2), HAVARTH3 %in% 1:2) %>%
mutate(arth = HAVARTH3 == 1, no_arth = HAVARTH3 == 2,weight = WEIGHT2) %>%
group_by(arth, no_arth) %>%
summarize(
mean_weight = mean(weight),
sd_weight = sd(weight),
.groups = "drop"
)
#> # A tibble: 2 × 4
#> arth no_arth mean_weight sd_weight
#> <lgl> <lgl> <dbl> <dbl>
#> 1 FALSE TRUE 165 10.8
#> 2 TRUE FALSE 865 2629.
Code used to create dataset
BRFSS2015 <- readr::read_csv("2015.csv")
BRFSS2015_minimal <- dput(head(BRFSS2015[c("HAVARTH3", "WEIGHT2")], 20))

Finding unique rows that are NOT between an interval

I'm trying to find a way to filter a data set so that I see only the rows that do NOT have a measurement in a particular interval. For some reason my brain is cannot seem to put the logic together. I've created an example dataset below to try and explain my thinking
library(dplyr)
df <- data.frame (id = c(1,1,1,1,1,1,1,1,2,2,2,2,2, 3, 3),
number = c(-10, -9, -8, -1, -0.5, 0.0, 0.23, 5, -2, -1.1, -.88, 1.2, 4, -10,10))
)
df
So here, ideally, I want to find the unique id's that do NOT have values in between -1 and 0. ID 1 and ID 2 both have values in between -1 and 0, so they would not be included.
df %>% filter(between(number, -1, 0))
But ID 3 only has measurements of -10 and 10, so that ID does not have measures in between the interval of -1 to 0. I'm trying to get that as my final output (the 2 rows with ID 3). But can't think of a way to achieve that.
Thanks in advance!
You could use group_by and filter the groups with all values not in specific range like this:
library(dplyr)
df <- data.frame (id = c(1,1,1,1,1,1,1,1,2,2,2,2,2, 3, 3),
number = c(-10, -9, -8, -1, -0.5, 0.0, 0.23, 5, -2, -1.1, -.88, 1.2, 4, -10,10))
df %>%
group_by(id) %>%
filter(all(!between(number, -1, 0)))
#> # A tibble: 2 × 2
#> # Groups: id [1]
#> id number
#> <dbl> <dbl>
#> 1 3 -10
#> 2 3 10
Created on 2022-09-30 with reprex v2.0.2
df %>% group_by(id) %>% filter(!any(between(number, -1, 0)))

if_else with haven_labelled column fails because of wrong class

I have the following data:
dat <- structure(list(value = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
label = "value: This is my label",
labels = c(`No` = 0, `Yes` = 1),
class = "haven_labelled"),
group = structure(c(1, 2, 1, 1, 2, 3, 3, 1, 3, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 1),
label = "my group",
labels = c(first = 1, second = 2, third = 3),
class = "haven_labelled")),
row.names = c(NA, -20L),
class = c("tbl_df", "tbl", "data.frame"),
label = "test.sav")
As you can see, the data uses a special class from tidyverse's haven package, so called labelled columns.
Now I want to recode my initial value variable such that:
if group equals 1, value should stay the same, otherwise it should be missing
I was trying the following, but getting an error:
dat_new <- dat %>%
mutate(value = if_else(group != 1, NA, value))
# Error: `false` must be a logical vector, not a `haven_labelled` object
I got so far as to understand that if_else from dplyr requires the true and false checks in the if_else command to be of same class and since there is no NA equivalent for class labelled (e.g. similar to NA_real_ for doubles), the code probably fails, right?
So, how can I recode my inital variables and preserve the labels?
I know I could change my code above and replace the if_else by R's base version ifelse. However, this deletes all labels and coerces the value column to a numeric one.
You can try dplyr::case_when for cases where group == 1. If no cases are matched, NA is returned:
dat %>% mutate(value = case_when(group == 1 ~ value))
You can create an NA value in the haven_labelled class with this ugly code:
haven::labelled(NA_real_, labels = attr(dat$value, "labels"))
I'd recommend writing a function for that, e.g.
labelled_NA <- function(value)
haven::labelled(NA_real_, labels = attr(value, "labels"))
and then the code for your mutate isn't quite so ugly:
dat_new <- dat %>%
mutate(value = if_else(group != labelled_NA(value), value))
Then you get
> dat_new[1:5,]
# A tibble: 5 x 2
value group
<dbl+lbl> <dbl+lbl>
1 NA 1 [first]
2 NA 2 [second]
3 0 [No] 1 [first]
4 0 [No] 1 [first]
5 NA 2 [second]

How to create a new dataset based on multiple conditions in R?

I have a dataset called carcom that looks like this
carcom <- data.frame(household = c(173, 256, 256, 319, 319, 319, 422, 422, 422, 422), individuals= c(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
Where individuals refer to father for "1" , mother for "2", child for "3" and "4". What I would like to get two new columns. First one should indicate the number of children in that household if there is. Second, assigning a weight to each individual respectively "1" for father, "0.5" to mother and "0.3" to each child. My new dataset should look like this
newcarcom <- data.frame(household = c(173, 256, 319, 422), child = c(0, 0, 1, 2), weight = c(1, 1.5, 1.8, 2.1)
I have been trying to find the solutions for days. Would be appreciated if someone helps me. Thanks
We can count number of individuals with value 3 and 4 in each household. To calculate weight we change the value for 1:4 to their corresponding weight values using recode and then take sum.
library(dplyr)
newcarcom <- carcom %>%
group_by(household) %>%
summarise(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3)))
# household child weight
# <dbl> <int> <dbl>
#1 173 0 1
#2 256 0 1.5
#3 319 1 1.8
#4 422 2 2.1
Base R version suggested by #markus
newcarcom <- do.call(data.frame, aggregate(individuals ~ household, carcom, function(x)
c(child = sum(x %in% 3:4), weight = sum(replace(y <- x^-1, y < 0.5, 0.3)))))
An option with data.table
library(data.table)
setDT(carcom)[, .(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3))), household]

Iterate over dplyr code using purrr::map2

I am relatively new to R, so my apologies if this question is too basic.
I have transactions that show quantity sold and revenue earned from different products. Because there are three products, there are 2^3 = 8 combinations for selling these products in a "basket." Each basket could be sold in any of the three given years (2016, 2017, 2018) and in any of the zones (East and West). [I have 3 years worth of transactions for the two zones: East and West.]
My objective is to analyze how much revenue is earned, how many quantities are sold, and how many transactions occurred for each combination of these products in a given year for a given zone.
I was able to do the above operation (using purrr::map) by splitting the data based on zones. I have created a list of two data frames that hold data grouped by "year" for each combination described above. This works well. However, the code is a little clunky in my opinion. There are a lot of repetitive statements. I want to be able to create a list of 2X3 (i.e. 2 zones and 3 years)
Here's my code using zone-wise splitting.
First Try
UZone <- unique(Input_File$Zone)
FYear <- unique(Input_File$Fiscal.Year)
#Split based on zone
a<-purrr::map(UZone, ~ dplyr::filter(Input_File, Zone == .)) %>%
#Create combinations of products
purrr::map(~mutate_each(.,funs(Exists = . > 0), L.Rev:I.Qty )) %>%
#group by Fiscal Year
purrr::map(~group_by_(.,.dots = c("Fiscal.Year", grep("Exists", names(.), value = TRUE)))) %>%
#Summarize, delete unwanted columns and rename the "number of transactions" column
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
purrr::map(~plyr::rename(.,c("L.Rev_count" = "No.Trans")))
#Now do Zone and Year-wise splitting : Try 1
EastList<-a[[1]]
EastList <- EastList %>% split(.$Fiscal.Year)
WestList<-a[[2]]
WestList <- WestList %>% split(.$Fiscal.Year)
write.xlsx(EastList , file = "East.xlsx",row.names = FALSE)
write.xlsx(WestList , file = "West.xlsx",row.names = FALSE)
As you can see, the above code is very clunky. With limited knowledge of R, I researched https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ and read purrr::map2() manual but I couldn't find too many examples. After reading the solution at How to add list of vector to list of data.frame objects as new slot by parallel?, I am assuming that I could use X = zone and Y= Fiscal Year to do what I have done above.
Here's what I tried:
Second Try
#Now try Zone and Year-wise splitting : Try 2
purrr::map2(UZone,FYear, ~ dplyr::filter(Input_File, Zone == ., Fiscal.Year == .))
But this code doesn't work. I get an error message that :
Error: .x (2) and .y (3) are different lengths
Question 1: Can I use map2 to do what I am trying to do? If not, is there any other better way?
Question 2: Just in case, we are able to use map2, how can I generate two Excel files using one command? As you can see above, I have two function calls above. I'd want to have only one.
Question 3: Instead of two statements below, is there any way to do sum and count in one statement? I am looking for more cleaner ways to do sum and count.
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
Can someone please help me?
Here's my data:
dput(Input_File)
structure(list(Zone = c("East", "East", "East", "East", "East",
"East", "East", "West", "West", "West", "West", "West", "West",
"West"), Fiscal.Year = c(2016, 2016, 2016, 2016, 2016, 2016,
2017, 2016, 2016, 2016, 2017, 2017, 2018, 2018), Transaction.ID = c(132,
133, 134, 135, 136, 137, 171, 171, 172, 173, 175, 176, 177, 178
), L.Rev = c(3, 0, 0, 1, 0, 0, 2, 1, 1, 2, 2, 1, 2, 1), L.Qty = c(3,
0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1), A.Rev = c(0, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 1, 0, 0), A.Qty = c(0, 0, 0, 2, 2, 3, 0,
0, 0, 0, 0, 3, 0, 0), I.Rev = c(4, 4, 4, 0, 1, 0, 3, 0, 0, 0,
1, 0, 1, 1), I.Qty = c(2, 2, 2, 0, 1, 0, 3, 0, 0, 0, 1, 0, 1,
1)), .Names = c("Zone", "Fiscal.Year", "Transaction.ID", "L.Rev",
"L.Qty", "A.Rev", "A.Qty", "I.Rev", "I.Qty"), row.names = c(NA,
14L), class = "data.frame")
Output Format:
Here's the code to generate the output. I would love to see EastList.2016 and EastList.2017 as two sheets in one Excel file, and WestList.2016, WestList.2017 and WestList.2018 as 3 sheets in one Excel file.
#generate the output:
EastList.2016 <- EastList[[1]]
EastList.2017 <- EastList[[2]]
WestList.2016 <- WestList[[1]]
WestList.2017 <- WestList[[2]]
WestList.2018 <- WestList[[3]]
Two lists broken down by year with sums and counts for each?
In dplyr:
(df <- your dataframe)
df %>%
group_by(Zone, Fiscal.Year) %>%
summarise_at(vars(L.Rev:I.Qty), funs(sum = sum, cnt = n()))
Source: local data frame [5 x 14]
Groups: Zone [?]
Zone Fiscal.Year L.Rev_sum L.Qty_sum A.Rev_sum A.Qty_sum I.Rev_sum I.Qty_sum L.Rev_cnt L.Qty_cnt A.Rev_cnt A.Qty_cnt I.Rev_cnt I.Qty_cnt
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 East 2016 4 4 3 7 13 7 6 6 6 6 6 6
2 East 2017 2 1 0 0 3 3 1 1 1 1 1 1
3 West 2016 4 4 0 0 0 0 3 3 3 3 3 3
4 West 2017 3 3 1 3 1 1 2 2 2 2 2 2
5 West 2018 3 3 0 0 2 2 2 2 2 2 2 2

Resources