How to create a new dataset based on multiple conditions in R? - r

I have a dataset called carcom that looks like this
carcom <- data.frame(household = c(173, 256, 256, 319, 319, 319, 422, 422, 422, 422), individuals= c(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
Where individuals refer to father for "1" , mother for "2", child for "3" and "4". What I would like to get two new columns. First one should indicate the number of children in that household if there is. Second, assigning a weight to each individual respectively "1" for father, "0.5" to mother and "0.3" to each child. My new dataset should look like this
newcarcom <- data.frame(household = c(173, 256, 319, 422), child = c(0, 0, 1, 2), weight = c(1, 1.5, 1.8, 2.1)
I have been trying to find the solutions for days. Would be appreciated if someone helps me. Thanks

We can count number of individuals with value 3 and 4 in each household. To calculate weight we change the value for 1:4 to their corresponding weight values using recode and then take sum.
library(dplyr)
newcarcom <- carcom %>%
group_by(household) %>%
summarise(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3)))
# household child weight
# <dbl> <int> <dbl>
#1 173 0 1
#2 256 0 1.5
#3 319 1 1.8
#4 422 2 2.1
Base R version suggested by #markus
newcarcom <- do.call(data.frame, aggregate(individuals ~ household, carcom, function(x)
c(child = sum(x %in% 3:4), weight = sum(replace(y <- x^-1, y < 0.5, 0.3)))))

An option with data.table
library(data.table)
setDT(carcom)[, .(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3))), household]

Related

Group_by not working, summarize() computing identical values?

I am using the data found here: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system. In my R studio, I have named the csv file, BRFSS2015. Below is the code I am trying to execute. I have created two new columns comparing people who have arthritis vs. people who do not have arthritis (arth and no_arth). Grouping by these variables, I am now trying to find the mean and sd for their weights. The weight variable was generated from another variable in the dataset using this code: (weight = BRFSS2015$WEIGHT2) Below is the code I am trying to run for mean and sd.
BRFSS2015%>%
group_by(arth,no_arth)%>%
summarize(mean_weight=mean(weight),
sd_weight=sd(weight))
I am getting output that says mean and sd for these two groups is identical. I doubt this is correct. Can someone check and tell me why this is happening? The numbers I am getting are:
arth: mean = 733.2044; sd= 2197.377
no_arth: mean= 733.2044; sd= 2197.377
Here is how I created the variables arth and no_arth:
a=BRFSS2015%>%
select(HAVARTH3)%>%
filter(HAVARTH3=="1")
b=BRFSS2015%>%
select(HAVARTH3)%>%
filter(HAVARTH3=="2")
as.data.frame(BRFSS2015)
arth=c(a)
no_arth=c(b)
BRFSS2015$arth <- c(arth, rep(NA, nrow(BRFSS2015)-length(arth)))
BRFSS2015$no_arth <- c(no_arth, rep(NA, nrow(BRFSS2015)-length(no_arth)))
as.tibble(BRFSS2015)
Before I started, I also removed NAs from weight using weight=na.omit(WEIGHT2)
Based on the info you provided one can only guess what when wrong in your analysis. But here is a working code using a snippet of the real data.
library(tidyverse)
BRFSS2015_minimal <- structure(list(HAVARTH3 = c(
1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2,
1, 1, 1, 1, 1, 1, 2, 1, 2
), WEIGHT2 = c(
280, 165, 158, 180, 142,
145, 148, 179, 84, 161, 175, 150, 9999, 140, 170, 128, 200, 178,
155, 163
)), row.names = c(NA, -20L), class = c(
"tbl_df", "tbl",
"data.frame"
))
BRFSS2015_minimal %>%
filter(!is.na(WEIGHT2), HAVARTH3 %in% 1:2) %>%
mutate(arth = HAVARTH3 == 1, no_arth = HAVARTH3 == 2,weight = WEIGHT2) %>%
group_by(arth, no_arth) %>%
summarize(
mean_weight = mean(weight),
sd_weight = sd(weight),
.groups = "drop"
)
#> # A tibble: 2 × 4
#> arth no_arth mean_weight sd_weight
#> <lgl> <lgl> <dbl> <dbl>
#> 1 FALSE TRUE 165 10.8
#> 2 TRUE FALSE 865 2629.
Code used to create dataset
BRFSS2015 <- readr::read_csv("2015.csv")
BRFSS2015_minimal <- dput(head(BRFSS2015[c("HAVARTH3", "WEIGHT2")], 20))

How can I select only the dummy variable columns?

Is there a way to select only those columns in a dataframe where the values in the columns are either 0 or 1 at a time?
I have a data frame with various values, including various strings as well as numbers.
I tried to use dplyr select but could not find a way to evaluate the values contained in the columns.
Sample data is shown below.
data %>%
tribble(
~id, ~gender, ~height, smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
)
You can pass a function (or rlang-tilde function) to select_if, and look for columns that only contain 0:1.
tribble(
~id, ~gender, ~height, ~smoking,
1, 1, 170, 0,
2, 0, 150, 0,
3, 1, 160, 1
) %>%
select_if(~ all(. %in% 0:1))
# # A tibble: 3 x 2
# gender smoking
# <dbl> <dbl>
# 1 1 0
# 2 0 0
# 3 1 1
If you may have NA in a dummy-variable column, you may want to instead use %in% c(0:1, NA) in the predicate.

Summary of N recent values

I am trying to get summary statistics (sum and max here) with most N recent values.
Starting data:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2))
Desired result:
dt = data.table(id = c('a','a','a','a','b','b','b','b'),
week = c(1,2,3,4,1,2,3,4),
value = c(2, 3, 1, 0, 5, 7,3,2),
sum_recent2week = c(NA, NA, 5, 4, NA, NA, 12, 10),
max_recent2week = c(NA, NA, 3, 3, NA, NA, 7, 7))
With the data, I would like to have sum and max of 2 (N=2) most recent values for each row by id. 4th(sum_recent2week) and 5th (max_recent2week) columns are my desired columns
You can use rollsum and rollmax from the zoo package.
dt[, `:=`(sum_recent2week =
shift(rollsum(value, 2, align = 'left', fill = NA), 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
For the sum, if you're using data table version >= 1.12, you can use data.table::frollmean. The default for frollmean is fill = NA, so no need to specify that in this case.
dt[, `:=`(sum_recent2week =
shift(frollmean(value, 2, align = 'left')*2, 2),
max_recent2week =
shift(rollmax(value, 2, align = 'left', fill = NA), 2))
, id]
I'm sure it can be done in a much more elegant way, but here is one tidyverse possibility:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1))) %>%
rowid_to_column() %>%
select(-week, -value) %>%
top_n(-2) %>%
right_join(dt %>%
rowid_to_column(), by = c("rowid" = "rowid",
"id" = "id")) %>%
select(-rowid)
id sum_recent2week max_recent2week week value
<chr> <dbl> <dbl> <dbl> <dbl>
1 a NA NA 1. 2.
2 a NA NA 2. 3.
3 a 5. 3. 3. 1.
4 a 4. 3. 4. 0.
5 b NA NA 1. 5.
6 b NA NA 2. 7.
7 b 12. 7. 3. 3.
8 b 10. 7. 4. 2.
First, it is computing the "sum_recent2week" and "max_recent2week" per group. Second, it selects the last two rows per group. Finally, it merges it with the original data.
Or if you want to compute it for all rows, not just for the last two rows per group:
dt %>%
group_by(id) %>%
mutate(sum_recent2week = lag(value + lead(value), n = 2),
max_recent2week = pmax(lag(value, n = 2), lag(value, n = 1)))

Iterate over dplyr code using purrr::map2

I am relatively new to R, so my apologies if this question is too basic.
I have transactions that show quantity sold and revenue earned from different products. Because there are three products, there are 2^3 = 8 combinations for selling these products in a "basket." Each basket could be sold in any of the three given years (2016, 2017, 2018) and in any of the zones (East and West). [I have 3 years worth of transactions for the two zones: East and West.]
My objective is to analyze how much revenue is earned, how many quantities are sold, and how many transactions occurred for each combination of these products in a given year for a given zone.
I was able to do the above operation (using purrr::map) by splitting the data based on zones. I have created a list of two data frames that hold data grouped by "year" for each combination described above. This works well. However, the code is a little clunky in my opinion. There are a lot of repetitive statements. I want to be able to create a list of 2X3 (i.e. 2 zones and 3 years)
Here's my code using zone-wise splitting.
First Try
UZone <- unique(Input_File$Zone)
FYear <- unique(Input_File$Fiscal.Year)
#Split based on zone
a<-purrr::map(UZone, ~ dplyr::filter(Input_File, Zone == .)) %>%
#Create combinations of products
purrr::map(~mutate_each(.,funs(Exists = . > 0), L.Rev:I.Qty )) %>%
#group by Fiscal Year
purrr::map(~group_by_(.,.dots = c("Fiscal.Year", grep("Exists", names(.), value = TRUE)))) %>%
#Summarize, delete unwanted columns and rename the "number of transactions" column
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
purrr::map(~plyr::rename(.,c("L.Rev_count" = "No.Trans")))
#Now do Zone and Year-wise splitting : Try 1
EastList<-a[[1]]
EastList <- EastList %>% split(.$Fiscal.Year)
WestList<-a[[2]]
WestList <- WestList %>% split(.$Fiscal.Year)
write.xlsx(EastList , file = "East.xlsx",row.names = FALSE)
write.xlsx(WestList , file = "West.xlsx",row.names = FALSE)
As you can see, the above code is very clunky. With limited knowledge of R, I researched https://blog.rstudio.org/2016/01/06/purrr-0-2-0/ and read purrr::map2() manual but I couldn't find too many examples. After reading the solution at How to add list of vector to list of data.frame objects as new slot by parallel?, I am assuming that I could use X = zone and Y= Fiscal Year to do what I have done above.
Here's what I tried:
Second Try
#Now try Zone and Year-wise splitting : Try 2
purrr::map2(UZone,FYear, ~ dplyr::filter(Input_File, Zone == ., Fiscal.Year == .))
But this code doesn't work. I get an error message that :
Error: .x (2) and .y (3) are different lengths
Question 1: Can I use map2 to do what I am trying to do? If not, is there any other better way?
Question 2: Just in case, we are able to use map2, how can I generate two Excel files using one command? As you can see above, I have two function calls above. I'd want to have only one.
Question 3: Instead of two statements below, is there any way to do sum and count in one statement? I am looking for more cleaner ways to do sum and count.
purrr::map(~summarise_each(., funs(sum(., na.rm = TRUE), count = n()), L.Rev:I.Qty)) %>%
purrr::map(~select(., Fiscal.Year:L.Rev_count)) %>%
Can someone please help me?
Here's my data:
dput(Input_File)
structure(list(Zone = c("East", "East", "East", "East", "East",
"East", "East", "West", "West", "West", "West", "West", "West",
"West"), Fiscal.Year = c(2016, 2016, 2016, 2016, 2016, 2016,
2017, 2016, 2016, 2016, 2017, 2017, 2018, 2018), Transaction.ID = c(132,
133, 134, 135, 136, 137, 171, 171, 172, 173, 175, 176, 177, 178
), L.Rev = c(3, 0, 0, 1, 0, 0, 2, 1, 1, 2, 2, 1, 2, 1), L.Qty = c(3,
0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1), A.Rev = c(0, 0, 0, 1,
1, 1, 0, 0, 0, 0, 0, 1, 0, 0), A.Qty = c(0, 0, 0, 2, 2, 3, 0,
0, 0, 0, 0, 3, 0, 0), I.Rev = c(4, 4, 4, 0, 1, 0, 3, 0, 0, 0,
1, 0, 1, 1), I.Qty = c(2, 2, 2, 0, 1, 0, 3, 0, 0, 0, 1, 0, 1,
1)), .Names = c("Zone", "Fiscal.Year", "Transaction.ID", "L.Rev",
"L.Qty", "A.Rev", "A.Qty", "I.Rev", "I.Qty"), row.names = c(NA,
14L), class = "data.frame")
Output Format:
Here's the code to generate the output. I would love to see EastList.2016 and EastList.2017 as two sheets in one Excel file, and WestList.2016, WestList.2017 and WestList.2018 as 3 sheets in one Excel file.
#generate the output:
EastList.2016 <- EastList[[1]]
EastList.2017 <- EastList[[2]]
WestList.2016 <- WestList[[1]]
WestList.2017 <- WestList[[2]]
WestList.2018 <- WestList[[3]]
Two lists broken down by year with sums and counts for each?
In dplyr:
(df <- your dataframe)
df %>%
group_by(Zone, Fiscal.Year) %>%
summarise_at(vars(L.Rev:I.Qty), funs(sum = sum, cnt = n()))
Source: local data frame [5 x 14]
Groups: Zone [?]
Zone Fiscal.Year L.Rev_sum L.Qty_sum A.Rev_sum A.Qty_sum I.Rev_sum I.Qty_sum L.Rev_cnt L.Qty_cnt A.Rev_cnt A.Qty_cnt I.Rev_cnt I.Qty_cnt
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int>
1 East 2016 4 4 3 7 13 7 6 6 6 6 6 6
2 East 2017 2 1 0 0 3 3 1 1 1 1 1 1
3 West 2016 4 4 0 0 0 0 3 3 3 3 3 3
4 West 2017 3 3 1 3 1 1 2 2 2 2 2 2
5 West 2018 3 3 0 0 2 2 2 2 2 2 2 2

How to reorder rows in a matrix

I have a matrix and would like to reorder the rows so that for example row 5 can be switched to row 2 and row 2 say to row 7. I have a list with all rownames delimited with \n and I thought I could somehow read it into R (its a txt file) and then just use the name of the matrix (in my case 'k' and do something like k[txt file,]-> k_new but this does not work since the identifiers are not the first column but are defined as rownames.
k[ c(1,5,3,4,7,6,2), ] #But probably not what you meant....
Or perhaps (if your 'k' object rownames are something other than the default character-numeric sequence):
k[ char_vec , ] # where char_vec will get matched to the row names.
(dat <- structure(list(person = c(1, 1, 1, 1, 2, 2, 2, 2), time = c(1,
2, 3, 4, 1, 2, 3, 4), income = c(100, 120, 150, 200, 90, 100,
120, 150), disruption = c(0, 0, 0, 1, 0, 1, 1, 0)), .Names = c("person",
"time", "income", "disruption"), row.names = c("h", "g", "f",
"e", "d", "c", "b", "a"), class = "data.frame"))
dat[ c('h', 'f', 'd', 'b') , ]
#-------------
person time income disruption
h 1 1 100 0
f 1 3 150 0
d 2 1 90 0
b 2 3 120 1

Resources