Group_by not working, summarize() computing identical values? - r

I am using the data found here: https://www.kaggle.com/cdc/behavioral-risk-factor-surveillance-system. In my R studio, I have named the csv file, BRFSS2015. Below is the code I am trying to execute. I have created two new columns comparing people who have arthritis vs. people who do not have arthritis (arth and no_arth). Grouping by these variables, I am now trying to find the mean and sd for their weights. The weight variable was generated from another variable in the dataset using this code: (weight = BRFSS2015$WEIGHT2) Below is the code I am trying to run for mean and sd.
BRFSS2015%>%
group_by(arth,no_arth)%>%
summarize(mean_weight=mean(weight),
sd_weight=sd(weight))
I am getting output that says mean and sd for these two groups is identical. I doubt this is correct. Can someone check and tell me why this is happening? The numbers I am getting are:
arth: mean = 733.2044; sd= 2197.377
no_arth: mean= 733.2044; sd= 2197.377
Here is how I created the variables arth and no_arth:
a=BRFSS2015%>%
select(HAVARTH3)%>%
filter(HAVARTH3=="1")
b=BRFSS2015%>%
select(HAVARTH3)%>%
filter(HAVARTH3=="2")
as.data.frame(BRFSS2015)
arth=c(a)
no_arth=c(b)
BRFSS2015$arth <- c(arth, rep(NA, nrow(BRFSS2015)-length(arth)))
BRFSS2015$no_arth <- c(no_arth, rep(NA, nrow(BRFSS2015)-length(no_arth)))
as.tibble(BRFSS2015)
Before I started, I also removed NAs from weight using weight=na.omit(WEIGHT2)

Based on the info you provided one can only guess what when wrong in your analysis. But here is a working code using a snippet of the real data.
library(tidyverse)
BRFSS2015_minimal <- structure(list(HAVARTH3 = c(
1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 2,
1, 1, 1, 1, 1, 1, 2, 1, 2
), WEIGHT2 = c(
280, 165, 158, 180, 142,
145, 148, 179, 84, 161, 175, 150, 9999, 140, 170, 128, 200, 178,
155, 163
)), row.names = c(NA, -20L), class = c(
"tbl_df", "tbl",
"data.frame"
))
BRFSS2015_minimal %>%
filter(!is.na(WEIGHT2), HAVARTH3 %in% 1:2) %>%
mutate(arth = HAVARTH3 == 1, no_arth = HAVARTH3 == 2,weight = WEIGHT2) %>%
group_by(arth, no_arth) %>%
summarize(
mean_weight = mean(weight),
sd_weight = sd(weight),
.groups = "drop"
)
#> # A tibble: 2 × 4
#> arth no_arth mean_weight sd_weight
#> <lgl> <lgl> <dbl> <dbl>
#> 1 FALSE TRUE 165 10.8
#> 2 TRUE FALSE 865 2629.
Code used to create dataset
BRFSS2015 <- readr::read_csv("2015.csv")
BRFSS2015_minimal <- dput(head(BRFSS2015[c("HAVARTH3", "WEIGHT2")], 20))

Related

How to loop over multiple groups and create radar plots in R

I have the following dataframe:
group
Class
Maths
Science
Name1
7
74
78
Name2
7
80
91
Name3
6
69
80
I want to create different radar plots for the variables Maths and Science for each classes using R. eg: For the above dataframe, two radar plots should be created for two classes 7 and 6.
nrange <- 2
class <- c(7,6)
for (i in nrange){
plot <- ggradar::ggradar(df[i,2:3], values.radar = c(0, 50, 100), group.line.width = 1,
group.point.size = 2, legend.position = "bottom", plot.title=class[i])
}
plot
I using the above code. However, it is only creating the plot for the last row. Please help me with this issue.
Thanks a lot in advance!
You were almost there, but there were two little problems.
The for statement evaluated to for(i in 2) which means it is only using i=2. You can fix this by using for(i in 1:nrange)
You were overwriting plot each time through the loop. If you make plot a list and save each graph as a separate element in the list, then it should work.
mydat <- tibble::tribble(
~group, ~Class, ~Maths, ~Science,
"Name1", 7, 74, 78,
"Name2", 7, 80, 91,
"Name3", 6, 69, 80)
plots <- list()
nrange <- 2
class <- c(7,6)
for (i in 1:3){
plots[[i]] <- ggradar::ggradar(mydat[i,2:4], values.radar = c(0, 50, 100),
grid.max = 100, group.line.width = 1,
group.point.size = 2, legend.position = "bottom", plot.title=mydat$Class[i])
}
plots
#> [[1]]
#>
#> [[2]]
#>
#> [[3]]
Created on 2023-02-03 by the reprex package (v2.0.1)
Putting Together with facet_wrap()
library(dplyr)
library(ggplot2)
mydat <- tibble::tribble(
~group, ~Class, ~Maths, ~Science,
"Name1", 7, 74, 78,
"Name2", 7, 80, 91,
"Name3", 6, 69, 80)
mydat <- mydat %>%
mutate(gp = paste(group, Class, sep=": ")) %>%
select(gp, Maths, Science)
ggradar::ggradar(mydat, values.radar = c(0, 50, 100),
grid.max = 100, group.line.width = 1,
group.point.size = 2, legend.position = "bottom") +
facet_wrap(~gp)
Created on 2023-02-06 by the reprex package (v2.0.1)

How to use for-loop (or apply) in R to filter and select specific columns with specific terms

I am trying to make a table like this -
The table contains several scenarios and risk_type.
The scenarios are basically filters. For example
0 - loan_age > 18
1 - interest_rate > 8%
2 - interest_rate > 18% AND referee == "MALE" AND new_LTV > 50
risk_type are columns in the original dataset like
A - flood risk
B - wildfire risk
C - foundation risk
What I want to do is to create a summary table of all these different risks for all the filters.
This is how the data looks like -
Damage and new LTV is a function of risk score, and I want to filter for risk score > 4
Edit - The first 5 rows of the dummy dataframe.
structure(list(ID = c(1, 2, 3, 4, 5), LTV_value = c(43, 43, 32,
34, 35), loan_age = c(17, 65, 32, 33, 221), referee = c("MALE",
"FEMALE", "MALE", "MALE", "FEMALE"), interest_rate = c(0.02,
0.03, 0.05, 0.0633333333333333, 0.0783333333333333), value = c(70000,
80000, 90000, 1e+05, 45000), flood_risk_score = c(3, 4, 5, 0,
1), wildfire_risk_score = c(3, 4, 3, 3, 2), foundation_risk_score = c(5,
5, 2, 0, 1), flood_damage = c(21000, 32000, 45000, 0, 4500),
wildfire_damage = c(21000, 32000, 27000, 30000, 9000), foundation_damage = c(35000,
40000, 18000, 0, 4500), new_LTV_flood = c(40, 39, 27, 34,
34), new_LTV_wildfire = c(40, 39, 29, 31, 33), new_LTV_foundation = c(38,
38, 30, 34, 34)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))
Till now I have tried these methods.
risk_list = c("flood_risk_score"
, "wildfire_risk_score"
, "foundation_risk_score")
for (i in risk_list){
table <- df %>%
filter(df[i] > 3) %>%
summarise(Count = n()
, mean = mean(value, na.rm = TRUE)
, LTV = mean(LTV))
# Using rbind() to append the output of one iteration to the dataframe
table_append= rbind(table_append, table)
}
This helps me get the values for all the risk scores, however, I have two issues here.
I am unable to filter according to a filter list
For the filter list, I tried this code, but I am unable to add it in a loop -
filters_list = list(which(df$interest > 8)
, which(df$loan_age > 18))
For LTV update, all of them have different new LTV
All of them need to be filtered for high LTV using their new LTV scores
risk_type_list = c("flood"
, "wildfire"
, "foundation")
for (i in list(paste0(risk_type_list,"_risk_level"))){
table <- df %>%
filter(df[paste0(i,"_risk_level")] > 3) %>%
summarise(Count = n())
#Using rbind() to append the output of one iteration to the dataframe
table_append = rbind(table_append, table)
}
In the end, I want to have code that will generate data from the given data by putting in required filters for all different risk types and also use their new LTV values.

How to apply functions depending on the column, and mutate into new data frame?

I came up with the idea to represent stats on a chart like this. Example of the plot. And made it like this.
df_n <- df_normalized %>%
transmute(
Height_x = round(Height*cos_my(45), 2),
Height_y = round(Height*sin_my(45), 2),
Weight_x = round(Weight*cos_my(45*2), 2),
Weight_y = round(Weight*sin_my(45*2), 2),
Reach_x = round(Reach*cos_my(45*3), 2),
Reach_y = round(Reach*sin_my(45*3), 2),
SLpM_x = round(SLpM*cos_my(45*4), 2),
SLpM_y = round(SLpM*sin_my(45*4), 2),
Str_Def_x = round(`Str_Def %`*cos_my(45*5), 2),
Str_Def_y = round(`Str_Def %`*sin_my(45*5), 2),
TD_Avg_x = round(TD_Avg*cos_my(45*6), 2),
TD_Avg_y = round(TD_Avg*sin_my(45*6), 2),
TD_Acc_x = round(`TD_Acc %`*cos_my(45*7), 2),
TD_Acc_y = round(`TD_Acc %`*sin_my(45*7), 2),
Sub_Avg_x = round(Sub_Avg*cos_my(45*8), 2),
Sub_Avg_y = round(Sub_Avg*sin_my(45*8), 2))
Now I want to do this smart way, so I created a data frame with same number of rows empty_df, and later in for loop I try to mutate and array, with every iteration. So for example I want to multiply 1st column by cos(30), 2nd by cos(30*2), and so on
But...
It mutate only last column because all columns during iteration have the same name 'column'.
I want to name each column by the variable column, made with paste0().
reprex_df <- structure(list(Height = c(190, 180, 183, 196, 185),
Weight = c(120, 77, 93, 120, 84),
Reach = c(193, 180, 188, 203, 193),
SLpM = c(2.45, 3.8, 2.05, 7.09, 3.17),
`Str_Def %` = c(58, 56, 55, 34, 44),
TD_Avg = c(1.23, 0.33, 0.64, 0.91, 0),
`TD_Acc %` = c(24, 50, 20, 66, 0),
Sub_Avg = c(0.2, 0, 0, 0, 0)), row.names = c(NA, -5L),
class = c("tbl_df", "tbl", "data.frame"))
temp <- apply(reprex_df[,1], function(x) x*cos(60), MARGIN = 2)
temp
empty_df <- data.frame(first_column = replicate(length(temp),1))
for (x in 1:8) {
temp <- apply(df[,x], function(x) round(x*cos((360/8)*x),2), MARGIN = 2)
column <- paste0("Column_",x)
empty_df <- mutate(empty_df, column = temp)
}
Later I want to make it a function where I can pass data frame and receive data frame with X, and Y coordinates.
So, how should I make it?
Perhaps this helps
library(purrr)
library(stringr)
nm1 <- names(reprex_df)
nm_cos <- str_c(names(reprex_df), "_x")
nm_sin <- str_c(names(reprex_df), "_y")
reprex_df[nm_cos] <- map2(reprex_df, seq_along(nm1),
~ round(.x * cos(45 *.y ), 2))
reprex_df[nm_sin] <- map2(reprex_df[nm1], seq_along(nm1),
~ round(.x * sin(45 *.y ), 2))

How to create a new dataset based on multiple conditions in R?

I have a dataset called carcom that looks like this
carcom <- data.frame(household = c(173, 256, 256, 319, 319, 319, 422, 422, 422, 422), individuals= c(1, 1, 2, 1, 2, 3, 1, 2, 3, 4))
Where individuals refer to father for "1" , mother for "2", child for "3" and "4". What I would like to get two new columns. First one should indicate the number of children in that household if there is. Second, assigning a weight to each individual respectively "1" for father, "0.5" to mother and "0.3" to each child. My new dataset should look like this
newcarcom <- data.frame(household = c(173, 256, 319, 422), child = c(0, 0, 1, 2), weight = c(1, 1.5, 1.8, 2.1)
I have been trying to find the solutions for days. Would be appreciated if someone helps me. Thanks
We can count number of individuals with value 3 and 4 in each household. To calculate weight we change the value for 1:4 to their corresponding weight values using recode and then take sum.
library(dplyr)
newcarcom <- carcom %>%
group_by(household) %>%
summarise(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3)))
# household child weight
# <dbl> <int> <dbl>
#1 173 0 1
#2 256 0 1.5
#3 319 1 1.8
#4 422 2 2.1
Base R version suggested by #markus
newcarcom <- do.call(data.frame, aggregate(individuals ~ household, carcom, function(x)
c(child = sum(x %in% 3:4), weight = sum(replace(y <- x^-1, y < 0.5, 0.3)))))
An option with data.table
library(data.table)
setDT(carcom)[, .(child = sum(individuals %in% 3:4),
weight = sum(recode(individuals,`1` = 1, `2` = 0.5, .default = 0.3))), household]

Weighted mean calculation in R with missing values

Does anyone know if it is possible to calculate a weighted mean in R when values are missing, and when values are missing, the weights for the existing values are scaled upward proportionately?
To convey this clearly, I created a hypothetical scenario. This describes the root of the question, where the scalar needs to be adjusted for each row, depending on which values are missing.
Image: Weighted Mean Calculation
File: Weighted Mean Calculation in Excel
Using weighted.mean from the base stats package with the argument na.rm = TRUE should get you the result you need. Here is a tidyverse way this could be done:
library(tidyverse)
scores <- tribble(
~student, ~test1, ~test2, ~test3,
"Mark", 90, 91, 92,
"Mike", NA, 79, 98,
"Nick", 81, NA, 83)
weights <- tribble(
~test, ~weight,
"test1", 0.2,
"test2", 0.4,
"test3", 0.4)
scores %>%
gather(test, score, -student) %>%
left_join(weights, by = "test") %>%
group_by(student) %>%
summarise(result = weighted.mean(score, weight, na.rm = TRUE))
#> # A tibble: 3 x 2
#> student result
#> <chr> <dbl>
#> 1 Mark 91.20000
#> 2 Mike 88.50000
#> 3 Nick 82.33333
The best way to post an example dataset is to use dput(head(dat, 20)), where dat is the name of a dataset. Graphic images are a really bad choice for that.
DATA.
dat <-
structure(list(Test1 = c(90, NA, 81), Test2 = c(91, 79, NA),
Test3 = c(92, 98, 83)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
w <-
structure(list(Test1 = c(18, NA, 27), Test2 = c(36.4, 39.5, NA
), Test3 = c(36.8, 49, 55.3)), .Names = c("Test1", "Test2", "Test3"
), row.names = c("Mark", "Mike", "Nick"), class = "data.frame")
CODE.
You can use function weighted.mean in base package statsand sapply for this. Note that if your datasets of notes and weights are R objects of class matrix you will not need unlist.
sapply(seq_len(nrow(dat)), function(i){
weighted.mean(unlist(dat[i,]), unlist(w[i, ]), na.rm = TRUE)
})

Resources