I'm trying to add a new column (color) to my data frame. The value in the row depends on the values in two other columns. For example, when the class value is equal to 4 and the Metro_status value is equal to Metro, I want a specific value returned in the corresponding row in the new column. I tried doing this with case_when using dplyr and it worked... to an extent.
The majority of the color values outputted into the color column don't line up with the defined conditions. For example, the first rows (Nome Census Area) color value should be "#fcc48b" but instead is "#d68182".
What am I doing wrong?? TIA!
Here's my code:
#set working directory
setwd("C:/Users/weirc/OneDrive/Desktop/Undergrad Courses/Fall 2021 Classes/GHY 3814/final project/data")
#load packages
library(readr)
library(dplyr)
#load data
counties <- read_csv("vaxData_counties.csv")
#create new column for class
updated_county_data <- counties %>%
mutate(class = case_when(
Series_Complete >=75 ~ 4,
Series_Complete >= 50 ~ 3,
Series_Complete >= 25 ~ 2,
TRUE ~ 1
), color = case_when(
class == 4 | Metro_status == 'Metro' ~ '#d62023',
class == 4 | Metro_status == 'Non-metro' ~ '#d68182',
class == 3 | Metro_status == 'Metro' ~ '#fc9126',
class == 3 | Metro_status == 'Non-metro' ~ '#fcc48b',
class == 2 | Metro_status == 'Metro' ~ '#83d921',
class == 2 | Metro_status == 'Non-metro' ~ '#abd977',
class == 1 | Metro_status == 'NA' ~ '#7a7a7a'
))
View(updated_county_data)
write.csv(updated_county_data, file="county_data_manip/updated_county_data.csv")
Here's what the data frame looks like
Remark 1:
when the class value is equal to 4 and the Metro_status value is equal to Metro
In R (and many programming languages) & is the "and". You're using |, which is "or".
Remark 2:
Consider simplifying the first four lines to two, since Metro status doesn't affect the color for classes 4 & 3
Remark 3:
To calculate class, consider base::cut(), because it's adequate, yet simpler than dplyr::case_when().
Here's my preference when escalating the complexity of recoding functions:
https://ouhscbbmc.github.io/data-science-practices-1/coding.html#coding-simplify-recoding
Remark 4:
This was a good SO post, but see if you can improve your next one.
Read and incorporate elements from How to make a great R reproducible example?. Especially the aspects of using dput() for the input and then an explicit example of your expected dataset.
Related
I am trying to code offenders into race categories to clean up a criminal justice database.
One of the race categories is "equal number of blacks and whites". I've been asked to take the number of offenders (variable V4248), split its value and allocate 1/2 the value to "whites" and 1/2 the value to "blacks". Codingwise, I am not sure how to do this. This is where Im at so far:
I created the race variable
I created a variable to split the value (but this is where I'm not sure how to code it)
I combine the two variables above.
ncvs11 <- ncvs10 %>%
mutate(total_ethnic = case_when(off_race_hisp == 1 | multi_race_ethnic == 1 ~ 1, #White
off_race_hisp == 2 | multi_race_ethnic == 2 ~ 2, #Black
multi_race_ethnic == 3 ~ 3,
off_race_hisp == 4 | multi_race_ethnic == 4 ~ 4 #Hispanic
)) %>%
mutate(offsplit = case_when(total_ethnic == 3 ~ V4248/2) %>%
#WHAT DO I DO HERE?
mutate(final_ethnic = case_when(total_ethnic == 1 | total_ethnic == 2 |
total_ethnic == 4 ~ total_ethnic,
total_ethnic == 3 ~ offsplit))
Thanks,
I am trying to create an if-and-or statement with multiple responses.
A sample dataset is as follows (my actual data does not have columns next to each other and has a lot more responses:
hairdf=data.frame(
id=c(1:5),
drug1=c("etoh*","hhh","etoh","hhhh","blank"),
source1=c("no blood","yes","some blood","nothing","blank"),
con1=c("5","6","4","2","0"),
unit1=c("g/l","cm/l","g/km","j/nm","t/mm"),
drug2=c("hhh","etoh*","hhh","etoh","blank"),
source1=c("no ","yes blood","some","nothing","blank"),
con1=c("6","7","8","9","1"),
unit1=c("cm/l","g/km","j/nm","t/mm","g/l"))
I am trying to create 2 new columns that will return the conc and units if 1) drug 1 is etoh or etoh* and source1 has the word blood or if 1) drug 2 is etoh or etoh* and source2 has the word blood.
I have tried the following code but am coming up with an error:
wordetoh <-c("etoh", "etoh*")
hairdf<-hairdf %>%
mutate(etohconc=if_else(
drug1 %in% wordetoh & grepl("blood",source1), con1 |
drug2 %in% wordetoh & grepl("blood",source2), con2,
""))
hairdf<-hairdf %>%
mutate(etohunit=if_else(
drug1 %in% wordetoh & grepl("blood",source1), unit1 |
drug2 %in% wordetoh & grepl("blood",source2), unit 2,
""))
Based on my data, the new columns should have the following responses
etohconc: 5,7,4, blank, blank
etohunit: g/l, g/km, g/km, blank, blank.
I'm going to assume that some of the columns in the sample dataset are misnamed and should actually look like this:
hairdf=data.frame(
id=c(1:5),
drug1=c("etoh*","hhh","etoh","hhhh","blank"),
source1=c("no blood","yes","some blood","nothing","blank"),
con1=c("5","6","4","2","0"),
unit1=c("g/l","cm/l","g/km","j/nm","t/mm"),
drug2=c("hhh","etoh*","hhh","etoh","blank"),
source2=c("no ","yes blood","some","nothing","blank"),
con2=c("6","7","8","9","1"),
unit2=c("cm/l","g/km","j/nm","t/mm","g/l"))
It sounds like your situation is one where you don't just have multiple conditions; you want different outcomes depending on those different conditions. For example, I infer that if drug 1 meets the conditions, then you want the concentration and unit of drug 1; but if drug 2 meets the conditions, then you want the concentration and unit of drug 2. if_else() doesn't do this; it takes a single boolean (which can be arbitrarily complex) and returns one of exactly two outcomes: one outcome if the boolean evaluates to true, and another if it evaluates to false.
case_when() is a good option for your situation. (Nested if_else() would work too, but I find case_when() much easier to read.) Give it a series of condition:outcome pairs, and it can handle all three cases you're interested in:
library(tidyverse)
hairdf = hairdf %>%
mutate(etohconc = case_when(drug1 %in% wordetoh & grepl("blood", source1) ~ con1,
drug2 %in% wordetoh & grepl("blood", source2) ~ con2,
T ~ NA_character_),
etohunit = case_when(drug1 %in% wordetoh & grepl("blood", source1) ~ unit1,
drug2 %in% wordetoh & grepl("blood", source2) ~ unit2,
T ~ NA_character_))
I have data in a dataframe in R like this:
Value | Metric
10 | KG
5 | lbs etc.
I want to create a new column (weight) where I can calculate a converted weight based on the Metric - something like if Metric = "Kg" then Value * 1, if Metric = "lbs" then Value * 2.20462
I also have another use case I want to do a similar conditional calculation but based on continuous values i.e. if x >= 2 then "Classification" elseif x >= 1 then "Classification 2" else "Other
Any ideas that might work for both in R?
Does this work:
library(dplyr)
df %>% mutate(converted_wt = case_when(Metric == 'lbs' ~ Value * 2.20462, TRUE ~ Value))
Value Metric converted_wt
1 10 KG 10.0000
2 5 lbs 11.0231
If you have other units apart from "KG" and "lbs" you need to include those in case_when condition accordingly.
The following code works....
sum( (WASDATj$HCNT == 1 | WASDATj$HCNT == -1 | WASDATj$HCNT == 0 ) & WASDATj$Region=='United States'
& WASDATj$Unit=='Million Bushels'
& WASDATj$Commodity=='Soybeans'
& WASDATj$Attribute == 'Production'
& WASDATj$Fdex.x == 10
,na.rm=TRUE
)
It counts the number of observations where HCNT takes a value of -1,1,0
it provides a single number for this category.
The variable WASDATj$Fdex.x takes a value from 1-20.
How can I generalize this to count the number of observations that take a value -1,1,0 for each of the values of Fdex.x (so provide me 20 sums for Fdex.x from 1-20)? I did look for an answer, but I'm such a novice I may have missed what is an obvious answer....
Simply extend your sum of a boolean vector to aggregate function using length which is essentially a count aggregation and analogous to your sum of TRUE:
agg_df <- aggregate(cbind(Count=HCNT) ~ Fdex.x,
data=WASDATj[WASDATj$HCNT %in% c(1,-1, 0) &
WASDATj$Region=='United States' &
WASDATj$Unit=='Million Bushels' &
WASDATj$Commodity=='Soybeans' &
WASDATj$Attribute=='Production', ],
FUN=length)
Result should be a data frame of 20 rows by two columns for each distinct Fdex.x value and corresponding count.
And if needed, you can extend grouping for other counts by adjusting formula and data filter:
agg_df <- aggregate(cbind(Count=HCNT) ~ Fdex.x + Region + Unit + Commodity + Attribute,
data=WASDATj[WASDATj$HCNT %in% c(1,-1, 0), ],
FUN=length)
Building off of this question Pass a data.frame with column names and fields as filter
Let's say we have the following data set:
filt = data.table(X1 = c("Gender","Male"),
X2 = c('jobFamilyGroup','Finance'),
X3 = c('jobFamilyGroup','Software Dev')
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'))
and I want to use filt as a filter for df. filt is done by grabbing an input from Shiny and transforming it a bit to get me that data.table above. My goal is to filter df so we have: All rows that are MALE AND (Software Dev OR Finance).
Currently, I'm hardcoding it to always be an AND but that isn't ideal for situations like this. My thought would be to have multiple if conditions to catch things like this, but I feel like there could be an easier approach for building this logic in.
___________UPDATE______________
Once I have a table like filt I can pass code like:
if(!is.null(primary))
{
if(ncol(primary)==1){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1])]
}
else if(length(primary)==2){
d2 = df[get(as.character(primary[1,1]))==as.character(primary[2,1]) &
get(as.character(primary[1,2]))==as.character(primary[2,2])]
}
else{
d2 = df[get(as.character(primary[1,1]))==as.character(primary[1,2]) &
get(as.character(primary[1,2]))==as.character(primary[2,2]) &
get(as.character(primary[1,3]))==as.character(primary[2,3])]
}
}
But this code doesn't account for the OR Logical needed if there are multiple inputs for one type of grouping. Meaning the current code says give me all rows where: Gender == Male & Job Family Group == 'Finance'& Job Family Group == 'Software Dev' When really it should be Gender == Male & (Job Family Group == 'Finance'| Job Family Group == 'Software Dev')
this is a minimal example meaning there are many other columns so ideally the solution has the ability to determine when a multiple input for a grouping is present.
Given your problem, what if you parsed it so your logic looked like:
Gender %in% c("Male") & jobFamilyGroup %in% c('Finance','Software Dev')
By lumping all filter values with the same column name together in an %in% you get your OR and you keep your AND between column names.
UPDATE
Consider the case discussed in comments below.
Your reactive inputs a data.table specifying
Gender IS Male
Country IS China OR US
EmployeeStatus IS Active
In the sample data you provided there is no country column, so I added one. I extract the columns to be filtered and the values to be filtered and split the values to be filtered by the columns. I pass this into an lapply which does the logical check for each column using an %in% rather than a == so that options within the same column are treated as an | instead of a &. Then I rbind the logical results together and apply an all to the columns and then filter df by the results.
This approach handles the & between columns and the | within columns. It supports any number of columns to be searched removing the need for your if/else logic.
library(data.table)
df = data.table(Gender = c('Male','F','Male','Male','F'),
EmployeeStatus = c('Active','na','Active','Active','na'),
jobFamilyGroup = c('Finance','Software Dev','HR','Finance','Software Dev'),
Country = c('China','China','US','US','China'))
filt = data.table(x1 = c('Gender' , 'Male'),x2 = c('Country' , 'China'),x3 = c('Country','US'), x4 = c('EmployeeStatus','Active'))
column = unlist(filt[1,])
value = unlist(filt[2,])
tofilter = split(value,column)
tokeep = apply(do.call(rbind,lapply(names(tofilter),function(x){
`[[`(df,x) %in% tofilter[[x]]
})),2,all)
df[tokeep==TRUE]
#> Gender EmployeeStatus jobFamilyGroup Country
#> 1: Male Active Finance China
#> 2: Male Active HR US
#> 3: Male Active Finance US