Perform conditional calculations in a R data frame - r

I have data in a dataframe in R like this:
Value | Metric
10 | KG
5 | lbs etc.
I want to create a new column (weight) where I can calculate a converted weight based on the Metric - something like if Metric = "Kg" then Value * 1, if Metric = "lbs" then Value * 2.20462
I also have another use case I want to do a similar conditional calculation but based on continuous values i.e. if x >= 2 then "Classification" elseif x >= 1 then "Classification 2" else "Other
Any ideas that might work for both in R?

Does this work:
library(dplyr)
df %>% mutate(converted_wt = case_when(Metric == 'lbs' ~ Value * 2.20462, TRUE ~ Value))
Value Metric converted_wt
1 10 KG 10.0000
2 5 lbs 11.0231
If you have other units apart from "KG" and "lbs" you need to include those in case_when condition accordingly.

Related

Multi conditional case_when in R

I'm trying to add a new column (color) to my data frame. The value in the row depends on the values in two other columns. For example, when the class value is equal to 4 and the Metro_status value is equal to Metro, I want a specific value returned in the corresponding row in the new column. I tried doing this with case_when using dplyr and it worked... to an extent.
The majority of the color values outputted into the color column don't line up with the defined conditions. For example, the first rows (Nome Census Area) color value should be "#fcc48b" but instead is "#d68182".
What am I doing wrong?? TIA!
Here's my code:
#set working directory
setwd("C:/Users/weirc/OneDrive/Desktop/Undergrad Courses/Fall 2021 Classes/GHY 3814/final project/data")
#load packages
library(readr)
library(dplyr)
#load data
counties <- read_csv("vaxData_counties.csv")
#create new column for class
updated_county_data <- counties %>%
mutate(class = case_when(
Series_Complete >=75 ~ 4,
Series_Complete >= 50 ~ 3,
Series_Complete >= 25 ~ 2,
TRUE ~ 1
), color = case_when(
class == 4 | Metro_status == 'Metro' ~ '#d62023',
class == 4 | Metro_status == 'Non-metro' ~ '#d68182',
class == 3 | Metro_status == 'Metro' ~ '#fc9126',
class == 3 | Metro_status == 'Non-metro' ~ '#fcc48b',
class == 2 | Metro_status == 'Metro' ~ '#83d921',
class == 2 | Metro_status == 'Non-metro' ~ '#abd977',
class == 1 | Metro_status == 'NA' ~ '#7a7a7a'
))
View(updated_county_data)
write.csv(updated_county_data, file="county_data_manip/updated_county_data.csv")
Here's what the data frame looks like
Remark 1:
when the class value is equal to 4 and the Metro_status value is equal to Metro
In R (and many programming languages) & is the "and". You're using |, which is "or".
Remark 2:
Consider simplifying the first four lines to two, since Metro status doesn't affect the color for classes 4 & 3
Remark 3:
To calculate class, consider base::cut(), because it's adequate, yet simpler than dplyr::case_when().
Here's my preference when escalating the complexity of recoding functions:
https://ouhscbbmc.github.io/data-science-practices-1/coding.html#coding-simplify-recoding
Remark 4:
This was a good SO post, but see if you can improve your next one.
Read and incorporate elements from How to make a great R reproducible example?. Especially the aspects of using dput() for the input and then an explicit example of your expected dataset.

row index of "looked at" row case_when in R

I´m currently struggling with a coding task concerning the use of a case_when statement in R.
In general I would like to use the looked at row index of the case_when statement in the assignment part.
A short explanation to the data. I have large data.frame with a date-column, a geo layer-column and some numeric columns with numbers for the calculations.
The data.frame doesn't have any sorting and not for every point in time all geo layers are necessarily in the data.frame. Sadly I can't provide a real data set due to legal issues.
The task at hand is to compute on the one hand simple mathematical operations for the same point in time on the other side to compute mathematical operations for different points in time for the same geo layer and numeric value.
The mathematical operations vary as dose the interval between the time points.
For instance I need to calculate a change rate to the last quarter and last year of the value:
((current_value - last_quarter_value) / current_value)*100
This is how I'd like to code it.
library(tidyverse)
test_dataframe <- data.frame(
times = c(rep(as.Date("2021-03-01"),2),rep(as.Date("2020-12-01"),2)),
geo_layer = rep(c("001001001", "001001002"),2),
numeric_value_a = 1:4,
numeric_value_b = 4:1,
numeric_value_c = c(1,NA,3,1)
)
check_comparison_times <- unique(test_dataframe$times)
test_dataframe <- test_dataframe %>%
mutate(
normale_calculation = case_when(
!is.na(numeric_value_c) ~ (numeric_value_a + numeric_value_b) / numeric_value_c,
TRUE ~ Inf
),
time_comparison = case_when(
is.na(numeric_value_c) ~ Inf,
(times - months(3)) %in% check_comparison_times ~ test_dataframe[
which(
test_dataframe[,"times"] ==
(test_dataframe[row_index_of_current_looked_at_row, "times"] - months(3)) &
test_dataframe[,"geo_layer"] ==
test_dataframe[row_index_of_current_looked_at_row, "geo_layer"]
)
,"numeric_value_c"] - test_dataframe[row_index_of_current_looked_at_row, "numeric_value_c"],
TRUE ~ -Inf
)
)
With this desired outcome:
times geo_layer numeric_value_a numeric_value_b numeric_value_c normal_calculation time_comparison
1 2021-03-01 001001001 1 4 1 5.000000 2
2 2021-03-01 001001002 2 3 NA Inf Inf
3 2020-12-01 001001001 3 2 3 1.666667 -Inf
4 2020-12-01 001001002 4 1 1 5.000000 -Inf
Currently I solve the problem with a triple loop in which I first pair the Values for time then for geo_layer and then execute the mathematical operation.
Since my Data-Set is much much lager than that this. This solution is every in efficient.
Thanks for your help.

Summarizing outcomes by groups in R

The following code works....
sum( (WASDATj$HCNT == 1 | WASDATj$HCNT == -1 | WASDATj$HCNT == 0 ) & WASDATj$Region=='United States'
& WASDATj$Unit=='Million Bushels'
& WASDATj$Commodity=='Soybeans'
& WASDATj$Attribute == 'Production'
& WASDATj$Fdex.x == 10
,na.rm=TRUE
)
It counts the number of observations where HCNT takes a value of -1,1,0
it provides a single number for this category.
The variable WASDATj$Fdex.x takes a value from 1-20.
How can I generalize this to count the number of observations that take a value -1,1,0 for each of the values of Fdex.x (so provide me 20 sums for Fdex.x from 1-20)? I did look for an answer, but I'm such a novice I may have missed what is an obvious answer....
Simply extend your sum of a boolean vector to aggregate function using length which is essentially a count aggregation and analogous to your sum of TRUE:
agg_df <- aggregate(cbind(Count=HCNT) ~ Fdex.x,
data=WASDATj[WASDATj$HCNT %in% c(1,-1, 0) &
WASDATj$Region=='United States' &
WASDATj$Unit=='Million Bushels' &
WASDATj$Commodity=='Soybeans' &
WASDATj$Attribute=='Production', ],
FUN=length)
Result should be a data frame of 20 rows by two columns for each distinct Fdex.x value and corresponding count.
And if needed, you can extend grouping for other counts by adjusting formula and data filter:
agg_df <- aggregate(cbind(Count=HCNT) ~ Fdex.x + Region + Unit + Commodity + Attribute,
data=WASDATj[WASDATj$HCNT %in% c(1,-1, 0), ],
FUN=length)

How to calculate percentage of dataframe column in R with condition?

I would like to find out how to calculate the percentage of a column based on a condition.
My table looks like this:
url | call_count
-------|-----------
bbc.com| 1
bbc.com| 1
bbc.com| 1
bbc.com| 1
ao.com | 0
ab.com | 2
I would like to group the table by the url column and calculate a new column called "percent_calling" - this is based on a condition whereby the call_count column value is greater than 0 then calculate it as a percent of the whole column value - this is basically just % calling if the value is > 0 as >0 means they made a call.
I'm currently stuck on how to do this with dplyr the closest i have got is the following:
df %>%
group_by(url) %>%
summarise(percent_calling = sum(call_count)/nrow(df))
but as you can see i cannot add a condition i.e. call_count > 0
Your data:
df<-data.frame(
stringsAsFactors = FALSE,
url = c("bbc.com","bbc.com",
"bbc.com","bbc.com","ao.com","ab.com"),
call_count = c(1, 1, 1, 1, 0, 2)
)
Does the following work for you?
df%>%
group_by(url)%>%
summarise(sum_calling = sum(call_count))%>%
mutate(percent_calling=sum_calling/sum(sum_calling)*100)%>%
select(-sum_calling) # remove the sum if not required
url percent_calling
<chr> <dbl>
1 ab.com 33.3
2 ao.com 0
3 bbc.com 66.7

Obtaining mean of dataframe by value groups in another dataframe

Once again I consult your wisdom.
I have 2 dataframes of the form:
**data1sample**
ID value
water 3
water 5
fire 1
fire 3
fire 2
air 1
**data2controls**
ID value
water 1
fire 3
air 5
I want to use the values in my control dataframe (data2controls) and know their corresponding percentile in the sample distribution (data1sample). I have to classify each sample by their ID (meaning, get fire control against fire sample, and water against water, etc), but I haven't been able to do so.
I am using the command:
mean(data1sample[data1sample$ID == data2controls$ID,] <= data2controls$value)
but I get the error
In Ops.factor(left, right) : ‘<=’ not meaningful for factors
What I am after is basically the percentile of the value in dataframe2 calculated based on the samples of dataframe1 (I am trying to obtain the percentile as in percentile = mean(data1sample$value(by ID) <= dataframe2$value))
So something like this:
**data2controls**
ID value percentile(based on data1 sample values)
water 1 .30
fire 3 .14
air 5 .1
Please disregard the percentile values, they're just made up to show desired output.
I'd love if someone could give me a hand! Thanks!!
Its hard to answer without the desired output, but I will try to guess it here:
library(dplyr)
data1sample <- data.frame(ID = c("water", "water", "fire", "fire", "fire", "air"), value = c(3,5,1,3,2,1))
data2sample <- data.frame(ID = c("water", "fire", "air"), value = c(1,3,5))
by_ID <- data1sample %>% group_by(ID) %>% summarise(control = mean(value))
data2sample %>% inner_join(by_ID)
#> Joining, by = "ID"
#> ID value control
#> 1 water 1 4
#> 2 fire 3 2
#> 3 air 5 1
This gives the result I think you're after?
for(i in d2$ID){
x <- mean(d1[d1$ID == i & d1$value <= d2[d2$ID == i, 'value'], 'value'])
print(x)
}
Based on the data you provided it returns NaN for water because there are no 'water's that meet your criterion, and so div by 0

Resources