Sum a specific column based upon a condition - r

I have the follow set of data:
Using R and tidyverse if possible I would like to sum column S based upon a condition on other columns. If my variable
condition_columns = c('A', 'B')
The output I am after is a data frame containing
Where the 490 is obtained by summing column S only when A=1 and the 250 comes from summing column S when B=1.
Could anyone suggest a tidyverse way of doing it?
Thank you,
Phil,

You can do this using summarize(across())
summarize(df, across(all_of(condition_columns), ~sum(S[.x==1])))
Output:
A B
1 490 250
Input:
structure(list(ID = 1:10, A = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1),
B = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0), S = c(10, 20, 30, 40,
50, 60, 70, 80, 90, 100)), class = "data.frame", row.names = c(NA,
-10L))

You may use the following (easy to understand) code :
df %>%
summarise(A = sum(A*S),
B = sum(B*S))
Output:
A B
1 490 250

Related

How to calculate mean value of all columns of datarame [duplicate]

This question already has answers here:
calculate the mean for each column of a matrix in R
(10 answers)
Closed last year.
I have a data frame and I want to calculate the mean of all columns and save it into a new dataframe. I found this solution calculate the mean for each column of a matrix in R however, this is only for matrix and not dataframe
structure(list(TotFlArea = c(1232, 596, 708, 1052, 716), logg_weighted_assess = c(13.7765298160156,
13.1822275291412, 13.328376420438, 13.3076293132057, 13.5164823091252
), TypeDwel1.2.Duplex = c(0, 0, 0, 0, 0), TypeDwelApartment.Condo = c(0,
1, 1, 1, 1), TypeDwelTownhouse = c(1, 0, 0, 0, 0), Age_new.70 = c(0,
0, 0, 0, 0), Age_new0.1 = c(0, 0, 0, 0, 0), Age_new16.40 = c(1,
1, 0, 1, 0), Age_new2.5 = c(0, 0, 0, 0, 0), Age_new41.70 = c(0,
0, 0, 0, 0), Age_new6.15 = c(0, 0, 1, 0, 1), LandFreehold = c(1,
1, 1, 0, 1), LandLeasehold.prepaid = c(0, 0, 0, 1, 0), LandOthers = c(0,
0, 0, 0, 0), cluster_K_mean.1 = c(0, 0, 0, 0, 0)), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
Can you please advise how I can do this?
Note: my data frame can have NA values which should be excluded from mean calculation
As #akrun pointed out. Also another alternative
apply(df, 2, mean)
where 2 means by column and 1 is by row.
However, besides its flexibility (e.g. changing from mean to mode or applying to selected columns only apply(df[,c('a', 'b')], 2, mean)) below shows the disadvantage to using apply (in terms of speed)
library(data.table)
library(microbenchmark)
# dummy data
x <- 1e7
df <- data.table(a = 1:x )
y <- letters[2:10]
df[, (y) := lapply(2:10, \(i) a+i)]
# benchmark
z <-
microbenchmark(colMeans = {colMeans(df)}
, apply = {apply(df, 2, mean)}
, times = 30
)
plot(z)

StyleInterval with cut in column

I recently started exploring DT and I am stuck on something. Imagine the following table:
dt <- data.table(group = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
group2 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
interval = c(NA, NA, 100, NA, NA, 150, NA, NA, 100),
value1 = c(1000, 10, 90, 2000, 30, 120, 1500, 25, 150),
value2 = c(1200, 10, 110, 2500, 35, 145, 2200, 40, 90))
Now I want to create a DT with a Style that checks the value in value1 and value2 and compares it with the value in Interval. I tried something like this:
datatable(dt) %>% formatStyle(
columns = c("value1", "value2"),
backgroundColor = styleInterval(interval, c("red", "green"))
)
But interval is not recognized as an object. This leads me to believe that I cannot pass a column in the cut parameter. I also tried to pass some kind of function in the valueColumns but this didn't seem to be possible either.
Expected output:
It makes no sense to pass the full column to styleInterval. It requires n values for cut and n+1 for values. Try alternative below instead:
myCut <- sort(unique(dt$interval))
myCol <- rainbow(length(myCut) + 1)
formatStyle(datatable(dt),
columns = c("value1", "value2"),
backgroundColor = styleInterval(myCut, myCol))

Assign existing vector as replacement in case_when?

I am trying to assign ranks to an IBI, there is a condition for one attribute that requires me to assign the average of all other attributes when Alkalinity < 25 mg/L. Can I assign an existing vector/column of attribute score means? I've tried the code below to first assign the mean for each (but I'm not sure this is correct)
WWMBIscores <- WWMBI[c(1:126) , c(29, 33, 35, 37, 39, 41, 43)]
ScoreMeans <- rowMeans(WWMBIscores)
This code should then assign the value from "ScoreMeans" as the rank when alkalinity is less than 25 mg/L
Mollusca_IBI <-
WWMBI %>%
mutate(
Mollusca_Abund = case_when(
Alkalinity <= 5 ~ ScoreMeans,
Mollusca_Abund <= 1 ~ 0 ,
Mollusca_Abund >= 1 & Mollusca_Abund <= 9 ~ 1 ,
Mollusca_Abund >= 9 & Mollusca_Abund <= 99 ~ 3 ,
Mollusca_Abund >= 99 ~ 5
)
) %>% select(Mollusca_Abund)
It doesn't appear to be assigning means, but rather 0s as the value. I've included a small subset of the 7 ranks I have calculated, assume some of those are for areas where alkalinity is <25.
structure(list(MargDI_IBI = c(1, 1, 1, 3, 3), Pleid_IBI = c(1, 0, 0, 5, 1), Corixid_IBI = c(2, 0, 2, 0, 2), Trichop_IBI = c(0, 0, 0, 0, 0), Stratio_IBI = c(0, 0, 0, 0, 0), NonInsect_IBI = c(1, 0, 3, 3, 1), Insect_IBI = c(1, 1, 1, 3, 3)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))

How to select value in one column based on another value in the same column, to then use in an equation?

Example data set in the picture
If percentage is 35 with their corresponding response values (average of the response values = A), I want to select another set of percentages that equal 100-35 = 65 and those corresponding responses (average of the response values = B). These percentages and responses are in the same columns.
I want to multiple A by 65% and B by 35%, before adding the two together.
I would want to do that for an entire data set so that I can get a "flipped" percentage and response accuracy.
Perhaps this code can help you
with(
aggregate(response ~ ., df, mean),
sum(
rowMeans(
cbind(
response * (1 - percentage / 100),
response[match(100 - percentage, percentage)] * percentage / 100
)
)
)
)
which gives
[1] 2.175
Data
> dput(df)
structure(list(percentage = c(35, 35, 40, 40, 45, 45, 50, 50,
55, 55, 60, 60, 65, 65), response = c(1, 1, 1, 1, 1, 0, 1, 0,
0, 1, 0, 0, 0, 1)), class = "data.frame", row.names = c(NA, -14L
))

Error in if (is.double(data$x) && !has_groups(data) && any(data$x != data$x[1L])) { : missing value where TRUE/FALSE needed

I'm trying to use ggplot, and am hoping to create a boxplot that has four categories on the x axis for suspension data (low, lowish, highish, high) and farms on the y-axis.
I have I think broken the suspension column into four groups. But ggplot is upset with me. Here is the error:
```
Error in if (is.double(data$x) && !has_groups(data) && any(data$x != data$x[1L])) { : missing value where TRUE/FALSE needed
```
Here is my code:
```{r}
# To break suspension_rate_total_pct data into groups for clearer visualization, I found the min, and max
merged_data$suspension_rate_total_pct <-
as.numeric(merged_data$suspension_rate_total_pct)
max(merged_data$suspension_rate_total_pct, na.rm=TRUE)
min(merged_data$suspension_rate_total_pct, na.rm=TRUE)
low_suspension <- merged_data$suspension_rate_total_pct > 0 & merged_data$suspension_rate_total_pct < 0.5
low_ish_suspension <- merged_data$suspension_rate_total_pct > 0.5 & merged_data$suspension_rate_total_pct < 1
high_ish_suspension <- merged_data$suspension_rate_total_pct > 1 & merged_data$suspension_rate_total_pct < 1.5
high_suspension <- merged_data$suspension_rate_total_pct > 1.5 & merged_data$suspension_rate_total_pct < 2
ggplot(merged_data, aes(x = suspension_rate_total_pct , y = farms_pct)) +
geom_boxplot()
```
Here is the Data:
merged_data <- structure(list(schid = c("1030642", "1030766", "1030774", "1030840",
"1130103", "1230150"), enrollment = c(159, 333, 352, 430, 102,
193), farms = c(132, 116, 348, 406, 68, 130), foster = c(2, 0,
1, 8, 1, 4), homeless = c(14, 0, 8, 4, 1, 4), migrant = c(0,
0, 0, 0, 0, 0), ell = c(18, 12, 114, 45, 7, 4), suspension_rate_total = c(NA,
20, 0, 0, 95, 5), suspension_violent = c(NA, 9, 0, 0, 20, 2),
suspension_violent_no_injury = c(NA, 6, 0, 0, 47, 1), suspension_weapon = c(NA,
0, 0, 0, 8, 0), suspension_drug = c(NA, 0, 0, 0, 9, 1), suspension_defiance = c(NA,
1, 0, 0, 9, 1), suspension_other = c(NA, 4, 0, 0, 2, 0),
farms_pct = c(0.830188679245283, 0.348348348348348, 0.988636363636364,
0.944186046511628, 0.666666666666667, 0.673575129533679),
foster_pct = c(0.0125786163522013, 0, 0.00284090909090909,
0.0186046511627907, 0.00980392156862745, 0.0207253886010363
), migrant_pct = c(0, 0, 0, 0, 0, 0), ell_pct = c(0.113207547169811,
0.036036036036036, 0.323863636363636, 0.104651162790698,
0.0686274509803922, 0.0207253886010363), homeless_pct = c(0.0880503144654088,
0, 0.0227272727272727, 0.00930232558139535, 0.00980392156862745,
0.0207253886010363), suspension_rate_total_pct = c(NA, 2,
1, 1, 2, 2)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
If you can, please help me appease ggplot so that it will give me with beautiful visualization. Currently, this feels like a one-sided, emotional rollercoaster of a relationship.
Just a short answer, i am sure you can figure out the rest by yourself, (otherwise post a followup question.)
Since the data you provided has some NA's in the first row in several columns, i can only demonstrate you the principle on how to get your desired result by using the merged_data$homless value as group-input for our boxplots , the data (y-value) will be still Farms .
# first we create our groups of low, middle & high amount of homeless
merged_data2<- merged_data %>% mutate(homelessgroup= ifelse(homeless < 4, "low",
ifelse(homeless <= 8, "middle",
ifelse(homeless > 8, "high",NA ))))
## then we plot the data using ggplot
ggplot(merged_data2,aes(y=farms,fill=homelessgroup))+geom_boxplot()
I think you can just use cut() with your data to partition into 4 groups. Then you can use that variable with the plot
merged_data <- transform(merged_data,
group = cut(
suspension_rate_total_pct,
c(0, .5, 1, 1.5, 2),
include.lowest = TRUE,
labels = c("low", "lowish", "highish", "high")))
ggplot(merged_data, aes(x = group , y = farms_pct)) +
geom_boxplot()

Resources