I have the follow set of data:
Using R and tidyverse if possible I would like to sum column S based upon a condition on other columns. If my variable
condition_columns = c('A', 'B')
The output I am after is a data frame containing
Where the 490 is obtained by summing column S only when A=1 and the 250 comes from summing column S when B=1.
Could anyone suggest a tidyverse way of doing it?
Thank you,
Phil,
You can do this using summarize(across())
summarize(df, across(all_of(condition_columns), ~sum(S[.x==1])))
Output:
A B
1 490 250
Input:
structure(list(ID = 1:10, A = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1),
B = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0), S = c(10, 20, 30, 40,
50, 60, 70, 80, 90, 100)), class = "data.frame", row.names = c(NA,
-10L))
You may use the following (easy to understand) code :
df %>%
summarise(A = sum(A*S),
B = sum(B*S))
Output:
A B
1 490 250
Related
This question already has answers here:
calculate the mean for each column of a matrix in R
(10 answers)
Closed last year.
I have a data frame and I want to calculate the mean of all columns and save it into a new dataframe. I found this solution calculate the mean for each column of a matrix in R however, this is only for matrix and not dataframe
structure(list(TotFlArea = c(1232, 596, 708, 1052, 716), logg_weighted_assess = c(13.7765298160156,
13.1822275291412, 13.328376420438, 13.3076293132057, 13.5164823091252
), TypeDwel1.2.Duplex = c(0, 0, 0, 0, 0), TypeDwelApartment.Condo = c(0,
1, 1, 1, 1), TypeDwelTownhouse = c(1, 0, 0, 0, 0), Age_new.70 = c(0,
0, 0, 0, 0), Age_new0.1 = c(0, 0, 0, 0, 0), Age_new16.40 = c(1,
1, 0, 1, 0), Age_new2.5 = c(0, 0, 0, 0, 0), Age_new41.70 = c(0,
0, 0, 0, 0), Age_new6.15 = c(0, 0, 1, 0, 1), LandFreehold = c(1,
1, 1, 0, 1), LandLeasehold.prepaid = c(0, 0, 0, 1, 0), LandOthers = c(0,
0, 0, 0, 0), cluster_K_mean.1 = c(0, 0, 0, 0, 0)), row.names = c("1",
"2", "3", "4", "5"), class = "data.frame")
Can you please advise how I can do this?
Note: my data frame can have NA values which should be excluded from mean calculation
As #akrun pointed out. Also another alternative
apply(df, 2, mean)
where 2 means by column and 1 is by row.
However, besides its flexibility (e.g. changing from mean to mode or applying to selected columns only apply(df[,c('a', 'b')], 2, mean)) below shows the disadvantage to using apply (in terms of speed)
library(data.table)
library(microbenchmark)
# dummy data
x <- 1e7
df <- data.table(a = 1:x )
y <- letters[2:10]
df[, (y) := lapply(2:10, \(i) a+i)]
# benchmark
z <-
microbenchmark(colMeans = {colMeans(df)}
, apply = {apply(df, 2, mean)}
, times = 30
)
plot(z)
I recently started exploring DT and I am stuck on something. Imagine the following table:
dt <- data.table(group = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
group2 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
interval = c(NA, NA, 100, NA, NA, 150, NA, NA, 100),
value1 = c(1000, 10, 90, 2000, 30, 120, 1500, 25, 150),
value2 = c(1200, 10, 110, 2500, 35, 145, 2200, 40, 90))
Now I want to create a DT with a Style that checks the value in value1 and value2 and compares it with the value in Interval. I tried something like this:
datatable(dt) %>% formatStyle(
columns = c("value1", "value2"),
backgroundColor = styleInterval(interval, c("red", "green"))
)
But interval is not recognized as an object. This leads me to believe that I cannot pass a column in the cut parameter. I also tried to pass some kind of function in the valueColumns but this didn't seem to be possible either.
Expected output:
It makes no sense to pass the full column to styleInterval. It requires n values for cut and n+1 for values. Try alternative below instead:
myCut <- sort(unique(dt$interval))
myCol <- rainbow(length(myCut) + 1)
formatStyle(datatable(dt),
columns = c("value1", "value2"),
backgroundColor = styleInterval(myCut, myCol))
I am trying to assign ranks to an IBI, there is a condition for one attribute that requires me to assign the average of all other attributes when Alkalinity < 25 mg/L. Can I assign an existing vector/column of attribute score means? I've tried the code below to first assign the mean for each (but I'm not sure this is correct)
WWMBIscores <- WWMBI[c(1:126) , c(29, 33, 35, 37, 39, 41, 43)]
ScoreMeans <- rowMeans(WWMBIscores)
This code should then assign the value from "ScoreMeans" as the rank when alkalinity is less than 25 mg/L
Mollusca_IBI <-
WWMBI %>%
mutate(
Mollusca_Abund = case_when(
Alkalinity <= 5 ~ ScoreMeans,
Mollusca_Abund <= 1 ~ 0 ,
Mollusca_Abund >= 1 & Mollusca_Abund <= 9 ~ 1 ,
Mollusca_Abund >= 9 & Mollusca_Abund <= 99 ~ 3 ,
Mollusca_Abund >= 99 ~ 5
)
) %>% select(Mollusca_Abund)
It doesn't appear to be assigning means, but rather 0s as the value. I've included a small subset of the 7 ranks I have calculated, assume some of those are for areas where alkalinity is <25.
structure(list(MargDI_IBI = c(1, 1, 1, 3, 3), Pleid_IBI = c(1, 0, 0, 5, 1), Corixid_IBI = c(2, 0, 2, 0, 2), Trichop_IBI = c(0, 0, 0, 0, 0), Stratio_IBI = c(0, 0, 0, 0, 0), NonInsect_IBI = c(1, 0, 3, 3, 1), Insect_IBI = c(1, 1, 1, 3, 3)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
Example data set in the picture
If percentage is 35 with their corresponding response values (average of the response values = A), I want to select another set of percentages that equal 100-35 = 65 and those corresponding responses (average of the response values = B). These percentages and responses are in the same columns.
I want to multiple A by 65% and B by 35%, before adding the two together.
I would want to do that for an entire data set so that I can get a "flipped" percentage and response accuracy.
Perhaps this code can help you
with(
aggregate(response ~ ., df, mean),
sum(
rowMeans(
cbind(
response * (1 - percentage / 100),
response[match(100 - percentage, percentage)] * percentage / 100
)
)
)
)
which gives
[1] 2.175
Data
> dput(df)
structure(list(percentage = c(35, 35, 40, 40, 45, 45, 50, 50,
55, 55, 60, 60, 65, 65), response = c(1, 1, 1, 1, 1, 0, 1, 0,
0, 1, 0, 0, 0, 1)), class = "data.frame", row.names = c(NA, -14L
))
I'm trying to use ggplot, and am hoping to create a boxplot that has four categories on the x axis for suspension data (low, lowish, highish, high) and farms on the y-axis.
I have I think broken the suspension column into four groups. But ggplot is upset with me. Here is the error:
```
Error in if (is.double(data$x) && !has_groups(data) && any(data$x != data$x[1L])) { : missing value where TRUE/FALSE needed
```
Here is my code:
```{r}
# To break suspension_rate_total_pct data into groups for clearer visualization, I found the min, and max
merged_data$suspension_rate_total_pct <-
as.numeric(merged_data$suspension_rate_total_pct)
max(merged_data$suspension_rate_total_pct, na.rm=TRUE)
min(merged_data$suspension_rate_total_pct, na.rm=TRUE)
low_suspension <- merged_data$suspension_rate_total_pct > 0 & merged_data$suspension_rate_total_pct < 0.5
low_ish_suspension <- merged_data$suspension_rate_total_pct > 0.5 & merged_data$suspension_rate_total_pct < 1
high_ish_suspension <- merged_data$suspension_rate_total_pct > 1 & merged_data$suspension_rate_total_pct < 1.5
high_suspension <- merged_data$suspension_rate_total_pct > 1.5 & merged_data$suspension_rate_total_pct < 2
ggplot(merged_data, aes(x = suspension_rate_total_pct , y = farms_pct)) +
geom_boxplot()
```
Here is the Data:
merged_data <- structure(list(schid = c("1030642", "1030766", "1030774", "1030840",
"1130103", "1230150"), enrollment = c(159, 333, 352, 430, 102,
193), farms = c(132, 116, 348, 406, 68, 130), foster = c(2, 0,
1, 8, 1, 4), homeless = c(14, 0, 8, 4, 1, 4), migrant = c(0,
0, 0, 0, 0, 0), ell = c(18, 12, 114, 45, 7, 4), suspension_rate_total = c(NA,
20, 0, 0, 95, 5), suspension_violent = c(NA, 9, 0, 0, 20, 2),
suspension_violent_no_injury = c(NA, 6, 0, 0, 47, 1), suspension_weapon = c(NA,
0, 0, 0, 8, 0), suspension_drug = c(NA, 0, 0, 0, 9, 1), suspension_defiance = c(NA,
1, 0, 0, 9, 1), suspension_other = c(NA, 4, 0, 0, 2, 0),
farms_pct = c(0.830188679245283, 0.348348348348348, 0.988636363636364,
0.944186046511628, 0.666666666666667, 0.673575129533679),
foster_pct = c(0.0125786163522013, 0, 0.00284090909090909,
0.0186046511627907, 0.00980392156862745, 0.0207253886010363
), migrant_pct = c(0, 0, 0, 0, 0, 0), ell_pct = c(0.113207547169811,
0.036036036036036, 0.323863636363636, 0.104651162790698,
0.0686274509803922, 0.0207253886010363), homeless_pct = c(0.0880503144654088,
0, 0.0227272727272727, 0.00930232558139535, 0.00980392156862745,
0.0207253886010363), suspension_rate_total_pct = c(NA, 2,
1, 1, 2, 2)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
If you can, please help me appease ggplot so that it will give me with beautiful visualization. Currently, this feels like a one-sided, emotional rollercoaster of a relationship.
Just a short answer, i am sure you can figure out the rest by yourself, (otherwise post a followup question.)
Since the data you provided has some NA's in the first row in several columns, i can only demonstrate you the principle on how to get your desired result by using the merged_data$homless value as group-input for our boxplots , the data (y-value) will be still Farms .
# first we create our groups of low, middle & high amount of homeless
merged_data2<- merged_data %>% mutate(homelessgroup= ifelse(homeless < 4, "low",
ifelse(homeless <= 8, "middle",
ifelse(homeless > 8, "high",NA ))))
## then we plot the data using ggplot
ggplot(merged_data2,aes(y=farms,fill=homelessgroup))+geom_boxplot()
I think you can just use cut() with your data to partition into 4 groups. Then you can use that variable with the plot
merged_data <- transform(merged_data,
group = cut(
suspension_rate_total_pct,
c(0, .5, 1, 1.5, 2),
include.lowest = TRUE,
labels = c("low", "lowish", "highish", "high")))
ggplot(merged_data, aes(x = group , y = farms_pct)) +
geom_boxplot()