Conditional Calculation for a Column in R - r

I have the following data:
pop.2017 <- c(434,346,345,357)
pop.2018 <- c(334,336,325,345)
pop.2019 <- c(477,346,145,345)
pop.2020 <- c(474,366,341,300)
total <- c(34,36,34,35)
incident_month_yr <- c("2017-2","2017-5","2018-2","2019-2")
df <- data.frame(incident_month_yr,pop.2017,pop.2018,pop.2019,pop.2020,total)
df['perc'] <- NA
For rows where incident_month_yr contains 2017, I want perc to equal total/pop.2017
For rows where incident_month_yr contains 2018, I want perc to equal total/pop.2018
For rows where incident_month_yr contains 2019, I want perc to equal total/pop.2019
For rows where incident_month_yr contains 2020, I want perc to equal total/pop.2020
I've tried this:
df$perc[grepl(2017,df$incident_month_yr)] <- df$total/df$pop.2017
df$perc[grepl(2018,df$incident_month_yr)] <- df$total/df$pop.2018
df$perc[grepl(2019,df$incident_month_yr)] <- df$total/df$pop.2019
df$perc[grepl(2020,df$incident_month_yr)] <- df$total/df$pop.2020
However, it's not applying the calculations to specific rows like I want. How can I do this?

You can use the following solution:
library(dplyr)
library(stringr)
df %>%
mutate(perc = ifelse(str_detect(incident_month_yr, "2017"), total/pop.2017,
ifelse(str_detect(incident_month_yr, "2018"), total/pop.2018,
total/pop.2019)))
incident_month_yr pop.2017 pop.2018 pop.2019 pop.2020 total perc
1 2017-2 434 334 477 474 34 0.07834101
2 2017-5 346 336 346 366 36 0.10404624
3 2018-2 345 325 145 341 34 0.10461538
4 2019-2 357 345 345 300 35 0.10144928
Special Thanks to dear #akrun
We can also replace str_detect with grepl function from base R to use fewer packages and use case_when in place of ifelse as an unnested alternative.
df %>%
mutate(perc = case_when(
grepl("2017", incident_month_yr) ~ total/pop.2017,
grepl("2018", incident_month_yr) ~ total/pop.2018,
TRUE ~ total/pop.2019
))
incident_month_yr pop.2017 pop.2018 pop.2019 pop.2020 total perc
1 2017-2 434 334 477 474 34 0.07834101
2 2017-5 346 336 346 366 36 0.10404624
3 2018-2 345 325 145 341 34 0.10461538
4 2019-2 357 345 345 300 35 0.10144928

We can do this with match. Get the column names that have 'pop' substring ('nm1)', remove the characters that are not year from 'incident_month_yr', and the column name, use match to return the column index, cbind with the sequence of rows, extract the values from the 'pop' columns, divide by 'total' and assign it to 'perc' column
nm1 <- grep('pop', names(df), value = TRUE)
nm2 <- trimws(df$incident_month_yr, whitespace = '-.*')
nm3 <- trimws(nm1, whitespace = 'pop\\.')
df$perc <- df$total/df[nm1][cbind(seq_len(nrow(df)), match(nm2, nm3))]
df$perc
#[1] 0.07834101 0.10404624 0.10461538 0.10144928
In dplyr, an option is do rowwise, construct the column name from the 'incident_month_yr' with str_replace to capture the year part, append the 'pop.' as prefix, get the value and divide with 'total' column
library(stringr)
library(dplyr)
df %>%
rowwise %>%
mutate(perc = total/get(str_replace(incident_month_yr,
"(\\d{4})-\\d+", 'pop.\\1'))) %>%
ungroup
-output
# A tibble: 4 x 7
# incident_month_yr pop.2017 pop.2018 pop.2019 pop.2020 total perc
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 2017-2 434 334 477 474 34 0.0783
#2 2017-5 346 336 346 366 36 0.104
#3 2018-2 345 325 145 341 34 0.105
#4 2019-2 357 345 345 300 35 0.101

Here are two approaches, one in base R and one using tidy data. The provided data is not tidy, that's why base R looks uncomfortable:
# Define the target
target <- c(0.07834101, 0.10404624, 0.10461538, 0.10144928)
That is our goal, calculating target.
First, use base R and ifelse:
result1 <- with(df,
ifelse(grepl(2017, incident_month_yr),
total/pop.2017,
ifelse(grepl(2018, incident_month_yr),
total/pop.2018,
ifelse(grepl(2019, incident_month_yr),
total/pop.2019,
ifelse(grepl(2020, incident_month_yr),
total/pop.2020,
NA)))))
identical(round(result1, 4), round(target, 4))
#> [1] TRUE
And, the tidy way, reshaping into tidy data and calculating the result:
library(dplyr)
library(tidyr)
result2 <- df %>% pivot_longer(starts_with("pop."), names_to = "pop", names_prefix = "pop.") %>%
filter(substr(incident_month_yr, 1, 4) == pop) %>%
mutate(perc = total/value) %>%
pull(perc)
identical(round(result2, 4), round(target, 4))
#> [1] TRUE

Related

How to find duplicate dates within a row in R, and then replace associated values with the mean?

There are some similar questions, however I haven't been able to find the solution for my data:
ID <- c(27,46,72)
Gest1 <- c(27,28,29)
Sys1 <- c(120,123,124)
Dia1 <- c(90,89,92)
Gest2 <- c(29,28,30)
Sys2 <- c(122,130,114)
Dia2 <- c(89,78,80)
Gest3 <- c(32,29,30)
Sys3 <- c(123,122,124)
Dia3 <- c(90,88,89)
Gest4 <- c(33,30,32)
Sys4 <- c(124,123,128)
Dia4 <- c(94,89,80)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3,
Dia3,Gest4,Sys4,Dia4)
df.1
What I need to do is identify where there are any cases of gestational age duplicates (variables beginning with Gest), and then find the mean of the associated Sys and Dia variables.
Once the mean has been calculated, I need to replace the duplicates with just 1 Gest variable, and the mean of the Sys variable and the mean of the Dia variable. Everything after those duplicates should then be moved up the dataframe.
Here is what it should look like:
df.2
My real data has 25 Gest variables with 25 associated Sys variables and 25 association Dia variables.
Sorry if this is confusing! I've tried to write an ok question but it is my first time using stack overflow.
Thank you!!
This is easier to manage in long (and tidy) format.
Using tidyverse, you can use pivot_longer to put into long form. After grouping by ID and Gest you can substitute Sys and Dia values with the mean. If there are more than one Gest for a given ID it will then use the average.
Then, you can keep that row of data with slice. After grouping by ID, you can renumber after combining those with common Gest values.
library(tidyverse)
df.1 %>%
pivot_longer(cols = -ID, names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
group_by(ID, Gest) %>%
mutate(across(c(Sys, Dia), mean)) %>%
slice(1) %>%
group_by(ID) %>%
mutate(number = row_number())
Output
ID number Gest Sys Dia
<dbl> <int> <dbl> <dbl> <dbl>
1 27 1 27 120 90
2 27 2 29 122 89
3 27 3 32 123 90
4 27 4 33 124 94
5 46 1 28 126. 83.5
6 46 2 29 122 88
7 46 3 30 123 89
8 72 1 29 124 92
9 72 2 30 119 84.5
10 72 3 32 128 80
Note - I would keep in long form - but if you wanted wide again, you can add:
pivot_wider(id_cols = ID, names_from = number, values_from = c(Gest, Sys, Dia))
This involved change the structure of the table into the long format, averaging the duplicates and then reformatting back into the desired table:
library(tidyr)
library(dplyr)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3, Dia3,Gest4,Sys4,Dia4)
#convert data to long format
longdf <- df.1 %>% pivot_longer(!ID, names_to = c(".value", "time"), names_pattern = "(\\D+)(\\d)", values_to="count")
#average duplicate rows
temp<-longdf %>% group_by(ID, Gest) %>% summarize(Sys=mean(Sys), Dia=mean(Dia)) %>% mutate(time = row_number())
#convert back to wide format
answer<-temp %>% pivot_wider(ID, names_from = time, values_from = c("Gest", "Sys", "Dia"), names_glue = "{.value}{time}")
#resort the columns
answer <-answer[ , names(df.1)]
answer
# A tibble: 3 × 13
# Groups: ID [3]
ID Gest1 Sys1 Dia1 Gest2 Sys2 Dia2 Gest3 Sys3 Dia3 Gest4 Sys4 Dia4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 27 27 120 90 29 122 89 32 123 90 33 124 94
2 46 28 126. 83.5 29 122 88 30 123 89 NA NA NA
3 72 29 124 92 30 119 84.5 32 128 80 NA NA NA

Pivot wider to one row in R

Here is the sample code that I am using
library(dplyr)
naics <- c("000000","000000",123000,123000)
year <- c(2020,2021,2020,2021)
January <- c(250,251,6,9)
February <- c(252,253,7,16)
March <- c(254,255,8,20)
sample2 <- data.frame (naics, year, January, February, March)
Here is the intended result
Jan2020 Feb2020 March2020 Jan2021 Feb2021 March2021
000000 250 252 254 251 253 255
123000 6 7 8 9 16 20
Is this something that is done with pivot_wider or is it more complex?
We use pivot_wider by selecting the values_from with the month column, names_from as 'year' and then change the column name format in names_glue and if needed convert the 'naics' to row names with column_to_rownames (from tibble)
library(tidyr)
library(tibble)
pivot_wider(sample2, names_from = year, values_from = January:March,
names_glue = "{substr(.value, 1, 3)}{year}")%>%
column_to_rownames('naics')
-output
Jan2020 Jan2021 Feb2020 Feb2021 Mar2020 Mar2021
000000 250 251 252 253 254 255
123000 6 9 7 16 8 20
With reshape function from BaseR,
reshape(sample2, dir = "wide", sep="",
idvar = "naics",
timevar = "year",
new.row.names = unique(naics))[,-1]
# January2020 February2020 March2020 January2021 February2021 March2021
# 000000 250 252 254 251 253 255
# 123000 6 7 8 9 16 20
This takes a longer route than #akrun's answer. I will leave this here in case it may help with more intuition on the steps being taken. Otherwise, #akrun's answer is more resource efficient.
sample2 %>%
tidyr::pivot_longer(-c(naics, year), names_to = "month",
values_to = "value") %>%
mutate(Month=paste0(month, year)) %>%
select(-year, - month) %>%
tidyr::pivot_wider(names_from = Month,values_from = value)
# A tibble: 2 x 7
naics January2020 February2020 March2020 January2021 February2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 000000 250 252 254 251 253
2 123000 6 7 8 9 16
# ... with 1 more variable: March2021 <dbl>

Subset a df by the last non-NA value in a column

My dataframe looks like this:
Year aquil_7 aquil_8 aquil_9
2018 NA 201 222
2019 192 145 209
2020 166 121 NA
2021 190 NA NA
I want to subset this dataframe so as to include only those columns where the last non-NA year is equal to or less then 2020. In the example above, this means deleting the aquil_7 column since it's last non-NA year is 2021.
How could I do this?
A simple baseR answer.
Explanation - columnwise (that explaining arg 2 in apply) iteration to check given conditions on all database except first column. cbinding the result with T so that the result includes first column.
df <- read.table(text = "Year aquil_7 aquil_8 aquil_9
2018 NA 201 222
2019 192 145 209
2020 166 121 NA
2021 190 NA NA", header = T)
df[c(T, apply((!is.na(df[-1]))*df$Year, 2, function(x){max(x) < 2021}))]
Year aquil_8 aquil_9
1 2018 201 222
2 2019 145 209
3 2020 121 NA
4 2021 NA NA
Not sure if there's a better way to implement this (but I do hope so). In the meantime, you could e.g. do
library(tidyverse)
cols_to_keep <- df %>%
pivot_longer(-Year) %>%
group_by(name) %>%
summarize(var = min(Year[is.na(value)]) >= 2020) %>%
filter(var) %>%
pull(name)
df %>%
select(Year, cols_to_keep)

How to lookup and sum multiple columns in R

Suppose I have 2 dataframes structured as such:
GROUPS:
P1 P2 P3 P4
123 213 312 231
345 123 213 567
INDIVIDUAL_RESULTS:
ID SCORE
123 23
213 12
312 11
213 19
345 10
567 22
I want to add a column to the GROUPS which is a sum of each of their individual results:
P1 P2 P3 P4 SCORE
123 213 312 231 65
I've tried using various merge techniques, but have really just created a mess. I feel like there's a simple solution I just don't know about, would really appreciate some guidance!
d1=read.table(text="
P1 P2 P3 P4
123 213 312 231
345 123 213 567",h=T)
d2=read.table(text="
ID SCORE
123 23
213 12
312 11
231 19
345 10
567 22",h=T)
I will be using the apply and match functions. Apply will apply the match function to each row of d1, match will find the matching values from the row of d1 and d2$ID (their indices) and then take the values in d2$SCORE at those indices. In the end we sum them up.
d1$SCORE=apply(d1,1,function(x){
sum(d2$SCORE[match(x,d2$ID)])
})
and the result
P1 P2 P3 P4 SCORE
1 123 213 312 231 65
2 345 123 213 567 67
I would try a slow but could be an intuitive way for new users. I think the difficulty was created by the format of your data d1. If you do a little bit of tidy up:
library(tidyverse)
d1<-data.frame(t(d1))
colnames(d1) <-c("group1", "group2")
d1$P = row.names(d1)
d1<-d1 %>%
pivot_longer(
cols = group1:group2,
names_to = "Group",
values_to = "ID"
)
df <-left_join(d1, d2, by ="ID")
df
# A tibble: 8 x 4
P Group ID SCORE
<chr> <chr> <int> <int>
1 P1 group1 123 23
2 P1 group2 345 10
3 P2 group1 213 12
4 P2 group2 123 23
5 P3 group1 312 11
6 P3 group2 213 12
7 P4 group1 231 19
8 P4 group2 567 22
Once you get the data to this more "conventional" format, we can easily work out a tidyverse solution.
df %>%
group_by(Group) %>%
summarize(SCORE = sum(SCORE))
# A tibble: 2 x 2
Group SCORE
<chr> <int>
1 group1 65
2 group2 67
Another possibility is to reformat the first data.frame to contain the group and subgroup Information:
groups <- tidyr::gather(d1,name,number,P1:P4)
These information could be added to the second data.frame and could be further used for different analyses. Such as aggregrations.
d2_groups <- merge(groups, d2, by.x = "number",by.y = "ID")
aggregate(d2_groups$SCORE, by=list(groups = d2_groups$name), FUN=sum)

R - Transpose columns and rows with conditions

I am working with the dataframe 'by_class_survival' and I am trying to convert in other format, changing the rows and columns plus including conditions, I have already solved in a very rustic way, so but I am wondering if there is a better way to transpose columns and rows, plus adding conditions at the moment to create the transposition.
library(dplyr)
titanic_tbl <- dplyr::tbl_df(Titanic)
titanic_tbl <- titanic_tbl %>%
mutate_at(vars(Class:Survived), funs(factor))
by_class_survival <- titanic_tbl %>%
group_by(Class, Survived) %>%
summarize(Count = sum(n))
Original dataframe
# Class Survived Count
# 1 1st No 122
# 2 1st Yes 203
# 3 2nd No 167
# 4 2nd Yes 118
# 5 3rd No 528
# 6 3rd Yes 178
# 7 Crew No 673
# 8 Crew Yes 212
Creating a new dataframe based on the values from by_class_survival
first <- c(122,203)
second <- c(167, 118)
third <- c(528,178)
crew <- c(673,212)
titanic.df = data.frame(first,second,third,crew)
library(data.table)
t_titanic.df <- transpose(titanic.df)
rownames(t_titanic.df) <- colnames(titanic.df)
colnames(t_titanic.df) <- c("No survivor", "Survivor")
Expected result
## No survivor Survivor
## first 122 203
## second 167 118
## third 528 178
## crew 673 212
There is a better way to reach the expected result?
You can do it in one step with reshape2::dcast:
library(reshape2)
library(dplyr)
titanic_tbl %>%
dcast(Class ~ Survived, value.var = "n", sum)
Class No Yes
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212
or you can use tidyr::spread on the summarised data frame:
library(tidyr)
titanic_tbl %>%
group_by(Class, Survived) %>%
summarise(sum = sum(n)) %>%
spread(Survived, sum)
# A tibble: 4 x 3
# Groups: Class [4]
Class No Yes
<chr> <dbl> <dbl>
1 1st 122 203
2 2nd 167 118
3 3rd 528 178
4 Crew 673 212

Resources