How can I make conditional selections using dplyr in R? - r

I have the following situation. Given the table
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
I am trying to find a way to implement the following logic: If for an ID, both types (MC and MK) occur, use value1 from MK and value2 from MC. Otherwise (only the type MC occurs), use MC.
Hence, the final result is supposed to be:
data.frame(ID = c(1, 2, 3, 4),
type = c("MC", "MC", "MC", "MC"),
value1 = c(512, 4523, 1221, 2556),
value2 = c(726, 4000, 998, 6789))
Assuming the type MK is dropped after extracting the value1.

Another version with dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(value1 = ifelse(any(type == "MK"), value1[type=="MK"],value1[type=="MC"]),
value2 = value2[type == "MC"]) %>%
filter(type == "MC")
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
Here, for value1 we check value in "MK" if it is present or take corresponding "MC" value instead and for value2 by default we take "MC" value and keep only rows with type "MC". This is assuming every group (ID) would have a "MC" type row.

For efficiency I would definitely prefer #Andre Elrico' answer but here is a dplyr option. Try:
df <- data.frame(ID = c(1, 2, 2, 3, 3, 4),
type = c("MC", "MC", "MK", "MC", "MK", "MC"),
value1 = c(512, 261, 4523, 1004, 1221, 2556),
value2 = c(726, 4000, 280, 998, 113, 6789))
library(dplyr)
df %>%
reshape(., idvar = "ID", timevar = "type", direction = "wide") %>%
group_by(ID) %>%
mutate(value1 = ifelse(is.na(value1.MK), value1.MC, value1.MK),
value2 = ifelse(is.na(value2.MC), value2.MK, value2.MC),
type = "MC") %>%
select(ID, type, value1, value2)
# output
# A tibble: 4 x 4
# Groups: ID [4]
ID type value1 value2
<dbl> <chr> <dbl> <dbl>
1 1 MC 512 726
2 2 MC 4523 4000
3 3 MC 1221 998
4 4 MC 2556 6789

data.table solution
setDT(df1)[,{x=.SD;if(all(c("MC","MK") %in% type)){x$value1[] = last(value1)};first(x)},by=ID]
result:
# ID type value1 value2
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789
dplyr:
df1 %>% group_by(ID) %>% do(.,(function(x){if(all(c("MC","MK") %in% x$type)){x$value1[] = x$value1[x$type=="MK"]};x[1,]})(.))
# A tibble: 4 x 4
# Groups: ID [4]
# ID type value1 value2
# <dbl> <fct> <dbl> <dbl>
#1 1 MC 512 726
#2 2 MC 4523 4000
#3 3 MC 1221 998
#4 4 MC 2556 6789

Related

Converting from long to wide, using pivot_wide() on two columns in R

I would like to transform my data from long format to wide by the values in two columns. How can I do this using tidyverse?
Updated dput
structure(list(Country = c("Algeria", "Benin", "Ghana", "Algeria",
"Benin", "Ghana", "Algeria", "Benin", "Ghana"
), Indicator = c("Indicator 1",
"Indicator 1",
"Indicator 1",
"Indicator 2",
"Indicator 2",
"Indicator 2",
"Indicator 3",
"Indicator 3",
"Indicator 3"
), Status = c("Actual", "Forecast", "Target", "Actual", "Forecast",
"Target", "Actual", "Forecast", "Target"), Value = c(34, 15, 5,
28, 5, 2, 43, 5,
1)), row.names
= c(NA, -9L), class = c("tbl_df", "tbl", "data.frame"))
Country Indicator Status Value
<chr> <chr> <chr> <dbl>
1 Algeria Indicator 1 Actual 34
2 Benin Indicator 1 Forecast 15
3 Ghana Indicator 1 Target 5
4 Algeria Indicator 2 Actual 28
5 Benin Indicator 2 Forecast 5
6 Ghana Indicator 2 Target 2
7 Algeria Indicator 3 Actual 43
8 Benin Indicator 3 Forecast 5
9 Ghana Indicator 3 Target 1
Expected output
Country Indicator1_Actual Indicator1_Forecast Indicator1_Target Indicator2_Actual
Algeria 34 15 5 28
etc
Appreciate any tips!
foo <- data %>% pivot_wider(names_from = c("Indicator","Status"), values_from = "Value")
works perfectly!
I think the mistake is in your pivot_wider() command
data %>% pivot_wider(names_from = Indicator, values_from = c(Indicator, Status))
I bet you can't use the same column for both names and values.
Try this code
data %>% pivot_wider(names_from = c(Indicator, Status), values_from = Value))
Explanation: Since you want the column names to be Indicator 1_Actual, you need both columns indicator and status going into your names_from
It would be helpful if you provided example data and expected output. But I tested this on my dummy data and it gives the expected output -
Data:
# A tibble: 4 x 4
a1 a2 a3 a4
<int> <int> <chr> <dbl>
1 1 5 s 10
2 2 4 s 20
3 3 3 n 30
4 4 2 n 40
Call : a %>% pivot_wider(names_from = c(a2, a3), values_from = a4)
Output :
# A tibble: 4 x 5
a1 `5_s` `4_s` `3_n` `2_n`
<int> <dbl> <dbl> <dbl> <dbl>
1 1 10 NA NA NA
2 2 NA 20 NA NA
3 3 NA NA 30 NA
4 4 NA NA NA 40
Data here if you want to reproduce
structure(list(a1 = 1:4, a2 = 5:2, a3 = c("s", "s", "n", "n"),
a4 = c(10, 20, 30, 40)), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Edit : For the edited question after trying out the correct pivot_wider() command - It looks like your data could actually have duplicates, in which case the output you are seeing would make sense - I would suggest you try to figure out if your data actually has duplicates by using filter(Country == .., Indicator == .., Status == ..)
This can be achieved by calling both your columns to pivot wider in the names_from argument in pivot_wider().
data %>%
pivot_wider(names_from = c("Indicator","Status"),
values_from = "Value")
Result
Country `Indicator 1_Ac… `Indicator 1_Fo… `Indicator 1_Ta… `Indicator 2_Ac… `Indicator 2_Fo…
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Algeria 34 15 5 28 5

R - concatenate rows based on conditions?

I created a mapping table in R and have provided an example of what it looks like:
ex <- data.frame("id" = c(rep(1234,7)), "claim" = c(1234, 1367, 1234, 1869, 1234, 1367,1234),
"code1" = c(24, 61, 28, 21, 20, 29,80), date = c('2019-03-18', '2019-04-12',
'2019-03-18', '2019-03-18',
'2019-03-18', '2019-04-12', '2019-03-18'),
'code2' = c(24,29,24,24,24, 29,24), dx1=c("M234","M123",NA,"M434",NA,NA, NA),
dx2=c(NA,NA,NA,NA,"M789","Z123", "M999"),
dx3 = c(NA,NA,"M689",NA,NA,NA, NA),
pay = c(1000, 520, 1000, 780, 1000,520,1000))
Is there any way I could find a way to get this as my final output:
ex2 <- data.frame("id" = c(rep(1234,3)), date = c('2019-03-18', '2019-03-18','2019-04-12'),
'code2' = c(24,24,29),
dx1=c("M234","M434","M123"),
dx2=c("M789",NA,"Z123"),
dx3 = c("M689",NA,NA),
dx4 = c("M999", NA, NA),
pay = c(1000,780,520))
I basically would like for any values in dx2 or dx3 in example 1 to just be added onto the same row corresponding to that code2 value. However, if there are multiple values for code2 in dx1, then I would like to keep them as a separate row.
Is there any way I could do something like this in R?
Thanks in advance!
edit: In my mapping table (ex) there are only columns dx1, dx2, dx3. I would like for any multiple value in dx2 or dx3 to be added on as new columns (which is why in ex2 there is now a dx4 column). These changes are grouped by code2. So if there are multiple values in dx2 or dx3 for code 24, then that will determine how many new dx2 columns are created. The order can then be determined by max(pay) column.
Do you require this?
library(tidyverse)
ex %>% pivot_longer(cols = c("dx1", "dx2", "dx3"), names_to = "code3", values_to = "val", values_drop_na = T) %>%
arrange(claim, code2, code3) %>% group_by(id, claim, date, code2, code3) %>%
mutate(dummy = n(),
dummy2 = row_number(),
code3 = ifelse(dummy >1 & dummy2 >1, "dx4", code3)) %>% arrange(code3) %>%
pivot_wider(id_cols = c('id', 'claim', 'date', 'code2', 'pay'), names_from = 'code3', values_from = 'val', values_fn = min) %>%
ungroup() %>% select(-claim)
# A tibble: 3 x 8
id date code2 pay dx1 dx2 dx3 dx4
<dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
1 1234 2019-03-18 24 1000 M234 M789 M689 M999
2 1234 2019-04-12 29 520 M123 Z123 NA NA
3 1234 2019-03-18 24 780 M434 NA NA NA

How to find min and max in dplyr?

I know the sum of points for each person.
I need to know: what is the minimum number of points that a person could have. And what is the maximum number of points that a person could have.
What I have tried:
min_and_max <- dataset %>%
group_by(person) %>%
dplyr::filter(min(sum(points, na.rm = T))) %>%
distinct(person) %>%
pull()
min_and_max
My dataset:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
I would suggest this dplyr approach. You have to summarize data like this:
library(tidyverse)
#Code
df %>% group_by(id,person) %>%
summarise(Total=sum(points,na.rm = T),
min=min(points,na.rm = T),
max=max(points,na.rm=T))
Output:
# A tibble: 7 x 5
# Groups: id [7]
id person Total min max
<int> <chr> <int> <int> <int>
1 201 rt99 5 2 3
2 202 kt 4 4 4
3 203 rr 4 4 4
4 204 jk 4 2 2
5 322 knm3 8 3 5
6 343 kll2 8 1 5
7 344 kll 8 1 7
Here is the data.table solution -
dataset[, min_points := min(points, na.rm = T), by = person]
dataset[, max_points := max(points, na.rm = T), by = person]
Since I don't have your data, I cannot test this code, but it should work fine.
The summarize() verb is what you want for this. You don't even need to filter out the NA values first since both min() and max() can have na.rm = TRUE.
library(dplyr)
min_and_max <- dataset %>%
group_by(person) %>%
summarize(min = min(points, na.rm = TRUE),
max = max(points, na.rm = TRUE))
min_and_max
# A tibble: 7 x 3
person min max
<chr> <dbl> <dbl>
1 jk 2 2
2 kll 1 7
3 kll2 1 5
4 knm3 3 5
5 kt 4 4
6 rr 4 4
7 rt99 2 3
dput(dataset)
structure(list(id = c(201, 201, 201, 202, 202, 202, 203, 203,
203, 204, 204, 204, 322, 322, 322, 343, 343, 343, 344, 344, 344
), person = c("rt99", "rt99", "rt99", "kt", "kt", "kt", "rr",
"rr", "rr", "jk", "jk", "jk", "knm3", "knm3", "knm3", "kll2",
"kll2", "kll2", "kll", "kll", "kll"), points = c(NA, 3, 2, 4,
NA, NA, 4, NA, NA, 2, 2, NA, 5, NA, 3, 2, 1, 5, NA, 7, 1)), class = "data.frame", row.names = c(NA,
-21L), spec = structure(list(cols = list(id = structure(list(), class = c("collector_double",
"collector")), person = structure(list(), class = c("collector_character",
"collector")), points = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))

R Dplyr solution for summarize_at correlation

I am attempting to calculate correlation by (group_by) MktDate, for all columns in a dataframe to another column (Security Return).
I have attempted a number of dplyr solutions and can't quite get the correlation example to work properly but have no issues getting an example using mean to work properly.
This works, to calculate mean by specified columns
MyMeanTest <- MyDataTest %>%
filter(MktDate >='2009-12-31') %>%
group_by(MktDate) %>%
summarize_at(c('RtnVol_EM','OCFROI_EM'),mean,na.rm=TRUE)
This does not work. essentially I want the correlation for the columns specified, grouped by MktDate with the column FwdRet_12M. I get the following error message -
Error in summarise_impl(.data, dots) :
Evaluation error: not all arguments have the same length.
MyCorTest <- MyDataTest %>%
group_by(MktDate) %>%
summarize_at(c('RtnVol_EM','OCFROI_EM'),funs(cor(.,MyDataTest$FwdRet_12M,use="pairwise.complete.obs", "spearman")))
With the code example above I should end with something like this
MktDate,RtnVol_EM,OCFROI_EM...
Here is some sample code that should help to understand the structure of the data and end objective.
MyDataTest <- structure(list(MktDate = structure(c(17896, 17896, 17896, 17896,
17927, 17927, 17927, 17927), class = "Date"), FwdRet = c(2, 3,
4, 5, 5, 2, 1, 4), Fact1 = c(10, 30, 20, 15, 12, 25, 26, 28),
Fact2 = c(100, 500, 300, 400, 150, 400, 430, 420)), .Names = c("MktDate",
"FwdRet", "Fact1", "Fact2"), row.names = c(NA, -8L), class = "data.frame")
When running the pairwise correlation grouped by date on that data set the following should be the result.
MktDate,Fact1,Fact2
12/31/18,.2,.4
1/31/19,.4,-.8
One possible approach would be to reshape your data so that you have the variable you always want in the correlation (FwdRet) in one column and the variable that changes in a separate column. Like so:
MyDataTest_reshape <- MyDataTest %>%
gather(factor, value, -MktDate, -FwdRet)
MyDataTest_reshape
MktDate FwdRet factor value
1 2018-12-31 2 Fact1 10
2 2018-12-31 3 Fact1 30
3 2018-12-31 4 Fact1 20
4 2018-12-31 5 Fact1 15
5 2019-01-31 5 Fact1 12
6 2019-01-31 2 Fact1 25
7 2019-01-31 1 Fact1 26
8 2019-01-31 4 Fact1 28
9 2018-12-31 2 Fact2 100
10 2018-12-31 3 Fact2 500
11 2018-12-31 4 Fact2 300
12 2018-12-31 5 Fact2 400
13 2019-01-31 5 Fact2 150
14 2019-01-31 2 Fact2 400
15 2019-01-31 1 Fact2 430
16 2019-01-31 4 Fact2 420
Then you can take that reshaped data and feed it into your correlation:
MyDataTest_reshape %>%
group_by(MktDate, factor) %>%
summarize(correlation = cor(FwdRet, value)) %>%
spread(factor, correlation)
# A tibble: 2 x 3
# Groups: MktDate [2]
MktDate Fact1 Fact2
<date> <dbl> <dbl>
1 2018-12-31 0.0756 0.529
2 2019-01-31 -0.627 -0.736
You can also do this all in one step, of course:
MyDataTest %>%
gather(factor, value, -MktDate, -FwdRet) %>%
group_by(MktDate, factor) %>%
summarize(correlation = cor(FwdRet, value)) %>%
spread(factor, correlation)
This works for me.
library(tidyverse)
MyDataTest <- structure(list(MktDate = structure(c(17896, 17896, 17896, 17896,
17927, 17927, 17927, 17927), class = "Date"), FwdRet = c(2, 3,
4, 5, 5, 2, 1, 4), Fact1 = c(10, 30, 20, 15, 12, 25, 26, 28),
Fact2 = c(100, 500, 300, 400, 150, 400, 430, 420)), .Names = c("MktDate",
"FwdRet", "Fact1", "Fact2"), row.names = c(NA, -8L), class = "data.frame")
MyDataTest %>%
group_by(MktDate) %>%
summarize_at(c("Fact1", "Fact2"), list(~cor(., FwdRet, use="pairwise.complete.obs", "spearman")))
#> # A tibble: 2 x 3
#> MktDate Fact1 Fact2
#> <date> <dbl> <dbl>
#> 1 2018-12-31 0.2 0.4
#> 2 2019-01-31 -0.4 -0.8

Tidyversing working R code with a for loop

I have a dataset of pairs of cities V1 and V2. Each cities has a population v1_pop2015 and v2_pop2015.
I would like to create a new dataset with only the cityCode of the biggest city and its populated added of the population of the smallest.
I was able to create the output I want with a for loop. For educationnal purpose, I tried to do it using tidyverse tools without success.
This is a working sample
library(tidyverse)
## Sample dataset
pairs_pop <- structure(list(cityCodeV1 = c(20073, 20888, 20222, 22974, 23792,
20779), cityCodeV2 = c(20063, 204024, 20183, 20406, 23586, 23595
), v1_pop2015 = c(414, 682, 497, 3639, 384, 596), v2_pop2015 = c(384,
757, 5716, 315, 367, 1303)), row.names = c(NA, 6L), class = c("tbl_df",
"tbl", "data.frame"))
pairs_pop
#> # A tibble: 6 x 4
#> cityCodeV1 cityCodeV2 v1_pop2015 v2_pop2015
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 20073 20063 414 384
#> 2 20888 204024 682 757
#> 3 20222 20183 497 5716
#> 4 22974 20406 3639 315
#> 5 23792 23586 384 367
#> 6 20779 23595 596 1303
#### This is working !!!
clean_df <- setNames(data.frame(matrix(ncol = 2, nrow = dim(pairs_pop)[1])),c("to_keep", "to_keep_pop"))
# For each row, determine which city is the biggest and adds the two cities population
for (i in 1:dim(pairs_pop)[1]) {
if(pairs_pop$v1_pop2015[i] > pairs_pop$v2_pop2015[i])
{
clean_df$to_keep[i] = pairs_pop$cityCodeV1[i]
clean_df$to_keep_pop[i] = pairs_pop$v1_pop2015[i] + pairs_pop$v2_pop2015[i]
}
else
{
clean_df$to_keep[i] = pairs_pop$cityCodeV2[i]
clean_df$to_keep_pop[i] = pairs_pop$v1_pop2015[i] + pairs_pop$v2_pop2015[i]
}
}
clean_df
#> to_keep to_keep_pop
#> 1 20073 798
#> 2 204024 1439
#> 3 20183 6213
#> 4 22974 3954
#> 5 23792 751
#> 6 23595 1899
This is where I'm stucked
### trying to tidy it with rowwise, mutate and a function
v1_sup_tov2 <- function(x){
print(x)
if(x$v1_pop2015 > x$v2_pop2015){
return (TRUE)
}
return(FALSE)
}
to_clean_df2 <- pairs_pop %>%
rowwise() %>%
mutate_if(v1_sup_tov2,
to_keep = cityCodeV1,
to_delete= cityCodeV2,
to_keep_pop = v1_pop2015 + v2_pop2015)
The expected output is a dataframe with 2 colums like this:
to_keep: cityCode of the city I want to keep
to_keep_pop: population of that city
clean_df
#> to_keep to_keep_pop
#> 1 20073 798
#> 2 204024 1439
#> 3 20183 6213
#> 4 22974 3954
#> 5 23792 751
#> 6 23595 1899
What about this?
library(dplyr)
## Sample dataset
pairs_pop <- structure(
list(cityCodeV1 = c(20073, 20888, 20222, 22974, 23792, 20779),
cityCodeV2 = c(20063, 204024, 20183, 20406, 23586, 23595),
v1_pop2015 = c(414, 682, 497, 3639, 384, 596),
v2_pop2015 = c(384, 757, 5716, 315, 367, 1303)),
row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame"))
clean_df <- transmute(pairs_pop,
to_keep = if_else(v1_pop2015 > v2_pop2015, cityCodeV1, cityCodeV2),
to_keep_pop = v1_pop2015 + v2_pop2015)
Just in case one day you get multiple cities with v1, v2, v3, ...
Do not forget to keep all information in your dataframe so that you know what value is related to what. A tidy dataframe.
library(dplyr)
## Sample dataset
pairs_pop <- structure(
list(cityCodeV1 = c(20073, 20888, 20222, 22974, 23792, 20779),
cityCodeV2 = c(20063, 204024, 20183, 20406, 23586, 23595),
v1_pop2015 = c(414, 682, 497, 3639, 384, 596),
v2_pop2015 = c(384, 757, 5716, 315, 367, 1303)),
row.names = c(NA, 6L), class = c("tbl_df", "tbl", "data.frame"))
# Tidy dataset with all information that was in columns
library(dplyr)
library(tidyr)
library(stringr)
tidy_pairs <- pairs_pop %>%
mutate(city = 1:n()) %>%
gather("key", "value", -city) %>%
mutate(ville = str_extract(key, "([[:digit:]])"),
key = case_when(
grepl("cityCode", key) ~ "cityCode",
grepl("pop", key) ~ "pop",
TRUE ~ "other"
)) %>%
spread(key, value)
And then you can apply the test you want
tidy_pairs %>%
group_by(city) %>%
summarise(to_keep = cityCode[pop == max(pop)],
to_keep_pop = sum(pop))

Resources