How to merge multiple variables and create a new data set? - r

https://www.kaggle.com/nowke9/ipldata ----- Contains the IPL Data.
This is exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I have created four more variables namely total_extras, total_runs_scored, total_fours_hit and total_sixes_hit. Now I wish to combine these newly created variables into one single data frame. When I assign these variables into one single variable namely batsman_aggregate and selecting only the required columns, I am getting an error message.
library(tidyverse)
deliveries_tbl <- read.csv("deliveries_edit.csv")
matches_tbl <- read.csv("matches.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
# Add team score and team extra columns for each match, each inning.
total_score_extras_combined <- combined_matches_deliveries_tbl%>%
group_by(id, inning, date, batting_team, bowling_team, winner)%>%
mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
distinct(total_score, total_extras)%>%
glimpse()%>%
ungroup()
# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
group_by(id, inning, batting_team, batsman)%>%
mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
distinct(total_batsman_runs)%>%
glimpse()%>%
ungroup()
# Number of deliveries played .
balls_faced <- combined_matches_deliveries_tbl %>%
filter(wide_runs == 0)%>%
group_by(id, inning, batsman)%>%
summarise(deliveries_played = n())%>%
ungroup()
# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 4)%>%
group_by(id, inning, batsman)%>%
summarise(fours_hit = n())%>%
glimpse()%>%
ungroup()
sixes_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 6)%>%
group_by(id, inning, batsman)%>%
summarise(sixes_hit = n())%>%
glimpse()%>%
ungroup()
batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)
The error message is displayed as:-
Error: `select()` doesn't handle lists.
The required output is the data set created newly constructed variables.

You'll have to join those four tables, not combine using c.
And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.
I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
There are also 2 batsmen who didn't face any delivery:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.

Related

For loop with if else

i have a df which a part of is similar to the following
| Number|Category| A1|A2|B1|B2|C1|C2|A |B |C |
| ------| -------|---|--|--|--|--|--|--|--|--|
| 1 | 1 | 10|30|5 |15|NA|NA|5 |10|NA|
| 2 | 2 | 10|30|5 |15|25|35|40|20|45|
The conditions are
A1 & A2, B1 & B2, C1 & C2 are the lower and upper limits respectively, of the factors A, B, C
and columns A, B, C represent the measurements.
If the measurement is under the lower limit the factor is "passed",
if it is in between the two limits, then the factor is in "danger",
if the measurement is higher than the higher limit then it is "failed".
For the category=1 we are permitted to have only 1 failure in one of the factors and in that case we classify the asset as "risk",
but if we have 2 failures then the asset in the row 1 "fail".
For Category=2 permitted 2 failures. If one factor fails is at "at risk", if we have 2 failures is "risk" and we have 3 failures then its "fail".
So I would like to calculate for every row(asset) the status of every factor and then the status of the asset. I am trying to that with a for loop and an if-else statement that iterates through all these columns of every row but seems difficult as I am a beginner. The final result is to attach the following columns to the dataset. Thank you in advance
|Number|Aa |Bb |Cc |Result |
|------|------|------|------|-------|
|1 |passed|danger|NA | risk |
|2 |failed|failed|failed| failed|
This can be done in dplyr only without even reshaping the data or using any loop (for/while). Using across, cur_data() and cur_column() which are certainly powerful functions from dplyr.
library(dplyr, warn.conflicts = F)
df
#> Number Category A1 A2 B1 B2 C1 C2 A B C
#> 1 1 1 10 30 5 15 NA NA 5 10 NA
#> 2 2 2 10 30 5 15 25 35 40 20 45
df %>% group_by(Number, Category) %>%
transmute(across(c('A', 'B', 'C'), ~ case_when(is.na(.) | is.na(get(paste0(cur_column(), 1))) |
is.na(get(paste0(cur_column(), 2))) ~ NA_character_,
. < get(paste0(cur_column(), 1)) ~ 'passed',
. <= get(paste0(cur_column(), 2)) ~ 'danger',
TRUE ~ 'failed'),
.names = '{.col}{tolower(.col)}')) %>%
mutate(Result = ifelse(rowSums(cur_data() == 'failed', na.rm = T) <= Category, 'risk', 'failed'))
#> # A tibble: 2 x 6
#> # Groups: Number, Category [2]
#> Number Category Aa Bb Cc Result
#> <int> <int> <chr> <chr> <chr> <chr>
#> 1 1 1 passed danger <NA> risk
#> 2 2 2 failed failed failed failed
Created on 2021-07-06 by the reprex package (v2.0.0)
You can also use the following solution which is a combination of base R and tidyverse:
library(dplyr)
library(purrr)
colnames <- c(1, 2)
tmp <- df[-colnames]
lapply(split.default(tmp, gsub("(\\w)\\d+?", "\\1", names(tmp))),
function(x) cbind(df[colnames], x)) %>%
imap(~ .x %>%
mutate(!!{.y} := pmap_chr(., ~
ifelse(any(is.na(..3), is.na(..4), is.na(..5)), "NA",
ifelse(..5 > ..3 & ..5 < ..4, "danger", ifelse(..5 < ..3, "passed", "failed"))))) %>%
select(-c(3, 4))) %>%
reduce(~ full_join(..1, ..2, id = c("Number", "Category"))) %>%
rowwise() %>%
mutate(Result = case_when(
Category == 1 & sum(c_across(A:C) == "failed") <= 1 ~ "Risk",
Category == 1 & sum(c_across(A:C) == "failed") > 1 ~ "Fail",
Category == 2 & sum(c_across(A:C) == "failed") == 1 ~ "At_Risk",
Category == 2 & sum(c_across(A:C) == "failed") == 2 ~ "Risk",
Category == 2 & sum(c_across(A:C) == "failed") == 3 ~ "Fail"
))
# A tibble: 2 x 6
# Rowwise:
Number Category A B C Result
<dbl> <dbl> <chr> <chr> <chr> <chr>
1 1 1 passed danger NA Risk
2 2 2 failed failed failed Fail
Much of your problem is caused by the untidy nature of your data frame. I started to provide solutions based on both your untidy data and a tidy equivalent, but the untidy solution, whilst possible, became just too painful.
So, here's a solution based on a tidy equivalent of your data frame.
First, make it tidy. The reason your data frame is untidy is that your column names contain information, namely that A1 and A2 contain the acceptance limits for values in A, and so on. We can correct this by making the data frame longer.
The process is a little long because of the extent of the untidyness of the original. It might be possible to create a more compact version of the transformation using, say, names_pattern and other advanced arguments to pivot_longer(), but the long version at least has the benefit of clarity.
longDF <- df %>%
select(Number, Category, A, B, C) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Value"
) %>%
left_join(
df %>%
select(Number, Category, A1, B1, C1) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Lower"
) %>%
mutate(Variable=str_sub(Variable, 1, 1)),
by=c("Number", "Category", "Variable")
) %>%
left_join(
df %>%
select(Number, Category, A2, B2, C2) %>%
pivot_longer(
c(-Category, -Number),
names_to="Variable",
values_to="Upper"
) %>%
mutate(Variable=str_sub(Variable, 1, 1)),
by=c("Number", "Category", "Variable")
)
longDF
# A tibble: 6 x 6
Number Category Variable Value Lower Upper
<dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 1 1 A 5 10 30
2 1 1 B 10 5 15
3 1 1 C NA NA NA
4 2 2 A 40 10 30
5 2 2 B 20 5 15
6 2 2 C 45 25 35
So at this point, we have columns that define the Category of the test, the Variable being measured, its Value and the two acceptance limits (Lower and Upper).
Now, determining the acceptability of each Value is straightforward.
longDF <- longDF %>%
mutate(
Result=ifelse(
Value < Lower,
"Pass",
ifelse(Value < Upper, "Danger", "Fail")
)
)
longDF
# A tibble: 6 x 7
Number Category Variable Value Lower Upper Result
<dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
1 1 1 A 5 10 30 Pass
2 1 1 B 10 5 15 Danger
3 1 1 C NA NA NA NA
4 2 2 A 40 10 30 Fail
5 2 2 B 20 5 15 Fail
6 2 2 C 45 25 35 Fail
Also, note that the categorisation of each value is independent of both the Variable and the number of possible variables. So the code is robust in these respects.
Now we can categorise the results by Number and Category.
longDF %>%
group_by(Number, Category, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Result,
values_from=N,
values_fill=0
)
# A tibble: 2 x 7
Number Category Danger Pass `NA` Fail
<dbl> <dbl> <int> <int> <int> <int>
1 1 1 1 1 1 0
2 2 2 0 0 0 3
Again, we are robust with respect to both the number of Categorys and Numbers, and their labels.
Evaluating the overall results is also straightforward, but slightly long winded because of the various options. Note that your text is inconsistent with the desired output because you haven't explained how an overall result of "warn" for Category = 1 is obtained. I've gone with the text. if you want to match the sample output, the changes to the code should be simple once the criteria are defined.
longDF %>%
group_by(Number, Category, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Result,
values_from=N,
values_fill=0
) %>%
mutate(
Result=ifelse(
Category == 1,
ifelse(Fail == 0, "Pass", ifelse(Fail == 1, "Risk", "Fail")),
ifelse(Fail < 2, "Pass", ifelse(Fail == 2, "Risk", "Fail"))
)
)
# A tibble: 2 x 7
Number Category Danger Pass `NA` Fail Result
<dbl> <dbl> <int> <int> <int> <int> <chr>
1 1 1 1 1 1 0 Pass
2 2 2 0 0 0 3 Fail
If you need to know which Variable caused potential failures, that can also be obtained from longDF with a small change to the grouping.
longDF %>%
group_by(Category, Variable, Result) %>%
summarise(N=n(), .groups="drop") %>%
pivot_wider(
names_from=Variable,
values_from=Result
)
# A tibble: 2 x 5
Category N A B C
<dbl> <int> <chr> <chr> <chr>
1 1 1 Pass Danger NA
2 2 1 Fail Fail Fail
And, of course, you could join these two data frames together to get a comprehensive description of both the overall results and the component variable assessments.

is there an R code for the following data wrangling and transformation

I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)
This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA

R dplyr count observations within groups

I have a data frame with yes/no values for different days and hours. For each day, I want to get a total number of hours where I have data, as well as the total number of hours where there is a value of Y.
df <- data.frame(day = c(1,1,1,2,2,3,3,3,3,4),
hour = c(1,2,3,1,2,1,2,3,4,1),
YN = c("Y","Y","Y","Y","Y","Y","N","N","N","N"))
df %>%
group_by(day) %>%
summarise(tot.hour = n(),
totY = WHAT DO I PUT HERE?)
Using boolean then add it up
df %>%
group_by(day) %>%
dplyr::summarise(tot.hour = n(),
totY = sum(YN=='Y'))
# A tibble: 4 x 3
day tot.hour totY
<dbl> <int> <int>
1 1 3 3
2 2 2 2
3 3 4 1
4 4 1 0

How to Use Rank Function in R (using dplyr)

I have a data table called prob72. I want to add a column for rank. I want to rank each row by frac_miss_arr_delay. The highest value of frac_miss_arr_delay should get rank 1 and the lowest value should get the highest ranking (for my data that is rank 53). frac_miss_arr_delay are decimal values all less than 1. When I use the following line of code it ranks every single row as "1"
prob72<- prob72 %>% mutate(rank=rank(desc(frac_miss_arr_delay), ties.method = "first"))
I've tried using row_number as well
prob72<- prob72 %>% mutate(rank=row_number())
This STILL outputs all "1s" in the rank column.
week arrDelayIsMissi~ n n_total frac_miss_arr_d~
<dbl> <lgl> <int> <int> <dbl>
1 6. TRUE 1012 6101 0.166
2 26. TRUE 536 6673 0.0803
3 10. TRUE 518 6549 0.0791
4 50. TRUE 435 6371 0.0683
5 49. TRUE 404 6398 0.0631
6 21. TRUE 349 6285 0.0555
prob72[6]
# A tibble: 53 x 1
rank
<int>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
# ... with 43 more rows
flights_week = mutate(flights, week=lubridate::week(time_hour))
prob51<-flights_week %>%
mutate(pos_arr_delay=if_else(arr_delay<0,0,arr_delay))
prob52<-prob51 %>% group_by(week) %>% mutate(avgDelay =
mean(pos_arr_delay,na.rm=T))
prob52 <- prob52 %>% mutate(ridic_late=TRUE)
prob52$ridic_late<- ifelse(prob52$pos_arr_delay>prob52$avgDelay*10,TRUE, FALSE)
prob53<- prob52 %>% group_by(week) %>% count(ridic_late) %>% arrange(desc(ridic_late))
prob53<-prob53 %>% filter(ridic_late==TRUE)
prob54<- prob52 %>% group_by(week) %>% count(n())
colnames(prob53)[3] <- "n_ridiculously_late"
prob53["n"] <- NA
prob53$n <- prob54$n
table5 = subset(prob53, select=c(week,n, n_ridiculously_late))
prob71 <- flights_week
prob72 <- prob71 %>% group_by(week) %>% count(arrDelayIsMissing=is.na(arr_delay)) %>% arrange(desc(arrDelayIsMissing)) %>% filter(arrDelayIsMissing==TRUE)
prob72["n_total"] <- NA
prob72$n_total<- table5$n
prob72<-prob72 %>% mutate(percentageMissing = n/n_total)
prob72<-prob72 %>% arrange(desc(percentageMissing))
colnames(prob72)[5]="frac_miss_arr_delay"

How to use mutate iteratively over multiple rows in r

I am trying to calculate the percent difference in ht between all possible pairs of data, per group of individuals, as well as the time difference between the ht measures. This is my data:
hc1<- data.frame(id= c(1,1,1,2,2,2,3,3),
testoccasion= c(1,2,3,1,2,3,1,2),
ht= c(0.2,0.1,0.8,0.9,1.0,0.5,0.4,0.8),
time= c(5,4,8,5,6,5,2,1))
This is my code.
library(dplyr)
a<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht))/lag(ht)*100) %>%
mutate(t = time-lag(time))
b<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht,2))/lag(ht,2)*100) %>%
mutate(t = time-lag(time,2))
c<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht,3))/lag(ht,3)*100) %>%
mutate(t = time-lag(time,3))
diff<-rbind(a,b,c)
diff<-na.omit(diff)
I am curious how I can make this code shorter. I want to be able to find the difference across all possible pairs of ht, for all test occasions, where the number of test occasions differs between individual id's.It would be great if I didn't have to do it iteratively like this, because it's a huge dataset I have. Thanks!
We can use map to loop the n used in lag
library(tidyverse)
map_df(1:3, ~
hc1 %>%
group_by(id) %>%
arrange(id, testoccasion) %>%
mutate(fd = (ht -lag(ht, .x))/lag(ht, .x) * 100,
t = time -lag(time, .x))) %>%
na.omit
# A tibble: 7 x 6
# Groups: id [3]
# id testoccasion ht time fd t
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 2 0.1 4 -50 -1
#2 1 3 0.8 8 700 4
#3 2 2 1 6 11.1 1
#4 2 3 0.5 5 -50 -1
#5 3 2 0.8 1 100 -1
#6 1 3 0.8 8 300. 3
#7 2 3 0.5 5 -44.4 0

Resources