How to Use Rank Function in R (using dplyr) - r

I have a data table called prob72. I want to add a column for rank. I want to rank each row by frac_miss_arr_delay. The highest value of frac_miss_arr_delay should get rank 1 and the lowest value should get the highest ranking (for my data that is rank 53). frac_miss_arr_delay are decimal values all less than 1. When I use the following line of code it ranks every single row as "1"
prob72<- prob72 %>% mutate(rank=rank(desc(frac_miss_arr_delay), ties.method = "first"))
I've tried using row_number as well
prob72<- prob72 %>% mutate(rank=row_number())
This STILL outputs all "1s" in the rank column.
week arrDelayIsMissi~ n n_total frac_miss_arr_d~
<dbl> <lgl> <int> <int> <dbl>
1 6. TRUE 1012 6101 0.166
2 26. TRUE 536 6673 0.0803
3 10. TRUE 518 6549 0.0791
4 50. TRUE 435 6371 0.0683
5 49. TRUE 404 6398 0.0631
6 21. TRUE 349 6285 0.0555
prob72[6]
# A tibble: 53 x 1
rank
<int>
1 1
2 1
3 1
4 1
5 1
6 1
7 1
8 1
9 1
10 1
# ... with 43 more rows
flights_week = mutate(flights, week=lubridate::week(time_hour))
prob51<-flights_week %>%
mutate(pos_arr_delay=if_else(arr_delay<0,0,arr_delay))
prob52<-prob51 %>% group_by(week) %>% mutate(avgDelay =
mean(pos_arr_delay,na.rm=T))
prob52 <- prob52 %>% mutate(ridic_late=TRUE)
prob52$ridic_late<- ifelse(prob52$pos_arr_delay>prob52$avgDelay*10,TRUE, FALSE)
prob53<- prob52 %>% group_by(week) %>% count(ridic_late) %>% arrange(desc(ridic_late))
prob53<-prob53 %>% filter(ridic_late==TRUE)
prob54<- prob52 %>% group_by(week) %>% count(n())
colnames(prob53)[3] <- "n_ridiculously_late"
prob53["n"] <- NA
prob53$n <- prob54$n
table5 = subset(prob53, select=c(week,n, n_ridiculously_late))
prob71 <- flights_week
prob72 <- prob71 %>% group_by(week) %>% count(arrDelayIsMissing=is.na(arr_delay)) %>% arrange(desc(arrDelayIsMissing)) %>% filter(arrDelayIsMissing==TRUE)
prob72["n_total"] <- NA
prob72$n_total<- table5$n
prob72<-prob72 %>% mutate(percentageMissing = n/n_total)
prob72<-prob72 %>% arrange(desc(percentageMissing))
colnames(prob72)[5]="frac_miss_arr_delay"

Related

lag() with group_by between current and last observation in R

Edit: I found the solution with na.locf().
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(weight.io = na.locf(weight)) %>%
mutate(lag_weight = weight - lag(weight.io)
I have a dataset below.
set.seed(42000)
data <- data_frame(
country = sample(letters[1:20], size = 100, replace = TRUE),
weight = round(runif(100, min = 48, max = 90)))
data <- data %>%
group_by(country) %>%
arrange(weight) %>%
mutate(wave = seq_along(weight))
n_rows <- nrow(data)
perc_missing <- 10
data[sample(1:n_rows, sample(1:n_rows, round(perc_missing/100 * n_rows, 0))), c("weight")] <- NA
I would like to obtain the difference between one country's current "weight" and the last observed "weight for each wave.
For country "a" wave 5, I want the value to be 69 - 65 (last observed weight at wave < 5).
And for wave 8, 82(weight at wave 8) - 69(weight at wave 5).
My approach was the one below, but it didn't work.
data <-
data %>%
group_by(country) %>%
arrange(wave) %>%
mutate(lag_weight = weight - lag(weight, default = first(weight, na.rm = TRUE)))
Thank you!
I think this is a combination of diff (instead of lag, though that could work just as well) and more important tidyr::fill (or zoo::na.locf, not demonstrated):
BTW, na.rm= is not an argument for first, I've removed it.
library(dplyr)
# library(tidyr) # fill
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = weight - lag(weight, default = first(weight)))
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a 65 4 0
# 5 a 69 5 4
# 6 a 69 6 0
# 7 a 69 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
The issue here is that weight is over-written with the LOCF (last-observation carried forward) value instead of preserving the NA values. If that's important, then you can make another weight variable for temporary use (and remove it):
data %>%
mutate(tmpweight = weight) %>%
group_by(country) %>%
tidyr::fill(tmpweight) %>%
filter(country == "a") %>%
mutate(lag_weight = tmpweight - lag(tmpweight, default = first(tmpweight))) %>%
select(-tmpweight)
# # A tibble: 10 x 4
# # Groups: country [1]
# country weight wave lag_weight
# <chr> <dbl> <int> <dbl>
# 1 a 54 1 0
# 2 a 55 2 1
# 3 a 65 3 10
# 4 a NA 4 0
# 5 a 69 5 4
# 6 a NA 6 0
# 7 a NA 7 0
# 8 a 82 8 13
# 9 a 82 9 0
# 10 a 85 10 3
FYI, you can use c(0, diff(weight)) instead of weight - lag(weight) for the same effect. Since it returns length of 1 shorter (since it is the gap between each value), we prepend a 0 here:
data %>%
group_by(country) %>%
tidyr::fill(weight) %>%
filter(country == "a") %>%
mutate(lag_weight = c(0, diff(weight)))
(The filter(country == "a") is purely for demonstration to match your example, not that it is required for this solution.)

How to merge multiple variables and create a new data set?

https://www.kaggle.com/nowke9/ipldata ----- Contains the IPL Data.
This is exploratory study performed for the IPL data set. (link for the data attached above) After merging both the files with "id" and "match_id", I have created four more variables namely total_extras, total_runs_scored, total_fours_hit and total_sixes_hit. Now I wish to combine these newly created variables into one single data frame. When I assign these variables into one single variable namely batsman_aggregate and selecting only the required columns, I am getting an error message.
library(tidyverse)
deliveries_tbl <- read.csv("deliveries_edit.csv")
matches_tbl <- read.csv("matches.csv")
combined_matches_deliveries_tbl <- deliveries_tbl %>%
left_join(matches_tbl, by = c("match_id" = "id"))
# Add team score and team extra columns for each match, each inning.
total_score_extras_combined <- combined_matches_deliveries_tbl%>%
group_by(id, inning, date, batting_team, bowling_team, winner)%>%
mutate(total_score = sum(total_runs, na.rm = TRUE))%>%
mutate(total_extras = sum(extra_runs, na.rm = TRUE))%>%
group_by(total_score, total_extras, id, inning, date, batting_team, bowling_team, winner)%>%
select(id, inning, total_score, total_extras, date, batting_team, bowling_team, winner)%>%
distinct(total_score, total_extras)%>%
glimpse()%>%
ungroup()
# Batsman Aggregate (Runs Balls, fours, six , Sr)
# Batsman score in each match
batsman_score_in_a_match <- combined_matches_deliveries_tbl %>%
group_by(id, inning, batting_team, batsman)%>%
mutate(total_batsman_runs = sum(batsman_runs, na.rm = TRUE))%>%
distinct(total_batsman_runs)%>%
glimpse()%>%
ungroup()
# Number of deliveries played .
balls_faced <- combined_matches_deliveries_tbl %>%
filter(wide_runs == 0)%>%
group_by(id, inning, batsman)%>%
summarise(deliveries_played = n())%>%
ungroup()
# Number of 4 and 6s by a batsman in each match.
fours_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 4)%>%
group_by(id, inning, batsman)%>%
summarise(fours_hit = n())%>%
glimpse()%>%
ungroup()
sixes_hit <- combined_matches_deliveries_tbl %>%
filter(batsman_runs == 6)%>%
group_by(id, inning, batsman)%>%
summarise(sixes_hit = n())%>%
glimpse()%>%
ungroup()
batsman_aggregate <- c(batsman_score_in_a_match, balls_faced, fours_hit, sixes_hit)%>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit)
The error message is displayed as:-
Error: `select()` doesn't handle lists.
The required output is the data set created newly constructed variables.
You'll have to join those four tables, not combine using c.
And the join type is left_join so that all batsman are included in the output. Those who didn't face any balls or hit any boundaries will have NA, but you can easily replace these with 0.
I've ignored the by since dplyr will assume you want c("id", "inning", "batsman"), the only 3 common columns in all four data sets.
batsman_aggregate <- left_join(batsman_score_in_a_match, balls_faced) %>%
left_join(fours_hit) %>%
left_join(sixes_hit) %>%
select(id, inning, batsman, total_batsman_runs, deliveries_played, fours_hit, sixes_hit) %>%
replace(is.na(.), 0)
# A tibble: 11,335 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 1 1 DA Warner 14 8 2 1
2 1 1 S Dhawan 40 31 5 0
3 1 1 MC Henriques 52 37 3 2
4 1 1 Yuvraj Singh 62 27 7 3
5 1 1 DJ Hooda 16 12 0 1
6 1 1 BCJ Cutting 16 6 0 2
7 1 2 CH Gayle 32 21 2 3
8 1 2 Mandeep Singh 24 16 5 0
9 1 2 TM Head 30 22 3 0
10 1 2 KM Jadhav 31 16 4 1
# ... with 11,325 more rows
There are also 2 batsmen who didn't face any delivery:
batsman_aggregate %>% filter(deliveries_played==0)
# A tibble: 2 x 7
id inning batsman total_batsman_runs deliveries_played fours_hit sixes_hit
<int> <int> <fct> <int> <dbl> <dbl> <dbl>
1 482 2 MK Pandey 0 0 0 0
2 7907 1 MJ McClenaghan 2 0 0 0
One of which apparently scored 2 runs! So I think the batsman_runs column has some errors. The game is here and clearly says that on the second last delivery of the first innings, 2 wides were scored, not runs to the batsman.

How to find observations within a certain time range of each other in R

I have a dataset with ID, date, days of life, and medication variables. Each ID has multiple observations indicating different administrations of a certain drug. I want to find UNIQUE meds that were administered within 365 days of each other. A sample of the data frame is as follows:
ID date dayoflife meds
1 2003-11-24 16361 lasiks
1 2003-11-24 16361 vigab
1 2004-01-09 16407 lacos
1 2013-11-25 20015 pheno
1 2013-11-26 20016 vigab
1 2013-11-26 20016 lasiks
2 2008-06-05 24133 pheno
2 2008-04-07 24074 vigab
3 2014-11-25 8458 pheno
3 2014-12-22 8485 pheno
I expect the outcome to be:
ID N
1 3
2 2
3 1
indicating that individual 1 had a max of 3 different types of medications administered within 365 days of each other. I am not sure if it is best to use days of life or the date to get to this expected outcome.Any help is appreciated
An option would be to convert the 'date' to Date class, grouped by 'ID', get the absolute difference of 'date' and the lag of the column, check whether it is greater than 365, create a grouping index with cumsum, get the number of distinct elements of 'meds' in summarise
library(dplyr)
df1 %>%
mutate(date = as.Date(date)) %>%
group_by(ID) %>%
mutate(diffd = abs(as.numeric(difftime(date, lag(date, default = first(date)),
units = 'days')))) %>%
group_by(grp = cumsum(diffd > 365), add = TRUE) %>%
summarise(N = n_distinct(meds)) %>%
group_by(ID) %>%
summarise(N = max(N))
# A tibble: 3 x 2
# ID N
# <int> <int>
#1 1 2
#2 2 2
#3 3 1
You can try:
library(dplyr)
df %>%
group_by(ID) %>%
mutate(date = as.Date(date),
lag_date = abs(date - lag(date)) <= 365,
lead_date = abs(date - lead(date)) <= 365) %>%
mutate_at(vars(lag_date, lead_date), ~ ifelse(., ., NA)) %>%
filter(coalesce(lag_date, lead_date)) %>%
summarise(N = n_distinct(meds))
Output:
# A tibble: 3 x 2
ID N
<int> <int>
1 1 2
2 2 2
3 3 1

How to use mutate iteratively over multiple rows in r

I am trying to calculate the percent difference in ht between all possible pairs of data, per group of individuals, as well as the time difference between the ht measures. This is my data:
hc1<- data.frame(id= c(1,1,1,2,2,2,3,3),
testoccasion= c(1,2,3,1,2,3,1,2),
ht= c(0.2,0.1,0.8,0.9,1.0,0.5,0.4,0.8),
time= c(5,4,8,5,6,5,2,1))
This is my code.
library(dplyr)
a<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht))/lag(ht)*100) %>%
mutate(t = time-lag(time))
b<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht,2))/lag(ht,2)*100) %>%
mutate(t = time-lag(time,2))
c<-hc1 %>%
group_by(id) %>%
arrange(id,testoccasion) %>%
mutate(fd = (ht-lag(ht,3))/lag(ht,3)*100) %>%
mutate(t = time-lag(time,3))
diff<-rbind(a,b,c)
diff<-na.omit(diff)
I am curious how I can make this code shorter. I want to be able to find the difference across all possible pairs of ht, for all test occasions, where the number of test occasions differs between individual id's.It would be great if I didn't have to do it iteratively like this, because it's a huge dataset I have. Thanks!
We can use map to loop the n used in lag
library(tidyverse)
map_df(1:3, ~
hc1 %>%
group_by(id) %>%
arrange(id, testoccasion) %>%
mutate(fd = (ht -lag(ht, .x))/lag(ht, .x) * 100,
t = time -lag(time, .x))) %>%
na.omit
# A tibble: 7 x 6
# Groups: id [3]
# id testoccasion ht time fd t
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 2 0.1 4 -50 -1
#2 1 3 0.8 8 700 4
#3 2 2 1 6 11.1 1
#4 2 3 0.5 5 -50 -1
#5 3 2 0.8 1 100 -1
#6 1 3 0.8 8 300. 3
#7 2 3 0.5 5 -44.4 0

How to summarize the data by factor levels in r

I have the following data and i want to summarise(min/max/mean/median/mode/sd the date by factor levels which is cluster.kmeans column
head(MS.DATA.IMPVAR.KMEANS,10)
subscribers arpu handset3g mou rechargesum cluster.kmeans
1 105822 197704.10 19040 2854801.0 235430 5
2 18210 34799.21 2856 419109.0 39820 6
3 71351 133842.38 13056 2021183.0 157099 3
4 44975 104681.58 9439 1303220.6 121697 2
5 75860 133190.55 12605 1714640.8 144262 5
6 63740 119389.91 11067 1651303.2 143333 1
7 59368 117792.03 11747 1690910.7 136902 5
8 40064 80427.09 7217 886214.5 89226 2
9 51966 99385.52 9972 1407985.7 117353 5
10 70811 141131.66 12362 1373104.7 158206 4
I tried using dplyr and i got as below:
s_kmeans <- MS.DATA.IMPVAR.KMEANS %>% group_by(cluster.kmeans) %>% summarise_all(c("mean", "median", "min", "max", "sd"))
s_kmeans <- gather(s_kmeans, key, value, -cluster.kmeans)
s_kmeans$variable <- sapply(strsplit(s_kmeans$key, "_"), `[`,1)
s_kmeans$stat <- sapply(strsplit(s_kmeans$key, "_"), `[`, 2)
MS.DATA.STATS.KMEANS <- select(s_kmeans, -key) %>% spread(key = stat, value = value)
head(MS.DATA.STATS.KMEANS)
A tibble: 6 × 7
cluster.kmeans variable max mean median min
<fctr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 arpu 250153.5 164652.99 163718.33 88306.53
2 1 handset3g 21809.0 13736.38 13598.00 6936.00
3 1 mou 1143639.1 338834.54 313010.20 116523.59
4 1 rechargesum 270169.0 173397.03 171897.00 89080.00
5 1 subscribers 41428.0 26515.01 26321.00 13794.00
6 2 arpu 163566.9 84552.09 82402.23 29477.03
I would like do in some other way with fewer lines of codes where i do not use dplyr......using base r functions like by ..aggregate etc....
It is not clear whether fewer lines of code or base R is the priority. However, with the current Hadleyverse format, we can place of the code within in the %>% and use separate instead of the two sapply steps to make it more compact
library(dplyr)
library(tidyr)
MS.DATA.IMPVAR.KMEANS %>%
group_by(cluster.kmeans) %>%
summarise_all(funs(mean, median, min, max, sd)) %>%
gather(key, value, -cluster.kmeans) %>%
separate(key, into = c("variable", "stats")) %>%
spread(stats, value)

Resources