How to calculate this variable in R - r

I have the following data:
mydf[77:84,]
id game_week points code web_name first_name second_name position team_name date fixture team1 team2 home_away team_scored team_conceded minutes goals assists cleansheet goals_conceded own_goals
77 3 1 -2 51507 Koscielny Laurent Koscielny Defender Arsenal 17/08/13 ARS-AVL ARS AVL H 1 3 67 0 0 0 3 0
78 3 2 0 51507 Koscielny Laurent Koscielny Defender Arsenal 24/08/13 FUL-ARS ARS FUL A 3 1 0 0 0 0 0 0
79 3 3 6 51507 Koscielny Laurent Koscielny Defender Arsenal 01/09/13 ARS-TOT ARS TOT H 1 0 90 0 0 1 0 0
80 3 4 2 51507 Koscielny Laurent Koscielny Defender Arsenal 14/09/13 SUN-ARS ARS SUN A 3 1 90 0 0 0 1 0
81 3 5 2 51507 Koscielny Laurent Koscielny Defender Arsenal 22/09/13 ARS-STK ARS STK H 3 1 90 0 0 0 1 0
82 3 6 2 51507 Koscielny Laurent Koscielny Defender Arsenal 28/09/13 SWA-ARS ARS SWA A 2 1 90 0 0 0 1 0
83 3 7 3 51507 Koscielny Laurent Koscielny Defender Arsenal 06/10/13 WBA-ARS ARS WBA A 1 1 90 0 0 0 1 0
84 3 8 2 51507 Koscielny Laurent Koscielny Defender Arsenal 19/10/13 ARS-NOR ARS NOR H 4 1 90 0 0 0 1 0
As a part of modeling exercise, I want to create a new variable, "mov_avg_min", which for a given "id", is the average of "minutes" played in the last 3 "game_week". Example, for web_name "Koscielny" his distinct "id" is 3 in this data_frame. So for id= 3 and game_week=4, a function should calculate mov_avg_min of game_weeks 1:3 (3 game_week before current game_week for the same id value). Hence in row 80, mov_avg_min = 1/3(67+0+90)=52.333

I think the rollapply (of the zoo package) with width = 3 will include the value of the row you consider. So, for game 4 it will give you the average of minutes in games 2,3 and 4. I think you have to lag the minutes column first in order to get the average based on games 1,2 and 3. See a simple example below:
library(dplyr)
library(zoo)
dt = data.frame(id = c(1,1,1,1,1,2,2,2,2,2),
games = c(1,2,3,4,5,1,2,3,4,5),
minutes = c(61,72,73,82,82,81,71,51,90,73))
dt
# id games minutes
# 1 1 1 61
# 2 1 2 72
# 3 1 3 73
# 4 1 4 82
# 5 1 5 82
# 6 2 1 81
# 7 2 2 71
# 8 2 3 51
# 9 2 4 90
# 10 2 5 73
dt %>% group_by(id) %>%
mutate(lag_minutes = lag(minutes, default=NA)) %>%
mutate(RA = rollapply(lag_minutes,width=3,mean, align= "right", fill=NA))
# Source: local data frame [10 x 5]
# Groups: id
#
# id games minutes lag_minutes RA
# 1 1 1 61 NA NA
# 2 1 2 72 61 NA
# 3 1 3 73 72 NA
# 4 1 4 82 73 68.66667
# 5 1 5 82 82 75.66667
# 6 2 1 81 NA NA
# 7 2 2 71 81 NA
# 8 2 3 51 71 NA
# 9 2 4 90 51 67.66667
# 10 2 5 73 90 70.66667

Related

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

Changing rows to columns through merging in R

actual_date=c('2018-01-03','2018-01-02','2018-01-25','2018-01-15','2018-01-06','2018-01-02','2018-01-16','2018-01-22','2018-01-03')
date_band=c('_201801','_201801','_201803','_201802','_201801', '_201801','_201803','_201804','_201801')
action=c('text','letter','call','letter','visit','letter','text','text','call')
unique_ref=c(1,1,2,1,2,3,3,4,4)
df1=as.data.frame(cbind(unique_ref,actual_date,date_band,action))
unique_ref=c(1,2,3,4)
priority_201801=c('3','2','3','0')
balance_201801=c('30','-20','35','-100')
priority_201802=c('1','1','1','2')
balance_201802=c('60','-40','35','0')
priority_201803=c('2','2','3','2')
balance_201803=c('30','-40','-50','100')
priority_201804=c('99','0','0','0')
balance_201804=c('0','-20','-50','-100')
df2=as.data.frame(cbind(unique_ref,priority_201801,balance_201801,priority_201802,balance_201802,priority_201803,
balance_201803,priority_201804,balance_201804))
The code above produces two examples of datasets I'm working with.
df1 looks like this:
unique_ref actual_date date_band action
1 1 2018-01-03 _201801 text
2 1 2018-01-02 _201801 letter
3 2 2018-01-25 _201803 call
4 1 2018-01-15 _201802 letter
5 2 2018-01-06 _201801 visit
6 3 2018-01-02 _201801 letter
7 3 2018-01-16 _201803 text
8 4 2018-01-22 _201804 text
9 4 2018-01-03 _201801 call
While df2 looks like:
unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803 priority_201804 balance_201804
1 1 3 30 1 60 2 30 99 0
2 2 2 -20 1 -40 2 -40 0 -20
3 3 3 35 1 35 3 -50 0 -50
4 4 0 -100 2 0 2 100 0 -100
What I want to do is add a column to df2 that states action_dateband (i.e. action_201801, action_201802 etc). This would be taken from the action in df1 using the date_band and matched on unqiue_ref.
Desired output looks as follows: Where if there are two for one week, there is a comma between the two actions.
unique_ref priority_201801 balance_201801 action_201801 priority_201802 balance_201802 action_201802 priority_201803 balance_201803 action_201803
1 1 3 30 text,letter 1 60 letter 2 30
2 2 2 -20 visit 1 -40 2 -40 call
3 3 3 35 letter 1 35 3 -50 text
4 4 0 -100 call 2 0 2 100
priority_201804 balance_201804 action_201804
1 99 0
2 0 -20
3 0 -50
4 0 -100 text
library(tidyverse)
df2 %>%
left_join(df1, by=c("unique_ref")) %>% # join df1 to df2
select(-actual_date) %>% # remove column you won't need
mutate(date_band = paste0("action", date_band)) %>% # update column values
spread(date_band, action) # reshape to get the format you want
# unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803
# 1 1 3 30 1 60 2 30
# 2 2 2 -20 1 -40 2 -40
# 3 3 3 35 1 35 3 -50
# 4 4 0 -100 2 0 2 100
# priority_201804 balance_201804 action_201801 action_201802 action_201803 action_201804
# 1 99 0 text letter <NA> <NA>
# 2 0 -20 visit <NA> call <NA>
# 3 0 -50 letter <NA> text <NA>
# 4 0 -100 call <NA> <NA> text
For the case where you have multiple actions for a specific ref and date_band you can use this approach:
library(tidyverse)
# update df1
df1_upd = df1 %>%
group_by(unique_ref, date_band) %>% # for every combination of ref and date_band
summarise(action = paste0(action, collapse = ",")) %>% # combine actions
ungroup() # forget the grouping
df2 %>%
left_join(df1_upd, by=c("unique_ref")) %>%
mutate(date_band = paste0("action", date_band)) %>%
spread(date_band, action)
# unique_ref priority_201801 balance_201801 priority_201802 balance_201802 priority_201803 balance_201803
# 1 1 3 30 1 60 2 30
# 2 2 2 -20 1 -40 2 -40
# 3 3 3 35 1 35 3 -50
# 4 4 0 -100 2 0 2 100
# priority_201804 balance_201804 action_201801 action_201802 action_201803 action_201804
# 1 99 0 text,letter letter <NA> <NA>
# 2 0 -20 visit <NA> call <NA>
# 3 0 -50 letter <NA> text <NA>
# 4 0 -100 call <NA> <NA> text

Calculating table in R with uneven length

I have to table of data in R
a = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 2 0 0 0 2
2 3 0 0 10 3
3 4 0 51 25 0
4 5 19 129 14 0
5 6 60 137 1 0
6 7 31 62 15 5
7 8 7 11 7 0
and
b = Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
1 1 0 0 1 266
2 2 1 0 47 335
3 3 1 26 415 142
4 4 3 965 508 5
5 5 145 2535 103 0
6 6 939 2239 15 6
7 7 420 613 86 34
8 8 46 84 36 16
I wouold like to calculate b/a by matching the duration. I though of some thing like ifelse() but it does not work. Can someone please help me?
Thanks a lot
Match the order and selection of b with a (in my example y with x). Then do the math.
x <- data.frame(duration = 2:8, v = rnorm(7))
y <- data.frame(duration = 8:1, v = rnorm(8))
m <- match(y$duration, x$duration)
ym <- y[m[!is.na(m)],]
x$v/ym$v
It does not work when x contains items that are not in y, btw.
Do you want something like the following:
a <- a[-1]
b <- b[-1]
a <- a[order(a$Duration),]
b <- b[order(b$Duration),]
durations <- intersect(a$Duration, b$Duration)
b[b$Duration %in% durations,] / a[a$Duration %in% durations,]
Duration (-10,0] (0,0.25] (0.25,0.5] (0.5,10]
2 1 Inf NaN Inf 167.50000
3 1 Inf Inf 41.500000 47.33333
4 1 Inf 18.921569 20.320000 Inf
5 1 7.631579 19.651163 7.357143 NaN
6 1 15.650000 16.343066 15.000000 Inf
7 1 13.548387 9.887097 5.733333 6.80000
8 1 6.571429 7.636364 5.142857 Inf
you may like to replace NaN and Inf values by something else.

Using do() with names of list elements

I am trying to take the names of list elements and use do() to apply a function over them all, then bind them in a single data frame.
require(XML)
require(magrittr)
url <- "http://gd2.mlb.com/components/game/mlb/year_2016/month_05/day_21/gid_2016_05_21_milmlb_nynmlb_1/boxscore.xml"
box <- xmlParse(url)
xml_data <- xmlToList(box)
end <- length(xml_data[[2]]) - 1
x <- seq(1:end)
away_pitchers_names <- paste0("xml_data[[2]][", x, "]")
away_pitchers_names <- as.data.frame(away_pitchers_names)
names(away_pitchers_names) <- "elements"
away_pitchers_names$elements %<>% as.character()
listTodf <- function(x) {
df <- as.data.frame(x)
tdf <- as.data.frame(t(df))
row.names(tdf) <- NULL
tdf
}
test <- away_pitchers_names %>% group_by(elements) %>% do(listTodf(.$elements))
When I run the listTodf function on a list element it works fine:
listTodf(xml_data[[2]][1]
id name name_display_first_last pos out bf er r h so hr bb np s w l sv bs hld s_ip s_h s_r s_er s_bb
1 605200 Davies Zach Davies P 16 22 4 4 5 5 2 2 86 51 1 3 0 0 0 36.0 41 24 23 15
s_so game_score era
1 25 45 5.75
But when I try to loop through the names of the elements with the do() function I get the following:
Warning message:
In rbind_all(out[[1]]) : Unequal factor levels: coercing to character
And here is the output:
> test
Source: local data frame [5 x 2]
Groups: elements [5]
elements V1
(chr) (chr)
1 xml_data[[2]][1] xml_data[[2]][1]
2 xml_data[[2]][2] xml_data[[2]][2]
3 xml_data[[2]][3] xml_data[[2]][3]
4 xml_data[[2]][4] xml_data[[2]][4]
5 xml_data[[2]][5] xml_data[[2]][5]
I am sure it is something extremely simple, but I can't figure out where things are getting tripped up.
For evaluating the strings, eval(parse can be used
library(dplyr)
lapply(away_pitchers_names$elements,
function(x) as.data.frame.list(eval(parse(text=x))[[1]], stringsAsFactors=FALSE)) %>%
bind_rows()
# id name name_display_first_last pos out bf er r h so hr bb np s w l
#1 605200 Davies Zach Davies P 16 22 4 4 5 5 2 2 86 51 1 3
#2 430641 Boyer Blaine Boyer P 2 4 0 0 2 0 0 0 8 7 1 0
#3 448614 Torres, C Carlos Torres P 3 4 0 0 0 1 0 2 21 11 0 1
#4 592804 Thornburg Tyler Thornburg P 3 3 0 0 0 1 0 0 14 8 2 1
#5 518468 Blazek Michael Blazek P 1 5 1 1 2 0 0 2 23 10 1 1
# sv bs hld s_ip s_h s_r s_er s_bb s_so game_score era loss note
#1 0 0 0 36.0 41 24 23 15 25 45 5.75 <NA> <NA>
#2 0 1 0 21.1 22 4 4 5 7 48 1.69 <NA> <NA>
#3 0 0 2 22.1 22 9 9 14 21 52 3.63 <NA> <NA>
#4 1 2 8 18.2 13 8 8 7 29 54 3.86 <NA> <NA>
#5 0 1 8 21.1 23 6 6 14 18 41 2.53 true (L, 1-1)
However, it is easier and faster to just do
lapply(xml_data[[2]][1:5], function(x)
as.data.frame.list(x, stringsAsFactors=FALSE)) %>%
bind_rows()

compute a Means variable for a specific value in another variable

I would like to compute the mean age for every value from 1-7 in another variable called period.
This is how my data looks like:
work1 <- read.table(header=T, text="ID dead age gender inclusion_year diagnosis surv agrp period
87 0 25 2 2006 1 2174 1 5
396 0 19 2 2003 1 3077 1 3
446 0 23 2 2003 1 3144 1 3
497 0 19 2 2011 1 268 1 7
522 1 57 2 1999 1 3407 2 1
714 0 58 2 2003 1 3041 2 3
741 0 27 2 2004 1 2587 1 4
767 0 18 1 2008 1 1104 1 6
786 0 36 1 2005 1 2887 3 4
810 0 25 1 1998 1 3783 4 2")
This is a subset of a data with more then 1500 observations
This is what I'm trying to achieve:
sim <- read.table(header=T, text="Period diagnosis dead surv age
1 1 50 50000 35.5
2 1 80 70000 40.3
3 1 100 80000 32.8
4 1 120 100000 39.8
5 1 140 1200000 28.7
6 1 150 1400000 36.2
7 1 160 1600000 37.1")
In this data set I would like to group by period and diagnosis while all deaths(dead) and surv(survival time in days) is summarised in period time. I would also like for a mean value of the age in every period.
Have tried everything, still can't create the data set I'm striving for.
All help is appreciated!
You could try data.table
library(data.table)
as.data.table(work1)[, .(dead_sum=sum(dead),
surv_sum=sum(surv),
age_mean=mean(age)), keyby=.(period, diagnosis)]
Or dplyr
library(dplyr)
work1 %>% group_by(period, diagnosis) %>%
summarise(dead_sum=sum(dead), surv_sum=sum(surv), age_mean=mean(age))
# result
period diagnosis dead_sum surv_sum age_mean
1: 1 1 1 3407 57.00000
2: 2 1 0 3783 25.00000
3: 3 1 0 9262 33.33333
4: 4 1 0 5474 31.50000
5: 5 1 0 2174 25.00000
6: 6 1 0 1104 18.00000
7: 7 1 0 268 19.00000

Resources