Hope the community can help me since I am relatively new to R and to the StackOverflow community.
I am trying to replace a missing value of a group with the average of the 3 previous years and then use this newly generated mean to continue generating the next period missing value in R either using dplyr or data.table. My data looks something like this (desired output column rounded to 2 digits):
df <- data.frame(gvkey = c(10443, 10443, 10443, 10443, 10443, 10443, 10443, 29206, 29206, 29206, 29206, 29206), fyear = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017, 2018, 2019, 2020, 2021), receivables = c(543, 595, 757, NA, NA, NA, NA, 147.469, 161.422, 154.019, NA, NA), desired_output = c(543, 595, 757, 631.67, 661.22, 683.30, 658.73, 147.47, 161.42, 154.02, 154.30, 156.58))
I have attempted the following line of code, but it does not use the newly generated number:
df <- df %>% mutate(mean_rect=rollapply(rect,3,mean,align='right',fill=NA))
Any help would be greatly appreciated!
Because your desired fill value depends on any previously created fill values, I think the only reasonable approach is a trusty for loop:
df$out <- NA
for (i in 1:nrow(df)) {
if (!is.na(df$receivables[i])) {
df$out[i] <- df$receivables[i]
} else {
df$out[i] <- mean(df$out[(i-3):(i-1)], na.rm = T)
}
}
gvkey fyear receivables desired_output out
1 10443 2005 543.000 543.00 543.0000
2 10443 2006 595.000 595.00 595.0000
3 10443 2007 757.000 757.00 757.0000
4 10443 2008 NA 631.67 631.6667
5 10443 2009 NA 661.22 661.2222
6 10443 2010 NA 683.30 683.2963
7 10443 2011 NA 658.73 658.7284
8 29206 2017 147.469 147.47 147.4690
9 29206 2018 161.422 161.42 161.4220
10 29206 2019 154.019 154.02 154.0190
11 29206 2020 NA 154.30 154.3033
12 29206 2021 NA 156.58 156.5814
Related
I'm in doubt about how to define age ranges to calculate Incidence Ratio in a cohort dataset. More specifically, my data comprises individuals who entered into a specific cohort between 2008-2018, and, furthermore, this dataset was used as a reference to merge with hospitalization information from another source of data by the id number.
My data looks like this below
id = c(1:5)
year_of_entry = c(2008, 2009, 2011, 2015, 2016)
age_of_entry = c(8,10,40,20,30)
year_birth = c(2000, 1999, 1971, 1995, 1986)
hospitalization_year = c(2009, NA, 2015, 2017, NA)
age_hospitalization = c(9, NA, 44, 22, NA)
data = data.frame(
id = id,
'Age of Entry' = age_of_entry,
'Year of Birth' = year_birth,
'Hospitalization Year' = hospitalization_year,
'Age of Hospitalization' = age_hospitalization
)
>data
id Age.of.Entry Year.of.Birth Hospitalization.Year Age.of.Hospitalization
1 8 2000 2009 9
2 10 1999 NA NA
3 40 1971 2015 44
4 20 1995 2017 22
5 30 1986 NA NA
This way, my next step is to run a linear regression to study determinants for admissions by age groups (i.e 0-10; 11-20; 21-30; 31-40; 41-50), but I'm not so sure about the criteria that I need to use in order to create these age groups regarding the fact that we have people who entered into the cohort in different periods, at different ages and was admitted in different periods of time. Additionally, as you can see in the example above, in my dataset I also have some individuals who have not been admitted.
Can anyone help me to solve that?
I want to extract a sequence of values from this dataframe altogether to form a single output, I can do so individually using:
p <- function(x, i){
r<- rank_data[x][rank_data[,i] %in% 2000:2020,]
}
p(1:2, 2)
#output
jan year
235.2 2008
where the sequence of values in p() continue as:
x=c(1:2, 3:4, 5:6)
i=(2, 4, 6)
I'm looking for a single code, where a variable x or i can be indexed into the dataframe to produce the expected output. Although, using some other iteration function like apply is welcome. I want to better understand indexing through iterative functions.
expected output:
jan year feb year2 mar year3 ...
235.2 2008 287.6 2020 187.8 2019 ...
NA NA 241.9 2002 NA NA
I've asked a similar question here, although, I'm more interested in doing this through indexing with a single iterative function. The technique provided by the author in the previous question, is very specialised, so I'm looking for something simpler to get the hang of this.
Reproducible code:
structure(list(jan = c(268.1, 263.1, 235.2, 223.3, 219.2, 218.3
), year = c(1928, 1948, 2008, 1877, 1995, 1990), feb = c(287.6,
241.9, 213.7, 205.1, 191.9, 191.2), year2 = c(2020, 2002, 1997,
1990, 1958, 1923), mar = c(225.3, 190.7, 187.8, 187.2, 175.9,
173.9), year3 = c(1981, 1903, 2019, 1947, 1994, 1912)), class = "data.frame", row.names = c(NA,
6L))
You should store the values of x in a list because if you store them in a vector there is no way to distinguish between two groups.
x = c(1:2, 3:4, 5:6)
x
#[1] 1 2 3 4 5 6
Storing them in a list.
x= list(1:2, 3:4, 5:6)
x
#[[1]]
#[1] 1 2
#[[2]]
#[1] 3 4
#[[3]]
#[1] 5 6
You can use Map to index rows from your dataframe.
p <- function(x, i){
r<- rank_data[x][rank_data[,i] %in% 2000:2020,]
r
}
x= list(1:2, 3:4, 5:6)
i= c(2, 4, 6)
result <- Map(p, x, i)
result
#[[1]]
# jan year
#3 235.2 2008
#[[2]]
# feb year2
#1 287.6 2020
#2 241.9 2002
#[[3]]
# mar year3
#3 187.8 2019
If you want output same as shown you can add another step to count max number of rows.
nr <- 1:max(sapply(result, nrow))
do.call(cbind, lapply(result, function(x) x[nr, ]))
# jan year feb year2 mar year3
#3 235.2 2008 287.6 2020 187.8 2019
#NA NA NA 241.9 2002 NA NA
I am trying to round consectutive years to the nearest year that a census took place. Unfortunately, in NZ the spacing between census is not always consistent. Eg. I want to round years 2000 to 2020 to the nearest value of 2001, 2006, 2013, 2018. Is there a way to do this without resorting to a series of if_else or case_when statements?
You could use sapply to find the minimum absolute difference between the two vectors.
Suppose your vectors were like this:
census_years <- c(2001, 2006, 2013, 2018)
all_years <- 2000:2020
Then you can do:
sapply(all_years, function(x) census_years[which.min(abs(census_years - x))])
#> [1] 2001 2001 2001 2001 2006 2006 2006 2006 2006 2006 2013 2013 2013 2013 2013
#> [16] 2013 2018 2018 2018 2018 2018
Created on 2020-12-09 by the reprex package (v0.3.0)
We can use findInterval
census_year[findInterval(year_in_question, census_year)+1]
#[1] 2013
data
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012
This does the trick, by finding the smallest difference between the year and the census years. Vectorizing is left as an exercise...
require(magrittr)
census_year <- c(2001, 2006, 2013, 2018)
year_in_question <- 2012
abs(census_year - year_in_question) %>% # abs diff in years
which.min() %>% # index number of the smallest abs difference
census_year[.] # use that index number
[1] 2013
I'm a beginner in R, so please be gentle :)
I have a dataframe of the following form:
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
sampleData
id year
1 1 2010
2 1 2014
3 2 2010
4 2 2014
5 3 2010
6 4 2010
7 4 2014
I want to exclude every id, which does not have both years.
In this case: id "3" only has year "2010".
Therefore I want to conditionally remove ids, which do not have another row with the missing year.
I hope you guys can understand what I'm looking for :(
thank you in advance!
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
First you count :
library(plyr)
countBy <- ddply(unique(sampleData),
.(id),
summarise,
occurence = length(year) ,
.parallel = F )
Then you subset
sampleData[sampleData$id %in% countBy$id[countBy$occurence > 1],]
We can use ave and check number of rows for each id and select only those rows with length as 2.
sampleData[ave(sampleData$year, sampleData$id, FUN = length) == 2, ]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
In case if we want to check whether both "2010" and "2014" appear at least once per id we can do
sampleData[as.logical(ave(sampleData$year, sampleData$id, FUN = function(x)
any(2014 %in% x) & any(2010 %in% x))), ]
Here is a solution with data.table
library("data.table")
sampleData <- data.frame(id = c(1,1,2,2,3,4,4), year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
setDT(sampleData)
sampleData[, `:=`(n, .N), by=id][n==2]
In case you want to make your check more explicit, i.e. not just relying on two rows per id but checking whether both "2010" and "2014" appear at least once per id, you can do something like this in base R:
x <- table(sampleData$id, sampleData$year) > 0
x
# 2010 2014
# 1 TRUE TRUE
# 2 TRUE TRUE
# 3 TRUE FALSE
# 4 TRUE TRUE
ids_to_keep <- row.names(x)[rowSums(x[,c("2010", "2014")]) == 2]
ids_to_keep
#[1] "1" "2" "4"
sampleData[sampleData$id %in% ids_to_keep,]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
This approach is longer than others but it's also more robust, for example if you can have multiple occurences of the same year per id, then some other approaches may fail or, if you can have other years (not just 2010 and 2014) some other approaches may also fail if they only rely on checking number of occurences per id.
There is also a nice dplyr solution:
# create the sample dataset
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
# load dplyr library
library(dplyr)
# take the sample dateset
sampleData %>%
# group by id - thus the function within filter will be evaluated for each id
group_by(id) %>%
# filter only ids which were recorded in two separate years
filter(length(unique(year)) == 2)
I want to create a new variable that's derived from specific values in my existing variables. My data frame looks something like the following:
year <- c("2010", "2011", "2012", "2013", "2014", "2015")
x <- c(2980, 2955, 3110, 2962, 2566, 3788)
y <- c(2453, 2919, 2930, 2864, 2873, 3031)
df <- data.frame(year, x, y)
More specifically, I want to create a third column, z, that is the ratio of x and y. However, I don't want to create this ratio by simply dividing x by y for each individual year. Instead, I want the values in 2015 (and 2014 etc.) to be an average of this ratio in the three preceding years, i.e. 2014, 2013, and 2012.
I've looked at Wickham's dplyr package and, in particular, the group_by function but I'm stumped because I don't want to group my data by year per se but by each years' three preceding years as illustrated (hopefully) above.
With dplyr and library(zoo):
df_fin<- df %>% mutate( z = rollmeanr(x/y,3,na.pad=TRUE))
I think the column z is what you want but it would be good to have the desired output.
The answers that use zoo::rollmean are all on the correct track, but they have a couple of "off by one" errors in them. First, you actually want zoo::rollmeanr( ..., na.pad=TRUE ) which will correctly pad the output with NA on the left side:
> zoo::rollmeanr( df$x / df$y, 3, na.pad=TRUE )
[1] NA NA 1.0962018 1.0359948 0.9962648 1.0590378
The second "off by one" error arises from alignment of this vector with the rest of your data. From your description, you want the value for 2015 to be the average of 2014, 2013, and 2012. However, appending the vector above to your table will make the value for 2015 to be the average of 2015, 2014, and 2013, instead. To correct, you want to omit the last value in your input to the rolling average and prepend an NA to compensate:
> c( NA, zoo::rollmeanr( head(df$x / df$y,-1), 3, na.pad=TRUE ) )
[1] NA NA NA 1.0962018 1.0359948 0.9962648
Putting it all together using dplyr notation:
df %>% mutate( z = c( NA, zoo::rollmeanr( head(x/y,-1), 3, na.pad=TRUE ) ) )
year x y z
1 2010 2980 2453 NA
2 2011 2955 2919 NA
3 2012 3110 2930 NA
4 2013 2962 2864 1.0962018
5 2014 2566 2873 1.0359948
6 2015 3788 3031 0.9962648
df$z<-0
for (i in 4:6){
df$z[i]<-mean(df$x[(i-3):(i-1)])/mean(df$y[(i-3):(i-1)])
}
Whit a loop, you can get this:
year x y z
1 2010 2980 2453 0.000000
2 2011 2955 2919 0.000000
3 2012 3110 2930 0.000000
4 2013 2962 2864 1.089497
5 2014 2566 2873 1.036038
6 2015 3788 3031 0.996654
library(zoo)
library(dplyr)
df %>% mutate(z = x/y, zz = rollmean(z, 3, fill = NA)