indexing through function to gather multiple values - r

I want to extract a sequence of values from this dataframe altogether to form a single output, I can do so individually using:
p <- function(x, i){
r<- rank_data[x][rank_data[,i] %in% 2000:2020,]
}
p(1:2, 2)
#output
jan year
235.2 2008
where the sequence of values in p() continue as:
x=c(1:2, 3:4, 5:6)
i=(2, 4, 6)
I'm looking for a single code, where a variable x or i can be indexed into the dataframe to produce the expected output. Although, using some other iteration function like apply is welcome. I want to better understand indexing through iterative functions.
expected output:
jan year feb year2 mar year3 ...
235.2 2008 287.6 2020 187.8 2019 ...
NA NA 241.9 2002 NA NA
I've asked a similar question here, although, I'm more interested in doing this through indexing with a single iterative function. The technique provided by the author in the previous question, is very specialised, so I'm looking for something simpler to get the hang of this.
Reproducible code:
structure(list(jan = c(268.1, 263.1, 235.2, 223.3, 219.2, 218.3
), year = c(1928, 1948, 2008, 1877, 1995, 1990), feb = c(287.6,
241.9, 213.7, 205.1, 191.9, 191.2), year2 = c(2020, 2002, 1997,
1990, 1958, 1923), mar = c(225.3, 190.7, 187.8, 187.2, 175.9,
173.9), year3 = c(1981, 1903, 2019, 1947, 1994, 1912)), class = "data.frame", row.names = c(NA,
6L))

You should store the values of x in a list because if you store them in a vector there is no way to distinguish between two groups.
x = c(1:2, 3:4, 5:6)
x
#[1] 1 2 3 4 5 6
Storing them in a list.
x= list(1:2, 3:4, 5:6)
x
#[[1]]
#[1] 1 2
#[[2]]
#[1] 3 4
#[[3]]
#[1] 5 6
You can use Map to index rows from your dataframe.
p <- function(x, i){
r<- rank_data[x][rank_data[,i] %in% 2000:2020,]
r
}
x= list(1:2, 3:4, 5:6)
i= c(2, 4, 6)
result <- Map(p, x, i)
result
#[[1]]
# jan year
#3 235.2 2008
#[[2]]
# feb year2
#1 287.6 2020
#2 241.9 2002
#[[3]]
# mar year3
#3 187.8 2019
If you want output same as shown you can add another step to count max number of rows.
nr <- 1:max(sapply(result, nrow))
do.call(cbind, lapply(result, function(x) x[nr, ]))
# jan year feb year2 mar year3
#3 235.2 2008 287.6 2020 187.8 2019
#NA NA NA 241.9 2002 NA NA

Related

How to find rolling mean using means previously generated using R?

Hope the community can help me since I am relatively new to R and to the StackOverflow community.
I am trying to replace a missing value of a group with the average of the 3 previous years and then use this newly generated mean to continue generating the next period missing value in R either using dplyr or data.table. My data looks something like this (desired output column rounded to 2 digits):
df <- data.frame(gvkey = c(10443, 10443, 10443, 10443, 10443, 10443, 10443, 29206, 29206, 29206, 29206, 29206), fyear = c(2005, 2006, 2007, 2008, 2009, 2010, 2011, 2017, 2018, 2019, 2020, 2021), receivables = c(543, 595, 757, NA, NA, NA, NA, 147.469, 161.422, 154.019, NA, NA), desired_output = c(543, 595, 757, 631.67, 661.22, 683.30, 658.73, 147.47, 161.42, 154.02, 154.30, 156.58))
I have attempted the following line of code, but it does not use the newly generated number:
df <- df %>% mutate(mean_rect=rollapply(rect,3,mean,align='right',fill=NA))
Any help would be greatly appreciated!
Because your desired fill value depends on any previously created fill values, I think the only reasonable approach is a trusty for loop:
df$out <- NA
for (i in 1:nrow(df)) {
if (!is.na(df$receivables[i])) {
df$out[i] <- df$receivables[i]
} else {
df$out[i] <- mean(df$out[(i-3):(i-1)], na.rm = T)
}
}
gvkey fyear receivables desired_output out
1 10443 2005 543.000 543.00 543.0000
2 10443 2006 595.000 595.00 595.0000
3 10443 2007 757.000 757.00 757.0000
4 10443 2008 NA 631.67 631.6667
5 10443 2009 NA 661.22 661.2222
6 10443 2010 NA 683.30 683.2963
7 10443 2011 NA 658.73 658.7284
8 29206 2017 147.469 147.47 147.4690
9 29206 2018 161.422 161.42 161.4220
10 29206 2019 154.019 154.02 154.0190
11 29206 2020 NA 154.30 154.3033
12 29206 2021 NA 156.58 156.5814

Remove specific rows from data frame conditional on caseid and year

I'm a beginner in R, so please be gentle :)
I have a dataframe of the following form:
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
sampleData
id year
1 1 2010
2 1 2014
3 2 2010
4 2 2014
5 3 2010
6 4 2010
7 4 2014
I want to exclude every id, which does not have both years.
In this case: id "3" only has year "2010".
Therefore I want to conditionally remove ids, which do not have another row with the missing year.
I hope you guys can understand what I'm looking for :(
thank you in advance!
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
First you count :
library(plyr)
countBy <- ddply(unique(sampleData),
.(id),
summarise,
occurence = length(year) ,
.parallel = F )
Then you subset
sampleData[sampleData$id %in% countBy$id[countBy$occurence > 1],]
We can use ave and check number of rows for each id and select only those rows with length as 2.
sampleData[ave(sampleData$year, sampleData$id, FUN = length) == 2, ]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
In case if we want to check whether both "2010" and "2014" appear at least once per id we can do
sampleData[as.logical(ave(sampleData$year, sampleData$id, FUN = function(x)
any(2014 %in% x) & any(2010 %in% x))), ]
Here is a solution with data.table
library("data.table")
sampleData <- data.frame(id = c(1,1,2,2,3,4,4), year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
setDT(sampleData)
sampleData[, `:=`(n, .N), by=id][n==2]
In case you want to make your check more explicit, i.e. not just relying on two rows per id but checking whether both "2010" and "2014" appear at least once per id, you can do something like this in base R:
x <- table(sampleData$id, sampleData$year) > 0
x
# 2010 2014
# 1 TRUE TRUE
# 2 TRUE TRUE
# 3 TRUE FALSE
# 4 TRUE TRUE
ids_to_keep <- row.names(x)[rowSums(x[,c("2010", "2014")]) == 2]
ids_to_keep
#[1] "1" "2" "4"
sampleData[sampleData$id %in% ids_to_keep,]
# id year
#1 1 2010
#2 1 2014
#3 2 2010
#4 2 2014
#6 4 2010
#7 4 2014
This approach is longer than others but it's also more robust, for example if you can have multiple occurences of the same year per id, then some other approaches may fail or, if you can have other years (not just 2010 and 2014) some other approaches may also fail if they only rely on checking number of occurences per id.
There is also a nice dplyr solution:
# create the sample dataset
sampleData <- data.frame(id = c(1,1,2,2,3,4,4),
year = c(2010, 2014, 2010, 2014, 2010, 2010, 2014))
# load dplyr library
library(dplyr)
# take the sample dateset
sampleData %>%
# group by id - thus the function within filter will be evaluated for each id
group_by(id) %>%
# filter only ids which were recorded in two separate years
filter(length(unique(year)) == 2)

Grouping data by specific observations in R

I want to create a new variable that's derived from specific values in my existing variables. My data frame looks something like the following:
year <- c("2010", "2011", "2012", "2013", "2014", "2015")
x <- c(2980, 2955, 3110, 2962, 2566, 3788)
y <- c(2453, 2919, 2930, 2864, 2873, 3031)
df <- data.frame(year, x, y)
More specifically, I want to create a third column, z, that is the ratio of x and y. However, I don't want to create this ratio by simply dividing x by y for each individual year. Instead, I want the values in 2015 (and 2014 etc.) to be an average of this ratio in the three preceding years, i.e. 2014, 2013, and 2012.
I've looked at Wickham's dplyr package and, in particular, the group_by function but I'm stumped because I don't want to group my data by year per se but by each years' three preceding years as illustrated (hopefully) above.
With dplyr and library(zoo):
df_fin<- df %>% mutate( z = rollmeanr(x/y,3,na.pad=TRUE))
I think the column z is what you want but it would be good to have the desired output.
The answers that use zoo::rollmean are all on the correct track, but they have a couple of "off by one" errors in them. First, you actually want zoo::rollmeanr( ..., na.pad=TRUE ) which will correctly pad the output with NA on the left side:
> zoo::rollmeanr( df$x / df$y, 3, na.pad=TRUE )
[1] NA NA 1.0962018 1.0359948 0.9962648 1.0590378
The second "off by one" error arises from alignment of this vector with the rest of your data. From your description, you want the value for 2015 to be the average of 2014, 2013, and 2012. However, appending the vector above to your table will make the value for 2015 to be the average of 2015, 2014, and 2013, instead. To correct, you want to omit the last value in your input to the rolling average and prepend an NA to compensate:
> c( NA, zoo::rollmeanr( head(df$x / df$y,-1), 3, na.pad=TRUE ) )
[1] NA NA NA 1.0962018 1.0359948 0.9962648
Putting it all together using dplyr notation:
df %>% mutate( z = c( NA, zoo::rollmeanr( head(x/y,-1), 3, na.pad=TRUE ) ) )
year x y z
1 2010 2980 2453 NA
2 2011 2955 2919 NA
3 2012 3110 2930 NA
4 2013 2962 2864 1.0962018
5 2014 2566 2873 1.0359948
6 2015 3788 3031 0.9962648
df$z<-0
for (i in 4:6){
df$z[i]<-mean(df$x[(i-3):(i-1)])/mean(df$y[(i-3):(i-1)])
}
Whit a loop, you can get this:
year x y z
1 2010 2980 2453 0.000000
2 2011 2955 2919 0.000000
3 2012 3110 2930 0.000000
4 2013 2962 2864 1.089497
5 2014 2566 2873 1.036038
6 2015 3788 3031 0.996654
library(zoo)
library(dplyr)
df %>% mutate(z = x/y, zz = rollmean(z, 3, fill = NA)

return final row of dataframe - recurring variable names

I want to return the final row for each subsection of a dataframe. I'm aware of the ddply and aggregate functions, but they are not giving the expected output in this case, as the column by which I split the data has recurring names.
For example, in df:
year <- rep(c(2011, 2012, 2013), each=12)
season <- rep(c("Spring", "Summer", "Autumn", "Winter"), each=3)
allseason <- rep(season, 3)
temp <- rnorm(36, mean = 61, sd = 10)
df <- data.frame(year, allseason, temp)
I want to return the final temp reading at the end of every season. When I run either
final1 <- aggregate(df, list(df$allseason), tail, 1)
or
final2 <- ddply(df, .(allseason), tail, 1)
I get only the final 4 seasons (i.e. those of 2013). The function seems to stop there and does not go back to previous years/seasons. My intended output is a data frame with 12 rows * 3 columns.
All help appreciated!
*I notice that in the df created here, the allseasons column is designated as a factor with 4 levels, whereas this is not the case in my original dataframe.
In your ddply code, you only forgot to also group by year:
With plyr:
library(plyr)
ddply(df, .(year, allseason), tail, 1)
Or with dplyr
library(dplyr)
df %>%
group_by(year, allseason) %>%
do(tail(.,1))
Or if you want a base R alternative you can use ave:
df[with(df, ave(year, list(year, allseason), FUN = seq_along)) == 3,]
Result:
# year allseason temp
#1 2011 Autumn 63.40626
#2 2011 Spring 59.69441
#3 2011 Summer 42.33252
#4 2011 Winter 79.10926
#5 2012 Autumn 63.14974
#6 2012 Spring 60.32811
#7 2012 Summer 67.57364
#8 2012 Winter 61.39100
#9 2013 Autumn 50.30501
#10 2013 Spring 61.43044
#11 2013 Summer 55.16605
#12 2013 Winter 69.37070
Note that the output will contain the same rows in each case, only the ordering may differ.
And just to add to #beginneR's answer, your aggregate solution should look like:
aggregate(temp ~ allseason + year, data = df, tail, 1)
# or:
with(df, aggregate(temp, list(allseason, year), tail, 1))
Result:
allseason year temp
1 Autumn 2011 64.51539
2 Spring 2011 45.14341
3 Summer 2011 62.29240
4 Winter 2011 47.97461
5 Autumn 2012 43.16781
6 Spring 2012 80.02419
7 Summer 2012 72.31149
8 Winter 2012 45.58344
9 Autumn 2013 55.92607
10 Spring 2013 52.06778
11 Summer 2013 51.01308
12 Winter 2013 53.22452

Sum duplicates then remove all but first occurrence

I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id variable. I have another continuous variable x, whose values I would like to sum for each duplicate id. The observations are time dependent, there are year and month variables, and I'd like to keep the chronologically first observation of each duplicate id and add the subsequent dupes to this first observation.
I've included dummy data that resembles what I have: dat1. I've also included a data set that shows the structure of my desired outcome: outcome.
I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x, but I loose my year and month columns - I need to retain these for all the first duplicate id values. The second strategy doesn't sum the values of x correctly.
Any suggestions for how to get my desired outcome would be much appreciated.
# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
year = rep(c("2006", "2007"), each = 5),
month = rep(c("December", "January"), each = 5),
x = round(rnorm(10, 10, 3), 2))
# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
year = c(rep("2006", 4), rep("2007", 3)),
month = c(rep("December", 4), rep("January", 3)),
x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))
# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))
# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]
# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
.(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
all.x = TRUE), FUN = sum)
I got different sums, but it were b/c I forgot the seed:
> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21
(To be safer It would be better to work on a copy. And you might need to add an ordering step.)
You could do this with data.table (quicker, more memory efficiently than plyr)
With a bit of self-joining fun using mult ='first'. Keying by id year and month will sort by id, year then month.
library(data.table)
DT <- data.table(dat1, key = c('id','year','month'))
# setnames is required as there are two x columns that get renamed x, x.1
DT1 <- setnames(DT[DT[,list(x=sum(x)),by=id],mult='first'][,x:=NULL],'x.1','x')
Or a simpler approach :
DT = as.data.table(dat1)
DT[,x:=sum(x),by=id][!duplicated(id)]
id year month x
1: 1234 2006 December 36.42
2: 1321 2006 December 11.55
3: 4321 2006 December 17.31
4: 7423 2006 December 5.97
5: 8503 2007 January 12.48
6: 2961 2007 January 10.22
7: 8564 2007 January 11.41

Resources