Related
I have a table of Calpers Private Equity Fund performance from several years. I cleaned and joined all the data into a large table with 186 entries for individual fund investments. Some of these funds have data for 5 yrs, most for 4 or less. I would like to calculate the CAGR for each fund using the earliest value and the latest value in the formula:
CAGR= Latest/First^(1/n)-1 ...
The columns with the data are named:
2017,2018,2019,2020,2021, so the formula in R will look something like this: (calper is the table with all the data ... one fund per row)
idx<- which(startsWith(names(calperMV),"2")) # locate columns with data needed for CAGR calc
idx <- rev(idx) # match to NCOL_NA order ...
the values here are (6,5,4,3,2) ... which are the column numbers for 2021-2020-2019-2018-2017.
the indx was formed by counting the number of NA in each row ... all the NA are left to right, so the totals here should be a reference to the idx and thus the correct columns.
I use the !!sym(as.String()) with name()[idx[indx]] to pull out the column names symbolically
calperMV %>% rowwise() %>%
mutate(CAGR=`2021`/!!sym((colnames(.)[idx[indx]])^(1/(5-indx))-1))))
Problem is that the referencing either does not work correctly or gets this error:
"Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?"
I've tried creating test code which shows the addressing is working:
calper %>% rowwise() %>% mutate(test = (names(.)[idx[indx]]),
test1= !!sym(as.String(names(.)[idx[1]])),
test2= !!sym(as.String(names(.)[idx[2]])),
test3= !!sym(as.String(names(.)[idx[3]])),
test4= !!sym(as.String(names(.)[idx[4]])),
test5= !!sym(as.String(names(.)[idx[5]])))
But when I do the full CAGR calc I get that recursive error. Here'a tibble of the test data for reference:
Input data:
Security Name 2017 2018 2019 2020 2021 NA_cols indx
ASIA ALT NA NA NA 6,256,876.00 7,687,037.00 3 2
ASIA ALT NA NA NA 32,549,704.00 34,813,844.00 3 2
AVATAR NA NA NA NA 700,088.00 - 3 2
AVENUE FUND VI (A) NA NA NA 10,561,674.00 19,145,496.00 3 2
BDC III C NA 48,098,429.00 85,808,280.00 100,933,699.00 146,420,669.00 1 4
BIRCH HILL NA NA NA 6,488,941.00 9,348,941.00 3 2
BLACKSTONE NA NA NA 4,011,072.00 2,406,075.00 3 2
BLACKSTONE IV NA NA NA 4,923,625.00 3,101,081.00 3 2
BLACKSTONE V NA NA NA 18,456,472.00 17,796,711.00 3 2
BLACKSTONE VI NA NA NA 245,269,656.00 310,576,064.00 3 2
BLACKSTONE VII NA NA NA 465,415,036.00 607,172,062.00 3 2
Results: The indexing selects the proper String and also selects the proper # from the column ... but won't do when I operate with the selected variable:
selYR test1 test2 test3 test4 test5
2020 7,687,037.00 6,256,876.00 NA NA NA
2020 34,813,844.00 32,549,704.00 NA NA NA
2020 - 700,088.00 NA NA NA
2020 19,145,496.00 10,561,674.00 NA NA NA
2018 146,420,669.00 100,933,699.00 85,808,280.00 48,098,429.00 NA
2020 9,348,941.00 6,488,941.00 NA NA NA
2020 2,406,075.00 4,011,072.00 NA NA NA
2020 3,101,081.00 4,923,625.00 NA NA NA
2020 17,796,711.00 18,456,472.00 NA NA NA
2020 310,576,064.00 245,269,656.00 NA NA NA
2020 607,172,062.00 465,415,036.00 NA NA NA
(Sorry ... I don't know how to put these into proper columns :( )
I never learned all those fancy tidystuff techniques. Here's a base R approach:
Firstand second: Use read.delim to bring in tab data and your data has (yeccch) commas in the numbers.
(ignore the warnings, they are correct and you do want the NA's.)
calpDat <- read.delim(text=calpTab)
calpDat[2:6] <- lapply(calpDat[2:6], function(x) as.numeric(gsub("[,]", "",x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion
Note that lapply in this case returns a list of numeric vectors which can be assigned back inot the origianl dataframe to overwrite the original character values. Or you could have created new columns which could then have gotten the same treatment as below. Now that the data is in, you can count the number of valid numbers and then calculate the CAGR for each row using apply on the numeric columns in a rowwise fashion:
calpDat$CAGR <- apply(calpDat[2:6], 1, function(rw) {n <- length(na.omit(rw));
(rw[5]/rw[6-n])^(1/n) -1})
calpDat
#----------------
Security.Name X2017 X2018 X2019 X2020 X2021 NA_cols indx CAGR
1 ASIA ALT NA NA NA 6256876 7687037 3 2 0.10841071
2 ASIA ALT NA NA NA 32549704 34813844 3 2 0.03419508
3 AVATAR NA NA NA NA 700088 NA 3 2 NA
4 AVENUE FUND VI (A) NA NA NA 10561674 19145496 3 2 0.34637777
5 BDC III C NA 48098429 85808280 100933699 146420669 1 4 0.32089372
6 BIRCH HILL NA NA NA 6488941 9348941 3 2 0.20031241
7 BLACKSTONE NA NA NA 4011072 2406075 3 2 -0.22549478
8 BLACKSTONE IV NA NA NA 4923625 3101081 3 2 -0.20637732
9 BLACKSTONE V NA NA NA 18456472 17796711 3 2 -0.01803608
10 BLACKSTONE VI NA NA NA 245269656 310576064 3 2 0.12528383
11 BLACKSTONE VII NA NA NA 465415036 607172062 3 2 0.14218298
Problems remaining ... funds that did not have a value in the most recent year; funds that might have had discontinuous reporting. You need to say how these would be handled and provide example data if you want tested solutions.
I have a data with variables in rows and samples in columns. The samples and Biological replicates of several groups. each group vary in the replicate number. an example is the follwoing:
Varibale Ctrl_1 Ctrl_2 Ctrl_3 DrugA_1 DrugA_2 DrugA_3 DrugA_4 DrugB_1 DrugB_2 DrugB_3
ALa 22.6438 28.3892 20.8251 27.9565 23.574 30.9032 25.0484 27.5785 29.5881 26.0501
Arg 152.146 166.179 93.2656 138.99 105.766 124.013 99.5645 110.642 115.571 121.694
Asn 27.0758 42.4504 30.9472 42.2874 34.0379 34.8214 36.2872 36.1781 42.2911 42.4052
Asp 165.854 351.983 232.19 249.624 114.803 96.9021 120.799 136.907 141.833 246.973
Glu 72.3663 168.251 116.67 167.514 86.3407 90.9513 100.363 122.004 134.128 175.909
Gln 119.269 147.754 107.945 152.588 102.166 115.641 165.737 133.397 147.295 172.286
Met 6.69636 13.0122 7.23503 16.6089 7.63006 6.23411 6.87084 5.65021 9.35852 12.9221
I have over 200 variable in my table. some times a variable is measured very low in one sample and very high in another within the same group I would like to write a code in R that helps me filter out these variable. I tried to figure out a way based on IQR but I did not succeed.
My understanding of the question is that you want to remove values when they are outside the IQR for a given variable (i.e., row). You could do so with:
library(dplyr)
df <- tribble(
~Variable, ~Ctrl_1, ~Ctrl_2, ~Ctrl_3, ~DrugA_1, ~DrugA_2, ~DrugA_3, ~DrugA_4, ~DrugB_1, ~DrugB_2, ~DrugB_3,
"ALa", 22.6438, 28.3892, 20.8251, 27.9565, 23.574 , 30.9032, 25.0484, 27.5785, 29.5881, 26.0501,
"Arg", 152.146, 166.179, 93.2656, 138.99 , 105.766, 124.013, 99.5645, 110.642, 115.571, 121.694,
"Asn", 27.0758, 42.4504, 30.9472, 42.2874, 34.0379, 34.8214, 36.2872, 36.1781, 42.2911, 42.4052,
"Asp", 165.854, 351.983, 232.19 , 249.624, 114.803, 96.9021, 120.799, 136.907, 141.833, 246.973,
"Glu", 72.3663, 168.251, 116.67 , 167.514, 86.3407, 90.9513, 100.363, 122.004, 134.128, 175.909,
"Gln", 119.269, 147.754, 107.945, 152.588, 102.166, 115.641, 165.737, 133.397, 147.295, 172.286,
"Met", 6.69636, 13.0122, 7.23503, 16.6089, 7.63006, 6.23411, 6.87084, 5.65021, 9.35852, 12.9221
)
df %>%
rowwise() %>%
mutate(across(-Variable, ~ifelse(between(.x,
quantile(c_across(-Variable), .25),
quantile(c_across(-Variable), .75)), .x, NA))) %>%
ungroup()
#> # A tibble: 7 × 11
#> Variable Ctrl_1 Ctrl_2 Ctrl_3 DrugA_1 DrugA_2 DrugA_3 DrugA_4 DrugB_1 DrugB_2
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ALa NA NA NA 28.0 NA NA 25.0 27.6 NA
#> 2 Arg NA NA NA NA NA 124. NA 111. 116.
#> 3 Asn NA NA NA 42.3 NA 34.8 36.3 36.2 NA
#> 4 Asp 166. NA 232. NA NA NA NA 137. 142.
#> 5 Glu NA NA 117. NA NA NA 100. 122. 134.
#> 6 Gln 119. 148. NA NA NA NA NA 133. 147.
#> 7 Met NA NA 7.24 NA 7.63 NA 6.87 NA 9.36
#> # … with 1 more variable: DrugB_3 <dbl>
Created on 2022-06-01 by the reprex package (v2.0.1)
I am using the extRemes package in R in plotting the return periods.
However, my data set has missing values so I encounter many errors:
Here's the data:
https://www.dropbox.com/s/un9vghuwmptnty1/Lumbia_1979-2017.csv?dl=0
Apologies, the return period plot requires a large number of observation points, so this is the smallest data set that I can post.
Here's the script so far:
library(extRemes)
dat<-read.csv("Lumbia_1979-2017.csv",header=T)
dat[dat==-999]<-NA
#Extract annual max precip
annmax <- aggregate(dat$Rain, by = list(dat$Year),max, na.rm=TRUE,na.action=NULL)
gevfit1 <- fevd(annmax[ ,2])
gevfit1
Error
I encounter the error and warning below:
What I want
(a) Can anyone suggest a workaround solution for plotting the return periods of data sets with missing values?
(b) If for example, I want to get the return period of 100mm/day rainfall, how do I estimate this from the plot of the return periods.
I'll appreciate any help on this matter.
The error you are getting is from the aggregate call:
annmax <- aggregate(dat$Rain, by = list(dat$Year),max, na.rm=TRUE,na.action=NULL)
# Warning messages:
# 1: In FUN(X[[i]], ...) : no non-missing arguments to max; returning -Inf
# 2: In FUN(X[[i]], ...) : no non-missing arguments to max; returning -Inf
# [...]
In this call, you are computing the maximum of Rain for each Year. This error happens when there are data missing for some of those years, so that the maximum can't be calculated. Look at the results:
annmax
# Group.1 x
# 1 1979 5.4
# 2 1980 27.1
# 3 1981 62.5
# [...]
# 33 2011 58.2
# 34 2012 5.7
# 35 2013 74.9
# 36 2014 -Inf
# 37 2015 -Inf
# 38 2016 -Inf
# 39 2017 -Inf
In these years, when aggregate returned -Inf, you have no data for Rain:
dat[dat$Year %in% 2014:2017,]
# Year Month Day Rain
# 1086 2014 1 1 NA
# 1087 2014 1 2 NA
# 1088 2014 1 3 NA
# 1089 2014 1 4 NA
# 1090 2014 1 5 NA
# 1091 2014 1 6 NA
# 1092 2014 1 7 NA
# 1093 2014 1 8 NA
# 1094 2014 1 9 NA
# 1095 2014 1 10 NA
# 1096 2014 1 11 NA
# [...]
So, it's up to you to decide what to do with those missing years. It depends on the analysis. Does the analysis need some data in the missing years?
1) In the case that the analysis (fevd) needs some data in the missing years, you need some method to interpolate them from the other years.
2) If the analysis doesn't need the missing years, just delete them. This is the simplest solution:
annmax2 <- annmax[is.finite(annmax[,2]),]
gevfit1 <- fevd(annmax2[ ,2])
And now it works :-)
I have two data frames, one containing the raw data (and lots of NAs) and one containing the means of the raw data at different intervals.
I would like to replace the NAs with the means at those intervals, while retaining the non-NAs.
I have googled a lot on "impute missing data r", or "replace missing data r", but have yet to find any solution that seems to fit, they all either seem to replace data with 0:s, or use a way to complex method for this assignment, like using the MICE package.
Code example:
This is the head of the first DF, with the raw data. As you can see, they are all NA for the first day.
steps date interval
1 NA 2012-10-01 0
2 NA 2012-10-01 5
3 NA 2012-10-01 10
4 NA 2012-10-01 15
5 NA 2012-10-01 20
6 NA 2012-10-01 25
...
The second data frame head contains the mean steps per interval, like this:
steps interval
1 1.72 0
2 0.340 5
3 0.132 10
4 0.151 15
5 0.0755 20
6 2.09 25
...
Now, what I am looking for is to be able to fill the NAs with the mean steps for the relevant interval, so it looks like this:
steps date interval
1 1.72 2012-10-01 0
2 0.340 2012-10-01 5
3 0.132 2012-10-01 10
4 0.151 2012-10-01 15
5 0.0755 2012-10-01 20
6 2.09 2012-10-01 25
...
Any tips, or links for resources I have missed? As this is a course assignment and I mostly wanna learn, any help without doing the assignment for me would be much appreciated! =)
edit: Also, since this is my first question on Stack Overflow, any comments on how to improve my question-making is also appreciated!
There are many ways to do that in R. For example
# generate dataframe with some interval vaulues
df1 <- data.frame(interval= rep(seq(0, 25, 5), 5))
# add a steps column
df1$steps <- 1:nrow(df)
# copy the dataframe
df2 <- df1
# replace some steps values with missings in df1
df1$steps[c(1,2,5, 14)] <- NA
# sapply goes thru every unique interval...
sapply(df1$interval, function(interval_i){
# replace missing steps of interval_i in df1 with the steps mean of interval_i of df2
df1$steps[is.na(df1$steps) & df1$interval == interval_i] <<- mean(df2$steps[df2$interval == interval_i], na.rm= TRUE)
# you must use <<- not <- to assign it to df outside of sapply
})
I'll add a solution using dplyr.
It's good practice to give some reproducible data in your question rather that can be put straight into R, rather than just pasting in the head of your data. I've created some dummy data instead:
# create random sample data
library(dplyr)
set.seed(100)
df1 <- tibble(
steps = runif(1e3),
date = lubridate::today() + runif(1e3) * 24,
interval = as.numeric(sample(seq(0,25, by=5), 1e3, replace = T))
)
# add 100 NAs at random
df1$steps[sample(1:1e3, 100)] <- NA
df1
# steps date interval
# <dbl> <date> <dbl>
# 1 0.308 2019-07-18 15
# 2 NA 2019-07-19 10
# 3 NA 2019-07-31 0
# 4 0.0564 2019-08-02 20
# 5 0.469 2019-07-25 0
# 6 0.484 2019-07-21 25
# 7 NA 2019-07-17 5
# 8 0.370 2019-07-28 0
# 9 0.547 2019-07-31 5
# 10 0.170 2019-08-08 15
# # … with 990 more rows
Using dplyr the imputation task is then pretty simple with group_by
df1 %>%
group_by(interval) %>%
mutate(steps = if_else(is.na(steps), mean(steps, na.rm = T), steps))
# # A tibble: 1,000 x 3
# # Groups: interval [6]
# steps date interval
# <dbl> <date> <dbl>
# 1 0.308 2019-07-18 15
# 2 0.573 2019-07-19 10
# 3 0.523 2019-07-31 0
# 4 0.0564 2019-08-02 20
# 5 0.469 2019-07-25 0
# 6 0.484 2019-07-21 25
# 7 0.527 2019-07-17 5
# 8 0.370 2019-07-28 0
# 9 0.547 2019-07-31 5
# 10 0.170 2019-08-08 15
# # … with 990 more rows
We can confirm that the imputed means are correct for each group by calculating the mean for each group and comparing it to the imputed values:
df1 %>%
group_by(interval) %>%
summarise(mean_int = mean(steps, na.rm=T))
# # A tibble: 6 x 2
# interval mean_int
# <dbl> <dbl>
# 1 0 0.523
# 2 5 0.527
# 3 10 0.573
# 4 15 0.511
# 5 20 0.475
# 6 25 0.485
I have a table containing observations on scores of restaurants(identified by ID). The variable mean is the mean rating of reviews received in a week-long window centered on each day (i.e. from 3 days before till 3 days later), and the variable count is the number of reviews received in the same window (see the code below for a dput of a randomly-generated sample of my data frame).
I am interested in looking at those restaurants that contain big spikes in either variable (like all of a sudden their mean rating goes up by a lot, or drops suddenly). For those restaurants, I would like to investigate what's going on by plotting the distribution (I have lots of restaurants so I can't do it manually and I have to restrict my domain for semi-manual inspection).
Also, since my data is day-by-day, I would like it to be less granular. In particolar, I want to average all the ratings or counts for a given month in a single value.
I think zoo should help me do it nicely: given the data frame in the example, I think I can convert it to a zoo time series which is aggregate the way I want and split the way I want by using:
z <- read.zoo(df, split = "restaurantID",
format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
however, splitting on restaurantID does not yield the expected result. What I get instead is lots of NAs:
mean.1006054 count.1006054 mean.1006639 count.1006639 mean.1006704 count.1006704 mean.1007177 count.1007177
Lug 2004 NA NA NA NA NA NA NA NA
Ago 2004 NA NA NA NA NA NA NA NA
Nov 2004 NA NA NA NA NA NA NA NA
Gen 2005 NA NA NA NA NA NA NA NA
Feb 2005 NA NA NA NA NA NA NA NA
Mar 2005 NA NA NA NA NA NA NA NA
mean.1007296 count.1007296 mean.1007606 count.1007606 mean.1007850 count.1007850 mean.1008272 count.1008272
Lug 2004 NA NA NA NA NA NA NA NA
Ago 2004 NA NA NA NA NA NA NA NA
Nov 2004 NA NA NA NA NA NA NA NA
Gen 2005 NA NA NA NA NA NA NA NA
Feb 2005 NA NA NA NA NA NA NA NA
Mar 2005 NA NA NA NA NA NA NA NA
Note that it works if I don't split it on the restaurantID column.
df$website <- NULL
> z <- read.zoo(df, format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
> head(z)
restaurantID mean count
Lug 2004 1418680 3.500000 1
Ago 2004 1370457 5.000000 1
Nov 2004 1324645 4.333333 1
Gen 2005 1425933 1.920000 1
Feb 2005 1315289 3.000000 1
Mar 2005 1400577 2.687500 1
Also, plot.zoo(z) works but of course the produced graph has no meaning for me.
My questions are:
1) How can I filter the restaurants that have the higher "month-month" spikes in either column?
2) How can I split on restaurantID and plot the time series of only such restaurants?
DATA HERE (wouldn't fit SO's word limit)
Try:
# helper function to calculate change per time interval in a sequence
difflist <- function(v) {rr <- 0; for (i in 2:length(v)) {rr <- c(rr, v[i] - v[i-1])}; return(rr) }
# make center as dates
df$center <- as.Date(df$center,format='%m/%d/%Y')
# sort data frame in time order
df <- df[order(df$restaurantID, df$center),]
# now calculate the change in each column
deltas <- ddply(df, .(restaurantID), function(x) {cbind(center = x$center, delta_mean = difflist(x$mean), delta_count = difflist(x$count)) } )
# filter out only the big spikes
deltas_big <- subset(deltas, delta_mean > 2 | delta_count > 3)
# arrange the data
delta_melt <- melt(deltas_big,id.vars=c('restaurantID','center'))
# now plot by time
ggplot(delta_melt, aes(x=center,y=value,color=variable)) + geom_point()
The robfilter r package was developed to filter time series data to pick out outliers based on robust statistics methods for time series analysis. You can use the adore.filter function to fit a pattern to the data and then pick the outliers that deviate far from the signal.