Stratified random sampling from data frame_follow up - r

I am trying to randomly sample 50% of the data for each of the group following Stratified random sampling from data frame. A reproducible example using mtcars dataset in R looks like below. What I dont understand is, the sample index clearly shows a group of gear labeled as '5', but when the index is applied to the mtcars dataset, the sampled data mtcars2 does not contain any record from gear='5'. What went wrong? Thank you very much.
> set.seed(14908141)
> index=tapply(1:nrow(mtcars),mtcars$gear,function(x){sample(length(x),length(x)*0.5)})
> index
$`3`
[1] 6 7 14 4 12 9 13
$`4`
[1] 12 7 8 4 6 5
$`5`
[1] 5 1
> mtcars2=mtcars[unlist(index),]
> table(mtcars2$gear)
3 4
12 3

I think the approach you've done creates a number 1:length(mtcars$gear) for each gear group so you will have repeat row numbers for each group. Then, when you subset it isn't working, see in your output above you have row number 7 in both gear group 3 and 4.
Base R
I would use split first to split by gear:
res <- split(mtcars, mtcars$gear)
then I run over this list using lapply and sample 50% of them that way:
res2 <- lapply(res, function(x) {
x[sample(1:nrow(x), nrow(x)*0.5, FALSE), ]
}
)
if you would like one dataset at the end (instead of a list) you can combine using do.call:
final_df <- do.call(rbind, res2)
dplyr
A simpler approach would be:
library(dplyr)
mtcars %>%
group_by(gear) %>%
sample_frac(., 0.5)

Related

loop to run two way frequency table r

Let's say I have data such as this:
dat <- mtcars %>% mutate(cyl2 = cyl*2,cyl3 = cyl*3)
I want to generate frequency tables such as this:
table(dat$cyl, dat$vs)
table(dat$cyl2, dat$vs)
table(dat$cyl3, dat$vs)
table(dat$cyl, dat$am)
table(dat$cyl2, dat$am)
table(dat$cyl3, dat$am)
Is there a way to automate the generation of frequency tables so that I don't have to run it each time like I have in the example above. First, the actual data I have is a lot bigger and second, the output isn't very easy to digest.
Now, if I wanted only the frequency of each variable, I could do something like so:
mapply(table, dat); mylist
There must be a way to run apply for a two way frequency distribution?? Thank you for your insight.
Here is one way to put all of the tables you want into a list:
vars <- as.matrix(expand.grid(c("cyl", "cyl2", "cyl3"), c("vs", "am")))
tables <- lapply(seq(nrow(vars)), function(x) table(dat[, vars[x, ]]))
lbls <- apply(vars, 1, paste, collapse="_")
names(tables) <- lbls
You can access a table if you know the number (row number in vars) or the combination of factors, e.g.
tables[[3]]
# vs
# cyl3 0 1
# 12 1 10
# 18 3 4
# 24 14 0
tables[["cyl3_vs"]]
# vs
# cyl3 0 1
# 12 1 10
# 18 3 4
# 24 14 0

How can I find count of rows by each factor level of each column in a dataframe in R?

I have several different datasets with different number of factor variables and an output variable. For each of these data-set I need to find number of rows of observations grouped by each factor level of the variables and further grouped by all variables (columns). I thought a for loop might do the trick but am struggling with it. Could someone please help with this?
the data set looks something like this:
enter image description here
and I want the ouput to be
enter image description here
I have tried
for (i in 1:length(df)){
df %>% group_by(df[[i]]) %>% summarise(n = length(i))%>%print()
}
but this doesn't seem to be working
An option is to gather into 'long' format and then do the count
library(tidyverse)
gather(df1, Variable, Factor_Level, var1:var3) %>%
count(Variable, Factor_Level)
You should be able to do something like
by(data$x, data$y, function)
where data$x is what you want sorted, data$y is what you sort for, and function is what you want done to those entries (fx: mean, length, shapiro.test, etc). Then you can coerce this output to a vector using as.vector().
If I for instance have a dataframe with df <- dataframe(ID <- c(1, 1, 1, 1, 2, 2, 3), value <- (10, 20, 30, 40, 50, 60, 70)) then running as.vector(by(df$value, df$Id, lengh)) would return a vector (4, 2, 1)
If you are ok with a list format you could stop after creating the list. However, this is a (somewhat complex) alternative to the gather method proposed by akrun:
# Getting a vector of factor variables in dataset
factor_vars <- names(factor_vars)[sapply(mtcars, is.factor)]
# Creating list of frequency tables
freq_tables <- lapply(factor_vars, function(x) group_by_(mtcars, .dots = x) %>% tally())
freq_tables <- lapply(freq_tables, function(x) cbind(colnames(x)[1], x))
do.call(rbind, lapply(freq_tables, setNames, c("Factor", "Level", "Count")))
Factor Level Count
1 vs 0 18
2 vs 1 14
3 am 0 19
4 am 1 13
5 gear 3 15
6 gear 4 12
7 gear 5 5
8 carb 1 7
9 carb 2 10
10 carb 3 3
11 carb 4 10
12 carb 6 1
13 carb 8 1
Data:
mtcars[8:11] <- lapply(mtcars[8:11], factor)

Aggregate odd/even pairs

I am trying to simplify a large dataset (52k+ rows) by finding the maximum value for every two week interval. I have already assigned week number values to every row and used the aggregate() function to find the maximum value for each week.
Simplified sample data:
week <- c(1:5, 5, 7:10)
conc <- rnorm(mean=50, sd=20, n=10)
df <- data.frame(week,conc)
aggregate(df, by=list(week), FUN=max)
However, I am stuck on how to further aggregate based on two-week intervals (ex: weeks 1&2, weeks 3&4...). It's not as simple as combining every other row since every week was sampled.
I'm assuming there's a simple solution, I just haven't found it yet.
Thanks!
week <- c(1:5, 5, 7:10)
bi_week <- (week+1)%/%2
conc <- rnorm(mean=50, sd=20, n=10)
df <- data.frame(week,bi_week,conc)
aggregate(df, by=list(bi_week), FUN=max)
Use pracma::ceil to grab each bi-weekly pair
library(pracma)
aggregate(df, by=list(ceil(df$week/2)), FUN=max)
Output
Group.1 week conc
1 1 2 76.09191
2 2 4 50.20154
3 3 5 54.93041
4 4 8 69.17820
5 5 10 74.67518
ceil(df$week/2)
# 1 1 2 2 3 3 4 4 5 5
library(purrr)
library(dplyr)
Odds<-seq(1:max(week),2)
Evens<-seq(2,max(week),2)
map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>%
filter(week==x | week==y) %>% select(conc) %>% max})
I first made vectors of odds and even numbers. Then using the purrr package I fed these pairwise (1&2, then 3&4 etc) into a function that uses the dplyr package to get just the correct weeks, select the conc values and take the max.
Here is the output:
> map2(.x=Odds,.y=Evens, .f=function(x,y) {df %>% filter(week==x | week==y) %>% select(conc) %>% max})
[[1]]
[1] 68.38759
[[2]]
[1] 56.9231
[[3]]
[1] 77.23965
[[4]]
[1] 49.39443
[[5]]
[1] 49.38465
Note: you could use map2_dbl in place of map2and get a numeric vector instead
Edit: removed the part about df2 as that was an error.

Beta estimation over panel data by group

I found some previous questions on this topic especially this R: Grouped rolling window linear regression with rollapply and ddply and R: Rolling / moving avg by group , however, both questions did not provide an exact solution for the problem that I am facing. I am currently trying to estimate CAPM beta over panel data using a linear regression. So I have different funds (in the example below I used 3 fund groups) for which I would like to calculate the betas separately and per row. To put this more abstract: I am trying to do a linear regression with a moving window by group to estimate the coefficient for every row based on the data in the window.
install.packages("zoo","dplyr")
library(zoo);library(dplyr)
# Create dataframe
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
I want to do a rolling regression over two columns df[6:7] for every group indicated by the column "fund". The calculation should be done separately so the first two rows in the beta column for every fund group will always show "NA". In the end I want to have a full dataframe with all fund groups and all beta values combined.
I managed to come up with a new code that works but is pretty messy and it requires to order the data by fund & date before executing. I would welcome any suggestions on how to make it better.
funddata <- funddata[order(funddata$fund, funddata$date),]
beta_func <- function(x, benchmark_riskfree, ret_riskfree) {
a <- coef(lm(as.formula(paste(ret_riskfree, "~", benchmark_riskfree,-1)),
data = x))
return(a)
}
beta_list<-list()
for (i in c(1:3)){beta_list[[paste(i, sep="_")]]<- (rollapplyr(funddata[(funddata$fund==i),6:7], width = 3,
FUN = function(x) beta_func(as.data.frame(x), "benchmark_riskfree" , "ret_riskfree"),
by.column = FALSE,fill=NA))}
beta_list<-unlist(beta_list, recursive=FALSE)
funddata$beta<-beta_list
As I mentioned in the comment above, this solution might be a bit off since I'm not able to reproduce your desired output 100%. Still, the functionality of what you're trying to accomplish is there. Have a look at it and let me know if this is something you could use or I could develop further.
EDIT: The code below does not reproduce the desired output as specified above, but turned out to be what the OP was looking for after all.
Here goes:
# Datasource
fund <- as.numeric(c(1,1,1,1,1,1,1,1,3,3,3,3,3,3,2,2,2,2,2,2,2))
return<- as.numeric(c(1:21))
benchmark <- as.numeric(c(1,13,14,20,14,32,4,1,5,7,1,0,7,1,-2,1,6,-7,9,10,9))
riskfree<-as.numeric(c(1,5,1,2,1,6,4,7,5,-5,10,0,3,1,2,1,6,7,8,9,10))
date <- as.Date(c("2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2011-02-28","2010-07-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30",
"2010-07-30","2010-08-31","2010-09-30","2010-10-31","2010-11-30","2010-12-31","2011-01-30"))
funddata<-data.frame(date,fund,return,benchmark,riskfree)
# Creating variables of interest
funddata["ret_riskfree"]<-as.numeric(funddata$return-funddata$riskfree)
funddata["benchmark_riskfree"]<-as.numeric(funddata$benchmark-funddata$riskfree)
# Target check #################################################################
# Subset last three rows in original dataframe
df_check <- funddata[funddata$fund == 1,]
df_check <- tail(df_check,3)
# Run regression check
mod_check <- lm(df_check$ret_riskfree~df_check$benchmark_riskfree)
coef(mod_check)
# My suggestion ################################################################
# The following function takes three arguments:
# 1. a dataframe, myDf
# 2. a column that you'd like to myDf on
# 3. a window length for a sliding window, myWin
fun_rollreg <- function(myDf, subCol, varY, varX, myWin){
df_main <- myDf
# Make an empty data frame to store results in
df_data <- data.frame()
# Identify unique funds
unFunds <- unique(unlist(df_main[subCol]))
# Loop through your subset
for (fundx in unFunds){
# Subset
df <- df_main
df <- df[df$fund == fundx,]
# Keep a copy of the original until later
df_new <- df
# Specify a container for your beta estimates
betas <- c()
# Specify window length
wlength <- myWin
# Retrieve some data dimensions to loop on
rows = dim(df)[1]
periods <- rows - wlength
# Loop through each subset of the data
# and run regression
for (i in rows:(rows - periods)){
# Split dataframe in subsets
# according to the window length
df1 <- df[(i-(wlength-1)):i,]
# Run regression
beta <- coef(lm(df1[[varY]]~df1[[varX]]))[2]
# Keep regression ressults
betas[[i]] <- beta
}
# Add regression data to dataframe
df_new <- data.frame(df, betas)
# Keep the new dataset for later concatenation
df_data <- rbind(df_data, df_new)
}
return(df_data)
}
# Run the function:
df_roll <- fun_rollreg(myDf = funddata, subCol = 'fund',
varY <- 'ret_riskfree', varX <- 'benchmark_riskfree',
myWin = 3)
# Show the results
print(head(df_roll,8))
For the first 8 rows in the new dataframe (fund = 1), this is the result:
date fund return benchmark riskfree ret_riskfree benchmark_riskfree betas
1 2010-07-30 1 1 1 1 0 0 NA
2 2010-08-31 1 2 13 5 -3 8 NA
3 2010-09-30 1 3 14 1 2 13 0.10465116
4 2010-10-31 1 4 20 2 2 18 0.50000000
5 2010-11-30 1 5 14 1 4 13 -0.20000000
6 2010-12-31 1 6 32 6 0 26 -0.30232558
7 2011-01-30 1 7 4 4 3 0 -0.11538462
8 2011-02-28 1 8 1 7 1 -6 -0.05645161

Return values from a Correlation Matrix in R

I have a correlation matrix (called correl)that is 390 x 390 so I would like to scan for values that are within 0.80 & 0.99. I have written the following loop:
cc1 <- NA #creates a NA vector to store values between 0.80 & 0.99
cc2 <- NA #creates a NA vector to store desired values
p <- dim(correl)[2] #dim returns the size of the correlation matrix
i =1
while (i <= p) {
cc1 <- correl[,correl[,i] >=0.80 & correl[,i] < 1.00]
cc2<- cbind(cc2,cc1)
i <- i +1
}
The problem I am having is that I also get undesired correlations ( those below 0.80) into cc2.
#Sample of what I mean:
SPY.Adjusted AAPL.Adjusted CHL.Adjusted CVX.Adjusted
1 SPY.Adjusted 1.0000000 0.83491778 0.6382930 0.8568000
2 AAPL.Adjusted 0.8349178 1.00000000 0.1945304 0.1194307
3 CHL.Adjusted 0.6382930 0.19453044 1.0000000 0.2991739
4 CVX.Adjusted 0.8568000 0.11943067 0.2991739 1.0000000
5 GE.Adjusted 0.6789054 0.13729877 0.3356743 0.5219169
6 GOOGL.Adjusted 0.5567947 0.10986655 0.2552149 0.2128337
I only want to return the correlations within the desired range ( 0.80 & 0.99) without losing the row.names or col.names as I would not know which are which.
Let's create a simple reproducible example
m = matrix(runif(100), ncol=10)
rownames(m) = LETTERS[1:10]
colnames(m) = rownames(m)
The tricky part is getting a nice return structure that contains the variable names. So I would collapse the matrix into a standard data frame
dd = data.frame(cor = as.vector(m1),
id1=rownames(m),
id2=rep(rownames(m), each=nrow(m)))
Remove duplicate entries
dd = dd[as.vector(upper.tri(m, TRUE)),]
Then select as usual
dd[dd$cor > 0.8 & dd$cor < 0.99,]
Glad you found an answer, but here's another that puts the results in a tidy data frame just in case others are looking for this.
This solution uses the corrr package (and using dplyr functions that are attached with it):
library(corrr)
mtcars %>%
correlate() %>%
shave() %>%
stretch(na.rm = TRUE) %>%
filter(between(r, .8, .99))
#> # A tibble: 3 × 3
#> x y r
#> <chr> <chr> <dbl>
#> 1 cyl disp 0.9020329
#> 2 cyl hp 0.8324475
#> 3 disp wt 0.8879799
Explanation:
mtcars is the data.
correlate() creates a correlation data frame.
shave() is optional and removes the upper triangle (to remove duplicates).
stretch() converts the data frame (in matrix format) to a long format.
filter(between(r, .8, .99)) selects only the correlations between .8 and .99
When I understood your problem correctly, one wouldn't expect a symmetric matrix as return object. For every variable of yours, you want to extract the other variables that are highly correlated with it - but this amount differs from variable to variable, so you cannot work with a matrix.
If you insist on a matrix/data frame, I would rather replace small correlations with NA
correl[correl<0.8] <- NA
and then access the column names for highly correlated with variable (e.g. in the first row) like this
colnames(correl)[!is.na(correl[1,])]
(Although then the NA step is kind of useless, as you could access the colnames straight with the constraint
colnames(correl)[correl[1,]>0.8)]
)

Resources