Creating random ratios that add up to 1 by group - r

I have a dataset as follows:
panelID= c(1:50)
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1,2,3,4,5)
n <- 2
library(AER)
library(data.table)
library(dplyr)
set.seed(123)
DT <- data.table( country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
sales= round(rnorm(10,10,10),2),
industry = rep(sample(indust, length(panelID), replace = T), each = n),
urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
size = rep(sample(sizes, length(panelID), replace = T), each = n))
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1),
votewon = ifelse(vote==1, sample(c(0,1),1),0))
I would like to add a variable to this dataset called ratio. I want ratio to be a random number between 0 and 1, and I want the sum of these ratios by country to be 1.
How would I go about creating such a column? The only thing I could think of is manually creating vectors which add up to one and then sampling from those vectors.
EDIT: The countries do not have equal entries:
> table(DT$country)
A B C D E F G H I J
6 10 14 6 14 10 10 8 10 12
ratio_sample_6 <- c(0.1, 0.2, 0.3, 0.05, 0.15, 0.2)
DT[,ratio:=sample(ratio_sample_6, replace = FALSE), by="country"]
But even that I could not get to work. Any suggestions?

Pick random numbers and normalize by country:
## data.table version
DT[, ratio := runif(.N)][, ratio := ratio / sum(ratio), by = "country"]
## dplyr version
DT %>% group_by(country) %>%
mutate(
ratio = runif(n()),
ratio = ratio / sum(ratio)
)

Related

Calculating a rolling return

I have a data frame with 3 columns. What I want to do is to calculate the product of the return over a selected month rolling period for each monthly period (or said another way, each row) (where available). This is the basic structure of the data.
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$return <- runif(774, min=0, max=1)
What it should end with is a dataframe where a new column is added with the selected period cumulative return for that time frame. For example, I have shown below a four month return. The calculation of the 4-month return on 03/30/2012 from the data would be:
((1+0.81/100)(1+0.715/100)(1+0.27/100)*(1+0.80/100)-1)*100
This would be repeated for each value under the X column.
I ended up utilizing the mutate function there you can set the lag width. in the end version I wanted
library(dplyr)
library(zoo)
# Create Test Dataframe
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$performance <- runif(774, min=0, max=1)
This particular code creates a 5 month average on a rolling basis. If you sort by column X you can see and recreate it in excel.
df <- df %>%
group_by(x) %>%
mutate(x_mean = rollmean(performance, 5, fill = NA, align = 'right'))
I also found a way to create a lag so I could take the 4 prior values to the observation and calculate the mean:
df2 = df %>%
mutate(perf.4.previous = rollapply(data = perf.1.previous, width = 4, FUN =
mean, align = "right", fill = NA, na.rm = T))

Creating a new variable when a function recognises an interaction term

Based on this link, I wrote the following code, which is part of a function:
Sample data:
panelID = c(1:50)
year= c(2001:2010)
country = c("NLD", "BEL", "GER")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1,2,3,4,5)
n <- 2
library(data.table)
set.seed(123)
DT <- data.table(panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
Factor = sample(0:5, 6),
industry = rep(sample(indust, length(panelID), replace = T), each = n),
urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
size = rep(sample(sizes, length(panelID), replace = T), each = n),
income = round(runif(100)/10,2),
sales= round(rnorm(10,10,10),2),
happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID
DT <- as.data.frame(DT)
Code:
depvar <- "happiness"
othervar <- "factor:income"
insvar <- c("happiness","factor","income")
if (length(insvar)>2) {
DT$newvar <- DT[insvar[2]]*DT[insvar[3]]
othervar=newvar
}
The idea is that when othervar is a combination of two variables, othervar gets replaced by a new variable which is the combination of those two variables.
Right now I however get the error:
Error in `[.data.frame`(DT, insvar[2]) : undefined columns selected
How should I write this function properly?
If you change factor to Factor as the column is named and use DT$newvar the code runs and produces a new column, which I believe is what you are looking for.
depvar <- "happiness"
othervar <- "Factor:income"
insvar <- c("happiness","Factor","income")
if (length(insvar)>2) {
DT$newvar <- DT[insvar[2]]*DT[insvar[3]]
othervar=DT$newvar
}

Add column to list of data frames and do incremental addition / Loop through df for simple addition

I have this data frame (actually, a list of those dfs):
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"))
I want to add a column which adds up a certain value until the end of the data frame, like this:
0 + 0.05 = 0.05, 0.05 + 0.05 = 0.1, 0.1 + 0.05 = 0.15 and so on.
So, in my example the result would be
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"), z=(0,0.05,0.1)
I guess the way to go would be using cbind with lapply (assuming ALL is a list of dfs):
ALL<- lapply(ALL, function(x) cbind(x, z = ???))
But my brain simple doesn't come up with the right formula for z.
Your help is very appreciated.
Greetings
trumfnator
We can use transform to add new column with seq to generate the sequence.
lapply(list_df, function(df) transform(df, z = seq(0, by = 0.05, length.out = nrow(df))))
#[[1]]
# x y z
#1 1 a 0.00
#2 2 b 0.05
#3 3 c 0.10
#[[2]]
# x y z
#1 1 a 0.00
#2 2 b 0.05
#3 3 c 0.10
#4 4 d 0.15
In tidyverse we can do the same by
library(dplyr)
library(purrr)
map(list_df, ~.x %>% mutate(z = seq(0, by = 0.05, length.out = n())))
data
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"))
ALL1 <- data.frame(x = 1:4, y = c("a", "b", "c", "d"))
list_df <- list(ALL, ALL1)
Maybe this is what you want
ALL$z <- 0.05*(0:(nrow(ALL)-1))
We can do an assignment and then return the dataset
lapply(list_df, function(x) {x$z <- seq(0, by= 0.05, length.out = nrow(x));x})
data
ALL <- data.frame(x = 1:3, y = c("a", "b", "c"))
ALL1 <- data.frame(x = 1:4, y = c("a", "b", "c", "d"))
list_df <- list(ALL, ALL1)

R multiple choice questionnaire data to ggplot

I have a Qualtrics multiple choice question that I want to use to create graphs in R. My data is organized so that you can answer multiple answers for each question. For example, participant 1 selected multiple choice answers 1 (Q1_1) & 3 (Q1_3). I want to collapse all answer choices in one bar graph, one bar for each multiple response option (Q1_1:Q1_3) divided by the number of respondents who answered this question (in this case, 3).
df <- structure(list(Participant = 1:3, A = c("a", "a", ""), B = c("", "b", "b"), C = c("c", "c", "c")), .Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"), row.names = c(NA, -3L), class = "data.frame")
I want to use ggplot2 and maybe some sort of loop through Q1_1: Q1_3?
Perhaps this is what you want
f <-
structure(
list(
Participant = 1:3,
A = c("a", "a", ""),
B = c("", "b", "b"),
C = c("c", "c", "c")),
.Names = c("Participant", "Q1_1", "Q1_2", "Q1_3"),
row.names = c(NA, -3L),
class = "data.frame"
)
library(tidyr)
library(dplyr)
library(ggplot2)
nparticipant <- nrow(f)
f %>%
## Reformat the data
gather(question, response, starts_with("Q")) %>%
filter(response != "") %>%
## calculate the height of the bars
group_by(question) %>%
summarise(score = length(response)/nparticipant) %>%
## Plot
ggplot(aes(x=question, y=score)) +
geom_bar(stat = "identity")
Here is a solution using ddply from dplyr package.
# I needed to increase number of participants to ensure it works in every case
df = data.frame(Participant = seq(1:100),
Q1_1 = sample(c("a", ""), 100, replace = T, prob = c(1/2, 1/2)),
Q1_2 = sample(c("b", ""), 100, replace = T, prob = c(2/3, 1/3)),
Q1_3 = sample(c("c", ""), 100, replace = T, prob = c(1/3, 2/3)))
df$answer = paste0(df$Q1_1, df$Q1_2, df$Q1_3)
summ = ddply(df, c("answer"), summarize, freq = length(answer)/nrow(df))
## Re-ordeing of factor levels summ$answer
summ$answer <- factor(summ$answer, levels=c("", "a", "b", "c", "ab", "ac", "bc", "abc"))
# Plot
ggplot(summ, aes(answer, freq, fill = answer)) + geom_bar(stat = "identity") + theme_bw()
Note : it might be more complicated if you have more columns relating to other questions ("Q2_1", "Q2_2"...). In this case, melting data for each question could be a solution.
I think you want something like this (proportion with a stacked bar chart):
Participant Q1_1 Q1_2 Q1_3
1 1 a c
2 2 a a c
3 3 c b c
4 4 b d
# ensure that all question columns have the same factor levels, ignore blanks
for (i in 2:4) {
df[,i] <- factor(df[,i], levels = c(letters[1:4]))
}
tdf <- as.data.frame(sapply(df[2:4], function(x)table(x)/sum(table(x))))
tdf$choice <- rownames(tdf)
tdf <- melt(tdf, id='choice')
ggplot(tdf, aes(variable, value, fill=choice)) +
geom_bar(stat='identity') +
xlab('Questions') +
ylab('Proportion of Choice')

sort data into deciles based on a rolling subset

I am trying to replicate the Fama French 1993 paper using R. I need to do the following sorting :
for each month,
calculate ME decile breakpoints on NYSE stocks only
sort all stocks into the deciles created in 2.
Data generation:
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME=floor(100*abs(rnorm(n)))))
Desired Output:
ME_NYSE_vals <- as.numeric(paste(df[df$Month==1 & df$exchangeCode=="NYSE","ME"]))
ME_ALL_vals <- as.numeric(paste(df[df$Month==1,"ME"]))
cut(x = ME_ALL_vals,
breaks = c(-Inf,quantile(ME_NYSE_vals,probs=seq(.1,.9,.1)),+Inf),
labels = 1:10
)
The breaks should be calculated based on ME_NSYE_vals. The cut should be applied to all ME_ALL_vals for each month.
If the intention is to keep the whole data frame but generate deciles only for the NYSE values the code below could do. The point was to generate deciles only for the entries pertaining to the NYSE values but to keep the full data set achieving some form of a partial sorting.
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
# Transformations
df %<>%
mutate(nTileNYSE = ifelse(exchangeCode == "NYSE", ntile(ME, 10), NA))
arrange(nTileNYSE)
The code was applied to the data:
set.seed(1)
df <- as.data.frame(cbind(exchangeCode = c("NYSE", "NASDAQ"),
Stock = c("A", "B", "C", "A"),
Month = 1:12,
ME=rnorm(1200)))
2nd approach
Following the discussion in the comments I would suggest the following approach:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c( "tidyr", "dplyr", "magrittr", "xts", "Hmisc"),
char = TRUE)
# Data generation ---------------------------------------------------------
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME = floor(100*abs(rnorm(n)))))
# Transformations ---------------------------------------------------------
# For some reason this was needed
df$ME <- as.numeric(as.character(df$ME))
# Generate cuts
dfNtiles <- df %>%
arrange(exchangeCode, Month, ME) %>%
group_by(exchangeCode, Month) %>%
mutate(cutsBsdOnNYSE = cut(x = ME,
breaks = cut2(x = df$ME[df$exchangeCode == "NYSE"],
g = 10, onlycuts = TRUE))) %>%
ungroup() %>%
group_by(cutsBsdOnNYSE) %>%
mutate(grpBsdOnNYSE = n())
It's fairly straightforward
Generating cut brackets reflecting subset of the data.
Applying those brackets to the whole vector (ME)
Numbering the obtained groups so a group identifier is created
and boils down to:

Resources