Creating a new variable when a function recognises an interaction term - r

Based on this link, I wrote the following code, which is part of a function:
Sample data:
panelID = c(1:50)
year= c(2001:2010)
country = c("NLD", "BEL", "GER")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1,2,3,4,5)
n <- 2
library(data.table)
set.seed(123)
DT <- data.table(panelID = rep(sample(panelID), each = n),
country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
some_NA = sample(0:5, 6),
Factor = sample(0:5, 6),
industry = rep(sample(indust, length(panelID), replace = T), each = n),
urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
size = rep(sample(sizes, length(panelID), replace = T), each = n),
income = round(runif(100)/10,2),
sales= round(rnorm(10,10,10),2),
happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID
DT <- as.data.frame(DT)
Code:
depvar <- "happiness"
othervar <- "factor:income"
insvar <- c("happiness","factor","income")
if (length(insvar)>2) {
DT$newvar <- DT[insvar[2]]*DT[insvar[3]]
othervar=newvar
}
The idea is that when othervar is a combination of two variables, othervar gets replaced by a new variable which is the combination of those two variables.
Right now I however get the error:
Error in `[.data.frame`(DT, insvar[2]) : undefined columns selected
How should I write this function properly?

If you change factor to Factor as the column is named and use DT$newvar the code runs and produces a new column, which I believe is what you are looking for.
depvar <- "happiness"
othervar <- "Factor:income"
insvar <- c("happiness","Factor","income")
if (length(insvar)>2) {
DT$newvar <- DT[insvar[2]]*DT[insvar[3]]
othervar=DT$newvar
}

Related

chi square over multiple groups and variables

I have a huge dataset with several groups (factors with between 2 to 6 levels), and dichotomous variables (0, 1).
example data
DF <- data.frame(
group1 = sample(x = c("A","B","C","D"), size = 100, replace = T),
group2 = sample(x = c("red","blue","green"), size = 100, replace = T),
group3 = sample(x = c("tiny","small","big","huge"), size = 100, replace = T),
var1 = sample(x = 0:1, size = 100, replace = T),
var2 = sample(x = 0:1, size = 100, replace = T),
var3 = sample(x = 0:1, size = 100, replace = T),
var4 = sample(x = 0:1, size = 100, replace = T),
var5 = sample(x = 0:1, size = 100, replace = T))
I want to do a chi square for every group, across all the variables.
library(tidyverse)
library(rstatix)
chisq_test(DF$group1, DF$var1)
chisq_test(DF$group1, DF$var2)
chisq_test(DF$group1, DF$var3)
...
etc
I managed to make it work by using two nested for loops, but I'm sure there is a better solution
groups <- c("group1","group2","group3")
vars <- c("var1","var2","var3","var4","var5")
results <- data.frame()
for(i in groups){
for(j in vars){
test <- chisq_test(DF[,i], DF[,j])
test <- mutate(test, group=i, var=j)
results <- rbind(results, test)
}
}
results
I think I need some kind of apply function, but I can't figure it out
Here is one way to do it with apply. I am sure there is an even more elegant way to do it with dplyr. (Note that here I extract the p.value of the test, but you can extract something else or the whole test result if you prefer).
res <- apply(DF[,1:3], 2, function(x) {
apply(DF[,4:7], 2,
function(y) {chisq.test(x,y)$p.value})
})
Here's a quick and easy dplyr solution, that involves transforming the data into long format keyed by group and var, then running the chi-sq test on each combination of group and var.
DF %>%
pivot_longer(starts_with("group"), names_to = "group", values_to = "group_val") %>%
pivot_longer(starts_with("var"), names_to = "var", values_to = "var_val") %>%
group_by(group, var) %>%
summarise(chisq_test(group_val, var_val)) %>%
ungroup()

Creating random ratios that add up to 1 by group

I have a dataset as follows:
panelID= c(1:50)
year= c(2005, 2010)
country = c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J")
urban = c("A", "B", "C")
indust = c("D", "E", "F")
sizes = c(1,2,3,4,5)
n <- 2
library(AER)
library(data.table)
library(dplyr)
set.seed(123)
DT <- data.table( country = rep(sample(country, length(panelID), replace = T), each = n),
year = c(replicate(length(panelID), sample(year, n))),
sales= round(rnorm(10,10,10),2),
industry = rep(sample(indust, length(panelID), replace = T), each = n),
urbanisation = rep(sample(urban, length(panelID), replace = T), each = n),
size = rep(sample(sizes, length(panelID), replace = T), each = n))
DT <- DT %>%
group_by(country) %>%
mutate(base_rate = as.integer(runif(1, 12.5, 37.5))) %>%
group_by(country, year) %>%
mutate(taxrate = base_rate + as.integer(runif(1,-2.5,+2.5)))
DT <- DT %>%
group_by(country, year) %>%
mutate(vote = sample(c(0,1),1),
votewon = ifelse(vote==1, sample(c(0,1),1),0))
I would like to add a variable to this dataset called ratio. I want ratio to be a random number between 0 and 1, and I want the sum of these ratios by country to be 1.
How would I go about creating such a column? The only thing I could think of is manually creating vectors which add up to one and then sampling from those vectors.
EDIT: The countries do not have equal entries:
> table(DT$country)
A B C D E F G H I J
6 10 14 6 14 10 10 8 10 12
ratio_sample_6 <- c(0.1, 0.2, 0.3, 0.05, 0.15, 0.2)
DT[,ratio:=sample(ratio_sample_6, replace = FALSE), by="country"]
But even that I could not get to work. Any suggestions?
Pick random numbers and normalize by country:
## data.table version
DT[, ratio := runif(.N)][, ratio := ratio / sum(ratio), by = "country"]
## dplyr version
DT %>% group_by(country) %>%
mutate(
ratio = runif(n()),
ratio = ratio / sum(ratio)
)

Generate a column with random years within a range

I have a very simple question, for which I could not find any answer. For an example I want to create, I want to give the following data.table a column with random years within a certain range say 2004-2010.
library(data.table)
set.seed(1)
DT <- data.table(panelID = sample(50,50), # Creates a panel ID
Country = c(rep("Albania",30),rep("Belarus",50), rep("Chilipepper",20)),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
norm = round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA # https://stackoverflow.com/questions/11036989/replace-all-0-values-to-na
DT$some_NA_factor <- factor(DT$some_NA_factor)
We can use sample to select random years between 2004:2010 with replace = TRUE.
library(data.table)
DT[, random_year := sample(2004:2010, .N, replace = TRUE)]

Calculating the mean of the absolute value of all numerical columns

I want to calculate the mean of the absolute value of all numerical columns for the example dataset DT:
library(data.table)
set.seed(1)
DT <- data.table(panelID = sample(50,50), # Creates a panel ID
Country = c(rep("Albania",30),rep("Belarus",50), rep("Chilipepper",20)),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
norm = round(runif(100)/10,2),
Income = round(rnorm(10,-5,5),2),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = sample(100,100),
Educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA # https://stackoverflow.com/questions/11036989/replace-all-0-values-to-na
DT$some_NA_factor <- factor(DT$some_NA_factor)
I tried to calculate the means and the absolute means as follows:
mean_of_differences <- DT[,lapply(Filter(is.numeric,.SD),mean, na.rm=TRUE)]
mean_of_differences <- as.data.frame(t(mean_of_differences))
mean_of_differences <- round(mean_of_differences, digits=2)
mean_of_absolute_diff <- DT[,lapply(Filter(is.numeric,.SD),function(x) mean(abs(x),na.rm=TRUE))]
mean_of_absolute_diff <- as.data.frame(t(mean_of_absolute_diff))
mean_of_absolute_diff <- round(mean_of_differences, digits=2)
The mean of Income for the absolute differences is however negative (as it is for the normal mean), which obviously is not possible. If I look at my code I don't understand what I am doing wrong. What am I overlooking?
Here is a solution using data.table. It (i) identifies numeric columns and (ii) obtains the mean of the absolute value of each numeric column.
Data
dt = data.table(
num1 = rnorm(100),
num2 = rnorm(100),
strv = sample(LETTERS, 100, replace = T)
)
Code
numcols = colnames(dt)[unlist(lapply(dt, is.numeric))] # Which columns are numeric?
# > numcols
# [1] "num1" "num2"
meandt = dt[, lapply(.SD, function(x) mean(abs(x))), .SDcols = numcols]
newcols = paste('mean_abs_', numcols, sep = ''); colnames(meandt) = newcols
# > meandt
# mean_abs_num1 mean_abs_num2
# 1: 0.8287523 0.8325123

Quosure with in a nested function

I am struggling to write a function fun2 that uses fun1... and keep getting errors. I have written a simplified example below. It is the first time I deal with "tidy evaluation" and not sure to understand the in and outs of it.
Example dataframes:
d1 = data.frame(
ID = c("A", "A", "A", "B", "B", "C", "C", "C", "C"),
EXPR = c(2, 8, 3, 5, 7, 20, 1, 5, 4)
)
d2 = data.frame(
ID = c("A", "B", "C"),
NUM = c(22, 50, 31)
)
First function
fun1 <- function(
df1 = "df 1",
df2 = "df 2",
t1 = "threshold 1",
expr_col = "expr column",
id_col = "sample column - must be present in df1 and df2") {
# dataframes
df <- df1
db <- df2
# quosure
enquo_id <- enquo(id_col)
enquo_expr <- enquo(expr_col)
# classify
df <- df %>%
mutate(threshold = t1) %>%
mutate(class = ifelse(!!enquo_expr > t1, "positive", "negative")) %>%
mutate(class = factor(class, levels = c("positive", "negative")))
# calculate sample data
df.sum <- df %>%
group_by(!!enquo_id, class) %>%
summarise(count = n()) %>%
complete(class, fill = list(count = 0)) %>%
mutate(total = sum(count), freq = count/total)
# merge dataframes
df.sum <- left_join(df.sum, db, by = quo_name(enquo_id))
# return
return(df.sum)
}
If I run a test of this, I get a dataframe in return, as expected
test <- fun1(df1 = d1, df2 = d2, t1 = 3, expr_col = EXPR, id_col = ID)
Second funtion
Now with fun2, I am trying to use fun1 in a for loop to iterate from ti to tf of the seq vector:
fun2 <- function(
df1 = "df 1",
df2 = "df 2",
expr_col = "expr column",
id_col = "sample column - must be present in df1 and df2",
ti = "initial value",
tf = "final value",
res = "resolution") {
# define variables for fun1
var1 <- enquo(d1)
var2 <- enquo(d2)
var3 <- enquo(t1)
var4 <- enquo(EXPR)
var5 <- enquo(ID)
# get sequence of values
seq <- seq(from = ti, to = tf, by = res)
# open list
t.list <- list()
# Loop ----
for (i in seq_along(seq)){
t1 <- seq[i]
t.list[[i]] <- fun1(df1 = var1,
df2 = var2,
t1 = var3,
expr_col = var4,
id_col = var5)
}
df.out <- plyr::ldply(t.list, rbind)
### Return ---
return(df.out)
}
But if I run this
test <- fun2(df1 = d1, df2 = d2, expr_col = EXPR, id_col = ID, ti = 1, tf = 10, res = 1)
I get an error message
Error in (function (x) : object 'EXPR' not found
I tried various things... and I am kind of stuck here. I guess I am not using enquo() properly. I can get it to work by not using varX and putting directly the actual appropriate name of each element in the fun1 arguments, but the whole point of doing this, to me, is to make it "generalisable" and therefore specify the arguments only in fun2 which will then be passed to fun1.
Any help would be greatly appreciated.
Many thanks for your answer aosmith. I am now sorted using the following code:
fun2 <- function(
df1 = "df 1",
df2 = "df 2",
expr_col = "expr column",
id_col = "sample column - must be present in df1 and df2",
ti = "initial value",
tf = "final value",
res = "resolution") {
# define variables for fun1
var4 <- enquo(expr_col)
var5 <- enquo(id_col)
# get sequence of values
seq <- seq(from = ti, to = tf, by = res)
# open list
t.list <- list()
### Loop --------------------------------------------------------------
for (i in seq_along(seq)){
t1 <- seq[i]
t.list[[i]] <- fun1(df1 = df1,
df2 = df2,
t1 = t1,
expr_col = !!var4,
id_col = !!var5)
}
df.out <- plyr::ldply(t.list, rbind)
### Return ---
return(df.out)
}
# TEST FUN2
test <- fun2(df1 = d1, df2 = d2, expr_col = EXPR, id_col = ID, ti = 1, tf = 10, res = 1)

Resources