Related
I'm using a program called Apollo to make an ordered logit model. In this model, you have to specify a list of variables like this:
apollo_beta = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum3 = 0)
I want to do two things:
Firstly, I want to be able to specify these beforehand:
specification1 = c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0,
b_var2_dum1 = 0,
b_var2_dum2 = 0,
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to call it:
apollo_beta = specification1
Secondly, I want to be able to make categories:
var1 <- c(
b_var1_dum1 = 0,
b_var1_dum2 = 0,
b_var1_dum3 = 0)
var2 <- c(
b_var2_dum1 = 0,
b_var2_dum2 = 0)
var3 <- c(
b_var3_dum1 = 0,
b_var3_dum2 = 0,
b_var3_dum3 = 0,
b_var3_dum4 = 0)
And then be able to use those in the specification:
specification1 = c(
var1,
var2,
var3)
And then:
apollo_beta = specification1
I know you might not have the best knowledge of the very niche programme Apollo. I am not quite sure if this is even possible, but since it would save me days (maybe weeks) of work, can anyone give me a hint on what I might be doing wrong? I worry I have a list within a list.
Since I have to make 60 specifications of the same model with different variations of 6 variables, it would be a lot of code and lot of work if I can't shorten it like this.
Any tips would be greatly appreciated.
Data:
df <- data.frame(
var1_dum1 = c(0, 1, 0),
var1_dum2 = c(1, 0, 0),
var1_dum3 = c(0, 0, 1),
var2_dum1 = c(0, 1, 0),
var2_dum2 = c(1, 0, 0),
var3_dum1 = c(1, 1, 0),
var3_dum2 = c(1, 0, 0),
var3_dum3 = c(0, 1, 0),
var3_dum4 = c(0, 0, 1),
)
So there is a dataset with these variables. In apollo you specify "database = df" first, so it already refers to the variables.
In the list of apollo_beta, it doesn't refer to the variables directly, so technically you can call it what you want. I just want to call it the same as the variables as I will refer to them later.
My question is simple. Can I condense the long list to simply say "specification1". It's just a question of the r language. Whether the items of the list will function the same way as how it was originally written in code.
In other words, would calling apollo_beta in the above three examples lead to the same result? If not, how do I change the code so that it does lead to the same?
I have the following data showing 5 possible kids to invite to a party and what neighborhoods they live in.
I have a list of solutions as well (binary indicators of whether the kid is invited or not; e.g., the first solution invites Kelly, Gina, and Patty.
data <- data.frame(c("Kelly", "Andrew", "Josh", "Gina", "Patty"), c(1, 1, 0, 1, 0), c(0, 1, 1, 1, 0))
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1), c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
I'm looking for a way to now filter the solutions in the following ways:
a) Only keep solutions where there are at least 3 kids from both neighborhood A and neighborhood B (one kid can count as one for both if they're part of both)
b) Only keep solutions that have at least 3 kids selected (i.e., sum >= 3)
I think I need to somehow join data to the solutions in solutions, but I'm a bit lost on how to manipulate everything since the solutions are stuck in lists. Basically looking for a way to add entries to every solution in the list indicating a) how many kids the solution has, b) how many kids from neighborhood A, and c) how many kids from neighborhood B. From there I'd have to somehow filter the lists to only keep the solutions that satisfy >= 3?
Thank you in advance!
I wrote a little function to check each solution and return TRUE or FALSE based on your requirements. Passing your solutions to this using sapply() will give you a logical vector, with which you can subset solutions to retain only those that met the requirements.
check_solution <- function(solution, data) {
data <- data[as.logical(solution),]
sum(data[["Neighborhood A"]]) >= 3 && sum(data[["Neighborhood B"]]) >= 3
}
### No need for function to test whether `sum(solution) >= 3`, since
### this will *always* be true if either neighborhood sums is >= 3.
tests <- sapply(solutions, check_solution, data = data)
# FALSE FALSE FALSE FALSE FALSE
solutions[tests]
# list()
### none of the `solutions` provided actually meet criteria
Edit: OP asked in the comments how to test against all neighborhoods in the data, and return TRUE if a specified number of neighborhoods have enough kids. Below is a solution using dplyr.
library(dplyr)
data <- data.frame(
c("Kelly", "Andrew", "Josh", "Gina", "Patty"),
c(1, 1, 0, 1, 0),
c(0, 1, 1, 1, 0),
c(1, 1, 1, 0, 1),
c(0, 1, 1, 1, 1)
)
names(data) <- c("Kid", "Neighborhood A", "Neighborhood B", "Neighborhood C",
"Neighborhood D")
solutions <- list(c(1, 0, 0, 1, 1), c(0, 0, 0, 1, 1), c(0, 1, 0, 1, 1),
c(1, 0, 1, 0, 1), c(0, 1, 0, 0, 1))
check_solution <- function(solution,
data,
min_kids = 3,
min_neighborhoods = NULL) {
neighborhood_tests <- data %>%
filter(as.logical(solution)) %>%
summarize(across(starts_with("Neighborhood"), ~ sum(.x) >= min_kids)) %>%
as.logical()
# require all neighborhoods by default
if (is.null(min_neighborhoods)) min_neighborhoods <- length(neighborhood_tests)
sum(neighborhood_tests) >= min_neighborhoods
}
tests1 <- sapply(solutions, check_solution, data = data)
solutions[tests1]
# list()
tests2 <- sapply(
solutions,
check_solution,
data = data,
min_kids = 2,
min_neighborhoods = 3
)
solutions[tests2]
# [[1]]
# [1] 1 0 0 1 1
#
# [[2]]
# [1] 0 1 0 1 1
This question already has answers here:
Error: `n()` must only be used inside dplyr verbs
(3 answers)
Closed 1 year ago.
I have made the following script to sum rows of a data frame and count number of columns that are not zero for all rows. Suddenly my script stop working and I am not sure what the error is.
test <- structure(list(col1 = c(0.126331200264469, 0, 0, 0, 0), col2 = c(0,
0, 0, 0, 0), col3 = c(0, 0, 0, 0, 0), col4 = c(0, 0, 0, 0, 0),
col5 = c(0, 0, 0, 0, 0)), row.names = c("row1", "row2", "row3",
"row4", "row5"), class = "data.frame")
script:
test.out <- test %>%
mutate(Not_Present = across(everything(), ~ . == 0) %>%
reduce(`+`), Present = ncol(test)- Not_Present)
error:
Error: `across()` must only be used inside dplyr verbs.
Run `rlang::last_error()` to see where the error occurred.
Another option is using rowSums
library(dplyr)
test %>%
mutate(Not_Present = rowSums(across(everything()) == 0),
Present = ncol(test) - Not_Present)
If it helps in any way for further work, I would just go with:
test.out <- sum(apply(test !=0, 2, any))
I am estimating a time series Error Correction Model on my data (with package 'ecm'). In below code you can see that I specify the short and long term variables with xeq and xtr.
These variables are independent variables and estimate on the dependent variable: Sales.
In this case, it is a pooled model but I want to estimate this model unit by unit (so separate for every brand). Since my dataset is rather large and consists of 360 product categories, each having 3 brands (brand 2, brand 3 and brand 4).
xeq <- DatasetThesisSynergyClean[c('lnPrice', 'lnAdvertising', 'lnDisplay', 'IntrayearCycles', 'lnCompetitorPrices', 'lnCompADV', 'lnCompDISP' , 'ADVxDISP', 'ADVxCYC', 'DISPxCYC', 'ADVxDISPxCYC')]
xtr <- DatasetThesisSynergyClean[c('lnPrice', 'lnAdvertising', 'lnDisplay', 'IntrayearCycles', 'lnCompetitorPrices', 'lnCompADV', 'lnCompDISP', 'ADVxDISP', 'ADVxCYC', 'DISPxCYC', 'ADVxDISPxCYC')]
model11 <- ecm(DatasetThesisSynergyClean$lnSales, xeq, xtr, includeIntercept=TRUE)
summary(model11)
What I want is to generate an output for every brand of every category. To give you glimpse of my data, please run this code:
structure(list(Week = 7:17, Category = c("2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2"), Brand = c("3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3"), Display = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), Sales = c(0, 0, 0, 0, 13.440948, 40.097397,
32.01384, 382.169189, 2830.748779, 4524.460938, 1053.590576),
Price = c(0, 0, 0, 0, 5.949999, 5.95, 5.950003, 4.87759,
3.787015, 3.205987, 4.898724), Distribution = c(0, 0, 0,
0, 1.394019, 1.386989, 1.621416, 8.209759, 8.552915, 9.692097,
9.445554), Advertising = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), lnSales = c(11.4945151554497, 11.633214247508, 11.5862944141137,
11.5412559646132, 11.4811122484454, 11.4775106999991, 11.6333660772506,
11.4859819773102, 11.5232680456161, 11.5572670584292, 11.5303686934256
), IntrayearCycles = c(4.15446534315765, 3.62757053512638,
2.92387946552647, 2.14946414386239, 1.40455011205262, 0.768856938870769,
0.291497141953598, -0.0131078404184544, -0.162984144025091,
-0.200882782749248, -0.182877633924882), `Competitor Advertising` = c(10584.87063,
224846.3243, 90657.72553, 0, 0, 0, 2396.54212, 0, 0, 0, 40343.49444
), `Competitor Display` = c(0.385629, 2.108133, 2.515806,
4.918288, 3.81749, 3.035847, 2.463194, 3.242594, 1.850399,
1.751096, 1.337943), `Competitor Prices` = c(5.30989, 5.372752,
5.3717245, 5.3295525, 5.298393, 5.319466, 5.1958415, 5.2941095,
5.296757, 5.294059, 5.273578), ZeroSales = c(1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0)), .Names = c("Week", "Category", "Brand",
"Display", "Sales", "Price", "Distribution", "Advertising", "lnSales",
"IntrayearCycles", "Competitor Advertising", "Competitor Display",
"Competitor Prices", "ZeroSales"), row.names = 1255:1265, class = "data.frame")
As you can see, I have all the categories and brands stored in rows. To get an estimation on every single brand I want to write a for loop, but I don't really know how to specify the right category and brand in order to save this output separately.
Eventually want to store the coefficients, std. error, t-values and p-values, of all brands in 4 separate dataframes. But first I need to obtain the output, can you guys help me out?
You could use dplyr like this:
f <- function(.) {
xeq <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
xtr <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
print(xeq)
print(xtr)
summary(ecm(.$lnSales, xeq, xtr, includeIntercept = TRUE))
}
Models <- DatasetThesisSynergyClean %>%
group_by(Category, Brand) %>%
do(Model = f(.))
Models$Category
[1] "2" "3"
Models$Brand
[1] "3" "3"
Models$Model
[[1]]
Call:
lm(formula = dy ~ ., data = x)
# ... and so on
You end up with a list of 3 items (the categories, brands and model summary objects) and length equal to unique category/brand combinations. Could not try it properly, since I do not have the complete data. Model summary for Category 3, Brand 3:
Models$Model[[which(Models$Category == 3 & Models$Brand == 3)]]
Update:
If you want standalone object for each model you can give them corresponding names and use list2env():
names(Models$Model) <- paste0("C", Models$Category, "B", Models$Brand)
list2env(Models$Model, .GlobalEnv)
I would suggest you take a look at some of the tidyverse packages, and consider using a vectorised approach combining split(df, list(df$Category, df$Group)) and purrr's map() function to apply a function to each of your smaller datasets. The code would be something like this:
df %>%
split(f = list(.$Category, .$Brand)) %>%
map(a_function_for_each_group) %>%
bind_rows()
I hope i have understood your question correctly.
I want to get the highest values (lets say highest 3) of all columns of my df. Important for me is to get also the rownames of these values. Here a subset of my data:
structure(list(BLUE.fruits = c(12803543, 3745797, 19947613, 0, 130, 4),
BLUE.nuts = c(21563867, 533665, 171984, 0, 0, 0),
BLUE.veggies = c(92690, 188940, 34910, 0, 0, 577),
GREEN.fruits = c(3389314, 15773576, 8942278, 0, 814, 87538),
GREEN.nuts = c(6399474, 1640804, 464688, 0, 0, 0),
GREEN.veggies = c(15508, 174504, 149581, 0, 0, 6190),
GREY.fruits = c(293869, 0, 188368, 0, 8, 0),
GREY.nuts = c(852646, 144024, 26592, 0, 0, 0),
GREY.veggies = c(2992, 41267, 6172, 0, 0, 0)),
.Names = c("BLUE.fruits", "BLUE.nuts", "BLUE.veggies",
"GREEN.fruits", "GREEN.nuts", "GREEN.veggies", "GREY.fruits",
"GREY.nuts", "GREY.veggies"), row.names = c("Afghanistan", "Albania",
"Algeria", "American Samoa", "Angola", "Antigua and Barbuda"),
class = "data.frame")
I tried this so far for the first column:
as.data.frame(x[,1][order(x[,1], decreasing=TRUE)][1:10]
However, I don't get the original rownames and I need an approach as apply/lapply to go through all columns (~ 150 cols). Ideas? Thanks
This could help:
Print one column of data frame with row names
So if you adapt your code a bit you get:
(A long ugly code line =) , that returns a list, what your desired output format is - based on your "lapply" tag?)
lapply(1:dim(df)[2], function(col.number) df[order(df[, col.number], decreasing=TRUE)[1:3], col.number, drop = FALSE])
You could write a column maximum function, colMax.
colMax <- function(data) sapply(data, max, na.rm = TRUE)