For loop regression estimations (ECM model) - r

I am estimating a time series Error Correction Model on my data (with package 'ecm'). In below code you can see that I specify the short and long term variables with xeq and xtr.
These variables are independent variables and estimate on the dependent variable: Sales.
In this case, it is a pooled model but I want to estimate this model unit by unit (so separate for every brand). Since my dataset is rather large and consists of 360 product categories, each having 3 brands (brand 2, brand 3 and brand 4).
xeq <- DatasetThesisSynergyClean[c('lnPrice', 'lnAdvertising', 'lnDisplay', 'IntrayearCycles', 'lnCompetitorPrices', 'lnCompADV', 'lnCompDISP' , 'ADVxDISP', 'ADVxCYC', 'DISPxCYC', 'ADVxDISPxCYC')]
xtr <- DatasetThesisSynergyClean[c('lnPrice', 'lnAdvertising', 'lnDisplay', 'IntrayearCycles', 'lnCompetitorPrices', 'lnCompADV', 'lnCompDISP', 'ADVxDISP', 'ADVxCYC', 'DISPxCYC', 'ADVxDISPxCYC')]
model11 <- ecm(DatasetThesisSynergyClean$lnSales, xeq, xtr, includeIntercept=TRUE)
summary(model11)
What I want is to generate an output for every brand of every category. To give you glimpse of my data, please run this code:
structure(list(Week = 7:17, Category = c("2", "2", "2", "2",
"2", "2", "2", "2", "2", "2", "2"), Brand = c("3", "3", "3",
"3", "3", "3", "3", "3", "3", "3", "3"), Display = c(0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0), Sales = c(0, 0, 0, 0, 13.440948, 40.097397,
32.01384, 382.169189, 2830.748779, 4524.460938, 1053.590576),
Price = c(0, 0, 0, 0, 5.949999, 5.95, 5.950003, 4.87759,
3.787015, 3.205987, 4.898724), Distribution = c(0, 0, 0,
0, 1.394019, 1.386989, 1.621416, 8.209759, 8.552915, 9.692097,
9.445554), Advertising = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0), lnSales = c(11.4945151554497, 11.633214247508, 11.5862944141137,
11.5412559646132, 11.4811122484454, 11.4775106999991, 11.6333660772506,
11.4859819773102, 11.5232680456161, 11.5572670584292, 11.5303686934256
), IntrayearCycles = c(4.15446534315765, 3.62757053512638,
2.92387946552647, 2.14946414386239, 1.40455011205262, 0.768856938870769,
0.291497141953598, -0.0131078404184544, -0.162984144025091,
-0.200882782749248, -0.182877633924882), `Competitor Advertising` = c(10584.87063,
224846.3243, 90657.72553, 0, 0, 0, 2396.54212, 0, 0, 0, 40343.49444
), `Competitor Display` = c(0.385629, 2.108133, 2.515806,
4.918288, 3.81749, 3.035847, 2.463194, 3.242594, 1.850399,
1.751096, 1.337943), `Competitor Prices` = c(5.30989, 5.372752,
5.3717245, 5.3295525, 5.298393, 5.319466, 5.1958415, 5.2941095,
5.296757, 5.294059, 5.273578), ZeroSales = c(1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0)), .Names = c("Week", "Category", "Brand",
"Display", "Sales", "Price", "Distribution", "Advertising", "lnSales",
"IntrayearCycles", "Competitor Advertising", "Competitor Display",
"Competitor Prices", "ZeroSales"), row.names = 1255:1265, class = "data.frame")
As you can see, I have all the categories and brands stored in rows. To get an estimation on every single brand I want to write a for loop, but I don't really know how to specify the right category and brand in order to save this output separately.
Eventually want to store the coefficients, std. error, t-values and p-values, of all brands in 4 separate dataframes. But first I need to obtain the output, can you guys help me out?

You could use dplyr like this:
f <- function(.) {
xeq <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
xtr <- as.data.frame(select(., lnPrice, lnAdvertising, lnDisplay, IntrayearCycles, lnCompetitorPrices, lnCompADV, lnCompDISP, ADVxDISP, ADVxCYC, DISPxCYC, ADVxDISPxCYC))
print(xeq)
print(xtr)
summary(ecm(.$lnSales, xeq, xtr, includeIntercept = TRUE))
}
Models <- DatasetThesisSynergyClean %>%
group_by(Category, Brand) %>%
do(Model = f(.))
Models$Category
[1] "2" "3"
Models$Brand
[1] "3" "3"
Models$Model
[[1]]
Call:
lm(formula = dy ~ ., data = x)
# ... and so on
You end up with a list of 3 items (the categories, brands and model summary objects) and length equal to unique category/brand combinations. Could not try it properly, since I do not have the complete data. Model summary for Category 3, Brand 3:
Models$Model[[which(Models$Category == 3 & Models$Brand == 3)]]
Update:
If you want standalone object for each model you can give them corresponding names and use list2env():
names(Models$Model) <- paste0("C", Models$Category, "B", Models$Brand)
list2env(Models$Model, .GlobalEnv)

I would suggest you take a look at some of the tidyverse packages, and consider using a vectorised approach combining split(df, list(df$Category, df$Group)) and purrr's map() function to apply a function to each of your smaller datasets. The code would be something like this:
df %>%
split(f = list(.$Category, .$Brand)) %>%
map(a_function_for_each_group) %>%
bind_rows()
I hope i have understood your question correctly.

Related

How to add environmental variable to DCA plot using ggplot2

I would like to plot an environmental variable on a ggplot2 version of a DCA plot.
I have some code where I extract species and data scores from vegan and then plot them up in ggplot2. I am having trouble trying to work out how I can get my environmental variable SWLI to plot as an arrow - something like this RDA's plots with ggvegan: How can I change text position for arrows text? (or see PCA example here https://www.rpubs.com/an-bui/vegan-cheat-sheet)
Can anybody help?
#DCA Plot
library(plyr)
library(vegan)
library(ggplot2)
library(cluster)
library(ggfortify)
library(factoextra)
#read in csv and remove variables you don't want to go through analysis
regforamcountsall<-read_csv("regionalforamcountsallnocalcs.csv")
swli<-read_csv("DCAenv.csv")
rownames(regforamcountsall)<-regforamcountsall$Sample
regforamcountsall$Sample = NULL
regforamcountsall$Site=NULL
regforamcountsall$SWLI=NULL
#check csv
regforamcountsall
#run ordination
ord<-decorana(regforamcountsall)
#get species scores
summary(ord)
#get DCA values of environmental variable
ord.fit <- envfit(ord ~ SWLI, data=swli, perm=999)
ord.fit
plot(ord, dis="site")
plot(ord.fit)
#use this summary code to get species scores for DCA1 and DCA2
#put species scores values in from ord plot summary stats
species.scores<-read.csv("speciescores.csv")
species.scores$species <- row.names(species.scores)
#Using the scores function from vegan to extract the sample scores and convert to a data.frame
data.scores <- as.data.frame(scores(ord))
# create a column of groupings/clusters, from the rownames of data.scores
data.scores$endgroup <- as.factor(pam(regforamcountsall, 3)$clustering)
#getting the convex hull of each unique point set
find_hull <- function(df) df[chull(data.scores$DCA1, data.scores$DCA2), ]
hulls <- NULL
for(i in 1:length(unique(data.scores$endgroup))){
endgroup_coords <- data.scores[data.scores$endgroup == i,]
hull_coords <- data.frame(
endgroup_coords[chull(endgroup_coords[endgroup_coords$endgroup == i,]$DCA1,
endgroup_coords[endgroup_coords$endgroup == i,]$DCA2),])
hulls <- rbind(hulls,hull_coords)
}
data.scores$numbers <- 1:length(data.scores$endgroup)
regforamcountsall<-read_csv("regionalforamcountsallnocalcs.csv")
rownames(regforamcountsall)<-regforamcountsall$Sample
data.scores$Site<-regforamcountsall$Site
data.scores$SWLI<-regforamcountsall$SWLI
data.scores
#DCA with species
data.scores$Site <- as.character(data.scores$Site)
library(scico)
dca <- ggplot() +
# add the point markers
geom_point(data=data.scores,aes(x=DCA1,y=DCA2,colour=SWLI,pch=Site),size=4) + geom_point(data=species.scores,aes(x=DCA1,y=DCA2),size=3,pch=3,alpha=0.8,colour="grey22") +
# add the hulls and labels - numbers position labels
geom_polygon(data = hulls,aes(x=DCA1,y=DCA2,fill=endgroup), alpha = 0.25) +
#geom_text(data=data.scores,aes(x=DCA1-0.03,y=DCA2,colour=endgroup, label = numbers))+
geom_text(data=species.scores,aes(x=DCA1+0.1,y=DCA2+0.1, label = species))+
#look this up
geom_segment(data=ord.fit,aes(x = 0, y = 0, xend=DCA1,yend=DCA2), arrow = arrow(length = unit(0.3, "cm")))+
theme_classic()+
scale_color_scico(palette = "lapaz")+
coord_fixed()
dca
#regforamcountsall data
structure(list(Sample = c("T3LB7.008", "T3LB7.18", "T3LB7.303",
"WAP 0 ST-2", "T3LB7.5", "LG120"), T.salsa = c(86.63793102, 68.5897436,
70.39274924, 5.199999999, 79.15057916, 44.40000001), H.wilberti = c(0,
0, 0, 0, 0.386100386, 9.399999998), Textularia = c(0, 0, 0, 0,
0, 0.4), T.irregularis = c(2.155172414, 10.25641026, 7.854984897,
0, 2.702702703, 0), P.ipohalina = c(0, 0, 0, 0, 0, 0), J.macrescens = c(4.741379311,
5.769230769, 4.833836859, 5.800000001, 8.108108107, 5.400000001
), T.inflata = c(6.465517244, 15.38461538, 16.918429, 83.2, 5.791505794,
40.4), S.lobata = c(0, 0, 0, 2.300000001, 0, 0), M.fusca = c(0,
0, 0, 3.499999999, 3.861003862, 0), A.agglutinans = c(0, 0, 0,
0, 0, 0), A.exiguus = c(0, 0, 0, 0, 0, 0), A.subcatenulatus = c(0,
0, 0, 0, 0, 0), P.hyperhalina = c(0, 0, 0, 0, 0, 0), SWLI = c(200,
197.799175, 194.497937, 192.034776, 191.746905, 190.397351),
Site = c("LSP", "LSP", "LSP", "WAP", "LSP", "LG")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
#data.scores
structure(list(DCA1 = c(-1.88587476921648, -1.58550534382589,
-1.59816311314591, -0.0851161831632892, -1.69080448670088, -1.14488987340879
), DCA2 = c(0.320139736602921, 0.226662031865046, 0.230912045301637,
-0.0531232712001122, 0.272143119753744, 0.0696939776869396),
DCA3 = c(-0.755595015095353, -0.721144380683279, -0.675071834919103,
0.402339366526422, -0.731006052784081, 0.00474996849420783
), DCA4 = c(-1.10780013276303, -0.924265835490466, -0.957711953532202,
-0.434438970032073, -0.957873836258657, -0.508347000558056
), endgroup = structure(c(1L, 1L, 1L, 2L, 1L, 1L), .Label = c("1",
"2", "3"), class = "factor"), numbers = 1:6, Site = c("LSP",
"LSP", "LSP", "WAP", "LSP", "LG"), SWLI = c(200, 197.799175,
194.497937, 192.034776, 191.746905, 190.397351)), row.names = c(NA,
6L), class = "data.frame")
#species.scores
structure(list(species = c("1", "2", "3", "4", "5", "6"), DCA1 = c(-2.13,
-1.6996, -2.0172, -0.9689, 1.0372, -0.3224), DCA2 = c(0.342,
-0.8114, 0.3467, -0.3454, 2.0007, 0.9147)), row.names = c(NA,
6L), class = "data.frame")

if_else with haven_labelled column fails because of wrong class

I have the following data:
dat <- structure(list(value = structure(c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0),
label = "value: This is my label",
labels = c(`No` = 0, `Yes` = 1),
class = "haven_labelled"),
group = structure(c(1, 2, 1, 1, 2, 3, 3, 1, 3, 1, 3, 3, 1, 2, 3, 2, 1, 3, 3, 1),
label = "my group",
labels = c(first = 1, second = 2, third = 3),
class = "haven_labelled")),
row.names = c(NA, -20L),
class = c("tbl_df", "tbl", "data.frame"),
label = "test.sav")
As you can see, the data uses a special class from tidyverse's haven package, so called labelled columns.
Now I want to recode my initial value variable such that:
if group equals 1, value should stay the same, otherwise it should be missing
I was trying the following, but getting an error:
dat_new <- dat %>%
mutate(value = if_else(group != 1, NA, value))
# Error: `false` must be a logical vector, not a `haven_labelled` object
I got so far as to understand that if_else from dplyr requires the true and false checks in the if_else command to be of same class and since there is no NA equivalent for class labelled (e.g. similar to NA_real_ for doubles), the code probably fails, right?
So, how can I recode my inital variables and preserve the labels?
I know I could change my code above and replace the if_else by R's base version ifelse. However, this deletes all labels and coerces the value column to a numeric one.
You can try dplyr::case_when for cases where group == 1. If no cases are matched, NA is returned:
dat %>% mutate(value = case_when(group == 1 ~ value))
You can create an NA value in the haven_labelled class with this ugly code:
haven::labelled(NA_real_, labels = attr(dat$value, "labels"))
I'd recommend writing a function for that, e.g.
labelled_NA <- function(value)
haven::labelled(NA_real_, labels = attr(value, "labels"))
and then the code for your mutate isn't quite so ugly:
dat_new <- dat %>%
mutate(value = if_else(group != labelled_NA(value), value))
Then you get
> dat_new[1:5,]
# A tibble: 5 x 2
value group
<dbl+lbl> <dbl+lbl>
1 NA 1 [first]
2 NA 2 [second]
3 0 [No] 1 [first]
4 0 [No] 1 [first]
5 NA 2 [second]

Is there a way to count occurrences of a specific value for unique columns in a dataframe in R?

I am relatively new to R and have a dataframe (cn_data2) with several duplicated columns. It looks something like this:
Gene breast_cancer breast_cancer breast_cancer lung_cancer lung_cancer
myc 1 0 1 1 2
ARID1A 0 2 1 1 0
Essentially, the rows are genes and the columns are different types of cancers. What I want is to find for each gene the number of times, a value (0,1,or 2) occurs for each unique cancer type.
I have tried several things but haven't been able to achieve what I want. For example, cn_data2$count1 <- rowSums(cn_data == '1') gives me a column with the number of "1" for each gene but what I want the number of "1" for each individual disease.
Hope my question is clear!I appreciate any help, thank you!
structure(list(gene1 = structure(1:6, .Label = c("ACAP3", "ACTRT2",
"AGRN", "ANKRD65", "ATAD3A", "ATAD3B"), class = "factor"), glioblastoma_multiforme_Primary_Tumor = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.1 = c(-1,
-1, -1, -1, -1, -1), glioblastoma_multiforme_Primary_Tumor.2 = c(0,
0, 0, 0, 0, 0), glioblastoma_multiforme_Primary_Tumor.3 = c(2,
2, 2, 2, 2, 2), glioblastoma_multiforme_Primary_Tumor.4 = c(0,
0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, 6L))

Comparing vertices with specific attributes

I tried finding a solution, but since I am not that familiar with R, I'm not sure if I used the best key words searching.
I have an igraph where vertices have attributes(positions, wealth) and I'm trying to compare the wealth of those vertices that have positions == "Manager".
Edit
I'm not only comparing the wealth but also another attribute: constraint. Also I tried to do the make this reproducible:
library(igraph)
M <- matrix(c( 0, 1, 0, 0, 0,
0, 0, 1, 0, 0,
1, 1, 0, 0, 1,
0, 1, 0, 0, 0,
0, 1, 1, 0, 0), nrow = 5, byrow=TRUE)
g <- graph.adjacency(M, mode = "undirected")
V(g)$position <- c("Manager", "Manager", "Other", "Other", "Other")
V(g)$wealth <- c("12", "16", "16", "4", "29")
V(g)$constraint <- constraint(g)
What I want to do is to see a table with the wealth and constraint of the Managers only.
Edit 2
#G5W offered this solution which works perfectly:
cbind(V(g)$wealth, V(g)$constraint)[V(g)$position == "Manager"]
I think I understand what you're asking. For this sort of thing, I prefer to use the dplyr package (as part of the tidyverse) because it is usually followed with further wrangling.
Let's say that your data is stored in the dataframe df. We can then do the following:
df %>%
filter(position == "Manager")
This returns all Manager entries.
Alternatively, using the base package, you can use
df[df$position == "Manager",]
I should add that I'm not familiar with igraph and so for a better answer, sample data should be provided.

How to order an object of type "structure" based on names(c)?

I have an object that looks like this:
structure(c(0, 2, 0, 3, 5, 0), .Names = c("6", "1", "3", "4", "2", "5" ))
I need the values of this object to be in the order indicated by the names, if they would be integers and not characters as they are now. The object should be (2, 5, 0, 3, 0, 0) i don't mind it to be a vector or a matrix with row names but I simply couldn't order this object.
Thanks
You have a named vector. Function structure is a convenient way of packing your objects for distribution. Notice that you're missing a comma before .Names.
x <- structure(c(0, 2, 0, 3, 5, 0), .Names = c("6", "1", "3", "4", "2", "5" ))
right.order <- order(as.numeric(names(x)))
x[right.order]
1 2 3 4 5 6
2 5 0 3 0 0

Resources