Using 'for' to analyze multiple cross sections of a matrix on RStudio - r

I have a dataset of about 144 entries and 93 variables, where each column correspond to a municipality and the variables account for yearly measurements of environmental data (e.g: temperature, vegetated area, rainfall, etc). As said before, the variables are divided yearly, so I have one column named rainfall_2004, another one for rainfall_2005 and so on. The entire dataset has a timespan of 10 years. Here's a picture to better illustrate:
I wanted to develop a script where I could create a GLM for each municipality at each year. Luckily, I found Zuur's book, "Mixed Effect Models and Extensions in Ecology with R", which provides such code in one of his examples. I tried adapting it to my dataset, but something went wrong. My knowledge with R is a bit limited, so I'm missing something but I can't quite find it.
Here's Zuur's code:
library(AED); data(RIKZ)
Beta <- vector(length = 9)
for (i in 1:9) {
Mi <- summary(lm(Richness ∼ NAP, subset = (Beach==i), data=RIKZ))
Beta[i] <- Mi$coefficients[2, 1]
}
Now here's mine:
count <- dados_ampliados[, 1]
View(count)
for (i in count) {
RA <- summary(glm(dados_ampliados$infect_2004 ~ dados_ampliados$mmax_2004 +
dados_ampliados$mmin_2004 +
dados_ampliados$mprec_2004 +
dados_ampliados$mumid_2004 +
dados_ampliados$prop_for_2004 +
dados_ampliados$prop_urb_2004 +
dados_ampliados$prod_2004,
family = poisson(),
subset = (dados_ampliados$Geocode==i),
data = dados_ampliados))
count[i] <- RA$coefficients[2, 1]
}
Yet my code returns:
Error in `[<-.data.frame`(`*tmp*`, i, value = 0.357095537720183) :
new columns would leave holes after existing columns
Any ideas as why is this happening? Thanks in advance.
Some observations:
File used in this code can be obtained here. This is a WeTransfer file, so it won't last forever.
In his text, Zuur explains that he's creating that model to analyze data on 9 different beaches. In his code, he compares the value of the 1:9 vector to the beach value, therefore I'm assuming the beaches aren't named, but numbered instead. So, for each value of the vector, he's going to model the corresponding beach. My data however isn't organized like that, but with geocodes provided by the Brazilian Institute of Statistics and Geography, therefore my adaption consisted on creating a vector of 144 entries, one for each row, and each one is populated by the municipalities' geocode. This and the substition of lm for glm were my main adaptations.
For the troubleshooting, I already tried changing the values of RA$coefficients from 2,1 to 1,1 or 1,2. The error remained.

Related

DEA analysis: variables are excluded in analysis?

I’m working on a DEA (Data Envelopment Analysis) analysis to analyze the relative effects of different banks efficiencies.
The packages I’m using are rDEA and kableExtra.
What this analysis if doing is measuring the relative effect of input and output variables that I use to examine the efficiency for each individual bank.
The problem is that my code only includes two out of four output variables and I can’t find anywhere in the code where I ask it to do so.
Can some of you identify the problem?
Thank you in advance!
I have tried to format the data in several different ways, assign the created "inp_var" and "out_var" as a matrix'.
#install.packages('rDEA')
#install.packages('dplyr')
#install.packages('kableExtra')
library(kableExtra)
library(rDEA)
library(dplyr)
dea <- tbl_df(PANELDATA)
head(dea)
inp_var <- select(dea, 'IE', 'NIE')
out_var <- select(dea, 'L', 'D', 'II','NII')
inp_var <- as.matrix(inp_var)
out_var <- as.matrix(out_var)
model <- dea(XREF= inp_var, YREF = out_var, X = inp_var, Y = out_var, model= "output", RTS = "constant")
model
I want a number between 0 and 1 for every observation, where the most efficient one receives a 1. What I get now is the same result no matter if I include the two extra output variables L and II or not.
L stands for Loans to the public and II for interest income and it would be weird if these variables had NO effect for the efficiency of banks.
I think you could type this:
result <- cbind(round(model$thetaOpt, 3), round(model$lambda, 3))
rownames(result)<-dea[[1]]
colnames(result)<-c("Efficiency", rownames(result))
kable(result[,])

Iteration / Maximization Excel solver in R

I am trying to do a maximization in R that I have done previously in Excel with the solver. The problem is that I don't know how to deal with it (i don't have a good level in R).
let's talk a bit about my data. I have 26 Swiss cantons and the Swiss government (which is the sum of the value of the 26 cantons) with their population and their "wealth". So I have 27 observatios by variable. I'm not sure that the following descriptions are useful but I put them anyway. From this, I calculate some variables with while loops. For each canton [i]:
resource potential = mean(wealth2011 [i],wealth2012 [i],wealth2013 [i])
population mean = mean(population2011 [i],population2012 [i],population2013 [i])
resource potential per capita = 1000*resource potential [i]/population [i]
resource index = 100*resource potential capita [i]/resource potential capita [swiss government]
Here a little example of the kind of loops I used:
RI=0
i = 1
while(i<28){
RI[i]=resource potential capita [i]/resource potential capita [27]*100
i = i+1
}
The resource index (RI) for the Swiss government (i = 27) is 100 because we divide the resource potential capita of the swiss government (when i = 27) by itself and multiply by 100. Hence, all cantons that have a RI>100 are rich cantons and other (IR<100) are poor cantons. Until here, there was no problem. I just explained how I built my dataset.
Now the problem that I face: I have to create the variable weighted difference (wd). It takes the value of:
0 if RI>100 (rich canton)
(100-RI[i])^(1+P)*Pop[i] if RI<100 (poor canton)
I create this variable like this: (sorry for the weakness of the code, I did my best).
wd=-1
i = 1
a = 0
c = 0
tot = 0
while(i<28){
if(i == 27) {
wd[i] = a
} else if (RI[i] < 100) {
wd[i] = (100-RI[i])^(1+P)*Pop[i]
c = wd[i]
a = a+c
} else {
wd[i]= 0
}
i = i+1
}
However, I don't now the value of "p". It is a value between 0 and 1. To find the value of p, I have to do a maximization using the following features:
RI_26 = 65.9, it is the minimum of RI in my data
RI_min = 100-((x*wd [27])/((1+p)*z*100))^(1/p), where x and z are fixed values (x = 8'677, z = 4'075'977'077) and wd [27] the sum of wd for each canton.
We have p in two equation: RI_min and wd. To solve it in Excel, I used the Excel solver with the following features:
p_dot = RI_26/RI_min* p ==> p_dot =[65.9/100-((x* wd [27])/((1+p)*z*100))^(1/p)]*p
RI_26 = RI_min ==>65.9 =100-((x*wd [27])/((1+p)*z*100))^(1/p)
In Excel, p is my variable cell (the only value allowed to change), p_dot is my objective to define and RI_26 = RI_min is my constraint.
So I would like to maximize p and I don't know how to do this in R. My main problem is the presence of p in RI_min and wd. We need to do an iteration to solve it but this is too far from my skills.
Is anyone able to help me with the information I provided?
you should look into the optim function.
Here I will try to give you a really simple explanation since you said you don't have a really good level in R.
Assuming I have a function f(x) that I want to maximize and therefore I want to find the parameter x that gives me the max value of f(x).
First thing to do will be to define the function, in R you can do this with:
myfunction<- function(x) {...}
Having defined the function I can optimize it with the command:
optim(par,myfunction)
where par is the vector of initial parameters of the function, and myfunction is the function that needs to be optimized. Bear in mind that optim performs minimization, however it will maximize if control$fnscale is negative. Another strategy will be to change the function (i.e. changing the sign) to suit the problem.
Hope that this helps,
Marco
From the description you provided, if I'm not mistaken, it looks like that everything you need to do it's just an equation.
In particular you have the following two expressions:
RI_min = 100-((x*y)/((1+p)*z*100))^(1/p)
and, since x,y,z are fixed, the only variable is p.
Moreover, having RI_26 = RI_min this yields to:
65.9 =100-((x*y)/((1+p)*z*100))^(1/p)
Plugging in the values of x,y and z you have provided, this yields to
p=0.526639915936052
I don't understand what exactly you are trying to maximize.

Need help applying regression model to dataset in R (sports data)

Update: Solved!
I'm currently trying to create a regression model for football that predicts a team's total points based on their pass yards and rush yards. I was able to get all the way to figuring out the regression equation but from here I do not know how to "plug in" the formula.
The data table is essentially all 32 NFL teams listed in rows and their offensive stats listed in columns
Code:
# 1. Import
Offense <- read.csv(file.choose(), header=TRUE)
#2 View
show (Offense)
#3 Attach so headers can be referenced
attach (Offense)
#4 Create Regression Model
mod1 <-lm(Total.Points ~ Pass.Yds + Rush.Yds)
summary(mod1)
#Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#Plug in the Regression Equation
predict(mod1)
Output: https://imgur.com/a/AbTNF
I see that at the end it applied the regression equation to all 32 rows, but how do I
get it to display in a ranked list
get it to display, say, the team name as well as the projected score (so I don't have to wonder what team "1" or "2" refer to
Since I have the equation, could I also just write a loop function that ran the equation for every row of data I have and print the results?
I'm a beginner so much appreciated!
Update: Came up with this
####Part 2. Interpretation
#1. Examining quality of model
summary(mod1)
cor(Pass.Yds, Rush.Yds)
#2. Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#3. Predicted Points (Descending Order)
proj <- sort(predict(mod1), decreasing = TRUE)
proj
#4. Corresponding Name (Descending)
name <- Team[order(predict(mod1), decreasing = TRUE)]
name
#Data Frame
Projections <- data.frame(name, proj)
Projections
While bbrot provided a much simpler version
Assuming that Teams is the vector of team names, something like cbind(Teams[order(predict(mod1), decreasing = TRUE)], sort(predict(mod1), decreasing = TRUE)) should do...
Edit: Your Teams vector seems to be a factor. In this case, the following commands are going to work:
# returns a character matrix
cbind(as.character(Teams)[order(predict(mod1), decreasing = TRUE)],
sort(predict(mod1), decreasing = TRUE))
# returns a data frame
data.frame(Teams = Teams[order(predict(mod1), decreasing = TRUE)],
Points = sort(predict(mod1), decreasing = TRUE))

r qqp function - why is the 'perfect fit' a flat line on 0?

This may be more of a statistical question than a programming one. I just wanted to make sure I was getting the programming right first.
I have a large count dataset (108 sites with 31 species = 3348 observations) but a lot of these are 0 counts because only not species were not present at every site. I have had log transformation suggested to me but others have also said that you shouldn't log transform count data. Here is my data for the first 8 species (also contains the very abundant species with the highest counts):
example.abund <- c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,
0,0,1,0,8,0,1,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,1,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,2,0,0,1,0,0,0,0,2,0,3,1,0,0,0,0,0,0,0,0,0,
2,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,
0,1,0,0,0,28,1,0,1,0,0,1,0,2,0,0,2,0,0,0,1,0,0,0,1,0,0,0,2,0,0,1,0,0,
0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,2,0,1,0,0,8,7,7,1,1,13,0,8,0,3,0,1,1,
1,4,4,0,1,0,1,0,0,0,0,6,5,2,0,2,58,4,2,47,4,0,0,0,2,59,2,0,0,6,1,36,28,2,
1,1,0,6,0,0,2,5,0,0,0,0,87,7,0,1,1,1,0,0,1,1,0,6,11,0,0,0,3,0,4,0,7,2,
0,5,0,4,1,0,1,12,0,2,0,9,0,1,0,0,0,24,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,3,1,0,1,0,1,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,1,0,0,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,15,0,2,
81,0,1,32,26,13,2,61,0,66,2,2,0,17,43,43,0,25,19,2,25,26,91,61,0,13,0,62,186,1,4,22,1,50,3,67,86,11,56,26,74,0,6,8,7,0152,8,14,1,97,1,0,12,11,3,1,1,112,2,35,36,5,61,26,211,15,8,173,17,97,22,18,88,11,1,66,15,3,3,3,2,0,1,0,41,9,14,1,0,38,0,0,51,27,11,38,31,1,0,221,68,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,2,0,0,2,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2,29,0,0,0,0,
0,82,12,0,0,3,0,9,0,0,164,0,0,0,0,1,0,15,0,0,0,6,56,0,0,0,6,0,0,1,0,5,5,8,
0,4,0,0,6,0,0,2,0,0,3,0,0,0,0,683,0,0,0,0,3,149,252,11,13,195,19,0,59,0,0,1,28,0,
0,0,0,0,0,0,0,0,0,0,31,55,85,0,142,0,44,52,0,0192,0,45,0,0,0,0,0,0,11,2,0,0,6,
0,0,0,0,0,0,0,0,0,0,0,0,0,19,3,0,0,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0)
I am need to make a mixed model to fit the data, but first I am trying to figure out the most appropriate distribution to use. I was following the steps in this blog. But all of the red lines (meant to represent 'perfect fit' for that distribution) are coming up as being 0 along the entire plot.
My question is: have a coded this correctly and there are so many 0s in my data that the perfect fit is 0? Or is there something wrong with the way I have coded?
Code example:
#so that the families without 0s can recognise data
example.abund.1 <- example.abund + 1
plot(hist(example.abund))
qqp(example.abund, "norm")
qqp(example.abund.1, "lnorm") #lognorm
#have to generate estimates of parameters:
nbinom <- fitdistr(example.abund.1, "Negative Binomial")
qqp(example.abund.1, "nbinom", size = nbinom$estimate[[1]], mu = nbinom$estimate[[2]])
poisson <- fitdistr(example.abund.1, "Poisson")
qqp(example.abund.1, "pois", poisson$estimate)
gamma <- fitdistr(example.abund.1, "gamma")
qqp(example.abund.1, "gamma", shape = gamma$estimate[[1]], rate = gamma$estimate[[2]])

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

Resources