Why is my use of predict in r not working - should be very simple case - r

So I'm having some trouble getting r to actually predict given a very, very simple linear model. Using the following,
> x=1:10
> y=1:10*2
> lm(y~x)
we get the corect answer, namely, y is twice x. But when I do,
predict(lm(y~x),newdata=2.5)
I just get
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
What is going on?

The newdata parameter should be a data.frame with column names matching the names used as covariates. So the correct case is
predict(lm(x~y),newdata=data.frame(y=2.5))
or
predict(lm(y~x),newdata=data.frame(x=2.5))
depending on which way you wanted to do the regression.

Related

How do I change the order of multiple grouped values in a row dependent on another variable in that row in R?

I need some help conditionally sorting/switching data based on a factor variable.
I'm not sure if it's a typical use case I just can't formulate properly enough for a search engine to show me a solution or if it is that niche but I haven't found anything yet.
I currently have a dataframe like this:
id group a1 a2 a3 a4 b1 b2 b3 b4
1 1 2 6 6 3 4 4 6 4
2 2 5 2 2 2 2 5 2 3
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 3 1 1 4 2 1 1 7
For context this is from a psychological experiment where people went through two variations of a task and the order of those conditions was determined by the experimental group they were assigned to. The columns represent different measurements from different trials and are currently grouped together for the same variable and in chronological order, meaning a1,a2,a3,a4 are essentially the same variable at consecutive time points, same with b1,b2,b3,b4.
I want to split them up for the different conditions so regardless of which group (=which order of tasks) someone went through, data from one condition should come first in the dataframe and columns should still be grouped together for the same variables and in chronological order within that condition. It should essentially look like this:
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c2b1 c2b2
1 1 2 6 6 3 4 4 6 4
2 2 2 2 5 2 2 3 2 5
3 1 6 3 3 1 3 6 4 1
4 1 4 8 4 2 7 8 8 9
5 2 1 4 3 1 1 7 2 1
So essentially for group 1 everything stays the same since they happened to go through the conditions in the same order that I want to have in the new dataframe while for group 2 values are being switched where the originally second half of values for each variable is put in front of the originally first one.
I hope I formulated the problem in a way, people can understand it.
My real dataset is a bit more complicated it has 180 columns minus id and group so 178.
I have 13 variables some of which were measured over two conditions with 5 trials for each of those and some which have those 5 trials for each of the 2 main condition but which also have 2 adittional measurements for each condition where the order was determined by the same group variable.
(We essentially asked participants to do the task again in two certain ways, which allowed us to see if they were capable of doing them like that if they wanted to under the circumstences of both main conditions).
So there are an adittional 4 columns for some variables which need to be treated seperately. It should look like this when transformed (x and y are the 2 extra tasks where only b was measured once):
id group c1a1 c1a2 c2a1 c2a2 c1b1 c1b2 c1bx c1by c2b1 c2b2 c2bx c2by
1 1 2 6 6 3 4 4 3 7 6 4 4 2
2 2 2 2 5 2 2 3 4 3 2 5 2 2
3 1 6 3 3 1 3 6 2 2 4 1 1 1
4 1 4 8 4 2 7 8 1 1 8 9 5 8
5 2 1 4 3 1 1 7 8 9 2 1 3 4
What I want to say with this is, I need a pretty general solution.
I already tried formulating a function for creation of two seperate datasets for the groups and then merging them by id but got stuck with the automatic creation and naming of columns which I can't seem to wrap my head around. dplyr is currently loaded and used for some other transformations but since I'm not really good with it, I need to ask for your help regarding a solution with or without it. I'm still pretty new to R and this is for my bachelor thesis.
Thanks in advance!
Your question leaves a few things unclear that make this hard to answer, but here is maybe a start that could help, or at least help clarify your problem.
It would really help if you could clarify 2 pieces of info, what types of column rearrangements you need, and how you distinguish what indicates that a row needs to have this transformation.
I'm also wondering if instead of trying to manipulate your data in its current shape, if it not might be more practical to figure out how to change the shape of your data to better represent your data, perhaps using something like pivot_longer(), I don't know how this data will ultimately be used or what the actual values indicate, but it doesn't seem to be very tidy in its current form, and instead having a "longer" table might be more meaningful, but I'll still provide what I think is a solution to your listed problem.
This creates some example data that looks like it reflects yours in the example table.
ID=seq(1:10)
group=sample(1:2,10,replace=T)
Data=matrix(sample(1:10,80,replace=T),nrow=10,ncol=8)
DataFrame=data.frame('ID'=ID,'Group'=group,Data)
You then define the groups of columns that need to be kept together. I can't tell if there is an automated way for you to indicate which columns are grouped, but this might get bulky if done manually. Some more information on what your column names actually are, and how they are distributed in groups would help.
ColumnGroups=list('One'=c('X1','X2'),'Two'=c('X3','X4'),'Three'=c('X5','X6'),'Four'=c('X7','X8'))
You can then figure out which rows need to have rearranged done by using some conditional. Based on your example, I'm assuming when the group variable equals 2, then the rearranging needs to be done, which is what I've used here.
FlipRows=DataFrame$Group==2
You can then have R only apply the rearrangement needed to those rows that need it, and define the rearrangement based on the ordering of the different column groups. I know you ask for a general solution, but is hard to identify the general solution you need without knowing what types of column rearrangements you need. If it is always flipping two sets of consecutive column groups, that would be easier to define without having to type it all out. What I have done here would require you to manually type out the order of the different column groups that you would like the rows to be rearranged as. The SortedDataFrame object seems to be what you are looking for, but might not actually reflect your real data. I removed columns 1 and 2 in this operation since those are ID and group which you don't want overridden.
SortedDataFrame=DataFrame
SortedDataFrame[FlipRows,-c(1,2)]=DataFrame[FlipRows,c(ColumnGroups$Two,ColumnGroups$One,ColumnGroups$Four,ColumnGroups$Three)]
This solution won't work if you need to rearrange each row differently, but it is unclear if that is the case. Try to provide any of the other info requested here, and let me know where this solution doesn't work for you, and that.

sandwich + mlogit: `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

I am trying to compute robust/cluster standard errors after using mlogit() to fit a Multinomial Logit (MNL) in a Discrete Choice problem. Unfortunately, I suspect I am having problems with it because I am using data in long format (this is a must in my case), and getting the error #Error in ef/X : non-conformable arrays after sandwich::vcovHC( , "HC0").
The Data
For illustration, please gently consider the following data. It represents data from 5 individuals (id_ind ) that choose among 3 alternatives (altern). Each of the five individuals chose three times; hence we have 15 choice situations (id_choice). Each alternative is represented by two generic attributes (x1 and x2), and the choices are registered in y (1 if selected, 0 otherwise).
df <- read.table(header = TRUE, text = "
id_ind id_choice altern x1 x2 y
1 1 1 1 1.586788801 0.11887832 1
2 1 1 2 -0.937965347 1.15742493 0
3 1 1 3 -0.511504401 -1.90667519 0
4 1 2 1 1.079365680 -0.37267925 0
5 1 2 2 -0.009203032 1.65150370 1
6 1 2 3 0.870474033 -0.82558651 0
7 1 3 1 -0.638604013 -0.09459502 0
8 1 3 2 -0.071679538 1.56879334 0
9 1 3 3 0.398263302 1.45735788 1
10 2 4 1 0.291413453 -0.09107974 0
11 2 4 2 1.632831160 0.92925495 0
12 2 4 3 -1.193272276 0.77092623 1
13 2 5 1 1.967624379 -0.16373709 1
14 2 5 2 -0.479859282 -0.67042130 0
15 2 5 3 1.109780885 0.60348187 0
16 2 6 1 -0.025834772 -0.44004183 0
17 2 6 2 -1.255129594 1.10928280 0
18 2 6 3 1.309493274 1.84247199 1
19 3 7 1 1.593558740 -0.08952151 0
20 3 7 2 1.778701074 1.44483791 1
21 3 7 3 0.643191170 -0.24761157 0
22 3 8 1 1.738820924 -0.96793288 0
23 3 8 2 -1.151429915 -0.08581901 0
24 3 8 3 0.606695064 1.06524268 1
25 3 9 1 0.673866953 -0.26136206 0
26 3 9 2 1.176959443 0.85005871 1
27 3 9 3 -1.568225496 -0.40002252 0
28 4 10 1 0.516456176 -1.02081089 1
29 4 10 2 -1.752854918 -1.71728381 0
30 4 10 3 -1.176101700 -1.60213536 0
31 4 11 1 -1.497779616 -1.66301234 0
32 4 11 2 -0.931117325 1.50128532 1
33 4 11 3 -0.455543630 -0.64370825 0
34 4 12 1 0.894843784 -0.69859139 0
35 4 12 2 -0.354902281 1.02834859 0
36 4 12 3 1.283785176 -1.18923098 1
37 5 13 1 -1.293772990 -0.73491317 0
38 5 13 2 0.748091387 0.07453705 1
39 5 13 3 -0.463585127 0.64802031 0
40 5 14 1 -1.946438667 1.35776140 0
41 5 14 2 -0.470448172 -0.61326604 1
42 5 14 3 1.478763383 -0.66490028 0
43 5 15 1 0.588240775 0.84448489 1
44 5 15 2 1.131731049 -1.51323232 0
45 5 15 3 0.212145247 -1.01804594 0
")
The problem
Consequently, we can fit an MNL using mlogit() and extract their robust variance-covariance as follows:
library(mlogit)
library(sandwich)
mo <- mlogit(formula = y ~ x1 + x2|0 ,
method ="nr",
data = df,
idx = c("id_choice", "altern"))
sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays
As we can see there is an error produced by sandwich::vcovHC, which says that ef/X is non-conformable. Where X <- model.matrix(x) and ef <- estfun(x, ...). After looking through the source code on the mirror on GitHub I spot the problem which comes from the fact that, given that the data is in long format, ef has dimensions 15 x 2 and X has 45 x 2.
My workaround
Given that the show must continue, I am computing the robust and cluster standard errors manually using some functions that I borrow from sandwich and I adjusted to accommodate the Stata's output.
> Robust Standard Errors
These lines are inspired on the sandwich::meat() function.
psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <- (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)
# x1 x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662
Stata Equivalent
qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .23050262
y:x2 .09840356 .12765662
> Clustered Standard Errors
Here, given that each individual answers 3 questions is highly likely that there is some degree of correlation among individuals; hence cluster corrections should be preferred in such situations. Below I compute the cluster correction in this case and I show the equivalence with the Stata output of clogit , cluster().
id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )
k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <- (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)
# x1 x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004
Stata equivalent
qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
y: y:
x1 x2
y:x1 .17667075
y:x2 .1007703 .11800038
The Question:
I would like to accommodate my computations within the sandwich ecosystem, meaning not computing the matrices manually but actually using the sandwich functions. Is it possible to make it work with models in long format like the one described here? For example, providing the meat and bread objects directly to perform the computations? Thanks in advance.
PS: I noted that there is a dedicated bread function in sandwich for mlogit, but I could not spot something like meat for mlogit, but anyways I am probably missing something here...
Why vcovHC does not work for mlogit
The class of HC covariance estimators can just be applied in models with a single linear predictor where the score function aka estimating function is the product of so-called "working residuals" and a regressor matrix. This is explained in some detail in the Zeileis (2006) paper (see Equation 7), provided as vignette("sandwich-OOP", package = "sandwich") in the package. The ?vcovHC also pointed to this but did not explain it very well. I have improved this in the documentation at http://sandwich.R-Forge.R-project.org/reference/vcovHC.html now:
The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).
This means that vcovHC() is not applicable to multinomial logit models as they generally use separate linear predictors for the separate response categories. Similarly, two-part or hurdle models etc. are not supported.
Basic "robust" sandwich covariance
Generally, for computing the basic Eicker-Huber-White sandwich covariance matrix estimator, the best strategy is to use the sandwich() function and not the vcovHC() function. The former works for any model with estfun() and bread() methods.
For linear models sandwich(..., adjust = FALSE) (default) and sandwich(..., adjust = TRUE) correspond to HC0 and HC1, respectively. In a model with n observations and k regression coefficients the former standardizes with 1/n and the latter with 1/(n-k).
Stata, however, divides by 1/(n-1) in logit models, see:
Different Robust Standard Errors of Logit Regression in Stata and R. To the best of my knowledge there is no clear theoretical reason for using specifically one or the other adjustment. And already in moderately large samples, this makes no difference anyway.
Remark: The adjustment with 1/(n-1) is not directly available in sandwich() as an option. However, coincidentally, it is the default in vcovCL() without specifying a cluster variable (i.e., treating each observation as a separate cluster). So this is a convenient "trick" if you want to get exactly the same results as Stata.
Clustered covariance
This can be computed "as usual" via vcovCL(..., cluster = ...). For mlogit models you just have to consider that the cluster variable just needs to be provided once (as opposed to stacked several times in long format).
Replicating Stata results
With the data and model from your post:
vcovCL(mo)
## x1 x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
## x1 x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

Strings in datatable (imported from database) get coerced into integers? [duplicate]

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 6 years ago.
I've imported a test file and tried to make a histogram
pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t")
hist <- as.numeric(pichman$WS)
However, I get different numbers from values in my dataset. Originally I thought that this because I had text, so I deleted the text:
table(pichman$WS)
ws <- pichman$WS[pichman$WS!="Down" & pichman$WS!="NoData"]
However, I am still getting very high numbers does anyone have an idea?
I suspect you are having a problem with factors. For example,
> x = factor(4:8)
> x
[1] 4 5 6 7 8
Levels: 4 5 6 7 8
> as.numeric(x)
[1] 1 2 3 4 5
> as.numeric(as.character(x))
[1] 4 5 6 7 8
Some comments:
You mention that your vector contains the characters "Down" and "NoData". What do expect/want as.numeric to do with these values?
In read.csv, try using the argument stringsAsFactors=FALSE
Are you sure it's sep="/t and not sep="\t"
Use the command head(pitchman) to check the first fews rows of your data
Also, it's very tricky to guess what your problem is when you don't provide data. A minimal working example is always preferable. For example, I can't run the command pichman <- read.csv(file="picman.txt", header=TRUE, sep="/t") since I don't have access to the data set.
As csgillespie said. stringsAsFactors is default on TRUE, which converts any text to a factor. So even after deleting the text, you still have a factor in your dataframe.
Now regarding the conversion, there's a more optimal way to do so. So I put it here as a reference :
> x <- factor(sample(4:8,10,replace=T))
> x
[1] 6 4 8 6 7 6 8 5 8 4
Levels: 4 5 6 7 8
> as.numeric(levels(x))[x]
[1] 6 4 8 6 7 6 8 5 8 4
To show it works.
The timings :
> x <- factor(sample(4:8,500000,replace=T))
> system.time(as.numeric(as.character(x)))
user system elapsed
0.11 0.00 0.11
> system.time(as.numeric(levels(x))[x])
user system elapsed
0 0 0
It's a big improvement, but not always a bottleneck. It gets important however if you have a big dataframe and a lot of columns to convert.

Plot empty groups in boxplot

I want to plot a lot of boxplots in on particular style to compare them.
But when a group is empty the group "isn't plotted".
lets say I have a dataframe:
a b
1 1 5
2 1 4
3 1 6
4 1 4
5 2 9
6 2 8
7 2 9
8 3 NaN
9 3 NaN
10 3 NaN
11 4 2
12 4 8
and I use boxplot to plot it:
boxplot(b ~ a , df)
than I get the plot without group 3
(which I can't show because I did not have "10 reputation")
I found some solutions for removing empty groups via Google but my problem is the other way around.
And I found the solution via at=c(1,2,4) but as I generate an Rscript with python and different groups are empty I would prefer, that the groups aren't dropped at all.
Oh I don't think I have the time to grapple with additional packages.
Therefore I would be thankful for solutions without them.
You can get the group on the x-axis by
boxplot(b ~ a , df, na.action=na.pass)
Or
boxplot(b~factor(a), df)

loop over columns with semi like columnnames

I have the following variable and dataframe
welltypes <- c("LC","HC")
qccast <- data.frame(
LC_mean=1:10,
HC_mean=10:1,
BC_mean=rep(0,10)
)
Now I only want to see the welltypes I selected(in this case LC and HC, but it could also be different ones.)
for(i in 1:length(welltypes)){
qccast$welltypes[i]_mean
}
This does not work, I know.
But how do i loop over those columns?
And it has to happen variable wise, because welltypes is of an unkown size.
The second argument to $ needs to be a column name of the first argument. I haven't run the code, but I would expect welltypes[i]_mean to be a syntax error. $ is similar to [[, so you can use paste to create the column name string and subset via [[.
For example:
qccast[[paste(welltypes[i],"_mean",sep="")]]
Depending on the rest of your code, you may be able to do something like this instead.
for(i in paste(welltypes,"_mean",sep="")){
qccast[[i]]
}
Here's another strategy:
qccast[ sapply(welltypes, grep, names(qccast)) ]
LC_mean HC_mean
1 1 10
2 2 9
3 3 8
4 4 7
5 5 6
6 6 5
7 7 4
8 8 3
9 9 2
10 10 1
Another easy way to access given welltypes
qccast[,paste(welltypes, '_mean', sep = "")]

Resources