Output from fitted after loglm - r

I am trying to reproduce some log-linear modeling analysis from Agresti's Categorical Data Analysis (3rd ed.) [CDA] using the loglm function from the MASS package:
library(MASS)
# read in the data: http://www.stat.ufl.edu/~aa/cda/data.html
dfX = read.table(textConnection('a c m r g count
1 1 1 1 1 405
1 1 1 2 1 23
1 2 1 1 1 13
1 2 1 2 1 2
2 1 1 1 1 1
2 1 1 2 1 0
2 2 1 1 1 1
2 2 1 2 1 0
1 1 2 1 1 268
1 1 2 2 1 23
1 2 2 1 1 218
1 2 2 2 1 19
2 1 2 1 1 17
2 1 2 2 1 1
2 2 2 1 1 117
2 2 2 2 1 12
1 1 1 1 2 453
1 1 1 2 2 30
1 2 1 1 2 28
1 2 1 2 2 1
2 1 1 1 2 1
2 1 1 2 2 1
2 2 1 1 2 1
2 2 1 2 2 0
1 1 2 1 2 228
1 1 2 2 2 19
1 2 2 1 2 201
1 2 2 2 2 18
2 1 2 1 2 17
2 1 2 2 2 8
2 2 2 1 2 133
2 2 2 2 2 17'), header = TRUE)
llACM = loglm(count ~ c + a + m, data = dfX)
summary(llACM)
fitted(llACM)
But I am having difficulty understanding what the .Within. = means, and how I can get a predicted contingency table as given in CDA on page 323.

Really a long comment:
I ran your example, and took a look at things like fitted.loglm which consists of the following code:
{
if (!is.null(object$fit))
return(unclass(object$fit))
cat("Re-fitting to get fitted values\n")
unclass(update(object, fitted = TRUE, keep.frequencies = FALSE)$fitted)
}
update generates all the info that goes into object$fitted . I compared all sorts of data in the MASS example
minn38a <- xtabs(f ~ ., minn38)
fm <- loglm(~ 1 + 2 + 3 + 4, minn38a)
And about the only guess I can make is that your data is not properly dimensioned--or the fitting model adds a needed dimension, and this extra dim is given the default name .Within. . My suggestion would be to read the MASS book, or dig up info on glm fitting models. I agree that the explanations of the dimensions defined in the fitted dataset are somewhat lacking.

Related

discrete choice experiment data preparation for analysis using GMNL package

I have conducted a discrete choice experiment using google forms and written up the results in a csv in excel. I am having problems understanding how to take the data from a standard csv format to a format that I can analyse using the gmnl package.
I am using this data below which has been dummy coded
personid choiceid alt payment management assessment crop
1 1 1 3 2 2 3
1 2 2 2 2 1 3
1 3 1 3 2 1 3
1 4 1 2 1 3 1
1 5 1 2 1 3 1
1 6 2 1 1 2 1
1 7 2 3 1 2 3
1 8 2 3 1 2 3
1 9 2 3 1 1 2
1 10 2 3 1 1 2
1 11 2 3 1 2 1
1 12 2 2 1 1 3
1 13 3 1 2 1 1
1 14 2 1 1 2 3
1 15 2 2 1 2 2
1 16 2 1 1 1 3
2 17 3 1 2 1 2
2 18 3 1 3 1 2
2 19 1 3 1 1 3
test <- as.data.frame(testchoices)
choices <- mlogit.data(test, shape = "long", idx = list(c("choiceid", "personid")),
idnames = c("management", "crops", "assessment", "price"))
write_csv(choices, "choicesnext.csv")
It works fine up to write csv where the error is thrown saying 'Error in [.data.frame (x, start:min(NROW(x), start + len)) : undefined columns selected
I would be grateful for any assistance

Hierarchical Clustering produces list instead of hclust

I have been doing some hierarchical clusterings in R. Its worked out fine up til now, producing hclust objects left and center, but suddenly not anymore. Now it will only produce lists when performing:
mydata.clusters <- hclust(dist(mydata[, 1:8]))
mydata.clustercut <- cutree(mydata.clusters, 4)
and when trying to:
table(mydata.clustercut, mydata$customer_lifetime)
it doesnt produce a table, but an endless print of the values (Im guessing from the list).
The cutree function provide the grouping to which each observation belong to. For example:
iris.clust <- hclust(dist(iris[,1:4]))
iris.clustcut <- cutree(iris.clust, 4)
iris.clustcut
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
# [52] 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2
# [103] 4 2 2 4 3 4 2 4 2 2 2 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 4 4 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Additional comparison can then be done by using this as a grouping variable for the observed data:
new.iris <- data.frame(iris, gp=iris.clustcut)
# example to visualise quickly the Species membership of each group
library(ggplot2)
ggplot(new.iris, aes(gp, fill=Species)) +
geom_bar()

Using an already created kmeans kluster model on a new data set in R

I've built a cluster model in R (kmeans):
fit <- kmeans(sales_DP_DCT_agg_tr_bel_mod, 4)
Now I want to use this model and segment a brand new data set. How can I:
store the model
run the model on an new data set?
Let's say you're using iris as a dataset.
data = iris[,1:4] ## Don't want the categorical feature
model = kmeans(data, 3)
Here's what the output looks like:
>model
K-means clustering with 3 clusters of sizes 96, 33, 21
Cluster means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 6.314583 2.895833 4.973958 1.7031250
2 5.175758 3.624242 1.472727 0.2727273
3 4.738095 2.904762 1.790476 0.3523810
Clustering vector:
[1] 2 3 3 3 2 2 2 2 3 3 2 2 3 3 2 2 2 2 2 2 2 2 2 2 3 3 2 2 2 3 3 2 2 2 3 2 2 2 3 2 2 3 3 2 2 3 2 3 2 2 1 1 1 1 1 1 1 3 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[76] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Within cluster sum of squares by cluster:
[1] 118.651875 6.432121 17.669524
(between_SS / total_SS = 79.0 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" "iter" "ifault"
Notice you have access to the centroids using model$centers. All you have to do to classify an incoming sample is find which centroid it's closest to. You could define a euclidean distance function as follows:
eucDist <- function(x, y) sqrt(sum( (x-y)^2 ))
And then a classifying function:
classifyNewSample <- function(newData, centroids = model$centers) {
dists = apply(centroids, 1, function(y) eucDist(y,newData))
order(dists)[1]
}
> classifyNewSample(c(7,3,6,2))
[1] 1
> classifyNewSample(c(6,2.7,4.3,1.4))
[1] 2
As far as model persistence goes, checkout ?save here.
Edit:
To apply the predict function to a new matrix:
## I'm just generating a random matrix of 50x4 here:
r <- 50
c <- 4
m0 <- matrix(0, r, c)
new_data = apply(m0, c(1,2), function(x) sample(seq(0,10,0.1),1))
new_labels = apply(new_data, 1, classifyNewSample)
>new_labels
[1] 1 2 3 3 2 1 3 1 3 1 2 3 3 1 1 3 1 1 1 3 1 1 1 1 1 1 3 1 1 3 3 1 1 3 2 1 3 2 3 1 2 1 2 1 1 2 1 3 2 1

Merge data based on response pattern in R

I have a dataframe that has survey response items (scale 1-4). This is what the data looks like for the first 10 respondents:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n
1 1 2 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1
3 2 1 1 1 1 1 1 2 2
4 4 4 2 2 3 3 4 4 3
5 1 1 1 1 1 1 1 2 1
6 4 4 4 3 4 4 2 4 4
7 3 3 4 3 3 3 4 4 3
8 3 3 2 2 4 2 3 3 2
9 1 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1
I fit an graded response model to the data, and now have theta hats for each response pattern. There are 901 observations in the raw data, but only 547 observations of theta.hat. The reason is because there is a single theta.hat for each observed response pattern - e.g., a score of '1' across all items appears 94 times. The theta.hat dataframe looks like this:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
1 1 1 1 1 1 1 1 1 1 94 -1.307
2 1 1 1 1 1 1 1 1 2 10 -.816
3 1 1 1 1 1 1 1 1 4 1 -0.750
4 1 1 1 1 1 1 1 2 1 22 -.803
5 1 1 1 1 1 1 1 2 2 6 -.524
What I am trying to do is merge the theta.hats with the original data. This seems to require matching the response patterns across two datasets. So, for example, line 10 in the raw data (with all '1's) would receive a theta hat of -1.307 because it matched the response pattern in line 1 of the theta matrix. Both datasets are structured so each variable is a numeric column.
I'm not sure how to send a reproducible dataset for this case, but am happy to if you have suggestions.
Thank you,
Andrea
How about a simple merge? Assuming your first dataset (responses) is assigned to df.1 and the second dataset (modeled with theta) is assigned to df.2:
merge(df.1, df.2, by = names(df.1), all.x = TRUE)
# Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
# 1 1 1 1 1 1 1 1 1 1 94 -1.307
# 2 1 1 1 1 1 1 1 1 1 94 -1.307
# 3 1 1 1 1 1 1 1 1 1 94 -1.307
# 4 1 1 1 1 1 1 1 2 1 22 -0.803
# 5 1 2 1 1 1 1 1 1 1 NA NA
# 6 2 1 1 1 1 1 1 2 2 NA NA
# 7 3 3 2 2 4 2 3 3 2 NA NA
# 8 3 3 4 3 3 3 4 4 3 NA NA
# 9 4 4 2 2 3 3 4 4 3 NA NA
# 10 4 4 4 3 4 4 2 4 4 NA NA

Digits being neglected while performing N-gram in R

I want to get the counts of all character level Ngrams presnt in a text file.
Using R I wrote a small code for the same. However the code is neglecting all the digits present in the text. Could anyone help me in fixing this issue.
Here is the code :
library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=TRUE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L, decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b
Here is the contents of the input file:
abcd123
appl2345e
coun56ry
live123
names3423bsdf
coun56ryas
This is the output:
$`1`
counts size
_ 18 1
a 3 1
e 3 1
n 3 1
s 3 1
c 2 1
l 2 1
o 2 1
p 2 1
r 2 1
u 2 1
y 2 1
b 1 1
d 1 1
f 1 1
i 1 1
m 1 1
v 1 1
$`2`
counts size
_c 2 2
_r 2 2
co 2 2
e_ 2 2
n_ 2 2
ou 2 2
ry 2 2
s_ 2 2
un 2 2
_a 1 2
_b 1 2
_e 1 2
_l 1 2
_n 1 2
am 1 2
ap 1 2
as 1 2
bs 1 2
df 1 2
es 1 2
f_ 1 2
iv 1 2
l_ 1 2
li 1 2
me 1 2
na 1 2
pl 1 2
pp 1 2
sd 1 2
ve 1 2
y_ 1 2
ya 1 2
$`3`
counts size
_co 2 3
_ry 2 3
cou 2 3
oun 2 3
un_ 2 3
_ap 1 3
_bs 1 3
_e_ 1 3
_li 1 3
_na 1 3
ame 1 3
app 1 3
as_ 1 3
bsd 1 3
df_ 1 3
es_ 1 3
ive 1 3
liv 1 3
mes 1 3
nam 1 3
pl_ 1 3
ppl 1 3
ry_ 1 3
rya 1 3
sdf 1 3
ve_ 1 3
yas 1 3
$`4`
counts size
_cou 2 4
coun 2 4
oun_ 2 4
_app 1 4
_bsd 1 4
_liv 1 4
_nam 1 4
_ry_ 1 4
_rya 1 4
ames 1 4
appl 1 4
bsdf 1 4
ive_ 1 4
live 1 4
mes_ 1 4
name 1 4
ppl_ 1 4
ryas 1 4
sdf_ 1 4
yas_ 1 4
Could anyone tell what am I missing or where I went wrong.
Thanks in Advance.
The default value for splits in textcnt includes "digits" , so numbers are being treated as delimiters. Remove that and things will work.

Resources