I have conducted a discrete choice experiment using google forms and written up the results in a csv in excel. I am having problems understanding how to take the data from a standard csv format to a format that I can analyse using the gmnl package.
I am using this data below which has been dummy coded
personid choiceid alt payment management assessment crop
1 1 1 3 2 2 3
1 2 2 2 2 1 3
1 3 1 3 2 1 3
1 4 1 2 1 3 1
1 5 1 2 1 3 1
1 6 2 1 1 2 1
1 7 2 3 1 2 3
1 8 2 3 1 2 3
1 9 2 3 1 1 2
1 10 2 3 1 1 2
1 11 2 3 1 2 1
1 12 2 2 1 1 3
1 13 3 1 2 1 1
1 14 2 1 1 2 3
1 15 2 2 1 2 2
1 16 2 1 1 1 3
2 17 3 1 2 1 2
2 18 3 1 3 1 2
2 19 1 3 1 1 3
test <- as.data.frame(testchoices)
choices <- mlogit.data(test, shape = "long", idx = list(c("choiceid", "personid")),
idnames = c("management", "crops", "assessment", "price"))
write_csv(choices, "choicesnext.csv")
It works fine up to write csv where the error is thrown saying 'Error in [.data.frame (x, start:min(NROW(x), start + len)) : undefined columns selected
I would be grateful for any assistance
Related
I have a data frame that looks like this:
Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2
As you can see in this data frame there are 6 different combinations in the N and S columns, and 8 consecutive rows of each combination. I want to create a new data frame where one row from each combination (be it 3 & 1 or 1 & 2) is randomly selected and then put into a new data frame so there are 8 consecutive rows of each different combination. That way the entire data frame of all 48 rows is completely reorganized. Is this possible in R code?
Edit: The desired output would be something like this, but repeating until all 48 rows are full and the subject number for each row would have be random because it is a randomly selected row of each N & S combo.
Subject N S
3 1
1 1
3 2
1 2
2 2
2 1
2 2
3 2
2 1
1 1
3 1
1 2
A solution using functions from dplyr.
# Load package
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Process the data
dt2 <- dt %>%
group_by(N, S) %>%
sample_n(size = 1)
# View the result
dt2
## A tibble: 6 x 3
## Groups: N, S [6]
# Subject N S
# <chr> <int> <int>
#1 Sub6-3 1 1
#2 Sub5-1 1 2
#3 Sub1-5 2 1
#4 Sub5-8 2 2
#5 Sub2-4 3 1
#6 Sub3-1 3 2
Update: Reorganize the row
The following randomize all rows.
dt3 <- dt %>% slice(sample(1:n(), n()))
Data Preparation
dt <- read.table(text = "Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2",
header = TRUE, stringsAsFactors = FALSE)
I have been doing some hierarchical clusterings in R. Its worked out fine up til now, producing hclust objects left and center, but suddenly not anymore. Now it will only produce lists when performing:
mydata.clusters <- hclust(dist(mydata[, 1:8]))
mydata.clustercut <- cutree(mydata.clusters, 4)
and when trying to:
table(mydata.clustercut, mydata$customer_lifetime)
it doesnt produce a table, but an endless print of the values (Im guessing from the list).
The cutree function provide the grouping to which each observation belong to. For example:
iris.clust <- hclust(dist(iris[,1:4]))
iris.clustcut <- cutree(iris.clust, 4)
iris.clustcut
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
# [52] 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2
# [103] 4 2 2 4 3 4 2 4 2 2 2 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 4 4 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Additional comparison can then be done by using this as a grouping variable for the observed data:
new.iris <- data.frame(iris, gp=iris.clustcut)
# example to visualise quickly the Species membership of each group
library(ggplot2)
ggplot(new.iris, aes(gp, fill=Species)) +
geom_bar()
I have my data in txt file, contain the following number, how to read into R
I tied fread but did not work
Error in fread("x.txt") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 0 when detecting types ( first):
Here is the data:
2 3 3 2 1 2 3 2 3 2 1 3 1 2
1 1 3 2 3 1 2 1 2 3 3 2
3 1 1 1 2 1 1 3 1 2 2 2
1 3 1 1 3 2 3 3 1 1 2 2
1 3 2 3 2 1 3 1 1 1 3 1
1 3 1 2 3 3 2 2 2 2 3 3
1 3 2 3 2 3 2 2 2 1 3 1
3 2 1 2 2 3 3 2 3 2 3 3
2 1
Try this.
x <- scan("x.txt")
data <- as.data.frame(x)
I have a dataframe that has survey response items (scale 1-4). This is what the data looks like for the first 10 respondents:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n
1 1 2 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1
3 2 1 1 1 1 1 1 2 2
4 4 4 2 2 3 3 4 4 3
5 1 1 1 1 1 1 1 2 1
6 4 4 4 3 4 4 2 4 4
7 3 3 4 3 3 3 4 4 3
8 3 3 2 2 4 2 3 3 2
9 1 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1
I fit an graded response model to the data, and now have theta hats for each response pattern. There are 901 observations in the raw data, but only 547 observations of theta.hat. The reason is because there is a single theta.hat for each observed response pattern - e.g., a score of '1' across all items appears 94 times. The theta.hat dataframe looks like this:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
1 1 1 1 1 1 1 1 1 1 94 -1.307
2 1 1 1 1 1 1 1 1 2 10 -.816
3 1 1 1 1 1 1 1 1 4 1 -0.750
4 1 1 1 1 1 1 1 2 1 22 -.803
5 1 1 1 1 1 1 1 2 2 6 -.524
What I am trying to do is merge the theta.hats with the original data. This seems to require matching the response patterns across two datasets. So, for example, line 10 in the raw data (with all '1's) would receive a theta hat of -1.307 because it matched the response pattern in line 1 of the theta matrix. Both datasets are structured so each variable is a numeric column.
I'm not sure how to send a reproducible dataset for this case, but am happy to if you have suggestions.
Thank you,
Andrea
How about a simple merge? Assuming your first dataset (responses) is assigned to df.1 and the second dataset (modeled with theta) is assigned to df.2:
merge(df.1, df.2, by = names(df.1), all.x = TRUE)
# Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
# 1 1 1 1 1 1 1 1 1 1 94 -1.307
# 2 1 1 1 1 1 1 1 1 1 94 -1.307
# 3 1 1 1 1 1 1 1 1 1 94 -1.307
# 4 1 1 1 1 1 1 1 2 1 22 -0.803
# 5 1 2 1 1 1 1 1 1 1 NA NA
# 6 2 1 1 1 1 1 1 2 2 NA NA
# 7 3 3 2 2 4 2 3 3 2 NA NA
# 8 3 3 4 3 3 3 4 4 3 NA NA
# 9 4 4 2 2 3 3 4 4 3 NA NA
# 10 4 4 4 3 4 4 2 4 4 NA NA
I am trying to reproduce some log-linear modeling analysis from Agresti's Categorical Data Analysis (3rd ed.) [CDA] using the loglm function from the MASS package:
library(MASS)
# read in the data: http://www.stat.ufl.edu/~aa/cda/data.html
dfX = read.table(textConnection('a c m r g count
1 1 1 1 1 405
1 1 1 2 1 23
1 2 1 1 1 13
1 2 1 2 1 2
2 1 1 1 1 1
2 1 1 2 1 0
2 2 1 1 1 1
2 2 1 2 1 0
1 1 2 1 1 268
1 1 2 2 1 23
1 2 2 1 1 218
1 2 2 2 1 19
2 1 2 1 1 17
2 1 2 2 1 1
2 2 2 1 1 117
2 2 2 2 1 12
1 1 1 1 2 453
1 1 1 2 2 30
1 2 1 1 2 28
1 2 1 2 2 1
2 1 1 1 2 1
2 1 1 2 2 1
2 2 1 1 2 1
2 2 1 2 2 0
1 1 2 1 2 228
1 1 2 2 2 19
1 2 2 1 2 201
1 2 2 2 2 18
2 1 2 1 2 17
2 1 2 2 2 8
2 2 2 1 2 133
2 2 2 2 2 17'), header = TRUE)
llACM = loglm(count ~ c + a + m, data = dfX)
summary(llACM)
fitted(llACM)
But I am having difficulty understanding what the .Within. = means, and how I can get a predicted contingency table as given in CDA on page 323.
Really a long comment:
I ran your example, and took a look at things like fitted.loglm which consists of the following code:
{
if (!is.null(object$fit))
return(unclass(object$fit))
cat("Re-fitting to get fitted values\n")
unclass(update(object, fitted = TRUE, keep.frequencies = FALSE)$fitted)
}
update generates all the info that goes into object$fitted . I compared all sorts of data in the MASS example
minn38a <- xtabs(f ~ ., minn38)
fm <- loglm(~ 1 + 2 + 3 + 4, minn38a)
And about the only guess I can make is that your data is not properly dimensioned--or the fitting model adds a needed dimension, and this extra dim is given the default name .Within. . My suggestion would be to read the MASS book, or dig up info on glm fitting models. I agree that the explanations of the dimensions defined in the fitted dataset are somewhat lacking.