Merge data based on response pattern in R - r

I have a dataframe that has survey response items (scale 1-4). This is what the data looks like for the first 10 respondents:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n
1 1 2 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1 1
3 2 1 1 1 1 1 1 2 2
4 4 4 2 2 3 3 4 4 3
5 1 1 1 1 1 1 1 2 1
6 4 4 4 3 4 4 2 4 4
7 3 3 4 3 3 3 4 4 3
8 3 3 2 2 4 2 3 3 2
9 1 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1
I fit an graded response model to the data, and now have theta hats for each response pattern. There are 901 observations in the raw data, but only 547 observations of theta.hat. The reason is because there is a single theta.hat for each observed response pattern - e.g., a score of '1' across all items appears 94 times. The theta.hat dataframe looks like this:
Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
1 1 1 1 1 1 1 1 1 1 94 -1.307
2 1 1 1 1 1 1 1 1 2 10 -.816
3 1 1 1 1 1 1 1 1 4 1 -0.750
4 1 1 1 1 1 1 1 2 1 22 -.803
5 1 1 1 1 1 1 1 2 2 6 -.524
What I am trying to do is merge the theta.hats with the original data. This seems to require matching the response patterns across two datasets. So, for example, line 10 in the raw data (with all '1's) would receive a theta hat of -1.307 because it matched the response pattern in line 1 of the theta matrix. Both datasets are structured so each variable is a numeric column.
I'm not sure how to send a reproducible dataset for this case, but am happy to if you have suggestions.
Thank you,
Andrea

How about a simple merge? Assuming your first dataset (responses) is assigned to df.1 and the second dataset (modeled with theta) is assigned to df.2:
merge(df.1, df.2, by = names(df.1), all.x = TRUE)
# Q20_1n Q20_3n Q20_5n Q20_7n Q20_9n Q20_11n Q20_13n Q20_15n Q20_17n Obs Theta
# 1 1 1 1 1 1 1 1 1 1 94 -1.307
# 2 1 1 1 1 1 1 1 1 1 94 -1.307
# 3 1 1 1 1 1 1 1 1 1 94 -1.307
# 4 1 1 1 1 1 1 1 2 1 22 -0.803
# 5 1 2 1 1 1 1 1 1 1 NA NA
# 6 2 1 1 1 1 1 1 2 2 NA NA
# 7 3 3 2 2 4 2 3 3 2 NA NA
# 8 3 3 4 3 3 3 4 4 3 NA NA
# 9 4 4 2 2 3 3 4 4 3 NA NA
# 10 4 4 4 3 4 4 2 4 4 NA NA

Related

R: Merge dataframes not aligning properly

I am trying to merge two data frames:
df1
Day Hour S_Value
1 1 1
1 1 2
1 1 3
1 1 1
1 2 5
1 2 6
1 2 4
1 2 2
df2
Day Hour n_Value
1 1 3
1 1 7
1 1 na
1 1 na
1 2 1
1 2 9
1 2 na
1 2 na
I used:
join <- merge(df1, df2, by = "Hour")
And got this data frame where the first number per Hour of the S_Value is filled in for all the same Hour:
join
Day Hour S_Value N_Value
1 1 1 3
1 1 1 7
1 1 1 na
1 1 1 na
1 2 5 1
1 2 5 9
1 2 5 na
1 2 5 na
I want it to look like this:
join
Day Hour S_Value N_Value
1 1 1 3
1 1 2 7
1 1 3 na
1 1 1 na
1 2 5 1
1 2 6 9
1 2 4 na
1 2 2 na

discrete choice experiment data preparation for analysis using GMNL package

I have conducted a discrete choice experiment using google forms and written up the results in a csv in excel. I am having problems understanding how to take the data from a standard csv format to a format that I can analyse using the gmnl package.
I am using this data below which has been dummy coded
personid choiceid alt payment management assessment crop
1 1 1 3 2 2 3
1 2 2 2 2 1 3
1 3 1 3 2 1 3
1 4 1 2 1 3 1
1 5 1 2 1 3 1
1 6 2 1 1 2 1
1 7 2 3 1 2 3
1 8 2 3 1 2 3
1 9 2 3 1 1 2
1 10 2 3 1 1 2
1 11 2 3 1 2 1
1 12 2 2 1 1 3
1 13 3 1 2 1 1
1 14 2 1 1 2 3
1 15 2 2 1 2 2
1 16 2 1 1 1 3
2 17 3 1 2 1 2
2 18 3 1 3 1 2
2 19 1 3 1 1 3
test <- as.data.frame(testchoices)
choices <- mlogit.data(test, shape = "long", idx = list(c("choiceid", "personid")),
idnames = c("management", "crops", "assessment", "price"))
write_csv(choices, "choicesnext.csv")
It works fine up to write csv where the error is thrown saying 'Error in [.data.frame (x, start:min(NROW(x), start + len)) : undefined columns selected
I would be grateful for any assistance

Using R code to reorganize data frame by randomly selecting one row from each combination

I have a data frame that looks like this:
Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2
As you can see in this data frame there are 6 different combinations in the N and S columns, and 8 consecutive rows of each combination. I want to create a new data frame where one row from each combination (be it 3 & 1 or 1 & 2) is randomly selected and then put into a new data frame so there are 8 consecutive rows of each different combination. That way the entire data frame of all 48 rows is completely reorganized. Is this possible in R code?
Edit: The desired output would be something like this, but repeating until all 48 rows are full and the subject number for each row would have be random because it is a randomly selected row of each N & S combo.
Subject N S
3 1
1 1
3 2
1 2
2 2
2 1
2 2
3 2
2 1
1 1
3 1
1 2
A solution using functions from dplyr.
# Load package
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Process the data
dt2 <- dt %>%
group_by(N, S) %>%
sample_n(size = 1)
# View the result
dt2
## A tibble: 6 x 3
## Groups: N, S [6]
# Subject N S
# <chr> <int> <int>
#1 Sub6-3 1 1
#2 Sub5-1 1 2
#3 Sub1-5 2 1
#4 Sub5-8 2 2
#5 Sub2-4 3 1
#6 Sub3-1 3 2
Update: Reorganize the row
The following randomize all rows.
dt3 <- dt %>% slice(sample(1:n(), n()))
Data Preparation
dt <- read.table(text = "Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2",
header = TRUE, stringsAsFactors = FALSE)

How to select only the last row among the subset of rows satisfying a condition in R programming

The dataframe looks like this :
Customer_id A B C D E F G
10000001 1 1 2 3 1 3 1
10000001 1 2 3 1 2 1 3
10000002 2 2 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 5 2 4 7 2 4
10000003 1 5 2 6 3 7 2
10000003 1 1 2 2 1 2 1
10000004 1 2 3 1 2 3 1
10000004 1 3 2 3 1 3 2
10000004 1 3 2 1 3 2 1
10000004 1 4 1 4 1 3 1
10000006 1 2 3 4 5 1 2
10000006 1 3 1 4 1 2 1
10000008 2 3 2 3 2 1 2
10000008 2 3 1 1 2 1 2
10000008 1 3 1 1 2 2 1
There are multiple entries for each customer_id. I need to create another data frame from this existing data frame. The new data frame should contain only the last row for every customer_id. It should look like this
10000001 1 1 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 1 2 2 1 2 1
10000004 1 4 1 4 1 3 1
10000006 1 3 1 4 1 2 1
10000008 1 3 1 1 2 2 1
Something like this (hard to code without the data in R format):
dataframe[ rev(!duplicated(rev(dataframe$Customer_id))),]
or better
dataframe[ !duplicated(dataframe$Customer_id,fromLast=TRUE),]
You can also use aggregate
aggregate(. ~ Customer_id, data = DF, FUN = tail, 1)
## Customer_id A B C D E F G
## 1 10000001 1 2 3 1 2 1 3
## 2 10000002 2 2 1 4 2 3 1
## 3 10000003 1 1 2 2 1 2 1
## 4 10000004 1 4 1 4 1 3 1
## 5 10000006 1 3 1 4 1 2 1
## 6 10000008 1 3 1 1 2 2 1
Assume your data is named dat,
Here's one way using by and rbind, although the other two methods (aggregate and duplicated) are much nicer:
> do.call(rbind, by(dat,dat$Customer_id,FUN=tail,1))
## Customer_id A B C D E F G
## 2 10000001 1 2 3 1 2 1 3
## 4 10000002 2 2 1 4 2 3 1
## 7 10000003 1 1 2 2 1 2 1
## 11 10000004 1 4 1 4 1 3 1
## 13 10000006 1 3 1 4 1 2 1
## 16 10000008 1 3 1 1 2 2 1

Digits being neglected while performing N-gram in R

I want to get the counts of all character level Ngrams presnt in a text file.
Using R I wrote a small code for the same. However the code is neglecting all the digits present in the text. Could anyone help me in fixing this issue.
Here is the code :
library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=TRUE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L, decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b
Here is the contents of the input file:
abcd123
appl2345e
coun56ry
live123
names3423bsdf
coun56ryas
This is the output:
$`1`
counts size
_ 18 1
a 3 1
e 3 1
n 3 1
s 3 1
c 2 1
l 2 1
o 2 1
p 2 1
r 2 1
u 2 1
y 2 1
b 1 1
d 1 1
f 1 1
i 1 1
m 1 1
v 1 1
$`2`
counts size
_c 2 2
_r 2 2
co 2 2
e_ 2 2
n_ 2 2
ou 2 2
ry 2 2
s_ 2 2
un 2 2
_a 1 2
_b 1 2
_e 1 2
_l 1 2
_n 1 2
am 1 2
ap 1 2
as 1 2
bs 1 2
df 1 2
es 1 2
f_ 1 2
iv 1 2
l_ 1 2
li 1 2
me 1 2
na 1 2
pl 1 2
pp 1 2
sd 1 2
ve 1 2
y_ 1 2
ya 1 2
$`3`
counts size
_co 2 3
_ry 2 3
cou 2 3
oun 2 3
un_ 2 3
_ap 1 3
_bs 1 3
_e_ 1 3
_li 1 3
_na 1 3
ame 1 3
app 1 3
as_ 1 3
bsd 1 3
df_ 1 3
es_ 1 3
ive 1 3
liv 1 3
mes 1 3
nam 1 3
pl_ 1 3
ppl 1 3
ry_ 1 3
rya 1 3
sdf 1 3
ve_ 1 3
yas 1 3
$`4`
counts size
_cou 2 4
coun 2 4
oun_ 2 4
_app 1 4
_bsd 1 4
_liv 1 4
_nam 1 4
_ry_ 1 4
_rya 1 4
ames 1 4
appl 1 4
bsdf 1 4
ive_ 1 4
live 1 4
mes_ 1 4
name 1 4
ppl_ 1 4
ryas 1 4
sdf_ 1 4
yas_ 1 4
Could anyone tell what am I missing or where I went wrong.
Thanks in Advance.
The default value for splits in textcnt includes "digits" , so numbers are being treated as delimiters. Remove that and things will work.

Resources