R: PCA - How to make a survey design object with svydesign - r

I am new to Principal Components Analysis in R. I'm currently working my way through a PCA with my own data, following this example: PCA by Coreysparks
I'm stuck at the section which requires me to create a survey design object. This is my code:
options(survey.lonely.psu = "adjust")
scwb<-scwb[complete.cases(scwb),]
des<-svydesign(ids=~psu, strata=~ststr, weights=~cntywt, data=scwb , nest=T)
I then receive the following message -
Error in eval(predvars, data, env) : object 'psu' not found
I tried substituting psu with 1 , since I'm pretty sure I'm working with Simple Random Sampling. But then the error message points out object 'ststr' not found and I assume it's the same for cntywt.
I've been working on this for a couple of hours and I simply can't figure out (even after quite some research) how to properly fill out this function so that I can continue with the PCA.
Any suggestions on how to approach this problem?
EDIT: Here is a sample of my data:
IDN
Jobsat_1 Jobsat_2 Jobsat_3 Member_1 Member_2 Member_4 LingInt Belong_2 Belong_3 Grpor_14 Ethnic10 MEDIAb Trust_G1 Trust_G2
1 10 10 8 3 1 1 5 10 10 2 1 2 1 3
2 7 7 5 4 1 1 5 10 10 4 4 2 3 3
3 7 7 7 1 1 1 5 9 10 4 3 2 1 1
4 7 7 7 2 1 1 5 10 10 3 3 2 3 3
5 7 8 8 3 3 3 5 10 8 3 4 2 3 3
6 7 7 7 2 1 2 5 10 8 4 3 2 5 3
Grpor_3 Grpor_4 Grpor_12 Emp_1 Volas_10 Mstatus1 Child_1 Health_1 YRBIRTH Rgender Brthcoun YR_IMM Relig_2 Educ_R INCOM_14
1 1 5 2 3 15 0 0 2 31 1 1 100 3 4 73
2 7 6 9 2 15 1 5 1 40 2 1 100 7 8 105
3 5 5 7 1 10 1 3 3 60 1 1 100 4 3 40
4 8 8 8 3 4 1 4 2 63 1 1 100 2 3 30
5 5 4 8 2 20 1 0 2 37 2 1 100 3 8 60
6 7 7 10 2 20 1 1 2 37 2 1 100 6 8 73
Emp_1.1 pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10
1 3 1.9148465 4.9846483 -0.8072519 -3.3183158 -3.32593626 -0.8836492 0.4113892 1.26298570 -0.32539801 1.4829219
2 2 -0.8358479 -0.3853229 -2.4203323 0.2514915 -0.08298205 -0.8393284 0.2384652 -0.04452874 -0.76024480 3.0219680
3 1 0.5800140 1.3872784 -3.8136358 -1.4466660 0.49100339 1.0119573 0.3198218 -0.27307314 -1.46013793 -0.7989148
4 3 -1.0482772 0.4416652 -2.7239398 -2.0870089 -1.23500720 -1.6997003 0.2737917 0.76019756 -1.41174450 0.2476696
5 2 0.1772093 0.7945494 0.3299359 0.7466387 -0.37083214 -0.2998434 0.4494944 -0.04000732 -0.09798538 -0.1694823
6 2 -1.5970861 -0.3793862 -2.0577713 -1.2681369 0.02748217 -0.8220763 0.1409177 -0.15252729 -1.15142836 -0.2362450
pc11 pc12 pc13
1 -0.06980251 0.6393237 -0.66798274
2 -0.39774429 -0.1281088 -0.06633185
3 0.33557364 0.4008058 -1.13155741
4 0.07448897 -0.5383089 -0.02582100
5 -0.04507346 -0.1805617 1.42605349
6 -0.06659864 1.1336526 0.04854108

Related

Clogit function in CEDesign not converge

I designed a CE Experiment using the package support.CEs. I generated a CE Design with 3 attributes an 4 levels per attribute. The questionnaire had 4 alternatives and 4 blocks
des1 <- rotation.design(attribute.names = list(
Qualitat = c("Aigua potable", "Cosetes.blanques.flotant", "Aigua.pou", "Aigua.marro"),
Disponibilitat.acces = c("Aixeta.24h", "Aixeta.10h", "Diposit.comunitari", "Pou.a.20"),
Preu = c("No.problemes.€", "Esforç.economic", "No.pagues.acces", "No.pagues.no.acces")),
nalternatives = 4, nblocks = 4, row.renames = FALSE,
randomize = TRUE, seed = 987)
The questionnaire was replied by 15 persons (ID 1-15), so 60 outputs (15 persons responding per 4 blocks:
ID BLOCK q1 q2 q3 q4
1 1 1 1 2 3 3
2 1 2 1 3 3 4
3 1 3 5 1 3 5
4 1 4 5 2 2 5
5 2 1 1 2 4 3
6 2 2 1 4 3 4
7 2 3 3 1 3 2
8 2 4 1 2 2 2
9 3 1 1 2 2 2
10 3 2 1 4 3 4
11 3 3 3 1 3 4
12 3 4 3 2 1 4
13 4 1 1 5 4 3
14 4 2 1 4 5 4
15 4 3 5 5 3 2
16 4 4 5 2 5 5
17 5 1 1 2 4 2
18 5 2 3 2 3 2
19 5 3 3 1 3 4
20 5 4 3 2 1 4
21 6 1 1 5 5 5
22 6 2 1 3 3 4
23 6 3 3 1 3 4
24 6 4 1 2 2 2
25 7 1 1 2 4 3
26 7 2 4 2 3 4
27 7 3 3 1 3 3
28 7 4 3 4 5 5
29 8 1 1 3 2 3
30 8 2 1 4 3 4
31 8 3 3 1 3 4
32 8 4 1 2 2 1
33 9 1 1 2 3 3
34 9 2 1 3 3 4
35 9 3 5 1 3 5
36 9 4 5 2 2 5
37 15 1 1 5 5 5
38 15 2 4 4 5 4
39 15 3 5 5 3 5
40 15 4 4 3 5 5
41 11 1 1 5 5 5
42 11 2 4 4 5 4
43 11 3 5 5 3 5
44 11 4 5 3 5 5
45 12 1 1 2 4 3
46 12 2 4 2 3 4
47 12 3 3 1 3 3
48 12 4 3 4 5 5
49 13 1 1 2 2 2
50 13 2 1 4 3 4
51 13 3 3 1 3 2
52 13 4 1 2 2 2
53 14 1 1 1 3 3
54 14 2 1 4 1 4
55 14 3 4 1 3 2
56 14 4 3 2 1 2
57 15 1 1 1 3 2
58 15 2 5 2 1 4
59 15 3 4 4 3 1
60 15 4 3 4 1 4
The probles is that, when i merge the questions and answers matrix with the formula
dataset1 <- make.dataset(respondent.dataset = res1,
choice.indicators = c("q1","q2","q3","q4"),
design.matrix = desmat1)
R shows a warning message: In fitter(X, Y, strats, offset, init, control, weights = weights, :
Ran out of iterations and did not converge
I should expect that the matrix desmat1 generated had 4800 observations (80 possible combinations and 60 outputs). Instead of that i have only 1200 obseravations. The matrix dataset1 only shows the combination of 1 set of alternatives instead of the 4.
For example, for ID 1, Block 1, Question 1 only appears alternative 1. It match with the answer selected by the person, but in other cases it does not match, and that information is lost in R, so the results when clogit is applied are wrong.
I do hope thay the problems is understood.
Regards,
Edition:
I found my problem. When i make the dataset from the respondent.dataset that i generated in .csv format, r detects only the q1 response instead of q1-q4. dataset1
dataset1 <- make.dataset(respondent.dataset = res1,
choice.indicators = c("q1","q2","q3","q4"),
design.matrix = desmat1)
detects q1-q4 as new columns. But the key is that q1-q4 has to fill the columns QES in dataset1. I did another CE before with 1 block and the dataset was correctly done one reading the respondant.dataset. So the key point is that now i'm using 4 blocks but i do not know how to make R to interprete that q1-q4 are the columns QUES for each block.
res1 matrix (repondant.dataset) (Complete matriz has 60 rows = 15 respondants (ID 1-15) * 4 Questions (QES column in make.dataset)
Kind reagards,

Mean and SD in a table

In R, when doing table of two variables, you'll get a frequency table
> table(data$Var1, data$Var2)
1 2 3 4 5
0 0 1 5 6 12
1 1 10 6 7 0
2 2 6 7 6 3
3 2 9 8 3 2
4 4 9 5 3 3
5 3 4 9 4 4
6 2 7 7 4 4
7 2 7 7 6 2
8 5 7 5 5 2
9 5 4 5 6 4
is there a way such that you include the mean and SD in each row, something like
1 2 3 4 5 mean SD
0 0 1 5 6 12 4.20833 0.93153
1 1 10 6 7 0 .. ..
2 2 6 7 6 3
3 2 9 8 3 2
4 4 9 5 3 3
5 3 4 9 4 4
6 2 7 7 4 4
7 2 7 7 6 2
8 5 7 5 5 2
9 5 4 5 6 4
Save the table in something called T, and then:
For the mean and sd:
> cbind(T,
mean=apply(T,1,function(x){
(sum(x*(1:5)))/sum(x)}),
sd=apply(T,1,function(x){sd(rep(1:5,x))}))
1 2 3 4 5 mean sd
0 4 3 1 1 1 2.200000 1.3984118
1 1 2 3 3 3 3.416667 1.3113722
2 2 2 1 2 1 2.750000 1.4880476
3 0 1 2 4 1 3.625000 0.9161254
So 2.2 and 1.3984 is mean and sd of (c(1,1,1,1,2,2,2,3,4,5))
Its probably inefficient to compute the sd by reconstructing the original vector with rep - but its late and working out all the sums of squares and squares of sums for the sd is not something my brain can do at 1am.

Error for tune with package e1071

I'm trying to tune the SVM model for regression in R using package e1071 with the method used in this tutorial. Here is the data
> head(gps_rg5,16)
Weather sex age occupation income weekday weekend age group rg
1 6 2 57 3 1 7 1 3 0.035725277
2 6 2 32 1 5 6 1 2 1.693898548
3 1 2 63 3 1 4 0 4 0.009012839
4 6 2 65 3 2 6 1 4 0.014902879
5 6 2 57 3 2 7 1 3 0.045594146
6 6 2 76 3 1 4 0 5 0.003531616
7 6 1 65 3 2 4 0 4 0.001575542
8 4 2 57 3 3 6 1 3 0.009384690
9 4 2 52 3 2 6 1 3 0.033322905
10 4 2 56 3 2 6 1 3 0.011879944
11 4 2 56 3 2 7 1 3 0.008266786
12 4 1 63 3 2 6 1 4 3.055594036
13 1 2 42 1 2 1 0 2 0.029010174
14 4 2 42 1 2 6 1 2 0.000933115
15 1 2 66 3 2 5 0 4 2.342416927
16 6 1 79 3 2 4 0 5 2.891190912
And this is the code for tuning:
svr1<-tune(svm,rg~.,data=train,ranges=list(cost=2^(2:9),epsilon=seq(0.01,10,0.1)))
And the code returns an error saying
Error in predict.svm(ret, xhold, decision.values = TRUE) :
Model is empty!
This is the structure of the training dataset:
Any answers would be appreciated!!!
Many thanks!!

Comparing each element in subsets of a large data

I have a large data with raw responses and wanted to compare each element for subject 1 in group 1 with its corresponding element for subject 1 in group 2. Of course, the comparison needs to be kept between subject 2 in group 1 and subject 2 in group 2, and between subject 3 in group 1 and subject 3 in group 2, and so on. What makes the problem even complex is that there are 100 groups, which in turn are 50 paired groups.
The output needs to keep the original raw response if they are the same. If they are different, the raw response needs to be replaced with '9'.
I'm pretty sure I could do it with for-loop, but wondering if there is anything better than for-loop in r, such as ifelse or apply?
As making my data simple, it would look like below.
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
Thanks for any help.
#Initialization of data
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
>df
V1 V2 V3 V4 V5 subject group
1 3 3 3 4 5 1 1
2 4 4 3 1 3 2 1
3 3 2 2 4 2 3 1
4 4 4 3 5 3 1 2
5 3 2 1 5 1 2 2
6 2 5 4 4 1 3 2
7 3 2 3 2 2 1 3
8 1 2 3 3 3 2 3
9 2 2 2 2 5 3 3
10 3 3 3 5 4 1 4
11 5 3 5 4 2 2 4
12 5 3 1 1 3 3 4
Processing without for loop
#processing without for loop
# assumption: initial data is sorted by group (can be easily done)
coloumns<-!dimnames(x)[[2]] %in% c('group','subject');
subjects<-df[, 'subject']
tabl<-table(subjects)
rows<-order(subjects)
rows2<-cumsum(tabl)
rows1<-rows2-tabl+1
df[rows[-rows1],coloumns][df[rows[-rows1],coloumns]!=df[rows[-rows2],coloumns]]<-9
>df
V1 V2 V3 V4 V5 subject group
1 3 3 3 4 5 1 1
2 4 4 3 1 3 2 1
3 3 2 2 4 2 3 1
4 9 9 3 9 9 1 2
5 9 9 9 9 9 2 2
6 9 9 9 4 9 3 2
7 9 9 3 9 9 1 3
8 9 2 9 9 9 2 3
9 2 9 9 9 9 3 3
10 3 9 3 9 9 1 4
11 9 9 9 9 9 2 4
12 9 9 9 9 9 3 4
Below is what I did to get the output. Again, thanks to Stanislav
df<-as.data.frame(matrix(sample(c(1:5),60,replace=T),nrow=12))
df$subject<-rep(1:3)
df$group<-rep(1:4, each=3)
> df
V1 V2 V3 V4 V5 subject group
1 1 4 3 1 5 1 1
2 2 1 4 1 5 2 1
3 1 2 5 4 5 3 1
4 5 4 1 4 3 1 2
5 5 1 3 2 2 2 2
6 1 2 2 4 5 3 2
7 5 4 2 3 1 1 3
8 2 3 4 3 5 2 3
9 2 5 3 5 3 3 3
10 4 2 1 4 1 1 4
11 2 3 3 5 5 2 4
12 5 3 3 4 5 3 4
col<-!dimnames(df)[[2]] %in% c('subject','group')
n<-length(df[,1])
temp<-table(df$group)
n.sub<-temp[1]
temp<-seq(1,n,by=2*n.sub)
s1<-c(sapply(temp, function(x) seq.int(x, length.out=n.sub)))
temp<-seq(n.sub+1,n,by=2*n.sub)
s2<-c(sapply(temp, function(x) seq.int(x, length.out=n.sub)))
df[s2,col][df[s1,col]!=df[s2,col]]<-9
> df
V1 V2 V3 V4 V5 subject group
1 1 4 3 1 5 1 1
2 2 1 4 1 5 2 1
3 1 2 5 4 5 3 1
4 9 4 9 9 9 1 2
5 9 1 9 9 9 2 2
6 1 2 9 4 5 3 2
7 5 4 2 3 1 1 3
8 2 3 4 3 5 2 3
9 2 5 3 5 3 3 3
10 9 9 9 9 1 1 4
11 2 3 9 9 5 2 4
12 9 9 3 9 9 3 4

T test and permutation testing

I have a data frame which looks like this. There are 2 separate groups and 5 different variables.
df <- read.table(text="Group var1 var2 var3 var4 var5
1 3 5 7 3 7
1 3 7 5 9 6
1 5 2 6 7 6
1 9 5 7 0 8
1 2 4 5 7 8
1 2 3 1 6 4
2 4 2 7 6 5
2 0 8 3 7 5
2 1 2 3 5 9
2 1 5 3 8 0
2 2 6 9 0 7
2 3 6 7 8 8
2 10 6 3 8 0", header = TRUE)
I'm calculating the significance of each variable for distinguishing between the 2 groups using the T test (as below). However I'd like to implement permutation testing to calculate the p values as this is quite a small dataset. What is the best method for doing this in R?
t(sapply(df[-1], function(x)
unlist(t.test(x~df$Group)[c("p.value")])))

Resources