Cut value in creating table - r

I have following type of data:
mydata <- data.frame (yvar = rnorm(200, 15, 5), xv1 = rep(1:5, each = 40),
xv2 = rep(1:10, 20))
table(mydata$xv1, mydata$xv2)
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
I want tabulate again with yvar categories. The following is cutkey.
cutkey :
< 10 - group 1
10-12 - group 2
12-16 - group 3
>16 - group 4
Thus we will have similar to above type of table to each cutkey elements. I want to have margin sums everytime.
< 10 - group 1
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
10-12 - group 2
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
and so on for all groups
(the numbers will be definately different)
Is there easyway to do it ?

Yes, using cut, dlply (plyr package) and addmargins:
mydata$yvar1 <- cut(mydata$yvar,breaks = c(-Inf,10,12,16,Inf))
> dlply(mydata,.(yvar1),function(x) addmargins(table(x$xv1,x$xv2)))
$`(-Inf,10]`
1 2 3 4 5 6 7 8 9 10 Sum
1 0 0 0 0 0 0 2 0 1 0 3
2 1 1 0 1 0 0 0 0 2 0 5
3 0 1 0 0 1 1 0 2 0 0 5
4 0 0 2 0 1 1 0 1 0 0 5
5 0 1 1 0 1 1 1 0 0 2 7
Sum 1 3 3 1 3 3 3 3 3 2 25
$`(10,12]`
1 2 3 4 6 7 8 9 10 Sum
1 0 0 0 1 2 0 0 0 0 3
2 0 0 1 0 0 1 0 0 1 3
3 0 1 0 1 1 2 0 0 1 6
4 0 1 0 0 0 0 0 0 0 1
5 1 0 1 1 1 0 1 1 2 8
Sum 1 2 2 3 4 3 1 1 4 21
$`(12,16]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 3 1 1 1 2 0 3 0 2 15
2 0 1 0 1 3 3 2 0 0 1 11
3 3 1 3 1 0 0 0 2 4 1 15
4 3 2 1 2 2 0 1 1 4 1 17
5 3 1 1 2 0 1 1 1 1 0 11
Sum 11 8 6 7 6 6 4 7 9 5 69
$`(16, Inf]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 1 3 2 3 0 2 1 3 2 19
2 3 2 3 2 1 1 1 4 2 2 21
3 1 1 1 2 3 2 2 0 0 2 14
4 1 1 1 2 1 3 3 2 0 3 17
5 0 2 1 1 3 1 2 2 2 0 14
Sum 7 7 9 9 11 7 10 9 7 9 85
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
yvar1
1 (-Inf,10]
2 (10,12]
3 (12,16]
4 (16, Inf]
You can adjust the breaks argument to cut to get the values just how you want them. (Although the margin sums you display in your question don't look like margin sums at all.)

Related

Stratified Fisher and Wilcoxon test

I would like to perform a stratified fisher test.
I've tried with tabulate without success.
These are my data:
> db$site
[1] 2 2 2 3 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 1 1 1 1 2 2 3 3 1 1 3 1 2 1 1 2 1 1 1 1 1 3 1 1 1 1 1 3 1
[50] 2 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 1 3 3 1
[99] 3 1 1 1 1 1 1 1 3 3 1 1 1 1 1 3 2 1 1 1 1 1 3 3 1 3 3 3 3 1 3 1 3 3 1 3 1 1 3 3 3 2 3 3 3 3 1 3 3
[148] 3 2 3 3 1 3 1 3 3 3 3 3 3 1 3 3 3 1 3 3 1 3 1 1 1 1 1 1 1 3 2 1 3 2 2 2 3 2 3 2 2 2 2 2 2 2 2 3 2
[197] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 1 3 2 1 2 3 1 3 3 1 1 1 1 3 3 3 1 3 2 2 1 3 1 2 3 1
[246] 1 1 1 1 1 2 3 2 1 2 3 3 3 1 1 2 3 2 3 3 3 2 2 1 3 2 3 1 1 3 3 2 1 1 1 1 1 2 1 1 2 2 1 2 3 3 1 1 1
[295] 3 2 2 3 1 1 2 2 2 3 3 2 1 3 1 2 1 3 1 1 3 1 1 3 2 2 2 2 2 1 3 1 1 2 3 3 3 1 3 1 3 2 3 1 1 1 3 3 3
[344] 3 1 2 2 2 3 1 3 1 1 3 1 3 2 1 3 2 2 2 2 2 2 2
Levels: 1 2 3
> db$phq_cat
[1] 1 2 2 3 1 2 2 2 1 1 1 2 1 1 1 2 2 1 2 1 3 2 1 1 1 5 1 2 3 2 3 1 2 4 2 1 1 2 2 1 1 1 1 2 1 2 2 2 2
[50] 2 1 3 1 2 3 2 2 2 3 2 1 1 3 2 2 2 2 2 3 1 1 2 3 2 2 5 3 1 3 1 2 3 2 2 3 3 1 3 1 1 2 2 1 2 2 1 2 4
[99] 1 1 2 2 2 2 2 1 3 2 2 1 1 3 2 1 2 2 2 1 3 2 2 3 2 1 1 1 2 2 2 2 1 3 1 3 2 2 1 2 2 2 2 1 1 3 2 2 1
[148] 3 2 4 1 2 1 2 3 1 3 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 3 3 2 1 1 3 2 2 3 3 2 4 2 2 2 3 2 1 4 1 2
[197] 3 1 2 2 2 2 3 2 3 3 3 3 5 1 3 2 3 1 3 3 3 2 2 1 2 2 1 3 4 4 2 1 2 2 3 4 3 3 2 1 2 4 1 1 2 1 3 2 3
[246] 3 1 1 2 2 3 3 1 1 4 3 2 1 1 2 2 2 2 1 2 2 2 3 1 1 2 4 1 1 2 3 3 2 1 1 1 2 5 1 2 3 2 2 3 1 1 3 3 1
[295] 3 5 2 1 1 1 2 2 1 1 1 4 4 2 1 1 2 1 2 3 1 2 1 2 4 1 1 1 3 1 4 1 1 1 1 4 3 1 2 1 3 3 3 1 2 1 3 1 1
[344] 2 1 1 2 2 2 1 4 2 2 1 4 1 1 4 2 1 4 2 3 2 1 1
Levels: 1 2 3 4 5
> db$area
[1] 3 1 1 0 3 2 5 3 1 3 3 3 1 5 4 0 3 5 5
[20] 1 3 2 3 1 3 3 0 0 3 3 0 1 3 1 3 3 3 3
[39] 3 3 3 0 1 2 2 2 2 0 2 1 1 1 0 0 0 3 3
[58] 3 3 4 3 3 3 3 3 1 3 3 0 3 3 5 3 2 3 5
[77] 5 3 3 3 3 5 3 2 5 3 3 3 2 0 0 0 0 0 5
[96] 0 0 3 0 3 5 3 3 3 1 3 0 0 3 3 2 1 3 0
[115] 3 2 5 2 5 1 0 0 5 0 0 0 0 1 0 1 0 0 3
[134] 0 3 3 0 0 0 3 0 0 0 0 3 0 0 0 3 0 0 2
[153] 0 3 0 0 0 0 0 0 3 0 0 0 3 0 0 3 0 5 3
[172] 5 3 3 3 3 0 2 3 0 3 2 3 0 3 0 3 2 5 3
[191] 2 3 5 5 0 3 5 3 2 3 3 3 3 2 2 3 1 3 3
[210] 3 3 3 5 3 3 3 3 3 0 3 0 3 2 1 0 3 0 0
[229] 5 3 3 1 0 0 0 3 0 5 1 3 0 3 3 0 1 5 5
[248] 3 1 3 5 0 2 3 2 0 0 0 3 3 5 0 3 0 0 0
[267] 3 3 3 0 3 0 3 4 0 0 3 3 3 3 5 5 3 3 1
[286] 3 1 2 3 0 0 5 3 1 0 5 3 0 3 3 3 2 3 0
[305] 0 <NA> 3 0 3 5 5 0 5 3 0 3 1 0 5 3 3 3 2
[324] <NA> 0 3 5 1 0 0 0 5 0 5 0 5 0 3 3 3 0 0
[343] 0 0 2 3 3 3 0 5 0 3 3 0 3 0 5 1 0 3 3
[362] 2 3 3 3 3
Levels: 0 1 2 3 4 5
library(survival)
library(Exact)
library(plyr)
b<-tabulate(db$site, db$phq_cat, db$area, tests=c("fisher"))
I obtain this error message:
Error in tabulate(db$site, db$phq_cat, db$area, tests = c("fisher")) :
unused arguments (db$AREEDISCIPL, tests = c("fisher"))
How cain I handle this?
I also would like to perform stratified wilcoxon rank sum test.
Is there a way?
Thank you!

Reshape wide data to long with multiple variables in R (dplyr) [duplicate]

This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 2 years ago.
I have a dataset of adolescents over 3 waves. I need to reshape the data from wide to long, but I haven't been able to figure out how to use pivot_longer (I've checked other questions, but maybe I missed one?). Below is sample data:
HAVE DATA:
id c1sports c2sports c3sports c1smoker c2smoker c3smoker c1drinker c2drinker c3drinker
1 1 1 1 1 1 4 1 5 2
2 1 1 1 5 1 3 4 1 4
3 1 0 0 1 1 5 2 3 2
4 0 0 0 1 3 3 4 2 3
5 0 0 0 2 1 2 1 5 3
6 0 0 0 4 1 4 4 3 1
7 1 0 1 2 2 3 1 4 1
8 0 1 1 4 4 1 4 5 4
9 1 1 1 3 2 2 3 4 2
10 0 1 0 2 5 5 4 2 3
WANT DATA:
id wave sports smoker drinker
1 1 1 1 1
1 2 1 1 5
1 3 1 4 2
2 1 1 5 4
2 2 1 1 1
2 3 1 3 4
3 1 1 1 2
3 2 0 1 3
3 3 0 5 2
4 1 0 1 4
4 2 0 3 2
4 3 0 3 3
5 1 0 2 1
5 2 0 1 5
5 3 0 2 3
6 1 0 4 4
6 2 0 1 3
6 3 0 4 1
7 1 1 2 1
7 2 0 2 4
7 3 1 3 1
8 1 0 4 4
8 2 1 4 5
8 3 1 1 4
9 1 1 3 3
9 2 1 2 4
9 3 1 2 2
10 1 0 2 4
10 2 1 2 2
10 3 0 5 3
So far the only think that I've been able to run is:
long_dat <- wide_dat %>%
pivot_longer(., cols = c1sports:c3drinker)
But this doesn't get me separate columns for sports, smoker, drinker.
You could use names_pattern argument in pivot_longer.
tidyr::pivot_longer(df,
cols = -id,
names_to = c('wave', '.value'),
names_pattern = 'c(\\d+)(.*)')
# id wave sports smoker drinker
# <int> <chr> <int> <int> <int>
# 1 1 1 1 1 1
# 2 1 2 1 1 5
# 3 1 3 1 4 2
# 4 2 1 1 5 4
# 5 2 2 1 1 1
# 6 2 3 1 3 4
# 7 3 1 1 1 2
# 8 3 2 0 1 3
# 9 3 3 0 5 2
#10 4 1 0 1 4
# … with 20 more rows

How to generate Classification Analysis tables in R?

So far I have done the discriminant analysis. I generated the posterior probabilities, structure loadings, and group centroids.
I have 1 grouping variable : history
I have 3 discriminant variables : mhpg, exercise, and control
here is the code so far
td <- read.delim("H:/Desktop/TAB DATA.txt")
td$history<-factor(td$history)
fit<-lda(history~mhpg+exercise+control, data=td)
git<-predict(fit)
xx<-subset(td, select=c(mhpg, control, exercise))
cor(xx,git$x)
aggregate(git$x~history,data=td,FUN=mean)
tst<-lm(cbind(mhpg,control,exercise)~history,data=td)
Basically, the above code is for discriminant analysis.
Now I want generate frequency classification and percent classification tables for classification analysis.
my attempted code (which i sampled from someone else to no avail) is:
td[6] <- git$class
td$V6<-factor(td$V6)
ftab<-table(td$history,dt$V6)
prop.table(ftab,1)
Where column 6 is my grouping variable history.
I get the following error when trying to make td$V6 a categorical variable with factor
Error in `$<-.data.frame`(`*tmp*`, "V6", value = integer(0)) :
replacement has 0 rows, data has 50
Can anyone steer me in the right direction? I really don't know why the sample code used a capital V out of nowhere. Below is the data. Column 6 is the grouping variable, history. Column 5 is the discriminant variable, control. column 7 is the discriminant variable, exercise. Column 8 is the discriminant variable, mhpg.
1 3 6 0 2 0 4 2 4 3 0 6 0
1 4 5 0 0 1 2 5 4 6 1 4 1
1 4 4 0 2 1 1 8 6 7 1 2 1
2 4 9 0 2 1 0 6 7 8 1 4 1
2 4 3 1 4 1 2 6 6 6 1 4 1
2 5 7 0 1 1 3 6 7 7 1 1 1
2 5 8 0 1 1 1 6 6 7 1 5 1
2 6 7 0 1 1 0 9 8 8 1 3 1
2 6 4 1 2 1 2 5 7 6 1 5 1
3 4 10 0 1 1 1 8 5 7 1 4 1
3 4 4 0 1 1 1 8 9 8 1 3 1
3 4 7 0 1 0 1 6 3 4 0 8 0
3 5 4 1 4 1 2 5 4 5 0 5 1
3 5 7 0 2 1 1 7 5 7 1 4 1
3 5 6 0 0 1 0 10 9 10 1 3 1
3 5 6 0 2 1 1 9 10 9 1 2 1
3 5 5 1 2 1 2 5 4 4 0 9 1
3 6 2 1 4 1 3 6 4 4 0 7 1
3 6 3 1 2 1 2 7 5 5 0 6 1
3 6 5 1 2 1 2 6 7 6 1 6 1
3 6 7 1 3 1 3 5 4 4 0 8 1
3 6 5 1 2 1 2 5 3 3 0 10 1
3 7 8 0 0 1 1 7 6 7 1 5 1
3 7 5 1 2 1 1 5 5 5 0 6 1
3 7 6 1 2 0 4 3 1 2 0 9 0
3 8 6 1 2 1 1 6 5 5 0 7 1
3 8 9 0 0 1 0 7 5 6 1 3 1
4 5 5 1 2 1 1 5 6 5 0 6 1
4 5 5 1 2 0 2 3 3 4 0 8 0
4 6 8 0 0 1 2 8 7 7 1 4 1
4 6 6 1 3 1 2 5 4 4 0 7 0
4 6 5 1 3 1 2 4 3 2 0 8 0
4 7 2 0 3 0 4 3 6 6 1 4 1
4 7 4 1 3 0 3 4 2 1 0 7 0
4 7 7 1 3 0 4 4 5 5 0 7 0
4 7 6 1 3 0 3 3 6 5 0 4 0
5 7 5 1 1 0 4 1 7 4 0 7 1
5 8 1 1 3 0 3 4 8 7 1 5 0
5 8 3 1 3 0 3 4 5 6 1 5 1
5 9 4 1 4 0 3 2 7 5 0 5 1
5 9 6 1 4 0 3 4 6 6 1 7 0
5 10 4 1 3 0 3 4 2 3 0 6 0
1 1 8 0 1 0 2 5 6 5 0 6 1
1 2 7 0 1 1 1 7 8 9 1 5 0
1 2 7 0 1 1 0 7 5 6 1 5 1
1 3 5 0 1 1 2 7 8 8 1 5 0
2 3 3 1 2 1 2 6 7 6 1 6 0
2 3 6 1 1 1 2 7 6 4 0 7 0
2 4 6 1 3 1 3 6 5 5 0 6 0
2 5 4 1 3 1 3 4 4 3 0 6 0
Try:
tbl <- table(td$history,git$class)
tbl
# 0 1
# 0 13 2
# 1 1 34
prop.table(tbl)
# 0 1
# 0 0.26 0.04
# 1 0.02 0.68
These are the classification tables.
Regarding why your "borrowed" code does not run, there are too many possibilities.
First, if you import the data set you provided without column names, R will assign names Vn where n is 1,2,3, etc. But if this was the case none of your code would run as you refer to columns history, control, etc. So at least those must be named properly.
Second, in the line:
ftab<-table(td$history,dt$V6)
you refer to dt$V6. AFAICT there is no dt (is this a typo?).

Using R to filter special rows

I have a question which has bothered me for a long time.I have a data frame as below...
ll <- data.frame(id=1:10,
A=c(rep(0,5),3,4,5,0,2),
B=c(1,4,2,0,3,0,3,24,0,0),
C=c(0,3,4,5,0,4,0,6,0,5),
D=c(0,1,2,0,42,4,0,3,8,0))
> ll
id A B C D
1 1 0 1 0 0
2 2 0 4 3 1
3 3 0 2 4 2
4 4 0 0 5 0
5 5 0 3 0 42
6 6 3 0 4 4
7 7 4 3 0 0
8 8 5 24 6 3
9 9 0 0 0 8
10 10 2 0 5 0
I want to filter out some special rows which have more than one "0" such as...
id A B C D
1 1 0 1 0 0
I want the final output as...
id A B C D
2 2 0 4 3 1
3 3 0 2 4 2
6 6 3 0 4 4
8 8 5 24 6 3
You can just use rowSums:
> ll[rowSums(ll == 0) <= 1, ]
id A B C D
2 2 0 4 3 1
3 3 0 2 4 2
6 6 3 0 4 4
8 8 5 24 6 3
If there are any columns that shouldn't be included, you can drop them in the rowSums step. For example, I assume "id" would not be included. If that's the case, then you can do:
ll[rowSums(ll[-1] == 0) <= 1, ]

Using "ward" method with pvclust in R

I am using the pvclust package in R to get hierarchical clustering dendrograms with p-values.
I want to use the "Ward" clustering and the "Euclidean" distance method. Both work fine with my data when using hclust. In pvclust however I keep getting the error message "invalid clustering method". The problem apparently results from the "ward" method, because other methods such as "average" work fine, as does "euclidean" on its own.
This is my syntax and the resulting error message:
result <- pvclust(t(data2007num), method.hclust="ward", method.dist="euclidean", nboot=100)
Bootstrap (r = 0.5)...
Error in hclust(distance, method = method.hclust) : invalid clustering method
My data matrix has the following form (28 countries x 20 policy dimensions):
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
AUT 2 3 4 2 1 1 4 3 2 2 2 3 3 4 4 2.0 5 4 0 3
GER 3 5 3 2 1 3 2 4 4 5 4 0 4 5 4 3.0 5 5 3 2
SWE 5 5 1 5 4 3 1 4 4 5 3 4 5 2 4 3.0 3 3 5 0
NLD 4 4 2 3 2 1 0 4 4 0 4 4 4 2 2 4.0 4 4 2 5
ESP 3 4 1 4 5 0 3 2 4 1 4 3 3 1 2 3.0 2 2 0 2
ITA 3 2 0 3 1 1 3 3 5 5 4 2 4 1 1 2.0 0 2 0 2
FRA 3 2 1 3 1 2 4 2 5 2 3 2 3 3 5 4.0 1 2 0 3
DNK 5 2 1 3 4 4 2 4 3 0 4 4 2 3 5 2.0 5 4 5 3
GRE 3 3 2 5 2 1 3 2 2 2 3 2 3 0 2 3.0 0 1 0 2
CHE 5 4 3 3 4 3 2 3 4 1 4 4 2 1 1 3.0 5 4 0 3
BEL 3 2 3 1 4 2 4 2 2 2 3 3 3 1 5 2.0 2 3 2 0
CZE 2 4 3 3 2 2 1 2 5 2 3 1 4 1 2 3.0 1 4 0 2
POL 3 3 4 4 0 1 3 3 2 2 4 2 2 0 3 4.0 2 2 0 3
IRL 3 1 2 1 4 3 2 1 5 4 3 2 2 1 3 2.0 0 1 1 2
LUX 2 1 2 5 3 2 2 5 4 2 2 4 3 2 4 3.0 2 3 0 1
HUN 1 3 2 3 2 1 4 3 5 4 2 3 4 3 3 2.0 3 2 4 2
PRT 3 2 3 5 4 1 4 1 5 5 3 2 2 1 2 2.0 1 1 1 1
AUS 4 1 2 1 2 3 1 1 1 5 4 5 3 1 2 3.0 1 3 5 1
CAN 1 1 1 1 4 1 0 1 1 5 1 1 3 3 2 2.0 1 2 5 4
FIN 5 4 4 3 2 3 2 3 3 3 2 2 4 3 3 3.0 4 4 5 2
GBR 3 1 2 1 2 3 1 1 2 5 4 4 4 3 1 2.0 1 3 5 5
JPN 4 1 0 1 2 2 0 2 5 4 3 1 1 3 3 2.0 2 4 5 3
KOR 3 3 0 1 2 1 0 0 1 4 0 1 1 2 3 2.0 1 2 1 3
MEX 0 3 4 0 3 2 5 2 3 5 2 2 0 0 0 0.0 0 1 0 3
NZL 5 1 2 1 2 3 1 1 5 2 3 5 2 2 2 0.5 0 0 3 3
NOR 5 3 2 4 2 4 2 5 4 2 4 5 4 2 4 4.0 5 4 5 0
SVK 1 4 3 2 4 2 1 2 5 2 3 2 4 2 2 3.0 0 2 0 3
USA 3 0 1 3 2 4 0 3 0 1 0 0 3 4 1 2.0 1 1 5 4
I tried to used "ward" with the dataset provided by the pvclust package (lung) as well as other data provided in R (such as Boston in the MASS package, without any success. Does anyone now a solution or if the "ward" method was disabled inpvclust?

Resources