How to generate Classification Analysis tables in R? - r

So far I have done the discriminant analysis. I generated the posterior probabilities, structure loadings, and group centroids.
I have 1 grouping variable : history
I have 3 discriminant variables : mhpg, exercise, and control
here is the code so far
td <- read.delim("H:/Desktop/TAB DATA.txt")
td$history<-factor(td$history)
fit<-lda(history~mhpg+exercise+control, data=td)
git<-predict(fit)
xx<-subset(td, select=c(mhpg, control, exercise))
cor(xx,git$x)
aggregate(git$x~history,data=td,FUN=mean)
tst<-lm(cbind(mhpg,control,exercise)~history,data=td)
Basically, the above code is for discriminant analysis.
Now I want generate frequency classification and percent classification tables for classification analysis.
my attempted code (which i sampled from someone else to no avail) is:
td[6] <- git$class
td$V6<-factor(td$V6)
ftab<-table(td$history,dt$V6)
prop.table(ftab,1)
Where column 6 is my grouping variable history.
I get the following error when trying to make td$V6 a categorical variable with factor
Error in `$<-.data.frame`(`*tmp*`, "V6", value = integer(0)) :
replacement has 0 rows, data has 50
Can anyone steer me in the right direction? I really don't know why the sample code used a capital V out of nowhere. Below is the data. Column 6 is the grouping variable, history. Column 5 is the discriminant variable, control. column 7 is the discriminant variable, exercise. Column 8 is the discriminant variable, mhpg.
1 3 6 0 2 0 4 2 4 3 0 6 0
1 4 5 0 0 1 2 5 4 6 1 4 1
1 4 4 0 2 1 1 8 6 7 1 2 1
2 4 9 0 2 1 0 6 7 8 1 4 1
2 4 3 1 4 1 2 6 6 6 1 4 1
2 5 7 0 1 1 3 6 7 7 1 1 1
2 5 8 0 1 1 1 6 6 7 1 5 1
2 6 7 0 1 1 0 9 8 8 1 3 1
2 6 4 1 2 1 2 5 7 6 1 5 1
3 4 10 0 1 1 1 8 5 7 1 4 1
3 4 4 0 1 1 1 8 9 8 1 3 1
3 4 7 0 1 0 1 6 3 4 0 8 0
3 5 4 1 4 1 2 5 4 5 0 5 1
3 5 7 0 2 1 1 7 5 7 1 4 1
3 5 6 0 0 1 0 10 9 10 1 3 1
3 5 6 0 2 1 1 9 10 9 1 2 1
3 5 5 1 2 1 2 5 4 4 0 9 1
3 6 2 1 4 1 3 6 4 4 0 7 1
3 6 3 1 2 1 2 7 5 5 0 6 1
3 6 5 1 2 1 2 6 7 6 1 6 1
3 6 7 1 3 1 3 5 4 4 0 8 1
3 6 5 1 2 1 2 5 3 3 0 10 1
3 7 8 0 0 1 1 7 6 7 1 5 1
3 7 5 1 2 1 1 5 5 5 0 6 1
3 7 6 1 2 0 4 3 1 2 0 9 0
3 8 6 1 2 1 1 6 5 5 0 7 1
3 8 9 0 0 1 0 7 5 6 1 3 1
4 5 5 1 2 1 1 5 6 5 0 6 1
4 5 5 1 2 0 2 3 3 4 0 8 0
4 6 8 0 0 1 2 8 7 7 1 4 1
4 6 6 1 3 1 2 5 4 4 0 7 0
4 6 5 1 3 1 2 4 3 2 0 8 0
4 7 2 0 3 0 4 3 6 6 1 4 1
4 7 4 1 3 0 3 4 2 1 0 7 0
4 7 7 1 3 0 4 4 5 5 0 7 0
4 7 6 1 3 0 3 3 6 5 0 4 0
5 7 5 1 1 0 4 1 7 4 0 7 1
5 8 1 1 3 0 3 4 8 7 1 5 0
5 8 3 1 3 0 3 4 5 6 1 5 1
5 9 4 1 4 0 3 2 7 5 0 5 1
5 9 6 1 4 0 3 4 6 6 1 7 0
5 10 4 1 3 0 3 4 2 3 0 6 0
1 1 8 0 1 0 2 5 6 5 0 6 1
1 2 7 0 1 1 1 7 8 9 1 5 0
1 2 7 0 1 1 0 7 5 6 1 5 1
1 3 5 0 1 1 2 7 8 8 1 5 0
2 3 3 1 2 1 2 6 7 6 1 6 0
2 3 6 1 1 1 2 7 6 4 0 7 0
2 4 6 1 3 1 3 6 5 5 0 6 0
2 5 4 1 3 1 3 4 4 3 0 6 0

Try:
tbl <- table(td$history,git$class)
tbl
# 0 1
# 0 13 2
# 1 1 34
prop.table(tbl)
# 0 1
# 0 0.26 0.04
# 1 0.02 0.68
These are the classification tables.
Regarding why your "borrowed" code does not run, there are too many possibilities.
First, if you import the data set you provided without column names, R will assign names Vn where n is 1,2,3, etc. But if this was the case none of your code would run as you refer to columns history, control, etc. So at least those must be named properly.
Second, in the line:
ftab<-table(td$history,dt$V6)
you refer to dt$V6. AFAICT there is no dt (is this a typo?).

Related

Reshape wide data to long with multiple variables in R (dplyr) [duplicate]

This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 2 years ago.
I have a dataset of adolescents over 3 waves. I need to reshape the data from wide to long, but I haven't been able to figure out how to use pivot_longer (I've checked other questions, but maybe I missed one?). Below is sample data:
HAVE DATA:
id c1sports c2sports c3sports c1smoker c2smoker c3smoker c1drinker c2drinker c3drinker
1 1 1 1 1 1 4 1 5 2
2 1 1 1 5 1 3 4 1 4
3 1 0 0 1 1 5 2 3 2
4 0 0 0 1 3 3 4 2 3
5 0 0 0 2 1 2 1 5 3
6 0 0 0 4 1 4 4 3 1
7 1 0 1 2 2 3 1 4 1
8 0 1 1 4 4 1 4 5 4
9 1 1 1 3 2 2 3 4 2
10 0 1 0 2 5 5 4 2 3
WANT DATA:
id wave sports smoker drinker
1 1 1 1 1
1 2 1 1 5
1 3 1 4 2
2 1 1 5 4
2 2 1 1 1
2 3 1 3 4
3 1 1 1 2
3 2 0 1 3
3 3 0 5 2
4 1 0 1 4
4 2 0 3 2
4 3 0 3 3
5 1 0 2 1
5 2 0 1 5
5 3 0 2 3
6 1 0 4 4
6 2 0 1 3
6 3 0 4 1
7 1 1 2 1
7 2 0 2 4
7 3 1 3 1
8 1 0 4 4
8 2 1 4 5
8 3 1 1 4
9 1 1 3 3
9 2 1 2 4
9 3 1 2 2
10 1 0 2 4
10 2 1 2 2
10 3 0 5 3
So far the only think that I've been able to run is:
long_dat <- wide_dat %>%
pivot_longer(., cols = c1sports:c3drinker)
But this doesn't get me separate columns for sports, smoker, drinker.
You could use names_pattern argument in pivot_longer.
tidyr::pivot_longer(df,
cols = -id,
names_to = c('wave', '.value'),
names_pattern = 'c(\\d+)(.*)')
# id wave sports smoker drinker
# <int> <chr> <int> <int> <int>
# 1 1 1 1 1 1
# 2 1 2 1 1 5
# 3 1 3 1 4 2
# 4 2 1 1 5 4
# 5 2 2 1 1 1
# 6 2 3 1 3 4
# 7 3 1 1 1 2
# 8 3 2 0 1 3
# 9 3 3 0 5 2
#10 4 1 0 1 4
# … with 20 more rows

Displaying isolated points using pm3d Gnuplot

I am plotting 3D histograms using pm3d in Gnuplot. Data is provided below. The sequence of steps in Gnuplot are:
set view map
splot 'test.dat' u 1:2:(log($3)) with pm3d t " ", 'test.dat' u 1:2:(log($3)) t " "
as you can see in this figure:
some data points are not plotted with pm3d, I think because the lack of coordination with neighboring points. I wonder if there is a way to explicitly plot these non-well coordinated points in Gnuplot using pm3d.
Note: The "plot with images" doesn't work in my case, because the data set I have is much more larger than this simple example and the plot looks like fragmented squares.
Thanks.
Data:
1 1 1
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
1 7 0
1 8 0
2 1 0
2 2 0
2 3 1
2 4 2
2 5 3
2 6 0
2 7 0
2 8 0
3 1 0
3 2 0
3 3 2
3 4 10
3 5 15
3 6 2
3 7 0
3 8 0
4 1 0
4 2 0
4 3 0
4 4 5
4 5 2
4 6 1
4 7 0
4 8 0
5 1 0
5 2 0
5 3 0
5 4 3
5 5 2
5 6 0
5 7 0
5 8 0
6 1 0
6 2 0
6 3 0
6 4 2
6 5 0
6 6 0
6 7 1
6 8 0
7 1 0
7 2 0
7 3 0
7 4 0
7 5 0
7 6 0
7 7 0
7 8 0
8 1 0
8 2 0
8 3 0
8 4 0
8 5 0
8 6 0
8 7 1.1
8 8 0

Specify effect size in simulated data

I'm using the simstudy package to create simulated data sets for a power analysis of a mixed effects model (using lmer). What I would like to do is to be able to simulate a specific effect size for each group in the simulated data for the dependent/outcome variable. What I'm trying to do is set an r value for each group in the simulated data set across each time point.
Here's a sample data frame that I generated using simstudy
cid period Group male time
1 0 3 1 0
1 1 3 1 3
1 2 3 1 6
1 3 3 1 9
1 4 3 1 25
2 0 1 1 0
2 1 1 1 3
2 2 1 1 6
2 3 1 1 9
2 4 1 1 25
3 0 1 0 0
3 1 1 0 3
3 2 1 0 6
3 3 1 0 9
3 4 1 0 25
4 0 1 1 0
4 1 1 1 3
4 2 1 1 6
4 3 1 1 9
4 4 1 1 25
5 0 3 0 0
5 1 3 0 3
5 2 3 0 6
5 3 3 0 9
5 4 3 0 25
6 0 1 1 0
6 1 1 1 3
6 2 1 1 6
6 3 1 1 9
6 4 1 1 25
7 0 3 0 0
7 1 3 0 3
7 2 3 0 6
7 3 3 0 9
7 4 3 0 25
8 0 2 1 0
8 1 2 1 3
8 2 2 1 6
8 3 2 1 9
8 4 2 1 25
9 0 3 1 0
9 1 3 1 3
9 2 3 1 6
9 3 3 1 9
9 4 3 1 25
10 0 3 1 0
10 1 3 1 3
10 2 3 1 6
10 3 3 1 9
10 4 3 1 25
Let's assume a variable y with a mean = 25 and an SD = 10. Then assume an r for group 1 = .2, group 2 = .5, group 3 = .8
How would I simulate a variable (y) that has those properties? I was thinking something along the lines of rnorm, but really not having a lot of success.
~Note that simstudy provides a formula module to define the outcome variable - it takes the form of
#define the column variable
def <- defData(def, varname = "small.eff", dist = "normal", formula = 0.3)
#define the values for the column
dtAdd <- defDataAdd(varname = "PRCA.small", dist = "normal", formula =
"25 - (Group + 1) * (period * small.eff)", variance = 10)
But I couldn't figure out how to actually create standardized variables in the formula space that would allow me to set a concrete r value for each group.

Using "ward" method with pvclust in R

I am using the pvclust package in R to get hierarchical clustering dendrograms with p-values.
I want to use the "Ward" clustering and the "Euclidean" distance method. Both work fine with my data when using hclust. In pvclust however I keep getting the error message "invalid clustering method". The problem apparently results from the "ward" method, because other methods such as "average" work fine, as does "euclidean" on its own.
This is my syntax and the resulting error message:
result <- pvclust(t(data2007num), method.hclust="ward", method.dist="euclidean", nboot=100)
Bootstrap (r = 0.5)...
Error in hclust(distance, method = method.hclust) : invalid clustering method
My data matrix has the following form (28 countries x 20 policy dimensions):
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10
AUT 2 3 4 2 1 1 4 3 2 2 2 3 3 4 4 2.0 5 4 0 3
GER 3 5 3 2 1 3 2 4 4 5 4 0 4 5 4 3.0 5 5 3 2
SWE 5 5 1 5 4 3 1 4 4 5 3 4 5 2 4 3.0 3 3 5 0
NLD 4 4 2 3 2 1 0 4 4 0 4 4 4 2 2 4.0 4 4 2 5
ESP 3 4 1 4 5 0 3 2 4 1 4 3 3 1 2 3.0 2 2 0 2
ITA 3 2 0 3 1 1 3 3 5 5 4 2 4 1 1 2.0 0 2 0 2
FRA 3 2 1 3 1 2 4 2 5 2 3 2 3 3 5 4.0 1 2 0 3
DNK 5 2 1 3 4 4 2 4 3 0 4 4 2 3 5 2.0 5 4 5 3
GRE 3 3 2 5 2 1 3 2 2 2 3 2 3 0 2 3.0 0 1 0 2
CHE 5 4 3 3 4 3 2 3 4 1 4 4 2 1 1 3.0 5 4 0 3
BEL 3 2 3 1 4 2 4 2 2 2 3 3 3 1 5 2.0 2 3 2 0
CZE 2 4 3 3 2 2 1 2 5 2 3 1 4 1 2 3.0 1 4 0 2
POL 3 3 4 4 0 1 3 3 2 2 4 2 2 0 3 4.0 2 2 0 3
IRL 3 1 2 1 4 3 2 1 5 4 3 2 2 1 3 2.0 0 1 1 2
LUX 2 1 2 5 3 2 2 5 4 2 2 4 3 2 4 3.0 2 3 0 1
HUN 1 3 2 3 2 1 4 3 5 4 2 3 4 3 3 2.0 3 2 4 2
PRT 3 2 3 5 4 1 4 1 5 5 3 2 2 1 2 2.0 1 1 1 1
AUS 4 1 2 1 2 3 1 1 1 5 4 5 3 1 2 3.0 1 3 5 1
CAN 1 1 1 1 4 1 0 1 1 5 1 1 3 3 2 2.0 1 2 5 4
FIN 5 4 4 3 2 3 2 3 3 3 2 2 4 3 3 3.0 4 4 5 2
GBR 3 1 2 1 2 3 1 1 2 5 4 4 4 3 1 2.0 1 3 5 5
JPN 4 1 0 1 2 2 0 2 5 4 3 1 1 3 3 2.0 2 4 5 3
KOR 3 3 0 1 2 1 0 0 1 4 0 1 1 2 3 2.0 1 2 1 3
MEX 0 3 4 0 3 2 5 2 3 5 2 2 0 0 0 0.0 0 1 0 3
NZL 5 1 2 1 2 3 1 1 5 2 3 5 2 2 2 0.5 0 0 3 3
NOR 5 3 2 4 2 4 2 5 4 2 4 5 4 2 4 4.0 5 4 5 0
SVK 1 4 3 2 4 2 1 2 5 2 3 2 4 2 2 3.0 0 2 0 3
USA 3 0 1 3 2 4 0 3 0 1 0 0 3 4 1 2.0 1 1 5 4
I tried to used "ward" with the dataset provided by the pvclust package (lung) as well as other data provided in R (such as Boston in the MASS package, without any success. Does anyone now a solution or if the "ward" method was disabled inpvclust?

Cut value in creating table

I have following type of data:
mydata <- data.frame (yvar = rnorm(200, 15, 5), xv1 = rep(1:5, each = 40),
xv2 = rep(1:10, 20))
table(mydata$xv1, mydata$xv2)
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
I want tabulate again with yvar categories. The following is cutkey.
cutkey :
< 10 - group 1
10-12 - group 2
12-16 - group 3
>16 - group 4
Thus we will have similar to above type of table to each cutkey elements. I want to have margin sums everytime.
< 10 - group 1
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
10-12 - group 2
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
and so on for all groups
(the numbers will be definately different)
Is there easyway to do it ?
Yes, using cut, dlply (plyr package) and addmargins:
mydata$yvar1 <- cut(mydata$yvar,breaks = c(-Inf,10,12,16,Inf))
> dlply(mydata,.(yvar1),function(x) addmargins(table(x$xv1,x$xv2)))
$`(-Inf,10]`
1 2 3 4 5 6 7 8 9 10 Sum
1 0 0 0 0 0 0 2 0 1 0 3
2 1 1 0 1 0 0 0 0 2 0 5
3 0 1 0 0 1 1 0 2 0 0 5
4 0 0 2 0 1 1 0 1 0 0 5
5 0 1 1 0 1 1 1 0 0 2 7
Sum 1 3 3 1 3 3 3 3 3 2 25
$`(10,12]`
1 2 3 4 6 7 8 9 10 Sum
1 0 0 0 1 2 0 0 0 0 3
2 0 0 1 0 0 1 0 0 1 3
3 0 1 0 1 1 2 0 0 1 6
4 0 1 0 0 0 0 0 0 0 1
5 1 0 1 1 1 0 1 1 2 8
Sum 1 2 2 3 4 3 1 1 4 21
$`(12,16]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 3 1 1 1 2 0 3 0 2 15
2 0 1 0 1 3 3 2 0 0 1 11
3 3 1 3 1 0 0 0 2 4 1 15
4 3 2 1 2 2 0 1 1 4 1 17
5 3 1 1 2 0 1 1 1 1 0 11
Sum 11 8 6 7 6 6 4 7 9 5 69
$`(16, Inf]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 1 3 2 3 0 2 1 3 2 19
2 3 2 3 2 1 1 1 4 2 2 21
3 1 1 1 2 3 2 2 0 0 2 14
4 1 1 1 2 1 3 3 2 0 3 17
5 0 2 1 1 3 1 2 2 2 0 14
Sum 7 7 9 9 11 7 10 9 7 9 85
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
yvar1
1 (-Inf,10]
2 (10,12]
3 (12,16]
4 (16, Inf]
You can adjust the breaks argument to cut to get the values just how you want them. (Although the margin sums you display in your question don't look like margin sums at all.)

Resources