Reshape wide data to long with multiple variables in R (dplyr) [duplicate] - r

This question already has an answer here:
How to use Pivot_longer to reshape from wide-type data to long-type data with multiple variables
(1 answer)
Closed 2 years ago.
I have a dataset of adolescents over 3 waves. I need to reshape the data from wide to long, but I haven't been able to figure out how to use pivot_longer (I've checked other questions, but maybe I missed one?). Below is sample data:
HAVE DATA:
id c1sports c2sports c3sports c1smoker c2smoker c3smoker c1drinker c2drinker c3drinker
1 1 1 1 1 1 4 1 5 2
2 1 1 1 5 1 3 4 1 4
3 1 0 0 1 1 5 2 3 2
4 0 0 0 1 3 3 4 2 3
5 0 0 0 2 1 2 1 5 3
6 0 0 0 4 1 4 4 3 1
7 1 0 1 2 2 3 1 4 1
8 0 1 1 4 4 1 4 5 4
9 1 1 1 3 2 2 3 4 2
10 0 1 0 2 5 5 4 2 3
WANT DATA:
id wave sports smoker drinker
1 1 1 1 1
1 2 1 1 5
1 3 1 4 2
2 1 1 5 4
2 2 1 1 1
2 3 1 3 4
3 1 1 1 2
3 2 0 1 3
3 3 0 5 2
4 1 0 1 4
4 2 0 3 2
4 3 0 3 3
5 1 0 2 1
5 2 0 1 5
5 3 0 2 3
6 1 0 4 4
6 2 0 1 3
6 3 0 4 1
7 1 1 2 1
7 2 0 2 4
7 3 1 3 1
8 1 0 4 4
8 2 1 4 5
8 3 1 1 4
9 1 1 3 3
9 2 1 2 4
9 3 1 2 2
10 1 0 2 4
10 2 1 2 2
10 3 0 5 3
So far the only think that I've been able to run is:
long_dat <- wide_dat %>%
pivot_longer(., cols = c1sports:c3drinker)
But this doesn't get me separate columns for sports, smoker, drinker.

You could use names_pattern argument in pivot_longer.
tidyr::pivot_longer(df,
cols = -id,
names_to = c('wave', '.value'),
names_pattern = 'c(\\d+)(.*)')
# id wave sports smoker drinker
# <int> <chr> <int> <int> <int>
# 1 1 1 1 1 1
# 2 1 2 1 1 5
# 3 1 3 1 4 2
# 4 2 1 1 5 4
# 5 2 2 1 1 1
# 6 2 3 1 3 4
# 7 3 1 1 1 2
# 8 3 2 0 1 3
# 9 3 3 0 5 2
#10 4 1 0 1 4
# … with 20 more rows

Related

how to create a column that determines if a value is missing in a variable in R

I am trying to identify if a column has a missing number category based on a max.score. Here is a sample dataset.
df <- data.frame(id = c(1,1,1,1,1, 2,2,2,2,2, 3,3,3,3,3),
score = c(0,0,2,0,2, 0,1,1,0,1, 0,1,0,1,0),
max.score = c(2,2,2,2,2, 1,1,1,1,1, 2,2,2,2,2))
> df
id score max.score
1 1 0 2
2 1 0 2
3 1 2 2
4 1 0 2
5 1 2 2
6 2 0 1
7 2 1 1
8 2 1 1
9 2 0 1
10 2 1 1
11 3 0 2
12 3 1 2
13 3 0 2
14 3 1 2
15 3 0 2
for the id = 1, based on the max.score, it is missing the category 1. I would like to add missing column saying something like 1. When id=3 is missing score = 2, the missing column should indicate a value of 2. If there are more than one category is missing, then it would indicate those missing categories as ,for example, 1,3. The desired output should be:
> df
id score max.score missing
1 1 0 2 1
2 1 0 2 1
3 1 2 2 1
4 1 0 2 1
5 1 2 2 1
6 2 0 1 NA
7 2 1 1 NA
8 2 1 1 NA
9 2 0 1 NA
10 2 1 1 NA
11 3 0 2 2
12 3 1 2 2
13 3 0 2 2
14 3 1 2 2
15 3 0 2 2
Any thoughts?
Thanks!
df %>%
group_by(id) %>%
mutate(missing = toString(setdiff(0:max.score[1], unique(score))),
missing = ifelse(nzchar(missing), missing, NA))
# A tibble: 15 x 4
# Groups: id [3]
id score max.score missing
<dbl> <dbl> <dbl> <chr>
1 1 0 2 1
2 1 0 2 1
3 1 2 2 1
4 1 0 2 1
5 1 2 2 1
6 2 0 1 NA
7 2 1 1 NA
8 2 1 1 NA
9 2 0 1 NA
10 2 1 1 NA
11 3 0 2 2
12 3 1 2 2
13 3 0 2 2
14 3 1 2 2
15 3 0 2 2

Pasting a string of variables into a function is not working

I was looking at this question: Find how many times duplicated rows repeat in R data frame, which provides the following code:
library(plyr)
ddply(df,.(a,b),nrow)
However, I have a dataset with many variables, so I can't type them out like a,b in this case. I've tried using names(data) with the paste function, but it doesn't seem to work. I tried this:
var_names=paste(names(data),collapse=",")
ddply(data,.(paste(a)),nrow)
It instead gives this output:
However, if I manually type them out, I get the proper output:
What do I need to do differently here?
Instead of paste and evaluating, make use of count from dplyr, which can take multiple columns with across and select-helpers - everything()
library(dplyr)
df %>%
count(across(everything()))
A reproducible example with mtcars dataset
data(mtcars)
df <- mtcars %>%
select(vs:carb)
count(df, across(everything()))
vs am gear carb n
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Also, in ddply, we can just pass a vector of column names i.e. no need to create a single string
library(plyr)
ddply(df, names(df), nrow)
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
Or if we are creating a single string from names, also paste the whole expression and then evaluate (which is not recommended as there are standard ways of dealing this)
eval(parse(text = paste('ddply(df, .(', toString(names(df)), '), nrow)')))
vs am gear carb V1
1 0 0 3 2 4
2 0 0 3 3 3
3 0 0 3 4 5
4 0 1 4 4 2
5 0 1 5 2 1
6 0 1 5 4 1
7 0 1 5 6 1
8 0 1 5 8 1
9 1 0 3 1 3
10 1 0 4 2 2
11 1 0 4 4 2
12 1 1 4 1 4
13 1 1 4 2 2
14 1 1 5 2 1
You can use aggregate by grouping all the columns and counting it's length.
aggregate(1:nrow(df)~., df, length)

Stratified Fisher and Wilcoxon test

I would like to perform a stratified fisher test.
I've tried with tabulate without success.
These are my data:
> db$site
[1] 2 2 2 3 1 1 1 1 1 1 1 1 1 1 2 3 1 1 1 1 1 1 1 1 2 2 3 3 1 1 3 1 2 1 1 2 1 1 1 1 1 3 1 1 1 1 1 3 1
[50] 2 1 1 3 3 3 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 1 3 3 1
[99] 3 1 1 1 1 1 1 1 3 3 1 1 1 1 1 3 2 1 1 1 1 1 3 3 1 3 3 3 3 1 3 1 3 3 1 3 1 1 3 3 3 2 3 3 3 3 1 3 3
[148] 3 2 3 3 1 3 1 3 3 3 3 3 3 1 3 3 3 1 3 3 1 3 1 1 1 1 1 1 1 3 2 1 3 2 2 2 3 2 3 2 2 2 2 2 2 2 2 3 2
[197] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 1 3 2 1 2 3 1 3 3 1 1 1 1 3 3 3 1 3 2 2 1 3 1 2 3 1
[246] 1 1 1 1 1 2 3 2 1 2 3 3 3 1 1 2 3 2 3 3 3 2 2 1 3 2 3 1 1 3 3 2 1 1 1 1 1 2 1 1 2 2 1 2 3 3 1 1 1
[295] 3 2 2 3 1 1 2 2 2 3 3 2 1 3 1 2 1 3 1 1 3 1 1 3 2 2 2 2 2 1 3 1 1 2 3 3 3 1 3 1 3 2 3 1 1 1 3 3 3
[344] 3 1 2 2 2 3 1 3 1 1 3 1 3 2 1 3 2 2 2 2 2 2 2
Levels: 1 2 3
> db$phq_cat
[1] 1 2 2 3 1 2 2 2 1 1 1 2 1 1 1 2 2 1 2 1 3 2 1 1 1 5 1 2 3 2 3 1 2 4 2 1 1 2 2 1 1 1 1 2 1 2 2 2 2
[50] 2 1 3 1 2 3 2 2 2 3 2 1 1 3 2 2 2 2 2 3 1 1 2 3 2 2 5 3 1 3 1 2 3 2 2 3 3 1 3 1 1 2 2 1 2 2 1 2 4
[99] 1 1 2 2 2 2 2 1 3 2 2 1 1 3 2 1 2 2 2 1 3 2 2 3 2 1 1 1 2 2 2 2 1 3 1 3 2 2 1 2 2 2 2 1 1 3 2 2 1
[148] 3 2 4 1 2 1 2 3 1 3 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 3 3 2 1 1 3 2 2 3 3 2 4 2 2 2 3 2 1 4 1 2
[197] 3 1 2 2 2 2 3 2 3 3 3 3 5 1 3 2 3 1 3 3 3 2 2 1 2 2 1 3 4 4 2 1 2 2 3 4 3 3 2 1 2 4 1 1 2 1 3 2 3
[246] 3 1 1 2 2 3 3 1 1 4 3 2 1 1 2 2 2 2 1 2 2 2 3 1 1 2 4 1 1 2 3 3 2 1 1 1 2 5 1 2 3 2 2 3 1 1 3 3 1
[295] 3 5 2 1 1 1 2 2 1 1 1 4 4 2 1 1 2 1 2 3 1 2 1 2 4 1 1 1 3 1 4 1 1 1 1 4 3 1 2 1 3 3 3 1 2 1 3 1 1
[344] 2 1 1 2 2 2 1 4 2 2 1 4 1 1 4 2 1 4 2 3 2 1 1
Levels: 1 2 3 4 5
> db$area
[1] 3 1 1 0 3 2 5 3 1 3 3 3 1 5 4 0 3 5 5
[20] 1 3 2 3 1 3 3 0 0 3 3 0 1 3 1 3 3 3 3
[39] 3 3 3 0 1 2 2 2 2 0 2 1 1 1 0 0 0 3 3
[58] 3 3 4 3 3 3 3 3 1 3 3 0 3 3 5 3 2 3 5
[77] 5 3 3 3 3 5 3 2 5 3 3 3 2 0 0 0 0 0 5
[96] 0 0 3 0 3 5 3 3 3 1 3 0 0 3 3 2 1 3 0
[115] 3 2 5 2 5 1 0 0 5 0 0 0 0 1 0 1 0 0 3
[134] 0 3 3 0 0 0 3 0 0 0 0 3 0 0 0 3 0 0 2
[153] 0 3 0 0 0 0 0 0 3 0 0 0 3 0 0 3 0 5 3
[172] 5 3 3 3 3 0 2 3 0 3 2 3 0 3 0 3 2 5 3
[191] 2 3 5 5 0 3 5 3 2 3 3 3 3 2 2 3 1 3 3
[210] 3 3 3 5 3 3 3 3 3 0 3 0 3 2 1 0 3 0 0
[229] 5 3 3 1 0 0 0 3 0 5 1 3 0 3 3 0 1 5 5
[248] 3 1 3 5 0 2 3 2 0 0 0 3 3 5 0 3 0 0 0
[267] 3 3 3 0 3 0 3 4 0 0 3 3 3 3 5 5 3 3 1
[286] 3 1 2 3 0 0 5 3 1 0 5 3 0 3 3 3 2 3 0
[305] 0 <NA> 3 0 3 5 5 0 5 3 0 3 1 0 5 3 3 3 2
[324] <NA> 0 3 5 1 0 0 0 5 0 5 0 5 0 3 3 3 0 0
[343] 0 0 2 3 3 3 0 5 0 3 3 0 3 0 5 1 0 3 3
[362] 2 3 3 3 3
Levels: 0 1 2 3 4 5
library(survival)
library(Exact)
library(plyr)
b<-tabulate(db$site, db$phq_cat, db$area, tests=c("fisher"))
I obtain this error message:
Error in tabulate(db$site, db$phq_cat, db$area, tests = c("fisher")) :
unused arguments (db$AREEDISCIPL, tests = c("fisher"))
How cain I handle this?
I also would like to perform stratified wilcoxon rank sum test.
Is there a way?
Thank you!

How to generate Classification Analysis tables in R?

So far I have done the discriminant analysis. I generated the posterior probabilities, structure loadings, and group centroids.
I have 1 grouping variable : history
I have 3 discriminant variables : mhpg, exercise, and control
here is the code so far
td <- read.delim("H:/Desktop/TAB DATA.txt")
td$history<-factor(td$history)
fit<-lda(history~mhpg+exercise+control, data=td)
git<-predict(fit)
xx<-subset(td, select=c(mhpg, control, exercise))
cor(xx,git$x)
aggregate(git$x~history,data=td,FUN=mean)
tst<-lm(cbind(mhpg,control,exercise)~history,data=td)
Basically, the above code is for discriminant analysis.
Now I want generate frequency classification and percent classification tables for classification analysis.
my attempted code (which i sampled from someone else to no avail) is:
td[6] <- git$class
td$V6<-factor(td$V6)
ftab<-table(td$history,dt$V6)
prop.table(ftab,1)
Where column 6 is my grouping variable history.
I get the following error when trying to make td$V6 a categorical variable with factor
Error in `$<-.data.frame`(`*tmp*`, "V6", value = integer(0)) :
replacement has 0 rows, data has 50
Can anyone steer me in the right direction? I really don't know why the sample code used a capital V out of nowhere. Below is the data. Column 6 is the grouping variable, history. Column 5 is the discriminant variable, control. column 7 is the discriminant variable, exercise. Column 8 is the discriminant variable, mhpg.
1 3 6 0 2 0 4 2 4 3 0 6 0
1 4 5 0 0 1 2 5 4 6 1 4 1
1 4 4 0 2 1 1 8 6 7 1 2 1
2 4 9 0 2 1 0 6 7 8 1 4 1
2 4 3 1 4 1 2 6 6 6 1 4 1
2 5 7 0 1 1 3 6 7 7 1 1 1
2 5 8 0 1 1 1 6 6 7 1 5 1
2 6 7 0 1 1 0 9 8 8 1 3 1
2 6 4 1 2 1 2 5 7 6 1 5 1
3 4 10 0 1 1 1 8 5 7 1 4 1
3 4 4 0 1 1 1 8 9 8 1 3 1
3 4 7 0 1 0 1 6 3 4 0 8 0
3 5 4 1 4 1 2 5 4 5 0 5 1
3 5 7 0 2 1 1 7 5 7 1 4 1
3 5 6 0 0 1 0 10 9 10 1 3 1
3 5 6 0 2 1 1 9 10 9 1 2 1
3 5 5 1 2 1 2 5 4 4 0 9 1
3 6 2 1 4 1 3 6 4 4 0 7 1
3 6 3 1 2 1 2 7 5 5 0 6 1
3 6 5 1 2 1 2 6 7 6 1 6 1
3 6 7 1 3 1 3 5 4 4 0 8 1
3 6 5 1 2 1 2 5 3 3 0 10 1
3 7 8 0 0 1 1 7 6 7 1 5 1
3 7 5 1 2 1 1 5 5 5 0 6 1
3 7 6 1 2 0 4 3 1 2 0 9 0
3 8 6 1 2 1 1 6 5 5 0 7 1
3 8 9 0 0 1 0 7 5 6 1 3 1
4 5 5 1 2 1 1 5 6 5 0 6 1
4 5 5 1 2 0 2 3 3 4 0 8 0
4 6 8 0 0 1 2 8 7 7 1 4 1
4 6 6 1 3 1 2 5 4 4 0 7 0
4 6 5 1 3 1 2 4 3 2 0 8 0
4 7 2 0 3 0 4 3 6 6 1 4 1
4 7 4 1 3 0 3 4 2 1 0 7 0
4 7 7 1 3 0 4 4 5 5 0 7 0
4 7 6 1 3 0 3 3 6 5 0 4 0
5 7 5 1 1 0 4 1 7 4 0 7 1
5 8 1 1 3 0 3 4 8 7 1 5 0
5 8 3 1 3 0 3 4 5 6 1 5 1
5 9 4 1 4 0 3 2 7 5 0 5 1
5 9 6 1 4 0 3 4 6 6 1 7 0
5 10 4 1 3 0 3 4 2 3 0 6 0
1 1 8 0 1 0 2 5 6 5 0 6 1
1 2 7 0 1 1 1 7 8 9 1 5 0
1 2 7 0 1 1 0 7 5 6 1 5 1
1 3 5 0 1 1 2 7 8 8 1 5 0
2 3 3 1 2 1 2 6 7 6 1 6 0
2 3 6 1 1 1 2 7 6 4 0 7 0
2 4 6 1 3 1 3 6 5 5 0 6 0
2 5 4 1 3 1 3 4 4 3 0 6 0
Try:
tbl <- table(td$history,git$class)
tbl
# 0 1
# 0 13 2
# 1 1 34
prop.table(tbl)
# 0 1
# 0 0.26 0.04
# 1 0.02 0.68
These are the classification tables.
Regarding why your "borrowed" code does not run, there are too many possibilities.
First, if you import the data set you provided without column names, R will assign names Vn where n is 1,2,3, etc. But if this was the case none of your code would run as you refer to columns history, control, etc. So at least those must be named properly.
Second, in the line:
ftab<-table(td$history,dt$V6)
you refer to dt$V6. AFAICT there is no dt (is this a typo?).

Cut value in creating table

I have following type of data:
mydata <- data.frame (yvar = rnorm(200, 15, 5), xv1 = rep(1:5, each = 40),
xv2 = rep(1:10, 20))
table(mydata$xv1, mydata$xv2)
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
I want tabulate again with yvar categories. The following is cutkey.
cutkey :
< 10 - group 1
10-12 - group 2
12-16 - group 3
>16 - group 4
Thus we will have similar to above type of table to each cutkey elements. I want to have margin sums everytime.
< 10 - group 1
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
10-12 - group 2
1 2 3 4 5 6 7 8 9 10
1 4 4 4 4 4 4 4 4 4 4
2 4 4 4 4 4 4 4 4 4 4
3 4 4 4 4 4 4 4 4 4 4
4 4 4 4 4 4 4 4 4 4 4
5 4 4 4 4 4 4 4 4 4 4
and so on for all groups
(the numbers will be definately different)
Is there easyway to do it ?
Yes, using cut, dlply (plyr package) and addmargins:
mydata$yvar1 <- cut(mydata$yvar,breaks = c(-Inf,10,12,16,Inf))
> dlply(mydata,.(yvar1),function(x) addmargins(table(x$xv1,x$xv2)))
$`(-Inf,10]`
1 2 3 4 5 6 7 8 9 10 Sum
1 0 0 0 0 0 0 2 0 1 0 3
2 1 1 0 1 0 0 0 0 2 0 5
3 0 1 0 0 1 1 0 2 0 0 5
4 0 0 2 0 1 1 0 1 0 0 5
5 0 1 1 0 1 1 1 0 0 2 7
Sum 1 3 3 1 3 3 3 3 3 2 25
$`(10,12]`
1 2 3 4 6 7 8 9 10 Sum
1 0 0 0 1 2 0 0 0 0 3
2 0 0 1 0 0 1 0 0 1 3
3 0 1 0 1 1 2 0 0 1 6
4 0 1 0 0 0 0 0 0 0 1
5 1 0 1 1 1 0 1 1 2 8
Sum 1 2 2 3 4 3 1 1 4 21
$`(12,16]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 3 1 1 1 2 0 3 0 2 15
2 0 1 0 1 3 3 2 0 0 1 11
3 3 1 3 1 0 0 0 2 4 1 15
4 3 2 1 2 2 0 1 1 4 1 17
5 3 1 1 2 0 1 1 1 1 0 11
Sum 11 8 6 7 6 6 4 7 9 5 69
$`(16, Inf]`
1 2 3 4 5 6 7 8 9 10 Sum
1 2 1 3 2 3 0 2 1 3 2 19
2 3 2 3 2 1 1 1 4 2 2 21
3 1 1 1 2 3 2 2 0 0 2 14
4 1 1 1 2 1 3 3 2 0 3 17
5 0 2 1 1 3 1 2 2 2 0 14
Sum 7 7 9 9 11 7 10 9 7 9 85
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
yvar1
1 (-Inf,10]
2 (10,12]
3 (12,16]
4 (16, Inf]
You can adjust the breaks argument to cut to get the values just how you want them. (Although the margin sums you display in your question don't look like margin sums at all.)

Resources