I'm trying to run a Shapiro Wilks test on the variable 'Size', using a dataset that I'm subsetting with ddply (by the variables 'Site' and 'Category'), but I keep getting an error message.
Here's a sample of my dataset (d). I have 4237 observations with 9 categories and 13 sites:
Site Genus Size Category
Arn01 ACR 4 ACR
Arn01 ACR 7 ACR
Arn02 ACR 3 ACR
I created a function for Shapiro Wilks:
shap.w <- function(input){ #shapiro wilk test function
if(sum(!is.na(input$Size)) > 3 & sum(!is.na(input$Size)) < 5000){
p <- shapiro.test(input$Size)$p.value
return(p)}else{return(NA)} }
Then, I try to apply the function to subsets of my data using ddply:
sw_test <- ddply(d, .(Site, Category), .fun = shap.w)
But when I do, I get an error message that says:
Error in shapiro.test(input$Size) : all 'x' values are identical
Even though they're clearly not. Any help/advice would be much appreciated.
ETA output of
dput(d[1:20,]):
> dput(d[1:20,])
structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Arn01n",
"Arn02n", "Arn03n", "Arn04n", "Arn05n", "Arn06n", "Arn07n", "Arn08n",
"Arn09n", "Arn10n", "Arn11n", "Arn12n", "Arn13n"), class = "factor"),
Genus = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 30L, 30L, 30L, 30L), .Label = c("ACA",
"ACR", "AST", "COS", "CYP", "ECH", "FUN", "FVA", "FVT", "GAR",
"GON", "HEL", "HYD", "ISO", "LEA", "LEO", "LEP", "LOB", "MER",
"MNT", "MST", "MYC", "PAV", "PBR", "PLA", "PLAT", "POC",
"POD", "PRE", "PRM", "PRS", "PSA", "SAR", "STY"), class = "factor"),
Size = c(4, 2, 4, 4, 3, 5, 5, 4, 4, 4, 4, 3, 6, 3, 4, 5,
2, 3, 3, 6), Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 8L, 8L, 8L, 8L), .Label = c("ACR",
"FAV", "FUN", "HEL", "ISO", "MNT", "POC", "PRM", "PRS"), class = "factor")),
.Names = c("Site",
"Genus", "Size", "Category"), row.names = c(NA, 20L), class = "data.frame")`
ETA output of table(d$Size)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 23 24 25 26 27 28 29 30 31 33 35 36 37 38 39
14 271 525 548 521 424 201 206 50 357 23 95 36 7 171 11 14 30 4 145 11 21 5 46 4 1 5 1 95 1 2 31 3 1 2 1
40 41 42 43 44 45 46 48 50 51 53 55 56 57 60 62 63 65 66 70 72 75 76 80 82 83 85 88 90 94 95 100 105 110 120 125
80 1 9 3 4 22 1 4 42 1 1 4 1 3 64 3 5 9 4 13 1 2 1 20 2 2 2 1 5 1 2 17 1 2 6 2
128 130 143 150 155 160 180 200 230 300 890 920
1 1 1 1 1 1 1 2 1 1 1 1
Note that if you return NA, then is.numeric will give FALSE: Try is.numeric(NA) to see this.
You could return NA_real_ instead
is.numeric(NA)
[1] FALSE
is.numeric(NA_real_)
[1] TRUE
It's still an NA though:
is.na(NA_real_)
[1] TRUE
However, as.numeric should also fix that problem (perhaps double check what's being returned to ddply by your function given the inputs)
Okay, thanks to the help I received in the comments, I was able to solve this problem by updating the code for the function to read:
shap.w <- function(input){ #shapiro-wilks test function
if(length(unique((input$Size[!is.na(input)]))) > 3
& length(unique((input$Size[!is.na(input)])))< 5000 ){
p <- shapiro.test(input$Size)$p.value
return(p)}else{return(NA)} }
This removes the combinations that are less than 3 / greater than 5000 (although I won't have any greater than 5,000 in this dataset). Once I updated this, the next line ran without any problems. Thank you all for your help!
Related
Good afternoon ,
Assume we have the following dataset from UCI :
ballons=structure(list(YELLOW = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("PURPLE",
"YELLOW"), class = "factor"), SMALL = structure(c(2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L
), .Label = c("LARGE", "SMALL"), class = "factor"), STRETCH = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L), .Label = c("DIP", "STRETCH"), class = "factor"), ADULT = structure(c(1L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 2L), .Label = c("ADULT", "CHILD"), class = "factor"), T = c(TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA,
-19L))
# output :
YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH ADULT TRUE
2 YELLOW SMALL STRETCH CHILD FALSE
3 YELLOW SMALL DIP ADULT FALSE
4 YELLOW SMALL DIP CHILD FALSE
5 YELLOW LARGE STRETCH ADULT TRUE
6 YELLOW LARGE STRETCH ADULT TRUE
7 YELLOW LARGE STRETCH CHILD FALSE
8 YELLOW LARGE DIP ADULT FALSE
9 YELLOW LARGE DIP CHILD FALSE
10 PURPLE SMALL STRETCH ADULT TRUE
11 PURPLE SMALL STRETCH ADULT TRUE
12 PURPLE SMALL STRETCH CHILD FALSE
13 PURPLE SMALL DIP ADULT FALSE
14 PURPLE SMALL DIP CHILD FALSE
15 PURPLE LARGE STRETCH ADULT TRUE
16 PURPLE LARGE STRETCH ADULT TRUE
17 PURPLE LARGE STRETCH CHILD FALSE
18 PURPLE LARGE DIP ADULT FALSE
19 PURPLE LARGE DIP CHILD FALSE
Assume also i applied a clustering algorithm to get a results like the following :
clusterss=data.frame(index=1:19,class=c(1,2,3,3,3,2,3,1,2,3,3,2,2,3,2,2,1,1,2))
> clusterss
index class
1 1 1
2 2 2
3 3 3
4 4 3
5 5 3
6 6 2
7 7 3
8 8 1
9 9 2
10 10 3
11 11 3
12 12 2
13 13 2
14 14 3
15 15 2
16 16 2
17 17 1
18 18 1
19 19 2
Here the index variable represents the ballons rows and the class is the obtained cluster where the ballons row belongs to.
I know that we could compute the frequency of all categorical variables by :
> sapply(ballons,table)
y1 y2 y3 y4 y5
PURPLE 10 10 8 11 12
YELLOW 9 9 11 8 7
However , i need to compute this for each cluster independently . This means i need ( for each class ) to select their associated observations , After that i can compute the frequencies. For example , with class=1 :
# Expected results for the first cluster : class == 1
result1 <- filter(clusterss, class == 1)
sapply(ballons[result1[,1],],table)
y1 y2 y3 y4 y5
PURPLE 2 3 2 3 3
YELLOW 2 1 2 1 1
# Expected results for the second cluster : class == 2
result2 <- filter(clusterss, class == 2)
sapply(ballons[result2[,1],],table)
y1 y2 y3 y4 y5
PURPLE 5 5 3 4 5
YELLOW 3 3 5 4 3
# Expected results for the third cluster : class == 3
result3 <- filter(clusterss, class == 3)
sapply(ballons[result3[,1],],table)
y1 y2 y3 y4 y5
PURPLE 3 2 3 4 4
YELLOW 4 5 4 3 3
I'm searching an efficient way to obtain such results ( maybe with select function of dplyr ).
Thank you for help !
You can give an additional column, here clusterss$class, to table:
sapply(ballons,table, clusterss$class)
#lapply(ballons,table, clusterss$class) #Alternative
# YELLOW SMALL STRETCH ADULT T
#[1,] 2 3 2 3 3
#[2,] 2 1 2 1 1
#[3,] 5 5 3 4 5
#[4,] 3 3 5 4 3
#[5,] 3 2 3 4 4
#[6,] 4 5 4 3 3
What I am trying to do is close to propensity score matching (or causal matching, MatchIt) but not quite the same.
I am simply interested in finding and gathering together the closest (pairwise) observations from a dataset with mixed variables (categorical and numerical).
The dataset looks like this:
id child age edu y
1 11011209 0 69 some college 495
2 11011212 0 44 secondary/primary 260
3 11011213 1 40 some college 175
4 11020208 1 47 secondary/primary 0
5 11020212 1 50 secondary/primary 25
6 11020310 0 65 secondary/primary 525
7 11020315 1 43 college 0
8 11020316 1 41 secondary/primary 5
9 11031111 0 49 secondary/primary 275
10 11031116 1 42 secondary/primary 0
11 11031119 0 32 college 425
12 11040801 1 38 secondary/primary 0
13 11040814 0 52 some college 260
14 11050109 0 59 some college 405
15 11050111 1 35 secondary/primary 20
16 11050113 0 51 secondary/primary 40
17 11051001 1 38 college 165
18 11051004 1 36 college 10
19 11051011 0 63 secondary/primary 455
20 11051018 0 44 college 40
What I want is to match the variables {child, age, edu} but not y (nor id).
Because I use a dataset with mixed variables I can use the gower distance
library(cluster)
# test on first ten observations
dt = dt[1:10, ]
# gower distance
ddmen = daisy(dt[,-c(1,5)], metric = 'gower')
Now, I want to retrieve the closest observations
mg = as.matrix(ddmen)
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
close = mgg %>% dplyr::select(Var2, closest, dis = m) %>% distinct()
close gives me
Var2 closest dis
1 1 6 0.37931034
2 2 9 0.05747126
3 3 8 0.34482759
4 4 5 0.03448276
5 5 4 0.03448276
6 6 9 0.18390805
7 7 10 0.34482759
8 8 10 0.01149425
9 9 2 0.05747126
10 10 8 0.01149425
I can merge close to my original data
dt$id = 1:10
dt2 = merge(dt, close, by.x = 'id', by.y = 'Var2', all = T)
Then, bind it
vlist = vector('list', 10)
for(i in 1:10){
vlist[[i]] = dt2[ c( which(dt2$id == i), dt2$closest[dt2$id == i] ), ] %>%
mutate(p = i)
}
bind_rows(vlist)
and get
id child age edu y closest dis p
1 1 0 69 some college 495 6 0.37931034 1
2 6 0 65 secondary/primary 525 9 0.18390805 1
3 2 0 44 secondary/primary 260 9 0.05747126 2
4 9 0 49 secondary/primary 275 2 0.05747126 2
...
p then is the identifier of the matched pairs, based on id. So, you can notice that individuals can be in different pairs (because the closest matching of 1 on 2 is not necessarily symmetrical, 2 might have another closest match than 1).
Questions
First, there is a little bug in the code here:
mgg = mg %>% melt() %>% group_by(Var2) %>% filter(value != 0) %>% mutate(m =
min(value)) %>% mutate(closest = Var1[m == value]) %>% as.data.frame()
I get this error message Column closest must be length 19 (the group size) or one, not 2
The code works for 10 observations but not for 20 (complete dataset provided here).
Why?
Second, is there a package available to do this automatically?
dt = structure(list(id = c(11011209L, 11011212L, 11011213L, 11020208L,
11020212L, 11020310L, 11020315L, 11020316L, 11031111L, 11031116L,
11031119L, 11040801L, 11040814L, 11050109L, 11050111L, 11050113L,
11051001L, 11051004L, 11051011L, 11051018L), child = structure(c(1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L,
2L, 1L, 1L), .Label = c("0", "1"), class = "factor"), age = c(69L,
44L, 40L, 47L, 50L, 65L, 43L, 41L, 49L, 42L, 32L, 38L, 52L, 59L,
35L, 51L, 38L, 36L, 63L, 44L), edu = structure(c(3L, 2L, 3L,
2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 2L, 1L, 1L, 2L,
1L), .Label = c("college", "secondary/primary", "some college"
), class = "factor"), y = c(495, 260, 175, 0, 25, 525, 0, 5,
275, 0, 425, 0, 260, 405, 20, 40, 165, 10, 455, 40)), class = "data.frame",
.Names = c("id",
"child", "age", "edu", "y"), row.names = c(NA, -20L))
I have a large survey dataset which looks as follows:
trust09 q16a q16b q16c q16f q16g q23e
1 5A3 3 3 3 4 3 3
2 5A3 2 2 2 2 3 2
3 5A3 4 4 4 5 5 5
4 5A3 3 3 2 4 4 3
5 5A3 NA NA NA NA NA NA
6 5A3 4 4 4 4 4 3
....
....
159524 TAN 2 2 3 4 4 3
159525 TAN 4 3 2 1 3 3
159526 TAN 4 4 4 4 4 4
159527 TAN 4 NA 4 2 3 4
159528 TAN 4 4 4 4 4 4
159529 TAN 4 4 4 5 4 5
trust09 is the code for the hospital or organisation and the remaining columns are survey questions from strongly disagree to strongly agree and are scored from 1 to 5 respectively.
Each row corresponds a response from a respondent belonging to some hospital.
From this data, I want to calculate the positive response rate or PRR for each survey question for each hospital i.e. the number of respondents that have answered 'Agree' (4) or 'Strongly Agree' (5) and express this is as a percentage over the total no. of respondents.
I can get the total no. of respondents quite easily from the following code:
df0 <- nss08 %>% select(trust09, q16a, q16b, q16c, q16f, q16g, q23e) %>%
group_by(trust09) %>%
summarise_all(funs(length(.)))
Which gives me the following table:
trust09 q16a q16b q16c q16f q16g q23e
<chr> <int> <int> <int> <int> <int> <int>
1 5A3 414 414 414 414 414 414
2 5A4 298 298 298 298 298 298
3 5A5 271 271 271 271 271 271
4 5A7 384 384 384 384 384 384
5 5A8 343 343 343 343 343 343
6 5A9 502 502 502 502 502 502
I can easily count the number of responses for 'Agree'(4) and 'Strongly Agree' (5) for a single survey question using the following code:
df1 <- nss08 %>%
select(trust09, q16a) %>%
group_by(trust09) %>%
filter(q16a == 4|q16a == 5) %>%
summarise_all(funs(length(.)))
which gives this sample data:
trust09 q16a
<chr> <int>
1 5A3 124
2 5A4 65
3 5A5 107
4 5A7 142
5 5A8 126
6 5A9 159
....
I also get the same result using:
aggregate(q16a ~ trust09, data = nss08[nss08$q16a == 4|nss08$q16a == 5, ], length)
I then simply merge these two data and calculate the PRR for the variable/question q16a i.e. no. of respondents who answered 'Agree' (4) or 'Strongly Agree' (5) for this question, dividend by total responses for the question and then multiplied by 100.
The problem occurs when I wish to do the same for all the remaining variables simultaneously rather than simply writing one code corresponding to one single variable.
I have tried the following, but I get an error message:
myList <- vector("list", length = length(myVars))
for (x in seq_along(myVars)){
myList[x] <- aggregate(myVars[x] ~ trust09, data = nss08[nss08$myVars[,x] == 4|nss08$myVars[,x] == 5, ], length)}
I have also tried this without any success:
for (x in seq_along(myVars)){
myList[[x]] <- nss08 %>%
select(trust09, myVars[x]) %>%
group_by(trust09) %>%
filter(myVars[x] == 4|myVars[x] == 5) %>%
summarise(length(myVars[x]))
}
Maybe, you can see from the code what I am trying to do here.
I wanted to know how can you do the whole process more efficiently by using less code and eventually create a data frame that consists of positive response rates for each the variables/survey questions?
Thank you.
Base on your dplyr code , I made this function , you can call it in the for loop or within apply
xx=function(arg){
var=quo(arg)
#print(var)
df1=df %>%
select(trust09, !!!quos(arg)) %>%
group_by(trust09) %>%
filter_(.dots=paste0(arg,'==','4|',arg,'== 5'))%>%
summarise(length(!!var))
return(df1)
}
xx('q16a')
<quosure: frame>
~arg
# A tibble: 2 x 2
trust09 `length(arg)`
<chr> <int>
1 5A3 1
2 TAN 1
Assume your data frame contains trust09 and all other columns correspond to questions you want to summarize, you can use summarize_all and count the number of 4 and 5 responses with sum(col %in% 4:5, na.rm=TRUE) and divide it by length(col) directly:
df %>% group_by(trust09) %>% summarise_all(~ sum(. %in% 4:5, na.rm = T)/length(.))
# here . refers to all other columns individually except the group variable
# A tibble: 2 x 7
# trust09 q16a q16b q16c q16f q16g q23e
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 5A3 0.3333333 0.3333333 0.3333333 0.6666667 0.5000000 0.1666667
#2 TAN 0.8333333 0.5000000 0.6666667 0.6666667 0.6666667 0.6666667
Data used as following:
dput(df)
structure(list(trust09 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("5A3", "TAN"), class = "factor"),
q16a = c(3L, 2L, 4L, 3L, NA, 4L, 2L, 4L, 4L, 4L, 4L, 4L),
q16b = c(3L, 2L, 4L, 3L, NA, 4L, 2L, 3L, 4L, NA, 4L, 4L),
q16c = c(3L, 2L, 4L, 2L, NA, 4L, 3L, 2L, 4L, 4L, 4L, 4L),
q16f = c(4L, 2L, 5L, 4L, NA, 4L, 4L, 1L, 4L, 2L, 4L, 5L),
q16g = c(3L, 3L, 5L, 4L, NA, 4L, 4L, 3L, 4L, 3L, 4L, 4L),
q23e = c(3L, 2L, 5L, 3L, NA, 3L, 3L, 3L, 4L, 4L, 4L, 5L)), .Names = c("trust09",
"q16a", "q16b", "q16c", "q16f", "q16g", "q23e"), class = "data.frame", row.names = c(NA,
12L))
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
Anyone know how can I split a column to multiple ones?
For example: I want to split column "score" and "class", then make the values of column "grade" as column name. In my data, I have 50 different values in column "grade" instead of only two in the example below.
In the data frame 2, the row names are the values of column "class" in data frame 1.
data frame 1
class grade score
A a 12
B a 45
C a 75
D a 18
E a 6
A b 45
B b 92
C b 78
D b 36
E b 39
data frame 2
a b
A 12 45
B 45 92
C 75 78
D 18 36
E 6 39
Base R's unstack does this out of the box:
unstack(df, score ~ grade)
# a b
#1 12 45
#2 45 92
#3 75 78
#4 18 36
#5 6 39
As does xtabs:
as.data.frame.matrix(xtabs(score ~ class + grade, data=df))
# a b
#A 12 45
#B 45 92
#C 75 78
#D 18 36
#E 6 39
library(reshape2)
dcast(df, class ~ grade, value.var = "score")
class a b
1 1 12 45
2 2 45 92
3 3 75 78
4 4 18 36
5 5 6 39
df <- structure(list(class = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L,
5L), grade = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), .Label = c("a", "b"), class = "factor"), score = c(12L,
45L, 75L, 18L, 6L, 45L, 92L, 78L, 36L, 39L)), .Names = c("class",
"grade", "score"), class = "data.frame", row.names = c(NA, -10L
))
Another option is spread from library(tidyr)
library(tidyr)
spread(df1, grade, score)
I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89