Getting value not found while running chisq_test using dplyr? - r

I have two factors, day and the other group. The value is how many totals in each group.
x <- c(5,17,31,9,17,10,30,28,16,29,14,34)
y <- c(1,2,3,4,5,6,7,8,9,10,11,12)
day <- as.character (c ( 1,2,3,4,5,6,7,8,9,10,11,12) )
df1 <- data.frame(x, y, day)
df2 <- reshape2::melt(df1, id.vars='day')
colnames (df2)[2] = "group"
> df2
day group value
1 1 x 5
2 2 x 17
3 3 x 31
4 4 x 9
5 5 x 17
6 6 x 10
7 7 x 30
8 8 x 28
9 9 x 16
10 10 x 29
11 11 x 14
12 12 x 34
13 1 y 1
14 2 y 2
15 3 y 3
16 4 y 4
17 5 y 5
18 6 y 6
19 7 y 7
20 8 y 8
21 9 y 9
22 10 y 10
23 11 y 11
24 12 y 12
so in the above example there are a total of 5 in group x and total of 1 for y in day 1. I would like to determine significance for each unique days between x and y, so perhaps a chisquare test?
I run the code as such but for some reason, it keeps implying that the group is not found?
df2 %>% group_by(day) %>%
rstatix::chisq_test( day, group ) %>%
add_significance("p.adj")
Can anyone help with this, thanks in advance.

Perhaps this helps
library(dplyr)
library(purrr)
library(rstatix)
df2 %>%
mutate(day = as.numeric(day)) %>%
split(.$day) %>%
map_dfr(~ with(.x, chisq_test(setNames(value, group))), .id = 'day')
-output
# A tibble: 12 × 7
day n statistic p df method p.signif
<chr> <int> <dbl> <dbl> <dbl> <chr> <chr>
1 1 2 2.67 0.102 1 Chi-square test ns
2 2 2 11.8 0.000579 1 Chi-square test ***
3 3 2 23.1 0.00000157 1 Chi-square test ****
4 4 2 1.92 0.166 1 Chi-square test ns
5 5 2 6.55 0.0105 1 Chi-square test *
6 6 2 1 0.317 1 Chi-square test ns
7 7 2 14.3 0.000156 1 Chi-square test ***
8 8 2 11.1 0.000858 1 Chi-square test ***
9 9 2 1.96 0.162 1 Chi-square test ns
10 10 2 9.26 0.00235 1 Chi-square test **
11 11 2 0.36 0.549 1 Chi-square test ns
12 12 2 10.5 0.00118 1 Chi-square test **
Or could use group_modify
df2 %>%
group_by(day = as.numeric(day)) %>%
group_modify(~ with(.x, chisq_test(setNames(value, group)))) %>%
ungroup
# A tibble: 12 × 7
day n statistic p df method p.signif
<dbl> <int> <dbl> <dbl> <dbl> <chr> <chr>
1 1 2 2.67 0.102 1 Chi-square test ns
2 2 2 11.8 0.000579 1 Chi-square test ***
3 3 2 23.1 0.00000157 1 Chi-square test ****
4 4 2 1.92 0.166 1 Chi-square test ns
5 5 2 6.55 0.0105 1 Chi-square test *
6 6 2 1 0.317 1 Chi-square test ns
7 7 2 14.3 0.000156 1 Chi-square test ***
8 8 2 11.1 0.000858 1 Chi-square test ***
9 9 2 1.96 0.162 1 Chi-square test ns
10 10 2 9.26 0.00235 1 Chi-square test **
11 11 2 0.36 0.549 1 Chi-square test ns
12 12 2 10.5 0.00118 1 Chi-square test **

Related

Flag run-length of grouped intervals

I have a dataframe grouped by grp:
df <- data.frame(
v = rnorm(25),
grp = c(rep("A",10), rep("B",15)),
size = 2)
I want to flag the run-length of intervals determined by size. For example, for grp == "A", size is 2, and the number of rows is 10. So the interval should have length 10/2 = 5. This code, however, creates intervals with length 2:
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% size)
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 1
4 -0.913 A 2 1
5 0.486 A 2 2
6 -1.80 A 2 2
7 -0.370 A 2 3
8 -0.209 A 2 3
9 -0.661 A 2 4
10 -0.177 A 2 4
# … with 15 more rows
How can I flag the correct run-length of the size-determined intervals? The desired output is this:
# A tibble: 25 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <dbl>
1 -0.166 A 2 0
2 -1.12 A 2 0
3 0.941 A 2 0
4 -0.913 A 2 0
5 0.486 A 2 0
6 -1.80 A 2 1
7 -0.370 A 2 1
8 -0.209 A 2 1
9 -0.661 A 2 1
10 -0.177 A 2 1
# … with 15 more rows
If I interpreted your question correctly, this small change should do the trick?
df %>%
group_by(grp) %>%
mutate(
interval = (row_number() -1) %/% (n()/size))
You can use gl:
df %>%
group_by(grp) %>%
mutate(interval = gl(first(size), ceiling(n() / first(size)))[1:n()])
output
# A tibble: 26 × 4
# Groups: grp [2]
v grp size interval
<dbl> <chr> <dbl> <fct>
1 -1.12 A 2 1
2 3.04 A 2 1
3 0.235 A 2 1
4 -0.0333 A 2 1
5 -2.73 A 2 1
6 -0.0998 A 2 1
7 0.976 A 2 2
8 0.414 A 2 2
9 0.912 A 2 2
10 1.98 A 2 2
11 1.17 A 2 2
12 -0.509 B 2 1
13 0.704 B 2 1
14 -0.198 B 2 1
15 -0.538 B 2 1
16 -2.86 B 2 1
17 -0.790 B 2 1
18 0.488 B 2 1
19 2.17 B 2 1
20 0.501 B 2 2
21 0.620 B 2 2
22 -0.966 B 2 2
23 0.163 B 2 2
24 -2.08 B 2 2
25 0.485 B 2 2
26 0.697 B 2 2

Calculating R squared from multiple columns

I'm very new to R and have been trying to figure out how to calculate R^2 from a few columns within a large data set of approx 300+ columns.
Example:
rcalc <- data.frame('x1' = c(694, 702, 701), 'x2'=c(652, 659, 655),
'x3'=c(614, 612, 613), 'y1'= c(17.97, 17.95, 17.96), 'y2' = c(12.03, 12.0,
12.1), 'y3' = c(0.09, 0.1, 0.1))
From here I am stuck.
The formula in excel I can do, and looks like this:
RSQ(X1:X3, Y1:Y3) or RSQ(694:652:614, 17.97:12.03:0.09)
So, each row needs to be calculated for R^2. I was able to use the 'lm' command but was only able to do this for 1 row:
I had to take the value from each column of x (x1:x3) and stack them into 1 column, then each value from each column y (y1:y3) and stack into 1 column. Then performed the following:
rsqrd = lm(x~y, data=rcalc)
summary(rsqrd)$r.squared
This worked but again, only for 1 row. I'm not sure how to do this for thousands of rows. I hope this wasn't too confusing. Any help is greatly appreciated.
Troubleshooting:
with pivot_longer:
row col obs value
1 c 300_0 DUT Ip2_comp 784.9775
1 c 300_12 DUT Ip2_comp 864.4234
1 c 300_18 DUT Ip2_comp 919.3384
1 c 300_0 REF O2 0.09
1 c 300_12 REF O2 11.95
1 c 300_18 REF O2 17.98
2 c 300_0 DUT Ip2_comp 781.5785
2 c 300_12 DUT Ip2_comp 865.5541
2 c 300_18 DUT Ip2_comp 921.0646
2 c 300_0 REF O2 0.09
With Pivot_wider:
row obs c
1 300_0 DUT Ip2_comp 784.9775
1 300_12 DUT Ip2_comp 864.4234
1 300_18 DUT Ip2_comp 919.3384
1 300_0 REF O2 0.09
1 300_12 REF O2 11.95
1 300_18 REF O2 17.98
2 300_0 DUT Ip2_comp 781.5785
2 300_12 DUT Ip2_comp 865.5541
2 300_18 DUT Ip2_comp 921.0646
I'm sure this could be done more concisely, but here's one approach using tidyverse functions. First, I do some reshaping to add a row number and make it into a longer shape, with columns for row, observation # (1-3), x, and y.
Then I "nest" all the data except row number so that I can run a separate regression on each row's data, and then extract r squared (and a variety of other stats) from each regression.
library(tidyverse)
rcalc %>% # your data
# reshape to get matched columns for all x and for all y values
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = c("col", "obs"), names_sep = 1) %>% # split column name into two fields after first character
pivot_wider(names_from = col, values_from = value) %>%
# nest data, regression, unnest
nest(-row) %>%
mutate(model = map(data, function(df) lm(y ~ x, data = df)),
tidied = map(model, broom::glance)) %>%
unnest(tidied)
Result
# A tibble: 3 x 15
row data model r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<int> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 <tibble [3 × 3]> <lm> 0.952 0.905 2.81 20.0 0.140 1 -5.71 17.4 14.7 7.91 1 3
2 2 <tibble [3 × 3]> <lm> 0.973 0.946 2.10 36.3 0.105 1 -4.84 15.7 13.0 4.43 1 3
3 3 <tibble [3 × 3]> <lm> 0.951 0.903 2.84 19.6 0.141 1 -5.74 17.5 14.8 8.05 1 3
Edit: for troubleshooting, I am adding here the results I see at each stage:
after the pivot_longer step:
# A tibble: 18 x 4
row col obs value
<int> <chr> <chr> <dbl>
1 1 x 1 694
2 1 x 2 652
3 1 x 3 614
4 1 y 1 18.0
5 1 y 2 12.0
6 1 y 3 0.09
7 2 x 1 702
8 2 x 2 659
9 2 x 3 612
10 2 y 1 18.0
11 2 y 2 12
12 2 y 3 0.1
13 3 x 1 701
14 3 x 2 655
15 3 x 3 613
16 3 y 1 18.0
17 3 y 2 12.1
18 3 y 3 0.1
after the pivot_wider step:
# A tibble: 9 x 4
row obs x y
<int> <chr> <dbl> <dbl>
1 1 1 694 18.0
2 1 2 652 12.0
3 1 3 614 0.09
4 2 1 702 18.0
5 2 2 659 12
6 2 3 612 0.1
7 3 1 701 18.0
8 3 2 655 12.1
9 3 3 613 0.1

How to get the lowest and highest values for interval class

My data looks like the following data:
df<-read.table(text = "temp
12
15
12
6
9
11
15
14
14
16
14
14
11
12
13
14
10
12
12
14
9
13
12
15
11
11
12
12
10
11",header=TRUE)
I want to get the lowest and highest levels for temp to calculate cumulative.
I have done the following codes:
library(purrr)
library(dplyr)
map(names(df),~df %>%
count(!!rlang::sym(.x)%>%
mutate(cum=cumsum(temp)/sum(temp)))
AS you can see, this gives us the temps of 6,9,10,11,12,13,14,15,16, but 7 and 8 is lacking.
I want to have the following output:
temp n cum
6 x x
7 0 x
8 0 x
9 x x
10 x x
11 x x
12 x x
13 x x
14 x x
15 x x
16 x x
We can use complete to fill the missing sequence in temp and fill the cum value.
library(dplyr)
library(tidyr)
df %>%
count(temp) %>%
mutate(cum=cumsum(n)/sum(n)) %>%
complete(temp = seq(min(temp), max(temp)), fill = list(n = 0)) %>%
fill(cum)
# A tibble: 11 x 3
# temp n cum
# <int> <dbl> <dbl>
# 1 6 1 0.0333
# 2 7 0 0.0333
# 3 8 0 0.0333
# 4 9 2 0.1
# 5 10 2 0.167
# 6 11 5 0.333
# 7 12 8 0.6
# 8 13 2 0.667
# 9 14 6 0.867
#10 15 3 0.967
#11 16 1 1
In base R you could use table to get df2, match the frequencies within a new data.frame out of the temperature range, where you set NA to zero, ans calculate the cumsum.
df2 <- data.frame(table(df$temp))
rg <- range(df$temp)
res <- within(data.frame(temp=rg[1]:rg[2]), {
n <- df2[match(temp, df2$Var1), "Freq"]
n[is.na(n)] <- 0
cum=cumsum(n/sum(n))
})[c(1, 3, 2)]
res
# temp n cum
# 1 6 1 0.03333333
# 2 7 0 0.03333333
# 3 8 0 0.03333333
# 4 9 2 0.10000000
# 5 10 2 0.16666667
# 6 11 5 0.33333333
# 7 12 8 0.60000000
# 8 13 2 0.66666667
# 9 14 6 0.86666667
# 10 15 3 0.96666667
# 11 16 1 1.00000000

T-test for multiple rows in R

I have a table with 40+ columns and 200.000+ rows.
Something like this:
ID GROUP-A1 GROUP-A2 GROUP A3...A20 GROUP-B1 GROUP-B2 GROUP-B3...B20
1 5 6 3 5....3 10 21 9 15
2 3 4 6 2....13 23 42 34 23
3 5 3 1 0....12 10 12 43 15
4 0 0 2 5....3 10 21 23 15
I would like to run a t-test for the two groups A (1..20) and B (1..20) for every measurement I have (each row), which are independent. And possibly, have the resulting stats in the table next to each row or in a separate table, so I can easily select the significant ones.
I looked at few R packages but they mostly would require reformatting the table I have, to put measurements and groups in columns, and I would need 200.000+ separate tables in that case.
Any idea?
Something like this?
apply(df,1,function(x){t.test(x[2:21],x[22:41])})
To save the test statistic or p-value in a new column you could do
df$st=apply(df,1,function(x){t.test(x[2:21],x[22:41])$stat})
or $p.value
You can run all tests with the following code.
i_group_a <- grep("GROUP.A", names(df1), ignore.case = TRUE)
i_group_b <- grep("GROUP.B", names(df1), ignore.case = TRUE)
ttest_list <- lapply(seq_along(i_group_a), function(k){
i <- i_group_a[k]
j <- i_group_b[k]
t.test(df1[[i]], df1[[j]])
})
ttest_list[[1]]
#
# Welch Two Sample t-test
#
#data: df1[[i]] and df1[[j]]
#t = -2.8918, df = 3.7793, p-value = 0.04763
#alternative hypothesis: true difference in means is not equal to 0
#95 percent confidence interval:
# -19.826402 -0.173598
#sample estimates:
#mean of x mean of y
# 3.25 13.25
To extract, for instance, the p-values:
pval <- sapply(ttest_list, `[[`, 'p.value')
pval
#[1] 0.04762593 0.04449075 0.04390115 0.00192454
Data.
df1 <- read.table(text = "
ID GROUP-A1 GROUP-A2 GROUP-A3 GROUP-A20 GROUP-B1 GROUP-B2 GROUP-B3 GROUP-B20
1 5 6 3 5 10 21 9 15
2 3 4 6 2 23 42 34 23
3 5 3 1 0 10 12 43 15
4 0 0 2 5 10 21 23 15
", header = TRUE)
You can do this with tidyverse using purrr. It does however require to format your data differently. Here is an example:
require(tidyverse)
set.seed(314)
simulate your data
df <- data.frame(ID = rep(1:5,each = 20),
participant = rep(rep(1:10,2),5),
group = rep(rep(c('A','B'),each = 10),5),
answer = sample(1:10,100, replace = T))
dfflat <- df %>%
unite(column, group,participant) %>%
spread(column,answer)
dfflat:
ID A_1 A_10 A_2 A_3 A_4 A_5 A_6 A_7 A_8 A_9 B_1 B_10 B_2 B_3 B_4 B_5 B_6 B_7 B_8 B_9
1 1 1 8 3 8 3 3 4 3 4 6 4 4 2 3 3 6 4 8 6 1
2 2 7 6 5 6 3 1 6 4 1 3 3 6 7 1 5 5 2 10 10 6
3 3 4 3 8 5 9 7 9 7 3 1 8 2 7 6 8 3 5 6 9 4
4 4 5 4 8 2 4 1 4 6 2 2 1 1 7 10 6 9 7 7 10 1
5 5 4 1 5 10 3 5 3 10 8 3 7 3 4 6 6 9 10 7 4 5
the equivalent in long format:
dfflat %>%
gather(participant,answer,-ID) %>%
separate(participant,c('group','number'))
ID group number answer
1 1 A 1 1
2 2 A 1 7
3 3 A 1 4
4 4 A 1 5
5 5 A 1 4
6 1 A 10 8
7 2 A 10 6
8 3 A 10 3
9 4 A 10 4
10 5 A 10 1
11 1 A 2 3
12 2 A 2 5
13 3 A 2 8
14 4 A 2 8
15 5 A 2 5
16 1 A 3 8
17 2 A 3 6
18 3 A 3 5
19 4 A 3 2
20 5 A 3 10
...
Test the hypothesis with t.test per ID and extract the p.value
dfflat %>%
gather(participant,answer,-ID) %>%
separate(participant,c('group','number')) %>%
group_by(ID) %>%
nest() %>%
mutate(test = map(data, ~ with(.x, t.test(answer[group == 'A'],answer[group == 'B']))),
p.value = map_dbl(test,pluck,'p.value'))
results in:
# A tibble: 5 x 4
ID data test p.value
<int> <list> <list> <dbl>
1 1 <tibble [20 x 3]> <S3: htest> 0.841
2 2 <tibble [20 x 3]> <S3: htest> 0.284
3 3 <tibble [20 x 3]> <S3: htest> 0.863
4 4 <tibble [20 x 3]> <S3: htest> 0.137
5 5 <tibble [20 x 3]> <S3: htest> 0.469

Grouping by consecutive value occurrences

I came across a problem that forced me to use a loop instead of my preferred dplyr pipe flow.
I want to group rows based on consecutive observations of the same value.
For example, if the first four observations of type equal a, the first four observations should assigned to the same group. Order matters, so I can't dplyr::group_by and dplyr::summarize.
The code below should explain the problem fairly well. I was wondering if anyone could propose a less verbose way to do this, preferably using tidyverse packages, and not data.tables.
library(tidyverse)
# Crete some test data
df <- tibble(
id = 1:20,
type = c(rep("a", 5), rep("b", 5), rep("a", 5), rep("b", 5)),
val = runif(20)
)
df
#> # A tibble: 20 x 3
#> id type val
#> <int> <chr> <dbl>
#> 1 1 a 0.0606
#> 2 2 a 0.501
#> 3 3 a 0.974
#> 4 4 a 0.0833
#> 5 5 a 0.752
#> 6 6 b 0.0450
#> 7 7 b 0.367
#> 8 8 b 0.649
#> 9 9 b 0.846
#> 10 10 b 0.896
#> 11 11 a 0.178
#> 12 12 a 0.295
#> 13 13 a 0.206
#> 14 14 a 0.233
#> 15 15 a 0.851
#> 16 16 b 0.179
#> 17 17 b 0.801
#> 18 18 b 0.326
#> 19 19 b 0.269
#> 20 20 b 0.584
# Solve problem with a loop
count <- 1
df$consec_group <- NA
for (i in 1:nrow(df)) {
current <- df$type[i]
lag <- ifelse(i == 1, NA, df$type[i - 1])
lead <- ifelse(i == nrow(df), NA, df$type[i + 1])
if (lead %>% is.na) {
df$consec_group[i] <- ifelse(current == lag, count, count + 1)
} else {
df$consec_group[i] <- count
if (current != lead) count <- count + 1
}
}
df
#> # A tibble: 20 x 4
#> id type val consec_group
#> <int> <chr> <dbl> <dbl>
#> 1 1 a 0.0606 1
#> 2 2 a 0.501 1
#> 3 3 a 0.974 1
#> 4 4 a 0.0833 1
#> 5 5 a 0.752 1
#> 6 6 b 0.0450 2
#> 7 7 b 0.367 2
#> 8 8 b 0.649 2
#> 9 9 b 0.846 2
#> 10 10 b 0.896 2
#> 11 11 a 0.178 3
#> 12 12 a 0.295 3
#> 13 13 a 0.206 3
#> 14 14 a 0.233 3
#> 15 15 a 0.851 3
#> 16 16 b 0.179 4
#> 17 17 b 0.801 4
#> 18 18 b 0.326 4
#> 19 19 b 0.269 4
#> 20 20 b 0.584 4
Created on 2019-03-14 by the reprex package (v0.2.1)
This grouping of consecutive type occurrences is really just an intermediate step. My endgame is manipulate val for a given consec_group, based on the values of val that occurred within the previous consec_group. Advice on relevant packages would be appreciated.
You say "no data.tables", but are you sure? It's so *** fast and easy (in this case)...
library(data.table)
setDT(df)[, groupid := rleid(type)][]
# id type val groupid
# 1: 1 a 0.624078793 1
# 2: 2 a 0.687361541 1
# 3: 3 a 0.817702740 1
# 4: 4 a 0.669857208 1
# 5: 5 a 0.100977936 1
# 6: 6 b 0.418275823 2
# 7: 7 b 0.660119857 2
# 8: 8 b 0.876015209 2
# 9: 9 b 0.473562143 2
# 10: 10 b 0.284474633 2
# 11: 11 a 0.034154862 3
# 12: 12 a 0.391760387 3
# 13: 13 a 0.383107868 3
# 14: 14 a 0.729583433 3
# 15: 15 a 0.006288375 3
# 16: 16 b 0.530179235 4
# 17: 17 b 0.802643704 4
# 18: 18 b 0.409618633 4
# 19: 19 b 0.309363642 4
# 20: 20 b 0.021918512 4
If you insist on using the tidyverse/dplyr, you can (of course) still use the
rleid-function as follows:
df %>% mutate( groupid = data.table::rleid(type) )
benchmarks
on a larger sample
library(tidyverse)
library(data.table)
# Crete some large test data
df <- tibble(
id = 1:200000,
type = sample(letters[1:26], 200000, replace = TRUE),
val = runif(200000)
)
dt <- as.data.table(df)
microbenchmark::microbenchmark(
dplyr.rleid = df %>% mutate( groupid = data.table::rleid(type) ),
data.table.rleid = dt[, groupid := rleid(type)][],
rle = df %>% mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)}),
rle2 = df %>% mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths))),
transform = transform(df, ID = with(rle(df$type), rep(seq_along(lengths), lengths))),
times = 10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# dplyr.rleid 3.153626 3.278049 3.410363 3.444949 3.502792 3.582626 10
# data.table.rleid 2.965639 3.065959 3.173992 3.145643 3.259672 3.507009 10
# rle 13.059774 14.042797 24.364176 26.126176 29.460561 36.874054 10
# rle2 12.641319 13.553846 30.951152 24.698338 34.139786 102.791719 10
# transform 12.330717 22.419128 22.725242 25.532084 26.187634 26.702794 10
You can use a rleid()-like possibility like this:
df %>%
mutate(ID_rleid = {ID_rleid = rle(type); rep(seq_along(ID_rleid$lengths), ID_rleid$lengths)})
id type val ID_rleid
<int> <chr> <dbl> <int>
1 1 a 0.0430 1
2 2 a 0.858 1
3 3 a 0.504 1
4 4 a 0.318 1
5 5 a 0.469 1
6 6 b 0.144 2
7 7 b 0.173 2
8 8 b 0.0706 2
9 9 b 0.958 2
10 10 b 0.557 2
11 11 a 0.358 3
12 12 a 0.973 3
13 13 a 0.982 3
14 14 a 0.177 3
15 15 a 0.599 3
16 16 b 0.627 4
17 17 b 0.454 4
18 18 b 0.682 4
19 19 b 0.690 4
20 20 b 0.713 4
Or a modification (originally proposed by #d.b) that makes it more handy:
df %>%
mutate(ID_rleid = with(rle(type), rep(seq_along(lengths), lengths)))

Resources