I'm very new to R and have been trying to figure out how to calculate R^2 from a few columns within a large data set of approx 300+ columns.
Example:
rcalc <- data.frame('x1' = c(694, 702, 701), 'x2'=c(652, 659, 655),
'x3'=c(614, 612, 613), 'y1'= c(17.97, 17.95, 17.96), 'y2' = c(12.03, 12.0,
12.1), 'y3' = c(0.09, 0.1, 0.1))
From here I am stuck.
The formula in excel I can do, and looks like this:
RSQ(X1:X3, Y1:Y3) or RSQ(694:652:614, 17.97:12.03:0.09)
So, each row needs to be calculated for R^2. I was able to use the 'lm' command but was only able to do this for 1 row:
I had to take the value from each column of x (x1:x3) and stack them into 1 column, then each value from each column y (y1:y3) and stack into 1 column. Then performed the following:
rsqrd = lm(x~y, data=rcalc)
summary(rsqrd)$r.squared
This worked but again, only for 1 row. I'm not sure how to do this for thousands of rows. I hope this wasn't too confusing. Any help is greatly appreciated.
Troubleshooting:
with pivot_longer:
row col obs value
1 c 300_0 DUT Ip2_comp 784.9775
1 c 300_12 DUT Ip2_comp 864.4234
1 c 300_18 DUT Ip2_comp 919.3384
1 c 300_0 REF O2 0.09
1 c 300_12 REF O2 11.95
1 c 300_18 REF O2 17.98
2 c 300_0 DUT Ip2_comp 781.5785
2 c 300_12 DUT Ip2_comp 865.5541
2 c 300_18 DUT Ip2_comp 921.0646
2 c 300_0 REF O2 0.09
With Pivot_wider:
row obs c
1 300_0 DUT Ip2_comp 784.9775
1 300_12 DUT Ip2_comp 864.4234
1 300_18 DUT Ip2_comp 919.3384
1 300_0 REF O2 0.09
1 300_12 REF O2 11.95
1 300_18 REF O2 17.98
2 300_0 DUT Ip2_comp 781.5785
2 300_12 DUT Ip2_comp 865.5541
2 300_18 DUT Ip2_comp 921.0646
I'm sure this could be done more concisely, but here's one approach using tidyverse functions. First, I do some reshaping to add a row number and make it into a longer shape, with columns for row, observation # (1-3), x, and y.
Then I "nest" all the data except row number so that I can run a separate regression on each row's data, and then extract r squared (and a variety of other stats) from each regression.
library(tidyverse)
rcalc %>% # your data
# reshape to get matched columns for all x and for all y values
mutate(row = row_number()) %>%
pivot_longer(-row, names_to = c("col", "obs"), names_sep = 1) %>% # split column name into two fields after first character
pivot_wider(names_from = col, values_from = value) %>%
# nest data, regression, unnest
nest(-row) %>%
mutate(model = map(data, function(df) lm(y ~ x, data = df)),
tidied = map(model, broom::glance)) %>%
unnest(tidied)
Result
# A tibble: 3 x 15
row data model r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC deviance df.residual nobs
<int> <list> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1 <tibble [3 × 3]> <lm> 0.952 0.905 2.81 20.0 0.140 1 -5.71 17.4 14.7 7.91 1 3
2 2 <tibble [3 × 3]> <lm> 0.973 0.946 2.10 36.3 0.105 1 -4.84 15.7 13.0 4.43 1 3
3 3 <tibble [3 × 3]> <lm> 0.951 0.903 2.84 19.6 0.141 1 -5.74 17.5 14.8 8.05 1 3
Edit: for troubleshooting, I am adding here the results I see at each stage:
after the pivot_longer step:
# A tibble: 18 x 4
row col obs value
<int> <chr> <chr> <dbl>
1 1 x 1 694
2 1 x 2 652
3 1 x 3 614
4 1 y 1 18.0
5 1 y 2 12.0
6 1 y 3 0.09
7 2 x 1 702
8 2 x 2 659
9 2 x 3 612
10 2 y 1 18.0
11 2 y 2 12
12 2 y 3 0.1
13 3 x 1 701
14 3 x 2 655
15 3 x 3 613
16 3 y 1 18.0
17 3 y 2 12.1
18 3 y 3 0.1
after the pivot_wider step:
# A tibble: 9 x 4
row obs x y
<int> <chr> <dbl> <dbl>
1 1 1 694 18.0
2 1 2 652 12.0
3 1 3 614 0.09
4 2 1 702 18.0
5 2 2 659 12
6 2 3 612 0.1
7 3 1 701 18.0
8 3 2 655 12.1
9 3 3 613 0.1
Related
I have two factors, day and the other group. The value is how many totals in each group.
x <- c(5,17,31,9,17,10,30,28,16,29,14,34)
y <- c(1,2,3,4,5,6,7,8,9,10,11,12)
day <- as.character (c ( 1,2,3,4,5,6,7,8,9,10,11,12) )
df1 <- data.frame(x, y, day)
df2 <- reshape2::melt(df1, id.vars='day')
colnames (df2)[2] = "group"
> df2
day group value
1 1 x 5
2 2 x 17
3 3 x 31
4 4 x 9
5 5 x 17
6 6 x 10
7 7 x 30
8 8 x 28
9 9 x 16
10 10 x 29
11 11 x 14
12 12 x 34
13 1 y 1
14 2 y 2
15 3 y 3
16 4 y 4
17 5 y 5
18 6 y 6
19 7 y 7
20 8 y 8
21 9 y 9
22 10 y 10
23 11 y 11
24 12 y 12
so in the above example there are a total of 5 in group x and total of 1 for y in day 1. I would like to determine significance for each unique days between x and y, so perhaps a chisquare test?
I run the code as such but for some reason, it keeps implying that the group is not found?
df2 %>% group_by(day) %>%
rstatix::chisq_test( day, group ) %>%
add_significance("p.adj")
Can anyone help with this, thanks in advance.
Perhaps this helps
library(dplyr)
library(purrr)
library(rstatix)
df2 %>%
mutate(day = as.numeric(day)) %>%
split(.$day) %>%
map_dfr(~ with(.x, chisq_test(setNames(value, group))), .id = 'day')
-output
# A tibble: 12 × 7
day n statistic p df method p.signif
<chr> <int> <dbl> <dbl> <dbl> <chr> <chr>
1 1 2 2.67 0.102 1 Chi-square test ns
2 2 2 11.8 0.000579 1 Chi-square test ***
3 3 2 23.1 0.00000157 1 Chi-square test ****
4 4 2 1.92 0.166 1 Chi-square test ns
5 5 2 6.55 0.0105 1 Chi-square test *
6 6 2 1 0.317 1 Chi-square test ns
7 7 2 14.3 0.000156 1 Chi-square test ***
8 8 2 11.1 0.000858 1 Chi-square test ***
9 9 2 1.96 0.162 1 Chi-square test ns
10 10 2 9.26 0.00235 1 Chi-square test **
11 11 2 0.36 0.549 1 Chi-square test ns
12 12 2 10.5 0.00118 1 Chi-square test **
Or could use group_modify
df2 %>%
group_by(day = as.numeric(day)) %>%
group_modify(~ with(.x, chisq_test(setNames(value, group)))) %>%
ungroup
# A tibble: 12 × 7
day n statistic p df method p.signif
<dbl> <int> <dbl> <dbl> <dbl> <chr> <chr>
1 1 2 2.67 0.102 1 Chi-square test ns
2 2 2 11.8 0.000579 1 Chi-square test ***
3 3 2 23.1 0.00000157 1 Chi-square test ****
4 4 2 1.92 0.166 1 Chi-square test ns
5 5 2 6.55 0.0105 1 Chi-square test *
6 6 2 1 0.317 1 Chi-square test ns
7 7 2 14.3 0.000156 1 Chi-square test ***
8 8 2 11.1 0.000858 1 Chi-square test ***
9 9 2 1.96 0.162 1 Chi-square test ns
10 10 2 9.26 0.00235 1 Chi-square test **
11 11 2 0.36 0.549 1 Chi-square test ns
12 12 2 10.5 0.00118 1 Chi-square test **
So let's say I have two data frames
df1 <- data.frame(n = rep(n = 2,c(0,1,2,3,4)), nn =c(rep(x = 1, 5), rep(x=2, 5)),
y = rnorm(10), z = rnorm(10))
df2 <- data.frame(x = rnorm(20))
Here is the first df:
> head(df1)
n nn y z
1 0 1 1.5683647 0.48934096
2 1 1 1.2967556 -0.77891030
3 2 1 -0.2375963 1.74355935
4 3 1 -1.2241501 -0.07838729
5 4 1 -0.3278127 -0.97555379
6 0 2 -2.4124503 0.07065982
Here is the second df:
x
1 -0.4884289
2 0.9362939
3 -1.0624084
4 -0.9838209
5 0.4242479
6 -0.4513135
I'd like to substact x column values of df2 from z column values of df1. And return the rows of both dataframes for which the substracted value is approximately equal to that of y value of df1.
Is there a way to construct such function, so that I could imply the approximation to which the values should be equal?
So, that it's clear, I'd like to substract all x values from all z values and then compare the value to y column value of df1, and check if there is approximately matching value to y.
Here's an approach where I match every row of df1 with every row of df2, then take x and y from z (as implied by your logic of comparing z-x to y; this is the same as comparing z-x-y to zero). Finally, I look at each row of df1 and keep the match with the lowest absolute difference.
library(dplyr)
left_join(
df1 %>% mutate(dummy = 1, row = row_number()),
df2 %>% mutate(dummy = 1, row = row_number()), by = "dummy") %>%
mutate(diff = z - x - y) %>%
group_by(row.x) %>%
slice_min(abs(diff)) %>%
ungroup()
Result (I used set.seed(42) before generating df1+df2.)
# A tibble: 10 x 9
n nn y z dummy row.x x row.y diff
<dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
1 0 1 1.37 1.30 1 1 0.0361 20 -0.102
2 1 1 -0.565 2.29 1 2 1.90 5 0.956
3 2 1 0.363 -1.39 1 3 -1.76 8 0.0112
4 3 1 0.633 -0.279 1 4 -0.851 18 -0.0607
5 4 1 0.404 -0.133 1 5 -0.609 14 0.0713
6 0 2 -0.106 0.636 1 6 0.705 12 0.0372
7 1 2 1.51 -0.284 1 7 -1.78 2 -0.0145
8 2 2 -0.0947 -2.66 1 8 -2.41 19 -0.148
9 3 2 2.02 -2.44 1 9 -2.41 19 -2.04
10 4 2 -0.0627 1.32 1 10 1.21 4 0.168
I am trying to determine the difference between the set of ids in subsequent pairs of dataframes. The dataframes are derived from an original dataframe split by a grouping variable representing the time period. The results should show the rows of the new ids that occur in the current time period compared to the previous one.
I can accomplish this with a list of dataframes:
library(tidyverse)
set.seed(999)
examp <- tibble(
id = c(replicate(4, sample.int(20, 9))),
year = rep(1:4, each = 9),
val = runif(36)
)
examp %>%
split(.$year) %>%
# note my default, I compare the first year to itself
map2(lag(., default = .[1]), anti_join, by = "id")
$`1`
# A tibble: 0 x 3
# ... with 3 variables: id <int>, year <int>, val <dbl>
$`2`
# A tibble: 3 x 3
id year val
<int> <int> <dbl>
1 5 2 0.450
2 11 2 0.943
3 2 2 0.571
$`3`
# A tibble: 6 x 3
id year val
<int> <int> <dbl>
1 19 3 0.870
2 12 3 0.403
3 9 3 0.331
4 20 3 0.315
5 16 3 0.455
6 17 3 0.699
$`4`
# A tibble: 5 x 3
id year val
<int> <int> <dbl>
1 4 4 0.190
2 11 4 0.0804
3 2 4 0.247
4 1 4 0.619
5 18 4 0.434
But I could not get the same to work using mutate in a nested dataframe:
examp %>%
nest_by(year) %>%
mutate(new = anti_join(data, lag(data), by = "id"))
# A tibble: 4 x 3
# Rowwise: year
year data new$id $val
<int> <list<tibble[,2]>> <int> <dbl>
1 1 [9 x 2] 3 0.0601
2 2 [9 x 2] 1 0.495
3 3 [9 x 2] 17 0.699
4 4 [9 x 2] 18 0.434
Here I could not figure out how to specify the default and the output is unexpected. I expected "new" to be a list-column of dataframes corresponding with those above, which I could then unnest.
I am interested in learning more about working with nested dataframes and any help understanding how to get this to work would be much appreciated. Additionally, if there is another (simple) solution to this general problem, I would be happy to learn about it.
It should be wrapped in a list
library(dplyr)
out <- examp %>%
nest_by(year) %>%
ungroup %>%
mutate(newdat = lag(data, default = data[1])) %>%
rowwise %>%
mutate(new = list(anti_join(data, newdat, by = 'id')))
-output
out$new
[[1]]
# A tibble: 0 x 2
# … with 2 variables: id <int>, val <dbl>
[[2]]
# A tibble: 3 x 2
id val
<int> <dbl>
1 5 0.450
2 11 0.943
3 2 0.571
[[3]]
# A tibble: 6 x 2
id val
<int> <dbl>
1 19 0.870
2 12 0.403
3 9 0.331
4 20 0.315
5 16 0.455
6 17 0.699
[[4]]
# A tibble: 5 x 2
id val
<int> <dbl>
1 4 0.190
2 11 0.0804
3 2 0.247
4 1 0.619
5 18 0.434
I'm trying to arrange values in decreasing order within a exact group in a nested dataframe. My input data looks like this. I've got two grouping variables (group1 and group2) and three values (i.e. id, value2, value3).
library(tidyverse)
set.seed(1234)
df <- tibble(group1 = c(rep(LETTERS[1:3], 4)),
group2 = c(rep(0, 6), rep(2, 6)),
value2 = rnorm(12, 20, sd = 10),
value3 = rnorm(12, 20, sd = 50)) %>%
group_by(group1) %>%
mutate(id = c(1:4)) %>%
ungroup()
I decided to group them by group1 and group2 and then nest():
df_nested <- df %>%
group_by(group1, group2) %>%
nest()
# A tibble: 6 x 3
# Groups: group1, group2 [6]
group1 group2 data
<chr> <dbl> <list>
1 A 0 <tibble [2 x 3]>
2 B 0 <tibble [2 x 3]>
3 C 0 <tibble [2 x 3]>
4 A 2 <tibble [2 x 3]>
5 B 2 <tibble [2 x 3]>
6 C 2 <tibble [2 x 3]>
Perfect. Now I need to sort only those data which group2 is equal to 2 by id. However I'm receiving a following error:
df_nested %>%
mutate(data = map2_df(.x = data, .y = group2,
~ifelse(.y == 2, arrange(-.x$id),
.x)))
Error: Argument 1 must have names
You could do :
library(dplyr)
library(purrr)
df_nested$data <- map2(df_nested$data, df_nested$group2,~if(.y == 2)
arrange(.x, -.x$id) else .x)
So data where group2 is not equal to 2 is not sorted
df_nested$data[[1]]
# A tibble: 2 x 3
# value2 value3 id
# <dbl> <dbl> <int>
#1 13.1 -89.0 1
#2 9.76 -3.29 2
and where group2 is 2 is sorted.
df_nested$data[[4]]
# A tibble: 2 x 3
#value2 value3 id
# <dbl> <dbl> <int>
#1 15.0 -28.4 4
#2 31.0 -22.8 3
If you want to combine them do :
map2_df(df_nested$data, df_nested$group2,~if(.y == 2) arrange(.x, -.x$id) else .x)
I would suggest creating an additional variable id_ which will be equal to the original id variable when group2 == 2 and NA otherwise. This way if we use it in sorting it'll make no effect when group2 != 2.
df %>%
mutate(id_ = if_else(group2 == 2, id, NA_integer_)) %>%
arrange(group1, group2, -id_)
#> # A tibble: 12 x 6
#> group1 group2 value2 value3 id id_
#> <chr> <dbl> <dbl> <dbl> <int> <int>
#> 1 A 0 17.6 50.2 1 NA
#> 2 A 0 33.8 -14.4 2 NA
#> 3 A 2 23.1 22.6 4 4
#> 4 A 2 13.7 50.2 3 3
#> 5 B 0 15.4 49.9 1 NA
#> 6 B 0 16.2 63.7 2 NA
#> 7 B 2 41.7 -2.90 4 4
#> 8 B 2 16.6 46.7 3 3
#> 9 C 0 19.9 -64.3 1 NA
#> 10 C 0 19.9 59.7 2 NA
#> 11 C 2 34.1 48.5 4 4
#> 12 C 2 32.3 23.1 3 3
Then if needed we can group and nest the result.
Consider the case below for an experiment where group is different treatments, init are the initial values for each sample, change is expected change after treatment and sd_change is standard deviation of the change.
library(tidyverse)
set.seed(001)
data1 <- tibble(group = rep(c("a", "b"), each = 4),
init = rpois(8, 10)) %>%
group_by(group, init) %>%
expand(change = seq(2, 6, 2)) %>%
mutate(sd_change = 2)
as_tibble(data1)
> data1
# A tibble: 24 x 4
# Groups: group, init [8]
group init change sd_change
<chr> <int> <dbl> <dbl>
1 a 7 2 2
2 a 7 4 2
3 a 7 6 2
4 a 8 2 2
5 a 8 4 2
6 a 8 6 2
7 a 10 2 2
8 a 10 4 2
9 a 10 6 2
10 a 11 2 2
# ... with 14 more rows
I generate final values and obtain mean and variance for each group and change as below
data2a <- data1 %>%
rowwise %>%
mutate(final = rnorm(1, change, sd_change) + init) %>%
ungroup
data2a %>%
group_by(group, change) %>%
summarise(mu_start = mean(init), mu_end = mean(final),
v_start = var(init), v_end = var(final))
# A tibble: 6 x 6
# Groups: group [2]
group change mu_start mu_end v_start v_end
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 a 2 9 10.9 3.33 13.9
2 a 4 9 14.7 3.33 4.90
3 a 6 9 15.5 3.33 10.2
4 b 2 11.5 13.2 4.33 3.69
5 b 4 11.5 14.8 4.33 17.8
6 b 6 11.5 17.7 4.33 9.77
I want to repeat the above procedure R times by generating one final random value. I can do this with a for loop but I'm learning purrr and I'm stuck when summarising. See one version below:
# function to generate final values where R = 3
f <- function(n=3, x, y, z){
out <- rnorm(n, x, y)
out <- out + z
}
data2b <- data1 %>%
mutate(final = pmap(list(z = init,
x = change,
y = sd_change),
f)) %>%
ungroup
as_tibble(data2b)
# A tibble: 24 x 5
group init change sd_change final
<chr> <int> <dbl> <dbl> <list>
1 a 7 2 2 <dbl [3]>
2 a 7 4 2 <dbl [3]>
3 a 7 6 2 <dbl [3]>
4 a 8 2 2 <dbl [3]>
5 a 8 4 2 <dbl [3]>
6 a 8 6 2 <dbl [3]>
7 a 10 2 2 <dbl [3]>
8 a 10 4 2 <dbl [3]>
9 a 10 6 2 <dbl [3]>
10 a 11 2 2 <dbl [3]>
# ... with 14 more rows
summarise to get mu_end that should be a list of length R=3 in this example. The following gives an error
data2b %>%
split(.$group, .$change) %>%
mutate(mu_end = map(final, mean),
v_end = map(final, var)
Error in UseMethod("mutate_") :
no applicable method for 'mutate_' applied to an object of class "list"
The output should be like this
# A tibble: 6 x 4
# Groups: group [2]
group change mu_end v_end
<chr> <dbl> <dbl> <dbl>
1 a 2 10.9 13.9
2 a 4 14.7 4.90
3 a 6 15.5 10.2
4 b 2 13.2 3.69
5 b 4 14.8 17.8
6 b 6 17.7 9.77
but each row of mu_end and v_end should be a list of length R
any help?
We can either do a group_split and then map through the list of tibbles, mutate to create the mean and var of the list column 'final' by looping with map
data2b %>%
group_split(group, change) %>%
map_df(~ .x %>%
mutate(mu_end = map_dbl(final, mean),
v_end = map_dbl(final, var)))
Or without splitting
data2b %>%
group_by(group, change) %>%
mutate(mu_end = map_dbl(final, mean), v_end = map_dbl(final, var))