Using prop.test on grouped variables in dataframe - r

I've got the dataframe below and trying to compute if there is a significant difference in the proportions between the groups within each category. E.g. category A group 1 verus 2, 1 versus 3, and 2 versus 3.
Is there a way to calculate and add the p-values to the dataframe as new columns without having to calculate it and add it manually one row at a time?
Or is there a way to calculate them and store them in a separate data frame?
Group Category number min total Proportion
1 1 A 6 2.5 33 0.1818182
2 1 B 4 3.2 33 0.1212121
3 1 C 16 3.2 33 0.4848485
4 1 D 7 3.1 33 0.2121212
5 2 A 22 6.4 133 0.1654135
6 2 B 17 6.7 133 0.1278195
7 2 C 56 6.0 133 0.4210526
8 2 D 38 6.4 133 0.2857143
9 3 A 3 10.0 22 0.1363636
10 3 B 3 9.7 22 0.1363636
11 3 C 9 10.6 22 0.4090909
12 3 D 7 9.9 22 0.3181818

The solution is quite complicated although it looks like an easy task. Here is the solution using the purrr package as the core tool.
Let's import data:
data <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L), Category = c("B", "C", "D", "A", "B", "C", "D", "A",
"B", "C", "D"), number = c(4L, 16L, 7L, 22L, 17L, 56L, 38L, 3L,
3L, 9L, 7L), min = c(3.2, 3.2, 3.1, 6.4, 6.7, 6, 6.4, 10, 9.7,
10.6, 9.9), total = c(33L, 33L, 33L, 133L, 133L, 133L, 133L,
22L, 22L, 22L, 22L), Proportion = c(0.1212121, 0.4848485, 0.2121212,
0.1654135, 0.1278195, 0.4210526, 0.2857143, 0.1363636, 0.1363636,
0.4090909, 0.3181818)), row.names = 2:12, class = "data.frame")
and required packages:
library(dplyr) # mutate, group_by and rowwise functions
library(tidyr) # nest
library(purrr) # map
library(combinat) # combn
We will create tibble object foo which divides original dataset according to groups. That allows us to map function to the groups.
foo <- foo %>% mutate(tab = map(data, combFun))
Now we define own function combPval which 1) creates a data.frame of combinations of factors (combTab), 2) creates data.frame tab1 which stores relevant columns for prop.test. These data.frames are merged in subsequent steps to create data.frame data. prop.test is then applied by in a rowwise way.
combPval <- function(group){
combTab <- combn(unique(group$Category), 2) %>% t() %>% data.frame()
tab1 <- group %>% select(Category, number, total)
combTab
temp <- merge(y=combTab, x=tab1, by.y="X2", by.x="Category" )
data <- merge(y=temp, x=tab1, by.y="X1", by.x="Category")
data <- data %>%
rowwise() %>%
mutate(
pval = prop.test(x=c(number.x, number.y), n=c(total.x, total.y))$p.val
)
data
}
Function combPval is applied in the following way:
foo <- foo %>% mutate(results = map(data, combPval))
Results for the first group can be obtained:
foo$results[[1]]
# A tibble: 3 x 7
# Rowwise:
Category number.x total.x Category.y number.y total.y pval
<chr> <int> <int> <chr> <int> <int> <dbl>
1 B 4 33 C 16 33 0.00322
2 B 4 33 D 7 33 0.509
3 C 16 33 D 7 33 0.0388

Related

Aggregating columns based on columns name in R

I have this dataframe in R
Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6
I want to aggregate it to where it will combined all the pros and anti based on party
for example
Party ProSum AntiSum
R. 234. 245
D. 234. 245
How would I do that in R?
You can use:
library(tidyverse)
df %>%
pivot_longer(-Party,
names_to = c(".value", NA),
names_pattern = "([a-zA-Z]*)([0-9]*)") %>%
group_by(Party) %>%
summarise(across(where(is.numeric), sum, na.rm = T))
# A tibble: 2 x 3
Party Pro Anti
<chr> <int> <int>
1 D 50 34
2 R 5 78
I would suggest a tidyverse approach reshaping the data and the computing the sum of values:
library(tidyverse)
#Data
df <- structure(list(Party = c("R", "R", "D", "D"), Pro2005 = c(1L,
1L, 13L, 12L), Anti2005 = c(18L, 19L, 7L, 8L), Pro2006 = c(0L,
0L, 3L, 3L), Anti2006 = c(7L, 7L, 4L, 4L), Pro2007 = c(2L, 1L,
10L, 9L), Anti2007 = c(13L, 14L, 5L, 6L)), class = "data.frame", row.names = c(NA,
-4L))
The code:
df %>% pivot_longer(cols = -1) %>%
#Format strings
mutate(name=gsub('\\d+','',name)) %>%
#Aggregate
group_by(Party,name) %>% summarise(value=sum(value,na.rm=T)) %>%
pivot_wider(names_from = name,values_from=value)
The output:
# A tibble: 2 x 3
# Groups: Party [2]
Party Anti Pro
<chr> <int> <int>
1 D 34 50
2 R 78 5
Splitting by parties and loop sum over the pro/anti using sapply, finally rbind.
res <- data.frame(Party=sort(unique(d$Party)), do.call(rbind, by(d, d$Party, function(x)
sapply(c("Pro", "Anti"), function(y) sum(x[grep(y, names(x))])))))
res
# Party Pro Anti
# D D 50 34
# R R 5 78
An outer solution is also suitable.
t(outer(c("Pro", "Anti"), c("R", "D"),
Vectorize(function(x, y) sum(d[d$Party %in% y, grep(x, names(d))]))))
# [,1] [,2]
# [1,] 5 78
# [2,] 50 34
Data:
d <- read.table(header=T, text="Party Pro2005 Anti2005 Pro2006 Anti2006 Pro2007 Anti2007
R 1 18 0 7 2 13
R 1 19 0 7 1 14
D 13 7 3 4 10 5
D 12 8 3 4 9 6 ")

Summarise a group value into single row

I have a large dataset with longitudinal readings from single individuals.
I want to summarise information over time into a binary variable. i.e. if diff in the input table below is >5 for any value I want to then reduce the observation for A to a new column saying TRUE.
#Input
individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
#Output
individual newval
A TRUE
B FALSE
Using dplyr you can:
library(dplyr)
df %>%
group_by(individual) %>% # first group data
summarize(newval = any(diff > 5)) # then evaluate test for each group
#> # A tibble: 2 x 2
#> individual newval
#> <fct> <lgl>
#> 1 A TRUE
#> 2 B FALSE
data
df <- read.table(text = "individual val1 val2 diff
A 32 36 -4
A 36 28 8
A 28 26 2
A 26 26 0
B 65 64 1
B 58 59 -1
B 57 54 3
B 54 51 3
", header = TRUE)
Multiple ways to do this :
In base R we can use aggregate
aggregate(diff~individual, df,function(x) any(x>5))
# individual diff
#1 A TRUE
#2 B FALSE
Or tapply
tapply(df$diff > 5, df$individual, any)
We can also use data.table
library(data.table)
setDT(df)[ ,(newval = any(diff > 5)), by = individual]
An option in base R with rowsum
rowsum(+(df1$diff > 5), df1$individual) != 0
or with by
by(df1$diff > 5, df1$individual, any)
data
df1 <- structure(list(individual = c("A", "A", "A", "A", "B", "B", "B",
"B"), val1 = c(32L, 36L, 28L, 26L, 65L, 58L, 57L, 54L), val2 = c(36L,
28L, 26L, 26L, 64L, 59L, 54L, 51L), diff = c(-4L, 8L, 2L, 0L,
1L, -1L, 3L, 3L)), class = "data.frame", row.names = c(NA, -8L
))

How can I pass dataframe variables to a for-loop using pipes with dplyr?

I'm trying to iterate through some calculations on subsets of my df using a for-loop at the end of a dplyr pipe, but variables I pass to the for-loop from the df aren't recognized.
I've tried to follow steps from this post:
use for loop with pipes in R.
Basically, I'm wrapping the for-loop in a user defined function and passing the df to the function via pipes.
I'm using a product sales dataset and am trying to calculate average sales of each pair of periods within each quarter for each product (a sort of sales baseline for promotions). For example, my first pass through the subset would calculate the average of periods 2 and 3, omitting 1. My second pass would exclude period 2 and calculate the average sales for 1 and 3, etc.
#Create dataframe
Article <- rep(1:3, each = 6)
Quarter <- rep(1:2, each = 3, 3)
Period <- rep(1:3, 6)
Sales <- sample(10:20, 18, replace = T)
df <-data.frame(Article, Quarter, Period, Sales)
foo <- function(x){
for (i in unique(Period)) {
filter(Period != i) %>%
summarize(average_sales = mean(Sales))
}
return(x)
}
df <- df %>%
group_by(Article, Quarter) %>%
foo()
#Desired resultant df:
average_sales <- c(14.5, 16.5, 12, 12, 16, 15, 16.5, 12.5, 16, 15, 14, 18, 11.5, 11, 11.5, 16, 16, 12)
df$average_sales <- average_sales
print(df, row.names = F)
Article Quarter Period Sales average_sales
1 1 1 14 14.5
1 1 2 10 16.5
1 1 3 19 12.0
1 2 1 19 12.0
1 2 2 11 16.0
1 2 3 13 15.0
2 1 1 12 16.5
2 1 2 20 12.5
2 1 3 13 16.0
2 2 1 17 15.0
2 2 2 19 14.0
2 2 3 11 18.0
3 1 1 11 11.5
3 1 2 12 11.0
3 1 3 11 11.5
3 2 1 12 16.0
3 2 2 12 16.0
3 2 3 20 12.0
I know this code still doesn't give me my end result, which would ideally be a fifth variable in the df which contains, for each period, the mean sales of the other two periods, but this is where I'm stuck. I'm not even sure if a for-loop is the best/most efficient way to solve this problem (I'm a limited R coder and not familiar with the entire suite of tidyverse tools), but any suggestions on how to complete the dataframe would also be greatly appreciated. Thanks!
Turning my comments into an answer, with some simplified examples to try to help you understand how to fix your function:
foo1 <- function(x) {
1 + 2
return(x)
}
foo1(0)
# [1] 0
foo1 is my simplified version of your function. In takes in an argument x, does something that doesn't use x, and then returns x. It's a pointless function - it doesn't matter that we do 1 + 2, because nothing is done with the result. In its last line, foo1 returns the same value that was passed to it, untouched.
foo2 <- function(x) {
x + 1
return(x)
}
foo2(0)
# [1] 0
foo2 is a little bit better, but ultimately equally pointless. The calculation in the middle uses x, which is logically a step forward, but the result, x + 1, isn't saved, and the function still returns the original x that was passed in.
foo3 <- function(x) {
y <- x + 1
return(y)
}
foo3(0)
# [1] 1
Finally, a function that does something! foo3 adds 1 to its input, modifies the input to store that result in a new variable y, (it could just as well modify x, x <- x + 1), and then it returns the modified variable.
With a for loop, you can't just do y <- for(...), we need to do the assignment inside the loop:
foo4 <- function(x) {
for(i in 1:3) {
y <- x + i
}
return(y)
}
foo4(0)
# [1] 3
foo4 shows a common beginner mistake - y is modified each time through the loop, but it is overwritten each time. y will be x + 1, the first time through, then y will be x + 2, then when i is 3 y will be x + 3, with no memory of the previous iterations. We need to give y some length, so it can store each iteration separately.
foo5 <- function(x) {
y <- numeric(3)
for(i in 1:3) {
y[i] <- x + i
}
return(y)
}
foo5(0)
# [1] 1 2 3
foo5 is good! We initialize y to have the right length, and each iteration of the loop saves its result to a different part of y, and then the whole y is returned at the end.
foo <- function(x) {
y <- list() # with a `list`, we don't absolutely need to specify the length upfront
for(i in unique(x$Period)) {
# use [[ for list assignment
y[[i]] <- x %>%
filter(Period != i) %>%
summarize(
period_excluded = i, # we'll use this to keep track
average_sales = mean(Sales)
)
}
# do ourselves a favor and turn the list of data frames into a single data frame
# with bind_rows before returning
return(bind_rows(y))
}
foo(df)
# period_excluded average_sales
# 1 1 14.58333
# 2 2 14.16667
# 3 3 15.58333
If we are looking for a way to get the mean of elements other than the 'Sales' for the particular 'period', get the difference of the 'Sales' with the sum of the 'Sales' for each 'Article', 'Quarter', and divide by length of the group -1.
library(dplyr)
df %>%
group_by(Article, Quarter) %>%
mutate(average_sales = (sum(Sales)- Sales)/(n()-1))
# A tibble: 18 x 5
# Groups: Article, Quarter [6]
# Article Quarter Period Sales average_sales
# <int> <int> <int> <int> <dbl>
# 1 1 1 1 14 14.5
# 2 1 1 2 10 16.5
# 3 1 1 3 19 12
# 4 1 2 1 19 12
# 5 1 2 2 11 16
# 6 1 2 3 13 15
# 7 2 1 1 12 16.5
# 8 2 1 2 20 12.5
# 9 2 1 3 13 16
#10 2 2 1 17 15
#11 2 2 2 19 14
#12 2 2 3 11 18
#13 3 1 1 11 11.5
#14 3 1 2 12 11
#15 3 1 3 11 11.5
#16 3 2 1 12 16
#17 3 2 2 12 16
#18 3 2 3 20 12
data
df <- structure(list(Article = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Quarter = c(1L, 1L, 1L,
2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L),
Period = c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L), Sales = c(14L, 10L, 19L, 19L, 11L,
13L, 12L, 20L, 13L, 17L, 19L, 11L, 11L, 12L, 11L, 12L, 12L,
20L)), row.names = c(NA, -18L), class = "data.frame")

Create a new data frame column that is a combination of other columns

I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)

Sorting and aggregating in R

I used the aggregate function in R to bring down my data entries from 90k to 1800.
a=test$ID
b=test$Date
c=test$Value
d=test$Value1
sumA=aggregate(c, by=list(Date=b,Id=a), FUN=sum)
sumB=aggregate(d, by=list(Date=b,Id=a), FUN=sum)
final[1]=sumA[1],final[2]=sumA[2]
final[3]=sumA[3]/sumB[3]
Now I have data in 20 different dates in a month with close to 90 different ids each day so its around 1800 entries in the final table .
My question is that I want to aggregate further down and find the maximum value of final[3] for each date so that I am just left with 20 values .
In simple terms -
There are 20 days .
Each day has 90 values for 90 ids
I want to find maximum of these 90 values for each day .
So at last I would be left with just 20 values for 20 days .
Now aggregate function is not working here with function 'max' instead of sum
Date ID Value Value1
1 A 20 10
1 A 25 5
1 B 50 5
1 B 50 5
1 C 25 25
1 C 35 5
2 A 30 10
2 A 25 45
2 B 40 10
2 B 40 30
This is the Data
Now By using Aggregate function I got final table as
Date ID x
1 A 45/15=3
1 B 100/10=10
1 c 60/30=2
2 A 55/55=1
2 B 80/40=2
Now I want maximum value for date 1 and 2 thats it
Date max- Value
1 10
2 2
This is a one step process using data table. The data.table is an evolved version of data.frame, and works really well. It has the class of data.frame, so works just like data.frame.
Step0: Converting data.frame to data.table:
library(data.table)
setDT(test)
setkey(test,Date,ID)
Step1: Do the computation
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
Here the explanation of the step:
The first part creates what you call the final table in your question:
test[,sum(Value)/sum(Value1),by=key(test)]
# Date ID V1
# 1: 1 A 3
# 2: 1 B 10
# 3: 1 C 2
# 4: 2 A 1
# 5: 2 B 2
Now this is passed to the second item to do the max function by Date:
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
# Date V1
# 1: 1 10
# 2: 2 2
Hope this helps.
It's a very well documented package. You should read more about it.
May be this helps.
test <- structure(list(Date = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), ID = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B"),
Value = c(20L, 25L, 50L, 50L, 25L, 35L, 30L, 25L, 40L, 40L
), Value1 = c(10L, 5L, 5L, 5L, 25L, 5L, 10L, 45L, 10L, 30L
)), .Names = c("Date", "ID", "Value", "Value1"), class = "data.frame", row.names = c(NA,
-10L))
res1 <- aggregate(. ~ID+Date, data=test, FUN=sum)
res1 <- transform(res1, x=Value/Value1)
res1
# ID Date Value Value1 x
#1 A 1 45 15 3
#2 B 1 100 10 10
#3 C 1 60 30 2
#4 A 2 55 55 1
#5 B 2 80 40 2
aggregate(. ~Date, data=res1[,-c(1,3:4)], FUN=max)
# Date x
# 1 1 10
# 2 2 2
First I run the aggregate based on two grouping variables (ID and Date) on the two value column by using. ~`
Created a new variable x i.e. Value/Value1 with transform
Did the final run of aggregate with one grouping variable (Date) and removed the rest of the variables except x.

Resources