Extract values according to the result - r

I have a dataframe that represents characteristics of people, such as occupation, gender, and telework use :
data = data.frame (profession = sample (c ("craftsman", "employee", "senior executive"), 10000, replace = TRUE), sex = sample (c ("M", "F"), 10000, replace = TRUE), en_teletjob = sample (c ("Yes", "No"), 10000, replace = TRUE))
I would like to create a new dataframe, resulting from an extraction of the values ​​of "data", such as:
That there are 20% men and 80% women
And, that there are 60% of craftsmen, 20% of employees, and 20% of senior executives
And, that there be 50% of "Yes" to the use of telework.
Is it possible to do this on R?
Thank you

One approach you can try is next with apply() and prop.table() joint with table() in order to summarise all variables. Here the code:
#Code
apply(data,2,function(x) prop.table(table(x)))
Output:
$profession
x
craftsman employee senior executive
0.3331 0.3315 0.3354
$sex
x
F M
0.4987 0.5013
$en_teletjob
x
No Yes
0.503 0.497

You can use lapply() to call proportions() on each variable. It returns a list object.
lapply(data, function(x) proportions(table(x)))
# $profession
# x
# craftsman employee senior executive
# 0.3336 0.3318 0.3346
#
# $sex
# x
# F M
# 0.5035 0.4965
#
# $en_teletjob
# x
# No Yes
# 0.4978 0.5022
Note: prop.table() is an earlier name of proportions(), retained for back-compatibility.

An option with tidyverse would be to use adorn_percentages
-code
library(purrr)
library(dplyr)
library(janitor)
map(names(data), ~data %>%
select(.x) %>%
count(!! rlang::sym(.x)) %>%
adorn_percentages(denominator = 'col'))
-output
#[[1]]
# profession n
# craftsman 0.3302
# employee 0.3320
# senior executive 0.3378
#[[2]]
# sex n
# F 0.5108
# M 0.4892
#[[3]]
# en_teletjob n
# No 0.4981
# Yes 0.5019

Related

How to complete rownames in R?

I have imported a table that looks like this:
df <- data.frame(study=c("A", "", "", "B", "C", ""),
outcome=c("mortality", "mortality", "surgery", "mortality", "mortality", "surgery"),
time.point=c("30d", "1y", "10d", "1y", "5y", "20d"))
The 2nd and 3rd outcome belong to study A, the 6th outcome belongs to study C.
In my table there are various examples like this with irregular number of outcomes and time-points in each study.
How can I assign a good name to each row indicating the study and outcome and time point predicted?
I want it to look like that:
df_new <- data.frame(study=c("A", "", "", "B", "C", ""),
outcome=c("mortality", "mortality", "surgery", "mortality", "mortality", "surgery"),
time.point=c("30d", "1y", "10d", "1y", "5y", "20d"),
rowname=c("A_mortality_30d", "A_mortality_1y", "A_surgery_10d", "B_mortality_1y", "C_mortality_5y", "C_surgery_20d"))
Thank you so much!
here is an approach by changing the empty strings to NA
library( data.table ); library( zoo )
#make it a data.table
setDT(df)
#set empty strings as NA
df[ study == "", study := NA_character_ ]
#create new column
df[, rowname := paste( zoo::na.locf( study), outcome, time.point, sep = "_")][]
# study outcome time.point rowname
# 1: A mortality 30d A_mortality_30d
# 2: <NA> mortality 1y A_mortality_1y
# 3: <NA> surgery 10d A_surgery_10d
# 4: B mortality 1y B_mortality_1y
# 5: C mortality 5y C_mortality_5y
# 6: <NA> surgery 20d C_surgery_20d
Credits to Oliver. First part is from him. He was faster.
Then you can use unite from tidyr package.
library(tidyr)
library(dplyr)
df1 <- df %>%
mutate(study = case_when(study == "" ~ NA_character_ ,
TRUE ~ study)) %>%
fill(study, .direction = 'down') %>%
unite(rowname, study, outcome, time.point, sep= "_", remove = FALSE)
You could do something like:
library(tidyverse)
df$rowname <- df %>% mutate(study = case_when(study == "" ~ NA_character_ ,
TRUE ~ study)) %>%
fill(study, .direction = 'down') %>%
(function(x)mapply(paste, sep = '_', study = x$study, outcome = x$outcome, time.point = x$time.point))
#alternative use rownames(df) <- ...
df
# study outcome time.point rowname
# 1 A mortality 30d A_mortality_30d
# 2 mortality 1y A_mortality_1y
# 3 surgery 10d A_surgery_10d
# 4 B mortality 1y B_mortality_1y
# 5 C mortality 5y C_mortality_5y
# 6 surgery 20d C_surgery_20d
here I first "replace" non-existing studies with NA_character_ so that I can use fill to fill in the "" values. Then I us mapply to iterate over the values in each column. The mapply is wrapped in a function, only because I want it within a pipe.
Base R solution using grep to get the line numbers of non-empty studies, counting their repeats with diff, and then repeating them with rep.
studies <- df[df$study != "", "study"]
reps <- diff(c(grep(".", df$study), nrow(df) +1))
rownames(df) <- paste(rep(studies, reps), df$outcome, df$time.point, sep="_")
> df
study outcome time.point
A_mortality_30d A mortality 30d
A_mortality_1y mortality 1y
A_surgery_10d surgery 10d
B_mortality_1y B mortality 1y
C_mortality_5y C mortality 5y
C_surgery_20d surgery 20d

Clear and Concise Way to apply Standardization to both Train and Test Set in R

I am selecting a 90/10 Training/Test split with some data in R. After I have the Training set. I would like to standardize it. I would then like to use the same mean and standard deviation used in the training set and apply that standardization to the test set.
I would like to do this in the most base-R way possible but would be ok with a dplyr solution too. Note that I have columns that are both factors/chr and numeric. Of course I need to select the numeric ones first.
My first setup is below with a reproducible example code. I have the means and standard deviations for the appropriate numeric columns, now how can I apply the standardization back to the specific columns on the training and test data?
library(tidyverse)
rm(list = ls())
x <- data.frame("hame" = c("Bob", "Roberta", "Brady", "Jen", "Omar", "Phillip", "Natalie", "Aaron", "Annie", "Jeff"),
"age" = c(60, 55, 25, 30, 35, 40, 47, 32, 34,67),
"income" = c(50000, 60000, 100000, 90000, 100000, 95000, 75000, 85000, 95000, 105000))
train_split_pct = 0.90
train_size <- ceiling(nrow(x)*train_split_pct) # num of rows for training set
test_size <- nrow(x) - train_size # num of rows for testing set
set.seed(123)
ix <- sample(1:nrow(x)) # shuffle
x_new = x[ix, ]
Train_set = x_new[1:train_size, ]
Test_set = x_new[(train_size+1):(train_size+test_size), ]
Train_mask <- Train_set %>% select_if(is.numeric)
Train_means <- Train_mask %>% apply(2, mean)
Train_stddevs <- Train_mask %>% apply(2, sd)
We can do this in a concise way. Get the mean, sd of the 'Train' dataset ('mean_sd'). Note that with dplyr version >= 1.0, summarise can return more than one row. So, make use of that feature to create a two row dataset - first row => mean, second row => sd
library(dplyr) # >= 1.0.0
library(purrr)
mean_sd <- Train_set %>%
summarise(across(where(is.numeric), ~ c(mean(., na.rm = TRUE),
sd(., na.rm = TRUE))))
Then, create a function ('f1') to do the standardization.
f1 <- function(x, y) (x -y[1])/y[2]
Loop over the list of 'Train', 'Test' dataset, use map2 to loop over the corresponding columns based on the 'mean_sd' dataset, apply the f1 and assign that output to the columns. Then, with list2env, we can update the same objects in the global environment
list2env(map(lst(Train_set, Test_set), ~ {
.x[names(mean_sd)] <- map2(select(.x, names(mean_sd)), mean_sd, f1)
.x}), .GlobalEnv)
-output
Train_set
# hame age income
#3 Brady -1.3286522 0.7745967
#10 Jeff 1.6256451 1.0327956
#2 Roberta 0.7815601 -1.2909944
#8 Aaron -0.8362693 0.0000000
#6 Phillip -0.2735460 0.5163978
#9 Annie -0.6955885 0.5163978
#1 Bob 1.1332622 -1.8073922
#7 Natalie 0.2188368 -0.5163978
#5 Omar -0.6252481 0.7745967
Test_set
# hame age income
#4 Jen -0.9769502 0.2581989
Consider this as an option. You can use scale() function that allows you to normalize your variables. At the end you can find the code. Also, you can use mutate_if() in order to choose the numeric variables and avoid creating other dataframes. Here the code using dplyr where I have created two new dataframes with the required values:
library(tidyverse)
rm(list = ls())
x <- data.frame("hame" = c("Bob", "Roberta", "Brady", "Jen", "Omar", "Phillip", "Natalie", "Aaron", "Annie", "Jeff"),
"age" = c(60, 55, 25, 30, 35, 40, 47, 32, 34,67),
"income" = c(50000, 60000, 100000, 90000, 100000, 95000, 75000, 85000, 95000, 105000))
train_split_pct = 0.90
train.size <- ceiling(nrow(x)*train_split_pct) # num of rows for training set
test.size <- nrow(x) - train.size # num of rows for testing set
set.seed(123)
ix <- sample(1:nrow(x)) # shuffle
x_new = x[ix, ]
Train.set = x_new[1:train.size, ]
Test.set = x_new[(train.size+1):(train.size+test.size), ]
#Normalize
Train.set2 <- Train.set %>%
mutate_if(is.numeric, scale)
Test.set2 <- Test.set %>%
mutate_if(is.numeric, scale)
Update: If the scale() is not working, you can try reshaping the data and joining with the computed values for mean and SD:
#Define indexes for numeric vars
index.train <- which(names(Train.set)%in% names(Train_means))
#Format means and sd to merge
Train2 <- Train.set %>%
mutate(id=row_number()) %>%
pivot_longer(cols=index.train) %>%
left_join(
Train_means %>% t() %>%data.frame %>%
pivot_longer(everything()) %>%
rename(Mean=value) %>%
left_join(Train_stddevs %>% t() %>%data.frame %>%
pivot_longer(everything()) %>%
rename(SD=value))
) %>%
#Compute standard values
mutate(SValue=(value-Mean)/SD) %>%
select(-c(value,Mean,SD)) %>%
pivot_wider(names_from = name,values_from=SValue) %>% select(-id)
Output:
# A tibble: 9 x 3
hame age income
<fct> <dbl> <dbl>
1 Brady -1.33 0.775
2 Jeff 1.63 1.03
3 Roberta 0.782 -1.29
4 Aaron -0.836 0
5 Phillip -0.274 0.516
6 Annie -0.696 0.516
7 Bob 1.13 -1.81
8 Natalie 0.219 -0.516
9 Omar -0.625 0.775
And for the test set, the process is similar:
#Define indexes
index.test <- which(names(Test.set)%in% names(Train_means))
#Format means and sd 2
Test2 <- Test.set %>%
mutate(id=row_number()) %>%
pivot_longer(cols=index.test) %>%
left_join(
Train_means %>% t() %>%data.frame %>%
pivot_longer(everything()) %>%
rename(Mean=value) %>%
left_join(Train_stddevs %>% t() %>%data.frame %>%
pivot_longer(everything()) %>%
rename(SD=value))
) %>%
#Compute standard values
mutate(SValue=(value-Mean)/SD) %>%
select(-c(value,Mean,SD)) %>%
pivot_wider(names_from = name,values_from=SValue) %>% select(-id)
Output:
# A tibble: 1 x 3
hame age income
<fct> <dbl> <dbl>
1 Jen -0.977 0.258
The key is merging the values after reshaping. As evidence I will show the intermediate step for the final dataset. It looks like this:
# A tibble: 2 x 7
hame id name value Mean SD SValue
<fct> <int> <chr> <dbl> <dbl> <dbl> <dbl>
1 Jen 1 age 30 43.9 14.2 -0.977
2 Jen 1 income 90000 85000 19365. 0.258
In that way is easy to compute the standard values you want.
So after reviewing the prior answers which worked fine, I found them a bit unclear to use and not intuitive. I have achieved the desired result via a for loop. While slightly rudimentary I believe it a more clear approach. Given the use case where I don't have many columns I don't see a major issue in this solution unless there were many columns of data to go through. In that case I would need help seeking a faster solution.
Regardless, my method is as follows. I gather all column names in my Train_mask which is only the numeric columns. Next, I loop through each of the names and update the values accordingly with the standardization from their respective Train_means and Train_stddevs.
Due to the way I construct my Training and Testing sets there should be no issues with the order of my column frames and they can be used sequentially in the following fashion.
library(tidyverse)
rm(list = ls())
x <- data.frame("name" = c("Bob", "Roberta", "Brady", "Jen", "Omar", "Phillip", "Natalie", "Aaron", "Annie", "Jeff"),
"age" = c(60, 55, 25, 30, 35, 40, 47, 32, 34,67),
"income" = c(50000, 60000, 100000, 90000, 100000, 95000, 75000, 85000, 95000, 105000))
train_split_pct = 0.90
train_size <- ceiling(nrow(x)*train_split_pct) # num of rows for training set
test_size <- nrow(x) - train_size # num of rows for testing set
set.seed(123)
ix <- sample(1:nrow(x)) # shuffle
x_new = x[ix, ]
Train_set = x_new[1:train_size, ]
Test_set = x_new[(train_size+1):(train_size+test_size), ]
Train_mask <- Train_set %>% select_if(is.numeric)
Train_means <- data.frame(as.list(Train_mask %>% apply(2, mean)))
Train_stddevs <- data.frame(as.list(Train_mask %>% apply(2, sd)))
col_names <- names(Train_mask)
for (i in 1:ncol(Train_mask)){
Train_set[,col_names[i]] <- (Train_set[,col_names[i]] - Train_means[,col_names[i]])/Train_stddevs[,col_names[i]]
Test_set[,col_names[i]] <- (Test_set[,col_names[i]] - Train_means[,col_names[i]])/Train_stddevs[,col_names[i]]
}
Train_set
Test_set
Output:
> Train_set
name age income
3 Brady -3.180620 0.7745967
10 Jeff -2.972814 1.0327956
2 Roberta -3.032187 -1.2909944
8 Aaron -3.145986 0.0000000
6 Phillip -3.106404 0.5163978
9 Annie -3.136090 0.5163978
1 Bob -3.007448 -1.8073922
7 Natalie -3.071769 -0.5163978
5 Omar -3.131143 0.7745967
> Test_set
name age income
4 Jen -0.9769502 0.2581989

Filtering transaction level data

I am dealing with a data frame containing the transaction level data. It contains two fields, bill_id and product.
The data represents products purchased at a bill level, and a particular bill_id gets repeated as many times as the number of products purchased in that bill. For example, if 5 items have been purchased in bill_id 12345, the data for this bill will be like this:
bill_id product
12345 A
12345 B
12345 C
12345 D
12345 E
My objective is to filter out data of all bills containing a certain product.
Following is an example of how I am performing this task currently:
library(dplyr)
set.seed(1)
# Sample data
dat <- data.frame(bill_id = sample(1:500, size = 1000, replace = TRUE),
product = sample(LETTERS, size = 1000, replace =
TRUE),
stringsAsFactors = FALSE) %>%
arrange(bill_id, product)
# vector of bill_ids of product A
bills_productA <- dat %>%
filter(product == "A") %>%
pull(bill_id) %>%
unique()
# data for bill_ids in vector bills_productA
dat_subset <- dat %>%
filter(bill_id %in% bills_productA)
This leads to the creation of an intermediary vector of bill_ids (bills_productA) and a two-step filtering process (first find ids of bills containing the product, and then find all transactions of these bills).
Is there a more efficient way of performing this task?
a data.table approach:
preparation
library(data.table)
setDT(dat)
actual code
dat[ bill_id %in% dat[ product == "A",][[1]], ]
output
# bill_id product
# 1: 14 A
# 2: 14 I
# 3: 19 A
# 4: 19 W
# 5: 22 A
# ---
# 130: 478 A
# 131: 478 V
# 132: 478 Z
# 133: 494 A
# 134: 494 J
You can filter the bill_id by directly subsetting it
library(dplyr)
dat_subset1 <- dat %>% filter(bill_id %in% unique(bill_id[product == "A"]))
identical(dat_subset, dat_subset1)
#[1] TRUE
This would also work without unique in it but better to keep the list short.
Another variation:
library(dplyr)
dat_subset2 <- semi_join(dat, filter(dat, product == "A") %>% select(bill_id))
> identical(dat_subset, dat_subset2)
[1] TRUE

R for loop for calculating sums based on a data frame's different columns

My current data frame looks like this:
# Create sample data
my_df <- data.frame(seq(1, 100), rep(c("ind_1", "", "", ""), times = 25), rep(c("", "ind_2", "", ""), times = 25), rep(c("", "", "ind_3", ""), times = 25), rep(c("", "", "", "ind_4"), times = 25))
# Rename columns
names(my_df)[names(my_df)=="seq.1..100."] <- "value"
names(my_df)[names(my_df)=="rep.c..ind_1................times...25."] <- "ind_1"
names(my_df)[names(my_df)=="rep.c......ind_2............times...25."] <- "ind_2"
names(my_df)[names(my_df)=="rep.c..........ind_3........times...25."] <- "ind_3"
names(my_df)[names(my_df)=="rep.c..............ind_4....times...25."] <- "ind_4"
# Replace empty elements with NA
my_df[my_df==''] = NA
What I want to script is a rather simple for loop that calculates the sum of the value column for each of the four ind_*columns and prints the result.
So far my very meagre attempt has been:
# Create a vector with all individuals
individuals <- c("ind_1", "ind_2", "ind_3", "ind_4")
# Calculate aggregates for each individual
for (i in individuals){
ind <- 1
sum_i <- aggregate(value~ind_1, data = my_df, sum)
print(paste("Individual", i, "possesses an aggregated value of", sum_i$value))
ind <- ind + 1
}
As you can see, I currently struggle to include the correct command to calculate the sum based on one column after another as the current output, naturally, only calculates the result of ind_1. What needs to be changed in the aggregatecommand to achieve the desired result (I'm a total beginner but thought of using indices for proceeding from one column to another?)?
Assuming you´d want to calculate the sum if ind-column matches an expression in your individuals-vector:
individuals <- c("ind_1", "ind_2", "ind_3", "ind_4")
for (i in 1:(ncol(my_df)-1)){
print(sum(my_df$value[which(my_df[,individuals[i]] == individuals[i])]))
}
Why do you want to use print() instead of storing the results in a separate vector?
You can try tidyverse as well:
my_df %>%
gather(key, Inds, -value) %>%
filter(!is.na(Inds)) %>%
group_by(key) %>%
summarise(Sum=sum(value))
# A tibble: 4 x 2
key Sum
<chr> <int>
1 ind_1 1225
2 ind_2 1250
3 ind_3 1275
4 ind_4 1300
Idea is to make the data long using gather. Filter the NAs out, then group by Inds and summarize the values.
A more base R solution would be:
library(reshape2)
my_df_long <- melt(my_df, id.vars = "value",value.name = "ID")
aggregate(value ~ ID, my_df_long, sum, na.rm= T)
ID value
1 ind_1 1225
2 ind_2 1250
3 ind_3 1275
4 ind_4 1300

Doing chisq.test on data frame for multiple pairwise comparisons

I have the following dataframe:
species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
df <- cbind(species, category, minus, plus)
df<-as.data.frame(df)
I want to do a chisq.test for each category-species combination, like this:
Species a, category h and l: p-value
Species a, category h and m: p-value
Species a, category l and m: p-value
Species b, ... and so on
With the following chisq.test (dummy code):
chisq.test(c(minus(cat1, cat2),plus(cat1, cat2)))$p.value
I want to end up with a table that presents each chisq.test p-value for each comparison, like this:
Species Category1 Category2 p-value
a h l 0.05
a h m 0.2
a l m 0.1
b...
Where category and and category 2 are the compared categories in the chisq.test.
Is this possible to do using dplyr? I have tried tweaking what was mentioned in here and here, but they don't really apply to this issue, as I am seeing it.
EDIT: I also would like to see how this could be done for the following dataset:
species <- c(1:11)
minus <- c(132,78,254,12,45,76,89,90,100,42,120)
plus <- c(1,2,0,0,0,3,2,5,6,4,0)
I would like to do a chisq. test for each species in the table compared to every single other species in the table (a pairwise comparison between each species for all species). I want to end up with something like this:
species1 species2 p-value
1 2 0.5
1 3 0.7
1 4 0.2
...
11 10 0.02
I tried changing the code above to the following:
species_chisq %>%
do(data_frame(species1 = first(.$species),
species2 = last(.$species),
data = list(matrix(c(.$minus, .$plus), ncol = 2)))) %>%
mutate(chi_test = map(data, chisq.test, correct = FALSE)) %>%
mutate(p.value = map_dbl(chi_test, "p.value")) %>%
ungroup() %>%
select(species1, species2, p.value) %>%
However, this only created a table where each species was only compared to itself, and not the other species. I do not quite understand where in the original code given by #ycw it specifies which are compared.
EDIT 2:
I managed to do this by the code found here.
A solution from dplyr and purrr. Notice that I am not familiar with chi-square test, but I follow the way you specified in #Vincent Bonhomme's post: chisq.test(test, correct = FALSE).
In addition, to create example data frame, there is no need to use cbind, just data.frame would be sufficient. stringsAsFactors = FALSE is important to prevent columns become factor.
# Create example data frame
species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
df <- data.frame(species, category, minus, plus, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(purrr)
# Process the data
df2 <- df %>%
group_by(species) %>%
slice(c(1, 2, 1, 3, 2, 3)) %>%
mutate(test = rep(1:(n()/2), each = 2)) %>%
group_by(species, test) %>%
do(data_frame(species = first(.$species),
test = first(.$test[1]),
category1 = first(.$category),
category2 = last(.$category),
data = list(matrix(c(.$minus, .$plus), ncol = 2)))) %>%
mutate(chi_test = map(data, chisq.test, correct = FALSE)) %>%
mutate(p.value = map_dbl(chi_test, "p.value")) %>%
ungroup() %>%
select(species, category1, category2, p.value)
df2
# A tibble: 25 x 4
species category1 category2 p.value
<chr> <chr> <chr> <dbl>
1 a h l 0.3465104
2 a h m 0.1354680
3 a l m 0.6040227
4 b h l 0.2339414
5 b h m 0.4798647
6 b l m 0.4399181
7 c h l 0.4714005
8 c h m 0.6987413
9 c l m 0.5729834
10 d h l 0.2196806
# ... with 15 more rows
First, you should create your data.frame with data.frame, otherwise minus and plus columns are turned into factors.
species <- c("a","a","a","b","b","b","c","c","c","d","d","d","e","e","e","f","f","f","g","h","h","h","i","i","i")
category <- c("h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","h","l","m","l","h","l","m","h","l","m")
minus <- c(31,14,260,100,70,200,91,152,842,16,25,75,60,97,300,125,80,701,104,70,7,124,24,47,251)
plus <- c(2,0,5,0,1,1,4,4,30,1,0,0,2,0,5,0,0,3,0,0,0,0,0,0,4)
df <- data.frame(species=species, category=category, minus=minus, plus=plus)
Then, I'm not sure there is a pure dplyr way to do it (would be glad to be shown the contrary), but I think here is a partly-dplyr way to do it:
df_combinations <-
# create a df with all interactions
expand.grid(df$species, df$category, df$category)) %>%
# rename columns
`colnames<-`(c("species", "category1", "category2")) %>%
# 3 lines below:
# manage to only retain within a species, category(1 and 2) columns
# with different values
unique %>%
group_by(species) %>%
filter(category1 != category2) %>%
# cosmetics
arrange(species, category1, category2) %>%
ungroup() %>%
# prepare an empty column
mutate(p.value=NA)
# now we loop to fill your result data.frame
for (i in 1:nrow(df_combinations)){
# filter appropriate lines
cat1 <- filter(df,
species==df_combinations$species[i],
category==df_combinations$category1[i])
cat2 <- filter(df,
species==df_combinations$species[i],
category==df_combinations$category2[i])
# calculate the chisq.test and assign its p-value to the right line
df_combinations$p.value[i] <- chisq.test(c(cat1$minus, cat2$minus,
cat1$plus, cat2$plus))$p.value
}
Let's have a look to the resulting data.frame:
head(df_combinations)
# A tibble: 6 x 4
# A tibble: 6 x 4
# Groups: species [1]
species category1 category2 p.value
<fctr> <fctr> <fctr> <dbl>
1 a h l 3.290167e-11
2 a h m 1.225872e-134
3 a l h 3.290167e-11
4 a l m 5.824842e-150
5 a m h 1.225872e-134
6 a m l 5.824842e-150
Checking the first row:
chisq.test(c(31, 14, 2, 0))$p.value
[1] 3.290167e-11
Is this what you wanted?

Resources