Chisq Test R : Multiple group in data frame - r

I'm new on R and trying to run some statistical test.
My data looks like that :
Name Freqeunce Target Total
Steve 1 A 11
Marcel 1 A 11
Marie 1 A 11
John 2 A 11
Max 2 A 11
Alice 4 A 11
Mariane 1 B 1
Rose 1 C 3
Carla 1 C 3
Happy 1 C 3
I want to realise a Chi2 of homogeneity for each target type ( A, B and C).
I want to know if there is possibility with R to run a loop that will write the p.value of each name in a column or did i have to extract the information before and then realize the Chi2 ?
The objectif is to identify which the different name are less represented in the group according to the frequences. And there is more than 2000 groups, thats why i want a loop.
Thank you for your answer
Baptiste

I think this will answer your question. I don't know if this is the type of chi^2 test you want but you can always change the function. I use group_by and mutate from the dplyr package and write a function to perform the chi^2 test and extract the pvalue.
library(dplyr)
df <- read.table("test2.txt", header = T)
c2_all <- function(x,y){
mat <- matrix(c(x,y),nrow = 2)
c2 <- chisq.test(mat)
return(c2$p.value)
}
result <- df2 %>% group_by(Target) %>% mutate(pvalue = c2_all(Name,Freqeunce))
result
# A tibble: 11 x 5
# Groups: Target [3]
Name Freqeunce Target Total pvalue
<fct> <int> <fct> <int> <dbl>
1 Steve 1 A 11 0.285
2 Marcel 1 A 11 0.285
3 Marie 1 A 11 0.285
4 John 2 A 11 0.285
5 Max 2 A 11 0.285
6 Alice 4 A 11 0.285
7 Sarah 2 B 3 1.00
8 Mariane 1 B 3 1.00
9 Rose 1 C 5 0.223
10 Carla 3 C 5 0.223
11 Happy 1 C 5 0.223

Related

Acceptable practice to use 'group_by' stats in mutate?

In the past, when I've needed to create a new variable in an R data frame that is partly based on a 'group_by' summary statistic, I've always used the following sequence:
(1) calculate 'group stats' from data in the base (ungrouped) data frame using group_by() and summarize()
(2) join the base data frame with the result of the previous step, then calculate the new variable value using mutate.
However, (after years of using dplyr!) I accidentally did the 'summarizing' in a mutate step and everything seemed to work. This is illustrated in Option #2 in the code snippet below. I'm assuming Option #2 is okay because I'm getting identical results using both options, and because I found similar examples searching the web today. However, I wasn't sure.
Is Option #2 acceptable practice, or is Option #1 preferred (and if so why)?
set.seed(123)
df <- tibble(year_ = c(rep(c(2019), 4), rep(c(2020), 4)),
qtr_ = c(rep(c(1,2,3,4), 2)),
foo = sample(seq(1:8)))
# Option 1: calc statistics then rejoin with input data
df_stats <- df %>%
group_by(year_) %>%
summarize(mean_foo = mean(foo))
df_with_stats <- left_join(df, df_stats) %>%
mutate(dfoo = foo - mean_foo)
# Option 2: everything in one go
df_with_stats2 <- df %>%
group_by(year_) %>%
mutate(mean_foo = mean(foo),
dfoo = foo - mean_foo)
df_with_stats
# A tibble: 8 x 5
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
df_with_stats2
# A tibble: 8 x 5
# Groups: year_ [2]
year_ qtr_ foo mean_foo dfoo
<dbl> <dbl> <int> <dbl> <dbl>
1 2019 1 7 6 1
2 2019 2 8 6 2
3 2019 3 3 6 -3
4 2019 4 6 6 0
5 2020 1 2 3 -1
6 2020 2 4 3 1
7 2020 3 5 3 2
8 2020 4 1 3 -2
Option 2 is fine, if you don't need the intermediate object anyway, and you don't even need to create mean_foo in your mutate statement:
df %>% group_by(year_) %>% mutate(dfoo=foo-mean(foo))
also, data.table
setDT(df)[,dfoo:=foo-mean(foo), by =year_]

How can I compute the reverse rank abundance of a species matrix?

I have a dataset in R that contains species abundance data ordered by station and replicate sample. So, column one contains the station number, column two contains the replicate number, column three contains the species name, and column four contains the species abundance.
I want to add a new fifth column that contains the reverse rank abundance of a species per station/replicate combination (i.e., If there are four species in a station/replicate, I want the species with the lowest abundance to be given a value of 1, and the species with the highest abundance to be given a value of 4).
Here a sample code of the type of dataset I am working with:
library(tidyverse)
dat <- as.data.frame(matrix(c(1,1,"A",2.34,
1,1,"B",4.32,
1,1,"C",2.46,
1,1,"D",6.32,
1,2,"A",3.54,
1,2,"B",7.67,
1,2,"D",3.45,
2,1,"D",4.67,
2,1,"E",6.54,
2,1,"G",5.67,
2,2,"B",2.31,
2,2,"G",1.12), ncol = 4, nrow = 12, byrow = TRUE
))
names(dat)[1] <- "station"
names(dat)[2] <- "replicate"
names(dat)[3] <- "taxa"
names(dat)[4] <- "abundance"
dat %>%
mutate(abundance = parse_number(abundance))
station
replicate
taxa
abundance
1
1
A
2.34
1
1
B
4.32
1
1
C
2.46
1
1
D
6.32
1
2
A
3.54
1
2
B
7.67
1
2
D
3.45
2
1
D
4.67
2
1
E
6.54
2
1
G
5.67
2
2
B
2.31
2
2
G
1.12
And here is some code to reorder the dataset so that it goes from the species with the lowest abundance to the species with the highest abundance per station/replicate:
dat %>%
arrange(abundance) %>%
arrange(replicate) %>%
arrange(station)
For some reason, I am unsure how to continue from here. Any help would be greatly appreciated!
If you want to rank by station/replicate, you first group_by this combination, and then create a new column with the rank value.
library(tidyverse)
dat %>%
group_by(station, replicate) %>%
mutate(abundance = as.numeric(abundance),
rank = rank(abundance))
Output
station replicate taxa abundance rank
<chr> <chr> <chr> <dbl> <dbl>
1 1 1 A 2.34 1
2 1 1 B 4.32 3
3 1 1 C 2.46 2
4 1 1 D 6.32 4
5 1 2 A 3.54 2
6 1 2 B 7.67 3
7 1 2 D 3.45 1
8 2 1 D 4.67 1
9 2 1 E 6.54 3
10 2 1 G 5.67 2
11 2 2 B 2.31 2
12 2 2 G 1.12 1

Determine percentage of rows with missing values in a dataframe in R

I have a data frame with three variables and some missing values in one of the variables that looks like this:
subject <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2)
part <- c(0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3,0,0,0,0,1,1,1,1,2,2,2,2,3,3,3,3)
sad <- c(1,7,7,4,NA,NA,2,2,NA,2,3,NA,NA,2,2,1,NA,5,NA,6,6,NA,NA,3,3,NA,NA,5,3,NA,7,2)
df1 <- data.frame(subject,part,sad)
I have created a new data frame with the mean values of 'sad' per subject and part using a loop, like this:
columns<-c("sad.m",
"part",
"subject")
df2<-matrix(data=NA,nrow=1,ncol=length(columns))
df2<-data.frame(df2)
names(df2)<-columns
tn<-unique(df1$subject)
row=1
for (s in tn){
for (i in 0:3){
TN<-df1[df1$subject==s&df1$part==i,]
df2[row,"sad.m"]<-mean(as.numeric(TN$sad), na.rm = TRUE)
df2[row,"part"]<-i
df2[row,"subject"]<-s
row=row+1
}
}
Now I want to include an additional variable 'missing' in that indicates the percentage of rows per subject and part with missing values, so that I get df3:
subject <- c(1,1,1,1,2,2,2,2)
part<-c(0,1,2,3,0,1,2,3)
sad.m<-df2$sad.m
missing <- c(0,50,50,25,50,50,50,25)
df3 <- data.frame(subject,part,sad.m,missing)
I'd really appreciate any help on how to go about this!
It's best to try and avoid loops in R where possible, since they can get messy and tend to be quite slow. For this sort of thing, dplyr library is perfect and well worth learning. It can save you a lot of time.
You can create a data frame with both variables by first grouping by subject and part, and then performing a summary of the grouped data frame:
df2 = df1 %>%
dplyr::group_by(subject, part) %>%
dplyr::summarise(
sad_mean = mean(na.omit(sad)),
na_count = (sum(is.na(sad) / n()) * 100)
)
df2
# A tibble: 8 x 4
# Groups: subject [2]
subject part sad_mean na_count
<dbl> <dbl> <dbl> <dbl>
1 1 0 4.75 0
2 1 1 2 50
3 1 2 2.5 50
4 1 3 1.67 25
5 2 0 5.5 50
6 2 1 4.5 50
7 2 2 4 50
8 2 3 4 25
For each subject and part you can calculate mean of sad and calculate ratio of NA value using is.na and mean.
library(dplyr)
df1 %>%
group_by(subject, part) %>%
summarise(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100)
# subject part sad.m perc_missing
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 4.75 0
#2 1 1 2 50
#3 1 2 2.5 50
#4 1 3 1.67 25
#5 2 0 5.5 50
#6 2 1 4.5 50
#7 2 2 4 50
#8 2 3 4 25
Same logic with data.table :
library(data.table)
setDT(df1)[, .(sad.m = mean(sad, na.rm = TRUE),
perc_missing = mean(is.na(sad)) * 100), .(subject, part)]
Try this dplyr approach to compute df3:
library(dplyr)
#Code
df3 <- df1 %>% group_by(subject,part) %>% summarise(N=100*length(which(is.na(sad)))/length(sad))
Output:
# A tibble: 8 x 3
# Groups: subject [2]
subject part N
<dbl> <dbl> <dbl>
1 1 0 0
2 1 1 50
3 1 2 50
4 1 3 25
5 2 0 50
6 2 1 50
7 2 2 50
8 2 3 25
And for full interaction with df2 you can use left_join():
#Left join
df3 <- df1 %>% group_by(subject,part) %>%
summarise(N=100*length(which(is.na(sad)))/length(sad)) %>%
left_join(df2)
Output:
# A tibble: 8 x 4
# Groups: subject [2]
subject part N sad.m
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 4.75
2 1 1 50 2
3 1 2 50 2.5
4 1 3 25 1.67
5 2 0 50 5.5
6 2 1 50 4.5
7 2 2 50 4
8 2 3 25 4

for loop in R through pre-selected subset of data

For an analysis of the European Social Survey (ESS) I attempt to calculate the the share of respondants having a higher education than their parents. I intend to use a for loop for the calculation. However, I am not able to calculate the shares for each country and year seperately. The rows in the dataframe are the individual observations (about 400k) and I have a row indicating the country (cntry) and year (ESSround) of the respondant. My code looks like this
for (i in 1:nrow(ESS_cleann)) {
ESS_cleann$abs_mobility[i] <- ESS_cleann[ESS_cleann[cntry]==i && ESS_cleann[essround]==i] length(ESS_cleann$educ_mobility[i] [ESS_clean$educ_mobility [i] == "U"])/ESS_cleann[ESS_cleann[cntry]==i&& ESS_cleann[essround]==i] length(ESS_cleann$educ_mobility[i])
}
I am well aware that this is wrong, but I cannot manage to tell R to calculate R the share for each country and year seperately. Help appreaciated a lot!
To give you an idea of the data-structure, these are the heads for all three relevant colums:
ESS_cleann.cntry ESS_cleann.essround ESS_cleann.educ_mobility
1 AT 2 D
2 AT 2 D
3 AT 3 U
4 AT 3 U
5 AT 1 N
6 AT 3 N
I'm not quite sure I understand but are you trying to do something like this?
library(dplyr)
set.seed(2020)
cntry <- sample(c("AT", "UK"), 100, replace = TRUE)
essround <- sample(1:3, 100, replace = TRUE)
mobility <- sample(c("D", "U", "N"), 100, replace = TRUE)
ESS <- data.frame(cntry, essround, mobility)
ESS %>%
group_by(cntry, essround, mobility, .drop= FALSE) %>%
summarise(counts = n()) %>%
mutate(.data = ., perc = counts / sum(counts))
#> # A tibble: 18 x 5
#> # Groups: cntry, essround [6]
#> cntry essround mobility counts perc
#> <chr> <int> <chr> <int> <dbl>
#> 1 AT 1 D 6 0.429
#> 2 AT 1 N 4 0.286
#> 3 AT 1 U 4 0.286
#> 4 AT 2 D 3 0.273
#> 5 AT 2 N 5 0.455
#> 6 AT 2 U 3 0.273
#> 7 AT 3 D 5 0.333
#> 8 AT 3 N 4 0.267
#> 9 AT 3 U 6 0.4
#> 10 UK 1 D 7 0.318
#> 11 UK 1 N 6 0.273
#> 12 UK 1 U 9 0.409
#> 13 UK 2 D 4 0.25
#> 14 UK 2 N 7 0.438
#> 15 UK 2 U 5 0.312
#> 16 UK 3 D 7 0.318
#> 17 UK 3 N 10 0.455
#> 18 UK 3 U 5 0.227
Created on 2020-05-11 by the reprex package (v0.3.0)
Data table sounds like the package you need. You do not provide any data to reproduce the issue but something like this should work:
DT[,.SD[ education.level > parent.education.level, .N/nrow(.SD)], by= c("country", "year") ]
If you want to do this with a for loop, I guess something like this would work:
for (year in years) {
for (country in countries){
subtable <- table[year==yer & country===countr]
store.in.some.variable.or.table.or.something <- nrow( subtable [ education > parental.education, ]) / nrow(subtable)
}
}
hope this helps.
Best regards
JA.

How calculate ratio with the lagged values per group?

I have the following dataset:
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
Firstly, I wanted to change all NA values to the average value of that variable for this group. So, instead of NA should be 2.43.
Secondly, I wanted to calculate a fourth variable, which is ratio of the lagged value of the school to the number of the students.
data <-
a %>%
group_by(school) %>%
summarize(lag.value.ratio = lag(school, 1)/numberofstudents) %>% ungroup
Unfortunately, I have the following error: Error: Column lag.value.ratio must be length 1 (a summary value), not 5.
How to avoid this error and get the average group value instead of NA?
If you want the mean value of the group to replace the NAs, I calculate 2.83 to be the mean for school 3. You are getting the error because you are using summarize, which wants to collapse the result down to the number of groups that you have (in this case 2). I believe what you want is a mutate.
EDIT: I an loading the libraries used below and making sure that the lag function that is used is from the dplyr package.
library(dplyr)
library(tidyr)
a<-data_frame(school= c(2,2,2,2,2,3,3,3,3,3,3,3),
year=c(2011,2011,2011,2012,2012,2011,2011,2011,2012,2012,2012,2012),
numberofstudents=c(3,3,3,2,2,3,3,3,2,NA,2,4))
a %>%
group_by(school) %>%
mutate(numberofstudents = replace_na(numberofstudents, mean(numberofstudents, na.rm = TRUE)),
lag.value.ratio = dplyr::lag(school, 1)/numberofstudents) %>%
ungroup()
gives
# A tibble: 12 x 4
school year numberofstudents lag.value.ratio
<dbl> <dbl> <dbl> <dbl>
1 2 2011 3 NA
2 2 2011 3 0.667
3 2 2011 3 0.667
4 2 2012 2 1
5 2 2012 2 1
6 3 2011 3 NA
7 3 2011 3 1
8 3 2011 3 1
9 3 2012 2 1.5
10 3 2012 2.83 1.06
11 3 2012 2 1.5
12 3 2012 4 0.75

Resources