From a single dataset I created two dataset filtering on the target variable. Now I'd like to compare all the features in the dataset using chi square. The problem is that one of the two dataset is much smaller than the other one so in some features I have some values that are not present in the second one and when I try to apply the chi square test I get this error: "all arguments must have the same length".
How can I add to the dataset with less value the missing value in order to be able to use chi square test?
Example:
I want to use chi square on a the same feature in the two dataset:
chisq.test(table(df1$var1, df2$var1))
but I get the error "all arguments must have the same length" because table(df1$var1) is:
a b c d
2 5 7 18
while table(df2$var1) is:
a b c
8 1 12
so what I would like to do is adding the value d in df2 and set it equal to 0 in order to be able to use the chi square test.
The table output of df2 can be modified if we convert to factor with levels specified
table(factor(df2$var1, levels = letters[1:4]))
a b c d
8 1 12 0
But, table with two inputs, should have the same length. For this, we may need to bind the datasets and then use table
library(dplyr)
table(bind_rows(df1, df2, .id = 'grp'))
var1
grp a b c d
1 2 5 7 18
2 8 1 12 0
Or in base R
table(data.frame(col1 = rep(1:2, c(nrow(df1), nrow(df2))),
col2 = c(df1$var1, df2$var1)))
col2
col1 a b c d
1 2 5 7 18
2 8 1 12 0
data
df1 <- structure(list(var1 = c("a", "a", "b", "b", "b", "b", "b", "c",
"c", "c", "c", "c", "c", "c", "d", "d", "d", "d", "d", "d", "d",
"d", "d", "d", "d", "d", "d", "d", "d", "d", "d", "d")), class = "data.frame",
row.names = c(NA,
-32L))
df2 <- structure(list(var1 = c("a", "a", "a", "a", "a", "a", "a",
"a",
"b", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c", "c"
)), class = "data.frame", row.names = c(NA, -21L))
I have two data matrices of different dimensions stored as objects in R (I am using Rstudio with R v4.0.2 in Windows 10):
m1 = 1 column x 44 rows (this is a list of names with no spaces).
m2 = 500,000 columns x 164 rows (this contains a string of characters, the first row being a list of names).
I want to check how many (and which) of the rows of m1 are found in m2 (meaning it will be anywhere between 0 and 44). The end goal is that I have 4000 different matrices that will substitute the place of m2, and I need to see the extent of missing entries (found in m1) in all of the m2s (i.e., I am looking at the extent of missing data of those 44 names).
I am still a beginner to R, so apologies if my description is a bit off.
I tried storing each matrix, saved as CSV files, as such:
m1 <- read.csv("names-file.csv")
m2 <- read.csv("data-file.csv")
and tried to use the row.match function in the prodlim package, and ran row.match(m1, m2) but only got numeric values. I am looking to see just a number of how many of the names from m1 (first column) are found in m2 (first column), which values those are, and what the percentage would be (x out of 44).
As an example:
m1 =
Tom
Harry
Cindy
Megan
Jack
`
m2 =
Tom XXXXXXXXXXXX----XXXXXXXX
Stephanie XXXXXXXXXXXXXXXX----XXXX
Megan XXXXXXXXXXXXXXXXXXXXXXXX
Ryan XXXXXXXXXXXXXXXXXXXXXX-X
David XXXXXX---XXXXXXXXXXXXXXX
Josh XXXXXXXXXXXXXXXXXXXXXXXX
In the m2 matrix, each name is column 1, and the each subsequent X (which represents either an A, T, C, or G) are the subsequent columns (so some columns have an A, T, C, or G, or a "-"). I am looking to write a code that would see how many of the names from m1 and found in m2 (and conversely, how much data is missing from m2 as a percentage). In this case, the desired outputs would be:
2
Tom
Megan
60%
Here are my specific datafile using dput() (please let me know if I am using dput() correctly):
m1:
structure(list(V1 = c("Taxon1", "Taxon2", "Taxon3", "Taxon4",
"Taxon5", "Taxon6", "Taxon7", "Taxon8")), class = "data.frame", row.names = c(NA,
-8L))
m2:
structure(list(V1 = c("Taxon1", "Taxon3", "Taxon4", "Taxon6",
"Taxon7", "Taxon9", "Taxon10", "Taxon11", "Taxon12", "Taxon13",
"Taxon14", "Taxon15", "Taxon16", "Taxon17", "Taxon18", "Taxon19",
"Taxon20", "Taxon21", "Taxon22", "Taxon23", "Taxon24", "Taxon25",
"Taxon26", "Taxon27", "Taxon28", "Taxon29", "Taxon30"), V2 = c("A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "C", "C", "C", "C", "C", "C", "C"
), V3 = c("G", "G", "G", "G", "G", "C", "C", "G", "G", "G", "G",
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G",
"G", "G", "G"), V4 = c("C", "C", "C", "C", "C", "T", "G", "C",
"C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C", "C",
"C", "C", "C", "C", "C", "C"), V5 = c("T", "T", "G", "T", "G",
"G", "G", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T",
"T", "T", "T", "T", "T", "T", "T", "T", "T"), V6 = c("G", "G",
"C", "G", "C", "C", "C", "G", "G", "G", "G", "G", "G", "G", "G",
"G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G", "G"),
V7 = c("C", "C", "A", "C", "A", "A", "A", "C", "C", "C",
"C", "C", "C", "C", "C", "C", "G", "G", "G", "G", "G", "G",
"G", "G", "G", "G", "G"), V8 = c("T", "T", "A", "T", "A",
"A", "A", "T", "T", "T", "T", "T", "T", "T", "T", "T", "T",
"T", "T", "T", "T", "T", "T", "T", "T", "T", "T"), V9 = c("A",
"A", "A", "A", "A", "T", "T", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "T", "T", "T", "T", "T", "T", "T", "T",
"T", "T")), class = "data.frame", row.names = c(NA, -27L))
Thank you!
You might want to have a look at the %in% operator in R. According to your question, you might want something like this:
m1[,1] %in% m2[,1]
#[1] TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE
You can then pair it with functions such as mean or sum which will help you to find the percentage as required:
sum(m1[,1] %in% m2[,1])
#[1] 5
mean(m1[,1] %in% m2[,1])
#[1] 0.625
EDIT: As required by the OP in the comments of this post, there are various methods for that, my personal favourite being the which function:
m1[which(m1[,1] %in% m2[,1]),]
#[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
m1[which(!(m1[,1] %in% m2[,1])),]
#[1] "Taxon2" "Taxon5" "Taxon8"
Again, this is only one method, out of many (I can count 3 right now...), so I suggest you to explore the other options...
To get common names in both the dataframes you may use intersect, to calculate missing percentage you can use %in% with mean
common_names <- intersect(m1$V1, m2$V1)
missing_percentage_in_m1 <- mean(!m1$V1 %in% m2$V1) * 100
missing_percentage_in_m2 <- mean(!m2$V1 %in% m1$V1) * 100
common_names
#[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
missing_percentage_in_m1
#[1] 37.5
missing_percentage_in_m2
#[1] 81.48148
This code will get result like this
2
Tom
Megan
60%
1.how many of the names from m1 and found in m2
m1 <- t(m1)
res1 <-m2 %>%
rowwise %>%
mutate(n = m1 %in% c_across(V1:V9) %>% sum)
res1
# A tibble: 27 x 10
# Rowwise:
V1 V2 V3 V4 V5 V6 V7 V8 V9 n
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 Taxon1 A G C T G C T A 1
2 Taxon3 A G C T G C T A 1
3 Taxon4 A G C G C A A A 1
4 Taxon6 A G C T G C T A 1
5 Taxon7 A G C G C A A A 1
6 Taxon9 A C T G C A A T 0
7 Taxon10 A C G G C A A T 0
8 Taxon11 A G C T G C T A 0
9 Taxon12 A G C T G C T A 0
10 Taxon13 A G C T G C T A 0
# ... with 17 more rows
res1 %>% select(n) %>% sum
[1] 5
res2 <-res1 %>%
filter(n >0) %>%
pull(V1) %>%
unique
res2
[1] "Taxon1" "Taxon3" "Taxon4" "Taxon6" "Taxon7"
2.how much data is missing from m2 as a percentage
res3 <- res2 %>% length
1 - res3 / length(unique(m2$V1))
[1] 0.8148148
I have a dataframe with 82 variables. Many of the variables contain alphabetic letters, which I want to change into a set of numbers. I can do this column-by-column, number-by-number using the code below:
library(tibble)
mydf <- tribble(~Var1, ~Var2.a, ~Var3.a, ~Var4.a,
"A", "b", "b", "d",
"B", "w", NA, "w",
"C", "g", "k", "b",
"D", "k", NA, "j")
newdf <- mydf %>%
mutate(Var2.a = ifelse(Var2.a %in% c("m", "p", "w", "h", "n"), 1, Var2.a),
Var2.a = ifelse(Var2.a %in% c("k", "b", "g", "j", "f", "d"), 2, Var2.a),
Var3.a = ifelse(Var3.a %in% c("m", "p", "w", "h", "n"), 1, Var3.a),
Var3.a = ifelse(Var3.a %in% c("k", "b", "g", "j", "f", "d"), 2, Var3.a),
Var4.a = ifelse(Var4.a %in% c("m", "p", "w", "h", "n"), 1, Var4.a),
Var4.a = ifelse(Var4.a %in% c("k", "b", "g", "j", "f", "d"), 2, Var4.a))
But this will take a lot of time for the 70+ columns I need to change!
All the variables of interest have a matching letter combination in the variable name (".a" in the example data), so I should be able to use an ifelse statement on these columns using contains(). However I can't work out how to do this!
I have looked at this answer, which I think is getting me close, but I can't work out how to embed an if-statement into it:
newdf <- mydf %>%
mutate_at(vars[2:4] = ifelse(vars %in% c("m", "p", "w", "h", "n"), 1, vars)
But I get the error Error in vars[2:4] : object of type 'closure' is not subsettable. I think the brackets are wrong here, and probably also the use of vars!
Try this example:
# custom function, I prefer case_when (we could use nested if_else if needed.)
foo <- function(x){
case_when(
x %in% c("m", "p", "w", "h", "n") ~ 1L,
x %in% c("k", "b", "g", "j", "f", "d") ~ 2L,
TRUE ~ NA_integer_)
}
mydf %>%
mutate_at(vars(Var2.a:Var4.a), foo)
# # A tibble: 4 x 4
# Var1 Var2.a Var3.a Var4.a
# <chr> <int> <int> <int>
# 1 A 2 2 2
# 2 B 1 NA 1
# 3 C 2 2 2
# 4 D 2 NA 2
I have a large table which has a few columns that have "-" in them. I want to replace "-" with the value from the row above in the same column
library(tidyverse)
# This is the df I have
df <- data.frame(stringsAsFactors=FALSE,
my = c("a", "a", "a", "-", "b", "b", "b", "-", "c", "c", "c", "-"),
bad = c("d", "d", "d", "-", "e", "e", "e", "-", "f", "f", "f", "-"),
table = c("g", "g", "g", "-", "h", "h", "h", "-", "i", "i", "i", "-")
)
# This is the desired output:
output_df <- data.frame(stringsAsFactors=FALSE,
my = c("a", "a", "a", "a", "b", "b", "b", "b", "c", "c", "c", "c"),
bad = c("d", "d", "d", "d", "e", "e", "e", "e", "f", "f", "f", "f"),
table = c("g", "g", "g", "g", "h", "h", "h", "h", "i", "i", "i", "i")
)
# What I have tried unsuccessfully:
df %>%
mutate_at(c("my", "bad", "table"), .funs = str_replace("-", NA))
I'm a bit stumped with this one.....any ideas?
An option is fill after changing the - to NA
library(tidyverse)
df %>%
mutate_all(na_if, "-") %>%
fill(my, bad, table)
# orif there are many columns
# fill(!!! rlang::syms(names(.)))
# or as H1 suggested
# fill(everything())
# my bad table
#1 a d g
#2 a d g
#3 a d g
#4 a d g
#5 b e h
#6 b e h
#7 b e h
#8 b e h
#9 c f i
#10 c f i
#11 c f i
#12 c f i