R: is.na() does not pick up NA value - r

So I have a dataset and just by looking at it there are clear NA's in the dataset.
> dput(bmi.cig)
structure(list(MSI.subset.BMI = structure(c(4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 1L, 2L, 3L, 3L, 1L, 3L, 3L, 1L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("0", "1", "2",
"NA"), class = "factor"), MSI.subset.Cigarette = structure(c(3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 2L, 1L, 2L,
2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("1",
"2", "NA"), class = "factor")), .Names = c("MSI.subset.BMI",
"MSI.subset.Cigarette"), row.names = c(NA, 30L), class = "data.frame")
> head(bmi.cig)
MSI.subset.BMI MSI.subset.Cigarette
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
I want to remove any row that contains an NA in either column, so I'm using the list-wise deletion function ld in the ForImp package. However, R is not recognizing the NA values.
is.na(bmi.cig$MSI.subset.BMI)
I get
> is.na(bmi.cig$MSI.subset.BMI)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[26] FALSE FALSE FALSE FALSE FALSE
So once I use the ld function I just get an empty dataset in return.

It's b/c the columns are factors, and the levels are "NA". I.e., try
data <- structure(list(MSI.subset.BMI = structure(c(4L, 4L, 4L, 4L, 4L,
+ 4L, 4L, 4L, 4L, 4L, 4L, 1L, 2L, 3L, 3L, 1L, 3L, 3L, 1L, 4L, 4L,
+ 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("0", "1", "2",
+ "NA"), class = "factor"), MSI.subset.Cigarette = structure(c(3L,
+ 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 2L, 1L, 2L, 1L, 2L,
+ 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("1",
+ "2", "NA"), class = "factor")), .Names = c("MSI.subset.BMI",
+ "MSI.subset.Cigarette"), row.names = c(NA, 30L), class = "data.frame")
> class(blah[,1])
data[,1]=="NA"
The NA's are actually characters (class("NA")), not class logical like class(NA).

As #rbatt mentions, you have character NA values as factor levels. You can remove them and get the NA entries to register as real NA values for the entire data set with
df[] <- lapply(df, function(x) {
is.na(levels(x)) <- levels(x) == "NA"
x
})
where df is your data set. And now test with
is.na(df)

Related

Using nest_by or nest, group_by or looping to perform several chi square test to several variables in R (likert scales)

Let's say I have these categorical variables in my data set. All variables are related to the people's concerns of the COVID-19 and were assessed two times (with different participants..).
And my main goal is to check if time (will be "constant") is associated with the prevalence of each item (economy, social cohesion, and so on) (will vary). Therefore, I'll need to perform several Chi-square tests.
I've followed some instructions using nest_by or xtabs , but I'm not getting the right results.
I would like to keep the tidyverse environment in this analysis.
The main goal is to have several chi-squared tests, such as this one:
ds_plot_likert %>%
pivot_longer(cols = -c(time),
names_to = "item", values_to = "response") %>%
group_by(item, time, response) %>%
summarise(N = n()) %>%
mutate(pct = N / sum(N)) %>%
filter(item == "Children's academic achievement") %>% #need to change all the time...
xtabs(formula = pct ~ time + response, data = .) %>%
chisq.test()
But for all variables in my dataset (and preferably using tidyverse).
Thank you!
The following code gives you the possibility to reproduce.
ds <- structure(list(time = c("First", "First", "First", "First", "First",
"First", "First", "First", "First", "First", "First", "First",
".Second", "First", "First", "First", "First", ".Second", "First",
"First", "First", "First", "First", "First", "First", "First",
".Second", "First", "First", ".Second", "First", "First", "First",
"First", "First", "First", "First", "First", "First", "First",
".Second", "First", "First", "First", ".Second", ".Second", "First",
"First", "First", ".Second", ".Second", "First", "First", "First",
"First", ".Second", ".Second", "First", "First", "First", "First",
"First", ".Second", "First", "First", "First", "First", ".Second",
"First", "First", "First", "First", "First", "First", "First",
".Second", "First", ".Second", "First", "First", "First", "First",
"First", "First", "First", "First", "First", "First", "First",
".Second", "First", "First", "First", "First", "First", "First",
"First", ".Second", ".Second", "First"), Economy = structure(c(4L,
3L, 3L, 4L, 3L, 4L, 4L, 1L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 4L, 4L,
3L, 3L, NA, 2L, 3L, 4L, 3L, 3L, 4L, 4L, 2L, 3L, 4L, 4L, 3L, 2L,
4L, 2L, 3L, 2L, 3L, 3L, 3L, 2L, 4L, 4L, 3L, 3L, 3L, 4L, 3L, 3L,
4L, 2L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 4L, 3L, 2L,
3L, 3L, 3L, 4L, NA, 2L, 4L, 3L, 4L, 2L, 3L, 3L, 2L, NA, 3L, 2L,
3L, 2L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 4L, 3L, 4L, 3L, 3L, 2L, 3L,
3L, 3L, 4L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `My personal finance` = structure(c(3L,
2L, 4L, 2L, 4L, 4L, 3L, 2L, 3L, 3L, 2L, 3L, 2L, 4L, 3L, 4L, 4L,
2L, 3L, NA, 3L, 3L, 4L, 3L, 3L, 4L, 3L, 4L, 3L, 4L, 4L, 3L, 4L,
3L, 4L, 2L, 2L, 3L, 2L, 2L, 4L, 3L, 4L, 3L, 3L, 3L, 3L, 2L, 2L,
3L, 2L, 4L, 2L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 3L, 4L, 4L, 2L,
3L, 3L, 4L, 3L, NA, 2L, 3L, 3L, 4L, 2L, 3L, 2L, 3L, NA, 3L, 2L,
2L, 2L, 3L, 4L, 3L, 2L, 3L, 2L, 2L, 2L, 2L, 4L, 3L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `My own health` = structure(c(3L,
2L, 4L, 3L, 4L, 4L, 3L, 4L, 2L, 4L, 3L, 3L, 3L, 4L, 3L, 3L, 4L,
3L, 2L, NA, 3L, 4L, 3L, 4L, 3L, 2L, 1L, 2L, 3L, 3L, 4L, 2L, 4L,
2L, 4L, 4L, 3L, 2L, 2L, 4L, 4L, 2L, 4L, 3L, 3L, 2L, 3L, 3L, 2L,
2L, 3L, 3L, 1L, 3L, 4L, 4L, 3L, 3L, 2L, 2L, 4L, 3L, 3L, 4L, 2L,
3L, 3L, 4L, 4L, NA, 2L, 2L, 3L, 4L, 1L, 3L, 4L, 3L, NA, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 2L, 4L, 2L, 2L, 3L, 4L, 3L, 2L, 3L, 4L,
2L, 3L, 3L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `My friends and family health` = structure(c(4L,
3L, 4L, 4L, 4L, 4L, 3L, 4L, 3L, 4L, 3L, NA, 3L, 4L, 3L, 3L, 4L,
4L, 3L, NA, 3L, 4L, 3L, 4L, 3L, 3L, 2L, 2L, 3L, 4L, 4L, 2L, 4L,
1L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 4L, 3L,
4L, 3L, 3L, 4L, 3L, 3L, 4L, 3L, 3L, 3L, 2L, 4L, 3L, 4L, 4L, 2L,
3L, 3L, 4L, 4L, NA, 2L, 4L, 3L, 4L, 2L, 3L, 4L, 3L, NA, 3L, 4L,
3L, 3L, 4L, 3L, 4L, 4L, 3L, 4L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 4L,
4L, 3L, 4L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `Social cohesion` = structure(c(3L,
3L, 2L, 4L, 4L, 2L, 4L, 4L, 2L, 3L, 3L, 3L, 3L, 4L, 3L, 3L, 4L,
3L, 3L, NA, 3L, 3L, 4L, 3L, 3L, 2L, 1L, 2L, 2L, 4L, 4L, 3L, 3L,
1L, 3L, NA, 2L, 3L, 2L, 4L, 4L, 2L, 3L, 3L, 4L, 4L, 3L, 3L, 3L,
2L, 2L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 4L, 2L,
3L, 3L, 3L, 4L, NA, 2L, 3L, 3L, 2L, 1L, 1L, 3L, 2L, NA, 3L, NA,
3L, 3L, 4L, 2L, 4L, 3L, 1L, 4L, 2L, 4L, 3L, 2L, 4L, 2L, 3L, 4L,
4L, 2L, 2L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `Food and pharmaceutical drugs` = structure(c(2L,
3L, 4L, 4L, 3L, 2L, 4L, 1L, 2L, 4L, 3L, NA, 2L, 3L, 3L, 2L, 4L,
3L, 3L, NA, 3L, 4L, 3L, 3L, 3L, 1L, 1L, 2L, 3L, 4L, 2L, 2L, 4L,
4L, 4L, 1L, 4L, 2L, 2L, 3L, 4L, 2L, 4L, 3L, 2L, 2L, 3L, 3L, 2L,
3L, 2L, 3L, 4L, 2L, 2L, 2L, 3L, 3L, 3L, 2L, 4L, 3L, 3L, 4L, 2L,
3L, 3L, 3L, 4L, NA, 2L, 3L, 2L, 2L, 1L, 1L, 2L, 2L, NA, 1L, 1L,
2L, 2L, 2L, 1L, 3L, 3L, 2L, 3L, 1L, 2L, 2L, 2L, 4L, 2L, 3L, 3L,
2L, 1L, 2L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `Price of grocery products` = structure(c(2L,
2L, 3L, 4L, 3L, 4L, 3L, 2L, 2L, 4L, 3L, 3L, 2L, 4L, 3L, 3L, 4L,
4L, 3L, NA, 3L, 4L, 3L, 3L, 3L, 2L, 3L, 2L, 3L, 4L, 2L, 4L, 3L,
4L, 4L, 4L, 4L, 2L, 2L, 3L, 4L, 2L, 4L, 3L, 3L, 3L, 3L, 4L, 2L,
4L, 2L, 3L, 4L, 2L, 2L, 3L, 4L, 3L, 2L, 3L, 4L, 3L, 4L, 4L, 2L,
3L, 3L, 4L, 4L, NA, 2L, 3L, 2L, 2L, 2L, 1L, 3L, 3L, NA, 1L, 1L,
2L, 2L, 3L, 1L, 3L, 4L, 2L, 3L, 2L, 2L, 2L, 2L, 4L, 3L, 3L, 3L,
4L, 1L, 2L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `Stock prices` = structure(c(2L,
2L, 2L, 4L, 3L, 2L, 1L, 1L, 3L, 4L, 2L, 2L, 2L, 4L, 3L, 3L, 4L,
3L, 2L, NA, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 3L, 1L, 2L, 2L, 2L,
4L, 4L, 2L, 3L, 2L, 2L, 3L, 2L, 3L, 4L, 3L, 4L, 3L, 3L, 2L, 2L,
2L, 1L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, NA, 2L, 4L, 2L,
3L, 3L, 4L, 4L, NA, 2L, 2L, 2L, 2L, 2L, 1L, 3L, 3L, NA, 3L, 1L,
3L, 3L, 4L, 1L, 3L, 4L, 1L, 3L, 1L, 4L, 2L, 2L, 2L, NA, 3L, 2L,
4L, 1L, 2L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor"), `Children's academic achievement` = structure(c(4L,
3L, 4L, 1L, NA, NA, 1L, 1L, 1L, 1L, 3L, 3L, 2L, 4L, 3L, 4L, NA,
4L, 3L, NA, 1L, 1L, 3L, 3L, 3L, 1L, 4L, 1L, 3L, 4L, 3L, 2L, 4L,
1L, 4L, 1L, 1L, 3L, 2L, 3L, 1L, 2L, 1L, 3L, 2L, 3L, 3L, 3L, 1L,
1L, 1L, 4L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 2L, 4L, 3L, 3L, 4L, 2L,
2L, 1L, 1L, 4L, NA, 2L, 2L, 2L, 4L, 1L, 2L, 1L, 2L, NA, 2L, 1L,
NA, 3L, 2L, 2L, 1L, 4L, 2L, 3L, 1L, 4L, 1L, 1L, 1L, 3L, 1L, 1L,
1L, 1L, 2L), .Label = c("Not at all", "A little", "Moderately",
"Very much"), class = "factor")), class = "data.frame", row.names = c(NA,
-100L))
You can store the result in a list for each item -
library(dplyr)
library(tidyr)
ds %>%
pivot_longer(cols = -c(time),
names_to = "item", values_to = "response") %>%
group_by(item, time, response) %>%
summarise(N = n()) %>%
mutate(pct = N / sum(N)) %>%
group_by(item) %>%
summarise(test = list(xtabs(formula = pct ~ time + response,
data = cur_data()) %>% chisq.test())) -> result
result$test
#[[1]]
# Pearson's Chi-squared test
#data: .
#X-squared = 0.0099329, df = 3, p-value = 0.9997
#[[2]]
#
# Pearson's Chi-squared test
#data: .
#X-squared = 0.026631, df = 3, p-value = 0.9989
#...
#...

ggplot: how to add an outline around each fill in a stacked barchart but only partly

I have the following stacked barchart produced with geom_bar.
Question: how to add an outline around each fill corresponding to the matched color in cols. The tricky part is, that the outline should not be in between each fill but around the "borders" and the top, exclusively (expected output below)
I have
Written with
library(ggplot)
cols = c("#E1B930", "#2C77BF","#E38072","#6DBCC3", "grey40","black")
ggplot(i, aes(fill=uicc, x=n)) + theme_bw() +
geom_bar(position="stack", stat="count") +
scale_fill_manual(values=alpha(cols,0.5))
Expected output
My data i
i <- structure(list(uicc = structure(c(4L, 4L, 4L, 4L, 3L, 4L, 4L,
4L, 4L, 3L, 4L, 3L, 4L, 2L, 4L, 4L, 4L, 1L, 4L, 4L, 2L, 4L, 4L,
3L, 4L, 4L, 4L, 4L, 3L, 2L, 4L, 4L, 3L, 3L, 3L, 3L, 1L, 3L, 4L,
4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 3L,
4L, 1L, 3L, 1L, 4L, 4L, 3L, 1L, 2L, 1L, 3L, 3L, 3L, 4L, 3L, 4L,
4L, 3L, 4L, 3L, 3L, 3L, 2L, 2L, 4L, 3L, 4L, 2L, 1L, 1L, 4L, 4L,
4L, 4L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 2L, 3L, 3L, 4L), .Label = c("1",
"2", "3", "4"), class = "factor"), n = structure(c(4L, 4L, 4L,
4L, 2L, 1L, 4L, 1L, 4L, 2L, 4L, 2L, 4L, 1L, 4L, 5L, 2L, 1L, 1L,
5L, 1L, 1L, 2L, 2L, 1L, 4L, 3L, 4L, 2L, 1L, 5L, 2L, 2L, 2L, 2L,
2L, 1L, 2L, 3L, 2L, 1L, 4L, 2L, 1L, 4L, 1L, 4L, 1L, 2L, 2L, 2L,
4L, 1L, 4L, 2L, 4L, 1L, 1L, 1L, 4L, 1L, 2L, 1L, 1L, 1L, 2L, 2L,
1L, 1L, 2L, 4L, 1L, 1L, 4L, 2L, 2L, 2L, 1L, 1L, 3L, 2L, 5L, 1L,
1L, 1L, 4L, 4L, 4L, 5L, 1L, 4L, 4L, 1L, 4L, 2L, 1L, 1L, 2L, 2L,
4L), .Label = c("0", "1", "2", "3", "4", "5"), class = "factor")), row.names = c(NA,
100L), class = "data.frame")
Well, I found a way.
There's no easy way to draw just the "outside" lines, so the approach I used was to go ahead and draw them with the geom_bar call. The inner lines are "erased" by drawing white rectangles over top the initial geom_bar call, and then the fill is drawn back in with a colorless color= aesthetic.
In order to draw the rectangles over the initial geom_bar call, I created a summary dataframe of i which sets the y values.
i.sum <- i %>% group_by(n) %>% tally()
ggplot(i, aes(x=n)) + theme_bw() +
# draw lines
geom_bar(position='stack', stat='count',
aes(color=uicc), fill=NA, size=1.5) +
# cover over those inner lines
geom_col(data=i.sum, aes(y=nn), fill='white') +
# put back in the fill
geom_bar(position='stack', stat='count',
aes(fill=uicc), color=NA) +
scale_fill_manual(values=alpha(cols,0.5)) +
scale_color_manual(values=cols)
Note that the size= of the color= aesthetic needs to be much higher than normal, since the white rectangle ends up covering about half the line.

how to count the number of rows of specific column that has specific character

I have data that I want to know the number of specific rows that are with specific character. The data looks like the following
df<-structure(list(Gene.refGene = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1BG", "A1BG-AS1", "A1CF",
"A1CF;PRKG1"), class = "factor"), Chr = structure(c(2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("chr10", "chr19"
), class = "factor"), Start = c(58858232L, 58858615L, 58858676L,
58859052L, 58859055L, 58859066L, 58859510L, 58863162L, 58864479L,
58864150L, 58864867L, 58864879L, 58865857L, 52566433L, 52569637L,
52571047L, 52573510L, 52576068L, 52580561L, 52603659L, 52619845L,
52625849L, 52642500L, 52650951L, 52675605L, 52703952L, 52723140L,
52723638L), End = c(58858232L, 58858615L, 58858676L, 58859052L,
58859055L, 58859066L, 58859510L, 58863166L, 58864479L, 58864150L,
58864867L, 58864879L, 58865857L, 52566433L, 52569637L, 52571047L,
52573510L, 52576068L, 52580561L, 52603659L, 52619845L, 52625849L,
52642500L, 52650958L, 52675605L, 52703952L, 52723140L, 52723638L
), Ref = structure(c(3L, 5L, 2L, 2L, 3L, 2L, 5L, 7L, 6L, 6L,
2L, 1L, 5L, 6L, 5L, 3L, 2L, 5L, 6L, 3L, 3L, 6L, 3L, 4L, 3L, 6L,
6L, 3L), .Label = c("-", "A", "C", "CTCTCTCT", "G", "T", "TTTTT"
), class = "factor"), Alt_df1 = structure(c(1L, 1L, 4L, 4L, 1L,
4L, 5L, 1L, 3L, 3L, 4L, 4L, 3L, 1L, 2L, 5L, 1L, 2L, 1L, 5L, 5L,
2L, 5L, 1L, 4L, 3L, 4L, 2L), .Label = c("-", "A", "C", "G", "T"
), class = "factor")), class = "data.frame", row.names = c(NA,
-28L))
I want to know how many rows of the column named "alt_df1" is missing or - or NA
Here is an answer using which and utilising base R's LETTERS data:
length(which(!df$Alt_df1%in%LETTERS))
#[1] 8
Or using just which:
length(which(df$Alt_df1=="-"))
#[1] 8
One way would be to create a logical vector using %in% and then sum over them to count the number of occurrences.
sum(df$Alt_df1 %in% c("-", NA))
#[1] 8
Or we can also subset and count the number of rows.
nrow(subset(df, Alt_df1 %in% c("-", NA)))
which can also be done in dplyr by
library(dplyr)
df %>% filter(Alt_df1 %in% c("-", NA)) %>% nrow
Another option using grepl
with(df, sum(grepl("-", Alt_df1)) + sum(is.na(Alt_df1)))
and I am sure there are multiple other ways.

Add a column by counting words for each row in R code

I have a data frame of 2511 rows and 6 columns with candy and color items. Please see the first 15 rows as below:
structure(list(x = 1:15, iteml = structure(c(2L, 1L, 1L, 1L,
5L, 4L, 4L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("{dulce1_rojo",
"{dulce2_verde", "{dulce7_plata", "{miel21_amarillo", "{miel30_azul"
), class = "factor"), item2 = structure(c(4L, 2L, 2L, 2L, 1L,
5L, 5L, 4L, 3L, 3L, 4L, 1L, 4L, 4L, 1L), .Label = c("chocolate2l_amarillo",
"dulce2_verde", "dulce7_plata", "miel21_amarillo", "miel30_azul"
), class = "factor"), item3 = structure(c(1L, 1L, 3L, 3L, 2L,
2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 1L, 2L), .Label = c("chocolate2l_amarillo",
"chocolate30_azul", "miel21_amarillo"), class = "factor"), item4 = structure(c(2L,
2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("chocolate2l_amarillo",
"chocolate32_violeta", "cookie30_azul"), class = "factor"), item5 = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("cookie2l_amarillo}",
"cookie32_violeta}"), class = "factor"), item6 = structure(c(4L,
6L, 1L, 3L, 6L, 1L, 2L, 4L, 6L, 2L, 5L, 6L, 1L, 2L, 4L), .Label = c(">{chocolate2l_amarillo}",
">{chocolate30_azul}", ">{chocolate32_violeta}", ">{dulce1_rojo}",
">{dulce7_plata}", ">{miel21_amarillo}"), class = "factor")), class = "data.frame", row.names = c(NA,
-15L))
I don`t know how can I count in new columns only the kind of candy that each row has. This first line as an expected ouput of the resulting data frame:
x iteml item2 item3 item4 item5 item6 dulce miel chocolate cookie
1 1 {dulce2_verde miel21_amarillo chocolate2l_amarillo chocolate32_violeta cookie32_violeta} >{dulce1_rojo} 2 1 2 1
I'm stuck and I'd appreciate a little help.
you can use apply function to apply grepl function by row for the initial data frame. Then you use sapply to iterate through four ingridients you indicated. Then use cbind to concatentate the initial data frame and the data frame with ingedients into one. Please see the code below:
# initialize data frame
df <- structure(list(x = 1:15, iteml = structure(c(2L, 1L, 1L, 1L,
5L, 4L, 4L, 3L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("{dulce1_rojo",
"{dulce2_verde", "{dulce7_plata", "{miel21_amarillo", "{miel30_azul"
), class = "factor"), item2 = structure(c(4L, 2L, 2L, 2L, 1L,
5L, 5L, 4L, 3L, 3L, 4L, 1L, 4L, 4L, 1L), .Label = c("chocolate2l_amarillo",
"dulce2_verde", "dulce7_plata", "miel21_amarillo", "miel30_azul"
), class = "factor"), item3 = structure(c(1L, 1L, 3L, 3L, 2L,
2L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 1L, 2L), .Label = c("chocolate2l_amarillo",
"chocolate30_azul", "miel21_amarillo"), class = "factor"), item4 = structure(c(2L,
2L, 2L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("chocolate2l_amarillo",
"chocolate32_violeta", "cookie30_azul"), class = "factor"), item5 = structure(c(2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("cookie2l_amarillo}",
"cookie32_violeta}"), class = "factor"), item6 = structure(c(4L,
6L, 1L, 3L, 6L, 1L, 2L, 4L, 6L, 2L, 5L, 6L, 1L, 2L, 4L), .Label = c(">{chocolate2l_amarillo}",
">{chocolate30_azul}", ">{chocolate32_violeta}", ">{dulce1_rojo}",
">{dulce7_plata}", ">{miel21_amarillo}"), class = "factor")), class = "data.frame", row.names = c(NA,
-15L))
# counting ingridients
ingridients <- c("dulce", "miel", "chocolate", "cookie")
x <- sapply(ingridients, function(y) apply(df, 1, function(x) sum(grepl(y, x))))
df_res <- cbind(df, x)
head(df_res)
Output:
x iteml item2 item3 item4 item5 item6 dulce miel chocolate cookie
1 1 {dulce2_verde miel21_amarillo chocolate2l_amarillo chocolate32_violeta cookie32_violeta} >{dulce1_rojo} 2 1 2 1
2 2 {dulce1_rojo dulce2_verde chocolate2l_amarillo chocolate32_violeta cookie32_violeta} >{miel21_amarillo} 2 1 2 1
3 3 {dulce1_rojo dulce2_verde miel21_amarillo chocolate32_violeta cookie32_violeta} >{chocolate2l_amarillo} 2 1 2 1
4 4 {dulce1_rojo dulce2_verde miel21_amarillo chocolate2l_amarillo cookie32_violeta} >{chocolate32_violeta} 2 1 2 1
5 5 {miel30_azul chocolate2l_amarillo chocolate30_azul cookie30_azul cookie2l_amarillo} >{miel21_amarillo} 0 2 2 2
6 6 {miel21_amarillo miel30_azul chocolate30_azul cookie30_azul cookie2l_amarillo} >{chocolate2l_amarillo} 0 2 2 2

How to do iterative ANOVA and extract Mean Square Values from lm object in R

I have a dataset in which I have 18 populations. Each population has several individuals in it, each individual has a "Color" call. I would like to only compare two populations at once in a one-way ANOVA with Population as the main factor in order to get a pairwise MS-within and MS-among.
I know how to extract the MS from the omnibus ANOVA using the following code:
mylm <- lm(Color ~ Pop, data=PopColor)
anova(mylm)[["Mean Sq"]]
Which yields the among-subjects MS (PopColor$Pop) first, then the between-subjects MS (residual) respectively:
[1] 3.7079911 0.4536985
Is there a way in which I can create a do-loop to do all pairwise one-way ANOVA between all populations, then extract the among and within MS?
I would then like to move the two MS values from each comparison to their own symmetrical matrices: one among-subjects MS matrix labeled by population, and one within-subjects MS matrix labeled by population. These would have identical column and row names with the Population name.
Below is a subset of my data with six populations:
dput(dat)
structure(list(Pop = structure(c(6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("pop1001", "pop1026",
"pop252", "pop254a", "pop311", "pop317"), class = "factor"),
Color = c(4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
3L, 3L, 2L, 3L, 3L, 3L, 2L, 2L, 2L, 1L, 3L, 3L, 2L, 3L, 2L,
3L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 3L, 3L, 2L,
3L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 2L, 3L, 2L, 2L, 3L, 4L,
2L, 3L, 2L, 4L, 3L, 3L, 2L, 3L, 2L, 3L, 3L, 4L, 3L, 2L, 4L,
4L, 1L, 2L, 2L, 2L, 2L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 3L,
2L, 3L, 4L, 2L, 2L, 4L, 3L)), .Names = c("Pop", "Color"), class = "data.frame", row.names = c(NA,
-94L))
Any help would be greatly appreciated! Thanks!
I don't understand you point 2 ( a little bit technical for not statistician like me). For the first point , I understand it as you want to apply lm/Anova over all pairs of your populations. You can use combn:
combn: Generate all combinations of the elements of x taken m at a time
pops <- unique(PopColor$Pop)
ll <- combn(pops,2,function(x){
dat.pair <- PopColor[PopColor$Pop %in% pops[x],]
mylm <- lm(Color ~ Pop, data=dat.pair)
c(as.character(pops[x]),anova(mylm)[["Mean Sq"]])
},simplify=FALSE)
do.call(rbind,ll)
[,1] [,2] [,3] [,4]
[1,] "pop1026" "pop254a" "0.210291858678956" "0.597865353037767"
[2,] "pop1026" "pop1001" "0.52409988385598" "0.486874236874237"
[3,] "pop1026" "pop317" "15.7296466973886" "0.456486042692939"
[4,] "pop1026" "pop311" "1.34392057218144" "0.631962930099576"
[5,] "pop1026" "pop252" "0.339324116743472" "0.528899835796388"
[6,] "pop254a" "pop1001" "0.0166666666666669" "0.351785714285714"
[7,] "pop254a" "pop317" "14.45" "0.227777777777778"
[8,] "pop254a" "pop311" "1.92898550724637" "0.561430575035063"
[9,] "pop254a" "pop252" "0.8" "0.344444444444444"
[10,] "pop1001" "pop317" "20.4166666666667" "0.205357142857143"
[11,] "pop1001" "pop311" "3.55030333670374" "0.46474019088017"
[12,] "pop1001" "pop252" "1.35" "0.280357142857143"
[13,] "pop317" "pop311" "9.60474308300398" "0.429172510518934"
[14,] "pop317" "pop252" "8.45" "0.116666666666667"
[15,] "pop311" "pop252" "0.110803689064557" "0.496914446002805"
As you can see we have choose(6,2)=15 possibles pairs.

Resources