Calculating medians via dplyr vs. aggregate in R

Calculating medians via dplyr vs. aggregate in R - r

Hello: I am getting slightly different medians for a data set that looks like the one created below when I produce them via dplyr/ tidyr versus aggregate. Can anyone explain the difference? Thank you!
#dataset
out2<-structure(list(d3 = structure(c(1L, 2L, NA, NA, 1L, 1L, NA,
2L,NA,3L,1L, NA, NA, 1L, 3L, NA, 1L, 2L, 3L, 2L, 1L, 3L, 2L, 3L, 1L), .Label
= c("Professional journalist", "Elected politician", "Online blogger"),
class = "factor"), Accessible = c(3, 5, 2,NA, 1, 2, NA, 3, NA, 4, 2, 5, NA,
3, 4, NA, 2, NA, 3, 4, 4, 4,2, 2, 2), Information = c(1, 2, 1, NA, 4, 1, NA,
2, NA, 2, 1, 1, NA, 4, 1, NA, 1, 1, 1, 3, 1, 3, 3, 4, 1), Responsive = c(5,
4, 6, NA, 2, 3, NA, 1, NA, 5, 4, 4, NA, 6, 3, NA, 4, NA, 2, 2, 6, 2, 1, 1,
3), Debate = c(6, 3, 4, NA, 3, 4, NA, 5, NA, 6, 5,6, NA, 1, 5, NA, 5, 2, NA,
1, 5, 6, 5, 5, 7), Officials = c(2,1, 5, NA, 5, 5, NA, 6, NA, 3, 6, 2, NA, 2,
2, NA, 6, 3, NA, 5,2, 5, 4, 6, 5), Social = c(7, 6, 7, NA, 7, 7, NA, 4, NA,
7, 7,
7, NA, 7, 7, NA, 7, NA, NA, 7, 7, 1, 6, 7, 6), `Trade-Offs` = c(4,
7, 3, NA, 6, 6, NA, 7, NA, 1, 3, 3, NA, 5, 6, NA, 3, NA, NA,
6, 3, 7, 7, 3, 4)), .Names = c("d3", "Accessible", "Information",
"Responsive", "Debate", "Officials", "Social", "Trade-Offs"), row.names =
c(171L, 126L, 742L, 379L, 635L, 3L, 303L, 419L, 324L, 97L, 758L, 136L,
770L, 405L, 101L, 674L, 386L, 631L, 168L, 590L, 731L, 387L, 673L, 208L,
728L), class = "data.frame")
#Find Medians via tidyR and dplyr
test<-out2 %>%
gather(variable, value, -1) %>%
filter(is.na(d3)==FALSE)%>%
group_by(d3, variable) %>%
summarise(value=median(value, na.rm=TRUE))
#dataframe
test<-data.frame(test)
#find Medians via aggregate
test2<-aggregate(.~d3, data=out2, FUN=median, na.rm=TRUE)
#Gather for plotting
test2<-test2 %>%
gather(variable, value, -d3)
#Plot Medians via tidyr
ggplot(test, aes(x=d3, y=value,
group=d3))+facet_wrap(~variable)+
geom_bar(stat='identity')+labs(title='Medians via TidyR')
#Plot Medians Via aggregate
ggplot(test2, aes(x=d3, y=value,
group=d3))+facet_wrap(~variable)+geom_bar(stat='identity')+
labs(title='Medians via Aggregate')
#Compare Debate, Information and Responsive

The results produced by aggregate are different because aggregate is dropping entire rows where any value is NA, even if some variables in that row contain data.
You can correct this by specifying a value for the na.action argument, as described in this accepted answer. Here it would be:
test2<-aggregate(.~d3, data=out2, FUN=median, na.rm = TRUE, na.action=NULL)
test2<-test2 %>%
gather(variable, value, -d3)
Confirm that the results are the same:
identical(as.data.frame(test %>% arrange(d3, variable, value)),
as.data.frame(test2 %>% arrange(d3, variable, value)))
[1] TRUE

Related

R remove rows with NA in groups of columns containing the same string

I have a dataframe that contains multiple variables each measured with multiple items at two different time points. What I want to remove all rows with NA entries in groups of columns containing the same part of a string. Some of these groups contain multiple columns (e.g., grep("learn"), some only one (e.g., T1_age. This is my original dataframe (a part of it):
data <- data.frame(
T1_age = c(39, 30, 20, 48, 27, 55, 37, 50, 50, 37),
T1_sex = c(2, 1, 1, 2, 2, 1, 1, 2, 1, 1),
T2_learn1 = c(2, NA, 3, 4, 1, NA, NA, 2, 4, 4),
T2_learn2 = c(1, NA, 4, 4, 1, NA, NA, 2, 4, 4),
T2_learn3 = c(2, NA, 4, 4, 1, NA, NA, 3, 4, 4),
T2_learn4 = c(2, NA, 2, 5, 5, NA, NA, 5, 5, 5),
T2_learn5 = c(4, NA, 3, 4, 3, NA, NA, 3, 4, 3),
T2_aut1 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 4),
T2_aut2 = c(NA, NA, 4, 4, 4, NA, NA, 3, 5, 5),
T2_aut3 = c(NA, NA, 4, 4, 3, NA, NA, 3, 5, 5),
T2_ssup1 = c(1, NA, 4, 5, 4, NA, NA, 2, 4, 3),
T2_ssup2 = c(3, NA, 4, 5, 5, NA, NA, 3, 4, 4),
T2_ssup3 = c(4, NA, 4, 5, 5, NA, NA, 4, 4, 4),
T2_ssup4 = c(2, NA, 3, 5, 5, NA, NA, 3, 4, 4),
T3_learn1 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 4),
T3_learn2 = c(1, NA, NA, 4, 3, NA, NA, 3, 3, 4),
T3_learn3 = c(3, NA, NA, 4, 4, NA, NA, 3, 3, 5),
T3_learn4 = c(4, NA, NA, 5, 4, NA, NA, 4, 5, 5),
T3_learn5 = c(4, NA, NA, 3, 4, NA, NA, 3, 3, 4),
T3_aut1 = c(NA, NA, NA, 4, 4, NA, NA, 3, 5, 5),
T3_aut2 = c(NA, NA, NA, 3, 4, NA, NA, 3, 5, 5),
T3_aut3 = c(NA, NA, NA, 3, 2, NA, NA, 3, 5, 5),
T3_ssup1 = c(3, NA, NA, 5, 4, NA, NA, 2, 4, 1),
T3_ssup2 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 5),
T3_ssup3 = c(4, NA, NA, 5, 5, NA, NA, 4, 5, 3),
T3_ssup4 = c(3, NA, NA, 5, 5, NA, NA, 4, 5, 4)
)
Now I already found a very horrible solution and I believe that could be improved. So this code basically does what I want:
library(dplyr)
library(tidyr)
data <- data %>% filter(rowSums(is.na(.[ , grep("learn", colnames(.))])) != ncol(.[ , grep("learn", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("aut", colnames(.))])) != ncol(.[ , grep("aut", colnames(.))]))
data <- data %>% filter(rowSums(is.na(.[ , grep("ssup", colnames(.))])) != ncol(.[ , grep("ssup", colnames(.))]))
data <- data %>% drop_na(T1_age)
data <- data %>% drop_na(T1_sex)
So the new data frame (and what I want to achieve) looks like this:
data2 <- data.frame(
T1_age = c(20, 48, 27, 50, 50, 37),
T1_sex = c(1, 2, 2, 2, 1, 1),
T2_learn1 = c(3, 4, 1, 2, 4, 4),
T2_learn2 = c(4, 4, 1, 2, 4, 4),
T2_learn3 = c(4, 4, 1, 3, 4, 4),
T2_learn4 = c(2, 5, 5, 5, 5, 5),
T2_learn5 = c(3, 4, 3, 3, 4, 3),
T2_aut1 = c(4, 4, 4, 3, 5, 4),
T2_aut2 = c(4, 4, 4, 3, 5, 5),
T2_aut3 = c(4, 4, 3, 3, 5, 5),
T2_ssup1 = c(4, 5, 4, 2, 4, 3),
T2_ssup2 = c(4, 5, 5, 3, 4, 4),
T2_ssup3 = c(4, 5, 5, 4, 4, 4),
T2_ssup4 = c(3, 5, 5, 3, 4, 4),
T3_learn1 = c(NA, 4, 4, 3, 3, 4),
T3_learn2 = c(NA, 4, 3, 3, 3, 4),
T3_learn3 = c(NA, 4, 4, 3, 3, 5),
T3_learn4 = c(NA, 5, 4, 4, 5, 5),
T3_learn5 = c(NA, 3, 4, 3, 3, 4),
T3_aut1 = c(NA, 4, 4, 3, 5, 5),
T3_aut2 = c(NA, 3, 4, 3, 5, 5),
T3_aut3 = c(NA, 3, 2, 3, 5, 5),
T3_ssup1 = c(NA, 5, 4, 2, 4, 1),
T3_ssup2 = c(NA, 5, 5, 4, 5, 5),
T3_ssup3 = c(NA, 5, 5, 4, 5, 3),
T3_ssup4 = c(NA, 5, 5, 4, 5, 4)
)
Could you help me improve this a bit? Thank you!!!

You may iterate over grep in an sapply and check if the rowSums in the slices reach their number of columns.
V <- c('learn', 'aut', 'ssup')
res <- data[!rowSums(sapply(V, \(v) {
X <- data[grep(v, names(data))]
rowSums(is.na(X)) == dim(X)[2]
})), ]
stopifnot(all.equal(res, data2, check.attributes=FALSE))
Or probably just checking if the sums of NA's in the "hot" columns reach the number of columns (without the demographics) is enough.
res1 <- data[rowSums(is.na(data[grep(paste(V, collapse='|'), names(data))])) !=
dim(data[-(1:2)])[2], ]
stopifnot(all.equal(res1, data2, check.attributes=FALSE))
data2 is the result data frame you provide in OP. dim(data)[2] gives the same as ncol(data).
Note: R version 4.1.2 (2021-11-01)

misaligned bars in geom_bar using fill

I have a problem with ggplot's geom_bar.
I have several bar charts rendered from several variables all using the same fourth variable as a fill:
For some reason in the third chart the columns for cohort 2 are misaligned.
All three charts use the same Dataset and the same code.
library(tidyverse)
library(patchwork)
myColours <- c("#A71C49","#11897A","#DD4814", "#282A36")
DataSet <- structure(list(`Var1` = c(3, 2, 5, 3, 4, 1, 3, 1,
5, 4, 5, 3, 5, 5, 5, 4, 4, 5, 4, 5, 5, 5, 5, 1, 5, 5, 4, 4, 3,
5, 5, 4, 1, 3, 5, 2, 5, 5, 4, 4, 2, 5, 1, 5, 3, 5, 5, 5, 2, 5,
3, 1, 5, 5, 5, 5, 4), `Var2` = c(3, 1, 4, 1, 2, 2,
3, 1, 3, 3, 3, 3, 2, 5, 5, 1, 4, 4, 5, 5, 4, 5, 3, 2, 3, 5, 2,
3, 3, 5, 5, 2, 1, 3, 4, 2, 4, 5, 3, 3, 5, 3, 1, 4, 3, 5, 3, 4,
2, 4, 1, 4, 4, 5, 1, 3, 3), `Var3` = c(3, 2, 1,
3, 1, 4, 3, 2, 4, 3, 3, 3, 5, 3, 3, 3, 3, 5, 5, 5, 3, 3, 3, 4,
5, 2, 4, 4, 4, 5, 5, 1, 1, 3, 5, 2, 5, 5, 3, 4, 3, 1, 1, 4, 3,
2, 5, 4, 2, 4, 4, 1, 4, 5, 1, 5, 2), Cohort = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"
), class = "factor")), row.names = c(NA, -57L), class = c("tbl_df",
"tbl", "data.frame"))
c5 <- ggplot(DataSet, aes(`Var1`, fill=Cohort)) +
geom_bar() +
theme(legend.position = "none") +
ylim(0,25) +
scale_fill_manual(values=myColours)
c6 <- ggplot(DataSet, aes(`Var2`, fill=Cohort)) +
geom_bar() +
theme(legend.position = "none") +
ylim(0,25) +
scale_fill_manual(values=myColours)
c7 <- ggplot(DataSet, aes(`Var3`, fill=Cohort)) +
geom_bar() +
ylim(0,25) +
scale_fill_manual(values=myColours)
(c5 | c6 ) /
(c7 | guide_area())
I have the following error messages:
1: Removed 2 rows containing missing values (geom_bar).
2: position_stack requires non-overlapping x intervals
The missing values refer to the graph for Var1, the non-overlapping x intervals for the third graph.
If I render out just Cohort two I also get these weirdly misaligned bars:
And Cohort 3 Var 3 to compare
I would have suspected the fact, that there are only two different numbers in cohort 2, but it works for the barchart with Var1 above. It is also not patchwork buggering it up, as it is the same when I render out just the Var3 barchart. It is also not the legend only being rendered for the Var3 Graph
Does anyone have an idea what the problem is or how I can force ggplot to align the bars correctly?
Thank you!
(R version 4.0.4 Patched (2021-02-17 r80030); tidyverse v.1.3.1; patchwork v.1.1.1)

The third graph is misaligned as stacked barplots interpret the value as numerical, making it unsuitable for stacking (do you stack x=1 and x= 1.00001 on top of each other etc.?). Transforming it to an ordered vector helps ggplot understand.
Consider this example only using the tidyverse:
myColours <- c("#A71C49","#11897A","#DD4814", "#282A36")
# Lenthen the dataset
DataSet2 <- DataSet %>% pivot_longer(cols = -Cohort,names_to = "Variable")
# This helps against the non-overlapping x intervals issue
DataSet2$value <- as.ordered(DataSet2$value)
ggplot(DataSet2,aes(x=value,fill=Cohort)) +
geom_bar(position= position_stack()) + ylim(0,25)+
facet_wrap(vars(Variable)) + # make multiple graphs split by the column "Variable"
scale_fill_manual(values=myColours)
Result:

R: Error when grouping data using subset or index methods- produces list of all column names

I am encountering an error when trying to group my data based on categories of a variable and I am not sure why it is happening because I have used the two most widely recommended methods, subset(dataframe, variable==X) and dataframe[dataframe4variable ==X], successfully with past data sets and just now using the mtcars dataset.
The problem is that when I try to run my code, I get an error in which R just prints out the names of all of my variables(see below).
I am not quite sure how to "show" this problem-- any recommendations regarding what information would be useful to you all would be greatly appreciated. This problem is not reproducible with other datasets. Thank you for any advice you are able to give.
My dataset "wits" has 363 observations and 92 variables. My variable "complete" is a factor variable with four levels: "completed all", "stopped after demos", "stopped after consent", and "skipped manip bc poor id." I would like to create a new dataset made up only of participants with "completed all". I have tried these two methods:
wits_c <- wits[wits$complete=="Completed all", ]
wits_c <-subset(wits,complete=="Completed all")
Which results in the following error:
Error: Columns `Start`, `End`, `GameCode`, `workerID`, `condition`, `about`, `valid`, `consv`, `merit1`, `merit2`, `merit3`, `gender`, `gender_TEXT`, `poorid`, `choseskip`, `age`, `edu`, `race`, `race_TEXT`, `complete`, `distracted`, `happen`, `about__1`, `playread`, `thinking_1`, `thinking_2`, `thinking_3`, `thinking_4`, `thinking_5`, `thinking_6`, `thinking_7`, `text`, `logical_1`, `logical_2`, `logical_3`, `logical_4`, `controll_1`, `controll_2`, `controll_3`, `controll_4`, `controll_5`, `controll_6`, `controll_7`, `controll_8`, `controll_9`, `controll_10`, `privatesol`, `publicsol`, `privatesol_2`, `privatesol_3`, `privatesol_5`, `publicsol_2`, `publicsol_3`, `publicsol_5`, `policy_1`, `policy_2`, `policy_3`, `policy_4`, `colaction_5`, `colaction_6`, `colaction_7`, `colaction_8`, `colaction_10`, `colaction_13`, `joke`, `random`, `say`, `wits`, `wits_nb`, `neutral`, `rural_id`, `relig_id`, `prog_id`, `vignette`, `merit3R`, `policy_3R`, `policy_4R`, `controll_4R`, `controll_5R`,`co
Thank you to user Markdly for the suggestion to include the following output which provides more detailed information about my dataset:
dput(head(wits))
structure(list(Start = structure(c(1499525516, 1499516293, 1499516379,
1499516319, 1499516949, 1499516709), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), End = structure(c(1499525762, 1499518121,
1499516954, 1499517222, 1499517412, 1499517512), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), GameCode = c(2991999, 5712506, 1002944,
8916111, 3495462, 9127270), workerID = c("ACIHCWKHNFC7U", "A3UAO2LYUPO7L6",
"A8L94A9EF23BV", "A258JTYUD56LOE", "A12SJSJIUR3A23", "A1HHOCO3ZZHCJZ"
), condition = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("WITS",
"WITS No Blurb", "Neutral", "Read"), class = "factor"), about = c(2,
2, 2, 2, 2, 2), valid = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label =
c("valid responses",
"invalid responses"), class = "factor"), consv = c(4, 2, 2, 6,
4, 4), merit1 = c(5, 3, 2, 4, 6, 4), merit2 = c(4, 4, 2, 5, 5,
4), merit3 = c(3, 4, 2, 5, 4, 5), gender = structure(c(1L, 1L,
2L, 2L, 2L, 2L), .Label = c("man", "woman", "non-binary", "other"
), class = "factor"), gender_TEXT = c(NA, NA, NA, NA, NA, NA),
poorid = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("not id as poor",
"id as poor"), class = "factor"), choseskip = structure(c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), .Label = c("poor but continued", "poor and skipped"), class = "factor"),
age = c(28, 30, 33, 41, 26, 30), edu = c(5, 5, 6, 5, 3, 5
), race = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("white",
"black", "latino", "asian", "native american", "other", "multiracial"
), class = "factor"), race_TEXT = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), complete = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Completed
all",
"Stopped after demos", "Stopped after consent", "Skipped manip bc poor id"
), class = "factor"), distracted = c(4, 5, 0, 0, 0, 5), happen = c(0,
0, 0, 0, 0, 0), about__1 = c(2, 2, 2, 2, 2, 2), playread = c(1,
1, 1, 1, 1, 1), thinking_1 = c(1, 6, 5, 4, 6, 4), thinking_2 = c(4,
3, 5, 1, 5, 5), thinking_3 = c(1, 4, 7, 4, 5, 5), thinking_4 = c(6,
4, 7, 1, 5, 3), thinking_5 = c(5, 3, 6, 6, 5, 4), thinking_6 = c(4,
3, 7, 6, 6, 4), thinking_7 = c(6, 3, 7, 6, 5, 5), text = c(2,
4, 5, 4, 2, 3), logical_1 = c(1, 3, 4, 4, 4, 4), logical_2 = c(2,
4, 3, 4, 5, 4), logical_3 = c(4, 3, 5, 4, 5, 4), logical_4 = c(1,
6, 3, 4, 4, 3), controll_1 = c(6, 6, 1, 4, 4, 4), controll_2 = c(1,
3, 1, 4, 5, 5), controll_3 = c(1, 4, 3, 6, 4, 5), controll_4 = c(3,
4, 6, 3, 4, 4), controll_5 = c(6, 3, 6, 1, 4, 4), controll_6 = c(3,
3, 1, 3, 5, 4), controll_7 = c(2, 5, 1, 5, 4, 5), controll_8 = c(2,
2, 1, 4, 5, 3), controll_9 = c(6, 2, 6, 3, 5, 5), controll_10 = c(1,
3, 6, 2, 5, 3), privatesol = c(5, 5.33333333333333, 12, 8,
8.33333333333333, 8.33333333333333), publicsol = c(7.66666666666667,
2.66666666666667, 12, 2, 11.3333333333333, 8.33333333333333
), privatesol_2 = c(1, 11, 12, 11, 11, 3), privatesol_3 = c(3,
2, 12, 2, 3, 11), privatesol_5 = c(11, 3, 12, 11, 11, 11),
publicsol_2 = c(11, 3, 12, 2, 12, 11), publicsol_3 = c(11,
2, 12, 2, 11, 3), publicsol_5 = c(1, 3, 12, 2, 11, 11), policy_1 = c(1,
5, 6, 1, 4, 4), policy_2 = c(3, 2, 6, 1, 5, 4), policy_3 = c(1,
3, 2, 6, 5, 3), policy_4 = c(6, 3, 5, 6, 5, 4), colaction_5 = c(2,
5, 6, 1, 2, 5), colaction_6 = c(6, 4, 1, 6, 5, 4), colaction_7 = c(6,
2, 6, 1, 3, 4), colaction_8 = c(4, 5, 6, 1, 2, 4), colaction_10 = c(4,
3, 1, 6, 5, 4), colaction_13 = c(3, 2, 6, 1, 2, 3), joke = c(2,
2, 2, 2, 2, 2), random = c(2, 2, 2, 2, 2, 2), say = c(NA,
"Nope", NA, "This was highly biased survey.", "good survey",
"NO"), wits = c(1, 1, 1, 1, 1, 1), wits_nb = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), neutral = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), rural_id = c(0,
0, 0, 1, 0, 0), relig_id = c(0, 1, 1, 1, 1, 1), prog_id = c(0,
0, 1, 0, 1, 1), vignette = structure(c(1L, 1L, 1L, 1L, 1L,
1L), .Label = c("WITS", "WITS No Blurb", "Neutral", "Read"
), class = "factor"), merit3R = c(4, 3, 5, 2, 3, 2), policy_3R = c(6,
4, 5, 1, 2, 4), policy_4R = c(1, 4, 2, 1, 2, 3), controll_4R = c(4,
3, 1, 4, 3, 3), controll_5R = c(1, 4, 1, 6, 3, 3), controll_9R = c(1,
5, 1, 4, 2, 2), controll_10R = c(6, 4, 1, 5, 2, 4), colaction_6R = c(1,
3, 6, 1, 2, 3), colaction_10R = c(3, 4, 6, 1, 2, 3), logical = c(2,
4, 3.75, 4, 4.5, 3.75), cognition = structure(c(1, 6, 5,
4, 6, 4, 4, 3, 5, 1, 5, 5, 1, 4, 7, 4, 5, 5, 6, 4, 7, 1,
5, 3, 5, 3, 6, 6, 5, 4, 4, 3, 7, 6, 6, 4, 6, 3, 7, 6, 5,
5), .Dim = 6:7), engage = c(5, 3, 6.66666666666667, 6, 5.33333333333333,
4.33333333333333), pertake = c(3.66666666666667, 3.66666666666667,
6.33333333333333, 2, 5, 4.33333333333333), policy = c(2.75,
3.75, 4.75, 1, 3.25, 3.75), colaction = c(3.16666666666667,
3.5, 6, 1, 2.16666666666667, 3.66666666666667), controllability = c(2.7,
3.9, 1.2, 4.5, 3.7, 3.8), completebi = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = c("Completed all", "Stopped after demos"
), class = "factor"), gender2 = structure(c(1L, 1L, 2L, 2L,
2L, 2L), .Label = c("man", "woman"), class = "factor")), .Names = c("Start", "End", "GameCode", "workerID", "condition", "about", "valid",
"consv", "merit1", "merit2", "merit3", "gender", "gender_TEXT",
"poorid", "choseskip", "age", "edu", "race", "race_TEXT", "complete",
"distracted", "happen", "about__1", "playread", "thinking_1",
"thinking_2", "thinking_3", "thinking_4", "thinking_5", "thinking_6",
"thinking_7", "text", "logical_1", "logical_2", "logical_3",
"logical_4", "controll_1", "controll_2", "controll_3", "controll_4",
"controll_5", "controll_6", "controll_7", "controll_8", "controll_9",
"controll_10", "privatesol", "publicsol", "privatesol_2", "privatesol_3",
"privatesol_5", "publicsol_2", "publicsol_3", "publicsol_5",
"policy_1", "policy_2", "policy_3", "policy_4", "colaction_5",
"colaction_6", "colaction_7", "colaction_8", "colaction_10",
"colaction_13", "joke", "random", "say", "wits", "wits_nb", "neutral",
"rural_id", "relig_id", "prog_id", "vignette", "merit3R", "policy_3R",
"policy_4R", "controll_4R", "controll_5R", "controll_9R", "controll_10R",
"colaction_6R", "colaction_10R", "logical", "cognition", "engage",
"pertake", "policy", "colaction", "controllability", "completebi",
"gender2"), row.names = c(NA, 6L), class = c("tbl_df", "tbl",
"data.frame"))

wits$complete
#[1] Completed \nall Completed \nall Completed \nall Completed \nall Completed \nall Completed \nall
#Levels: Completed \nall Stopped after demos Stopped after consent Skipped manip bc poor id
You can see it's "Completed \nall" not "Completed all" in your wits data frame.
##So, you just use which function to subset your wits data frame.
which(wits$complete == "Completed \nall")
#[1] 1 2 3 4 5 6 ## This is index of the row. You put this to subet you data frame as below and you are good to go.
## So, this will subset your data frame
wits[which(wits$complete == "Completed \nall"),]

Setting edge attributes conditionally in list-column graphs using R igraph and dplyr (purrr)

I have a dataframe with a series of igraph objects in list-column format. I would like to conditionally set the edge color attribute.
I've included the dput output for a sample version of the actual dataframe (very large, thousands of graphs) containing just three graphs. It's still long, so I've put it at the bottom of this post and I'll explain a couple of the ideas I've tried so far.
First attempt was multiple uses of mutate and map using the purrr package.
sampleColored <- sampleGraphs %>% mutate(map(graph, function(x)
E(x)[weights == 0]$color = "blue")) %>% mutate(map(graph, function(x)
E(x)[weights < 0]$color = "red")) %>% mutate(map(graph, function(x)
E(x)[weights > 0]$color = "green"))
No error messages, but the command
shortPlots <- sampleColored %>%
mutate(plots = map(graph, function(x) plot(x, layout=layout.circle,
vertex.size=20,
edge.curved=TRUE)))
produced nice graphs with all edges colored grey.
Likewise with my second attempt where I created an edgeColor function and used a single map call.
edgecolor <- function(x) {
E(x)[weights == 0]$color <- "blue"
E(x)[weights < 0]$color <- "red"
E(x)[weights > 0]$color <- "green"
return(E(x))
}
sampleColored <- sampleGraphs %>% mutate(map(graph, function(x) edgecolor(x)))
No error and grey edges. Dropping the mutate command gives rise to the error message:
Error in as.numeric(n): cannot coerce type 'closure' to vector of type 'double'
I'm confident that this is possible and I simply don't have the understanding to get to the correct syntax. Any suggestions will be appreciated. Thanks for looking.
Here's the sampleGraph dput:
sampleGraphs <- structure(list(ID = 997:1000, graph = list(structure(list(5,
TRUE, c(0, 1, 2, 0, 3, 4, 1, 2, 4, 3, 0, 4, 2, 3, 0, 1, 3,
1, 4, 2), c(1, 0, 0, 4, 1, 1, 4, 3, 0, 2, 3, 2, 1, 4, 2,
3, 0, 2, 3, 4), c(0, 14, 10, 3, 1, 17, 15, 6, 2, 12, 7, 19,
16, 4, 9, 13, 8, 5, 11, 18), c(1, 2, 16, 8, 0, 12, 4, 5,
14, 17, 9, 11, 10, 15, 7, 18, 3, 6, 19, 13), c(0, 4, 8, 12,
16, 20), c(0, 4, 8, 12, 16, 20), list(c(1, 0, 1), structure(list(), .Names = character(0)),
structure(list(name = c("3", "0", "2", "4", "1")), .Names = "name"),
structure(list(weights = c(3L, -4L, 4L, -3L, 43L, 8L,
4L, 14L, 1L, 55L, 2L, 22L, 26L, 64L, 9L, 2L, 13L, -12L,
25L, 16L)), .Names = "weights")), <environment>), class = "igraph"),
structure(list(5, TRUE, c(0, 1, 2, 2, 1, 3, 1, 3, 4, 3, 3,
0, 4, 0, 4, 4, 2, 1, 2, 0), c(3, 3, 4, 0, 2, 1, 4, 2, 0,
4, 0, 2, 1, 4, 2, 3, 3, 0, 1, 1), c(19, 11, 0, 13, 17, 4,
1, 6, 3, 18, 16, 2, 10, 5, 7, 9, 8, 12, 14, 15), c(17, 3,
10, 8, 19, 18, 5, 12, 11, 4, 7, 14, 0, 1, 16, 15, 13, 6,
2, 9), c(0, 4, 8, 12, 16, 20), c(0, 4, 8, 12, 16, 20), list(
c(1, 0, 1), structure(list(), .Names = character(0)),
structure(list(name = c("2", "0", "1", "3", "4")), .Names = "name"),
structure(list(weights = c(4L, -4L, 25L, 22L, 4L, 3L,
2L, -3L, 55L, 2L, 9L, 16L, 43L, 14L, 64L, 13L, 1L, -12L,
8L, 26L)), .Names = "weights")), <environment>), class = "igraph"),
structure(list(5, TRUE, c(0, 1, 2, 3, 4, 0, 1, 2, 1, 3, 1,
3, 2, 4, 2, 4, 0, 0, 3, 4), c(1, 4, 3, 4, 0, 4, 2, 0, 0,
2, 3, 1, 4, 1, 1, 2, 3, 2, 0, 3), c(0, 17, 16, 5, 8, 6, 10,
1, 7, 14, 2, 12, 18, 11, 9, 3, 4, 13, 15, 19), c(8, 7, 18,
4, 0, 14, 11, 13, 17, 6, 9, 15, 16, 10, 2, 19, 5, 1, 12,
3), c(0, 4, 8, 12, 16, 20), c(0, 4, 8, 12, 16, 20), list(
c(1, 0, 1), structure(list(), .Names = character(0)),
structure(list(name = c("4", "0", "3", "2", "1")), .Names = "name"),
structure(list(weights = c(43L, 4L, 9L, 16L, 25L, 64L,
-4L, 2L, 2L, 4L, -11L, 26L, -3L, 8L, 3L, 1L, 55L, 13L,
14L, 22L)), .Names = "weights")), <environment>), class = "igraph"),
structure(list(5, TRUE, c(0, 1, 2, 3, 4, 1, 3, 2, 4, 0, 1,
3, 2, 4, 0, 0, 2, 4, 1, 3), c(4, 4, 4, 1, 2, 0, 2, 3, 0,
3, 2, 0, 1, 1, 2, 1, 0, 3, 3, 4), c(15, 14, 9, 0, 5, 10,
18, 1, 16, 12, 7, 2, 11, 3, 6, 19, 8, 13, 4, 17), c(5, 16,
11, 8, 15, 12, 3, 13, 14, 10, 6, 4, 9, 18, 7, 17, 0, 1, 2,
19), c(0, 4, 8, 12, 16, 20), c(0, 4, 8, 12, 16, 20), list(
c(1, 0, 1), structure(list(), .Names = character(0)),
structure(list(name = c("1", "4", "0", "2", "3")), .Names = "name"),
structure(list(weights = c(1L, 13L, -4L, 14L, 3L, 64L,
26L, -11L, -3L, 22L, 43L, 16L, 2L, 2L, 8L, 25L, 4L, 8L,
55L, 4L)), .Names = "weights")), <environment>), class = "igraph"))), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -4L), .Names = c("ID",
"graph"))

Using set_edge_attr rather than igraph's idiomatic E() edge function helps. I had to revise the sampleGraph list to a simple list of graphs, upgraded to the newer version of igraph, but this works:
graphs <- sampleGraphs$graph
graphs <- lapply(graphs, function(x) upgrade_graph(x)) #making a simple list of graphs
edgecolor <- function(x) {
E(x)[weights == 0]$color <- "blue"
E(x)[weights < 0]$color <- "red"
E(x)[weights > 0]$color <- "green"
return(E(x)$color)
} #The function now returns a list of colors conditional on statements
#Pass the function to the "values" argument of "set_edge_attr"
graphs_colored <- graphs %>% map(., function(x) set_edge_attr(x, "color", value = edgecolor(x)))
par(mfrow = c(2,2), mar = c(0,0,0,0))
shortPlots <- graphs_colored %>%
map(., function(x) plot(x,
layout=layout.circle,
vertex.size=20,
edge.curved=TRUE,
edge.arrow.size = 0.5))

Got it! Thanks to #paqmo for suggestions. I needed to use mutate to redefine the graph list-column variable.
edgecolor <- function(x) {
E(x)[weights == 0]$color <- "#FF000000"
E(x)[weights < 0]$color <- "red"
E(x)[weights > 0]$color <- "green"
return(E(x)$color)
}
sampleColored <- sampleGraphs %>% mutate(graph = map(graph, function(x)
set_edge_attr(x, "color", value = edgecolor(x))))
par(mfrow = c(2,2), mar = c(0,0,0,0))
samplePlots <- sampleColored %>%
mutate(plots = map(graph, function(x) plot(x, layout=layout.circle,
vertex.size=20,
edge.curved=TRUE)))
generates the same image as #paqmo.

Preparing data before doing Principal component analysis (PCA)

I have a data frame(200x300) which consists of mixed(character,numeric) variables and has lots of missing values(NA)
my first problem is how to convert all data into numeric, I can use factors but there are like 100 columns to convert.
secondly, all my columns are not expressed in equivalent units.
I just want some good advice for preparing the data before starting with my analysis
following is the structure of the data
structure(list(Hormonal.cycle.status..P4. = c(1, 1, 4, 1, 4,
1), Hormonal.medication.status = c(2, 1, 1, 2, 1, 1), Hormonal.medication.type = c(21,
27, 27, 26, 27, 27), ID.pathologist.main = c(3, 3, 3, 4, 2, 1
), ID.pathologist.sub = c(2, 1, 2, 2, 2, 2), Day.of.the.cycle_calculated = c(10,
8, 22, 19, 19, 12), Cycle.status..histology.and.cycle.day. = c(12,
18, 9, 1, 18, 3), Cycle.status.final..P4..histology..cycle.day. = c(1,
4, 5, 1, 6, 3), Deep.lesion = c(2, 2, 1, 2, 1, 2), Ovarian.lesion = c(2,
2, 2, 1, 2, 2), Peritoneal.lesion = c(2, 2, 2, 2, 2, 2), Combination.of.lesions = c(1,
1, 7, 4, 7, 1), DEEP.all.types = c(2, 2, NA, 2, NA, 2), DEEP.uterosacral = c(2,
2, NA, 2, NA, 2), DEEP.RVE = c(1, 1, 1, 2, 1, 1), DEEP.bowel = c(1,
1, 1, 1, 1, 1), DEEP.bladder = c(1, 1, 1, 1, 1, 1), Ovarian.endo.cyst = c(5,
5, 3, 1, 3, 5), Peritoneal = c(2, NA, 2, NA, 2, 2), Peritoneal.size.total = c(3,
3, 3, 3, 3, 2), Date.of.the.surgery = c(96, 98, 17, 105, 107,
108), Type.of.surgery = c(1, 1, 1, 1, 1, 1), Perit..surface = c(1,
2, 3, 3, 2, 2), Perit..deep = c(3, NA, NA, NA, 3, 3), R.ovary.surface = c(NA,
1, 2, 1, 1, NA), R.ovary.deep = c(NA, NA, 2, NA, 3, NA), L.ovary.surface = c(2,
NA, 2, NA, 1, NA), L.ovary.deep = c(2, 4, 4, NA, 4, 4), F.d.block = c(NA,
NA, 1, 1, NA, 1), R.ovary.frail = c(NA, NA, 3, NA, NA, 3), R.ovary.tight = c(NA,
NA, NA, NA, 3, NA), L.ovary.frail = c(NA, NA, NA, 2, NA, NA),
L.ovary.tight = c(2, 2, 3, NA, 2, 2), R.tuba.frail = c(NA,
NA, NA, NA, NA, 2), R.tuba.tight = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), L.tuba.frail = c(NA,
NA, NA, NA, NA, 2), L.tuba.tight = c(4, NA, NA, NA, 4, NA
), Elsewhere = c(35, NA, NA, 24, NA, NA), Other.diseases = c(16,
NA, NA, NA, NA, 11), In.microarray = c(2, 2, 1, 2, 2, 1),
In.cytokine.plexes = c(2, 2, 2, 2, 2, 2)), .Names = c("Hormonal.cycle.status..P4.",
"Hormonal.medication.status", "Hormonal.medication.type", "ID.pathologist.main",
"ID.pathologist.sub", "Day.of.the.cycle_calculated", "Cycle.status..histology.and.cycle.day.",
"Cycle.status.final..P4..histology..cycle.day.", "Deep.lesion",
"Ovarian.lesion", "Peritoneal.lesion", "Combination.of.lesions",
"DEEP.all.types", "DEEP.uterosacral", "DEEP.RVE", "DEEP.bowel",
"DEEP.bladder", "Ovarian.endo.cyst", "Peritoneal", "Peritoneal.size.total",
"Date.of.the.surgery", "Type.of.surgery", "Perit..surface", "Perit..deep",
"R.ovary.surface", "R.ovary.deep", "L.ovary.surface", "L.ovary.deep",
"F.d.block", "R.ovary.frail", "R.ovary.tight", "L.ovary.frail",
"L.ovary.tight", "R.tuba.frail", "R.tuba.tight", "L.tuba.frail",
"L.tuba.tight", "Elsewhere", "Other.diseases", "In.microarray",
"In.cytokine.plexes"), row.names = c("H003", "H004", "H006",
"H007", "H008", "H011"), class = "data.frame")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Calculating medians via dplyr vs. aggregate in R - r

Related

R remove rows with NA in groups of columns containing the same string

misaligned bars in geom_bar using fill

R: Error when grouping data using subset or index methods- produces list of all column names

Setting edge attributes conditionally in list-column graphs using R igraph and dplyr (purrr)

Preparing data before doing Principal component analysis (PCA)

Categories

Resources