Group function by two variables on data.table - r

My data looks something like this
students<-data.table(studid=c(1:6) ,FACULTY= c("IT","SCIENCE", "LAW","IT","IT","IT"),
SEX=c("Male","Male","Male","Female","Female","Male"), WAM=c(65,35,98,55,20,80))
studid FACULTY SEX AVE_MARK (WAM)
1 IT Male 65
2 SCIENCE Male 35
3 LAW Male 98
4 IT Female 55
5 IT Female 20
6 IT Male 80
I have used the following code to calculate the averages
degrees[, mean(WAM, na.rm=T),by=FACULTY][order(-V1)]
So my headings are
FACULTY VI
IT 65
LAW 50
etc
Any advice on how to do this would be greatly appreciated.
I would like to break this up by sex also
FACULTY VI VI
Male Female
IT 65 11
LAW 50 11

You could try
dcast.data.table(students, FACULTY~SEX, fun.aggregate=mean, na.rm=TRUE,
value.var='WAM')
# FACULTY Female Male
#1: IT 37.5 72.5
#2: LAW NaN 98.0
#3: SCIENCE NaN 35.0

Do you definitely need it in cross tabular format? If so, akrun's answer is the way to go.
Otherwise, here they are stacked:
> students[, mean(WAM, na.rm=T),by=c('FACULTY','SEX')]
FACULTY SEX V1
1: IT Male 72.5
2: SCIENCE Male 35.0
3: LAW Male 98.0
4: IT Female 37.5

Related

Exchange NA with mean of previous and next value

I searched for an answer to how to exchange NAs with the mean of the previous and next values in a DataFrame for specifically one column. But I didn't find an answer which shows how to do this on base R with the addition that NAs can be next to each other.
the DataFrame:
name number
1 John 56
2 Garry NA
3 Carl 70
4 Doris 96
5 Wendy NA
6 Louis NA
7 Becky 40
whished output:
name number
1 John 56
2 Garry 63
3 Carl 70
4 Doris 96
5 Wendy 68
6 Louis 68
7 Becky 40
within(df1, number.fill <-
rowMeans(cbind(ave(number, cumsum(!is.na(number)),
FUN=function(x) x[1]),
rev(ave(rev(number), cumsum(!is.na(rev(number))),
FUN=function(x) x[1])))))
#> name number number.fill
#> 1 John 56 56
#> 2 Garry NA 63
#> 3 Carl 70 70
#> 4 Doris 96 96
#> 5 Wendy NA 68
#> 6 Louis NA 68
#> 7 Becky 40 40
Data:
read.table(text = "name number
John 56
Garry NA
Carl 70
Doris 96
Wendy NA
Louis NA
Becky 40",
header = T, stringsAsFactors = F) -> df1
In Base R you could do:
idx <- is.na(df$number)
df$number[idx] <- 0
b <- with(rle(df$number), rep(stats::filter(values, c(1,0,1)/2), lengths))
df$number[idx] <- b[idx]
df
name number
1 John 56
2 Garry 63
3 Carl 70
4 Doris 96
5 Wendy 68
6 Louis 68
7 Becky 40

How do I generate a frequency table on R via dplyr and plot its values with ggplot?

I need to do a frequency table from two categorical variable columns where one is a 5-year age group and the other is health status (five states) from the brfss2013 data set, from where I extracted the columns of interest via:
> hlthgrpq1 <- brfss2013 %>% select(genhlth, X_ageg5yr)
Thus generating a two column frame, 491775 observations of 2 variables.
genhlth X_ageg5yr
1 Fair Age 60 to 64
2 Good Age 50 to 54
3 Good Age 55 to 59
4 Very good Age 60 to 64
5 Good Age 65 to 69
I can generate a summary table with the 'by' function:
> by(hlthgrpq1$genhlth, hlthgrpq1$X_ageg5yr, summary)
hlthgrpq1$X_ageg5yr: Age 18 to 24
Excellent Very good Good Fair Poor NA's
6896 10266 7795 1873 303 69
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 25 to 29
Excellent Very good Good Fair Poor NA's
5779 8488 6521 1751 325 46
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 30 to 34
Excellent Very good Good Fair Poor NA's
6412 9958 7977 2295 496 75
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 35 to 39
Excellent Very good Good Fair Poor NA's
6366 10169 8236 2637 638 61
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 40 to 44
Excellent Very good Good Fair Poor NA's
6689 11130 9193 3334 1067 95
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 45 to 49
Excellent Very good Good Fair Poor NA's
7051 12278 10611 4343 1815 112
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 50 to 54
Excellent Very good Good Fair Poor NA's
8545 15254 13761 6354 3120 139
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 55 to 59
Excellent Very good Good Fair Poor NA's
8500 16759 15394 7643 3998 197
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 60 to 64
Excellent Very good Good Fair Poor NA's
8283 16825 16266 8101 3955 229
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 65 to 69
Excellent Very good Good Fair Poor NA's
7479 15764 15600 7749 3200 205
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 70 to 74
Excellent Very good Good Fair Poor NA's
5491 11943 13125 6491 2721 196
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 75 to 79
Excellent Very good Good Fair Poor NA's
3320 8501 10128 5545 2426 173
----------------------------------------------------------------------------------------------------------------
hlthgrpq1$X_ageg5yr: Age 80 or older
Excellent Very good Good Fair Poor NA's
3697 10285 14400 8116 3695 322
And that's where I get stuck. I have tried for hours to attempt to get here:
Results obtained via spreadsheet.
Thanks for any help.
(This is for a specific assignment so I can only use dplyr and ggplot2, so, no reshape2 or tidyr.)
First off: For future postings, it is good practice to always include sample data. See here how to provide a minimal reproducible example/attempt including sample data.
Solution in base R.
as.data.frame.matrix(t(table(df)));
# Fair Good Very good
#Age 50 to 54 0 1 0
#Age 55 to 59 0 1 0
#Age 60 to 64 1 0 1
#Age 65 to 69 0 1 0
Or something like this as a tidyverse approach?
library(tidyverse);
df %>% count(genhlth, X_ageg5yr) %>% spread(genhlth, n);
## A tibble: 4 x 4
# X_ageg5yr Fair Good `Very good`
# <fct> <int> <int> <int>
#1 Age 50 to 54 NA 1 NA
#2 Age 55 to 59 NA 1 NA
#3 Age 60 to 64 1 NA 1
#4 Age 65 to 69 NA 1 NA
Or if you insist on only using dplyr and not tidyr, you can do:
df2 <- df %>%
count(genhlth, X_ageg5yr);
df2 <- as.data.frame.matrix(xtabs(n ~ X_ageg5yr + genhlth, data = df2));
# Fair Good Very good
#Age 50 to 54 0 1 0
#Age 55 to 59 0 1 0
#Age 60 to 64 1 0 1
#Age 65 to 69 0 1 0
This basically boils down to a wide-to-long reformat, SO has plenty of discussions around that topic (e.g. here).
Sample data
df <- read.table(text =
"genhlth X_ageg5yr
Fair 'Age 60 to 64'
Good 'Age 50 to 54'
Good 'Age 55 to 59'
'Very good' 'Age 60 to 64'
Good 'Age 65 to 69'", header = T)

How to find ratio between two columns in a dataframe?

For example
df
Cars Male female
Ford focus 23 64
vw golf 76 12
ford ka 34 55
renault megane 12 83
How do i find the ratio of male to female for every car >0.5
Just subset your data frame using that ratio:
df[df$Male / df$Female > 0.5, ]
Cars Male Female
2 vw golf 76 12
3 ford ka 34 55
Demo
You might try a which() function:
df[which(df[,2]/df[,3]>0.5),1]
Good luck!

Separating three dimensional array by the stratifying variable

I am working with the UCBAdmissions data set, and I want to separate out the data set into the 6 departmental tables that you get when you simply run
>UCBAdmissions
, , Dept = A
Gender
Admit Male Female
Admitted 512 89
Rejected 313 19
, , Dept = B
Gender
Admit Male Female
Admitted 353 17
Rejected 207 8
, , Dept = C
Gender
Admit Male Female
Admitted 120 202
Rejected 205 391
, , Dept = D
Gender
Admit Male Female
Admitted 138 131
Rejected 279 244
, , Dept = E
Gender
Admit Male Female
Admitted 53 94
Rejected 138 299
, , Dept = F
Gender
Admit Male Female
Admitted 22 24
Rejected 351 317
I am pretty sure I can make the data set into a dataframe and then go through and grep by department and sum to make tables, but I am wondering if there is an easier way, sine the data is already set up in the exact format I want, I just need to handle each department table individually
Oh, sorry I misread your question. You are not looking for converting this into a data frame but for splitting.
You may use:
setNames(lapply(1:dim(UCBAdmissions)[3], function (i) UCBAdmissions[,,i]),
dimnames(UCBAdmissions)[[3]])
#A
# Gender
#Admit Male Female
# Admitted 512 89
# Rejected 313 19
#
#$B
# Gender
#Admit Male Female
# Admitted 353 17
# Rejected 207 8
#
#$C
# Gender
#Admit Male Female
# Admitted 120 202
# Rejected 205 391
#
#$D
# Gender
#Admit Male Female
# Admitted 138 131
# Rejected 279 244
#
#$E
# Gender
#Admit Male Female
# Admitted 53 94
# Rejected 138 299
#
#$F
# Gender
#Admit Male Female
# Admitted 22 24
# Rejected 351 317
You can use assign in a for loop:
for (i in 1:6){assign(LETTERS[i], UCBAdmissions[,,i])}
A
# Gender
# Admit Male Female
# Admitted 512 89
# Rejected 313 19
and the same goes for B, C, D, E and F

Subsetting a data frame - Confused about syntax

Say I have the following data frame:
LungCap Age Height Smoke Gender Caesarean
1 6.475 6 62.1 no male no
2 10.125 18 74.7 yes female no
3 9.550 16 69.7 no female yes
4 11.125 14 71.0 no male no
5 4.800 5 56.9 no male no
6 6.225 11 58.7 no female no
Now I want to select all rows where the age is > 11 and gender is female. This gets me what I want:
y[y$Age>11&y$Gender=="female",]
LungCap Age Height Smoke Gender Caesarean
2 10.125 18 74.7 yes female no
3 9.550 16 69.7 no female yes
But this does not:
y[y$Age>11&y$Gender=="female"]
Age Height
1 6 62.1
2 18 74.7
3 16 69.7
4 14 71.0
5 5 56.9
6 11 58.7
I'm very new at R and I don't understand what this second query is doing, other than it's not giving me what I want.
When you subset the dataframe with the first syntax, the first number vector (or logic vector) in the square brackets represents the rows you want to select, while the second (after the comma) represents the columns.
If you do not explicitly insert anything after the comma, R assumes you want all the columns.
If you do not even put the comma, R assumes that the first number refers to what columns you want.
In your case y$Age>11&y$Gender=="female" is a logic vector that refers to position 2 and 3. So if you do not use comma, R thinks you want to only select columns 2 and 3. Therefore you get Age and Height.

Resources