Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a data frame of data of demographics in R
Name...Region...Gender
...A...........1.............F
...B...........2.............M
...C...........1.............F
...D...........1.............M
...E...........2.............M
I want to calculate gender ratio for every region. Output should look like:
Region ..........GenderRatio
.... 1........................(0.67)
.... 2........................(0.50)
This can be calculated using normal BODMAS usage. Is there any efficient way to calculate it in R?
You can use the dplyr library in R for all sorts of datamanipulation. See here to learn more about dplyr and other extremely useful R packages.
An example:
First I create some sample data. (I changed it a little bit to actually have a gender ratio that fits your output.)
df <- data.frame(name = c("A", "B", "C", "D", "E"),
region = c(1,2,1,1,2),
gender = c("F", "M", "F", "M", "F"))
Now we can calculate gender_ratio and summarise the data. The function mutate is used to create and calculate the new variable gender_ratio. The group_by and summarise functions to logically organise the data before calculation (in order it is calculated by region) and later to only output the summarised data.
library(dplyr)
df %>% group_by(region) %>% mutate(gender_ratio = sum(gender == "F")/length(gender)) %>% group_by(region, gender_ratio) %>% summarise()
Output is:
region gender_ratio
<dbl> <dbl>
1 1 0.667
2 2 0.5
Hope this helps.
As a (base R) alternative, you can use by with prop.table(table(...)) to return a list of fractions for both male/female
with(df, by(df, Region, function(x) prop.table(table(x$Gender))))
#Region: 1
#
# F M
#0.6666667 0.3333333
#------------------------------------------------------------
#Region: 2
#
#F M
#0 1
Or to return only the male fraction
with(df, by(df, Region, function(x) prop.table(table(x$Gender))[2]))
#Region: 1
#[1] 0.3333333
#------------------------------------------------------------
#Region: 2
#[1] 1
Or to store male fraction and region in a data.frame simply stack the above result:
setNames(
stack(with(df, by(df, Region, function(x) prop.table(table(x$Gender))[2]))),
c("GenderRatio", "Region"))
# GenderRatio Region
#1 0.3333333 1
#2 1.0000000 2
Related
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 1 year ago.
Improve this question
Here's my problem : dplyr group_by doesn't work on a data.frame, but it works well on another. The problematic dataframe is imported from a SPSS file with the package foreign. when I execute that :
d_summarised <- d %>%
group_by(group) %>%
summarise(Sex = (sum(d$GENRE == "F", na.rm = TRUE))/sum(!is.na(d$GENRE))) %>%
select(Sex, group)
The result is calculated on the whole sample, and not by group (so the result is the same by group, what is not expected).
# A tibble: 6 x 2
group Sex
* <fct> <dbl>
1 group1 0.626
2 group2 0.626
3 group3 0.626
4 group4 0.626
5 group5 0.626
6 NA 0.626
But, at the same time, on the same session, with the same packages loaded, this works :
dat <- data.frame(x=c(1,2,3,3,2,1), y=c(15,24,54,65,82,65))
dat %>%
group_by(x) %>%
summarise(mean(y))
Here's the result :
# A tibble: 3 x 2
x `mean(y)`
* <dbl> <dbl>
1 1 40
2 2 53
3 3 59.5
plyr is not loaded, only dplyr. How could that be possible ?
The issue would be breaking the grouping with d$. Instead, use the column names and it should work
library(dplyr)
d %>%
group_by(group) %>%
summarise(Sex = (sum(GENRE == "F", na.rm = TRUE))/sum(!is.na(GENRE))) %>%
select(Sex, group)
NOTE: when we use d$GENRE, it is selecting the whole column in the dataset and not limiting the elements within the group
In the second case, OP was applying mean directly on 'y' instead of mean(dat$y). In other words, it is not the data structure i.e. data.frame vs tibble, but it is because of extracting the whole column
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
Goal: to get new row names when using base R order() function (as is done with dplyr::arrange()). The rownames/index output for the base R call is 3, 1, 2 as seen below whereas the output for arrange() is 1, 2, 3 (seen below). How can I get 1, 2, 3 using base R order()?
Reprex:
library(dplyr)
df <- data.frame(
company = c("A", "B", "C"),
sales = c(100, 200, 50)
)
# base R:
df[order(df$sales),]
# dplyr:
arrange(df, sales)
# Base R output:
## company sales
## 3 C 50
## 1 A 100
## 2 B 200
# dplyr output:
## company sales
## 1 C 50
## 2 A 100
## 3 B 200
If your goal is for the row numbers after using arrange() to match what you get from order(), then do the following (a few extra dplyr and tibble steps).
library(dplyr)
library(tibble)
df %>%
rownames_to_column() %>%
arrange(sales) %>%
column_to_rownames("rowname")
company sales
3 C 50
1 A 100
2 B 200
If your goal is to the same rownames as what result after arrange(), you can assign the row names after using order().
df_new <- df[order(df$sales),]
rownames(df_new) <- 1:nrow(df_new)
It may be good practice to create an ID column instead of using row names. Usually numbered ID's correspond to the original data, but of course you can create them after your ordering operation.
df_new <- df[order(df$sales),]
df_new$id <- 1:nrow(df_new)
Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.
This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.
Closed 2 years ago.
Improve this question
Creating factor levels in a dataset with NAs works for individual columns, but I need to iterate across many more columns (all start with 'impact.') and have struck a problem inside a dplyr mutate(across)
What am I doing wrong?
Reprex below
library(tribble)
library(dplyr)
df <- tribble(~id, ~tumour, ~impact.chemo, ~impact.radio,
1,'lung',NA,1,
2,'lung',1,NA,
3,'lung',2,3,
4,'meso',3,4,
5,'lung',4,5)
# Factor labels
trt_labels <- c('Planned', 'Modified', 'Interrupted', 'Deferred', "Omitted")
# Such that factor levels match labels as, retaining NAs where present:
data.frame(level = 1:5,
label = trt_labels)
# Create factor works for individual columns
factor(df$impact.chemo, levels = 1:5, labels = trt_labels)
factor(df$impact.radio, levels = 1:5, labels = trt_labels)
# But fails inside mutate(across)
df %>%
mutate(across(.cols = starts_with('impact'), ~factor(levels = 1:5, labels = trt_labels)))
Just making #27ϕ9's comment an answer: the purrr-style lambda function you specified inside across is not correct because it needs the first argument, which is the object the function should refer to (in this case, the dataframe columns selected by across).
To fix your issue, you should insert .x inside the lambda function, which is non other than a shortcut for function(x) x - see this page for more info about purrr-style lambda functions.
df %>%
mutate(across(.cols = starts_with('impact'), ~factor(.x, levels = 1:5, labels = trt_labels)))
# A tibble: 5 x 4
# id tumour impact.chemo impact.radio
# <dbl> <chr> <fct> <fct>
# 1 1 lung NA Planned
# 2 2 lung Planned NA
# 3 3 lung Modified Interrupted
# 4 4 meso Interrupted Deferred
# 5 5 lung Deferred Omitted
This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 5 years ago.
I was wondering how I might simply split a numerical column by a second grouping variable in a dataset, then cbind the numerical column. This would most likely be a simple extension of the separate function for dplyr. For example, changing X below:
Y <- rbind(2,5,3,6,3,2)
Z <- rbind("A", "A", "A", "B", "B", "B")
X <- data.frame(Y,Z)
Into
A B
2 6
5 3
3 2
Then ideally extract the rowMeans into a new vector. (Issue also arises here when there is only one character in Z, given rowmeans requires 2).
This would need to be infinitely expandable based on the number of unique variables in Z. e.g., if Z had A, B, and C, then the final data.frame would require 3 columns. This would allow me to capture the row means from infinite number of groups in Z.
Thanks in advance,
Conal
Looks like a job for tidyr::spread.
library(dplyr)
library(tidyr)
X2 <- X %>%
group_by(Z) %>%
mutate(ID = 1:n()) %>%
spread(Z, Y) %>%
select(-ID)
X2
# A tibble: 3 x 2
A B
* <dbl> <dbl>
1 2 6
2 5 3
3 3 2
This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have what I know must be a simple answer but I can't seem to figure it out.
Suppose I have a dataset:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test <- c(12,16, NA, 11, 15,NA, 0,12, 5)
df <- data.frame(id,visit,test)
And I want to know the number of data points per visit so that the final output looks something like this:
visit test
A 3
B 3
C 1
How would I go about doing this? I've tried using table
table(df$visit, df$test)
but I get a full grid of the number of values present the combination of visits and test values.
I can sum each row by doing this:
sum(table(df$visit, df$test))[1,]
sum(table(df$visit, df$test))[2,]
sum(table(df$visit, df$test))[3,]
But I feel like there is an easier way and I'm missing it! Any help would be greatly appreciated!
aggregate of base R would be ideal for this. Group id by visit and count the length. Remove the rows with NA using !is.na() prior to determining the length
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
# Group.1 x
#1 A 3
#2 B 3
#3 C 1
How about:
data.frame(rowSums(table(df$visit, df$test)))