Merge rows depending on 2 column values - r

region. age. pop
SSC21184 0 209
SSC21184 1 195
SSC21184 2 242
SSC21184 3 248
SSC21185 0 231
SSC21185 1 287
SSC21185 2 268
SSC21185 3 257
I'm looking to:
group age groups (column 2) for ages <2 and >=2,
find the population for these age groups, for each region
so it should look something like this:
region. age_group. pop
SSC21184 <2 404
SSC21184 >=2 490
SSC21185 <2 518
SSC21185 >=2 524
I've attempted using tapply(df$pop, df$agegroup, FUN = mean) %>% as.data.frame(), however I continue to get the error: arguments must have same length
Edit: If possible, how would I be able to plot the population per age group per region? As for example, a stacked bar graph?
Thank you!

If you have only two age groups to change we can use ifelse :
library(dplyr)
df %>%
group_by(region, age = ifelse(age >=2, '>=2', '<2')) %>%
summarise(sum = sum(pop))
# region age sum
# <chr> <fct> <int>
#1 SSC21184 < 2 404
#2 SSC21184 >=2 490
#3 SSC21185 < 2 518
#4 SSC21185 >=2 525
A more general solution would be with cut if you have large number of age groups.
df %>%
group_by(region, age = cut(age, breaks = c(-Inf, 1, Inf),
labels = c('< 2', '>=2'))) %>%
summarise(sum = sum(pop))
We can use the same logic in tapply as well.
with(df, tapply(pop, list(region, ifelse(age >=2, '>=2', '<2')), sum))
# <2 >=2
#SSC21184 404 490
#SSC21185 518 525

Related

How to rearrange/tidy column in R data frame?

Suppose we have this data set, where avg_1, avg_2 and avg_3 repeat themselves:
avg_1 avg_2 avg_3 party_gender
424 242 213 RM
424 242 213 RF
424 242 213 DM
How can I edit this using R so that the data set looks like this (where the avg values aren't repeated, and avg_1, avg_2 and avg_3 correspond to RM, RF and DM respectively):
avg party_gender
424 RM
242 RF
213 DM
Admittedly, this is a bit hacky and doesn't work nicely if you have more than just a few conditions for the avg. value:
library(tidyverse)
dat %>%
pivot_longer(-party_gender) %>%
filter(party_gender == "RM" & value == 424 |
party_gender == "RF" & value == 242 |
party_gender == "DM" & value == 213) %>%
mutate(name = "avg") %>%
pivot_wider()
which gives:
# A tibble: 3 x 2
party_gender avg
<chr> <dbl>
1 RM 424
2 RF 242
3 DM 213

How to get 3 lists with no duplicates in a random sampling? (R)

I have done the first step:
how many persons have more than 1 point
how many persons have more than 3 points
how many persons have more than 6 points
My goal:
I need to have random samples (with no duplicates of persons)
of 3 persons that have more than 1 point
of 3 persons that have more than 3 points
of 3 persons that have more than 6 points
My dataset looks like this:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
345 nn 7
345 nn NA
490 kk 1
490 kk NA
490 kk 2
491 ww 1
491 ww 1
489 tt 1
489 tt 1
325 ll 1
325 ll 1
325 ll NA
That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:
persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)
Question:
How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)
Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.
library(tidyverse)
dataset %>%
# Group by person
group_by(person) %>%
# Get points sum
summarize(sum_points = sum(points, na.rm = T)) %>%
# Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
# I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
# ungroup
ungroup() %>%
# group by point class
group_by(point_class) %>%
# Sample 3 rows per point_class
sample_n(size = 3) %>%
# Eliminate the sum_points column
select(-sum_points) %>%
# If you need this data in lists you can nest the results in the sampled_data column
nest(sampled_data= -point_class)

function for both mean and Sd based on categorical variable in the data frame

I have 30 Patients with their 100 clinical data such as weight,BMI, waist etc and i want to take mean and SD for all the patients based on their Disease status For example my data set looks like
Patient_id DateOfBirth Sex Weight1 Bmi1 Wasit1 Disease
204065 25-06-1995 Female 113.8 41.3 105.8 0
200214 09-12-1990 Female 90 35.6 108 1
191633 14-09-1971 Male 128.4 47 150 1
186156 22-09-1967 Male 157.3 51.4 145.6 0
and i want output based on their disease status like
Disease weight1Mean Weight1SD BMI1Mean BMI1SD Waist1Mean WaistSD
0 135 30.7 46.3 7.14 125.7 28.1
1 109 27.1 41.3 8.06 129 29.7
your_df %>%
groupy_by(Disease) %>%
summarize(Weight1Mean = mean(Weight1),
Weight1SD = sd(Weight1
#Repeat for the rest of variables to sumamrize
)
You can also use summarize_at in place of summarize:
#... %>%
summarize_at(vars(Weight1, BMI1, Waist1), list(Mean = mean, SD = sd))
Or summarize_if:
#... %>%
summarize_if(is.numeric, list(Mean = mean, SD = sd))
If you have numeric variables you want to exclude from summarization, you can recode them as factors or drop them with select.
We can use data.table
library(data.table)
setDT(df1)[, .(Weight1Mean = mean(Weight1), Weight1SD = sd(Weight1)), Disease]

How to find totals of different categories in same dataset?

I'm a student doing exploratory analysis/data vis with this hate crime data set. I am trying to create a matrix of the different categories (i.e. race, religion, etc.) within from my dataset (hate_crime) during 2009 and 2017. The full dataset can be found here.
I extracted the necessary data (incidents during 2009 or 2017) from the existing data.
SecondYear_OTYear <- hate_crime %>% filter(hate_crime$DATA_YEAR == "2017" | hate_crime$DATA_YEAR == "2009")
Then, I just made different subsets for each subcategory in the category. For example, to create subsets of bias descriptions I made the following:
antiWhiteSubset <- SecondYear_OTYear[grep("Anti-White", SecondYear_OTYear$BIAS_DESC), ]
antiWhite17 <- nrow(antiWhiteSubset[antiWhiteSubset$DATA_YEAR == "2017", ])
antiWhite09 <- nrow(antiWhiteSubset[antiWhiteSubset$DATA_YEAR == "2009", ])
antiBlackSubset <- SecondYear_OTYear[grep("Anti-Black", SecondYear_OTYear$BIAS_DESC), ]
antiBlack17 <- nrow(antiBlackSubset[antiBlackSubset$DATA_YEAR == "2017", ])
antiBlack09 <- nrow(antiBlackSubset[antiBlackSubset$DATA_YEAR == "2009", ])
antiLatinoSubset <- SecondYear_OTYear[grep("Anti-Hispanic", SecondYear_OTYear$BIAS_DESC), ]
antiLatino17 <- nrow(antiLatinoSubset[antiLatinoSubset$DATA_YEAR == "2017", ])
antiLatino09 <- nrow(antiLatinoSubset[antiLatinoSubset$DATA_YEAR == "2009", ])
And, I proceeded to do all of the different bias descriptions with the same structure. Then, I created a matrix of the totals to create varying bar plots, mosaic plots, or chi-square analysis, such as the following:
Bar plot of Hate Crime Incidents by Bias Descriptions:
However, I feel like there is a more efficient way to code for the different subsets... I'm open to any suggestions! Thank you so much.
You can use dplyr to filter the data and ggplot2::geom_bar to summarize counts.
hc_small = hate_crimes %>% filter(DATA_YEAR %in% c(2009, 2017))
top_5 = hc_small %>% count(BIAS_DESC, sort=TRUE) %>% pull(BIAS_DESC) %>% head(5)
hc_5 = hc_small %>% filter(BIAS_DESC %in% top_5)
ggplot(hc_5, aes(BIAS_DESC, fill=BIAS_DESC)) +
geom_bar() +
facet_wrap(~DATA_YEAR) +
coord_flip() +
theme_minimal() +
guides(fill='none')
To aggregate across phrases as in the original question, I did
anti <-
hate_crime %>%
filter(DATA_YEAR %in% c("2009", "2017")) %>%
mutate(
ANTI_WHITE = grepl("Anti-White", BIAS_DESC),
ANTI_BLACK = grepl("Anti-Black", BIAS_DESC),
ANTI_HISPANIC = grepl("Anti-Hispanic", BIAS_DESC)
) %>%
select(DATA_YEAR, starts_with("ANTI"))
I then created the counts of each occurrence with group_by() and summarize_all() (noting that the sum() of a logical vector is the number of TRUE occurrences), and used pivot_longer() to create a 'tidy' summary
anti %>%
group_by(DATA_YEAR) %>%
summarize_all(~ sum(.)) %>%
tidyr::pivot_longer(starts_with("ANTI"), "BIAS", values_to = "COUNT")
The result is something like (there were errors importing the data with read_csv() that I did not investigate)
# A tibble: 6 x 3
DATA_YEAR BIAS COUNT
<dbl> <chr> <int>
1 2009 ANTI_WHITE 539
2 2009 ANTI_BLACK 2300
3 2009 ANTI_HISPANIC 486
4 2017 ANTI_WHITE 722
5 2017 ANTI_BLACK 2101
6 2017 ANTI_HISPANIC 444
Visualization seems like a second, separate, question.
The code can be made a little simpler by defining a function
n_with_bias <- function(x, bias)
sum(grepl(bias, x))
and then avoiding the need to separately mutate the data
hate_crime %>%
filter(DATA_YEAR %in% c("2009", "2017")) %>%
group_by(DATA_YEAR) %>%
summarize(
ANTI_WHITE = n_with_bias(BIAS_DESC, "Anti-White"),
ANTI_BLACK = n_with_bias(BIAS_DESC, "Anti-Black"),
ANTI_HISPANIC = n_with_bias(BIAS_DESC, "Anti-Hispanic")
) %>%
tidyr::pivot_longer(starts_with("ANTI"), names_to = "BIAS", values_to = "N")
On the other hand, a base R approach might create vectors for years-of-interest and all biases (using strsplit() to isolate the components of the compound biases)
years <- c("2009", "2017")
biases <- unique(unlist(strsplit(hate_crime$BIAS_DESC, ";")))
then create vectors of biases in each year of interest
bias_by_year <- split(hate_crime$BIAS_DESC, hate_crime$DATA_YEAR)[years]
and iterate over each year and bias (nested iterations can be inefficient when there are a large, e.g., 10,000's, number of elements, but that's not a concern here)
sapply(bias_by_year, function(bias) sapply(biases, n_with_bias, x = bias))
The result is a classic data.frame with all biases in each year
2009 2017
Anti-Black or African American 2300 2101
Anti-White 539 722
Anti-Jewish 932 983
Anti-Arab 0 106
Anti-Protestant 38 42
Anti-Other Religion 111 85
Anti-Islamic (Muslim) 0 0
Anti-Gay (Male) 0 0
Anti-Asian 128 133
Anti-Catholic 52 72
Anti-Heterosexual 21 33
Anti-Hispanic or Latino 486 444
Anti-Other Race/Ethnicity/Ancestry 296 280
Anti-Multiple Religions, Group 48 52
Anti-Multiple Races, Group 180 202
Anti-Lesbian (Female) 0 0
Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group) 0 0
Anti-American Indian or Alaska Native 68 244
Anti-Atheism/Agnosticism 10 6
Anti-Bisexual 24 24
Anti-Physical Disability 24 66
Anti-Mental Disability 70 89
Anti-Gender Non-Conforming 0 13
Anti-Female 0 48
Anti-Transgender 0 117
Anti-Native Hawaiian or Other Pacific Islander 0 15
Anti-Male 0 25
Anti-Jehovah's Witness 0 7
Anti-Mormon 0 12
Anti-Buddhist 0 15
Anti-Sikh 0 18
Anti-Other Christian 0 24
Anti-Hindu 0 10
Anti-Eastern Orthodox (Russian, Greek, Other) 0 0
Unknown (offender's motivation not known) 0 0
This avoids the need to enter each bias in the summarize() step. I'm not sure how to do that computation in a readable tidy-style analysis.
Note that in the table above any bias with a ( has zeros in both years. This is because grepl() treats ( in the bias as a grouping symbol; fix this by adding fixed = TRUE
n_with_bias <- function(x, bias)
sum(grepl(bias, x, fixed = TRUE))
and an updated result
2009 2017
Anti-Black or African American 2300 2101
Anti-White 539 722
Anti-Jewish 932 983
Anti-Arab 0 106
Anti-Protestant 38 42
Anti-Other Religion 111 85
Anti-Islamic (Muslim) 107 284
Anti-Gay (Male) 688 692
Anti-Asian 128 133
Anti-Catholic 52 72
Anti-Heterosexual 21 33
Anti-Hispanic or Latino 486 444
Anti-Other Race/Ethnicity/Ancestry 296 280
Anti-Multiple Religions, Group 48 52
Anti-Multiple Races, Group 180 202
Anti-Lesbian (Female) 186 133
Anti-Lesbian, Gay, Bisexual, or Transgender (Mixed Group) 311 287
Anti-American Indian or Alaska Native 68 244
Anti-Atheism/Agnosticism 10 6
Anti-Bisexual 24 24
Anti-Physical Disability 24 66
Anti-Mental Disability 70 89
Anti-Gender Non-Conforming 0 13
Anti-Female 0 48
Anti-Transgender 0 117
Anti-Native Hawaiian or Other Pacific Islander 0 15
Anti-Male 0 25
Anti-Jehovah's Witness 0 7
Anti-Mormon 0 12
Anti-Buddhist 0 15
Anti-Sikh 0 18
Anti-Other Christian 0 24
Anti-Hindu 0 10
Anti-Eastern Orthodox (Russian, Greek, Other) 0 22
Unknown (offender's motivation not known) 0 0

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Resources