how to count and group categorical data by range in r - r

I have data from a questionnaire that has a column for year of birth. So the range of data was too large and my mapping became confusing. I'm now trying to take the years, group them up by decade decade, and then chart them. But I don't know how to group them.
my data is likeļ¼š
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
and my plot is like:
However, I want my data by group as:
birth_year count
(1920-1930]: 5
(1931-1940]: 8
(1941-1950]: 4
(1951-1960]: 3
(1961-1970]: 5
(1971-1980]: 5
(1981-1990]: 5
(1991-2000]: 9
and then plot as a range group.

We can use cut() to group the data, and then plot with ggplot().
birth_year <- data.frame("years"=c(
"1920","1923","1930","1940","1932","1935","1942","1944","1952","1956","1996","1961",
"1962","1966","1978","1987","1998","1999","1967","1934","1945","1988","1976","1978",
"1951","1986","1942","1999","1935","1920","1933","1987","1998","1999","1931","1977",
"1920","1931","1977","1999","1967","1992","1998","1984"
))
birth_year$yearGroup <- cut(as.integer(birth_year$years),breaks = 8,dig.lab = 4,
include.lowest = FALSE)
library(ggplot2)
ggplot(birth_year,aes(x = yearGroup)) + geom_bar()

birth_year %>%
mutate(val=cut_width(as.numeric(years),10,boundary = 1920, dig.lab=-1))%>%
count(val)
val n
1 [1920,1930] 5
2 (1930,1940] 8
3 (1940,1950] 4
4 (1950,1960] 3
5 (1960,1970] 5
6 (1970,1980] 5
7 (1980,1990] 5
8 (1990,2000] 9

Related

Create dummy variable with survey package

I want to transform a variable into a dummy using the survey package.
I have a complex sample design defined by:
library(survey)
prestratified_design <- svydesign(
id = ~ PSU ,
strata = ~ STRAT,
data = data,
weights = ~ weight ,
nest = TRUE)
The dataset has a variable for education with 8 different categories:
# A tibble: 8 x 3
education n prop
<int> <int> <dbl>
1 1 2919 20.8
2 2 5551 39.5
3 3 447 3.18
4 4 484 3.45
5 5 3719 26.5
6 6 91 0.65
7 9 790 5.63
8 10 39 0.28
I want to create a dummy variable for categories 5 & 10 == 1 and others == 0.
I know that I have to use the update function, but I don't know how to use if in the survey package.
I have tried:
prestratified_design <- update(
prestratified_design,
dummy_educ = as.numeric (education == 5 & education == 10)
but it obviously didn't work.
thank you!
You can create dummy variables in R via ifelse() if the number of categories is two.
df$dummy_educ = with(df, ifelse(education == 5 | education == 10, 1, 0))
If the categories are more, you can use dplyr::case_when(), and if you are creating dummies from factor variable model.matrix() is fast and the best.
In order any new variable takes in count the complex design, you don't need to update your data set (in your example data), but you have to update your survey design adding the new variable. You must use the survey::update() function.
Following your example, try with the code below:
prestratified_design <- update(prestratified_design,
dummy_educ = as.integer(education == 5 | education == 10))
Good luck with that!.

Ridge plot: sort by value / rank

I have a data set which I uploaded here as a gist in CSV format.
It is the extracted form of the PDFs provided in the YouGov article "How good is 'good'?". People where asked to rate words (e.g. "perfect", "bad") with a score between 0 (very negative) and 10 (very positive). The gist contains exactly that data, i.e. for every word (column: Word) it stores for every ranking from 0 to 10 (column: Category) the number of votes (column: Total).
I would usually try to visualize the data with matplotlib and Python since I lack knowledge in R, but it seems that ggridges can create way nicer plots than I see myself doing with Python.
Using:
library(ggplot2)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov, aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
I was able to create this plot (which is still far from perfect):
Ignoring the fact that I have to tweak the aesthetics, there are three things I struggle to do:
Sort the words by their average rank.
Color the ridge by the average rank.
Or color the ridge by the category value, i.e. with varying color.
I tried to adapt the suggestions from this source, but ultimately failed because my data seems to be in the wrong format: Instead of having single instances of votes, I already have the aggregated vote count for each category.
I hope to end up with a result closer to this plot, which satisfies criteria 3 (source):
It took me a little while to get there myself. The key for me way understanding the data and how to order Word based on the average Category score. So let's look at the data first:
> YouGov
# A tibble: 440 x 17
ID Word Category Total Male Female `18 to 35` `35 to 54` `55+`
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 Incr~ 0 0 0 0 0 0 0
2 1 Incr~ 1 1 1 1 1 1 0
3 2 Incr~ 2 0 0 0 0 0 0
4 3 Incr~ 3 1 1 1 1 1 1
5 4 Incr~ 4 1 1 1 1 1 1
6 5 Incr~ 5 5 6 5 6 5 5
7 6 Incr~ 6 6 7 5 5 8 5
8 7 Incr~ 7 9 10 8 10 7 10
9 8 Incr~ 8 15 16 14 13 15 16
10 9 Incr~ 9 20 20 20 22 18 19
# ... with 430 more rows, and 8 more variables: Northeast <dbl>,
# Midwest <dbl>, South <dbl>, West <dbl>, White <dbl>, Black <dbl>,
# Hispanic <dbl>, `Other (NET)` <dbl>
Every Word has a row for every Category (or score, 1-10). The Total provides the number of responses for that Word/Category combination. So although there were no responses where the word "Incredible" scored zero there is still a row for it.
Before we calculate the average score for each Word we calculate the product of Category and Total for each Word-Category combination, let's call it Total Score. From there, we can treat Word as a factor, and reorder based on the average Total Score using forcats. After that, you can plot your data just as you did.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
YouGov %>%
mutate(total_score = Category*Total) %>%
mutate(Word = fct_reorder(.f = Word, .x = total_score, .fun = mean)) %>%
ggplot(aes(x=Category, y=Word, height = Total, group = Word, fill=Word)) +
geom_density_ridges(stat = "identity", scale = 3)
By treating Word as a factor we reordered the Words based on their mean Category. ggplot also orders colors accordingly so we don't have to modify ourselves, unless you'd prefer a different color palette.
The other solution is exactly correct. I just wanted to point out that you can call fct_reorder() from within aes() for an even more compact solution. However, you need to do it twice if you want to change fill color by position along the y axis.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = fct_reorder(Word, Category*Total, .fun = sum)
)) +
geom_density_ridges(stat = "identity", scale = 3) +
theme(legend.position = "none")
Created on 2020-01-19 by the reprex package (v0.3.0)
If instead you want to color by x position, you can do something like the following. It just doesn't look as nice as the temperature example because the x values come in discrete steps.
library(tidyverse)
library(ggridges)
YouGov <- read_csv("https://gist.githubusercontent.com/camminady/2e3aeab04fc3f5d3023ffc17860f0ba4/raw/97161888935c52407b0a377ebc932cc0c1490069/poll.csv")
ggplot(YouGov,
aes(
x = Category,
y = fct_reorder(Word, Category*Total, .fun = sum),
height = Total,
fill = stat(x)
)) +
geom_density_ridges_gradient(stat = "identity", scale = 3) +
theme(legend.position = "none") +
scale_fill_viridis_c(option = "C")
Created on 2020-01-19 by the reprex package (v0.3.0)

Cut function alternative in R

I have some data in the form:
Person.ID Household.ID Composition
1 4593 1A_0C
2 4992 2A_1C
3 9843 1A_1C
4 8385 2A_2C
5 9823 8A_1C
6 3458 1C_9C
7 7485 2C_0C
: : :
We can think of the composition variable as a count of adults/children i.e. 2A_1C would equate to two adults and two children.
What I want to do is reduce the amount of possible levels of composition. For person 5 we have composition of 8A_1C, I am looking for a way to reduce this to 4+A_0C. So for example we would have 4+ for any composition value with greater than 4A.
Person.ID Household.ID Composition
5 9823 4+A_1C
6 3458 1A_4+C
: : :
I am unsure of how to do this in R, I am thinking of using filter() or select() from dyplyr. Otherwise I would need to use some sort of regular expression.
Any help would be appreciated. Thanks
Data:
Person.ID <- c(1,2,3,4,5,6,7,8)
Household.ID <- c(4593,4992,9843,8385,9823,3458,7485)
Composition <- c("1A_0C","2A_1C","1A_1C","2A_2C","8A_1C","1A_9C","2A_0C")
dat <- tibble(Person.ID, Household.ID, Composition)
Function:
above4 <- function(f){
ff <- gsub("[^0-9]","",f)
if(ff>4){return("4+")}
if(ff<=4){return(ff)}
}
Apply function (done on separated data, but can recombine after):
dat_ <- dat %>% tidyr::separate(., col=Composition,
into=c("Adults", "Children"),
sep="_") %>%
dplyr::mutate(Adults_ = unlist(lapply(Adults,above4)),
Children_ = unlist(lapply(Children,above4)))
You might then use select, filter to get your required dataset.
dat_ %>% dplyr::mutate(Composition_ = paste0(Adults_, "A_", Children_, "C")) %>%
dplyr::select(Person.ID, Household.ID, Composition=Composition_)
# A tibble: 7 x 3
Person.ID Household.ID Composition
<dbl> <dbl> <chr>
1 1. 4593. 1A_0C
2 2. 4992. 2A_1C
3 3. 9843. 1A_1C
4 4. 8385. 2A_2C
5 5. 9823. 4+A_1C
6 6. 3458. 1A_4+C
7 7. 7485. 2A_0C
We can use gsub:
df$Composition <- gsub("(?<!\\d)([5-9]|\\d{2,})(?=[AC])", "4+", df$Composition, perl = TRUE)
This assumes that 2 or more consecutive digits represent a number that's always greater than 4 (i.e. no 01, 02, or 001).
Output:
Person.ID Household.ID Composition
1 1 4593 1A_0C
2 2 4992 2A_1C
3 3 9843 1A_1C
4 4 8385 2A_2C
5 5 9823 4+A_1C
6 6 3458 1C_4+C
7 7 7485 2C_0C

Stacking Scatterplots in ggplot2

I am creating scatterplots in R using ggplot2 to visualize populations over time. My data set looks like the following:
sampling_period cage total_1 total_2 total_3
4 y 34 95 12
4 n 89 12 13
5 n 23 10 2
I have been making individual scatterplots for the variables total_1, total_2, and total_3 with the following code:
qplot(data=BLPcaged, x=sampling_period, y=total_1, color=cage_or_control)
qplot(data=BLPcaged, x=sampling_period, y=total_2, color=cage_or_control)
qplot(data=BLPcaged, x=sampling_period, y=total_3, color=cage_or_control)
I want to create a scatterplot that contains the information about all three populations over time. I want the final product to be composed of three scatterplots one on top of each other and have the same scale for the axes. This way I could compare all three populations in one visualization.
I know that I can use facet to make different plots for the levels of a factor, but can it also be used to create different plots for different variables?
You can use melt() to reshape your data with total as a factor that you can facet on:
BLPcaged = data.frame(sampling_period=c(4,4,5),
cage=c('y','n','n'),
total_1=c(34,89,23),
total_2=c(95,12,10),
total_3=c(12,13,2))
library(reshape2)
BLPcaged.melted = melt(BLPcaged,
id.vars=c('sampling_period','cage'),
variable.name='total')
So now BLPcaged.melted looks like this:
sampling_period cage total value
1 4 y total_1 34
2 4 n total_1 89
3 5 n total_1 23
4 4 y total_2 95
5 4 n total_2 12
6 5 n total_2 10
7 4 y total_3 12
8 4 n total_3 13
9 5 n total_3 2
You can then facet this by total:
ggplot(BLPcaged.melted, aes(sampling_period, value, color=cage)) +
geom_point() +
facet_grid(total~.)

calculate gender percentage from grouped data frame in R

I have fairly large data frame that includes information on individuals divided into treatment groups. I am trying to generate variable means and gender percentages per group. I was able to calculate the means but I am not sure how to get the gender percentages.
Below, I generated a small replica of what my data looks like:
library(plyr)
#create variables and data frame
sampleid<-seq(1:100)
gender = rep(c("female","male"),c(50,50))
score <- rnorm(100)
age<-sample(25:35,100,replace=TRUE)
treatment <- rep(seq(1:5), each=4)
d <- data.frame(sampleid,gender,age,score, treatment)
>head(d)
sampleid gender age score treatment
1 1 female 34 1.6917201 1
2 2 female 26 -1.6189545 1
3 3 female 28 1.2867895 1
4 4 female 34 -0.5027578 1
5 5 female 29 -1.3652895 2
6 6 female 26 -2.4430843 2
I obtain the mean of each numeric column by:
groupstat<-ddply(d, .(treatment),numcolwise(mean))
which gives:
treatment sampleid age score
1 1 42.5 29.15 0.142078574
2 2 46.5 29.50 -0.261492514
3 3 50.5 30.50 -0.188393235
4 4 54.5 30.45 0.003526078
5 5 58.5 30.55 0.062996737
However I also need an additional column "Percent Female", which should give me the percentage of females within each treatment group 1:5.
Can someone help me in how to add this?
Try this out
groupstat<-ddply(d, .(treatment),summarise,
meansc= mean(score),
meanage= mean(age),
meanID= mean(sampleid),
nfem= length(gender[gender=="female"]), # number females per treatment group
nmale= length(gender[gender=="male"]), # number of males per treatment group
percentfem= nfem/(nfem+nmale)) # percent females by treatment group
I would first split into treatment groups (split(d, f = d$treatment)) and than calc the means for each group (function(x) sum(x$gender == "female")/length(x$gender):
sapply(split(d, f = d$treatment), function(x) sum(x$gender == "female")/length(x$gender))

Resources