Related
This is my dataframe:
mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido", "16 anos", "17 anos",
"18 anos", "19 anos", "20 anos", "21 a 24 anos", "25 a 29 anos",
"30 a 34 anos", "35 a 39 anos"), n = c(5202L, 48253L, 67401L,
79398L, 88233L, 90738L, 149634L, 198848L, 238406L, 265509L)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I would like to have grouped the observations into one group called: 16 a 20 anos.
"16 anos", "17 anos",
"18 anos", "19 anos", "20 anos"
In other words I would like to "merge" the rows 2-6 and sum its observations on the n column. I would have one row represent the sum of rows 2-6.
Is it possible to do this using group_by and then summarise(sum(DS_FAIXA_ETARIA)) verbs from dplyr?
This would be the output that I want:
mydf<-structure(list(DS_FAIXA_ETARIA = c("Inválido","16 a 20 anos" ,"21 a 24 anos", "25 a 29 anos",
"30 a 34 anos", "35 a 39 anos"), n = c(5202L,374023L , 149634L, 198848L, 238406L, 265509L)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Many thanks
This should the job. First sum with summarize.
Then add_row to the original dataframe. slice_tail and arrange
df1 <- mydf %>%
summarise(`16 a 20 anos`= sum(n[2:6]))
mydf %>%
add_row(DS_FAIXA_ETARIA=names(df1), n=df1$`16 a 20 anos`[1]) %>%
slice_tail(n=5) %>%
arrange(DS_FAIXA_ETARIA)
Output:
DS_FAIXA_ETARIA n
<chr> <int>
1 16 a 20 anos 374023
2 21 a 24 anos 149634
3 25 a 29 anos 198848
4 30 a 34 anos 238406
5 35 a 39 anos 265509
We create a grouping variable based on the occurrence of 'Invalido' or those elements with only digits (\\d+) followed by space and 'anos', then summarise by pasteing the first and last elements while getting the sum of 'n'
library(dplyr)
library(stringr)
mydf %>%
group_by(grp = replace(cumsum(!str_detect(DS_FAIXA_ETARIA,
'^\\d+\\s+anos$')), DS_FAIXA_ETARIA == 'Inválido', 0)) %>%
summarise(DS_FAIXA_ETARIA = if(n() > 1)
str_c(DS_FAIXA_ETARIA[c(1, n())], collapse="_") else
DS_FAIXA_ETARIA, n = sum(n), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 x 2
# DS_FAIXA_ETARIA n
# <chr> <int>
#1 Inválido 5202
#2 16 anos_20 anos 374023
#3 21 a 24 anos 149634
#4 25 a 29 anos 198848
#5 30 a 34 anos 238406
#6 35 a 39 anos 265509
This question already has an answer here:
Order a "mixed" vector (numbers with letters)
(1 answer)
Closed 2 years ago.
If i have a matrix that looks like this:
Region Ålder Antal regpop Andel
[1,] "01 Stockholms län" "0 år" "28474" "2377081" "0.0119785568939384"
[2,] "01 Stockholms län" "1 år" "29033" "2377081" "0.0122137192632477"
[3,] "01 Stockholms län" "10 år" "29678" "2377081" "0.0124850604586045"
[4,] "01 Stockholms län" "100+ år" "524" "2377081" "0.000220438428475933"
[5,] "01 Stockholms län" "11 år" "29679" "2377081" "0.0124854811426283"
[6,] "01 Stockholms län" "12 år" "28956" "2377081" "0.0121813265934144"
[7,] "01 Stockholms län" "13 år" "28592" "2377081" "0.0120281976087479"
[8,] "01 Stockholms län" "14 år" "27572" "2377081" "0.0115990999044627"
[9,] "01 Stockholms län" "15 år" "27466" "2377081" "0.0115545073979389"
[10,] "01 Stockholms län" "16 år" "26691" "2377081" "0.0112284772794869"
[11,] "01 Stockholms län" "17 år" "26004" "2377081" "0.0109394673551301"
[12,] "01 Stockholms län" "18 år" "24996" "2377081" "0.0105154178591306"
[13,] "01 Stockholms län" "19 år" "24971" "2377081" "0.0105049007585354"
[14,] "01 Stockholms län" "2 år" "29268" "2377081" "0.0123125800088428"
[15,] "01 Stockholms län" "20 år" "24777" "2377081" "0.0104232880579164"
what should i do to order them from "0 år", "1 år", "2 år" ....."100+ år"
The gtools::mixedsort function can help here
x <- c("0 år", "1 år", "10 år", "100+ år", "11 år", "12 år", "13 år",
"14 år", "15 år", "16 år", "17 år", "18 år", "19 år", "2 år",
"20 år")
gtools::mixedsort(x)
# [1] "0 år" "1 år" "2 år" "10 år" "11 år" "12 år" "13 år" "14 år"
# [9] "15 år" "16 år" "17 år" "18 år" "19 år" "20 år" "100+ år"
If the object you shared is a matrix named data, then you could do
data[gtools::mixedorder(data[["Ålder"]]),]
Here is a base R option. Extract the digits from the string using gsub and then convert to a numeric allowing us to reorder numerically. I have created a mock matrix just as an example. Perhaps if you could provide the actual matrix by using dput i.e. dput(matrix) then paste the output into your question, that would help people give more specific answers
x <- c("0 år", "1 år", "10 år", "100+ år", "11 år", "12 år", "13 år",
"14 år", "15 år", "16 år", "17 år", "18 år", "19 år", "2 år",
"20 år")
y <- 1:length(x)
mat <- matrix(c(x, y), ncol = 2)
mat[order(as.numeric(gsub("[^0-9.]", "", mat[,1]))),]
I have census data of Male and Female populations organizaed by age group:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
if (!file.exists("./datafiles/cc-est2018-alldata-54.csv"))
download.file(url, destfile = "./datafiles/cc-est2018-alldata-54.csv", mode = "wb")
popSample <- read.csv("./datafiles/cc-est2018-alldata-54.csv") %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
popSample$AGEGRP <- as.factor(popSample$AGEGRP)
I then plot the Male and Female population relationships, faceted by age group (1-18, which is currently treated as a int
g <- ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups", x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
g
Which results in this plot: https://share.getcloudapp.com/v1ur6O4e
The problem: I am trying to convert the column AGEGRP from ‘int’ to ‘factor’, and change the factors labels from “1”, “2”, “3”, … “18” to "AgeGroup1", "AgeGroup2", "AgeGroup3", … "AgeGroup18"
When I try this code, my AGEGRP column's observation values are all replaced with NAs:popSample$AGEGRP <- factor(popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
https://share.getcloudapp.com/qGuo1O4y
Thank you for your help,
popSample$AGEGRP <- factor( popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
Need to add all levels though.
Alternatively
levels(popSample$AGEGRP) <- c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
should work as well.
Read in the csv again:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
popSample <- read.csv(url) %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
If you just want to add a prefix "AgeGroup" to your facet labels, you do:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP,labeller=labeller(AGEGRP = function(i)paste0("AgeGroup",i))) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
If there is a need for new factors, then you need to refactor (like #Annet's answer below):
lvls = c("0 to 4", "5 to 9", "10 to 14", "15 to 19",
"20 to 24", "25 to 29", "30 to 34", "35 to 39",
"40 to 44", "45 to 49", "50 to 54", "55 to 59",
"60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
#because you have factorize it
# if you can read the csv again, skip the factorization
popSample$AGEGRP = factor(lvls[popSample$AGEGRP],levels=lvls)
Then plot:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
To change all the factor labels with one function, you can use forcats::fct_relabel (forcats ships as part of the tidyverse, which you've already got loaded). The changed factor labels will carry over to the plot facets and the order stays the same.
First few entries:
# before relabelling
popSample$AGEGRP[1:4]
#> [1] 1 2 3 4
#> Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# after relabelling
forcats::fct_relabel(popSample$AGEGRP, ~paste0("AgeGroup", .))[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18
Or with base R, reassign the levels:
levels(popSample$AGEGRP) <- paste0("AgeGroup", levels(popSample$AGEGRP))
popSample$AGEGRP[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18
Is it possible to create a density plot using this population data? Age_group is a categorical variable. Does it have to be numeric to create a density plot?
library(tidyverse)
df <- structure(list(year = c(1971, 1971, 1971, 1971, 1971, 1971, 1971,
1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971
), age_group = structure(2:19, .Label = c("All ages", "0 to 4 years",
"5 to 9 years", "10 to 14 years", "15 to 19 years", "20 to 24 years",
"25 to 29 years", "30 to 34 years", "35 to 39 years", "40 to 44 years",
"45 to 49 years", "50 to 54 years", "55 to 59 years", "60 to 64 years",
"65 to 69 years", "70 to 74 years", "75 to 79 years", "80 to 84 years",
"85 to 89 years", "90 to 94 years", "95 to 99 years", "100 years and over",
"Median age"), class = "factor"), population = c(1836149, 2267794,
2329323, 2164092, 1976914, 1643264, 1342744, 1286302, 1284154,
1252545, 1065664, 964984, 785693, 626521, 462065, 328583, 206174,
101117)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-18L))
You can convert the text to numeric ranges, e.g.:
library(tidyverse) # if not already loaded
df %>%
# These extract the 1st and 3rd "word" of age_group
# Uses stringr::word(), loaded as part of tidyverse
mutate(age_min = word(age_group, 1) %>% as.numeric,
age_max = word(age_group, 3) %>% as.numeric) %>%
head
# A tibble: 6 x 5
year age_group population age_min age_max
<dbl> <fct> <dbl> <dbl> <dbl>
1 1971 0 to 4 years 1836149 0 4
2 1971 5 to 9 years 2267794 5 9
3 1971 10 to 14 years 2329323 10 14
4 1971 15 to 19 years 2164092 15 19
5 1971 20 to 24 years 1976914 20 24
6 1971 25 to 29 years 1643264 25 29
From that, you could display in ggplot a bunch of ways:
... %>%
ggplot(aes(age_numeric, population)) +
geom_step()
... %>%
ggplot(aes(age_numeric, population)) +
geom_col()
... %>%
ggplot(aes(age_numeric, y = population)) +
geom_density(stat = "identity")
I want to convert the column $Annual.income saved as character in my dataframe to numeric. The code I use gives NA values although the new class is numeric.
I have looked for answer on forums but none of the questions answer my problem:
I do not have NAs in the column Annual.income, there are only numbers. All the data is formated so as to have "." instead of "," for decimals .
Here is the code I use.
data$Annual.income <- as.numeric(as.character(data$Annual.income))
******************************UPDATE********************************************
Here is the dput of the column Annual.income.
dput(data$Annual.income)
c("34 500", "51 400", "43 200", "40 100", "36 400", "39 100",
"41 900", "48 700", "45 500", "45 500", "49 100", "35 100", "34 500",
"29 200", "32 200", "36 300", "35 800", "31 500", "33 000", "34 600",
"32 100", "32 000", "31 400", "33 200", "42 600", "29 200", "34 600",
"29 200", "34 100", "30 600", "34 034", "33 600", "31 000", "35 500",
"30 600", "30 600", "30 600", "30 800", "34 034", "33 200", "32 900"
)
The following still gives me NAs.
data$Annual.income <- as.numeric(data$Annual.income))
I imported the data using the Import dataset command of the Environement and unchecked stringAsfactor, checked heading = YES. Seperator = Semicolon , decimal = Period.
Thanks
...
The white space causes the problem here, simply remove all white space characters with gsub(), e.g.
Annual.income <- c("34 500", "51 400", "43 200", "40 100", "36 400", "39 100",
"41 900", "48 700", "45 500", "45 500", "49 100", "35 100", "34 500",
"29 200", "32 200", "36 300", "35 800", "31 500", "33 000", "34 600",
"32 100", "32 000", "31 400", "33 200", "42 600", "29 200", "34 600",
"29 200", "34 100", "30 600", "34 034", "33 600", "31 000", "35 500",
"30 600", "30 600", "30 600", "30 800", "34 034", "33 200", "32 900"
)
as.numeric(gsub("\\s", "", Annual.income))
#> [1] 34500 51400 43200 40100 36400 39100 41900 48700 45500 45500 49100
#> [12] 35100 34500 29200 32200 36300 35800 31500 33000 34600 32100 32000
#> [23] 31400 33200 42600 29200 34600 29200 34100 30600 34034 33600 31000
#> [34] 35500 30600 30600 30600 30800 34034 33200 32900
Created on 2019-05-17 by the reprex package (v0.2.1)