How should I order a column with character values numerically [duplicate] - r

This question already has an answer here:
Order a "mixed" vector (numbers with letters)
(1 answer)
Closed 2 years ago.
If i have a matrix that looks like this:
Region Ålder Antal regpop Andel
[1,] "01 Stockholms län" "0 år" "28474" "2377081" "0.0119785568939384"
[2,] "01 Stockholms län" "1 år" "29033" "2377081" "0.0122137192632477"
[3,] "01 Stockholms län" "10 år" "29678" "2377081" "0.0124850604586045"
[4,] "01 Stockholms län" "100+ år" "524" "2377081" "0.000220438428475933"
[5,] "01 Stockholms län" "11 år" "29679" "2377081" "0.0124854811426283"
[6,] "01 Stockholms län" "12 år" "28956" "2377081" "0.0121813265934144"
[7,] "01 Stockholms län" "13 år" "28592" "2377081" "0.0120281976087479"
[8,] "01 Stockholms län" "14 år" "27572" "2377081" "0.0115990999044627"
[9,] "01 Stockholms län" "15 år" "27466" "2377081" "0.0115545073979389"
[10,] "01 Stockholms län" "16 år" "26691" "2377081" "0.0112284772794869"
[11,] "01 Stockholms län" "17 år" "26004" "2377081" "0.0109394673551301"
[12,] "01 Stockholms län" "18 år" "24996" "2377081" "0.0105154178591306"
[13,] "01 Stockholms län" "19 år" "24971" "2377081" "0.0105049007585354"
[14,] "01 Stockholms län" "2 år" "29268" "2377081" "0.0123125800088428"
[15,] "01 Stockholms län" "20 år" "24777" "2377081" "0.0104232880579164"
what should i do to order them from "0 år", "1 år", "2 år" ....."100+ år"

The gtools::mixedsort function can help here
x <- c("0 år", "1 år", "10 år", "100+ år", "11 år", "12 år", "13 år",
"14 år", "15 år", "16 år", "17 år", "18 år", "19 år", "2 år",
"20 år")
gtools::mixedsort(x)
# [1] "0 år" "1 år" "2 år" "10 år" "11 år" "12 år" "13 år" "14 år"
# [9] "15 år" "16 år" "17 år" "18 år" "19 år" "20 år" "100+ år"
If the object you shared is a matrix named data, then you could do
data[gtools::mixedorder(data[["Ålder"]]),]

Here is a base R option. Extract the digits from the string using gsub and then convert to a numeric allowing us to reorder numerically. I have created a mock matrix just as an example. Perhaps if you could provide the actual matrix by using dput i.e. dput(matrix) then paste the output into your question, that would help people give more specific answers
x <- c("0 år", "1 år", "10 år", "100+ år", "11 år", "12 år", "13 år",
"14 år", "15 år", "16 år", "17 år", "18 år", "19 år", "2 år",
"20 år")
y <- 1:length(x)
mat <- matrix(c(x, y), ncol = 2)
mat[order(as.numeric(gsub("[^0-9.]", "", mat[,1]))),]

Related

R loop to iterate and find unique combination between each item

concept_id concept_name event
1: 443387 Malignant tumor of stomach comorb
2: 4193704 Type 2 diabetes mellitus without complication comorb
3: 4095320 Malignant tumor of body of stomach comorb
4: 201826 Type 2 diabetes mellitus comorb
5: 4174977 Retinopathy due to diabetes mellitus comorb
For the above data, I am trying to create a list of combinations for concept_ids. There are 5 concept ids so when we iterate each concept_id with another concept_id we get a list something like this.
nrow(comorb_event)
for (i in (1:nrow(comorb_event))) {
for (j in (1:nrow(comorb_event))){
print(paste(i,j))
}
}
[1] "1 1"
[1] "1 2"
[1] "1 3"
[1] "1 4"
[1] "1 5"
[1] "2 1"
[1] "2 2"
[1] "2 3"
[1] "2 4"
[1] "2 5"
[1] "3 1"
[1] "3 2"
[1] "3 3"
[1] "3 4"
[1] "3 5"
[1] "4 1"
[1] "4 2"
[1] "4 3"
[1] "4 4"
[1] "4 5"
[1] "5 1"
[1] "5 2"
[1] "5 3"
[1] "5 4"
[1] "5 5"
My output is not what I expect. Since item [1,1] are same items we can avoid that, and similarly item [2,1] is already covered by [1,2] we can remove that too. The expected list would be something like this after removing the redundant combinations:
[1] "1 2"
[1] "1 3"
[1] "1 4"
[1] "1 5"
[1] "2 3"
[1] "2 4"
[1] "2 5"
[1] "3 4"
[1] "3 5"
[1] "4 5"
Sample data
structure(list(concept_id = c("443387", "4193704", "4095320",
"201826", "4174977"), concept_name = c("Malignant tumor of stomach",
"Type 2 diabetes mellitus without complication", "Malignant tumor of body of stomach",
"Type 2 diabetes mellitus", "Retinopathy due to diabetes mellitus"
), event = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("comorb",
"drug", "primary_dx"), class = "factor")), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x5642431689a0>)
We need combn
t(combn(seq_len(nrow(comorb_event)), 2))

Renaming labels of a factor in R

I have census data of Male and Female populations organizaed by age group:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
if (!file.exists("./datafiles/cc-est2018-alldata-54.csv"))
download.file(url, destfile = "./datafiles/cc-est2018-alldata-54.csv", mode = "wb")
popSample <- read.csv("./datafiles/cc-est2018-alldata-54.csv") %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
popSample$AGEGRP <- as.factor(popSample$AGEGRP)
I then plot the Male and Female population relationships, faceted by age group (1-18, which is currently treated as a int
g <- ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups", x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
g
Which results in this plot: https://share.getcloudapp.com/v1ur6O4e
The problem: I am trying to convert the column AGEGRP from ‘int’ to ‘factor’, and change the factors labels from “1”, “2”, “3”, … “18” to "AgeGroup1", "AgeGroup2", "AgeGroup3", … "AgeGroup18"
When I try this code, my AGEGRP column's observation values are all replaced with NAs:popSample$AGEGRP <- factor(popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
https://share.getcloudapp.com/qGuo1O4y
Thank you for your help,
popSample$AGEGRP <- factor( popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
Need to add all levels though.
Alternatively
levels(popSample$AGEGRP) <- c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
should work as well.
Read in the csv again:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
popSample <- read.csv(url) %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
If you just want to add a prefix "AgeGroup" to your facet labels, you do:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP,labeller=labeller(AGEGRP = function(i)paste0("AgeGroup",i))) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
If there is a need for new factors, then you need to refactor (like #Annet's answer below):
lvls = c("0 to 4", "5 to 9", "10 to 14", "15 to 19",
"20 to 24", "25 to 29", "30 to 34", "35 to 39",
"40 to 44", "45 to 49", "50 to 54", "55 to 59",
"60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
#because you have factorize it
# if you can read the csv again, skip the factorization
popSample$AGEGRP = factor(lvls[popSample$AGEGRP],levels=lvls)
Then plot:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
To change all the factor labels with one function, you can use forcats::fct_relabel (forcats ships as part of the tidyverse, which you've already got loaded). The changed factor labels will carry over to the plot facets and the order stays the same.
First few entries:
# before relabelling
popSample$AGEGRP[1:4]
#> [1] 1 2 3 4
#> Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# after relabelling
forcats::fct_relabel(popSample$AGEGRP, ~paste0("AgeGroup", .))[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18
Or with base R, reassign the levels:
levels(popSample$AGEGRP) <- paste0("AgeGroup", levels(popSample$AGEGRP))
popSample$AGEGRP[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18

Density plot using population data for a specific year

Is it possible to create a density plot using this population data? Age_group is a categorical variable. Does it have to be numeric to create a density plot?
library(tidyverse)
df <- structure(list(year = c(1971, 1971, 1971, 1971, 1971, 1971, 1971,
1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971, 1971
), age_group = structure(2:19, .Label = c("All ages", "0 to 4 years",
"5 to 9 years", "10 to 14 years", "15 to 19 years", "20 to 24 years",
"25 to 29 years", "30 to 34 years", "35 to 39 years", "40 to 44 years",
"45 to 49 years", "50 to 54 years", "55 to 59 years", "60 to 64 years",
"65 to 69 years", "70 to 74 years", "75 to 79 years", "80 to 84 years",
"85 to 89 years", "90 to 94 years", "95 to 99 years", "100 years and over",
"Median age"), class = "factor"), population = c(1836149, 2267794,
2329323, 2164092, 1976914, 1643264, 1342744, 1286302, 1284154,
1252545, 1065664, 964984, 785693, 626521, 462065, 328583, 206174,
101117)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-18L))
You can convert the text to numeric ranges, e.g.:
library(tidyverse) # if not already loaded
df %>%
# These extract the 1st and 3rd "word" of age_group
# Uses stringr::word(), loaded as part of tidyverse
mutate(age_min = word(age_group, 1) %>% as.numeric,
age_max = word(age_group, 3) %>% as.numeric) %>%
head
# A tibble: 6 x 5
year age_group population age_min age_max
<dbl> <fct> <dbl> <dbl> <dbl>
1 1971 0 to 4 years 1836149 0 4
2 1971 5 to 9 years 2267794 5 9
3 1971 10 to 14 years 2329323 10 14
4 1971 15 to 19 years 2164092 15 19
5 1971 20 to 24 years 1976914 20 24
6 1971 25 to 29 years 1643264 25 29
From that, you could display in ggplot a bunch of ways:
... %>%
ggplot(aes(age_numeric, population)) +
geom_step()
... %>%
ggplot(aes(age_numeric, population)) +
geom_col()
... %>%
ggplot(aes(age_numeric, y = population)) +
geom_density(stat = "identity")

Error: NAs introduced by coercion when converting character to numeric

I want to convert the column $Annual.income saved as character in my dataframe to numeric. The code I use gives NA values although the new class is numeric.
I have looked for answer on forums but none of the questions answer my problem:
I do not have NAs in the column Annual.income, there are only numbers. All the data is formated so as to have "." instead of "," for decimals .
Here is the code I use.
data$Annual.income <- as.numeric(as.character(data$Annual.income))
******************************UPDATE********************************************
Here is the dput of the column Annual.income.
dput(data$Annual.income)
c("34 500", "51 400", "43 200", "40 100", "36 400", "39 100",
"41 900", "48 700", "45 500", "45 500", "49 100", "35 100", "34 500",
"29 200", "32 200", "36 300", "35 800", "31 500", "33 000", "34 600",
"32 100", "32 000", "31 400", "33 200", "42 600", "29 200", "34 600",
"29 200", "34 100", "30 600", "34 034", "33 600", "31 000", "35 500",
"30 600", "30 600", "30 600", "30 800", "34 034", "33 200", "32 900"
)
The following still gives me NAs.
data$Annual.income <- as.numeric(data$Annual.income))
I imported the data using the Import dataset command of the Environement and unchecked stringAsfactor, checked heading = YES. Seperator = Semicolon , decimal = Period.
Thanks
...
The white space causes the problem here, simply remove all white space characters with gsub(), e.g.
Annual.income <- c("34 500", "51 400", "43 200", "40 100", "36 400", "39 100",
"41 900", "48 700", "45 500", "45 500", "49 100", "35 100", "34 500",
"29 200", "32 200", "36 300", "35 800", "31 500", "33 000", "34 600",
"32 100", "32 000", "31 400", "33 200", "42 600", "29 200", "34 600",
"29 200", "34 100", "30 600", "34 034", "33 600", "31 000", "35 500",
"30 600", "30 600", "30 600", "30 800", "34 034", "33 200", "32 900"
)
as.numeric(gsub("\\s", "", Annual.income))
#> [1] 34500 51400 43200 40100 36400 39100 41900 48700 45500 45500 49100
#> [12] 35100 34500 29200 32200 36300 35800 31500 33000 34600 32100 32000
#> [23] 31400 33200 42600 29200 34600 29200 34100 30600 34034 33600 31000
#> [34] 35500 30600 30600 30600 30800 34034 33200 32900
Created on 2019-05-17 by the reprex package (v0.2.1)

Changing levels in R

I have a field where the levels are broken down as below:
levels(demo$age)
"18 to 24 years old" "25 to 34 years old" "35 to 44 years old" "45 to 54 years old" "55 to 64 years old" "65 to 74 years old" "75 years old or older"
How can I change the levels to
"Total " "18 to 24 years old" "25 plus".
We create a vector of levels that needs to be changed
v1 <- c("25 to 34 years old", "35 to 44 years old", "45 to 54 years old",
"55 to 64 years old", "65 to 74 years old" , "75 years old or older")
then, assign those to new level
levels(demo$age)[levels(demo$age) %in% v1] <- "25 plus"
If we need a 'Total' level as well
levels(demo$age) <- c("Total", levels(demo$age))
levels(demo$age)
#[1] "Total" "18 to 24 years old" "25 plus"
data
set.seed(24)
demo <- data.frame(age = sample(c("18 to 24 years old", v1), 100, replace = TRUE))

Resources