binning the numbers with wrong outcome - r

I have problems with the output after I bin the a numerical vector.
I am trying to bin the length of stay, which was calculated beforehand with difftime function. It does not make sense to provide the whole code since this is only the background. Yet, when I bin, I do not get the right answer.
Here is the length of stay assigned it with los.
dput(los)
c(61.0416666666667, 61.0416666666667, 61.0416666666667, 2, 2, 3, 3)
Here are my breaks. I used na.rm inside as tried several methods. I passed na.rm with TRUE, FALSE and took it out of my breaks.
breaks <- c(0, 0.8, 0.16,
1.0, 1.8, 1.16,
2.0, 2.8, 2.16,
3.0, 3.8, 3.16,
4.0, 4.8, 4.16,
5.0, 5.8, 5.16,
6.0, 6.8, 6.16,
7.0, 14.0, 21.0, 28.0, max(los)) #, , na.rm = FALSE
Nevertheless, the next code tried
dt_los$losbinned <- cut(dt_los$LOS,
breaks = breaks,
labels = c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"),
right = FALSE)#
with different parameters passed for the 'right' gives me this:
when right = FALSE I do not get LOS for 61.04 binned for the category ">28 d". BBut do get the right bins for the other ones 2.00 and 3.00.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(NA, NA, NA, 7L, 7L,
10L, 10L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
when I pass right = TRUE, the output for 61.04 is binning into ">28 d" which is the desired answer, yet, I do not get the right bins for 2.0 and 3.0, which are bbinned in 1 d 16hrs for 2.0 and 2 d 16 hrs for 3. And again, these shall be binned in 2, respectively 3.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(25L, 25L, 25L, 6L, 6L,
9L, 9L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
The actual and expected results should the the right bins assigned for my length of stay. For 61.04 -> ">28d", for 2 -> "2 d", for 3 -> "3 d".
If this can be done with tidyverse that would be amazing. But respecting the bins I have assigned. However, I am aware this isn't done yet. Therefore, okay with the corrected code I have came up with, but corrected.

The cut function's bins are exclusive to inclusive.
From the cut function's help: The factor level labels are constructed as "(b1, b2]", "(b2, b3]" etc. for right = TRUE and as "[b1, b2)"
In order to include the lowest value (or highest value in this case), the include.lowest=TRUE option in required. This will make the first bin exclusive to exclusive, "[b1, b2]".
Try:
labels<-c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d")
dt_los$losbinned <- cut(los, breaks=breaks, labels=labels, right=FALSE, include.lowest = TRUE)

Related

R loop to iterate and find unique combination between each item

concept_id concept_name event
1: 443387 Malignant tumor of stomach comorb
2: 4193704 Type 2 diabetes mellitus without complication comorb
3: 4095320 Malignant tumor of body of stomach comorb
4: 201826 Type 2 diabetes mellitus comorb
5: 4174977 Retinopathy due to diabetes mellitus comorb
For the above data, I am trying to create a list of combinations for concept_ids. There are 5 concept ids so when we iterate each concept_id with another concept_id we get a list something like this.
nrow(comorb_event)
for (i in (1:nrow(comorb_event))) {
for (j in (1:nrow(comorb_event))){
print(paste(i,j))
}
}
[1] "1 1"
[1] "1 2"
[1] "1 3"
[1] "1 4"
[1] "1 5"
[1] "2 1"
[1] "2 2"
[1] "2 3"
[1] "2 4"
[1] "2 5"
[1] "3 1"
[1] "3 2"
[1] "3 3"
[1] "3 4"
[1] "3 5"
[1] "4 1"
[1] "4 2"
[1] "4 3"
[1] "4 4"
[1] "4 5"
[1] "5 1"
[1] "5 2"
[1] "5 3"
[1] "5 4"
[1] "5 5"
My output is not what I expect. Since item [1,1] are same items we can avoid that, and similarly item [2,1] is already covered by [1,2] we can remove that too. The expected list would be something like this after removing the redundant combinations:
[1] "1 2"
[1] "1 3"
[1] "1 4"
[1] "1 5"
[1] "2 3"
[1] "2 4"
[1] "2 5"
[1] "3 4"
[1] "3 5"
[1] "4 5"
Sample data
structure(list(concept_id = c("443387", "4193704", "4095320",
"201826", "4174977"), concept_name = c("Malignant tumor of stomach",
"Type 2 diabetes mellitus without complication", "Malignant tumor of body of stomach",
"Type 2 diabetes mellitus", "Retinopathy due to diabetes mellitus"
), event = structure(c(1L, 1L, 1L, 1L, 1L), .Label = c("comorb",
"drug", "primary_dx"), class = "factor")), class = c("data.table",
"data.frame"), row.names = c(NA, -5L), .internal.selfref = <pointer: 0x5642431689a0>)
We need combn
t(combn(seq_len(nrow(comorb_event)), 2))

Formatting grouped data for tables in R

I'm trying to display my data in table format and I can't figure out how to rearrange my data to display it in the proper format. I'm used to wrangling data for plots, but I'm finding myself a little lost when it comes to preparing tables. This seems like something really basic, but I haven't been able to find an explanation on what I'm doing wrong here.
I have 3 columns of data, Type, Year, and n. The data formatted as it is now produces a table that looks like this:
Type Year n
Type C 1 5596
Type D 1 1119
Type E 1 116
Type A 1 402
Type F 1 1614
Type B 1 105
Type C 2 26339
Type D 2 14130
Type E 2 98
Type A 2 3176
Type F 2 3071
Type B 2 88
What I want to do is to have Type as row names, Year as column names, and n populating the table contents like this:
1 2
Type A 402 3176
Type B 105 88
Type C 26339 5596
Type D 1119 14130
Type E 116 98
Type F 1614 3071
The mistake might have been made upstream from this point. Using the full original data set I arrived at this output by doing the following:
exampletable <- df %>%
group_by(Year) %>%
count(Type) %>%
select(Type, Year, n)
Here is the dput() output
structure(list(Type = c("Type C", "Type D", "Type E", "Type A",
"Type F", "Type B", "Type C", "Type D", "Type E", "Type A", "Type F",
"Type B", "Type C", "Type D", "Type E", "Type A", "Type F", "Type B",
"Type C", "Type D", "Type E", "Type A", "Type F", "Type B", "Type C",
"Type D", "Type E"), Year = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5), n = c(5596,
1119, 116, 402, 1614, 105, 26339, 14130, 98, 3176, 3071, 88,
40958, 17578, 104, 3904, 3170, 102, 33145, 23800, 93, 1264, 7084,
1262, 34642, 24911, 504)), class = c("spec_tbl_df", "tbl_df",
"tbl", "data.frame"), row.names = c(NA, -27L), spec = structure(list(
cols = list(Type = structure(list(), class = c("collector_character",
"collector")), Year = structure(list(), class = c("collector_double",
"collector")), n = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1), class = "col_spec"))
You can get the data in wide format and change Type column to rowname.
tidyr::pivot_wider(df, names_from = Year, values_from = n) %>%
tibble::column_to_rownames('Type')
# 1 2 3 4 5
#Type C 5596 26339 40958 33145 34642
#Type D 1119 14130 17578 23800 24911
#Type E 116 98 104 93 504
#Type A 402 3176 3904 1264 NA
#Type F 1614 3071 3170 7084 NA
#Type B 105 88 102 1262 NA
You can use tidyr package to get to wider format and tibble package to convert a column to rownames
dataset <- read.csv(file_location)
dataset <- tidyr::pivot_wider(dataset, names_from = Year, values_from = n)
tibble::column_to_rownames(dataset, var = 'Type')
1 2
Type C 5596 26339
Type D 1119 14130
Type E 116 98
Type A 402 3176
Type F 1614 3071
Type B 105 88

Renaming labels of a factor in R

I have census data of Male and Female populations organizaed by age group:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
if (!file.exists("./datafiles/cc-est2018-alldata-54.csv"))
download.file(url, destfile = "./datafiles/cc-est2018-alldata-54.csv", mode = "wb")
popSample <- read.csv("./datafiles/cc-est2018-alldata-54.csv") %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
popSample$AGEGRP <- as.factor(popSample$AGEGRP)
I then plot the Male and Female population relationships, faceted by age group (1-18, which is currently treated as a int
g <- ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups", x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
g
Which results in this plot: https://share.getcloudapp.com/v1ur6O4e
The problem: I am trying to convert the column AGEGRP from ‘int’ to ‘factor’, and change the factors labels from “1”, “2”, “3”, … “18” to "AgeGroup1", "AgeGroup2", "AgeGroup3", … "AgeGroup18"
When I try this code, my AGEGRP column's observation values are all replaced with NAs:popSample$AGEGRP <- factor(popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
https://share.getcloudapp.com/qGuo1O4y
Thank you for your help,
popSample$AGEGRP <- factor( popSample$AGEGRP, levels = c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+"))
Need to add all levels though.
Alternatively
levels(popSample$AGEGRP) <- c("0 to 4", "5 to 9", "10 to 14", "15 to 19", "20 to 24", "25 to 29", "30 to 34", "35 to 39", "40 to 44", "45 to 49", "50 to 54", "55 to 59", "60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
should work as well.
Read in the csv again:
library(tidyverse)
url <- "https://www2.census.gov/programs-surveys/popest/datasets/2010-2018/counties/asrh/cc-est2018-alldata-54.csv"
popSample <- read.csv(url) %>%
filter(AGEGRP != 0 & YEAR == 1) %>%
select("STNAME", "CTYNAME", "AGEGRP", "TOT_POP", "TOT_MALE", "TOT_FEMALE")
If you just want to add a prefix "AgeGroup" to your facet labels, you do:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP,labeller=labeller(AGEGRP = function(i)paste0("AgeGroup",i))) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
If there is a need for new factors, then you need to refactor (like #Annet's answer below):
lvls = c("0 to 4", "5 to 9", "10 to 14", "15 to 19",
"20 to 24", "25 to 29", "30 to 34", "35 to 39",
"40 to 44", "45 to 49", "50 to 54", "55 to 59",
"60 to 64", "65 to 69", "70 to 74", "75 to 79", "80 to 84", "85+")
#because you have factorize it
# if you can read the csv again, skip the factorization
popSample$AGEGRP = factor(lvls[popSample$AGEGRP],levels=lvls)
Then plot:
ggplot(popSample, aes(x=TOT_MALE, y=TOT_FEMALE)) +
geom_point(alpha = 0.5, colour="darkblue") +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~AGEGRP) +
stat_smooth(method = "lm", col = "darkred", size=.75) +
labs(title = "F vs. M Population across all Age Groups",
x = "Total Male (log10)", y = "Total Female (log10)") +
theme_light()
To change all the factor labels with one function, you can use forcats::fct_relabel (forcats ships as part of the tidyverse, which you've already got loaded). The changed factor labels will carry over to the plot facets and the order stays the same.
First few entries:
# before relabelling
popSample$AGEGRP[1:4]
#> [1] 1 2 3 4
#> Levels: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
# after relabelling
forcats::fct_relabel(popSample$AGEGRP, ~paste0("AgeGroup", .))[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18
Or with base R, reassign the levels:
levels(popSample$AGEGRP) <- paste0("AgeGroup", levels(popSample$AGEGRP))
popSample$AGEGRP[1:4]
#> [1] AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4
#> 18 Levels: AgeGroup1 AgeGroup2 AgeGroup3 AgeGroup4 AgeGroup5 ... AgeGroup18

Combine two data frames by one variable and combining columns under one main header

I want to combine two dataframes T2 and T4 by variable "Industry" and the columns of each data set with one main heading. So in the final output table I want columns Industry, three columns of T2 under one column heading "Executive" and three other columns of T4 as sub-columns of one heading "management".
T2
Industry percentage_Yes percentage_No Total_responses
1 ALL 94 % 6 % 117
2 Banking/Financial Services 83 % 17 % 6
3 Chemicals 100 % 0 % 5
4 Consumer Goods 75 % 25 % 8
5 Energy 89 % 11 % 9
6 High Tech 100 % 0 % 8
7 Insurance/Reinsurance 100 % 0 % 14
8 Life Sciences 100 % 0 % 11
9 Logistics -- -- 3
10 Mining & Metals -- -- 1
11 Other Manufacturing 100 % 0 % 11
12 Other Non-Manufacturing -- -- 3
13 Retail & Wholesale 100 % 0 % 12
14 Services (Non-Financial) 88 % 12 % 24
15 Transportation Equipment -- -- 2
16 <NA> -- -- 0
T4
Industry percentage_Yes percentage_No Total_responses
1 ALL 96 % 4 % 121
2 Banking/Financial Services 86 % 14 % 7
3 Chemicals 100 % 0 % 5
4 Consumer Goods 100 % 0 % 8
5 Energy 100 % 0 % 9
6 High Tech 100 % 0 % 9
7 Insurance/Reinsurance 93 % 7 % 15
8 Life Sciences 91 % 9 % 11
9 Logistics -- -- 3
10 Mining & Metals -- -- 1
11 Other Manufacturing 100 % 0 % 12
12 Other Non-Manufacturing -- -- 3
13 Retail & Wholesale 100 % 0 % 12
14 Services (Non-Financial) 92 % 8 % 24
15 Transportation Equipment -- -- 2
16 <NA> -- -- 0
> dput(T2)
structure(list(Industry = c("ALL", "Banking/Financial Services",
"Chemicals", "Consumer Goods", "Energy", "High Tech", "Insurance/Reinsurance",
"Life Sciences", "Logistics", "Mining & Metals", "Other Manufacturing",
"Other Non-Manufacturing", "Retail & Wholesale", "Services (Non-Financial)",
"Transportation Equipment", NA), percentage_Yes = c("94 %", "83 %",
"100 %", "75 %", "89 %", "100 %", "100 %", "100 %", "--", "--",
"100 %", "--", "100 %", "88 %", "--", "--"), percentage_No = c("6 %",
"17 %", "0 %", "25 %", "11 %", "0 %", "0 %", "0 %", "--", "--",
"0 %", "--", "0 %", "12 %", "--", "--"), Total_responses = c(117,
6, 5, 8, 9, 8, 14, 11, 3, 1, 11, 3, 12, 24, 2, 0)), class = "data.frame", row.names = c(NA,
-16L), .Names = c("Industry", "percentage_Yes", "percentage_No",
"Total_responses"))
> dput(T4)
structure(list(Industry = c("ALL", "Banking/Financial Services",
"Chemicals", "Consumer Goods", "Energy", "High Tech", "Insurance/Reinsurance",
"Life Sciences", "Logistics", "Mining & Metals", "Other Manufacturing",
"Other Non-Manufacturing", "Retail & Wholesale", "Services (Non-Financial)",
"Transportation Equipment", NA), percentage_Yes = c("96 %", "86 %",
"100 %", "100 %", "100 %", "100 %", "93 %", "91 %", "--", "--",
"100 %", "--", "100 %", "92 %", "--", "--"), percentage_No = c("4 %",
"14 %", "0 %", "0 %", "0 %", "0 %", "7 %", "9 %", "--", "--",
"0 %", "--", "0 %", "8 %", "--", "--"), Total_responses = c(121,
7, 5, 8, 9, 9, 15, 11, 3, 1, 12, 3, 12, 24, 2, 0)), class = "data.frame", row.names = c(NA,
-16L), .Names = c("Industry", "percentage_Yes", "percentage_No",
"Total_responses"))
I have tried tabular but then m getting Industry column 2 times:
library("tables")
st<-rbind(data.frame(T2, Employee_Level = 'Exe', what = factor(rownames(T2), levels = rownames(T2)),
row.names= NULL, check.names = FALSE),
data.frame(T4,Employee_Level = 'Mgmt',what = factor(rownames(T4), levels = rownames(T4)),
row.names = NULL,check.names = FALSE))
mytable <- tabular(Heading()*what ~ Employee_Level*(`Industry`+`percentage_Yes`+`percentage_No`+`Total_responses`)*Heading()*(identity),data=st)
latex(mytable)
Here's one way using (my) huxtable package:
library(huxtable)
my_data <- cbind(T2, T4)[, c(1:4, 6:8)]
my_hux <- as_hux(my_data, add_colnames = TRUE)
my_hux <- insert_row(my_hux, rep("", 7))
my_hux[1, 2] <- "Executive"
my_hux[1, 5] <- "Management"
colspan(my_hux)[1, 2] <- 3
colspan(my_hux)[1, 5] <- 3
my_hux[2, 2:7] <- rep(c("% yes", "% no", "Total responses"), 2)
number_format(my_hux) <- 0
# This should look like what you want:
my_hux

Unlist data frame column and pasting them together

I have a dataframe as defined below:
df <- structure(list(ID = 1:19, MEDICATION = c("0", "NOVOMIX 26 BF, 20 D",
"NOVOMIX 14 D", "NOVOMIX 34 BF 22 D", "MIXTARD 52 BF 20 D", "MIXTARD 40 BF 24 D",
"MIXTARD 10 BF 8 D", "MIXTARD 42 BF 24 D", "MIXTARD 20 BF 18 D",
"MIXTARD 82 BF 46 D", "MIXTARD 14 BF 10 D", "NOVOMIX 15 BF 15 D",
"MIXTARD", NA, "MIXTARD 10 BF 4 D", "NOVOMIX", "MIXTARD --> NOVOMIX",
"NOT GIVEN ANY DIABETES MEDICATION INPATIENT PATIENT NORMALLY ON METFORMIN",
"GIVEN ASPART")), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -19L), .Names = c("ID", "MEDICATION"))
I would like to extract all the medications (i.e. NOVOMIX, MIXTARD, METFORMIN, ASPART from the MEDICATION variable in the dataframe and paste them together. I wrote my code as follows:
library(tidyverse)
library(rebus)
df %>%
mutate(MEDICATION2 = str_extract_all(MEDICATION, pattern =
or1(c("NOVOMIX", "MIXTARD", "METFORMIN", "ASPART")))) %>%
unnest(MEDICATION2) %>%
group_by(ID) %>%
mutate(MEDICATION2 = str_c(unlist(MEDICATION2), collapse = " - ")) %>%
slice(1)
My expected output is:
df_out <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19), MEDICATION = c("0", "NOVOMIX 26 BF, 20 D",
"NOVOMIX 14 D", "NOVOMIX 34 BF 22 D", "MIXTARD 52 BF 20 D", "MIXTARD 40 BF 24 D",
"MIXTARD 10 BF 8 D", "MIXTARD 42 BF 24 D", "MIXTARD 20 BF 18 D",
"MIXTARD 82 BF 46 D", "MIXTARD 14 BF 10 D", "NOVOMIX 15 BF 15 D",
"MIXTARD", NA, "MIXTARD 10 BF 4 D", "NOVOMIX", "MIXTARD --> NOVOMIX",
"NOT GIVEN ANY DIABETES MEDICATION INPATIENT PATIENT NORMALLY ON METFORMIN",
"GIVEN ASPART"), MEDICATION2 = c(NA, "NOVOMIX", "NOVOMIX", "NOVOMIX",
"MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD", "MIXTARD",
"MIXTARD", "NOVOMIX", "MIXTARD", NA, "MIXTARD", "NOVOMIX", "MIXTARD - NOVOMIX",
"METFORMIN", "ASPART")), .Names = c("ID", "MEDICATION", "MEDICATION2"
), row.names = c(NA, -19L), class = "data.frame")
The problem is the code removed the row with MEDICATION == 0 and I think my code is too long for a simple extraction of strings. I would like to ask for help if you know how this code can be shorten (if possible).
We can use stri_extract_all_regex from the stringi package to extract all the words which matches the pattern.
library(stringi)
med_pattern <- c("NOVOMIX|MIXTARD|METFORMIN|ASPART")
df$MEDICATION2 <- stri_extract_all_regex(df$MEDICATION, pattern = med_pattern)
As mentioned by #mt1022, the new column is a list. We may paste them together with
df$MEDICATION2<-paste(stri_extract_all_regex(df$MEDICATION,pattern = med_pattern))
However, it will not give some unwanted characters for lists with more than 1 element. This should give you the expected output.
chars <- stri_extract_all_regex(df$MEDICATION, pattern = med_pattern)
df$MEDICATION2 <- sapply(chars, paste, collapse = "-")
df$MEDICATION2
#[1] "NA" "NOVOMIX" "NOVOMIX" "NOVOMIX"
#[5] "MIXTARD" "MIXTARD" "MIXTARD" "MIXTARD"
#[9] "MIXTARD" "MIXTARD" "MIXTARD" "NOVOMIX"
#[13] "MIXTARD" "NA" "MIXTARD" "NOVOMIX"
#[17] "MIXTARD-NOVOMIX" "METFORMIN" "ASPART"
You can also do this in single line :
df$MEDICATION2 <- sapply(stri_extract_all_regex(df$MEDICATION,
pattern = med_pattern), paste, collapse = "-")

Resources