Group levels of an existing factor - r

A variable in my dataset looks like this
df <- data.frame(Month = factor(c(sample(1:12, 15, replace = T),
sample(c("Apr", "May"), 5, replace = T))))
Now, levels Apr & May were entered later in time by a different person, thereby stored as name of the month. So how do I get rid of the separate levels and group those values under the already existing 4 & 5 levels respectively? or conversely, how do store all values in names of months instead of numbers?

You can match with month.abb, i.e.
i1 <- match(df$Month, month.abb)
df$Month[!is.na(i1)] <- i1[!is.na(i1)]
df
# Month
#1 5
#2 2
#3 7
#4 12
#5 5
#6 12
#7 4
#8 6
#9 7
#10 10
#11 9
#12 4
#13 11
#14 10
#15 3
#16 4
#17 5
#18 4
#19 4
#20 4

Related

Convert data from wide format to long format with multiple measure columns [duplicate]

This question already has answers here:
wide to long multiple measures each time
(5 answers)
Closed 1 year ago.
I want to do this but the exact opposite. So say my dataset looks like this:
ID
X_1990
X_2000
X_2010
Y_1990
Y_2000
Y_2010
A
1
4
7
10
13
16
B
2
5
8
11
14
17
C
3
6
9
12
15
18
but with a lot more measure variables (i.e. also Z_1990, etc.). How can I get it so that the year becomes a variable and it will keep the different measures, like this:
ID
Year
X
Y
A
1990
1
10
A
2000
4
13
A
2010
7
16
B
1990
2
11
B
2000
5
14
B
2010
8
17
C
1990
3
12
C
2000
3
15
C
2010
9
18
You may use pivot_longer with names_sep argument.
tidyr::pivot_longer(df, cols = -ID, names_to = c('.value', 'Year'), names_sep = '_')
# ID Year X Y
# <chr> <chr> <int> <int>
#1 A 1990 1 10
#2 A 2000 4 13
#3 A 2010 7 16
#4 B 1990 2 11
#5 B 2000 5 14
#6 B 2010 8 17
#7 C 1990 3 12
#8 C 2000 6 15
#9 C 2010 9 18
data
It is easier to help if you provide data in a reproducible format
df <- structure(list(ID = c("A", "B", "C"), X_1990 = 1:3, X_2000 = 4:6,
X_2010 = 7:9, Y_1990 = 10:12, Y_2000 = 13:15, Y_2010 = 16:18),
row.names = c(NA, -3L), class = "data.frame")

How to reorder column values ascending order that are seperated by "," and only keep first value in R

I have a column in a df that consists of values like so:
ID
2
NA
1
3
4
5,7
9,6,10
12
15
16
17
NA
19
22,23
I would like to reorder every row based on ascending order. Note - this column is a "character" based field and some rows are already in the correct order.
From there, I only want to keep the first value and remove the others.
Desired output:
ID
2
NA
1
3
4
5
6
12
15
16
17
NA
19
22
You can split the data on comma, sort them and extract the 1st value.
df$ID <- sapply(strsplit(df$ID, ','), function(x) sort(as.numeric(x))[1])
# ID
#1 2
#2 NA
#3 1
#4 3
#5 4
#6 5
#7 6
#8 12
#9 15
#10 16
#11 17
#12 NA
#13 19
#14 22
A couple of tidyverse alternatives.
library(tidyverse)
#1.
#Same as base R but in tidyverse
df %>% mutate(ID = map_dbl(str_split(ID, ','), ~sort(as.numeric(.x))[1]))
#2.
df %>%
mutate(row = row_number()) %>%
separate_rows(ID, sep = ',', convert = TRUE) %>%
group_by(row) %>%
summarise(ID = min(ID)) %>%
select(-row)
Here is another tidyverse solution: Making use of (dyplr, purrr, stringr and readr
library(tidyverse)
df %>%
mutate(ID = map_chr(str_split(ID, ","), ~
toString(sort(as.numeric(.x)))),
ID = parse_number(ID))
)
output:
ID
1 2
2 NA
3 1
4 3
5 4
6 5
7 6
8 12
9 15
10 16
11 17
12 NA
13 19
14 22
We may use the minimum instead of sorting / extracting:
DF <- transform(DF, ID=sapply(strsplit(ID, ','), \(x) min(as.double(x))))
DF
# ID
# 1 2
# 2 NA
# 3 1
# 4 3
# 5 4
# 6 5
# 7 6
# 8 12
# 9 15
# 10 16
# 11 17
# 12 NA
# 13 19
# 14 22
We could use str_extract
library(stringr)
library(dplyr)
df1 %>%
mutate(ID = as.numeric(str_extract(ID, '\\d+')))
-output
ID
1 2
2 NA
3 1
4 3
5 4
6 5
7 9
8 12
9 15
10 16
11 17
12 NA
13 19
14 22
data
df1 <- structure(list(ID = c("2", NA, "1", "3", "4", "5,7", "9,6,10",
"12", "15", "16", "17", NA, "19", "22,23")), class = "data.frame", row.names = c(NA,
-14L))

ICD-10 Variable recoding

I should do some recoding on a database.
It is a medico-administrative database (so a large database). I should recode the diagnoses that are coded (ICD-10). I take the example on the database above just as an indication.
ID<-(1:15)
Diag<-c("A001","A002","A003","A004","B001","B002","B003",
"C001","C002","C003","C004","C005","C006","C007","C008")
Age<-round(rnorm(15,25,10))
DATA<-data.frame(ID,Diag,Age)
So I would like to:
code as "Disease 1" all modalities of "Diag" that begin with "A" and "B".
code the modalities from C001 to C004 as "Disease 2".
code the modalities from C005 to C008 as "Disease 3".
We can use case_when
library(dplyr)
library(stringr)
DATA %>%
mutate(new = case_when(str_sub(Diag, 1, 1) %in% c('A', 'B') ~
'Disease 1',
Diag %in% str_c('C00', 1:4) ~ 'Disease 2',
TRUE ~ 'Disease 3'))
# ID Diag Age new
#1 1 A001 9 Disease 1
#2 2 A002 37 Disease 1
#3 3 A003 27 Disease 1
#4 4 A004 31 Disease 1
#5 5 B001 22 Disease 1
#6 6 B002 23 Disease 1
#7 7 B003 30 Disease 1
#8 8 C001 38 Disease 2
#9 9 C002 24 Disease 2
#10 10 C003 25 Disease 2
#11 11 C004 33 Disease 2
#12 12 C005 26 Disease 3
#13 13 C006 45 Disease 3
#14 14 C007 20 Disease 3
#15 15 C008 22 Disease 3

How to split a data set with duplicated informations based on date

I have this situation:
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
3 2014-07-11 56
3 NA 34
4 2014-10-05 25
4 2014-08-09 14
5 NA NA
5 NA NA
And I would like split the dataset in this, like this:
1-
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
4 2014-10-05 25
4 2014-08-09 14
2- Lowest Date
ID date Weight
3 2014-07-11 56
3 NA 34
5 NA NA
5 NA NA
I tried this for second dataset:
dt <- dt[order(dt$ID, dt$date), ]
dt.2=dt[duplicated(dt$ID), ]
but didn't work
Get the ID's for which date are NA and then subset based on that
NA_ids <- unique(df$ID[is.na(df$date)])
subset(df, !ID %in% NA_ids)
# ID date Weight
#1 1 2014-12-02 23
#2 1 2014-10-02 25
#3 2 2014-11-03 27
#4 2 2014-09-03 45
#7 4 2014-10-05 25
#8 4 2014-08-09 14
subset(df, ID %in% NA_ids)
# ID date Weight
#5 3 2014-07-11 56
#6 3 <NA> 34
#9 5 <NA> NA
#10 5 <NA> NA
Using dplyr, we can create a new column which has TRUE/FALSE for each ID based on presence of NA and then use group_split to split into list of two.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NA_ID = any(is.na(date))) %>%
ungroup %>%
group_split(NA_ID, keep = FALSE)
The above dplyr logic can also be implemented in base R by using ave and split
df$NA_ID <- with(df, ave(is.na(date), ID, FUN = any))
split(df[-4], df$NA_ID)

Left join (or equivalent) to number index by group

I have a sequence of numbers (days):
dayNum <- c(1:10)
And I have a dataframe of id, day, and event:
id = c("aa", "aa", "aa", "bb", "bb", "cc")
day = c(1, 2, 3, 1, 6, 2)
event = c("Y", "Y", "Y", "Y", "Y", "Y")
df = data.frame(id, day, event)
Which looks like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
bb 1 Y
bb 6 Y
cc 2 Y
I am trying to put this dataframe into a form that resembles left joining dayNum with df for each id. That is, even if id "aa" had no event on day 5, I should still get a row for "aa" on day 5 with N/A or something under event. Like this:
id day event
aa 1 Y
aa 2 Y
aa 3 Y
aa 4 N/A
aa 5 N/A
aa 6 N/A
aa 8 N/A
aa 9 N/A
aa 10 N/A
bb 1 Y
bb 2 N/A
bb 3 N/A
bb 4 N/A
bb 5 N/A
bb 6 Y
bb 7 N/A
...etc
I can make this work using dplyr and left_join when my dataframe only contains one unique id, but I am stuck trying to make this work with a dataframe that has many different ids.
A push in the right direction would be greatly appreciated.
Thank you!
We can use expand.grid and merge. We create a new dataset using the unique 'id' of 'df' and the 'dayNum'. Then merge with the 'df' to get the expected output.
merge(expand.grid(id=unique(df$id), day=dayNum), df, all.x=TRUE)
# id day event
#1 aa 1 Y
#2 aa 2 Y
#3 aa 3 Y
#4 aa 4 <NA>
#5 aa 5 <NA>
#6 aa 6 <NA>
#7 aa 7 <NA>
#8 aa 8 <NA>
#9 aa 9 <NA>
#10 aa 10 <NA>
#11 bb 1 Y
#12 bb 2 <NA>
#13 bb 3 <NA>
#14 bb 4 <NA>
#15 bb 5 <NA>
#16 bb 6 Y
#17 bb 7 <NA>
#18 bb 8 <NA>
#19 bb 9 <NA>
#20 bb 10 <NA>
#21 cc 1 <NA>
#22 cc 2 Y
#23 cc 3 <NA>
#24 cc 4 <NA>
#25 cc 5 <NA>
#26 cc 6 <NA>
#27 cc 7 <NA>
#28 cc 8 <NA>
#29 cc 9 <NA>
#30 cc 10 <NA>
A similar option using data.table would be to convert the 'data.frame' to 'data.table' (setDT(df), set the 'key' columns, join with the dataset derived from cross join of unique 'id' and 'dayNum'.
library(data.table)
setDT(df, key=c('id', 'day'))[CJ(id=unique(id), day=dayNum)]

Resources