I should do some recoding on a database.
It is a medico-administrative database (so a large database). I should recode the diagnoses that are coded (ICD-10). I take the example on the database above just as an indication.
ID<-(1:15)
Diag<-c("A001","A002","A003","A004","B001","B002","B003",
"C001","C002","C003","C004","C005","C006","C007","C008")
Age<-round(rnorm(15,25,10))
DATA<-data.frame(ID,Diag,Age)
So I would like to:
code as "Disease 1" all modalities of "Diag" that begin with "A" and "B".
code the modalities from C001 to C004 as "Disease 2".
code the modalities from C005 to C008 as "Disease 3".
We can use case_when
library(dplyr)
library(stringr)
DATA %>%
mutate(new = case_when(str_sub(Diag, 1, 1) %in% c('A', 'B') ~
'Disease 1',
Diag %in% str_c('C00', 1:4) ~ 'Disease 2',
TRUE ~ 'Disease 3'))
# ID Diag Age new
#1 1 A001 9 Disease 1
#2 2 A002 37 Disease 1
#3 3 A003 27 Disease 1
#4 4 A004 31 Disease 1
#5 5 B001 22 Disease 1
#6 6 B002 23 Disease 1
#7 7 B003 30 Disease 1
#8 8 C001 38 Disease 2
#9 9 C002 24 Disease 2
#10 10 C003 25 Disease 2
#11 11 C004 33 Disease 2
#12 12 C005 26 Disease 3
#13 13 C006 45 Disease 3
#14 14 C007 20 Disease 3
#15 15 C008 22 Disease 3
Related
I have data that look like this:
id <- c(rep(1,5), rep(2,5), rep(3,4), rep(4,2), rep(5, 1))
year <- c(1990,1991,1992,1993,1994,1990,1991,1992,1993,1994,1990,1991,1992,1994,1990,1994, 1994)
gender <- c(rep("female", 5), rep("male", 5), rep("male", 4), rep("female", 2), rep("male", 1))
dat <- data.frame(id,year,gender)
As you can see, id 1 and 2 have observations for every year between 1990 and 1994, while there are missing observations in between 1990 and 1994 for ids 3 and 4, and, finally, only one observation for id 5.
What I want to do is to copy column id and gender and insert the missing observations for id 3 and 4 so that there are observations from 1990 too 1994, while I want to do nothing with id 1, 2 or 5. Is there are way to create a sequence with numbers from the oldest to the newest observation based on the condition that there is a gap between two numbers grouped by a variable, such as id?
The final result should look like this:
id year gender
<dbl> <dbl> <chr>
1 1 1990 female
2 1 1991 female
3 1 1992 female
4 1 1993 female
5 1 1994 female
6 2 1990 male
7 2 1991 male
8 2 1992 male
9 2 1993 male
10 2 1994 male
11 3 1990 male
12 3 1991 male
13 3 1992 male
14 3 1993 male
15 3 1994 male
16 4 1990 female
17 4 1991 female
18 4 1992 female
19 4 1993 female
20 4 1994 female
21 5 1994 male
Filter the dataset for id 3 and 4, complete their observations and bind the data to other id's where id is not 3 and 4.
library(dplyr)
library(tidyr)
complete_id <- c(3, 4)
dat %>%
filter(id %in% complete_id) %>%
complete(id, year = 1990:1994) %>%
fill(gender) %>%
bind_rows(dat %>% filter(!id %in% complete_id)) %>%
arrange(id)
# id year gender
#1 1 1990 female
#2 1 1991 female
#3 1 1992 female
#4 1 1993 female
#5 1 1994 female
#6 2 1990 male
#7 2 1991 male
#8 2 1992 male
#9 2 1993 male
#10 2 1994 male
#11 3 1990 male
#12 3 1991 male
#13 3 1992 male
#14 3 1993 male
#15 3 1994 male
#16 4 1990 female
#17 4 1991 female
#18 4 1992 female
#19 4 1993 female
#20 4 1994 female
#21 5 1994 male
I have this situation:
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
3 2014-07-11 56
3 NA 34
4 2014-10-05 25
4 2014-08-09 14
5 NA NA
5 NA NA
And I would like split the dataset in this, like this:
1-
ID date Weight
1 2014-12-02 23
1 2014-10-02 25
2 2014-11-03 27
2 2014-09-03 45
4 2014-10-05 25
4 2014-08-09 14
2- Lowest Date
ID date Weight
3 2014-07-11 56
3 NA 34
5 NA NA
5 NA NA
I tried this for second dataset:
dt <- dt[order(dt$ID, dt$date), ]
dt.2=dt[duplicated(dt$ID), ]
but didn't work
Get the ID's for which date are NA and then subset based on that
NA_ids <- unique(df$ID[is.na(df$date)])
subset(df, !ID %in% NA_ids)
# ID date Weight
#1 1 2014-12-02 23
#2 1 2014-10-02 25
#3 2 2014-11-03 27
#4 2 2014-09-03 45
#7 4 2014-10-05 25
#8 4 2014-08-09 14
subset(df, ID %in% NA_ids)
# ID date Weight
#5 3 2014-07-11 56
#6 3 <NA> 34
#9 5 <NA> NA
#10 5 <NA> NA
Using dplyr, we can create a new column which has TRUE/FALSE for each ID based on presence of NA and then use group_split to split into list of two.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NA_ID = any(is.na(date))) %>%
ungroup %>%
group_split(NA_ID, keep = FALSE)
The above dplyr logic can also be implemented in base R by using ave and split
df$NA_ID <- with(df, ave(is.na(date), ID, FUN = any))
split(df[-4], df$NA_ID)
A variable in my dataset looks like this
df <- data.frame(Month = factor(c(sample(1:12, 15, replace = T),
sample(c("Apr", "May"), 5, replace = T))))
Now, levels Apr & May were entered later in time by a different person, thereby stored as name of the month. So how do I get rid of the separate levels and group those values under the already existing 4 & 5 levels respectively? or conversely, how do store all values in names of months instead of numbers?
You can match with month.abb, i.e.
i1 <- match(df$Month, month.abb)
df$Month[!is.na(i1)] <- i1[!is.na(i1)]
df
# Month
#1 5
#2 2
#3 7
#4 12
#5 5
#6 12
#7 4
#8 6
#9 7
#10 10
#11 9
#12 4
#13 11
#14 10
#15 3
#16 4
#17 5
#18 4
#19 4
#20 4
First time posting something here, forgive any missteps in my question.
In my example below I've got a data.frame where the unique identifier is the tripID with the name of the vessel, the species code, and a catch metric.
> testFrame1 <- data.frame('tripID' = c(1,1,2,2,3,4,5),
'name' = c('SS Anne','SS Anne', 'HMS Endurance', 'HMS Endurance','Salty Hippo', 'Seagallop', 'Borealis'),
'SPP' = c(101,201,101,201,102,102,103),
'kept' = c(12, 22, 14, 24, 16, 18, 10))
> testFrame1
tripID name SPP kept
1 1 SS Anne 101 12
2 1 SS Anne 201 22
3 2 HMS Endurance 101 14
4 2 HMS Endurance 201 24
5 3 Salty Hippo 102 16
6 4 Seagallop 102 18
7 5 Borealis 103 10
I need a way to basically condense the data.frame so that all there is only one row per tripID as shown below.
> testFrame1
tripID name SPP kept SPP.1 kept.1
1 1 SS Anne 101 12 201 22
2 2 HMS Endurance 101 14 201 24
3 3 Salty Hippo 102 16 NA NA
4 4 Seagallop 102 18 NA NA
5 5 Borealis 103 10 NA NA
I've looked into tidyr and reshape but neither of those are can deliver quite what I'm asking for. Is there anything out there that does this quasi-reshaping?
Here are two alternatives using base::reshape and data.table::dcast:
1) base R
reshape(transform(testFrame1,
timevar = ave(tripID, tripID, FUN = seq_along)),
idvar = cbind("tripID", "name"),
timevar = "timevar",
direction = "wide")
# tripID name SPP.1 kept.1 SPP.2 kept.2
#1 1 SS Anne 101 12 201 22
#3 2 HMS Endurance 101 14 201 24
#5 3 Salty Hippo 102 16 NA NA
#6 4 Seagallop 102 18 NA NA
#7 5 Borealis 103 10 NA NA
2) data.table
library(data.table)
setDT(testFrame1)
dcast(testFrame1, tripID + name ~ rowid(tripID), value.var = c("SPP", "kept"))
# tripID name SPP_1 SPP_2 kept_1 kept_2
#1: 1 SS Anne 101 201 12 22
#2: 2 HMS Endurance 101 201 14 24
#3: 3 Salty Hippo 102 NA 16 NA
#4: 4 Seagallop 102 NA 18 NA
#5: 5 Borealis 103 NA 10 NA
Great reproducible post considering it's your first. Here's a way to do it with dplyr and tidyr -
testFrame1 %>%
group_by(tripID, name) %>%
summarise(
SPP = toString(SPP),
kept = toString(kept)
) %>%
ungroup() %>%
separate("SPP", into = c("SPP", "SPP.1"), sep = ", ", extra = "drop", fill = "right") %>%
separate("kept", into = c("kept", "kept.1"), sep = ", ", extra = "drop", fill = "right")
# A tibble: 5 x 6
tripID name SPP SPP.1 kept kept.1
<dbl> <chr> <chr> <chr> <chr> <chr>
1 1.00 SS Anne 101 201 12 22
2 2.00 HMS Endurance 101 201 14 24
3 3.00 Salty Hippo 102 <NA> 16 <NA>
4 4.00 Seagallop 102 <NA> 18 <NA>
5 5.00 Borealis 103 <NA> 10 <NA>
data set exist data with age, gender, state, income, group. Group represents the group that each user belongs to:
group gender state age income
1 3 Female CA 33 $75,000 - $99,999
2 3 Male MA 41 $50,000 - $74,999
3 3 Male KY 32 $35,000 - $49,999
4 2 Female CA 23 $35,000 - $49,999
5 3 Male KY 25 $50,000 - $74,999
6 3 Male MA 21 $75,000 - $99,999
7 3 Female CA 33 $75,000 - $99,999
8 3 Male MA 41 $50,000 - $74,999
9 3 Male KY 32 $35,000 - $49,999
10 2 Female CA 23 $35,000 - $49,999
11 3 Male KY 25 $50,000 - $74,999
12 3 Female MA 21 $75,000 - $99,999
Above is dummy data and goal is to get the concept correct.
The goal is to group by group, gender, income and get the count and for each group get the mean age from the users who belong to that group. Then set the data in following structure: "Expanded Version"
group male female CA MA KY $35,000 - $49,999 $50,000 - $74,999 $75,000 - $99,999 mean_age
2 0 2 2 0 0 2 1 0 23
...
Here are the attempts: using dplyr
> data %>% group_by(group,
+ gender,
+ state,
+ income) %>%
+ summarize(n()) %>%
+ mutate(mean_age = mean(age))
I was also exploring spread function.
You can do both the count and mean in one call to summarize():
library(dplyr)
data %>% group_by(group,
gender,
state,
income) %>%
summarize(count = n(), mean_age = mean(age))
For the wide data, the variable names in your sample won't uniquely identify what a given data point means since the unique units are group X gender X state X income but it only has one row per group.
Since you have two summaries, the summary type is an additional layer to the unique identification. So to get everything in one row you would have variable names like [group]_[gender]_[state]_[income]_[summary]. For example, 2_Female_CA_$35,000 - $49,999_count.
There may be a better wide shape - what type of calculations are you doing on the wide data frame?
In addition to #treysp's answer you could use unite and spread to create a wide (and unwieldy) table. (I'm using as.data.frame() only to force printing all columns).
require(tidyverse);
df %>%
group_by(group, gender, state, income) %>%
summarize(n = n(), mean_age = mean(age)) %>%
unite(key, gender, state, income) %>%
spread(key, n) %>% as.data.frame();
# group mean_age Female_CA_$35,000 - $49,999 Female_CA_$75,000 - $99,999
#1 2 23 2 NA
#2 3 21 NA NA
#3 3 25 NA NA
#4 3 32 NA NA
#5 3 33 NA 2
#6 3 41 NA NA
# Female_MA_$75,000 - $99,999 Male_KY_$35,000 - $49,999
#1 NA NA
#2 1 NA
#3 NA NA
#4 NA 2
#5 NA NA
#6 NA NA
# Male_KY_$50,000 - $74,999 Male_MA_$50,000 - $74,999 Male_MA_$75,000 - $99,999
#1 NA NA NA
#2 NA NA 1
#3 2 NA NA
#4 NA NA NA
#5 NA NA NA
#6 NA 2 NA
#
Sample data
df <- read.table(text =
"group gender state age income
1 3 Female CA 33 '$75,000 - $99,999'
2 3 Male MA 41 '$50,000 - $74,999'
3 3 Male KY 32 '$35,000 - $49,999'
4 2 Female CA 23 '$35,000 - $49,999'
5 3 Male KY 25 '$50,000 - $74,999'
6 3 Male MA 21 '$75,000 - $99,999'
7 3 Female CA 33 '$75,000 - $99,999'
8 3 Male MA 41 '$50,000 - $74,999'
9 3 Male KY 32 '$35,000 - $49,999'
10 2 Female CA 23 '$35,000 - $49,999'
11 3 Male KY 25 '$50,000 - $74,999'
12 3 Female MA 21 '$75,000 - $99,999'", header = T, row.names = 1)