I have 2 dfs df1 and df2.
df1 have 2 record dates (Base date and followup date),
(Scenario 1) At first I need to match the exact Record_date1 to Drug_Date if it is matched drug name should be update to the corresponding date (i.e PID = 345).
(Scenario 2) If the date is not matched I have to fetch the minimum drug_date of the PID based on daterange like (where min(drug_Date) between Record_date_1(-7 days) and Record_date_1(+45 days))
Here I given the sample set and expected output below.
PID Record_Date_1 D1 Record_Date_2 D2
123 22-04-1996 5.3 30-10-1996 5.4
234 16-06-1994 6.8 13-12-1994 7.2
345 18-09-2000 7.5 24-02-2001 8.9
456 20-02-2001 8.5 20-08-2001 9.4
PID Drug_Date Drugs
123 23-04-1996 Biguanides
123 28-04-1996 Sulphynureas
123 31-10-1996 SGLT2
234 15-06-1994 Insulin
234 14-12-1994 Biguanides
345 18-09-2000 DPP4-inhibitor
345 24-02-2001 Incretin
456 21-02-2001 Biguanides
456 26-08-2001 Sulphynureas
Expected output :
PID Record Date D1 Record Date_2 D2 Drug_ Date1 D1_Drugs Drug_ Date2 D2_Drugs
123 22-04-1996 5.3 30-10-1996 5.4 23-04-1996 Biguanides 31-10-1996 sulphynureas
234 16-06-1994 6.8 13-12-1994 7.2 15-06-1994 Insulin 14-12-1994 Biguanides
345 18-09-2000 7.5 24-02-2001 8.9 18-09-2000 DPP4-inhibitor 24-02-2001 Incretin
456 20-02-2001 8.5 20-08-2001 9.4 21-02-2001 Biguanides 26-08-2001 sulphynureas
If you need any clarification please let me know.
Thanks in advance!
Consider a function like this
my_match <- function(x, y) {
f <- function(i, j) {
pos <- which(j >= i - 7L & j <= i + 45L)
pos[[which.min(j[pos])]]
}
x <- as.Date(x, "%d-%m-%Y")
y <- as.Date(y, "%d-%m-%Y")
out <- match(x, y)
ifelse(is.na(out), vapply(x, f, integer(1L), y), out)
}
Then, you can just
df1$Drug_Date1 <- df2$Drug_Date[my_match(df1$Record_Date_1, df2$Drug_Date)]
df1$D1_Drug <- df2$Drugs[my_match(df1$Record_Date_1, df2$Drug_Date)]
df1$Drug_Date2 <- df2$Drug_Date[my_match(df1$Record_Date_2, df2$Drug_Date)]
df1$D2_Drug <- df2$Drugs[my_match(df1$Record_Date_2, df2$Drug_Date)]
Output
> as.data.frame(df1)
PID Record_Date_1 D1 Record_Date_2 D2 Drug_Date1 D1_Drug Drug_Date2 D2_Drug
1 123 22-04-1996 5.3 30-10-1996 5.4 23-04-1996 Biguanides 31-10-1996 SGLT2
2 234 16-06-1994 6.8 13-12-1994 7.2 15-06-1994 Insulin 14-12-1994 Biguanides
3 345 18-09-2000 7.5 24-02-2001 8.9 18-09-2000 DPP4-inhibitor 24-02-2001 Incretin
4 456 20-02-2001 8.5 20-08-2001 9.4 21-02-2001 Biguanides 26-08-2001 Sulphynureas
Data (df1)
structure(list(PID = c(123, 234, 345, 456), Record_Date_1 = c("22-04-1996",
"16-06-1994", "18-09-2000", "20-02-2001"), D1 = c(5.3, 6.8, 7.5,
8.5), Record_Date_2 = c("30-10-1996", "13-12-1994", "24-02-2001",
"20-08-2001"), D2 = c(5.4, 7.2, 8.9, 9.4), Drug_Date1 = c("23-04-1996",
"15-06-1994", "18-09-2000", "21-02-2001"), D1_Drug = c("Biguanides",
"Insulin", "DPP4-inhibitor", "Biguanides"), Drug_Date2 = c("31-10-1996",
"14-12-1994", "24-02-2001", "26-08-2001"), D2_Drug = c("SGLT2",
"Biguanides", "Incretin", "Sulphynureas")), row.names = c(NA,
-4L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), spec = structure(list(
cols = list(PID = structure(list(), class = c("collector_double",
"collector")), Record_Date_1 = structure(list(), class = c("collector_character",
"collector")), D1 = structure(list(), class = c("collector_double",
"collector")), Record_Date_2 = structure(list(), class = c("collector_character",
"collector")), D2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 2), class = "col_spec"))
Data (df2)
structure(list(PID = c(123, 123, 123, 234, 234, 345, 345, 456,
456), Drug_Date = c("23-04-1996", "28-04-1996", "31-10-1996",
"15-06-1994", "14-12-1994", "18-09-2000", "24-02-2001", "21-02-2001",
"26-08-2001"), Drugs = c("Biguanides", "Sulphynureas", "SGLT2",
"Insulin", "Biguanides", "DPP4-inhibitor", "Incretin", "Biguanides",
"Sulphynureas")), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -9L), spec = structure(list(cols = list(
PID = structure(list(), class = c("collector_double", "collector"
)), Drug_Date = structure(list(), class = c("collector_character",
"collector")), Drugs = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 2), class = "col_spec"))
Related
Hi I am looking into figuring out how to match data frames together by column, then renaming it. If there is no name that matches, then I would want to drop that column instead.
For example, I would use this main dataset, call it DF1:
Name
Reference
Good
Fair
Bad
Great
Poor
George
Hill
34
21
33
21
32
Frank
Stairs
29
28
29
30
29
Bertha
Trail
25
25
24
21
26
Then another DF, call this DF2, that allows me to replace the names of the columns of DF1
Name
Adjusted_Name
Good
good_run
Great
very_great_work
Bad
bad run
Fair
fair run decent
Essentially, the words that would be substituted would not be any pattern of any sort, and I would try to match this first column in DF2 and match to DF1, and if there is a match in DF2$Name and DF(whatever column), then I would replace that name with the same row of DF2$Adjusted_Name. If there is no match, then the value in DF1 is dropped.
So the final goal would be to achieve:
Name
Reference
good_run
fair run decent
Bad run
very_great_work
George
Hill
34
21
33
21
Frank
Stairs
29
28
29
30
Bertha
Trail
25
25
24
21
In this case, "poor" was dropped because it didnt match the column name of DF1.
How should I go about this? How would I account if there thousands of columns? Does that change anything in how i Code? I am a bit new to R, and would appreciate any tips. Thank you!
If you are open to a tidyverse solution, you could use
library(dplyr)
library(tibble)
df %>%
rename_with(~deframe(df2)[.x], .cols = df2$Name) %>%
select(Name, Reference, any_of(df2$Adjusted_Name))
This returns
# A tibble: 3 x 6
Name Reference good_run very_great_work bad_run fair_run_decent
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 George Hill 34 21 33 21
2 Frank Stairs 29 30 29 28
3 Bertha Trail 25 21 24 25
Data
df <- structure(list(Name = c("George", "Frank", "Bertha"), Reference = c("Hill",
"Stairs", "Trail"), Good = c(34, 29, 25), Fair = c(21, 28, 25
), Bad = c(33, 29, 24), Great = c(21, 30, 21), Poor = c(32, 29,
26)), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"), row.names = c(NA,
-3L), spec = structure(list(cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Reference = structure(list(), class = c("collector_character",
"collector")), Good = structure(list(), class = c("collector_double",
"collector")), Fair = structure(list(), class = c("collector_double",
"collector")), Bad = structure(list(), class = c("collector_double",
"collector")), Great = structure(list(), class = c("collector_double",
"collector")), Poor = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df2 <- structure(list(Name = c("Good", "Great", "Bad", "Fair"), Adjusted_Name = c("good_run",
"very_great_work", "bad_run", "fair_run_decent")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L), spec = structure(list(
cols = list(Name = structure(list(), class = c("collector_character",
"collector")), Adjusted_Name = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Try the following - using the list of adjusted names, you can grep the list of desired words against column names and subset the data frame on it:
Data
df <- read.table(header = TRUE, text = "Name Reference Good Fair Bad Great Poor
George Hill 34 21 33 21 32
Frank Stairs 29 28 29 30 29
Bertha Trail 25 25 24 21 26")
adj_name <- c("good_run","very_great_run","bad run","fair run decent")
Index the columns based on grep from the string of desired names (note the tolower() on the column names as well)
desired_words <- paste(unlist(strsplit(adj_name, "_| ")), collapse = "|")
df[,c(1:2,grep(desired_words, tolower(names(df))))]
Output
# Name Reference Good Fair Bad Great
#1 George Hill 34 21 33 21
#2 Frank Stairs 29 28 29 30
#3 Bertha Trail 25 25 24 21
I am learning how to create a function in R, but I am struggling to understand how to write for loop. My understanding is that
for (item I list_items) {
do_something(itemn)
}
I would like to write a for loop to replace with 333 the cells that are equal with 123. So the item is 123 and the list of items is the df from sec1 till sec4.
Could somebody explain this to me, please? And how this can be included in a function?
Sample code:
structure(list(sec1 = c(1, 123, 1), sec2 = c(123, 1, 1), sec3 = c(123,
0, 0), sec4 = c(1, 123, 1)), spec = structure(list(cols = list(
sec1 = structure(list(), class = c("collector_double", "collector"
)), sec2 = structure(list(), class = c("collector_double",
"collector")), sec3 = structure(list(), class = c("collector_double",
"collector")), sec4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
We do not need a for loop here:
df[df==123]<-333
If we really need for loops:
for(i in 1:ncol(df)){
df[i][df[i]==123] <-333
}
output
df
# A tibble: 3 x 4
sec1 sec2 sec3 sec4
<dbl> <dbl> <dbl> <dbl>
1 1 333 333 1
2 333 1 0 333
3 1 1 0 1
Here's how it would work for one column of your data:
dat <- structure(list(sec1 = c(1, 123, 1),
sec2 = c(123, 1, 1),
sec3 = c(123, 0, 0),
sec4 = c(1, 123, 1)),
spec = structure(list(cols = list(
sec1 = structure(list(),
class = c("collector_double", "collector")),
sec2 = structure(list(),
class = c("collector_double","collector")),
sec3 = structure(list(),
class = c("collector_double", "collector")),
sec4 = structure(list(),
class = c("collector_double","collector"))),
default = structure(list(),
class = c("collector_guess","collector")),
delim = ","), class = "col_spec"),
row.names = c(NA,-3L), class =
c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
for(i in 1:nrow(dat)){
dat$sec1[i] <- ifelse(dat$sec1[i] == 123, 333, dat$sec1[i])
}
dat
#> sec1 sec2 sec3 sec4
#> 1 1 123 123 1
#> 2 333 1 0 123
#> 3 1 1 0 1
Created on 2022-01-31 by the reprex package (v2.0.1)
To replace all of them, using for loops, you could do a double loop over columns and rows.
for(j in names(dat)){
for(i in 1:nrow(dat)){
dat[[j]][i] <- ifelse(dat[[j]][i] == 123, 333, dat[[j]][i])
}
}
Of course, as others have identified, you certainly don't need a for loop to accomplish this.
in addition to DaveArmstrong Answer this would work for all rows and columns:
dat <- structure(list(sec1 = c(1, 123, 1), sec2 = c(123, 1, 1), sec3 = c(123,
0, 0), sec4 = c(1, 123, 1)), spec = structure(list(cols = list(
sec1 = structure(list(), class = c("collector_double", "collector"
)), sec2 = structure(list(), class = c("collector_double",
"collector")), sec3 = structure(list(), class = c("collector_double",
"collector")), sec4 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), delim = ","), class = "col_spec"), row.names = c(NA,
-3L), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"))
for(i in 1:nrow(dat)){
for(j in 1:ncol(dat)){
dat[i,j] <- ifelse(dat[i,j] == 123, 333, dat[i,j])
}
}
I have my data in the wide-format
what is the easiest way to change it to long without having a X in front of the time variables
Sample data:
structure(list(X1 = c("01/12/2019", "02/12/2019"), `00:30` = c(41.95,
39.689), `01:00` = c(44.96, 40.47), `01:30` = c(42.939, 38.95
), `02:00` = c(43.221, 40.46), `02:30` = c(44.439, 41.97)), class = "data.frame", row.names = c(NA,
-2L), spec = structure(list(cols = list(X1 = structure(list(), class = c("collector_character",
"collector")), `00:30` = structure(list(), class = c("collector_double",
"collector")), `01:00` = structure(list(), class = c("collector_double",
"collector")), `01:30` = structure(list(), class = c("collector_double",
"collector")), `02:00` = structure(list(), class = c("collector_double",
"collector")), `02:30` = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
with pivot_longer and pivot_wider from tidyr:
dat |>
pivot_longer(names_to="time",values_to="val",`00:30`:`02:30`) |>
pivot_wider(names_from="X1", values_from="val")
Output:
# A tibble: 5 x 3
time `01/12/2019` `02/12/2019`
<chr> <dbl> <dbl>
1 00:30 42.0 39.7
2 01:00 45.0 40.5
3 01:30 42.9 39.0
4 02:00 43.2 40.5
5 02:30 44.4 42.0
I this special case, you could transpose the part of your data.frame containing numbers and assign the column names:
df_new <- data.frame(t(df[,-1]))
colnames(df_new) <- df[, 1]
This returns a data.frame df_new:
01/12/2019 02/12/2019
00:30 41.950 39.689
01:00 44.960 40.470
01:30 42.939 38.950
02:00 43.221 40.460
02:30 44.439 41.970
Edit (Thanks to jay.sf)
For versions of R >= 4.1, you could use the natural pipe:
t(df[, -1]) |>
data.frame() |>
`colnames<-`(df[, 1])
I need a chart of accounts to stay in order when new accounts are added or dropped in future years. This is because in Accounting the accounts are sorted by type (for example Asset, Liability Equity) but it is not explicit in the dataset. This is an example of the code that is putting new "Accounts" from Year2 and Year3 at the bottom.
XYZCompany_Consolidated <- XYZCompany_Year1 %>%
full_join(XYZCompany_Year2 by = "Account") %>%
full_join(XYZCompany_Year3, by = "Account")
Example: This picture is just to give a simplified example. The highlight in orange is where the new accounts are going and to the right is the code i'm using, and the green is what I'm trying to achieve
Perhaps I'm overthinking this problem but I find it hard to solve. Let's define some data first:
df_year1 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"JaneDoe"), Year_1 = c(100, 1000, 20, 300, 500)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_1 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year2 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Laptop", "JaneDoe"), Year_2 = c(80, 1200, 50, 300, 500, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_2 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
df_year3 <- structure(list(Account = c("Cash", "Accounts", "Loan1", "Auto",
"Rent", "JaneDoe"), Year_3 = c(80, 1200, 50, 300, 1000, 0)), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(Account = structure(list(), class = c("collector_character",
"collector")), Year_3 = structure(list(), class = c("collector_double",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))
Those are similar to the data shown in the OP's picture, e.g. df_year1 looks like
# A tibble: 5 x 2
Account Year_1
<chr> <dbl>
1 Cash 100
2 Accounts 1000
3 Loan1 20
4 Auto 300
5 JaneDoe 500
Next we transform those data a little bit, namely
library(dplyr)
library(tidyr)
df_y1 <- df_year1 %>%
mutate(Year = 1,
no = row_number()) %>%
rename(value = Year_1)
which returns
# A tibble: 5 x 4
Account value Year no
<chr> <dbl> <dbl> <int>
1 Cash 100 1 1
2 Accounts 1000 1 2
3 Loan1 20 1 3
4 Auto 300 1 4
5 JaneDoe 500 1 5
The new column no stores the account's original position, column Year stores the chart's year. All three data.frames are processed like this, so we get df_y1, df_y2, df_y3.
Finally we bind them together
bind_rows(df_y1, df_y2, df_y3) %>%
mutate(num_years = max(Year)) %>%
group_by(Account) %>%
mutate(rank = sum((num_years - n() + 1) * no), .keep = "unused") %>%
pivot_wider(names_from = Year) %>%
arrange(rank) %>%
select(-rank) %>%
ungroup()
and calculate a rank for each account. The accounts are ordered by this rank. As a result, we get
# A tibble: 7 x 4
Account Year_1 Year_2 Year_3
<chr> <dbl> <dbl> <dbl>
1 Cash 100 80 80
2 Accounts 1000 1200 1200
3 Loan1 20 50 50
4 Auto 300 300 300
5 Laptop NA 500 NA
6 Rent NA NA 1000
7 JaneDoe 500 0 0
Note
I believe, there are better approaches, but at least this works for the example data.
I'm not sure about the calculated rank's stability. Take care.
For a sample dataframe:
df <- structure(list(code = c("a1", "a1", "b2", "v4", "f5", "f5", "h7",
"a1"), name = c("katie", "katie", "sally", "tom", "amy", "amy",
"ash", "james"), number = c(3.5, 3.5, 2, 6, 4, 4, 7, 3)), .Names = c("code",
"name", "number"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-8L), spec = structure(list(cols = structure(list(code = structure(list(), class = c("collector_character",
"collector")), name = structure(list(), class = c("collector_character",
"collector")), number = structure(list(), class = c("collector_double",
"collector"))), .Names = c("code", "name", "number")), default = structure(list(), class = c("collector_guess",
"collector"))), .Names = c("cols", "default"), class = "col_spec"))
I want to highlight all the records which are have two or more values of 'code' which are the same. I know I could use:
df[duplicated(df$name), ]
But this only highlights the duplicated records, but I want all of the code values which are duplicated (i.e. 3 a1s and 2 f5s).
Any ideas?
df[duplicated(df$code) | duplicated(df$code, fromLast=TRUE), ]
code name number
1 a1 katie 3.5
2 a1 katie 3.5
5 f5 amy 4.0
6 f5 amy 4.0
8 a1 james 3.0
Another solution inspired by Alok VS:
ta <- table(df$code)
df[df$code %in% names(ta)[ta > 1], ]
Edit: If you are ok with leaving base R then gdata::duplicated2() allows for more concision.
library(gdata)
df[duplicated2(df$code), ]
turn the indexes to values - and then check if 'code' fits this values:
df[df$code %in% df$code[duplicated(df$code)], ]
code name number
1 a1 katie 3.5
2 a1 katie 3.5
5 f5 amy 4.0
6 f5 amy 4.0
8 a1 james 3.0
I've come up with a crude solution,
temp<-aggregate(df$code, by=list(df$code), FUN=length)
temp<-temp[temp$x>1,]
df[df$code %in% temp$Group.1,]