How can i convert logical columns to numeric in R using dplyr? - r

i have a data frame that i import it from an excel (.xlsx) file and looks like this :
>data
# A tibble: 3,338 x 4
Dates A B C
<dttm> <lgl> <lgl> <lgl>
1 2009-01-05 00:00:00 NA NA NA
2 2009-01-06 00:00:00 NA NA NA
3 2009-01-07 00:00:00 NA NA NA
4 2009-01-08 00:00:00 NA NA NA
5 2009-01-09 00:00:00 NA NA NA
6 2009-01-12 00:00:00 NA NA NA
7 2009-01-13 00:00:00 NA NA NA
8 2009-01-14 00:00:00 NA NA NA
9 2009-01-15 00:00:00 NA NA NA
10 2009-01-16 00:00:00 NA NA NA
# ... with 3,328 more rows
# i Use `print(n = ...)` to see more rows
The problem is that these three columns A,B,C contain numeric values but after some 3 thousand rows.
but trying to convert them into numeric i did :
data%>%
dplyr::mutate(date = as.Date(Dates))%>%
dplyr::select(-Dates)%>%
dplyr::relocate(date,.before="A")%>%
dplyr::mutate_if(is.logical, as.numeric)%>%
tidyr::pivot_longer(!date,names_to = "var", values_to = "y")%>%
dplyr::group_by(var)%>%
dplyr::arrange(var)%>%
tidyr::drop_na()
but the problem remains :
date var y
<date> <chr> <dbl>
1 2021-11-30 A 1
2 2021-12-01 A 1
3 2021-12-02 A 1
4 2021-12-03 A 1
5 2021-12-06 A 1
6 2021-12-07 A 1
7 2021-12-08 A 1
8 2021-12-09 A 1
9 2021-12-10 A 1
10 2021-12-13 A 1
# ... with 189 more rows
any help ?

Summing up from comments:
it's usually easier to fix conversation errors closer to original source as possible.
read_xlsx tries to guess column types by checking first guess_max rows, guess_max being a read_xlsx parameter with a default value of min(1000, n_max). If read_xlsx gets columns types wrong because those cols are filled with NA for the first 1000 rows, just increasing guess_max parameter might be a viable solution:
readxl::read_xlsx("file.xlsx", guess_max = 5000)
Though for a simple 4-column dataset one shouldn't need more than defining correct column types manually:
readxl::read_xlsx("file.xlsx", col_types = c("date", "numeric", "numeric", "numeric"))
If NA values are only at the beginning of some columns, changing sorting order in Excel itself and moving NAs from top before importing the file into R should also work.

Related

Why does grepl work but not str_detect for mutate depending on row value?

I have been trying to wrap my head around this.
I need to create a corrected column based on detecting a specific comment at another "error" column in my database. I can work around this with grepl, but I am struggling with getting str_detect to work as well (it is usually faster for big datasets).
Here is an example database:
test <- tibble(
id = seq(1:30),
date = sample(seq(as.Date('2000/01/01'), as.Date('2018/01/01'), by="day"), 30),
error = c(rep(NA, 3), "wrong date! Correct date = 01.03.2022",
rep(NA, 5), "wrong date! Correct date = 01.05.2021",
rep(NA, 5), "wrong date! Correct date = 01.03.2022",
rep(NA, 7), "wrong date! Correct date = 01.05.2021",
rep(NA, 2), "date already corrected on 01.05.2021",
NA, "date already corrected on 01.03.2022", NA))
I first tried to create a new "date_corr" column with str_detect:
test %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
This yields:
A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
Adding rowwise is irrelevant:
test %>%
rowwise() %>%
mutate(date_corr=if_else(str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA NA
2 2 2004-06-30 NA NA
3 3 2015-09-25 NA NA
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA NA
6 6 2004-08-02 NA NA
7 7 2001-10-15 NA NA
8 8 2007-07-21 NA NA
9 9 2014-04-19 NA NA
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
However, with grepl I get the desired outcome, regardless of rowwise:
test %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
# A tibble: 30 × 4
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
test %>%
rowwise() %>%
mutate(date_corr=if_else(grepl("date \\= 01\\.03\\.2022$", error), as.Date('2022/03/01'), date),
date_corr=if_else(grepl("date \\= 01\\.05\\.2021$", error), as.Date('2021/05/01'), date_corr))
A tibble: 30 × 4
# Rowwise:
id date error date_corr
<int> <date> <chr> <date>
1 1 2010-04-28 NA 2010-04-28
2 2 2004-06-30 NA 2004-06-30
3 3 2015-09-25 NA 2015-09-25
4 4 2005-08-21 wrong date! Correct date = 01.03.2022 2022-03-01
5 5 2008-07-16 NA 2008-07-16
6 6 2004-08-02 NA 2004-08-02
7 7 2001-10-15 NA 2001-10-15
8 8 2007-07-21 NA 2007-07-21
9 9 2014-04-19 NA 2014-04-19
10 10 2013-02-08 wrong date! Correct date = 01.05.2021 2021-05-01
# … with 20 more rows
What I am missing here?
The difference is how they handle NA values
str_detect(NA, "missing")
# [1] NA
grepl("missing", NA)
# [1] FALSE
And note that if you have an NA value in the condition for if_else, it will also preserve the NA value
if_else(NA, 1, 2)
# [1] NA
The str_detect preserved the NA value. It's not clear what the "right" value should be. But if you want str_detect to have the same values as grepl, you can be explicit about not changing NA values
test %>%
mutate(date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.03\\.2022$"), as.Date('2022/03/01'), date),
date_corr=if_else(!is.na(error) & str_detect(error, "date \\= 01\\.05\\.2021$"), as.Date('2021/05/01'), date_corr))

R: Identify first, second, and third dates of multiple columns and rows

I have a dataset where individuals can have multiple rows of data and seven columns where dates have been listed. I'm trying to find the first, second, and third earliest dates.
> head(Addim_try)
# A tibble: 6 x 8
ID HistoryDate1 HistoryDate2 HistoryDate3 HistoryDate4 Date1 Date2 PDate1
<dbl> <dttm> <dttm> <dttm> <dttm> <dttm> <dttm> <dttm>
1 1317051 NA NA NA NA NA NA 2022-04-05 00:00:00
2 1317051 2021-06-19 00:00:00 2021-07-10 00:00:00 NA NA NA NA NA
3 1317079 2021-08-10 00:00:00 2021-08-31 00:00:00 NA NA NA NA NA
4 1317079 2022-01-21 00:00:00 NA NA NA NA NA NA
5 1324163 2022-04-08 00:00:00 NA NA NA NA NA NA
6 1324163 2021-08-07 00:00:00 2021-10-09 00:00:00 NA NA NA NA NA
1 1279491 2021-06-14 00:00:00 2021-07-12 00:00:00 NA NA NA NA NA
I'm considering first identifying the first, second, and third dose by row (I would show this but suddenly my code isn't working) and then re-shaping my data to wide format (although I'm getting stuck here, so any suggestions on how to re-shape would be helpful)... Any better ideas?
We could reshape to 'long' with pivot_longer and get the first 3 dates with slice_min
library(dplyr)
library(tidyr)
Addim_try %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -c(rn, ID), values_to = "Date", values_drop_na = TRUE) %>%
group_by(rn) %>%
slice_min(n = 3, order_by = Date) %>%
ungroup %>%
select(-rn)

In R, how do I code to analyse patients with follow up at any given time point?

I have patients with baseline pain scores and follow up of 6 months, 1 year and 2 years (each their own variable column). I have 26,000+ patients. There is missing data at those various time points. I can easily analyse pain score outcomes at one year excluding missing, 6mths and two years etc.... What I would like to do is analyse outcomes in those with data at EITHER 6mths, one year or two year. Some patients will have more than one and some will have missing data for all three. Any ideas how to code this? Maybe another column with mutate() ... that creates 'vas.outcome' and then in this variable I can have one-year data, if missing one-year then two-year, and if missing two-year then 6-month. If all three missing then code as NA.
# A tibble: 6 x 4
vas.base vas.6mth vas.year vas.two
<dbl> <dbl> <dbl> <dbl>
1 5 NA NA 4
2 9 2.3 1.2 NA
3 8.1 NA NA NA
4 10 NA NA 3.3
5 6.5 6.5 NA NA
6 8 NA NA 3
one approach:
library(dplyr)
your_data_frame %>%
mutate(vas.outcome = coalesce(vas.6mth, vas.year, vas.two))
You could use a case_when()/fcase() approach
dt[, pain:=fcase(
!is.na(vas.year), vas.year,
!is.na(vas.two), vas.two,
!is.na(vas.6mth), vas.6mth,
default = NA
)]
or
dt %>%
mutate(pain:=case_when(
!is.na(vas.year)~vas.year,
!is.na(vas.two)~vas.two,
TRUE~vas.6mth
))
Output:
vas.base vas.6mth vas.year vas.two pain
1: 5.0 NA NA 4.0 4.0
2: 9.0 2.3 1.2 NA 1.2
3: 8.1 NA NA NA NA
4: 10.0 NA NA 3.3 3.3
5: 6.5 6.5 NA NA 6.5
6: 8.0 NA NA 3.0 3.0
I'm not 100% sure what you want your final dataset to look like, and I'm sure there are more elegant ways, but to choose the first occurrence of an outcome (after baseline), you can do:
Data
df <- read.table(text = "id vas.base vas.6mth vas.year vas.two
1 5 NA NA 4
2 9 2.3 1.2 NA
3 8.1 NA NA NA
4 10 NA NA 3.3
5 6.5 6.5 NA NA
6 8 NA NA 3", header = TRUE)
dplyr approach:
library(tidyr)
df %>% pivot_longer(starts_with("vas")[-1], names_to = "visit") %>%
group_by(id) %>% mutate(vas.outcome = first(na.omit(value))) %>%
slice(1) %>% select(id, vas.outcome) %>%
left_join(df, by = "id")
Output:
# id vas.outcome vas.base vas.6mth vas.year vas.two
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 4 5 NA NA 4
# 2 2 2.3 9 2.3 1.2 NA
# 3 3 NA 8.1 NA NA NA
# 4 4 3.3 10 NA NA 3.3
# 5 5 6.5 6.5 6.5 NA NA
# 6 6 3 8 NA NA 3

Create a column that assigns value to a row in a dataframe based on an event in another row

I have a dataframe that is structured like the following:
example <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2),
event = c("email","email","email","draw","email","email","draw","email","email","email","email","draw"),
date = c("2020-03-01","2020-06-01","2020-07-15","2020-07-28","2020-08-07","2020-09-01","2020-09-15","2020-05-22","2020-06-15","2020-07-13","2020-07-15","2020-07-31"),
amount = c(NA,NA,NA,10000,NA,NA,1500,NA,NA,NA,NA,2200))
This is a simplified version of the dataframe. I am trying to create a column that will assign a 1 to the last email before the draw event and a column that will have the amount drawn on the same row as the email. The desired dataframe would look like the following:
desiredResult <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2),
event = c("email","email","email","draw","email","email","draw","email","email","email","email","draw"),
date = c("2020-03-01","2020-06-01","2020-07-15","2020-07-28","2020-08-07","2020-09-01","2020-09-15","2020-05-22","2020-06-15","2020-07-13","2020-07-15","2020-07-31"),
amount = c(NA,NA,NA,10000,NA,NA,1500,NA,NA,NA,NA,2200),
EmailBeforeDrawFlag = c(NA,NA,1,NA,NA,1,NA,NA,NA,NA,1,NA),
EmailBeforeDrawAmount = c(NA,NA,10000,NA,NA,1500,NA,NA,NA,NA,2200,NA))
Here is the dplyr solution. When you create new columns, you want to use if_else() in the definition of EmailBeforeDrawFlag to check a condition, and the lead function to look in the previous row for event. EmailBeforeDrawAmount is juts lead(amount).
example %>%
mutate(EmailBeforeDrawFlag = if_else(lead(event) == "draw", 1, NA_real_ ),
EmailBeforeDrawAmount = lead(amount))
id event date amount EmailBeforeDrawFlag EmailBeforeDrawAmount
1 1 email 2020-03-01 NA NA NA
2 1 email 2020-06-01 NA NA NA
3 1 email 2020-07-15 NA 1 10000
4 1 draw 2020-07-28 10000 NA NA
5 1 email 2020-08-07 NA NA NA
6 1 email 2020-09-01 NA 1 1500
7 1 draw 2020-09-15 1500 NA NA
8 2 email 2020-05-22 NA NA NA
9 2 email 2020-06-15 NA NA NA
10 2 email 2020-07-13 NA NA NA
11 2 email 2020-07-15 NA 1 2200
12 2 draw 2020-07-31 2200 NA NA
We could also make use of NA^ to create the column on the lead
library(dplyr)
example %>%
mutate(EmailBeforeDrawFlag = NA^(lead(event != 'draw')),
EmailBeforeDrawAmount = lead(amount))
-output
# id event date amount EmailBeforeDrawFlag EmailBeforeDrawAmount
#1 1 email 2020-03-01 NA NA NA
#2 1 email 2020-06-01 NA NA NA
#3 1 email 2020-07-15 NA 1 10000
#4 1 draw 2020-07-28 10000 NA NA
#5 1 email 2020-08-07 NA NA NA
#6 1 email 2020-09-01 NA 1 1500
#7 1 draw 2020-09-15 1500 NA NA
#8 2 email 2020-05-22 NA NA NA
#9 2 email 2020-06-15 NA NA NA
#10 2 email 2020-07-13 NA NA NA
#11 2 email 2020-07-15 NA 1 2200
#12 2 draw 2020-07-31 2200 NA NA

R: take different variables from a row and make column header

rainfall <- data.frame("date" = rep(1:15),"location_code" = rep(6:8,5),
"rainfall"=runif(15, min=12, max=60))
rainfall30 <- rainfall %>%
group_by(location_code) %>%
filter(rainfall>30)
I want to use the above data to make the following table, is there a way to do it in R using dplyr?
date location6 location7 location8
2 47.7
5 46.8
6 32.3
7 55.3
9 40.5
I am just starting to use R, please apologize if this already answered. Thanks.
I think what you are looking for is tidyr::pivot_wider, which turns this long-form data.frame into a wide form. See here and here for more information on pivoting data with tidyr.
rainfall30 %>%
pivot_wider(names_from = location_code,
values_from = rainfall)
# date `6` `7` `8`
# <int> <dbl> <dbl> <dbl>
# 1 1 32.3 NA NA
# 2 2 NA 52.7 NA
# 3 3 NA NA 54.3
# 4 4 30.6 NA NA
# 5 7 52.4 NA NA
Here is a base R option using reshape + subset
reshape(
subset(rainfall, rainfall > 30),
idvar = "date",
timevar = "location_code",
direction = "wide"
)
which gives something like below (using set.seed(1) to generate rainfall)
date rainfall.8 rainfall.6 rainfall.7
3 3 39.49696 NA NA
4 4 NA 55.59397 NA
6 6 55.12270 NA NA
7 7 NA 57.34441 NA
8 8 NA NA 43.71829
9 9 42.19747 NA NA
13 13 NA 44.97710 NA
14 14 NA NA 30.43698
15 15 48.95239 NA NA

Resources