Converting variable with 5 digit numbers and dates into date values - r

I have the following data, which contains some date values as 5 digit character values. When I try to convert to date, the correct date changes to NA value.
dt <- data.frame(id=c(1,1,1,1,1,1,2,2,2,2,2),
Registrationdate=c('2019-01-09','2019-01-09','2019-01-09','2019-01-09','2019-01-09',
'2019-01-09',"44105","44105","44105","44105","44105"))
Expected value
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01
I tried using
library(openxlsx)
dt$Registrationdate <- convertToDate(dt$Registrationdate, origin = "1900-01-01")
But I got
1 1 <NA>
2 1 <NA>
3 1 <NA>
4 1 <NA>
5 1 <NA>
6 1 <NA>
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01

Here's one approach using a mix of dplyr and base R:
library(dplyr, warn = FALSE)
dt |>
mutate(Registrationdate = if_else(grepl("-", Registrationdate),
as.Date(Registrationdate),
openxlsx::convertToDate(Registrationdate, origin = "1900-01-01")))
#> Warning in openxlsx::convertToDate(Registrationdate, origin = "1900-01-01"): NAs
#> introduced by coercion
#> id Registrationdate
#> 1 1 2019-01-09
#> 2 1 2019-01-09
#> 3 1 2019-01-09
#> 4 1 2019-01-09
#> 5 1 2019-01-09
#> 6 1 2019-01-09
#> 7 2 2020-10-01
#> 8 2 2020-10-01
#> 9 2 2020-10-01
#> 10 2 2020-10-01
#> 11 2 2020-10-01
Created on 2022-10-15 with reprex v2.0.2

library(janitor)
dt$Registrationdate <- convert_to_date(dt$Registrationdate)
id Registrationdate
1 1 2019-01-09
2 1 2019-01-09
3 1 2019-01-09
4 1 2019-01-09
5 1 2019-01-09
6 1 2019-01-09
7 2 2020-10-01
8 2 2020-10-01
9 2 2020-10-01
10 2 2020-10-01
11 2 2020-10-01

Another option is to import columns in the expected format. An example with openxlsx2 is shown below. The top half creates a file that causes the behavior you see with openxlsx. This is because some of the rows in the Registrationdate column are formatted as dates and some as strings, a fairly common error caused by the person who generated the xlsx input.
With openxlsx2 you can define the type of column you want to import. The option was inspired by readxl (iirc).
library(openxlsx2)
## prepare data
date_as_string <- data.frame(
id = rep(1, 6),
Registrationdate = rep('2019-01-09', 6)
)
date_as_date <- data.frame(
id = rep(2, 5),
Registrationdate = rep(as.Date('2019-01-10'), 5)
)
options(openxlsx2.dateFormat = "yyyy-mm-dd")
wb <- wb_workbook()$
add_worksheet()$
add_data(x = date_as_string)$
add_data(x = date_as_date, colNames = FALSE, startRow = 7)
#wb$open()
## read data as date
dt <- wb_to_df(wb, types = c(id = 1, Registrationdate = 2))
## check that Registrationdate is actually a Date column
str(dt$Registrationdate)
#> Date[1:10], format: "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" "2019-01-09" ...

Related

How to create a new column that counts the number of occurrences of a value in another column and orders them by date

I have a 2 column data frame with "date" and "ID" headings. Some IDs are listed more than once. I want to create a new column "Attempt" that denotes the number of attempts that each ID has taken, ordered by the date of occurrence.
Here is my sample data:
ID <- c(1,2,5,8,4,9,1,11,15,32,54,1,4,2,14)
Date <- c("2021-04-12", "2021-04-12", "2021-04-13", "2021-04-14", "2021-04-19",
"2021-04-19", "2021-04-20", "2021-04-21", "2021-04-22", "2021-04-28",
"2021-04-28", "2021-04-29", "2021-04-29", "2021-05-06", "2021-05-07")
Data <- data.frame(ID, Date)
Data$Date <- as.Date(Data$Date, format="%Y-%m-%d")
I tried various iterations of duplicated(). I can remove all duplicates or make every instance of a duplicated value "2" or "3" for example, but I want each occurrence to be ordered based on the date of the attempt taken.
Here is my expected result column to be added onto the original data frame:
Attempt <- c(1,1,1,1,1,1,2,1,1,1,1,3,2,2,1)
Data %>%
group_by(ID)
mutate(Attempt1 = row_number())
ID Date Attempt
1 1 2021-04-12 1
2 2 2021-04-12 1
3 5 2021-04-13 1
4 8 2021-04-14 1
5 4 2021-04-19 1
6 9 2021-04-19 1
7 1 2021-04-20 2
8 11 2021-04-21 1
9 15 2021-04-22 1
10 32 2021-04-28 1
11 54 2021-04-28 1
12 1 2021-04-29 3
13 4 2021-04-29 2
14 2 2021-05-06 2
15 14 2021-05-07 1
If you have the latest version of dplyr use
Data %>%
mutate(Attempt = row_number(), .by = ID)
Using data.table
library(data.table)
setDT(Data)[, Attempt := rowid(ID)]
-output
> Data
ID Date Attempt
1: 1 2021-04-12 1
2: 2 2021-04-12 1
3: 5 2021-04-13 1
4: 8 2021-04-14 1
5: 4 2021-04-19 1
6: 9 2021-04-19 1
7: 1 2021-04-20 2
8: 11 2021-04-21 1
9: 15 2021-04-22 1
10: 32 2021-04-28 1
11: 54 2021-04-28 1
12: 1 2021-04-29 3
13: 4 2021-04-29 2
14: 2 2021-05-06 2
15: 14 2021-05-07 1

How can I create a column in R that calculates a value of a row based on the previous row's value?

Here's what I would like to a achieve as a function in Excel, but I can't seem to find a solution to do it in R.
This is what I tried to do but it does not seem to allow me to operate with the previous values of the new column I'm trying to make.
Here is a reproducible example:
library(dplyr)
set.seed(42) ## for sake of reproducibility
dat <- data.frame(date=seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"))
This would be the output of the dataframe:
dat
date
1 2020-12-26
2 2020-12-27
3 2020-12-28
4 2020-12-29
5 2020-12-30
6 2020-12-31
Desired output:
date periodNumber
1 2020-12-26 1
2 2020-12-27 2
3 2020-12-28 3
4 2020-12-29 4
5 2020-12-30 5
6 2020-12-31 6
My try at this:
dat %>%
mutate(periodLag = dplyr::lag(date)) %>%
mutate(periodNumber = ifelse(is.na(periodLag)==TRUE, 1,
ifelse(date == periodLag, dplyr::lag(periodNumber), (dplyr::lag(periodNumber) + 1))))
Excel formula screenshot:
You could use dplyr's cur_group_id():
library(dplyr)
set.seed(42)
# I used a larger example
dat <- data.frame(date=sample(seq.Date(as.Date("2020-12-26"), as.Date("2020-12-31"), "day"), size = 30, replace = TRUE))
dat %>%
arrange(date) %>% # needs sorting because of the random example
group_by(date) %>%
mutate(periodNumber = cur_group_id())
This returns
# A tibble: 30 x 2
# Groups: date [6]
date periodNumber
<date> <int>
1 2020-12-26 1
2 2020-12-26 1
3 2020-12-26 1
4 2020-12-26 1
5 2020-12-26 1
6 2020-12-26 1
7 2020-12-26 1
8 2020-12-26 1
9 2020-12-27 2
10 2020-12-27 2
11 2020-12-27 2
12 2020-12-27 2
13 2020-12-27 2
14 2020-12-27 2
15 2020-12-27 2
16 2020-12-28 3
17 2020-12-28 3
18 2020-12-28 3
19 2020-12-29 4
20 2020-12-29 4
21 2020-12-29 4
22 2020-12-29 4
23 2020-12-29 4
24 2020-12-29 4
25 2020-12-30 5
26 2020-12-30 5
27 2020-12-30 5
28 2020-12-30 5
29 2020-12-30 5
30 2020-12-31 6

How to read a list of list in r

I have a txt file like this:
[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]
I would like to read it by R as a table like this:
seller_id
product_id
buyer_id
sale_date
quantity
price
7
11
49
2019-01-21
5
3330
13
32
6
2019-02-10
9
1089
50
47
4
2019-01-06
1
1343
1
22
2
2019-03-03
9
7677
How to write the correct R code? Thanks very much for your time.
An easier option is fromJSON
library(jsonlite)
library(janitor)
fromJSON(txt = "file1.txt") %>%
as_tibble %>%
row_to_names(row_number = 1) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 4 x 6
# seller_id product_id buyer_id sale_date quantity price
# <int> <int> <int> <chr> <int> <int>
#1 7 11 49 2019-01-21 5 3330
#2 13 32 6 2019-02-10 9 1089
#3 50 47 4 2019-01-06 1 1343
#4 1 22 2 2019-03-03 9 7677
You will need to parse the json from arrays into a data frame. Perhaps something like this:
# Get string
str <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'
df_list <- jsonlite::parse_json(str)
do.call(rbind, lapply(df_list[-1], function(x) {
setNames(as.data.frame(x), unlist(df_list[1]))}))
#> seller_id product_id buyer_id sale_date quantity price
#> 1 7 11 49 2019-01-21 5 3330
#> 2 13 32 6 2019-02-10 9 1089
#> 3 50 47 4 2019-01-06 1 1343
#> 4 1 22 2 2019-03-03 9 7677
Created on 2020-12-11 by the reprex package (v0.3.0)
Some base R options using:
gsub + read.table
read.table(
text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)),
sep = ",",
header = TRUE
)
gsub + read.csv
read.csv(text = gsub('"|\\[|\\]', "", gsub("\\],", "\n", s)))
which gives
seller_id product_id buyer_id sale_date quantity price
1 7 11 49 2019-01-21 5 3330
2 13 32 6 2019-02-10 9 1089
3 50 47 4 2019-01-06 1 1343
4 1 22 2 2019-03-03 9 7677
Data
s <- '[["seller_id","product_id","buyer_id","sale_date","quantity","price"],[7,11,49,"2019-01-21",5,3330],[13,32,6,"2019-02-10",9,1089],[50,47,4,"2019-01-06",1,1343],[1,22,2,"2019-03-03",9,7677]]'

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

limited size dates groups by interval

I have a data frame with dates and I would like to group dates by interval of 9 days, but the group size should be of 7 dates maximum. So if we find 9 days in the interval, the 2 last dates should roll to the next group and so on.
And the starting date of an interval can only be an existing date of the dataset.
Here is an example :
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39))
x <- data.frame(date = dates)
> x
date
1 2020-04-17
2 2020-04-27
3 2020-04-28
4 2020-04-29
5 2020-04-30
6 2020-05-01
7 2020-05-02
8 2020-05-03
9 2020-05-04
10 2020-05-05
11 2020-05-07
12 2020-05-17
13 2020-05-26
And the exected output :
date group
1 2020-04-17 1
2 2020-04-27 2
3 2020-04-28 2
4 2020-04-29 2
5 2020-04-30 2
6 2020-05-01 2
7 2020-05-02 2
8 2020-05-03 2
9 2020-05-04 3
10 2020-05-05 3
11 2020-05-07 3
12 2020-05-17 4
13 2020-05-26 4
I'm really stuck ony this, nothing worked from what I tried so far, any help would be really apprectiated, thank you !
I believe this is what you want. As you can see, the code is quite inefficient, but I can't think of the way without going sequentially.
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39))
x <- data.frame(date = dates)
assign_group <- function(group_var, group_number) {
# finding the start of the group
start_idx <- min(which(is.na(group_var)))
# finding the end of the group (either group size == 7 or the dates in the range)
end_idx <- start_idx + min(6, sum(x$date > x$date[start_idx] &
x$date <= x$date[start_idx] + 9))
# taking care of the out of range index
end_idx <- min(end_idx, length(group_var))
# assign group number
group_var[start_idx:end_idx] <- group_number
return(group_var)
}
group <- rep(NA, nrow(x))
group_number <- 1
while(sum(is.na(group[length(group)])) > 0){
group <- assign_group(group, group_number)
group_number <- group_number + 1
print(group)
}
#> [1] 1 NA NA NA NA NA NA NA NA NA NA NA NA
#> [1] 1 2 2 2 2 2 2 2 NA NA NA NA NA
#> [1] 1 2 2 2 2 2 2 2 3 3 3 NA NA
#> [1] 1 2 2 2 2 2 2 2 3 3 3 4 4
x$group <- group
x
#> date group
#> 1 2020-04-17 1
#> 2 2020-04-27 2
#> 3 2020-04-28 2
#> 4 2020-04-29 2
#> 5 2020-04-30 2
#> 6 2020-05-01 2
#> 7 2020-05-02 2
#> 8 2020-05-03 2
#> 9 2020-05-04 3
#> 10 2020-05-05 3
#> 11 2020-05-07 3
#> 12 2020-05-17 4
#> 13 2020-05-26 4
Created on 2020-05-27 by the reprex package (v0.3.0)
Here is an option using Rcpp:
library(Rcpp)
cppFunction("
IntegerVector grpDates(IntegerVector dates, int winsize, int daysaft) {
int sz = dates.size(), start = 0;
IntegerVector res(sz);
res[0] = 1;
for (int i = 1; i < sz; i++) {
if ((dates[i] - dates[start] > daysaft) || (i - start + 1 > winsize)) {
res[i] = res[i-1] + 1;
start = i;
} else {
res[i] = res[i-1];
}
}
return res;
}")
x$group <- grpDates(dates, 7L, 9L)
x
output:
date group
1 2020-04-17 1
2 2020-04-27 2
3 2020-04-28 2
4 2020-04-29 2
5 2020-04-30 2
6 2020-05-01 2
7 2020-05-02 2
8 2020-05-03 2
9 2020-05-04 3
10 2020-05-05 3
11 2020-05-07 3
12 2020-05-17 4
13 2020-05-26 4
14 2020-06-03 5
15 2020-06-04 5
16 2020-06-05 5
17 2020-06-06 5
18 2020-06-07 5
19 2020-06-08 5
20 2020-06-09 5
data with more date rows:
start_date <- as.Date("2020-04-17")
dates <- c(start_date,
start_date + 10:16,
start_date + c(17, 18, 20),
start_date + c(30, 39),
start_date + 47:53)
x <- data.frame(date = dates)

Resources