Convert Julian date to calendar date - r

First, a reproducible example. I am using data.table because I am dealing with around 20 million rows -
> require(data.table)
> x <- structure(list(DoM = c(2011241L, 2015359L, 2016352L, 2015360L,
2015287L, 2014038L, 2017066L, 2012227L, 2015041L, 2015295L),
Year = c(2011L, 2015L, 2016L, 2015L, 2015L, 2014L, 2017L,
2012L, 2015L, 2015L), Month = c(8L, 12L, 12L, 12L, 10L, 2L,
3L, 8L, 2L, 10L)), .Names = c("DoM", "Year", "Month"), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
> x
DoM Year Month
1: 2011241 2011 8
2: 2015359 2015 12
3: 2016352 2016 12
4: 2015360 2015 12
5: 2015287 2015 10
6: 2014038 2014 2
7: 2017066 2017 3
8: 2012227 2012 8
9: 2015041 2015 2
10: 2015295 2015 10
I need to extract the date from the DoM column, which contains the day in a Julian like format. Each element of the DoM column is of the form yyyyddd, where ddd is the day of the year yyyy (and hence 1 <= ddd <= 366).
E.g. The first date would be 2011-08-29 because it corresponds to the 241st day of 2011
I currently am not satisfied with what I have, which is -
x[, Date:=as.Date((DoM-1000*Year)-1, origin=paste(Year,1,1,sep='-'))]
I suspect the paste is inefficient and was looking for any alternatives that might be better.

This is possible with basic formatting. See ?strptime:
as.Date(as.character(x$DoM), format="%Y%j")
# or as #Frank suggests, for integer dates in data.table:
as.IDate(as.character(x$DoM), format="%Y%j")
# [1] "2011-08-29" "2015-12-25" "2016-12-17" "2015-12-26" "2015-10-14"
# [6] "2014-02-07" "2017-03-07" "2012-08-14" "2015-02-10" "2015-10-22"

Related

Count months based on multiples columns (R or Stata)

I would like to count the numbers of months a person has worked for.
Separation_month refers to the calendar month of dismissal if there was one and is equal to 0 if the person was not dismissed in the current year (2017).
I want to count the months from hire date to dismissal date (if the person was dismissed).
If he was not it means he worked until the end of the current year. So I want to count all months of 2017, that is 12 months for 2017 plus the months from other years.
structure(list(id = 1:5, current_year = c(2017L, 2017L, 2017L,
2017L, 2017L), hire_month = c(2L, 9L, 10L, 3L, 2L), hire_year = c(2016L,
2014L, 1980L, 2017L, 2017L), separation_month = c(0L, 3L, 4L,
4L, 0L)), class = "data.frame", row.names = c(NA, -5L))
id current_year hire_month hire_year separation_month
1 1 2017 2 2016 0
2 2 2017 9 2014 3
3 3 2017 10 1980 4
4 4 2017 3 2017 4
5 5 2017 2 2017 0
E.g. for the first observation, I expect there to be 23 months (he worked for 11 months in 2016 and for 12 months in 2017 since he was not separated from his job).
Stata:
gen months_worked = separation_month+ (separation_month==0)*12
replace months_worked = months_worked + (current_year-hire_year)*12-hire_month+1
R:
df %>%
mutate(months_worked = separation_month + (separation_month<1)*12,
months_worked = months_worked + (current_year-hire_year)*12-hire_month+1
)
Another Stata solution:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int current_year byte hire_month int hire_year byte separation_month
1 2017 2 2016 0
2 2017 9 2014 3
3 2017 10 1980 4
4 2017 3 2017 4
5 2017 2 2017 0
end
gen wanted = 1 + cond(separation_month == 0, ym(2017, 12) - ym(hire_year, hire_month), ym(2017, separation_month) - ym(hire_year, hire_month))

How to change julian to date [duplicate]

First, a reproducible example. I am using data.table because I am dealing with around 20 million rows -
> require(data.table)
> x <- structure(list(DoM = c(2011241L, 2015359L, 2016352L, 2015360L,
2015287L, 2014038L, 2017066L, 2012227L, 2015041L, 2015295L),
Year = c(2011L, 2015L, 2016L, 2015L, 2015L, 2014L, 2017L,
2012L, 2015L, 2015L), Month = c(8L, 12L, 12L, 12L, 10L, 2L,
3L, 8L, 2L, 10L)), .Names = c("DoM", "Year", "Month"), row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
> x
DoM Year Month
1: 2011241 2011 8
2: 2015359 2015 12
3: 2016352 2016 12
4: 2015360 2015 12
5: 2015287 2015 10
6: 2014038 2014 2
7: 2017066 2017 3
8: 2012227 2012 8
9: 2015041 2015 2
10: 2015295 2015 10
I need to extract the date from the DoM column, which contains the day in a Julian like format. Each element of the DoM column is of the form yyyyddd, where ddd is the day of the year yyyy (and hence 1 <= ddd <= 366).
E.g. The first date would be 2011-08-29 because it corresponds to the 241st day of 2011
I currently am not satisfied with what I have, which is -
x[, Date:=as.Date((DoM-1000*Year)-1, origin=paste(Year,1,1,sep='-'))]
I suspect the paste is inefficient and was looking for any alternatives that might be better.
This is possible with basic formatting. See ?strptime:
as.Date(as.character(x$DoM), format="%Y%j")
# or as #Frank suggests, for integer dates in data.table:
as.IDate(as.character(x$DoM), format="%Y%j")
# [1] "2011-08-29" "2015-12-25" "2016-12-17" "2015-12-26" "2015-10-14"
# [6] "2014-02-07" "2017-03-07" "2012-08-14" "2015-02-10" "2015-10-22"

Merge two lists of data frames

I have two lists of data frames. Each list has 6 data frames.
The dataframes has the same columns, but in list1 the dataframes has info from 2015 to 2017 and list2 has info of 2018. Like below
List1$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
List2$A
Name Value Year
AAA 543 2018
BBB 248 2018
I want to merge the dataframes from both lists. So I want in the end just one list of dataframes with all the info for all years.
Some dataframes from list1 has already info of 2018, so when I merge them with the others I want those 2018 values to be replaced.
Newlist$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
BBB 248 2018
I tried this but didn't work
data<- lapply(list1,list2, function (x,y) merge(x,y))
How can I do this?
It's always helpful to include a sample of data with dput, but here's an attempt without the data's confirmation:
library(tidyverse)
map2(list1, list2, ~bind_rows(.y, .x) %>% group_by(Name, Year) %>% slice(1))
We bind the rows (with list2 first), then grouping by Name and Year and taking the first occurrence with slice, which should take the first value for any Name/Year repeated measures from the 2nd data frame.
We could first bind everything into a long data frame and remove the entries for "2018" that first occur if there's an entry in list 2.
To do this we could list the lists and rbind them after adding an ID column that later helps to remove the duplicates of year "2018" that stem from list 1 with by/ave, but keep those which don't occur in list 2.
The trick of the latter is to us a rev(seq_along(x)).
To demonstrate I have created sample data that probably resembles your data.
# list the lists
L <- list(L1=L1, L2=L2)
# add id column to sublists
L <- lapply(seq(L), function(x)
Map(`[<-`, L[[x]], "list", value=substr(names(L)[x], 2, 2)))
# rbind lists to long data frame
d <- do.call(rbind, unlist(L, recursive=FALSE))
# remove 2018 duplicates of list L1, keep if no 2018 in list L2
do.call(rbind, by(d, d$name, function(y) {
i <- cbind(y, id=ave(y$year, y$year, FUN=function(z) rev(seq_along(z))))
i[!i$id == 2, ]
}))
Result
# name value year list id
# A.A.1 A 998 2015 1 1
# A.A.4 A 456 2016 1 1
# A.A.7 A 312 2017 1 1
# A.A.13 A 478 2018 2 1
# B.A.2 B 1592 2015 1 1
# B.A.5 B 1072 2016 1 1
# B.A.8 B 673 2017 1 1
# B.A.21 B 445 2018 2 1
# C.A.3 C 957 2015 1 1
# C.A.6 C 199 2016 1 1
# C.A.9 C 2165 2017 1 1
# C.A.31 C 342 2018 2 1
# D.B.1 D 877 2015 1 1
# D.B.4 D 876 2016 1 1
# D.B.7 D 482 2017 1 1
# D.B.13 D 1077 2018 2 1
# E.B.2 E 370 2015 1 1
# E.B.5 E 1475 2016 1 1
# E.B.8 E 768 2017 1 1
# E.B.11 E 385 2018 1 1 <- this stems from list 1!
# F.B.3 F 421 2015 1 1
# F.B.6 F 930 2016 1 1
# F.B.9 F 1105 2017 1 1
# F.B.31 F 1836 2018 2 1
Data
l1 <- list(A = structure(list(name = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
value = c(1371, 565, 363, 633, 404, 106, 1512, 95, 2018,
63, 1305, 2287), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)), B = structure(list(name = structure(c(1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("D", "E", "F"), class = "factor"),
value = c(1389, 279, 133, 636, 284, 2656, 2440, 1320, 307,
1781, 172, 1215), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)))
L2 <- list(A = structure(list(name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), value = c(1895, 430, 257), year = c(2018,
2018, 2018)), class = "data.frame", row.names = c(NA, -3L)),
B = structure(list(name = structure(c(1L, 3L), .Label = c("D",
"E", "F"), class = "factor"), value = c(1763, 640), year = c(2018,
2018)), row.names = c(1L, 3L), class = "data.frame"))
L2$B <- L2$B[-2, ] # remove intentionally value

Spread over multiple columns in R - dplyr tidyr solution

I've got monthly year over year data in a long format that I'm trying to spread with two columns. The only examples I've seen include a single key.
> dput(df)
structure(list(ID = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "b", "b", "b", "b", "b", "b", "b", "b"), Year = c(2015L,
2015L, 2015L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2015L,
2015L, 2015L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L), Month = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), Value = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L)), .Names = c("ID", "Year", "Month",
"Value"), class = "data.frame", row.names = c(NA, -18L))
I'm trying to get it into a data format with years as columns 2:5, and one row per month per ID
ID Month 2015 2016 2017
a 1 1 2 3
a 2 1 2 3
a 3 1 2 3
a 1 6 9 12
a 2 7 10 13
a 3 8 11 14
I've tried the following with the following error:
by_month_over_years = spread(df,key = c(Year,Month), Value)
Error: `var` must evaluate to a single number or a column name, not an integer vector
library(tidyr)
library(dplyr)
df %>% group_by(ID) %>% spread(Year, Value)
# A tibble: 6 x 5
# Groups: ID [2]
ID Month `2015` `2016` `2017`
<chr> <int> <int> <int> <int>
1 a 1 1 2 3
2 a 2 1 2 3
3 a 3 1 2 3
4 b 1 6 9 12
5 b 2 7 10 13
6 b 3 8 11 14
library(reshape2) # or data.table, for dcast
dcast(df, ID + Month ~ Year)
# ID Month 2015 2016 2017
# 1 a 1 1 2 3
# 2 a 2 1 2 3
# 3 a 3 1 2 3
# 4 b 1 6 9 12
# 5 b 2 7 10 13
# 6 b 3 8 11 14
Here is a base R option with reshape
reshape(df, idvar = c('ID', 'Month'), direction = 'wide', timevar = 'Year')
# ID Month Value.2015 Value.2016 Value.2017
#1 a 1 1 2 3
#2 a 2 1 2 3
#3 a 3 1 2 3
#10 b 1 6 9 12
#11 b 2 7 10 13
#12 b 3 8 11 14

Grouping data and then assigning values to variable names stored in strings - R

I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))

Resources