Adding something to a list of dates in a column - r

Supppose a data.table is:
dt <- structure(list(type = c("A", "B", "C"), dates = c("21-07-2011",
"22-11-2011,01-12-2011", "07-08-2012,14-08-2012,18-08-2012,11-10-2012"
)), class = c("data.table", "data.frame"), row.names = c(NA, -3L))
Check it:
type dates
1: A 21-07-2011
2: B 22-11-2011,01-12-2011
3: C 07-08-2012,14-08-2012,18-08-2012,11-10-2012
I need to add, say, 5 to each of the dates in second column, ie, I want the result to be as under:
type dates
1: A 26-07-2011
2: B 27-11-2011,06-12-2011
3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
Any help would be appreciated.

Using only basic R you can do:
dt$dates = sapply(dt$dates, function(x){
dates = as.Date(strsplit(x,",")[[1]], format = "%d-%m-%Y")
paste(format(dates+5, '%d-%m-%Y'), collapse = ",")
})
Result:
> dt
type dates
1: A 26-07-2011
2: B 27-11-2011,06-12-2011
3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
This procedure is practically the same as the one given by akrun, but without the extra libraries.

Grouped by 'type', we split the 'dates' by the ,, (with strsplit), convert to a Date class object with dmy (from lubridate), add 5, format it to the original format of the data, paste it to single string and assign (:=) to update the 'dates' column in the dataset
library(lubridate)
library(data.table)
dt[, dates := paste(format(dmy(unlist(strsplit(dates, ","))) + 5,
'%d-%m-%Y'), collapse=','), by = type]
dt
# type dates
#1: A 26-07-2011
#2: B 27-11-2011,06-12-2011
#3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
Another option without splitting, converting to Date, reformatting is regex method with gsubfn
library(gsubfn)
dt[, dates := gsubfn("^(\\d+)", ~ as.numeric(x) + 5,
gsubfn(",(\\d+)", ~sprintf(",%02d", as.numeric(x) + 5), dates))]
dt
# type dates
#1: A 26-07-2011
#2: B 27-11-2011,06-12-2011
#3: C 12-08-2012,19-08-2012,23-08-2012,16-10-2012
NOTE: Would assume the second method to be faster as we are not splitting, converting, pasteing etc.

Related

data.table in r : subset using column index

DT - data.table with column "A"(column index==1), "B"(column index 2), "C" and etc
for example next code makes subset DT1 which consists rows where A==2:
DT1 <- DT[A==2, ]
BUT How can I make subsets like DT1 using only column index??
for example, code like next not works :
DT1 <- DT[.SD==2, .SDcols = 1]
It is not recommended to use column index instead of column names as it makes your code difficult to understand and agile for any changes that could happen to your data. (See, for example, the first paragraph of the first question in the package FAQ.) However, you can subset with column index as follows:
DT = data.table(A = 1:5, B = 2:6, C = 3:7)
DT[DT[[1]] == 2]
# A B C
#1: 2 3 4
We can get the row index with .I and use that to subset the DT
DT[DT[, .I[.SD==2], .SDcols = 1]]
# A B C
#1: 2 3 4
data
DT <- data.table(A = 1:5, B = 2:6, C = 3:7)

Create column names based on "by" argument the data.table way

Say I have the following data.table
dt <- data.table(var = c("a", "b"), val = c(1, 2))
Now I want to add two new columns to dt, named a, and b with the respective values (1, 2). I can do this with a loop, but I want to do it the data.table way.
The result would be a data.table like this:
dt.res <- data.table(var = c("a", "b"), val = c(1, 2), #old vars
a = c(1, NA), b = c(NA, 2)) # newly created vars
So far I came up with something like this
dt[, c(xx) := val, by = var]
where xx would be a data.table-command similar to .N which addresses the value of the by-group.
Thanks for the help!
Appendix: The for-loop way
The non-data.table-way with a for-loop instead of a by-argument would look something like this:
for (varname in dt$var){
dt[var == varname, c(varname) := val]
}
Based on the example showed, we can use dcast from the data.table to convert the long format to wide, and join with the original dataset on the 'val' column.
library(data.table)#v1.9.6+
dt[dcast(dt, val~var, value.var='val'), on='val']
# var val a b
#1: a 1 1 NA
#2: b 2 NA 2
Or as #CathG mentioned in the comments, for previous versions either merge or set the key column and then join.
merge(dt, dcast.data.table(dt, val~var, value.var='val'))

data.table - group by all except one column

Can I group by all columns except one using data.table? I have a lot of columns, so I'd rather avoid writing out all the colnames.
The reason being I'd like to collapse duplicates in a table, where I know one column has no relevance.
library(data.table)
DT <- structure(list(N = c(1, 2, 2), val = c(50, 60, 60), collapse = c("A",
"B", "C")), .Names = c("N", "val", "collapse"), row.names = c(NA,
-3L), class = c("data.table", "data.frame"))
> DT
N val collapse
1: 1 50 A
2: 2 60 B
3: 2 60 C
That is, given DT, is there something like like DT[, print(.SD), by = !collapse] which gives:
> DT[, print(.SD), .(N, val)]
collapse
1: A
collapse
1: B
2: C
without actually having to specify .(N, val)? I realise I can do this by copy and pasting the column names, but I thought there might be some elegant way to do this too.
To group by all columns except one, you can use:
by = setdiff(names(DT), "collapse")
Explanation: setdiff takes the general form of setdiff(x, y) which returns all values of x that are not in y. In this case it means that all columnnames are returned except the collapse-column.
Two alternatives:
# with '%in%'
names(dt1)[!names(dt1) %in% 'colB']
# with 'is.element'
names(dt1)[!is.element(names(dt1), 'colB')]

data.table: How to fast stack(DT) operation, and return data.table instead of returned data.frame

DT <- data.table(a = c(1, 3), b = c(5, 2))
DT1 <- stack(DT)
> stack(DT1)
values ind
1 1 a
2 3 a
3 5 b
4 2 b
Now, it is changed to data.frame. Sure, I can use setDT(DT1) to change it back to data.table,
> DT1
values ind
1: 1 a
2: 3 a
3: 5 b
4: 2 b
I would like to now, is there other ways in data.table to perform **stack** operation (or something like stack function with more efficiency) and directly return data.table instead of data.frame?
Thank you.
You can use gather from the package tidyr
library(tidyr)
DT <- data.table(a = c(1, 3), b = c(5, 2))
DT1 <- gather(DT, ind, values, a:b)
The second argument is the name of the new "key" variable, the third argument is the name of the new "value" variable, the last argument is the columns you want to gather (all in this case). Also, gather automatically calls a faster version of melt for data.tables.

R - Aggregate Dates

When aggregating an R dataframe, the dates are converted in integer :
For instance, if I want to take the maximum dates for every Id in the following dataframe :
> df1 <- data.frame(id = rep(c(1, 2), 2), b = as.Date(paste("01/01/", 2000:2003, sep=''), format = "%d/%m/%Y"))
> df1
id b
1 1 2000-01-01
2 2 2001-01-01
3 1 2002-01-01
4 2 2003-01-01
> aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
id b
1 1 11688
2 2 12053
Why does R behave this way ? (and what's the best way to keep a date class column in the returned dataframe?)
Thanks for your help,
That works for me R version 3, perhaps there were some changes in updates, so I recommend you to update R :)
As for this version of R, have you tried as.Date() function after aggregating?
In your example, should be like:
dtf2<-aggregate(x = list(b = df1$b), by = list(id = df1$id), FUN = "max")
dtf2$b<-as.Date(dtf$b)
You can also add 'origin' option to as.Date, like
as.Date(dtf$b, origin='1970-01-01')
UPD: When R looks at dates as integers, its origin is January 1, 1970.
Hope that will help.

Resources