This question already has answers here:
Cartesian product data frame
(7 answers)
Cartesian product with dplyr
(7 answers)
Closed 4 years ago.
How can I create a dataframe with column A * column B.
For example, column 'year' (2018 - 2025) and for each year a column 'week' from 1:52.
Basically, I want a nicer way to get this result:
a =data.table( c(2018) , c(1:52))
x <- c("year", "week")
colnames(a) <- x
b =data.table(c(2019) , c(1:52))
x <- c("year", "week")
colnames(b) <- x
c =data.table(c(2020) , c(1:52))
x <- c("year", "week")
colnames(c) <- x
d = rbind(a, b, c)
EDIT: Thanks!!
d <- expand.grid(year = c(2018:2020), week = c(1:52))
Use crossing from the tidyr package. something like:
library(tidyr)
library(data.table)
crossing(
data.table(year=2018:2020),
data.table(week=1:52))
for more details, see https://stackoverflow.com/a/49630818/1358308
With base R
data.frame(year = rep(2018:2020, 52), week = rep(1:52, length(year)))
Since you seem to use data.table, here is one more option.
library(data.table)
CJ('year' = 2018:2020, 'week' = 1:52)
# year week
# 1: 2018 1
# 2: 2018 2
# 3: 2018 3
# 4: 2018 4
# 5: 2018 5
# ---
#152: 2020 48
#153: 2020 49
#154: 2020 50
#155: 2020 51
#156: 2020 52
Basically,
year = rep(c(2018:2025),each = 52)
week = rep(c(1:52), length(c(2018:2025)))
d = as.data.frame(cbind(year, week))
You just need to call data.frame
data.frame(year=rep(2018:2020,52),weak=rep(c(1:52),3))
Related
Here are some dataframe with volume in numerical numbers
data.frame(class = ("a","b","a","b"), date = c(2009,2009,2010,2010), volume=c(1,1,2,0))
How is it possible to convert the volume column to be in percentage for the same year(date) of different labels?
data.frame(class = ("a","b","a","b"), date = c(2009,2009,2010,2010), volumepercentage=c("50.00%","50.00%","100.00%","9.00%"))
Here is a base R approach:
df1.spl <- split(df1, df1$date)
df1.lst <- lapply(df1.spl, function(x) data.frame(x, pct=prop.table(x$volume)*100))
df2 <- do.call(rbind, df1.lst)
df2
# class date volume pct
# 2009.1 a 2009 1 50
# 2009.2 b 2009 1 50
# 2010.3 a 2010 2 100
# 2010.4 b 2010 0 0
Note the change in the row names. The command rownames(df2) <- NULL will remove them.
This question already has answers here:
Split date into different columns for year, month and day
(4 answers)
Closed 6 years ago.
I have a dataset which looks like:
mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11
I need to extract the constituent elements (day,month,year) from date of birth and put them in corresponding columns to look like:
mother_id,dateOfBirth,dayOfBirth,monthOfBirth,yearOfBirth
1,1962-09-24,24,09,1962
2,1991-02-19,19,02,1991
3,1978-11-11,11,11,1978
Currently, I have it coded as a loop:
data <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/matching file dss nacp.csv",stringsAsFactors = F)
dss_individuals <- read.csv("/home/tumaini/Desktop/IHI-Projects/Data-Linkage/Data/dssIndividuals.csv", stringsAsFactors = F)
lookup <- data[,c("patientid","extId")]
# remove duplicates
lookup <- lookup[!(duplicated(lookup$patientid)),]
dss_individuals$dateOfBirth <- as.character.Date(dss_individuals$dob)
dss_individuals$dayOfBirth <- 0
dss_individuals$monthOfBirth <- 0
dss_individuals$yearOfBirth <- 0
# Loop starts here
for(i in 1:nrow(dss_individuals)){ #nrow(dss_individuals)
split_list <- unlist(strsplit(dss_individuals[i,]$dateOfBirth,'[- ]'))
dss_individuals[i,]["dayOfBirth"] <- split_list[3]
dss_individuals[i,]["monthOfBirth"] <- split_list[2]
dss_individuals[i,]["yearOfBirth"] <- split_list[1]
}
This seems to work, but is horrendously slow as I have 400 000 rows. Is there a way I can get this done more efficiently?
I compared the speed of substr, format, and use of lubridate. It seems that lubridate and format are much faster than substr, if the the variable is stored as date. However, substr would be fastest if the variable is stored as character vector. The results of a single run is shown.
x <- sample(
seq(as.Date('1000/01/01'), as.Date('2000/01/01'), by="day"),
400000, replace = T)
system.time({
y <- substr(x, 1, 4)
m <- substr(x, 6, 7)
d <- substr(x, 9, 10)
})
# user system elapsed
# 3.775 0.004 3.779
system.time({
y <- format(x,"%y")
m <- format(x,"%m")
d <- format(x,"%d")
})
# user system elapsed
# 1.118 0.000 1.118
system.time({
y <- year(x)
m <- month(x)
d <- day(x)
})
# user system elapsed
# 0.951 0.000 0.951
x1 <- as.character(x)
system.time({
y <- substr(x1, 1, 4)
m <- substr(x1, 6, 7)
d <- substr(x1, 9, 10)
})
# user system elapsed
# 0.082 0.000 0.082
Not sure if this will solve your speed issues but here is a nicer way of doing it using dplyr and lubridate. In general when it comes to manipulating data.frames I personally recommend using either data.tables or dplyr. Data.tables is supposed to be faster but dplyr is more verbose which I personally prefer as I find it easier to pick up my code after not having read it for months.
library(dplyr)
library(lubridate)
dat <- data.frame( mother_id = c(1,2,3),
dateOfBirth = ymd(c( "1962-09-24" ,"1991-02-19" ,"1978-11-11"))
)
dat %>% mutate( year = year(dateOfBirth) ,
month = month(dateOfBirth),
day = day(dateOfBirth) )
Or you can use the mutate_each function to save having to write the variable name multiple times (though you get less control over the name of the output variables)
dat %>% mutate_each( funs(year , month , day) , dateOfBirth)
Here are some solutions. These solutions each (i) use 1 or 2 lines of code and (ii) return numeric year, month and day columns. In addition, the first two solutions use no packages -- the third uses chron's month.day.year function.
1) POSIXlt Convert to "POSIXlt" class and pick off the parts.
lt <- as.POSIXlt(DF$dateOfBirth, origin = "1970-01-01")
transform(DF, year = lt$year + 1900, month = lt$mon + 1, day = lt$mday)
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
2) read.table
cbind(DF, read.table(text = format(DF$dateOfBirth), sep = "-",
col.names = c("year", "month", "day")))
giving:
mother_id dateOfBirth year month day
1 1 1962-09-24 1962 9 24
2 2 1991-02-19 1991 2 19
3 3 1978-11-11 1978 11 11
3) chron::month.day.year
library(chron)
cbind(DF, month.day.year(DF$dateOfBirth))
giving:
mother_id dateOfBirth month day year
1 1 1962-09-24 9 24 1962
2 2 1991-02-19 2 19 1991
3 3 1978-11-11 11 11 1978
Note 1: Often when year, month and day are added to data it is not really necessary and in fact they could be generated on the fly when needed using format, substr or as.POSIXlt so you might critically examine whether you actually need to do this.
Note 2: The input data frame, DF in reproducible form, was assumed to be:
Lines <- "mother_id,dateOfBirth
1,1962-09-24
2,1991-02-19
3,1978-11-11"
DF <- read.csv(text = Lines)
Use format once for each part:
dss_individuals$dayOfBirth <- format(dss_individuals$dateOfBirth,"%d")
dss_individuals$monthOfBirth <- format(dss_individuals$dateOfBirth,"%m")
dss_individuals$yearOfBirth <- format(dss_individuals$dateOfBirth,"%Y")
Check the substr function from the base package (or other functions from the nice stringr package) to extract different parts of a string. This function may assume that day, month and year are always in the same place and with the same length.
The strsplit function is vectorized so using rbind.data.frame to convert your list to a dataframe works:
do.call(rbind.data.frame, strsplit(df$dateOfBirth, split = '-'))
Results need to be transposed in order to be used: you can do it using do.call or the t function.
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I assume this is a very simple transformation but I'm unable to get it right:
I have two columns in a data table. One contains the date and the other contains some unique numbers. I basically what the count of rows in a particular month and year.
I want to know the number of readings in 2011-02,then number of readings in 2011-03 and so on and so forth.
Here's some free data:
set.seed(1)
df <- data.frame(
x = sample(Sys.Date()-0:120, 20, TRUE),
y = sample(100, 20, TRUE)
)
We can do this quite easily with data.table by using a reformatted date in the by argument.
library(data.table)
setDT(df)[, .(N = .N), by = .(month = format(x, "%Y-%m"))]
# month N
# 1: 2015-09 5
# 2: 2015-08 4
# 3: 2015-07 7
# 4: 2015-06 4
Or with base R's aggregate()
aggregate(list(N = df$y), list(month = format(df$x, "%Y-%m")), length)
# month N
# 1 2015-06 4
# 2 2015-07 7
# 3 2015-08 4
# 4 2015-09 5
Here's a different approach using group_by. I also use lubridate to set POSIX date objects if you're interested.
library(lubridate)
library(dplyr)
# create some data
data <- data.frame("dates" = ymd(c("2014-05-01","2014-05-01","2014-05-01","2014-06-02","2014-06-02")),
"values" = c(1,3,5,2,5))
# this is the actual summarize.
data %>% group_by(dates) %>% summarise(n = n())
yields
dates n
(time) (int)
1 2014-05-01 3
2 2014-06-02 2
I have this a dataframe with a long list of dates in one column and values in another column, that looks like this:
set.seed(1234)
df <- data.frame(date= as.Date(c('2010-09-05', '2011-09-06', '2010-09-13',
'2011-09-14', '2010-09-23', '2011-09-24',
'2010-10-05', '2011-10-06', '2010-10-13',
'2011-10-14', '2010-10-23', '2011-10-24')),
value= rnorm(12))
I need to calculate the mean value in each 10 day period of each month, but irrespective of year, like this:
dfNeeded <- data.frame(datePeriod=c('period.Sept0.10', 'period.Sept11.20', 'period.Sept21.30',
'period.Oct0.10', 'period.Oct11.20', 'period.Oct21.31'),
meanValue=c(mean(df$value[c(1,2)]),
mean(df$value[c(3,4)]),
mean(df$value[c(5,6)]),
mean(df$value[c(7,8)]),
mean(df$value[c(9,10)]),
mean(df$value[c(11,12)])))
Is there a fast way of doing this?
Here is a way to do it, which uses the lubridate package for month and day extraction, but you can do it with base R date functions :
library(lubridate)
df$period <- paste(month(df$date),cut(day(df$date),breaks=c(0,10,20,31)),sep="-")
aggregate(df$value, list(period=df$period), mean)
Which gives :
period x
1 10-(0,10] -0.5606859
2 10-(10,20] -0.7272449
3 10-(20,31] -0.7377896
4 9-(0,10] -0.4648183
5 9-(10,20] -0.6306283
6 9-(20,31] 0.4675903
This approach with format.Date and modulo arithmetic should be reasonably fast:
tapply(df$value, list( format(df$date, "%b"), as.POSIXlt(df$date)$mday %/% 10), mean)
0 1 2
Oct -0.560686 -0.727245 -0.73779
Sep -0.464818 -0.630628 0.46759
I'm not sure how it would compare to the aggregate approach:
aggregate(df$value, list( format(df$date, "%b"), as.POSIXlt(df$date)$mday %/% 10), mean)
Group.1 Group.2 x
1 Oct 0 -0.560686
2 Sep 0 -0.464818
3 Oct 1 -0.727245
4 Sep 1 -0.630628
5 Oct 2 -0.737790
6 Sep 2 0.467590
I don't often have to work with dates in R, but I imagine this is fairly easy. I have daily data as below for several years with some values and I want to get for each 8 days period the sum of related values.What is the best approach?
Any help you can provide will be greatly appreciated!
str(temp)
'data.frame':648 obs. of 2 variables:
$ Date : Factor w/ 648 levels "2001-03-24","2001-03-25",..: 1 2 3 4 5 6 7 8 9 10 ...
$ conv2: num -3.93 -6.44 -5.48 -6.09 -7.46 ...
head(temp)
Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802
I tried to use aggregate function but for some reasons it doesn't work and it aggregates in wrong way:
z <- aggregate(amount ~ Date, timeSequence(from =as.Date("2001-03-24"),to =as.Date("2001-03-29"), by="day"),data=temp,FUN=sum)
I prefer the package xts for such manipulations.
I read your data, as zoo objects. see the flexibility of format option.
library(xts)
ts.dat <- read.zoo(text ='Date amount
24/03/2001 -3.927020472
25/03/2001 -6.4427004
26/03/2001 -5.477592528
27/03/2001 -6.09462162
28/03/2001 -7.45666902
29/03/2001 -6.731540928
30/03/2001 -6.855206184
31/03/2001 -6.807210228
1/04/2001 -5.40278802',header=TRUE,format = '%d/%m/%Y')
Then I extract the index of given period
ep <- endpoints(ts.dat,'days',k=8)
finally I apply my function to the time series at each index.
period.apply(x=ts.dat,ep,FUN=sum )
2001-03-29 2001-04-01
-36.13014 -19.06520
Use cut() in your aggregate() command.
Some sample data:
set.seed(1)
mydf <- data.frame(
DATE = seq(as.Date("2000/1/1"), by="day", length.out = 365),
VALS = runif(365, -5, 5))
Now, the aggregation. See ?cut.Date for details. You can specify the number of days you want in each group using cut:
output <- aggregate(VALS ~ cut(DATE, "8 days"), mydf, sum)
list(head(output), tail(output))
# [[1]]
# cut(DATE, "8 days") VALS
# 1 2000-01-01 8.242384
# 2 2000-01-09 -5.879011
# 3 2000-01-17 7.910816
# 4 2000-01-25 -6.592012
# 5 2000-02-02 2.127678
# 6 2000-02-10 6.236126
#
# [[2]]
# cut(DATE, "8 days") VALS
# 41 2000-11-16 17.8199285
# 42 2000-11-24 -0.3772209
# 43 2000-12-02 2.4406024
# 44 2000-12-10 -7.6894484
# 45 2000-12-18 7.5528077
# 46 2000-12-26 -3.5631950
rollapply. The zoo package has a rolling apply function which can also do non-rolling aggregations. First convert the temp data frame into zoo using read.zoo like this:
library(zoo)
zz <- read.zoo(temp)
and then its just:
rollapply(zz, 8, sum, by = 8)
Drop the by = 8 if you want a rolling total instead.
(Note that the two versions of temp in your question are not the same. They have different column headings and the Date columns are in different formats. I have assumed the str(temp) output version here. For the head(temp) version one would have to add a format = "%d/%m/%Y" argument to read.zoo.)
aggregate. Here is a solution that does not use any external packages. It uses aggregate based on the original data frame.
ix <- 8 * ((1:nrow(temp) - 1) %/% 8 + 1)
aggregate(temp[2], list(period = temp[ix, 1]), sum)
Note that ix looks like this:
> ix
[1] 8 8 8 8 8 8 8 8 16
so it groups the indices of the first 8 rows, the second 8 and so on.
Those are NOT Date classed variables. (No self-respecting program would display a date like that, not to mention the fact that these are labeled as factors.) [I later noticed these were not the same objects.] Furthermore, the timeSequence function (at least the one in the timeDate package) does not return a Date class vector either. So your expectation that there would be a "right way" for two disparate non-Date objects to be aligned in a sensible manner is ill-conceived. The irony is that just using the temp$Date column would have worked since :
> z <- aggregate(amount ~ Date, data=temp , FUN=sum)
> z
Date amount
1 1/04/2001 -5.402788
2 24/03/2001 -3.927020
3 25/03/2001 -6.442700
4 26/03/2001 -5.477593
5 27/03/2001 -6.094622
6 28/03/2001 -7.456669
7 29/03/2001 -6.731541
8 30/03/2001 -6.855206
9 31/03/2001 -6.807210
But to get it in 8 day intervals use cut.Date:
> z <- aggregate(temp$amount ,
list(Dts = cut(as.Date(temp$Date, format="%d/%m/%Y"),
breaks="8 day")), FUN=sum)
> z
Dts x
1 2001-03-24 -49.792561
2 2001-04-01 -5.402788
A more cleaner approach extended to #G. Grothendieck appraoch. Note: It does not take into account if the dates are continuous or discontinuous, sum is calculated based on the fixed width.
code
interval = 8 # your desired date interval. 2 days, 3 days or whatevea
enddate = interval-1 # this sets the enddate
nrows = nrow(z)
z <- aggregate(.~V1,data = df,sum) # aggregate sum of all duplicate dates
z$V1 <- as.Date(z$V1)
data.frame ( Start.date = (z[seq(1, nrows, interval),1]),
End.date = z[seq(1, nrows, interval)+enddate,1],
Total.sum = rollapply(z$V2, interval, sum, by = interval, partial = TRUE))
output
Start.date End.date Total.sum
1 2000-01-01 2000-01-08 9.1395926
2 2000-01-09 2000-01-16 15.0343960
3 2000-01-17 2000-01-24 4.0974712
4 2000-01-25 2000-02-01 4.1102645
5 2000-02-02 2000-02-09 -11.5816277
data
df <- data.frame(
V1 = seq(as.Date("2000/1/1"), by="day", length.out = 365),
V2 = runif(365, -5, 5))