I am having problems converting a column of imported dates in a data frame, represented as characters in a different date format, into date objects in that same data frame. Here is a toy example:
xx <- data.frame(A = c(10, 15, 20), B = c("10/15/2010", "9/8/2015", "8/5/2013"))
If I print xx,
A B
1 10 10/15/2010
2 15 9/8/2015
3 20 8/5/2013
I apply:
xx[, "B"] <- sapply(xx[, "B"], function(x) {as.Date(x,
format = "%m/%d/%Y", origin = "1970-01-01")})
and I get:
A B
1 10 14897
2 15 16686
3 20 15922
If I look at the mode of column B, it is numeric, not date. No matter what I try I cannot seem to get a result that converts column B to a date type. I can always add:
xx[, "B"] <- as.Date(xx[, "B"])
but there must be a way to do this in one statement.
If you have only one column to convert, you can do
xx$B <- as.Date(xx$B, "%m/%d/%Y")
If you have multiple columns use lapply instead of sapply
cols <- 2
xx[cols] <- lapply(xx[cols], as.Date, "%m/%d/%Y")
Or using lubridate where you don't need to specify the format argument.
xx$B <- lubridate::mdy(xx$B)
Related
I want to format certain columns of a data frame changing decimal mark and number of decimal positions
a <- c(3.412324463,3.2452364)
b <- c(2.2342,4.234234)
c <- data.frame(A=a, B=b)
I can do it column by column but would rather apply it to various columns, also I can not find number of decimals. "digits=2" gives me only to digits, including decimal part
c$A <- format(c$A, decimal.mark = ",",digits = 2)
It is better not to use function names (c) to name objects. To apply the format to all the columns
c[] <- lapply(c, format, decimal.mark = ",", digits = 2)
Or with formatC
c[] <- lapply(c, formatC, decimal.mark =",", format = "f", digits = 2)
If we need to apply to selected multiple columns, i.e. columns 1 to 3 and 7:10
j1 <- c(1:3, 7:10)
c[j1] <- lapply(c[j1, formatC, decimal.mark =",", format = "f", digits = 2)
Or another option with sprintf
c[] <- lapply(c, function(x) sub(".", ",", sprintf("%0.2f", x), fixed = TRUE))
There is also a package called fmtr designed specifically for this situation. It lets you control the formatting for each column independently. There are two main functions fapply() and fdata().
fapply() applies a format to a vector or single column. Like this:
a <- c(3.412324463,3.2452364)
b <- c(2.2342,4.234234)
c1 <- data.frame(A=a, B=b)
c2 <- c1
c2$A <- fapply(c2$A, "%.3f")
c2$B <- fapply(c2$B, "%.4f")
c2
# A B
# 1 3.412 2.2342
# 2 3.245 4.2342
fdata() applies formats to an entire data frame. All you have to do is assign the format to the format attribute of the column, and then call of the fdata() function on the data frame. fdata() will apply all the formats assigned, and leave any unformatted columns alone:
c3 <- c1
# Assign formats
attr(c3$A, "format") <- "%.1f"
attr(c3$B, "format") <- "%.2f"
# Apply formats
fdata(c3)
# A B
# 1 3.4 2.23
# 2 3.2 4.23
There is also a formats() function that allows you to more easily assign the format attributes to different columns. You just create a named list, where each name corresponds to the column you want to assign formats to. Then you can call the fdata() function like above:
c4 <- c1
# Assign formats
formats(c4) <- list(A = "%.2f",
B = "%.1f")
# Apply formats
fdata(c4)
# A B
# 1 3.41 2.2
# 2 3.25 4.2
I need to create a 'key' variable, since I want to combine two datasets.
Dataset1 has the variable ymd.
Dataset2 has the three variables y, m and d.
ymd (20050516,20060512)
y(2005,2006)
m(05,05)
d(16,12)
Two Options:
Combine y,m and d into variable ymd
List item plit variable ymd into 3 variables y, m and d.
Assuming you have two data frames:
df1 <- data.frame(
ymd = c(20050516,20060512),
x = c(1,2)
)
df2 <- data.frame(
y = c(2005,2006),
m = c('05','05'),
d = c(16,12),
z = c(5,10)
)
You can merge by pasting together the y, m, and d elements using paste0 and changing to numeric:
df2 %>%
mutate(
ymd = as.numeric(paste0(y,m,d))
) %>%
left_join(df1)
Output:
>
Joining, by = "ymd"
y m d z ymd x
1 2005 05 16 5 20050516 1
2 2006 05 12 10 20060512 2
You can adjust the merge (eg right_join) depending on your needs.
Here you have an example.
I use the variables as string instead of numeric, which makes it easier. You can use as.character() as in my example to convert it.
For option 1, I just use paste0() to paste the text together.
For option 2 I use substr() to cut the text in the corect locations.
If you need the output as numeric and not string, just use as.numeric() as I did in the print function.
Here is the code, let me know if you have further question:
ymd=as.character(c(20050516,20060512))
y=as.character(c(2005,2006))
m=as.character(c(05,05))
d=as.character(c(16,12))
## Concatenade y, m, and d together
ymd_concatenated=paste0(y,m,d)
print(as.numeric(ymd_concatenated))
## Split ymd into single variables
y_concatenated=c()
m_concatenated=c()
d_concatenated=c()
for (date in ymd)
{
y_concatenated=c(y_concatenated,substr(date,1,4))
m_concatenated=c(m_concatenated,substr(date,5,6))
d_concatenated=c(d_concatenated,substr(date,7,8))
}
print(y_concatenated)
print(m_concatenated)
print(d_concatenated)
I have dates in the format "X12.11.1985" and if I use the as.date() function to convert it on a matrix, it delivers a single number.
If I use as.date() with just one single date, then it delivers a real date.
Why is the result of the as.Date() function different in my code?
Thank you very much!
Minimal example:
col1 = c("X01.03.1988","X05.05.1995","X11.11.1990")
col2 = c(1,3,2)
mat = cbind(col1,col2)
mat[,'col1'] <- as.Date(mat[,'col1'], format='X%d.%m.%Y')
mat <- mat[order(as.numeric(mat[,'col1'])),]
mat #Result is ordered correct but as.Date converts the dates to numbers like "6634"
as.Date("X01.03.1988",format='X%d.%m.%Y') #Converts the date to a date like "1988-03-01"
A matrix cannot contain Date objects (and also can only contain one data type). It's as simple as that. You'll need a different data structure such as a data.frame.
col1 = c("X01.03.1988","X05.05.1995","X11.11.1990")
col2 = c(1,3,2)
mat = data.frame(col1,col2) #correct data structure
mat[,'col1'] <- as.Date(mat[,'col1'], format='X%d.%m.%Y')
mat <- mat[order(as.numeric(mat[,'col1'])),]
mat
# col1 col2
#1 1988-03-01 1
#3 1990-11-11 2
#2 1995-05-05 3
I can't seem to figure the following out.
I have a data frame with 398 rows and 16 variables. I want to add a date variable. I know that for each row the date increases by a week and starts with 2010-01-01. I've tried the following:
date <- ymd("2010-01-01")
df <- as.data.frame(c(1:nrow(data), 1))
for (i in 1:nrow(data)){
date <- date + 7
df[i,] <- as.Date(date)
}
I then want to bind it to my data-frame. However, the values inside df are non-dates. If I perform the date +7 calculation it works (e.g. once it goes to 2010-01-08), but if I assign it to the df it turns into weird numerical values.
Appreciate any help.
Try the following:
library(lubridate)
date <- ymd("2010-01-01")
df <- data.frame(ind = 1:5)
df$dates <- seq.Date(from = date, length.out = nrow(df), by = 7)
# note that `by = "1 week"` would also work, if you prefer more readable code.
df
ind dates
1 1 2010-01-01
2 2 2010-01-08
3 3 2010-01-15
4 4 2010-01-22
5 5 2010-01-29
Try this:
df$date <- seq(as.Date("2010-01-01"), by = 7, length.out = 398)
also try to get in the habit of not calling your variables names that are already being used by functions such as data and date.
I've been attempting to aggregate (some what erratic) daily data. I'm actually working with csv data, but if i recreate it - it would look something like this:
library(zoo)
dates <- c("20100505", "20100505", "20100506", "20100507")
val1 <- c("10", "11", "1", "6")
val2 <- c("5", "31", "2", "7")
x <- data.frame(dates = dates, val1=val1, val2=val2)
z <- read.zoo(x, format = "%Y%m%d")
Now i'd like to aggregate this on a daily basis (notice that some times there are >1 datapoint for a day, and sometimes there arent.
I've tried lots and lots of variations, but i cant seem to aggregate, so for instance this fails:
aggregate(z, as.Date(time(z)), sum)
# Error in Summary.factor(2:3, na.rm = FALSE) : sum not meaningful for factors
There seems to be a lot of content regarding aggregate, and i've tried a number of versions but cant seem to sum this on a daily level. I'd also like to run cummax and cumulative averages in addition to the daily summing.
Any help woudl be greatly appreciated.
Update
The code I am actually using is as follows:
z <- read.zoo(file = "data.csv", sep = ",", header = TRUE, stringsAsFactors = FALSE, blank.lines.skip = T, na.strings="NA", format = "%Y%m%d");
It seems my (unintentional) quotation of the numbers above is similar to what is happening in practice, because when I do:
aggregate(z, index(z), sum)
#Error in Summary.factor(25L, na.rm = FALSE) : sum not meaningful for factors
There a number of columns (100 or so), how can i specify them to be as.numeric automatically ? (stringAsFactors = False doesnt appear to work?)
Or you aggregate before using zoo (val1 and val2 need to be numeric though).
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
y <- aggregate(x[,2:3],by=list(x[,1]),FUN=sum)
and then feed y into zoo.
You avoid the warning:)
You started on the right path but made a couple of mistakes.
First, zoo only consumes matrices, not data.frames. Second, those need numeric inputs:
> z <- zoo(as.matrix(data.frame(val1=c(10,11,1,6), val2=c(5,31,2,7))),
+ order.by=as.Date(c("20100505","20100505","20100506","20100507"),
+ "%Y%m%d"))
Warning message:
In zoo(as.matrix(data.frame(val1 = c(10, 11, 1, 6), val2 = c(5, :
some methods for "zoo" objects do not work if the index entries in
'order.by' are not unique
This gets us a warning which is standard in zoo: it does not like identical time indices.
Always a good idea to show the data structure, maybe via str() as well, maybe run summary() on it:
> z
val1 val2
2010-05-05 10 5
2010-05-05 11 31
2010-05-06 1 2
2010-05-07 6 7
And then, once we have it, aggregation is easy:
> aggregate(z, index(z), sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
>
val1 and val2 are character strings. data.frame() converts them to factors. Summing factors doesn't make sense. You probably intended:
x <- data.frame(dates = dates, val1=as.numeric(val1), val2=as.numeric(val2))
z <- read.zoo(x, format = "%Y%m%d")
aggregate(z, as.Date(time(z)), sum)
which yields:
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7
Convert the character columns to numeric and then use read.zoo making use of its aggregate argument:
> x[-1] <- lapply(x[-1], function(x) as.numeric(as.character(x)))
> read.zoo(x, format = "%Y%m%d", aggregate = sum)
val1 val2
2010-05-05 21 36
2010-05-06 1 2
2010-05-07 6 7