I have a data set that will be used for time series. the date column is currently structured as follows:
> head(cam_shiller)
div stock dates
1 0.495 7.09 1933m1
2 0.490 6.25 1933m2
3 0.485 6.23 1933m3
4 0.480 6.89 1933m4
5 0.475 8.87 1933m5
6 0.470 10.39 1933m6
If I'm not mistaken, monthly data for time series should look like this: yyyy-mm. So I'm trying to make my date column look like this:
div stock dates
1 0.495 7.09 1933-01
2 0.490 6.25 1933-02
3 0.485 6.23 1933-03
4 0.480 6.89 1933-04
5 0.475 8.87 1933-05
6 0.470 10.39 1933-06
However, using the as.yearmo function produces a column full of NAs. I tried removing the 'm' and replacing it with a dash, and then running as.yearmo again. Now the results look like this:
div stock dates
1 0.495 7.09 Jan 1933
2 0.490 6.25 Feb 1933
3 0.485 6.23 Mar 1933
4 0.480 6.89 Apr 1933
5 0.475 8.87 May 1933
6 0.470 10.39 Jun 1933
How do I change the dates into the yyyy-mm format?
library(zoo)
cam_shiller = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/cam_shiller.csv')
cam_shiller$dates = gsub('m', '-', cam_shiller$dates)
cam_shiller$dates = as.yearmon(cam_shiller$dates)
Actually, in ts you just need to specify start= and frequency.
res <- ts(cam_shiller[, -3], start=1933, frequency=12)
res
# div stock
# Jan 1933 0.4950 7.09
# Feb 1933 0.4900 6.25
# Mar 1933 0.4850 6.23
# Apr 1933 0.4800 6.89
# May 1933 0.4750 8.87
# Jun 1933 0.4700 10.39
# Jul 1933 0.4650 11.23
# Aug 1933 0.4600 10.67
# Sep 1933 0.4550 10.58
# Oct 1933 0.4500 9.55
# Nov 1933 0.4450 9.78
# Dec 1933 0.4400 9.97
# Jan 1934 0.4408 10.54
# Feb 1934 0.4417 11.32
# Mar 1934 0.4425 10.74
# Apr 1934 0.4433 10.92
# May 1934 0.4442 9.81
# Jun 1934 0.4450 9.94
# Jul 1934 0.4458 9.47
# Aug 1934 0.4467 9.10
# Sep 1934 0.4475 8.88
# Oct 1934 0.4483 8.95
# Nov 1934 0.4492 9.20
# Dec 1934 0.4500 9.26
# ...
Or
ts(cam_shiller$stock, start=c(1933, 1), frequency=12)
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
# 1933 7.09 6.25 6.23 6.89 8.87 10.39 11.23 10.67 10.58 9.55 9.78 9.97
# 1934 10.54 11.32 10.74 10.92 9.81 9.94 9.47 9.10 8.88 8.95 9.20 9.26
# 1935 9.26 8.98 8.41 9.04 9.75 10.12 10.65 11.37 11.61 11.92 13.04 13.04
# ...
It may be wise to check beforehand that there are no gaps in the data by evaluating the column and row variances of years and month matrices:
test <- do.call(rbind, strsplit(cam_shiller$dates, 'm')) |>
type.convert(as.is=TRUE)
matrixStats::colVars(matrix(test[, 1], 12))
# [1] 0 0 ...
matrixStats::rowVars(matrix(test[, 2], 12))
# [1] 0 0 0 0 0 0 0 0 0 0 0 0
If you use the xts::xts, it's rather picky since it wants a time-based class such as "Date" or "POSIXct". So you need whole dates, i.e. paste a 01 as pseudo date.
res <- transform(cam_shiller, dates=strptime(paste(dates, '01'), format='%Ym%m %d')) |>
{\(.) xts::as.xts(.[1:2], .$dates)}()
head(res)
# div stock
# 1933-01-01 0.495 7.09
# 1933-02-01 0.490 6.25
# 1933-03-01 0.485 6.23
# 1933-04-01 0.480 6.89
# 1933-05-01 0.475 8.87
# 1933-06-01 0.470 10.39
class(res)
# [1] "xts" "zoo"
Data:
cam_shiller <- structure(list(div = c(0.495, 0.49, 0.485, 0.48, 0.475, 0.47,
0.465, 0.46, 0.455, 0.45, 0.445, 0.44, 0.4408, 0.4417, 0.4425,
0.4433, 0.4442, 0.445, 0.4458, 0.4467, 0.4475, 0.4483, 0.4492,
0.45), stock = c(7.09, 6.25, 6.23, 6.89, 8.87, 10.39, 11.23,
10.67, 10.58, 9.55, 9.78, 9.97, 10.54, 11.32, 10.74, 10.92, 9.81,
9.94, 9.47, 9.1, 8.88, 8.95, 9.2, 9.26), dates = c("1933m1",
"1933m2", "1933m3", "1933m4", "1933m5", "1933m6", "1933m7", "1933m8",
"1933m9", "1933m10", "1933m11", "1933m12", "1934m1", "1934m2",
"1934m3", "1934m4", "1934m5", "1934m6", "1934m7", "1934m8", "1934m9",
"1934m10", "1934m11", "1934m12")), row.names = c(NA, 24L), class = "data.frame")
Try lubridate::ym to change dates to yyyy-mm format
library(tidyverse)
cam_shiller = read.csv('https://raw.githubusercontent.com/bandcar/Examples/main/cam_shiller.csv')
cam_shiller %>%
mutate(
date = lubridate::ym(dates),
date = strftime(date, "%Y-%m")
) %>%
head()
#> div stock dates date
#> 1 0.495 7.09 1933m1 1933-01
#> 2 0.490 6.25 1933m2 1933-02
#> 3 0.485 6.23 1933m3 1933-03
#> 4 0.480 6.89 1933m4 1933-04
#> 5 0.475 8.87 1933m5 1933-05
#> 6 0.470 10.39 1933m6 1933-06
Created on 2022-10-01 with reprex v2.0.2
The form in the question is already correct. It is not true
that you need to change it. It renders as Jan 1933, etc. but internally it is represented as year+(month-1)/12 (where month is a number 1, 2, ..., 12) which is exactly what you need for analysis. You do not want a character string of the form yyyy-mm for analysis.
If by "time series" you mean a zoo series then using u defined in the Note at the end, z below gives that with a yearmon index. The index argument to read.csv.zoo gives the column number or name of the index, the FUN argument tells it how to convert it and the format argument tells it the precise form of the dates.
If what you mean by time series is that you want a ts series then tt below gives that.
If what you mean is a data frame with a yearmon column then DF below gives that.
With either a zoo series or a ts series one could perform a variety of analyses. For example, acf(z) or acf(tt) would give the autocorrelation function.
For more information see ?read.csv.zoo . There is also an entire vignette on read.zoo and its variants. The vignettes are linked to on the CRAN home page for zoo. Also see ?strptime for the percent codes.
library(zoo)
# zoo series with yearmon column
z <- read.csv.zoo(u, index = 3, FUN = as.yearmon, format = "%Ym%m")
# ts series
tt <- as.ts(z)
# data frame with yearmon column
DF <- u |>
read.csv() |>
transform(dates = as.yearmon(dates, "%Ym%m"))
A character string of the form yyyy-mm is not a suitable form for most analyses but if you really did want that anyways then
# zoo series with yyyy-mm character string index
z2 <- aggregate(z, format(index(z), "%Y-%m"), c)
# data.frame with yyyy-mm character string column
DF2 <- transform(DF, dates = format(dates, "%Y-%m"))
Note
u <- "https://raw.githubusercontent.com/bandcar/Examples/main/cam_shiller.csv"
Related
I seem to have some trouble converting my data frame data into a time series. I have a typical data set consisting of date, export quantity, GDP, FDI etc.
# A tibble: 252 x 10
Date `Maize Exports (m/t)` `Rainfall (mm)` `Temperature ©` `Exchange rate (R/$)` `Maize price (R)` `FDI (Million R)` GDP (Million~1 Oil p~2 Infla~3
<dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2000-05-01 00:00:00 21000 30.8 14.4 0.144 678. 4337 9056 192. 5.1
2 2000-06-01 00:00:00 54000 14.9 14.0 0.147 583. -4229 9056 205. 5.1
3 2000-07-01 00:00:00 134000 11.1 12.6 0.144 518. -4229 8841 196. 5.9
4 2000-08-01 00:00:00 213000 6.1 15.3 0.143 526. -4229 8841 205. 6.8
5 2000-09-01 00:00:00 123000 38.5 17.8 0.138 576. 6315 8841 234. 6.8
6 2000-10-01 00:00:00 94000 61.9 20.1 0.132 636. 6315 4487 231. 7.1
7 2000-11-01 00:00:00 192000 93.9 19.9 0.129 685. 6315 4487 250. 7.1
8 2000-12-01 00:00:00 134000 85.6 22.3 0.132 747. -2143 4487 192. 7
9 2001-01-01 00:00:00 133000 92.4 23.4 0.0875 1066. -5651 7365 226. 5
10 2001-02-01 00:00:00 168000 51 22.0 0.0879 1042. -5651 7365 233. 5.9
I've installed the right packages (readxl), I've used the as.Date function so ensure my Date is recognized as such, and I've used the as.ts function to convert the dataset. However, after using the as.ts function, the date column is all muddled up into a random number and not a date anymore. What am I doing wrong? Please help!
Date Maize Exports (m/t) Rainfall (mm) Temperature © Exchange rate (R/$) Maize price (R) FDI (Million R) GDP (Million R) Oil prices (R/barrel)
[1,] 957139200 21000 30.8 14.36 0.1435235 677.88 4337 9056 192.35
[2,] 959817600 54000 14.9 13.96 0.1474926 583.48 -4229 9056 205.36
[3,] 962409600 134000 11.1 12.61 0.1437298 518.10 -4229 8841 196.38
[4,] 965088000 213000 6.1 15.27 0.1433075 525.59 -4229 8841 204.66
[5,] 967766400 123000 38.5 17.83 0.1382170 576.08 6315 8841 233.64
[6,] 970358400 94000 61.9 20.10 0.1322751 635.79 6315 4487 231.27
In short nothing is wrong - and while this response should really be a comment, I wanted to use a full answer to have a bit more space to explain.
Behind each date is a numeric value tethered to an origin, so this is just R's way of handling it. And since you imported from excel originally, those origins may not line up if you tried to cross check it (see below).
You didn't make your question reproducible, but I put some similar data together to demonstrate what's going on:
Data
df <- data.frame(date = as.Date(c("2000-05-01",
"2000-06-01",
"2000-07-01",
"2000-08-01",
"2000-09-01",
"2000-10-01",
"2000-11-01")),
maize = c(21, 54, 132, 213, 123, 94, 192) * 1000,
rainfall = c(30, 14, 11, 6, 38, 61, 93))
tb <- tidyr::as_tibble(df)
Turning this into a time series object using as.ts()
tb_ts <- as.ts(tb)
# Time Series:
# Start = 1
# End = 7
# Frequency = 1
# date maize rainfall
# 1 11078 21000 30
# 2 11109 54000 14
# 3 11139 132000 11
# 4 11170 213000 6
# 5 11201 123000 38
# 6 11231 94000 61
# 7 11262 192000 93
Since I created these data in R, the "origin" is January 1, 1970, and we can see this in numerical dates from the time series object and convert them back into date formats:
as.Date(tb_ts[1:7], origin = '1970-01-01')
# [1] "2000-05-01" "2000-06-01" "2000-07-01" "2000-08-01"
# [5] "2000-09-01" "2000-10-01" "2000-11-01"
Note that if you import data from Excel, Excel's origin is December 30th, 1899 (i.e., as.Date(xx, origin = "1899-12-30")), so if you tried that you get the wrong dates:
as.Date(tb_ts[1:7], origin = "1899-12-30")
# [1] "1930-04-30" "1930-05-31" "1930-06-30" "1930-07-31"
# [5] "1930-08-31" "1930-09-30" "1930-10-31
The function worked as it's supposed to. Keeping the date format you're familiar with isn't practical for execution, so it converts the dates to a different value, usually something like the number of days (or minutes or seconds) since a certain year, usually Jan. 1 1970. For example, here is a little set to make the point:
# a test vector of dates
> del1 <- seq(as.Date("2012-04-01"), length.out=4, by=30)
# looks like
> del1
[1] "2012-04-01" "2012-05-01" "2012-05-31" "2012-06-30"
# use the as.ts
> as.ts(del1)
Time Series:
Start = 1
End = 4
Frequency = 1
[1] 15431 15461 15491 15521
So you can see the dates, which are 30 days apart, are converted to a series of values that are 30 integers apart.
Year A B C D E F
1993-Q1 15.3 5.77 437.02 487.68 97 86.9
1993-Q2 13.5 5.74 455.2 504.5 94.7 85.4
1993-Q3 12.9 5.79 469.42 523.37 92.4 82.9
:::
2021-Q1 18.3 6.48 35680.82 29495.92 182.2 220.4
2021-Q2 7.9 6.46 36940.3 30562.03 180.4 218
Dataset1 <- read.csv('C:/Users/s/Desktop/R/intro/data/Dataset1.csv')
class(Dataset1)
[1] "data.frame"
time_series <- ts(Dataset1, start=1993, frequency = 4)
class(time_series)
[1] "mts" "ts" "matrix"
I don't know how to proceed from there to read my Year column as dates (quaterly) instead of numbers!
Date class does not work well with ts class. It is better to use year and quarter. Using the input shown reproducibly in the Note at the end use read.csv.zoo with yearqtr class and then convert it to ts. The strip.white is probably not needed but we added it just in case.
library(zoo)
z <- read.csv.zoo("Dataset1.csv", FUN = as.yearqtr, format = "%Y-Q%q",
strip.white = TRUE)
tt <- as.ts(z)
tt
## A B C D E F
## 1993 Q1 15.3 5.77 437.02 487.68 97.0 86.9
## 1993 Q2 13.5 5.74 455.20 504.50 94.7 85.4
## 1993 Q3 12.9 5.79 469.42 523.37 92.4 82.9
class(tt)
## [1] "mts" "ts" "matrix"
as.integer(time(tt)) # years
## [1] 1993 1993 1993
cycle(tt) # quarters
## Qtr1 Qtr2 Qtr3
## 1993 1 2 3
as.numeric(time(tt)) # time in years
## [1] 1993.00 1993.25 1993.50
If you did want to use Date class it would be better to use a zoo (or xts) series.
zd <- aggregate(z, as.Date, c)
zd
## A B C D E F
## 1993-01-01 15.3 5.77 437.02 487.68 97.0 86.9
## 1993-04-01 13.5 5.74 455.20 504.50 94.7 85.4
## 1993-07-01 12.9 5.79 469.42 523.37 92.4 82.9
If you want a data frame or xts object then fortify.zoo(z), fortify.zoo(zd), as.xts(z) or as.xts(zd) can be used depending on which one you want.
Note
Lines <- "Year,A,B,C,D,E,F
1993-Q1,15.3,5.77,437.02,487.68,97,86.9
1993-Q2,13.5,5.74,455.2,504.5,94.7,85.4
1993-Q3,12.9,5.79,469.42,523.37,92.4,82.9
"
cat(Lines, file = "Dataset1.csv")
lubridate has really nice year-quarter function yq to convert year quarters to dates.
Dataset1<-structure(list(Year = c("1993-Q1", "1993-Q2", "1993-Q3", "1993-Q4", "1994-Q1", "1994-Q2"), ChinaGDP = c(15.3, 13.5, 12.9, 14.1, 14.1, 13.3), Yuan = c(5.77, 5.74, 5.79, 5.81, 8.72, 8.7), totalcredit = c(437.02, 455.2, 469.42, 521.68, 363.42, 389.01), bankcredit = c(487.68, 504.5, 523.37, 581.83, 403.48, 431.06), creditpercGDP = c(97, 94.7, 92.4, 95.6, 91.9, 90), creditGDPratio = c(86.9, 85.4, 82.9, 85.7, 82.8, 81.2)), row.names = c(NA, 6L), class = "data.frame")
library(lubridate)
library(dplyr)
df_quarter <- Dataset1 %>%
mutate(date=yq(Year)) %>%
relocate(date, .after=Year)
df_quarter
#> Year date ChinaGDP Yuan totalcredit bankcredit creditpercGDP
#> 1 1993-Q1 1993-01-01 15.3 5.77 437.02 487.68 97.0
#> 2 1993-Q2 1993-04-01 13.5 5.74 455.20 504.50 94.7
#> 3 1993-Q3 1993-07-01 12.9 5.79 469.42 523.37 92.4
#> 4 1993-Q4 1993-10-01 14.1 5.81 521.68 581.83 95.6
#> 5 1994-Q1 1994-01-01 14.1 8.72 363.42 403.48 91.9
#> 6 1994-Q2 1994-04-01 13.3 8.70 389.01 431.06 90.0
#> creditGDPratio
#> 1 86.9
#> 2 85.4
#> 3 82.9
#> 4 85.7
#> 5 82.8
#> 6 81.2
Created on 2022-01-15 by the reprex package (v2.0.1)
I have a two dataframes - one is the base dataframe and the other the query dataframe.
Base Dataframe (base_df):
Mon Tue Wed Thu Fri Sat
A 5.23 0.01 6.81 8.67 0.10 6.21
B 6.26 2.19 4.28 5.57 0.16 2.81
C 7.41 2.63 4.32 6.57 0.20 1.69
D 6.17 1.50 5.30 9.22 2.19 5.47
E 1.23 9.01 8.09 1.29 7.65 4.57
Query Dataframe (query_df):
Person Start End
A Tue Thu
C Mon Wed
D Thu Sat
C Thu Sat
B Wed Fri
I want to extract all the observations for a particular person between the start and end days. The difference between start and end days is always three (inclusive of start and end days).
Hence the output wanted is:
Person Start End D1 D2 D3
A Tue Thu 0.01 6.81 8.67
C Mon Wed 7.41 2.63 4.32
D Thu Sat 9.22 2.19 5.47
C Thu Sat 6.57 0.20 1.69
B Wed Fri 4.28 5.57 0.16
I want to avoid a loop because the actual base_df is more than 35000 rows. Is there a data.table solution? Solutions using other data structures are good too. Thank you!
Another base R solution, using mapply...
query_df <- cbind(query_df,
t(mapply(function(p,s,e) {
base_df[p, match(s, names(base_df)):match(e, names(base_df))]},
query_df$Person,
query_df$Start,
query_df$End)))
names(query_df)[4:6] <- c("D1", "D2", "D3")
query_df
Person Start End D1 D2 D3
1 A Tue Thu 0.01 6.81 8.67
2 C Mon Wed 7.41 2.63 4.32
3 D Thu Sat 9.22 2.19 5.47
4 C Thu Sat 6.57 0.2 1.69
5 B Wed Fri 4.28 5.57 0.16
The data.table solution below should be working also for varying numbers of days between Start and End days (not just 3 day periods) thanks to a non-equi join and melt() / dcast() for reshaping:
library(data.table)
setDT(base_df)
setDT(query_df)
# reshape from wide to long
long <- melt(base_df, id.vars = "Person", variable.name = "Day")
# align factor levels
cols <- c("Start", "End")
query_df[, (cols) := lapply(.SD, factor, levels = levels(long$Day)), .SDcols = cols][
# add row id because Person is not unique
, rn := .I]
# non-equi join right join, i.e., take all rows of query_df
long[query_df, on = .(Person, Day >= Start, Day <= End),
.(rn, Person, Start = i.Start, End = i.End, value)][
# reshape from long to wide
, dcast(.SD, rn + Person + ... ~ rowid(rn, prefix = "D"))]
rn Person Start End D1 D2 D3
1: 1 A Tue Thu 0.01 6.81 8.67
2: 2 C Mon Wed 7.41 2.63 4.32
3: 3 D Thu Sat 9.22 2.19 5.47
4: 4 C Thu Sat 6.57 0.20 1.69
5: 5 B Wed Fri 4.28 5.57 0.16
Note that Day is a factor with the names of weekdays as factor levels in order of appearance:
str(long)
Classes ‘data.table’ and 'data.frame': 30 obs. of 3 variables:
$ Person: chr "A" "B" "C" "D" ...
$ Day : Factor w/ 6 levels "Mon","Tue","Wed",..: 1 1 1 1 1 2 2 2 2 2 ...
$ value : num 5.23 6.26 7.41 6.17 1.23 0.01 2.19 2.63 1.5 9.01 ...
- attr(*, ".internal.selfref")=<externalptr>
Aligned factor levels are crucial for the non-equi join.
Data
library(data.table)
base_df <- fread(
"Person Mon Tue Wed Thu Fri Sat
A 5.23 0.01 6.81 8.67 0.10 6.21
B 6.26 2.19 4.28 5.57 0.16 2.81
C 7.41 2.63 4.32 6.57 0.20 1.69
D 6.17 1.50 5.30 9.22 2.19 5.47
E 1.23 9.01 8.09 1.29 7.65 4.57"
)
query_df <- fread(
"Person Start End
A Tue Thu
C Mon Wed
D Thu Sat
C Thu Sat
B Wed Fri"
)
A tidyverse answer
I reshape base_df, then join and slice the correct days, then reshape back.
library(tidyr)
library(dplyr)
base_df <- tibble::rownames_to_column(base_df, 'Person')
days <- names(base_df)[-1]
base_df %>%
gather(day, value, -Person) %>%
right_join(mutate(query_df, i = row_number())) %>%
group_by(i) %>%
slice(which(days == Start):which(days == End)) %>%
mutate(col = c('D1', 'D2', 'D3')) %>%
select(-day, -i) %>%
spread(col, value)
data.table solution:
Here I use get to extract columns (e.g. Mon) from a data.table object.
library(data.table)
# Prepare data
base_df$Person <- rownames(base_df)
d <- merge(query_df, base_df, "Person", sort = FALSE)
setDT(d)
# Extract mid day (day between start and end)
d[, Mid := days[which(Start == days) + 1], 1:nrow(d)]
# Extract columns using get
d[, .(Person, Start, End,
D1 = get(Start), D2 = get(Mid), D3 = get(End)), 1:nrow(d)][, nrow := NULL][]
Person Start End D1 D2 D3
1: A Tue Thu 0.01 6.81 8.67
2: C Mon Wed 7.41 2.63 4.32
3: D Thu Sat 9.22 2.19 5.47
4: C Thu Sat 6.57 0.20 1.69
5: B Wed Fri 4.28 5.57 0.16
Base R solution:
# Order of days
days <- names(base_df)
# Order of persons
subjects <- rownames(base_df)
res <- apply(query_df, 1, function(x) {
# Extract observation between start:end date
foo <- base_df[x[1] == subjects, which(x[2] == days):which(x[3] == days)]
colnames(foo) <- paste0("D", 1:3)
foo})
# Merge with original query_df
res <- cbind(query_df, do.call("rbind", res))
rownames(res) <- NULL
res
A base solution using indexing with a numeric matrix:
ri <- match(query_df$Person, rownames(base_df))
ci <- match(query_df$Start, names(base_df))
cbind(query_df, `dim<-`(base_df[cbind(ri, rep(ci, 3) + rep(0:2, each = nrow(query_df)))],
c(nrow(query_df), 3)))
# Person Start End 1 2 3
# 1 A Tue Thu 0.01 6.81 8.67
# 2 C Mon Wed 7.41 2.63 4.32
# 3 D Thu Sat 9.22 2.19 5.47
# 4 C Thu Sat 6.57 0.20 1.69
# 5 B Wed Fri 4.28 5.57 0.16
this is my first question on this forum.
I would like to re-model the structure of my dataset.
I would like to split the column "Teams" into two columns. One with the hometeam and another with the awayteam.
I also would like to split the result into two columns. Homegoals and Awaygoals. The new columns should not have a zero infront of the "real" goals scored.
BEFORE
Date Time Teams Results Homewin Draw Awaywin
18 May 19:45 AC Milan - Sassuolo 02:01 1.26 6.22 10.47
18 May 19:45 Chievo - Inter 02:01 3.73 3.42 2.05
18 May 19:45 Fiorentina - Torino 02:02 2.84 3.58 2.39
AFTER
Date Time Hometeam Awayteam Homegoals Awaygoals Homewin Draw Awaywin
18 May 19:45 AC Milan Sassuolo 2 1 1.26 6.22 10.47
18 May 19:45 Chievo Inter 2 1 3.73 3.42 2.05
18 May 19:45 Fiorentina Torino 2 2 2.84 3.58 2.39
Can R fix this problem for me? Which packages do i need?
I want to be able to do this for many excel spreadsheets with different leagues and divisions but all with the same structure.
Can someone help me and my data.frame?
tidyr solution:
separate(your.data.frame, Teams, c('Home', 'Away'), sep = " - ")
Base R solution (following this answer):
df <- data.frame(do.call(rbind, strsplit(as.character(your.df$teams), " - ")))
names(df) <- c("Home", "Away")
Here's an approach that uses cSplit from the splitstackshape package, which uses and returns a data.table. Presuming your original data frame is named df,
library(splitstackshape)
setnames(
cSplit(df, 3:4, c(" - ", ":"))[, c(1:2, 6:9, 3:5), with = FALSE],
3:6,
paste0(c("Home", "Away"), rep(c("Team", "Goals"), each = 2))
)[]
# Date Time HomeTeam AwayTeam HomeGoals AwayGoals Homewin Draw Awaywin
# 1: 18 May 19:45 AC Milan Sassuolo 2 1 1.26 6.22 10.47
# 2: 18 May 19:45 Chievo Inter 2 1 3.73 3.42 2.05
# 3: 18 May 19:45 Fiorentina Torino 2 2 2.84 3.58 2.39
I am stuck on the why that this is happening and have tried searching everywhere for the answer. When I try to plot a timeseries object in R the resulting plot comes out in reverse.
I have the following code:
library(sqldf)
stock_prices <- read.csv('~/stockPrediction/input/REN.csv')
colnames(stock_prices) <- tolower(colnames(stock_prices))
colnames(stock_prices)[7] <- 'adjusted_close'
stock_prices <- sqldf('SELECT date, adjusted_close FROM stock_prices')
head(stock_prices)
date adjusted_close
1 2014-10-20 3.65
2 2014-10-17 3.75
3 2014-10-16 4.38
4 2014-10-15 3.86
5 2014-10-14 3.73
6 2014-10-13 4.09
tail(stock_prices)
date adjusted_close
1767 2007-10-15 8.99
1768 2007-10-12 9.01
1769 2007-10-11 9.02
1770 2007-10-10 9.06
1771 2007-10-09 9.06
1772 2007-10-08 9.08
But when I try the following code:
stock_prices_ts <- ts(stock_prices$adjusted_close, start=c(2007, 1), end=c(2014, 10), frequency=12)
plot(stock_prices_ts, col='blue', lwd=2, type='l')
How the image that results is :
And even if I reverse the time series object with this code:
plot(rev(stock_prices_ts), col='blue', lwd=2, type='l')
I get this
which has arbitrary numbers.
Any idea why this is happening? Any help is much appreciated.
This is happened because your object loose its time serie structure once you apply rev function.
For example :
set.seed(1)
gnp <- ts(cumsum(1 + round(rnorm(100), 2)),
start = c(1954, 7), frequency = 12)
gnp ## gnp has a real time serie structure
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1954 0.37 1.55 1.71 4.31 5.64 5.82
1955 7.31 9.05 10.63 11.32 13.83 15.22 15.60 14.39 16.51 17.47 18.45 20.39
1956 22.21 23.80 25.72 27.50 28.57 27.58 29.20 30.14 30.98 30.51 31.03 32.45
1957
rev(gnp) ## the reversal is just a vector
[1] 110.91 110.38 110.60 110.17 110.45 108.89 106.30 104.60 102.44 ....
In general is a liitle bit painful to manipulate the class ts. One idea is to use an xts object that "generally" conserve its structure one you apply common operation on it.
Even in this case the generic method rev is not implemented fo an xts object, it is easy to coerce the resulted zoo time series to and xts one using as.xts.
par(mfrow=c(2,2))
plot(gnp,col='red',main='gnp')
plot(rev(gnp),type='l',col='red',main='rev(gnp)')
library(xts)
xts_gnp <- as.xts(gnp)
plot(xts_gnp)
## note here that I apply as.xts again after rev operation
## otherwise i lose xts structure
rev_xts_gnp = as.xts(rev(as.xts(gnp)))
plot(rev_xts_gnp)