Subtracting value in a column and change another one - r

I have a data frame which looks like this:
structure(list(V1 = c(1174060957322141696, 1174107739209043968,
1175456617980149760, 1175463444805558272, 1175475052307013632,
1175916108697808896, 1177035962104369152, 1177959867077791744,
1180512511436709888, 1179879113844236288), V2 = structure(c(573L,
595L, 87L, 88L, 91L, 67L, 561L, 100L, 77L, 1L), .Label = c("Fri Oct 04 00:01:16 CEST 2019",
"Sat Oct 05 13:55:30 CEST 2019", "Sat Oct 05 13:55:56 CEST 2019",
"Wed Oct 02 10:25:36 CEST 2019", "Wed Oct 02 11:47:16 CEST 2019",
"Wed Oct 02 23:43:18 CEST 2019", "Wed Oct 02 23:46:07 CEST 2019",
"Wed Oct 02 23:52:27 CEST 2019", "Wed Oct 02 23:54:42 CEST 2019",
"Wed Oct 02 23:55:50 CEST 2019", "Wed Oct 02 23:56:11 CEST 2019",
"Wed Oct 02 23:56:41 CEST 2019", "Wed Oct 02 23:57:12 CEST 2019",
"Wed Oct 02 23:58:02 CEST 2019", "Wed Oct 02 23:58:53 CEST 2019",
"Wed Oct 02 23:59:05 CEST 2019", "Wed Oct 02 23:59:16 CEST 2019",
"Wed Oct 02 23:59:42 CEST 2019", "Wed Sep 18 01:47:53 CEST 2019",
"Wed Sep 25 00:50:36 CEST 2019", "Wed Sep 25 01:06:26 CEST 2019"
), class = "factor")), row.names = c(NA, 10L), class = "data.frame")
I want to change the hours in column V4 by subtracting 07:00:00. If the hours in column V4 is smaller than 07:00:00 then it should as well change the day in column V3 and in case the day goes to the month before then it should change the month in column V2. The final aim of this is to count how many rows are there for each day, for which I can use:
count(entertainment_one, c("V2", "V3"))
but before I need to reorganise my data frame.
I am new to R and do not know where to start. Any help would be really appreciated, thank you very much!

First thing to notice is that your V2 is a factor; they do not behave as you think. Quickly convert it back to a character vector!
df$V2 <- as.character(df$V2)
Now, let's have our date as an actual datetime vector. But first, set the locale to English, as it seems your dates are English; otherwise parsing dates from a different language than your computer might work:
Sys.getlocale('LC_TIME') # take note of this value if you want to reset it.
Sys.setlocale('LC_TIME', 'english') # works on windows
df$dates <- strptime(df$V2, '%a %b %d %T CEST %Y', tz='XXX')
You see that 'XXX' - that's because I have no idea what timezone CEST is. If all your dates are in the same timezone, you probably wouldn't notice...
At this point, df$dates is a POSIXlt-class object. Try adding 10 (or 1 or any small integer)
df$dates + 1
[1] "2019-10-04 00:01:17 EDT" "2019-10-05 13:55:31 EDT" "2019-10-05 13:55:57 EDT" ...
Ahh, it's counting seconds.
So to subtract 7 hours, subtract 7 hours worth of seconds:
df$offset <- df$dates - 7 * 60 * 60
See, both days and months change accordingly. Now use the package lubridate to extract day and month-components:
library(lubridate)
month(df$offset)
day(df$offset)

Related

Error while trying to decompose a time series: non-numeric argument

I am new to R and am working on an assignment where I import some JSON data to (1) create a time series graph and (2) decompose the time-series. It's the decompose function where I'm struggling. Here is what works...
# Import JSON & convert to data.frame
aor <- fromJSON.....
aor <- as.data.frame......
# Combine the year and month into a date format
aor$date <- as.yearmon(paste(aor$year, aor$month), "%Y %m")
# Ensure data is float not chr
aor$mwh <- as.numeric(aor$mwh)
# Prep the data.frame for time-series analysis by converting to xts
aor <- xts(x = aor, order.by = aor$date)
# Successfully output a time-series graph.
dygraph(aor)
Here is a sample of aor up to this point...
> aor
mwh date
Jan 2001 " 1.42000" "Jan 2001"
Feb 2001 " 1.28400" "Feb 2001"
Mar 2001 " 1.25800" "Mar 2001"
Apr 2001 " 1.53600" "Apr 2001"
May 2001 " 1.47100" "May 2001"
Jun 2001 " 1.91800" "Jun 2001"
Jul 2001 " 2.37800" "Jul 2001"
> dput(head(aor, 10))
structure(c(" 1.42000", " 1.28400", " 1.25800", " 1.53600",
" 1.47100", " 1.91800", " 2.37800", " 2.47000", " 1.65100",
" 1.58100", "Jan 2001", "Feb 2001", "Mar 2001", "Apr 2001",
"May 2001", "Jun 2001", "Jul 2001", "Aug 2001", "Sep 2001", "Oct 2001"
), .Dim = c(10L, 2L), .Dimnames = list(NULL, c("mwh", "date")))
The code I thought would produce the decomposition graphic...
ts <- as.ts(aor)
> ts
mwh date
Jan 1 1.42000 Jan 2001
Feb 1 1.28400 Feb 2001
Mar 1 1.25800 Mar 2001
Apr 1 1.53600 Apr 2001
May 1 1.47100 May 2001
Jun 1 1.91800 Jun 2001
Jul 1 2.37800 Jul 2001
d <- decompose(ts)
plot(d)
I get this error when trying to decompose ts...
Error in `-.default`(x, trend) : non-numeric argument to binary operator

which.max outcome in list

edit: See solution at the bottom.
I have trouble using the outcome of an which.max outcome in a list.
Below follows an example which reproduces my problem.
Create dataframe
library(dplyr)
library(ggplot2)
library(forcats)
name <- c('A','A','A', 'A','A','A', 'A','A','A',
'B','B','B', 'B','B','B', 'B','B','B',
'C','C','C', 'C','C','C', 'C','C','C')
month = c("oct 2018", "oct 2018", "oct 2018","nov 2018", "nov 2018", "nov 2018","dec 2018", "dec 2018", "dec 2018",
"oct 2018", "oct 2018", "oct 2018","nov 2018", "nov 2018", "nov 2018","dec 2018", "dec 2018", "dec 2018" ,
"oct 2018", "oct 2018", "oct 2018","nov 2018", "nov 2018", "nov 2018","dec 2018", "dec 2018", "dec 2018" )
value <- seq(1:length(month))
df = data.frame(name, month, value)
df
Outcome
name month value
A oct 2018 1
A oct 2018 2
A oct 2018 3
A nov 2018 4
A nov 2018 5
A nov 2018 6
A dec 2018 7
A dec 2018 8
A dec 2018 9
B oct 2018 10
B oct 2018 11
B oct 2018 12
B nov 2018 13
B nov 2018 14
B nov 2018 15
B dec 2018 16
B dec 2018 17
B dec 2018 18
C oct 2018 19
C oct 2018 20
C oct 2018 21
C nov 2018 22
C nov 2018 23
C nov 2018 24
C dec 2018 25
C dec 2018 26
C dec 2018 27
Extract name of the observation with the largest value
memberLargestValue = df[which.max(df$value),]$name
memberLargestValue
Outcome
[1] C
Levels: A B C
Merge memberLargestValue with pre-existing list
oldList = c("A", "A")
newList = c(oldList, memberLargestValue)
newList
Outcome
[1] "A" "A" "3"
I do not want the "3" in the above list, but I want "C" instead. Does anybody know how I can acccess the "C" in "memberLargestValue" and get it into the list?
Solution:
Change to "character" type:
memberLargestValue = as.character(df[which.max(df$value),]$name)

R Change bar order grouped box-plot (fill-variable)

edit: I rewrote the whole post including an example that is possible to replicate directly, and also containing the solution provided by Paweł Chabros. Thank you Paweł Chabros for providing a very neat answer!
In the following picture I struggle reversing the order of the box-plots, wanting to change it to go from October to December when looking left to right:
Click here to display plot
The dataframe is created by
library(dplyr)
library(ggplot2)
library(forcats)
name <- c('A','A','A', 'A','A','A', 'A','A','A',
'B','B','B', 'B','B','B', 'B','B','B',
'C','C','C', 'C','C','C', 'C','C','C')
month = c("oct 2018", "oct 2018", "oct 2018","nov 2018", "nov 2018", "nov 2018","dec 2018", "dec 2018", "dec 2018",
"oct 2018", "oct 2018", "oct 2018","nov 2018", "nov 2018", "nov 2018","dec 2018", "dec 2018", "dec 2018" ,
"oct 2018", "oct 2018", "oct 2018","nov 2018", "nov 2018", "nov 2018","dec 2018", "dec 2018", "dec 2018" )
value <- seq(1:length(month))
df = data.frame(name, month, value)
df
The data frame looks like this
name month value
A oct 2018 1
A oct 2018 2
A oct 2018 3
A nov 2018 4
A nov 2018 5
A nov 2018 6
A dec 2018 7
A dec 2018 8
A dec 2018 9
B oct 2018 10
B oct 2018 11
B oct 2018 12
B nov 2018 13
B nov 2018 14
B nov 2018 15
B dec 2018 16
B dec 2018 17
B dec 2018 18
C oct 2018 19
C oct 2018 20
C oct 2018 21
C nov 2018 22
C nov 2018 23
C nov 2018 24
C dec 2018 25
C dec 2018 26
C dec 2018 27
The plot in the figure above is created by
wantedMonths = c("oct 2018", "nov 2018", "dec 2018")
wantedNames = c("A", "B")
df2= df[df$name %in% wantedNames, ]
ggplot(df2[df2$month %in% wantedMonths , ]) + geom_boxplot(aes(as.factor(name), value, fill=month))#fct_rev(month)
The command that creates the correct plot, which was provided by Paweł Chabros, is
ggplot(df2[df2$month %in% wantedMonths , ]) + geom_boxplot(aes(as.factor(name), value, fill=fct_rev(month)))
ggplot uses the order of the factor for this purpose. You can set month as ordered factor either inside ggplot call or change it before, in the data. In that case just add the following line before ggplot call:
df[['month']] = ordered(df[['month']], levels = c('oct 2018', 'nov 2018', 'dec 2018'))
If your problem is the ordering of the bar you can set them manually by scale_colour_manual function.
Just add the this while plotting with ggplot.
scale_colour_manual(values = c("red","green","blue"))
The answer, which is also included in the edited question, is to use fct_rev:
ggplot(df2[df2$month %in% wantedMonths , ]) + geom_boxplot(aes(as.factor(name), value, fill=fct_rev(month)))

Modify the date in a data frame in R

Recently I stumble over a problem. Unfortunately my variable for the date has not been recorded uniformly.
I got a similar data frame like the one shown below
Variable1 <- c(10,20,30,40,50)
Variable2 <- c("a", "b", "c", "d", "d")
Date <- c("today 10:45", "yesterday 3:10", "28 october 2018 5:32", "28 october 2018 8:32", "27 october 2018 5:32")
df <- data.frame(Variable1, Variable2, Date)
df
For my use I need to extract only the date of it. Therefore, I would like to create a new variable based on "Date".
The Date variable should only contain the date. The hour is irrelevant for my purpose and can be ignored.
My goal is to get the following data frame:
Variable1 <- c(10,20,30,40,50)
Variable2 <- c("a", "b", "c", "d", "d")
Date <- c("31 october 2018", "30 october 2018", "28 october 2018", "28 october 2018", "27 october 2018")
df2 <- data.frame(Variable1, Variable2, Date)
df2
Preferably the values for Date should also be in the correct format (date).
Thank you already in advance.
df$NewDate[grepl("today",df$Date)]<-Sys.Date() # Convert today to date
df$NewDate[grepl("yesterday",df$Date)]<-Sys.Date()-1 # covert yesterday to date
df$NewDate[is.na(df$NewDate)]<-df$Date[is.na(df$NewDate)] %>% as.Date(format="%d %b %Y") # Convert explicit dates to date format
class(df$NewDate)<-"Date" # Convert column to Date class
df
Variable1 Variable2 Date NewDate
1 10 a today 10:45 2018-10-31
2 20 b yesterday 3:10 2018-10-30
3 30 c 28 october 2018 5:32 2018-10-28
4 40 d 28 october 2018 8:32 2018-10-28
5 50 d 27 october 2018 5:32 2018-10-27
tolower( # not strictly necessary, but for consistency
gsub("yesterday", format(Sys.Date()-1, "%d %B %Y"), # convert *day to dates
gsub("today", format(Sys.Date(), "%d %B %Y"),
gsub("\\s*[0-9:]*$", "", # remove the times
c("today 10:45", "yesterday 3:10", "28 october 2018 5:32", "28 october 2018 8:32", "27 october 2018 5:32")))))
# [1] "31 october 2018" "30 october 2018" "28 october 2018" "28 october 2018" "27 october 2018"
Another solution, using indices.
Date <- c("today 10:45", "yesterday 3:10", "28 october 2018 5:32", "28 october 2018 8:32", "27 october 2018 5:32")
Date <- sub("today", Sys.Date(), Date)
Date <- sub("yesterday", Sys.Date() - 1, Date)
i <- grep("[[:alpha:]]", Date)
Date[i] <- format(as.POSIXct(Date[i], format = "%d %B %Y %H:%M"), format = "%d %B %Y")
Date[-i] <- format(as.POSIXct(Date[-i]), format = "%d %B %Y")
Date
#[1] "31 October 2018" "30 October 2018" "28 October 2018"
#[4] "28 October 2018" "27 October 2018"
Then I noticed the solution by user r2evans, that converts everything to lowercase. So, if necessary, end with
Date <- tolower(Date)

Remove records which have the same datetime stamp

My data has this format:
DF <- data.frame(ids = c("uniqueid1", "uniqueid1", "uniqueid1", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid3", "uniqueid3", "uniqueid3", "uniqueid4", "uniqueid4", "uniqueid4"), stock_year = c("April 2014", "March 2012", "April 2014", "January 2017", "January 2016", "January 2015", "January 2014", "November 2011", "November 2011", "December 2009", "August 2001", "July 2000", "May 1999"))
ids stock_year
1 uniqueid1 April 2014
2 uniqueid1 March 2012
3 uniqueid1 April 2014
4 uniqueid2 January 2017
5 uniqueid2 January 2016
6 uniqueid2 January 2015
7 uniqueid2 January 2014
8 uniqueid3 November 2011
9 uniqueid3 November 2011
10 uniqueid3 December 2009
11 uniqueid4 August 2001
12 uniqueid4 July 2000
13 uniqueid4 May 1999
How is it possible to remove totally rows which have in the same id have a same value in stock_year column?
An example output of expected results is this:
DF <- data.frame(ids = c("uniqueid2", "uniqueid2", "uniqueid2", "uniqueid2", "uniqueid4", "uniqueid4", "uniqueid4"), stock_year = c("January 2017", "January 2016", "January 2015", "January 2014", "August 2001", "July 2000", "May 1999"))
ids stock_year
1 uniqueid2 January 2017
2 uniqueid2 January 2016
3 uniqueid2 January 2015
4 uniqueid2 January 2014
5 uniqueid4 August 2001
6 uniqueid4 July 2000
7 uniqueid4 May 1999
We can group by 'ids' and check for duplicates to filter those 'ids' having no duplicates
library(dplyr)
DF %>%
group_by(ids) %>%
filter(!anyDuplicated(stock_year))
Or using ave from base R
DF[with(DF, ave(as.character(stock_year), ids, FUN=anyDuplicated)!=0),]

Resources