I have a dataframe with a date column. These dates represent the date that a particular poll result was actually taken. However, the website takes these results and adds them to a table not necessarily on the date of the poll taking. So for example:
20/01/2018
21/01/2018
20/01/2018
19/01/2018
so the date at the top (20/01/2018) came in after the ones below. But the poll below says 21st and thats the date that the poll was taken so the earliest date that the one above could have been added is the 21st thus the list becomes;
21/01/2018
21/01/2018
20/01/2018
19/01/2018
and now my column is sorted. I need to do this for like 50 variables! Suggestions?
I want to sort my dates column such that if i go from bottom to top of the column if a date has a later date below it, then that date becomes that later date too.
enter image description here
Maybe there is a prettier way, but this should give the desired output:
data$Date <- as.POSIXct(rev(cummax(rev(as.numeric(data$Date)))), origin = "1970-01-01")
The idea is that you want a rolling maximum from the bottom up, for example once the 2018-01-02 was reached, the rows above can not have a date that is "smaller" than the 2018-01-02. This is done by the cummax function. It carries the maximum date reached and overwrites earlier/smaller dates. Since you want it to go from the bottom up, you have to reverse your date column via rev and then reverse it back after your call of cummax. Because cummax only works for numeric input I transformed your date column to numeric and back to date in the end.
Related
I am designing a flex dashboard for my office.
I have a column( lets say Entry time) containing time stamp(ex-2020-06-01 20:30).I want to remove those rows for which the diff between current time and entry time is greater than 24 hours. Can u please help ?
If you are tidyverse person you can do so using lubridate and filter pretty easily and then select to keep the columns you want after filtering.
require(lubridate)
require(tidyverse)
df <- df%>%
mutate(time_difference = interval(ymd_hm(start_column), ymd_hm(end_column))%>%
filter(as.numeric(time_length(time_difference, 'hour')) >24)%>%
select(-time_difference)
This takes a dataframe, creates a new column with a lubridate interval in it. Then uses time length to get the duration in hours, which is coerced to numeric (just in case as some date objects are strings under the hood) within a filter to select times less than 24 hours. The last row using select simply removes the time_difference field created to do the filtering.
This will all be saved back into the original dataframe.
Just check the grammar before you go. Without code to test it on, I may have missed a closing parenthesis or something somewhere~
I have a vector that contains time data, but there's a problem: some of the entries are listed as dates (e.g., 10/11/2017), while other entries are listed as dates with time (e.g., 12/15/2016 09:07:17). This is problematic for myself, since as.Date() can't recognize the time portion and enters dates in an odd format (0012-01-20), while seemingly adding dates with time entries as NA's. Furthermore, using as.POSIXct() doesn't work, since not all entries are a combination of date with time.
I suspect that, since these entries are entered in a consistent format, I could hypothetically use an if function to change the entries in the vector to a consistent format, such as using an if statement to remove time entirely, but I don't know enough about it to get it to work.
use
library(lubridate)
Name of the data frame or table-> x
the column that has date->Date
use the ymd function
x$newdate<-ydm(x$Date)
I am at a standstill with this problem. I outlined it in another question ( Creating data histograms/visualizations using ipython and filtering out some values ) which meandered a bit so I'd like to fix the question and give it more context since I am sure others must have a workaround for this or have the problem. I've also seen similar, not identical, questions asked and can't quite adapt any of the solutions thus far given.
I have columns in my data frame for Start Time and End Time and created a 'Duration' column for time lapsed. I'm using ipython.
The Start Time/End Time columns have fields that look like:
2014/03/30 15:45
A date and then a time in hh:mm
when I type:
pd.to_datetime('End Time') and
pd.to_datetime('Start Time')
I get fields resulting that look like:
2014-03-30 15:45:00
same date but with hyphens and same time but with :00 seconds appended
I then decided to create a new column for the difference between the End and Start times. The 'Duration' or time lapsed column was created by typing in one command:
df['Duration'] = pd.to_datetime(df['End Time'])-pd.to_datetime(df['Start Time'])
The format of the fields in the duration column is:
01:14:00
no date just a time lapsed in the format hh:mm:ss
to indicate time lapsed or 74 mins in the above example.
When I type:
df.Duration.dtype
dtype('m8[ns]') is returned, whereas, when I type
df.Duration.head(4)
0 00:14:00
1 00:16:00
2 00:03:00
3 00:09:00
Name: Duration, dtype: timedelta64[ns]
is returned which seems to indicate a different dtype for Duration.
How can I convert the format I have in the Duration column to a single integer value of minutes (time lapsed)? I see no methods that I can use, I'd write a function but wouldn't know how to treat the input of hh:mm:ss. This must be a common requirement of data analysis, should I be going about converting these dates and times differently if my end goal is to get a single integer indicating minutes lapsed? Should I just be using Excel?... because I have so far spent a day on this problem and it should be a simple problem to solve.
**update:
THANK YOU!! (Jeff and Dataswede) I added a column with the command:
df['Durationendminusstart'] = pd.to_timedelta(df.Duration,unit='ns').astype('timedelta64[m]')
which seems to give me the Duration (minutes lapsed) as wanted so that huge part is solved!
What still is not clear is why there were two different dtypes for the same column depending how I asked, oh well right now it doesn't matter.**
I have, per cell, a date value in the format 2013-01-05 11:21.
Is there a way to separate the time of day (ie 11:21) and put it in a new column, without having to manually cut and paste?
I have a lot of date values in one column, and I want to separate the time-of-day portion of these dates into a new adjacent column.
Yes - the TIMEVALUE function should do this. You may need to format the result cells (in my examle: B1:B8) as time values. Using cell formatting, you can set the output to a hh:mm syntax, too.
Basically I want to know why as.Date(200322,format="%Y%W") gives me NA. While we are at it, I would appreciate any advice on a data structure for repeated cross-section (aka pseudo-panel) in R.
I did get aggregate() to (sort of) work, but it is not flexible enough - it misses data on columns when I omit the missed values, for example.
Specifically, I have a survey that is repeated weekly for a couple of years with a bunch of similar questions answers to which I would like to combine, average, condition and plot in both dimensions. Getting the date conversion right should presumably help me towards my goal with zoo package or something similar.
Any input is appreciated.
Update: thanks for string suggestion, but as you can see in your own example, %W part doesn't work - it only identifies the year while setting the current day while I need to set a specific week (and leave the day blank).
Use a string as first argument in as.Date() and select a specific weekday (format %w, value 0-6). There are seven possible dates in each week, therefore strptime needs more information to select a unique date. Otherwise the current day and month are returned.
> as.Date(paste("200947", "0", sep="-"), format="%Y%W-%w")
[1] "2009-11-22"