Merge two columns, set value based upon original - r

I have a data.frame looking like this:
date1 date2
2015-09-17 03:07:00 2015-09-17 11:53:00
2015-09-17 08:00:00 2015-09-18 11:48:59
2015-09-18 15:58:00 2015-09-22 12:14:00
2015-09-22 12:14:00 2015-09-24 13:58:21
I'd like to combine these two into one column, something like:
dates
2015-09-17 03:07:00
2015-09-17 11:53:00
2015-09-17 08:00:00
2015-09-18 11:48:59
2015-09-18 15:58:00
2015-09-22 12:14:00
2015-09-22 12:14:00
2015-09-24 13:58:21
Please note that dates (like the last but one and the last but two) can be equal. Now I'd like to add a column 'value'. For every date that has it's origin in date1, the value should be 1, if it's origin is in date2, it should be 2.
Adding a new column is obvious. Merging works fine. I've used:
df <- as.data.frame(df$date1)
df <- data.frame(date1 = c(df$date1, test$date2 ))
That works perfectly fine for the merging of the columns, but how to get the correct value for df$value?
The result should be:
dates value
2015-09-17 03:07:00 1
2015-09-17 11:53:00 2
2015-09-17 08:00:00 1
2015-09-18 11:48:59 2
2015-09-18 15:58:00 1
2015-09-22 12:14:00 1
2015-09-22 12:14:00 2
2015-09-24 13:58:21 1

I tried to mock your problem.
If you are not concerned about time complexity, this is the simplest solution that I can suggest.
a = c(1,3,5)
b = c(2,4,6)
df = data.frame(a, b)
d1 = c()
d2 = c()
for(counter in 1:length(df$a))
{
d1 = c(d1,df$a[counter],df$b[counter])
d2 = c(d2,1,2)
}
df = data.frame(d1, d2)
print(df)
Input:
a b
1 2
3 4
5 6
Output:
d1 d2
1 1
2 2
3 1
4 2
5 1
6 2

Can't you just do something like this?
dates1 <- data.frame(dates = c("2015-09-17 03:07:00",
"2015-09-17 08:00:00",
"2015-09-18 15:58:00",
"2015-09-22 12:14:00"), value = 1)
dates2 <- data.frame(dates = c("2015-09-17 11:53:00",
"2015-09-18 11:48:59",
"2015-09-22 12:14:00",
"2015-09-24 13:58:21"), value = 2)
# row-bind the two data.frames
df <- rbind(dates1, dates2)
# if "dates" is in a string format, convert to timestamp
df$dates <- strptime(df$dates, format = "%Y-%m-%d %H:%M:%S")
# order by "dates"
df[order(df$dates),]
# result:
dates value
1 2015-09-17 03:07:00 1
2 2015-09-17 08:00:00 1
5 2015-09-17 11:53:00 2
6 2015-09-18 11:48:59 2
3 2015-09-18 15:58:00 1
4 2015-09-22 12:14:00 1
7 2015-09-22 12:14:00 2
8 2015-09-24 13:58:21 2

There might be a more clever solution, but I'd just separate each column into its own data frame, add a value column, and then rbind() into a single dates data frame.
df1 <- df$date1
df1$value <- 1
df2 <- df$date2
df2$value <- 2
dates <- rbind(df1,df2)

Related

Make an hourly dataset full

I have a dataset with a date-time vector (format is m/d/y h:m) that looks like this:
june2018_2$datetime
[1] "6/1/2018 1:00" "6/1/2018 2:00" "6/1/2018 3:00" "6/1/2018 4:00"
And I have 61 other variables that are all numeric (with some already missing values indicated with 'NA'). My date time vector is missing some hourly slots and I want to make the date-time vector full and fill in missing spots in the other 61 variables with 'NA'. I tried to use what's already out there but I can't seem to find some code or function that works for what I'm specifically working with. Any tips?
If your datetime is not in POSIXct then could be mutated. With complete you can fill in rows by the hour. Other columns in the data frame will be NA.
library(tidyverse)
df %>%
mutate(datetime = as.POSIXct(datetime, format = "%m/%d/%Y %H:%M")) %>%
complete(datetime = seq(from = first(datetime), to = last(datetime), by = "hours"))
For example, if you have test data:
set.seed(123)
df <- data.frame(
datetime = c("6/1/2018 1:00", "6/1/2018 3:00", "6/1/2018 5:00", "6/1/2018 9:00"),
var1 = sample(10,4)
)
The output would be:
# A tibble: 9 x 2
datetime var1
<dttm> <int>
1 2018-06-01 01:00:00 3
2 2018-06-01 02:00:00 NA
3 2018-06-01 03:00:00 10
4 2018-06-01 04:00:00 NA
5 2018-06-01 05:00:00 2
6 2018-06-01 06:00:00 NA
7 2018-06-01 07:00:00 NA
8 2018-06-01 08:00:00 NA
9 2018-06-01 09:00:00 8

How to extract first observation of the day in dataframe?

I have this dataframe containing a date column, and a unique ID. I would simply want to extract the first observation of each day.
I tried to use the dpylr package (aggregate function) and date function but I'm still a beginner in R. I also tried to look for an answer on this forum without success. Thnx in advance for your return !
Here is the situation:
df <- as.data.frame(c(2013-01-12 07:30:00, 2013-01-12 12:40:00, 2013-01-16 06:50:00, 2013-01-16 15:10:00, 2013-01-14 11:20:00, 2013-01-14 08:15:00),
c(A,B,E,F,C,D))
The outcome should be:
2013-01-12 07:30:00 A
2013-01-14 08:15:00 D
2013-01-16 06:50:00 E
Try the code below. Note, that I have edited your example data.
library(dplyr)
df <- data.frame(date = as.POSIXct(c("2013-01-12 07:30:00",
"2013-01-12 12:40:00",
"2013-01-16 06:50:00",
"2013-01-16 15:10:00",
"2013-01-14 11:20:00",
"2013-01-14 08:15:00")),
id = letters[1:6])
df %>%
group_by(as.Date(date)) %>%
filter(date == min(date))
The result should look like this:
# A tibble: 3 x 3
# Groups: as.Date(date) [3]
date id `as.Date(date)`
<dttm> <fct> <date>
1 2013-01-12 07:30:00 a 2013-01-12
2 2013-01-16 06:50:00 c 2013-01-16
3 2013-01-14 08:15:00 f 2013-01-14
Here is an approach using aggregate from stats package, also editing your dataset definition:
df <- data.frame(times=strptime(c('2013-01-12 07:30:00', '2013-01-12 12:40:00',
'2013-01-16 06:50:00', '2013-01-16 15:10:00',
'2013-01-14 11:20:00', '2013-01-14 08:15:00'),
format = "%Y-%m-%d %H:%M:%S"),
id=c('A','B','E','F','C','D'))
df$day <- as.Date(df$times, format='%Y-%m-%d') #create a day column
aggregate(times ~ day, data = df, FUN='min')
# day times
# 1 2013-01-12 2013-01-12 07:30:00
# 2 2013-01-14 2013-01-14 08:15:00
# 3 2013-01-16 2013-01-16 06:50:00

Fill missing sequence values with dplyr

I have a data frame with missing values for "SNAP_ID". I'd like to fill in the missing values with floating point values based on a sequence from the previous non-missing value (lag()?). I would really like to achieve this using just dplyr if possible.
Assumptions:
There will never be missing data as the first or last row I'm generating the missing dates based on missing days between a min and max of a data set
There can be multiple gaps in the data set
Current data:
end SNAP_ID
1 2015-06-26 12:59:00 365
2 2015-06-26 13:59:00 366
3 2015-06-27 00:01:00 NA
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367
8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 366.2
5 2015-06-28 00:01:00 366.3
6 2015-06-28 23:00:00 366.4
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260,
1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct",
"POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end",
"SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
This was my attempt at achieving this goal, but it only works for the first missing value:
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID))
end SNAP_ID
1 2015-06-26 12:59:00 365.0
2 2015-06-26 13:59:00 366.0
3 2015-06-27 00:01:00 366.1
4 2015-06-27 23:00:00 NA
5 2015-06-28 00:01:00 NA
6 2015-06-28 23:00:00 NA
7 2015-06-29 09:00:00 367.0
8 2015-06-29 09:59:00 368.0
The outstanding answer from #mathematical.coffee below:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>%
ungroup() %>%
select(-tmp)
EDIT: new version works for any number of NA runs.
This one doesn't need zoo, either.
First, notice that tmp=cumsum(!is.na(SNAP_ID)) groups the SNAP_IDs such groups of the same tmp consist of one non-NA value followed by a run of NA values.
Then group by this variable and just add .1 to the first SNAP_ID to fill out the NAs:
df %>%
arrange(end) %>%
group_by(tmp=cumsum(!is.na(SNAP_ID))) %>%
mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1)))
end SNAP_ID tmp
1 2015-06-26 12:59:00 365.0 1
2 2015-06-26 13:59:00 366.0 2
3 2015-06-27 00:01:00 366.1 2
4 2015-06-27 23:00:00 366.2 2
5 2015-06-28 00:01:00 366.3 2
6 2015-06-28 23:00:00 366.4 2
7 2015-06-29 09:00:00 367.0 3
8 2015-06-29 09:59:00 368.0 4
Then you can drop the tmp column afterwards (add %>% select(-tmp) to the end).
EDIT: this is the old version which doesn't work for subsequent runs of NAs.
If your aim is to fill each NA with the previous value + 0.1, you can use zoo's na.locf (which fills each NA with the previous value), along with cumsum(is.na(SNAP_ID))*0.1 to add the extra 0.1.
library(zoo)
df %>%
arrange(end) %>%
mutate(SNAP_ID=ifelse(is.na(SNAP_ID),
na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1,
SNAP_ID))

Transpose and Filter Dataframe with null values in R

This one is almost a challenge!
I have the following dataframe:
tag hour val
N1 2013-01-01 00:00:00 0.3404266179
N1 2013-01-01 01:00:00 0.3274182995
N1 2013-01-01 02:00:00 0.3142598749
N2 2013-01-01 02:00:00 0.3189924887
N2 2013-01-01 04:00:00 0.3170907762
N3 2013-01-01 05:00:00 0.3161910788
N3 2013-01-01 06:00:00 0.4247638954
I need to transform it to something like this:
hour N1 N2 N3
2013-01-01 00:00:00 0.3404266179 NULL NULL
2013-01-01 01:00:00 0.3274182995 NULL NULL
2013-01-01 02:00:00 0.3142598749 0.3189924887 NULL
2013-01-01 03:00:00 NULL NULL NULL
2013-01-01 04:00:00 NULL 0.3170907762 NULL
2013-01-01 05:00:00 NULL NULL 0.3161910788
2013-01-01 06:00:00 NULL NULL 0.4247638954
As things are not that easy, my dataframe goes up to N5000 and hour has almost 200.000 entries for each N.
The timestamp is very well behaved, as it increases minute by minute for everybody in a way you could generate all timestamps with a simple command like strptime("2013-01-01 00:00:00", "%Y-%m-%d %H:%M:%S") + c(0:172800)*60 (172800 minutes ~ 4 months). But not necessarily you have data for every timestamp, as I show on the example.
I know I could write a function with endless loops, but is there a way to do this using only R (and its packages) functions?
Thanks!
You want to use the "reshape2" package:
install.packages("reshape2")
library(reshape2)
newdf <- dcast(mydata, hour~tag)
reshape2 is a wildly powerful package that I completely fail to understand... but sometimes it has nice useful things like this that just work. :-)
UPDATED: that's "dcast" not "cast"... I mistakenly used the "reshape" not the "reshape2" package. Fixed!
This is neither the most straightforward nor elegant solution, but it works:
An exemplary data.frame:
df <- data.frame(tag=rep(c("N1", "N2", "N4"), c(3,2,2)),
hour=structure(c(1,2,3,3,5,6,7), class="POSIXct"),
val=runif(7))
## tag hour val
## 1 N1 1970-01-01 01:00:01 0.6645598
## 2 N1 1970-01-01 01:00:02 0.7924186
## 3 N1 1970-01-01 01:00:03 0.3813311
## 4 N2 1970-01-01 01:00:03 0.8555780
## 5 N2 1970-01-01 01:00:05 0.4480540
## 6 N4 1970-01-01 01:00:06 0.1875233
## 7 N4 1970-01-01 01:00:07 0.5755332
Now we create the resulting date column (it's just an example):
uh <- structure(1:7, class="POSIXct") # or e.g. uh <- unique(df$hour), or seq(), etc.
Then we create an "empty" resulting data frame (each val will be NA)
nr <- length(uh) # number of rows on out
# column definitions:
(coldef <- paste("hour=uh", paste(unique(df$tag), "NA_real_", sep="=", collapse=", "), sep=", "))
## [1] "hour=uh, N1=NA_real_, N2=NA_real_, N4=NA_real_"
# create output df:
outdf <- eval(parse(text=sprintf("data.frame(list(%s))", coldef)))
Finally, let's set vals in each N* column:
for (idx in split(1:nrow(df), df$tag))
outdf[outdf$hour %in% df$hour[idx], as.character(df$tag[idx[1]])] <- df$val[idx]
You might also consider the base function reshape if you don't want to bother with another package. Using #gagolews's sample data
> reshape(df, idvar="hour", timevar="tag", v.names="val", direction="wide")
hour val.N1 val.N2 val.N4
1 1969-12-31 19:00:01 0.8156553 NA NA
2 1969-12-31 19:00:02 0.9203821 NA NA
3 1969-12-31 19:00:03 0.8127614 0.7386737 NA
5 1969-12-31 19:00:05 NA 0.9648562 NA
6 1969-12-31 19:00:06 NA NA 0.2540216
7 1969-12-31 19:00:07 NA NA 0.5024042

processing date and time data in R

Dear all, I have a data frame which comes directly from a sensor. The data provide the date and time in a single column. I want R to be able to recognise this data and then create an adjacent column in the data frame which gives a number that corresponds to a new day in the time and date column. For example 25/02/2011 13:34 in data$time.date would give 1 in the new column data$day, and 26/02/2011 13:34 in data$time.date would get 2 and so on....
Does anyone know how to go about solving this? Thanks in advance for any help.
You can use cut() and convert to numeric the factor resulting from that call. Here is an example with dummy data:
> sdate <- as.POSIXlt("25/02/2011 13:34", format = "%d/%m/%Y %H:%M")
> edate <- as.POSIXlt("02/03/2011 13:34", format = "%d/%m/%Y %H:%M")
> df <- data.frame(dt = seq(from = sdate, to = edate, by = "hours"))
> head(df)
dt
1 2011-02-25 13:34:00
2 2011-02-25 14:34:00
3 2011-02-25 15:34:00
4 2011-02-25 16:34:00
5 2011-02-25 17:34:00
6 2011-02-25 18:34:00
We cut the date time column into days using cut(). This results in a factor with the dates as labels. We convert this factor to numerics to get 1, 2, ...:
> df <- within(df, day <- cut(dt, "day", labels = FALSE))
> head(df, 13)
dt day
1 2011-02-25 13:34:00 1
2 2011-02-25 14:34:00 1
3 2011-02-25 15:34:00 1
4 2011-02-25 16:34:00 1
5 2011-02-25 17:34:00 1
6 2011-02-25 18:34:00 1
7 2011-02-25 19:34:00 1
8 2011-02-25 20:34:00 1
9 2011-02-25 21:34:00 1
10 2011-02-25 22:34:00 1
11 2011-02-25 23:34:00 1
12 2011-02-26 00:34:00 2
13 2011-02-26 01:34:00 2
You can achieve this using cut.POSIXt. For example:
dat <- data.frame(datetimes = Sys.time() - seq(360000, 0, by=-3600))
dat$day <- cut(dat$datetimes, breaks="day", labels=FALSE)
Note that this assumes your date time column is correclty formated as a date-time class.
See ?DateTimeClasses for details.

Resources