Creating a new column in a data frame based on start dates and end dates - r

I have the following 2 data frames:
Dataframe1 <- data.frame(Time = seq(as.POSIXct("2017-09-06 4:30:00"), as.POSIXct("2017-09-08 15:00:15"), by = "15 min"))
Dataframe2 <- data.frame(Start_Date = as.POSIXct(c("2017-09-07 4:32:00", "2017-09-07 13:02:00", "2017-09-08 10:20:00")), End_Date = as.POSIXct(c("2017-09-07 7:20:00", "2017-09-07 17:46:00", "2017-09-08 13:41:00")))
I want to create a new column in Dataframe1 (Dataframe1$New_Column) that is of class "logical". If values in Dataframe1$Time are between start dates and end dates (i.e., if they are between the two dates in each row of Dataframe2), Dataframe1$New_Column will be TRUE, and if they aren't, Dataframe1$New_Column will be FALSE. The result should look like:
Dataframe1$New_Column <- TRUE
Dataframe1$New_Column[which(Dataframe1$Time > Dataframe2$Start_Date[1] & Dataframe1$Time< Dataframe2$End_Date[1])] <- F
Dataframe1$New_Column[which(Dataframe1$Time > Dataframe2$Start_Date[2] & Dataframe1$Time< Dataframe2$End_Date[2])] <- F
Dataframe1$New_Column[which(Dataframe1$Time > Dataframe2$Start_Date[3] & Dataframe1$Time< Dataframe2$End_Date[3])] <- F
View(Dataframe1)
What is an efficient way to do this using base R functions?
Thank you!

A non-equi join might be better
library(data.table)
Dataframe1$New_Column <- TRUE
setDT(Dataframe1)[Dataframe2, New_Column := FALSE,
on = .(Time > Start_Date, Time < End_Date)]
which(!Dataframe1$New_Column)
#[1] 98 99 100 101 102 103 104 105 106 107 108 132 133 134 135 136
#[17] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
With base R, we can use lapply/sapply to loop over the rows of 'Dataframe2' and do the comparison
out <- !Reduce(`|`, lapply(seq_len(nrow(Dataframe2)),
function(i) with(Dataframe1, Time > Dataframe2$Start_Date[i] &
Time < Dataframe2$End_Date[i])))
which(!out)
#[1] 98 99 100 101 102 103 104 105 106 107 108 132 133 134 135 136
#[17] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
Dataframe1$New_Column <- out

Related

Create a dataframe i nR

I would like to create a dataframe with 117 columns and 90 rows, the first ones being: ID, date1, date2, Category, DR1, DRM01, DRM02, DRM03 .... up to DRM111. For the first column, it would have values ranging from 1 to 3. In date1 it would have a fixed value, which would be "2022-01-05", in date2, it would have values between 2021-12-20 to the maximum that it gives. Category can be ABC or ERF, in DR1 would be values that would vary from 200 to 250, and finally, in DRM columns, would be values that would vary from 0 to 300. Is it possible to create a dataframe like this?
I wondering if this is an effort at simulation. The first few tasks seem blindly obvious but the last call to replicate with simplify=FALSE might have been a bit less than trivial.
test <- data.frame( ID = rep(1:3, length=90),
date1 = as.Date( "2022-01-05"),
date2= seq( as.Date("2021-12-20"), length.out=90, by=1),
#Category = ???? so far not specified
DR1 = sample( 200:250, 90, repl=TRUE), #need repl is length need is long
setNames( replicate(111, { sample(0:300, 90)}, simplify=FALSE) ,
nm=paste("DRM",1:111) ) )
Snipped the last 105 rows of the output from str:
str(test)
'data.frame': 90 obs. of 115 variables:
$ ID : int 1 2 3 1 2 3 1 2 3 1 ...
$ date1 : Date, format: "2022-01-05" "2022-01-05" "2022-01-05" "2022-01-05" ...
$ data2 : Date, format: "2021-12-20" "2021-12-21" "2021-12-22" "2021-12-23" ...
$ DR1 : int 229 218 240 243 221 202 242 221 237 208 ...
$ DRM.1 : int 41 238 142 100 19 56 224 152 85 84 ...
$ DRM.2 : int 150 185 141 55 34 83 88 105 165 294 ...
$ DRM.3 : int 144 22 237 174 78 291 120 63 261 236 ...
$ DRM.4 : int 223 105 263 214 45 226 129 80 182 15 ...
$ DRM.5 : int 27 108 288 237 129 251 150 70 300 243 ...
# additional rows elided
The last item in that construction returns a list that has 111 "columns" with ascending numbered names. I admit to being puzzled about why there were periods in the DRM names but then realized that the data.frame function uses check.names to make sure they are legitimate, so the spaces from paste were converted to periods. If you don't like periods then use paste0.

Using dplyr to compute calculated fields depending on multiple columns without explicitly writing column names

Consider the following code.
set.seed(56)
library(dplyr)
df <- data.frame(
NUM_1 = sample.int(500, replace = TRUE),
DENOM_1 = sample.int(500, replace = TRUE),
NUM_2 = sample.int(500, replace = TRUE),
DENOM_2 = sample.int(500, replace = TRUE)
)
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2
1 417 379 154 173
2 160 437 239 154
3 243 315 106 361
4 291 169 393 340
5 170 450 429 421
6 422 131 75 64
Without having to manually specify each of the column names (the actual problem has about 40 of these I need to create), I would like to create columns FRAC_1 and FRAC_2 for which FRAC_X = NUM_X/DENOM_X.
So, this would be what I'm looking for with regard to output, but since I'm dealing with about 40 of these, I don't want to have to manually type out each column:
df_frac <- df %>%
mutate(FRAC_1 = NUM_1 / DENOM_1,
FRAC_2 = NUM_2 / DENOM_2)
head(df_frac)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750
I would strongly prefer a dplyr solution to this. I thought maybe I could use mutate() with across(), but it isn't clear to me how to tell across() to pair the NUM_x with the corresponding DENOM_x columns.
Here is one in tidyverse
Loop across the columns with names starts_with 'NUM'
Extract the column name cur_column(), replace the substring from 'NUM' to 'DENOM' in str_replace
get the column value, divide by the NUM column, and change the column name in .names to create the 'FRAC' columns
library(dplyr)
library(stringr)
df <- df %>%
mutate(across(starts_with("NUM"), ~
./get(str_replace(cur_column(), 'NUM', 'DENOM')),
.names = "{str_replace(.col, 'NUM', 'FRAC')}"))
-output
head(df)
NUM_1 DENOM_1 NUM_2 DENOM_2 FRAC_1 FRAC_2
1 417 379 154 173 1.1002639 0.8901734
2 160 437 239 154 0.3661327 1.5519481
3 243 315 106 361 0.7714286 0.2936288
4 291 169 393 340 1.7218935 1.1558824
5 170 450 429 421 0.3777778 1.0190024
6 422 131 75 64 3.2213740 1.1718750

Average over rows pairs and paste the value based on condition

In R, I have a df such as:
a b c
1 124 70 aa
2 129 67 aa
3 139 71 aa
4 125 77 aa
5 125 82 aa
6 121 69 aa
7 135 68 bb
8 137 72 bb
9 137 78 bb
10 140 86 bb
I want to iterate along rows within columns (a, b), computing the mean of all rows pairs, and paste this mean to the same two rows of new columns (a_new, b_new) if the difference between these two rows is >=12. Otherwise just copy the old value. This behaviour should be restricted to groups as marked by another column (c), i.e it should not happen if two rows are from different groups.
In this example, it happens in row 3 (cos in column a, difference with next (4th) row is 14) and in row 5 (cos in column b, difference with next row is 13). However, this should not happen with row 6 cos row 7 is in another c group.
Thus, resulting df would look like:
a b c a_new b_new
1 124 70 aa 124 70
2 129 67 aa 129 67
3 139 71 aa 132 71
4 125 77 aa 132 68
5 125 82 aa 125 75.5
6 121 69 aa 121 75.5
7 135 68 bb 135 68
8 137 72 bb 137 72
9 137 78 bb 137 78
10 140 86 bb 140 86
I've been struggling to do this for a while, figured out that perhaps lag function could be used, but no success. Help would be much appreciated (be it base R, or dplyr, or whatever)
Dput:
structure(list(a = c(124, 129, 139, 125, 125, 121, 135, 137,
137, 140), b = c(70, 67, 71, 77, 82, 69, 68, 72, 78, 86), c = c("aa",
"aa", "aa", "aa", "aa", "aa", "bb", "bb", "bb", "bb")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
We can write a function which works for one chunk.
apply_fun <- function(x) {
inds <- which(abs(diff(x)) >= 12)
if(length(inds))
x[sort(c(inds, inds + 1))] <- c(sapply(inds, function(i)
rep(mean(x[c(i, i + 1)]), 2)))
return(x)
}
and then apply it for multiple columns by group.
library(dplyr)
df %>% group_by(c) %>% mutate_at(vars(a, b), list(new = apply_fun))
# a b c a_new b_new
# <dbl> <dbl> <chr> <dbl> <dbl>
# 1 124 70 aa 124 70
# 2 129 67 aa 129 67
# 3 139 71 aa 132 71
# 4 125 77 aa 132 77
# 5 125 82 aa 125 75.5
# 6 121 69 aa 121 75.5
# 7 135 68 bb 135 68
# 8 137 72 bb 137 72
# 9 137 78 bb 137 78
#10 140 86 bb 140 86
What I understood is to apply to each group given by the indicator column "c" the procedure commented in the code below:
pairAverage <- function(x) {
# x should be a numeric vector of length > 1
if (is.vector(x) & is.numeric(x) & length(x) > 1) {
# copy data to an aux vector
aux <- x
# get differences of lag 1
dh<-diff(x, 1)
# get means of consecutive pairs
med <- c(x$a[2:length(x)] - dh/2)
# get positions (index) of abs(means) >= 12
idx <- match(med[abs(dh) >= 12], med)
# need 2 reps of each mean to replace consecutive values of x
valToRepl <- med[sort(rep(idx,2))]
# ordered indexes pairs of consecutive elements of x to be replaced
idxToRepl <- sort(c(idx,idx+1))
# replace pairs of values
aux[idxToRepl] <- valToRepl
return(aux)
} else {
# do nothing
warning("paramater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups <- function(x, gr) {
if (is.vector(x) & is.numeric(x) & length(x) == length(gr)) {
x.ls <- split(x, as.factor(gr))
output <- unlist(lapply(x.ls, pairAverage))
names(output) <- NULL
output
} else {
# do nothing
warning("paremater x should be a numeric vector of length > 1")
return(NULL)
}
}
pairAverageByGroups(dd$a, dd$c)
[1] 124 129 132 132 125 121 135 137 137 140

R 2.15.0 lm() on Windows 6.1.7601 - Subsetting data frame receive error when labeling columns

Purpose: Subset a dataframe into 2 columns that are labeled with new names
for example:
Age Height
1 65 183
2 73 178
[data1[dataset1$Age>50 | dataset1$Height>140,], c("Age","Cm")]
# Error: unexpected ',' in "data1[data1$Age>50 | data1$Height>140,],"
What I've tried:
data1[dataset1$Age>50 | dataset1$Height>140,] #This doesn't organize results in columns
data1[dataset1$Age>50 | dataset1$Height>140,], c("Age","Cm") #Returns same error
I can't get the columns to be organized side-by-side with the labels in c("label1", "label2"). Thanks for your help! New to R and learning it alongside biostats.
If I got it clearly can subset function be of help
dataset1 <- data.frame(
age=c(44,77,21,55,66,90,23,54,31),
height=c(144,177,121,155,166,190,123,154,131)
)
data1 <- as.data.frame(subset(dataset1,dataset1$age>50 | dataset1$height>140))
colnames(data1) <- c("Age", "Height")
I may have missed what you were trying to do need a bit more reproducible data I think.
Nevertheless I had a go
dataset1 = data.frame(cbind((35:75),(135:175)))
colnames(dataset1) = c("Age","Height")
Age Height
35 135
36 136
37 137
38 138
39 139
40 140
41 141
42 142
43 143
44 144
and subset
data1 = dataset1[dataset1$Age>50 | dataset1$Height>140,]
colnames(data1) = c("Age","Cm")
Age Cm
41 141
42 142
43 143
44 144
45 145
46 146
47 147
48 148
49 149
50 150
My apologies if I missed what you wanted but to me it wasn't very clear.

grep: How can i search through my data using a wildcard in R

I have recently started using R. So now I am trying to get some data out of it. However, the results I get are quite confusing. I have datas from the year 1961 to 1963 of everyday in the format 1961-04-25. I created a vector called: date
So when I try to use grep to just search for the period between April 10 and May 21 and display the dates I used this command:
date[date >= grep("196.-04-10", date, value = TRUE) &
date <= grep("196.-05-21", date, value = TRUE)]
The results I get is are somehow confusing as it is making 3 days steps instead of giving me every single day... see below.
[1] "1961-04-10" "1961-04-13" "1961-04-16" "1961-04-19" "1961-04-22" "1961-04-25" "1961-04-28" "1961-05-01" "1961-05-04" "1961-05-07" "1961-05-10"
[12] "1961-05-13" "1961-05-16" "1961-05-19" "1962-04-12" "1962-04-15" "1962-04-18" "1962-04-21" "1962-04-24" "1962-04-27" "1962-04-30" "1962-05-03"
[23] "1962-05-06" "1962-05-09" "1962-05-12" "1962-05-15" "1962-05-18" "1962-05-21" "1963-04-11" "1963-04-14" "1963-04-17" "1963-04-20" "1963-04-23"
[34] "1963-04-26" "1963-04-29" "1963-05-02" "1963-05-05" "1963-05-08" "1963-05-11" "1963-05-14" "1963-05-17" "1963-05-20"
I think the grep strategy is misguided, but maybe something like this will work ... basically, I'm computing the day-of-year (Julian date, yday()) and using that for comparison.
z <- as.Date(c("1961-04-10","1961-04-11","1961-04-12",
"1961-05-21","1961-05-22","1961-05-23",
"1963-04-09","1963-04-12","1963-05-21","1963-05-22"))
library(lubridate)
z[yday(z)>=yday(as.Date("1961-04-10")) & yday(z)<=yday(as.Date("1961-05-21"))]
## [1] "1961-04-10" "1961-04-11" "1961-04-12" "1961-05-21" "1963-04-12"
## [6] "1963-05-21"yz <- year(z)
Actually, this solution is fragile to leap-years ...
Better (?):
yz <- year(z)
z[z>=as.Date(paste0(yz,"-04-10")) & z<=as.Date(paste0(yz,"-05-21"))]
(You should definitely test this for yourself, I haven't tested carefully!)
Using a date format for your variable would be the best bet here.
## set up some test data
datevar <- seq.Date(as.Date("1961-01-01"),as.Date("1963-12-31"),by="day")
test <- data.frame(date=datevar,id=1:(length(datevar)))
head(test)
## which looks like:
> head(test)
date id
1 1961-01-01 1
2 1961-01-02 2
3 1961-01-03 3
4 1961-01-04 4
5 1961-01-05 5
6 1961-01-06 6
## find the date ranges you want
selectdates <-
(format(test$date,"%m") == "04" & as.numeric(format(test$date,"%d")) >= 10) |
(format(test$date,"%m") == "05" & as.numeric(format(test$date,"%d")) <= 21)
## subset the original data
result <- test[selectdates,]
## which looks as expected:
> result
date id
100 1961-04-10 100
101 1961-04-11 101
102 1961-04-12 102
103 1961-04-13 103
104 1961-04-14 104
105 1961-04-15 105
106 1961-04-16 106
107 1961-04-17 107
108 1961-04-18 108
109 1961-04-19 109
110 1961-04-20 110
111 1961-04-21 111
112 1961-04-22 112
113 1961-04-23 113
114 1961-04-24 114
115 1961-04-25 115
116 1961-04-26 116
117 1961-04-27 117
118 1961-04-28 118
119 1961-04-29 119
120 1961-04-30 120
121 1961-05-01 121
122 1961-05-02 122
123 1961-05-03 123
124 1961-05-04 124
125 1961-05-05 125
126 1961-05-06 126
127 1961-05-07 127
128 1961-05-08 128
129 1961-05-09 129
130 1961-05-10 130
131 1961-05-11 131
132 1961-05-12 132
133 1961-05-13 133
134 1961-05-14 134
135 1961-05-15 135
136 1961-05-16 136
137 1961-05-17 137
138 1961-05-18 138
139 1961-05-19 139
140 1961-05-20 140
141 1961-05-21 141
465 1962-04-10 465
...

Resources