grep: How can i search through my data using a wildcard in R - r

I have recently started using R. So now I am trying to get some data out of it. However, the results I get are quite confusing. I have datas from the year 1961 to 1963 of everyday in the format 1961-04-25. I created a vector called: date
So when I try to use grep to just search for the period between April 10 and May 21 and display the dates I used this command:
date[date >= grep("196.-04-10", date, value = TRUE) &
date <= grep("196.-05-21", date, value = TRUE)]
The results I get is are somehow confusing as it is making 3 days steps instead of giving me every single day... see below.
[1] "1961-04-10" "1961-04-13" "1961-04-16" "1961-04-19" "1961-04-22" "1961-04-25" "1961-04-28" "1961-05-01" "1961-05-04" "1961-05-07" "1961-05-10"
[12] "1961-05-13" "1961-05-16" "1961-05-19" "1962-04-12" "1962-04-15" "1962-04-18" "1962-04-21" "1962-04-24" "1962-04-27" "1962-04-30" "1962-05-03"
[23] "1962-05-06" "1962-05-09" "1962-05-12" "1962-05-15" "1962-05-18" "1962-05-21" "1963-04-11" "1963-04-14" "1963-04-17" "1963-04-20" "1963-04-23"
[34] "1963-04-26" "1963-04-29" "1963-05-02" "1963-05-05" "1963-05-08" "1963-05-11" "1963-05-14" "1963-05-17" "1963-05-20"

I think the grep strategy is misguided, but maybe something like this will work ... basically, I'm computing the day-of-year (Julian date, yday()) and using that for comparison.
z <- as.Date(c("1961-04-10","1961-04-11","1961-04-12",
"1961-05-21","1961-05-22","1961-05-23",
"1963-04-09","1963-04-12","1963-05-21","1963-05-22"))
library(lubridate)
z[yday(z)>=yday(as.Date("1961-04-10")) & yday(z)<=yday(as.Date("1961-05-21"))]
## [1] "1961-04-10" "1961-04-11" "1961-04-12" "1961-05-21" "1963-04-12"
## [6] "1963-05-21"yz <- year(z)
Actually, this solution is fragile to leap-years ...
Better (?):
yz <- year(z)
z[z>=as.Date(paste0(yz,"-04-10")) & z<=as.Date(paste0(yz,"-05-21"))]
(You should definitely test this for yourself, I haven't tested carefully!)

Using a date format for your variable would be the best bet here.
## set up some test data
datevar <- seq.Date(as.Date("1961-01-01"),as.Date("1963-12-31"),by="day")
test <- data.frame(date=datevar,id=1:(length(datevar)))
head(test)
## which looks like:
> head(test)
date id
1 1961-01-01 1
2 1961-01-02 2
3 1961-01-03 3
4 1961-01-04 4
5 1961-01-05 5
6 1961-01-06 6
## find the date ranges you want
selectdates <-
(format(test$date,"%m") == "04" & as.numeric(format(test$date,"%d")) >= 10) |
(format(test$date,"%m") == "05" & as.numeric(format(test$date,"%d")) <= 21)
## subset the original data
result <- test[selectdates,]
## which looks as expected:
> result
date id
100 1961-04-10 100
101 1961-04-11 101
102 1961-04-12 102
103 1961-04-13 103
104 1961-04-14 104
105 1961-04-15 105
106 1961-04-16 106
107 1961-04-17 107
108 1961-04-18 108
109 1961-04-19 109
110 1961-04-20 110
111 1961-04-21 111
112 1961-04-22 112
113 1961-04-23 113
114 1961-04-24 114
115 1961-04-25 115
116 1961-04-26 116
117 1961-04-27 117
118 1961-04-28 118
119 1961-04-29 119
120 1961-04-30 120
121 1961-05-01 121
122 1961-05-02 122
123 1961-05-03 123
124 1961-05-04 124
125 1961-05-05 125
126 1961-05-06 126
127 1961-05-07 127
128 1961-05-08 128
129 1961-05-09 129
130 1961-05-10 130
131 1961-05-11 131
132 1961-05-12 132
133 1961-05-13 133
134 1961-05-14 134
135 1961-05-15 135
136 1961-05-16 136
137 1961-05-17 137
138 1961-05-18 138
139 1961-05-19 139
140 1961-05-20 140
141 1961-05-21 141
465 1962-04-10 465
...

Related

Create a dataframe i nR

I would like to create a dataframe with 117 columns and 90 rows, the first ones being: ID, date1, date2, Category, DR1, DRM01, DRM02, DRM03 .... up to DRM111. For the first column, it would have values ranging from 1 to 3. In date1 it would have a fixed value, which would be "2022-01-05", in date2, it would have values between 2021-12-20 to the maximum that it gives. Category can be ABC or ERF, in DR1 would be values that would vary from 200 to 250, and finally, in DRM columns, would be values that would vary from 0 to 300. Is it possible to create a dataframe like this?
I wondering if this is an effort at simulation. The first few tasks seem blindly obvious but the last call to replicate with simplify=FALSE might have been a bit less than trivial.
test <- data.frame( ID = rep(1:3, length=90),
date1 = as.Date( "2022-01-05"),
date2= seq( as.Date("2021-12-20"), length.out=90, by=1),
#Category = ???? so far not specified
DR1 = sample( 200:250, 90, repl=TRUE), #need repl is length need is long
setNames( replicate(111, { sample(0:300, 90)}, simplify=FALSE) ,
nm=paste("DRM",1:111) ) )
Snipped the last 105 rows of the output from str:
str(test)
'data.frame': 90 obs. of 115 variables:
$ ID : int 1 2 3 1 2 3 1 2 3 1 ...
$ date1 : Date, format: "2022-01-05" "2022-01-05" "2022-01-05" "2022-01-05" ...
$ data2 : Date, format: "2021-12-20" "2021-12-21" "2021-12-22" "2021-12-23" ...
$ DR1 : int 229 218 240 243 221 202 242 221 237 208 ...
$ DRM.1 : int 41 238 142 100 19 56 224 152 85 84 ...
$ DRM.2 : int 150 185 141 55 34 83 88 105 165 294 ...
$ DRM.3 : int 144 22 237 174 78 291 120 63 261 236 ...
$ DRM.4 : int 223 105 263 214 45 226 129 80 182 15 ...
$ DRM.5 : int 27 108 288 237 129 251 150 70 300 243 ...
# additional rows elided
The last item in that construction returns a list that has 111 "columns" with ascending numbered names. I admit to being puzzled about why there were periods in the DRM names but then realized that the data.frame function uses check.names to make sure they are legitimate, so the spaces from paste were converted to periods. If you don't like periods then use paste0.

Creating a new column in a data frame based on start dates and end dates

I have the following 2 data frames:
Dataframe1 <- data.frame(Time = seq(as.POSIXct("2017-09-06 4:30:00"), as.POSIXct("2017-09-08 15:00:15"), by = "15 min"))
Dataframe2 <- data.frame(Start_Date = as.POSIXct(c("2017-09-07 4:32:00", "2017-09-07 13:02:00", "2017-09-08 10:20:00")), End_Date = as.POSIXct(c("2017-09-07 7:20:00", "2017-09-07 17:46:00", "2017-09-08 13:41:00")))
I want to create a new column in Dataframe1 (Dataframe1$New_Column) that is of class "logical". If values in Dataframe1$Time are between start dates and end dates (i.e., if they are between the two dates in each row of Dataframe2), Dataframe1$New_Column will be TRUE, and if they aren't, Dataframe1$New_Column will be FALSE. The result should look like:
Dataframe1$New_Column <- TRUE
Dataframe1$New_Column[which(Dataframe1$Time > Dataframe2$Start_Date[1] & Dataframe1$Time< Dataframe2$End_Date[1])] <- F
Dataframe1$New_Column[which(Dataframe1$Time > Dataframe2$Start_Date[2] & Dataframe1$Time< Dataframe2$End_Date[2])] <- F
Dataframe1$New_Column[which(Dataframe1$Time > Dataframe2$Start_Date[3] & Dataframe1$Time< Dataframe2$End_Date[3])] <- F
View(Dataframe1)
What is an efficient way to do this using base R functions?
Thank you!
A non-equi join might be better
library(data.table)
Dataframe1$New_Column <- TRUE
setDT(Dataframe1)[Dataframe2, New_Column := FALSE,
on = .(Time > Start_Date, Time < End_Date)]
which(!Dataframe1$New_Column)
#[1] 98 99 100 101 102 103 104 105 106 107 108 132 133 134 135 136
#[17] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
With base R, we can use lapply/sapply to loop over the rows of 'Dataframe2' and do the comparison
out <- !Reduce(`|`, lapply(seq_len(nrow(Dataframe2)),
function(i) with(Dataframe1, Time > Dataframe2$Start_Date[i] &
Time < Dataframe2$End_Date[i])))
which(!out)
#[1] 98 99 100 101 102 103 104 105 106 107 108 132 133 134 135 136
#[17] 137 138 139 140 141 142 143 144 145 146 147 148 149 150
Dataframe1$New_Column <- out

Prevent duplicates in R

I have a column in a data table which has entries in non-decreasing order. But there can be duplicate entries.
labels <- c(123,123,124,125,126,126,128)
time <- data.table(labels,unique_labels="")
time
labels unique_labels
1: 123
2: 123
3: 124
4: 125
5: 126
6: 126
7: 128
I want to make all entries unique, so the output will be
time
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130
Following is a loop implementation for this:
prev_label <- 0
unique_counter <- 0
for (i in 1:length(time$label)){
if (time$label[i]!=prev_label)
prev_label <- time$label[i]
else
unique_counter <- unique_counter + 1
time$unique_label[i] <- time$label[i] + unique_counter
}
There's a vectorized solution that completly prevents you from using for loops.
Since time is a R function I've changed the name of your data.frame to tm.
cumsum(duplicated(tm$labels)) + tm$labels
[1] 123 124 125 126 127 128 130
tm$unique_labels <- cumsum(duplicated(tm$labels)) + tm$labels
tm
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130
tank = ("t", 1:NROW(labels), sep="")
time$unique_labels = ifelse(duplicated(time), tank, time$labels)
the duplicated function of the data.table package returns the index of duplicated rows of your dataset, just replace them with "random" values you are sure are not used in your set

How to filter a data.Frame using a String in another table?

So, when I type to see the first element of a table I get the following.
>names(freq25)[1]
"japão"
When I type
>library(dplyr)
>filter(data, grepl("japão", HeadLine)) %>% select(Column)
Correct table
I get the correct table. But If I put both together I get:
>filter(data, grepl(names(freq25)[1], Headline)) %>% select(Column)
[1] Column
<0 rows> (or 0-length row.names)
I get an empty table.
I imagine the class of names() is not appropriate for the function grepl. But then I tried
>class(names(freq)[1])
[1] "character"
I don't know what I am doing wrong.
Any advice?
Ps: I want to filter using names(freq25)[i] because I want to filter a number of times, one per element in freq25. Maybe I could do that using something else.
Edit1: Content of freq25
>str(freq25)
Named num [1:25] 1260 304 215 193 192 167 164 151 150 149 ...
- attr(*, "names")= chr [1:25] "japão" "anos" "japonês" "tóquio" ...
>freq25
japão anos japonês tóquio preso homem tokyo japoneses polícia
1260 304 215 193 192 167 164 151 150
maior dia sobre previsão após parte japonesa mil mulher
149 136 134 131 129 128 117 113 112
brasil estrangeiros tempo pessoas governo novo pode
108 107 100 97 95 93 90

R split data into categories

I am trying to find the most efficient way to split a list of numbers into bins by value and then calculate a cumulative sum for each successive category.
I can't seem to get the value categories from this for the plot.
> scores
[1] 115 119 119 134 121 128 128 152 97 108 98 130 108 110 111 122 106 142 143 140 141 151 125 126
> table(cut(scores,breaks=10))
(96.9,102] (102,108] (108,113] (113,119] (119,124] (124,130] (130,136] (136,141] (141,147] (147,152]
2 1 4 1 4 5 1 2 2 2
> cumsum(table(cut(scores,breaks=10)))
(96.9,102] (102,108] (108,113] (113,119] (119,124] (124,130] (130,136] (136,141] (141,147] (147,152]
2 3 7 8 12 17 18 20 22 24
> plot(100*cumsum(table(cut(scores,breaks=10)))/length(scores),ylab="percent of scores")
> lines(100*cumsum(table(cut(scores,breaks=10)))/length(scores))
This produces an acceptable plot, which contains index values (2,4,6...). How can I get the values 96.9, 102, etc... Is there a better way to do this?
You need to set xaxt = "n" to force the plot not to display the x axis labels, and display them by yourself using axis while retrieving them using names
plot(100*cumsum(table(cut(scores,breaks=10)))/length(scores),ylab="percent of scores", xaxt = "n")
lines(100*cumsum(table(cut(scores,breaks=10)))/length(scores))
axis(1, 1:10, names(table(cut(scores,breaks=10))))

Resources