Related
Let's say I have two dataframes like the ones below:
df1 = structure(list(Date = c("2000-01-05", "2000-02-03", "2000-03-02",
"2000-03-30", "2000-04-13", "2000-05-11", "2000-06-08", "2000-07-06",
"2000-09-14", "2000-10-19", "2000-11-02", "2000-12-14", "2001-02-01",
"2001-03-01", "2001-04-11", "2001-05-10", "2001-06-07", "2001-06-21",
"2001-07-05", "2001-08-30", "2001-10-11", "2001-11-08", "2001-12-06"
)), row.names = c(NA, 23L), class = "data.frame")
Date
1 2000-01-05
2 2000-02-03
3 2000-03-02
4 2000-03-30
5 2000-04-13
6 2000-05-11
7 2000-06-08
8 2000-07-06
9 2000-09-14
10 2000-10-19
11 2000-11-02
12 2000-12-14
13 2001-02-01
14 2001-03-01
15 2001-04-11
16 2001-05-10
17 2001-06-07
18 2001-06-21
19 2001-07-05
20 2001-08-30
21 2001-10-11
22 2001-11-08
23 2001-12-06
df2 = structure(list(Date = structure(c(10987, 11016, 11047, 11077,
11108, 11138, 11169, 11200, 11230, 11261, 11291, 11322, 11353,
11381, 11412, 11442, 11473, 11503, 11534, 11565, 11595, 11626,
11656, 11687), class = "Date"), x = c(3.04285714285714, 3.27571428571429,
3.5104347826087, 3.685, 3.92, 4.29454545454545, 4.30857142857143,
4.41913043478261, 4.59047619047619, 4.76272727272727, 4.82909090909091,
4.82684210526316, 4.75590909090909, 4.9925, 4.78136363636364,
5.06421052631579, 4.65363636363636, 4.53952380952381, 4.50545454545454,
4.49130434782609, 3.9865, 3.97130434782609, 3.50727272727273,
3.33888888888889)), row.names = c(NA, 24L), class = "data.frame")
Date x
1 2000-01-31 3.042857
2 2000-02-29 3.275714
3 2000-03-31 3.510435
4 2000-04-30 3.685000
5 2000-05-31 3.920000
6 2000-06-30 4.294545
7 2000-07-31 4.308571
8 2000-08-31 4.419130
9 2000-09-30 4.590476
10 2000-10-31 4.762727
11 2000-11-30 4.829091
12 2000-12-31 4.826842
13 2001-01-31 4.755909
14 2001-02-28 4.992500
15 2001-03-31 4.781364
16 2001-04-30 5.064211
17 2001-05-31 4.653636
18 2001-06-30 4.539524
19 2001-07-31 4.505455
20 2001-08-31 4.491304
21 2001-09-30 3.986500
22 2001-10-31 3.971304
23 2001-11-30 3.507273
24 2001-12-31 3.338889
Now, what I would like to do is to create a real-time dataframe, that is, the data in df2 that were only available at the time of df1. For instance, at 2000-01-05 (first row in df1) no data in df2 was available since since 2000-01-31 (first row of df2) occurs after 2000-01-05. However, in 2000-02-03(second row in df1) the observation in 2000-01-31 (first row of df2) is available. This should be the reasoning for every row. The outcome should look like this:
Date y
1 2000-01-05 NA
2 2000-02-03 3.042857
3 2000-03-02 3.275714
4 2000-03-30 3.275714
5 2000-04-13 3.510435
6 2000-05-11 3.685000
....
The rule would be: pick up from df2 only the observation that was available at the time of df1.
Can anyone help me?
Thanks!
What you can do is complete the df2 dates and then join.
library(dplyr)
library(tidyr)
# create a dataframe with all the days, not just the snapshots
df2_complete <- df2 %>%
complete(Date = seq.Date(min(Date), max(Date), by = "day")) %>%
fill(x, .direction = "down")
# convert to Date class for this case and join
df1 %>%
mutate(Date = as.Date(Date)) %>%
left_join(df2_complete, by = "Date")
Which gives:
Date x
1 2000-01-05 NA
2 2000-02-03 3.042857
3 2000-03-02 3.275714
4 2000-03-30 3.275714
5 2000-04-13 3.510435
6 2000-05-11 3.685000
....
My data looks as follows:
var1 var2 var3
1 9V .6V 77V
2 6V .3V 15V
3 9V .7V 114V
4 12V 1.0V 199V
5 14V 1.2V 245V
6 13V .8V 158V
7 11V .6V 136V
8 11V .7V 132V
9 12V .8V 171V
10 11V .7V 155V
11 13V .8V 166V
12 11V .7V 138V
13 11V .9V 173V
14 9V .8V 143V
15 8V .7V 105V
16 4V .4V 21V
17 8V .4V 26V
18 10V .8V 154V
19 9V .8V 130V
20 10V .7V 113V
21 10V .6V 102V
22 11V .8V 135V
23 9V .7V 120V
24 10V .7V 126V
25 7N .6N 124N
26 14N 1.1N 210N
The last 2 rows contain N. I am trying to set to NA these observations which contain N.
I am trying some combination of str_detect and str_replace but I cannot seem to get it working.
Additionally I have other (very rarely) letters, i.e. M and P - I would like to find a way to set if the observations contains one of these letters, then set that observation to NA. i.e. c(var1:var3) %in% str_detect(c("N", "M", "P"))... str_replace_all.
Data:
structure(list(var1 = c("9V", "6V", "9V", "12V", "14V", "13V",
"11V", "11V", "12V", "11V", "13V", "11V", "11V", "9V", "8V",
"4V", "8V", "10V", "9V", "10V", "10V", "11V", "9V", "10V", "7N",
"14N", "7V", "5V", "7V", "9V", "8V", "8V", "5V", "4V", "4V",
"5V", "7V", "5V", "6V", "8V", "9V", "6V", "6V", "7V", "8V", "7V",
"8V", "8V", "7V", "8V"), var2 = c(".6V", ".3V", ".7V", "1.0V",
"1.2V", ".8V", ".6V", ".7V", ".8V", ".7V", ".8V", ".7V", ".9V",
".8V", ".7V", ".4V", ".4V", ".8V", ".8V", ".7V", ".6V", ".8V",
".7V", ".7V", ".6N", "1.1N", ".4V", ".3V", ".4V", ".6V", ".5V",
".6V", ".4V", ".3V", ".2V", ".3V", ".4V", ".3V", ".3V", ".5V",
".6V", ".4V", ".4V", ".4V", ".5V", ".4V", ".4V", ".5V", ".4V",
".4V"), var3 = c("77V", "15V", "114V", "199V", "245V", "158V",
"136V", "132V", "171V", "155V", "166V", "138V", "173V", "143V",
"105V", "21V", "26V", "154V", "130V", "113V", "102V", "135V",
"120V", "126V", "124N", "210N", "35V", "9V", "48V", "91V", "81V",
"80V", "14V", "11V", "7V", "13V", "34V", "18V", "15V", "58V",
"76V", "29V", "30V", "31V", "32V", "34V", "57V", "58V", "52V",
"49V")), row.names = c(NA, 50L), class = "data.frame")
Here's one solution:
x[] <- lapply(x, function(s) ifelse(grepl("N$", s), NA_character_, s))
x
# var1 var2 var3
# 1 9V .6V 77V
# 2 6V .3V 15V
# 3 9V .7V 114V
# 4 12V 1.0V 199V
# 5 14V 1.2V 245V
# 6 13V .8V 158V
# 7 11V .6V 136V
# 8 11V .7V 132V
# 9 12V .8V 171V
# 10 11V .7V 155V
# 11 13V .8V 166V
# 12 11V .7V 138V
# 13 11V .9V 173V
# 14 9V .8V 143V
# 15 8V .7V 105V
# 16 4V .4V 21V
# 17 8V .4V 26V
# 18 10V .8V 154V
# 19 9V .8V 130V
# 20 10V .7V 113V
# 21 10V .6V 102V
# 22 11V .8V 135V
# 23 9V .7V 120V
# 24 10V .7V 126V
# 25 <NA> <NA> <NA>
# 26 <NA> <NA> <NA>
# 27 7V .4V 35V
# 28 5V .3V 9V
# 29 7V .4V 48V
# 30 9V .6V 91V
# 31 8V .5V 81V
# 32 8V .6V 80V
# 33 5V .4V 14V
# 34 4V .3V 11V
# 35 4V .2V 7V
# 36 5V .3V 13V
# 37 7V .4V 34V
# 38 5V .3V 18V
# 39 6V .3V 15V
# 40 8V .5V 58V
# 41 9V .6V 76V
# 42 6V .4V 29V
# 43 6V .4V 30V
# 44 7V .4V 31V
# 45 8V .5V 32V
# 46 7V .4V 34V
# 47 8V .4V 57V
# 48 8V .5V 58V
# 49 7V .4V 52V
# 50 8V .4V 49V
If your data has columns where you do not want to do this replacement, just use a subset:
x[2:3] <- lapply(x[2:3], ...)
Variant:
library(dplyr)
x %>%
mutate_at(vars(var1, var2, var3), ~ if_else(grepl("N$", .), NA_character_, .))
# or, if all columns
x %>%
mutate_all(~ if_else(grepl("N$", .), NA_character_, .))
The use of NA_character_ is two-fold:
In the base R version, it is just being declarative, saying that I intend for the result to always be character;
In the dplyr version, its between function requires that the class of both the "yes" and "no" arguments be the same, and class(NA) is not class("A").
You just need to change your pattern to "N|M|P" :
dat <- structure(list(var1 = c("9V", "6V", "9V", "12V", "14V", "13V",
"11V", "11V", "12V", "11V", "13V", "11V", "11V", "9V", "8V",
"4V", "8V", "10V", "9V", "10V", "10V", "11V", "9V", "10V", "7N",
"14N", "7V", "5V", "7V", "9V", "8V", "8V", "5V", "4V", "4V",
"5V", "7V", "5V", "6V", "8V", "9V", "6V", "6V", "7V", "8V", "7V",
"8V", "8V", "7V", "8V"), var2 = c(".6V", ".3V", ".7V", "1.0V",
"1.2V", ".8V", ".6V", ".7V", ".8V", ".7V", ".8V", ".7V", ".9V",
".8V", ".7V", ".4V", ".4V", ".8V", ".8V", ".7V", ".6V", ".8V",
".7V", ".7V", ".6N", "1.1N", ".4V", ".3V", ".4V", ".6V", ".5V",
".6V", ".4V", ".3V", ".2V", ".3V", ".4V", ".3V", ".3V", ".5V",
".6V", ".4V", ".4V", ".4V", ".5V", ".4V", ".4V", ".5V", ".4V",
".4V"), var3 = c("77V", "15V", "114V", "199V", "245V", "158V",
"136V", "132V", "171V", "155V", "166V", "138V", "173V", "143V",
"105V", "21V", "26V", "154V", "130V", "113V", "102V", "135V",
"120V", "126V", "124N", "210N", "35V", "9V", "48V", "91V", "81V",
"80V", "14V", "11V", "7V", "13V", "34V", "18V", "15V", "58V",
"76V", "29V", "30V", "31V", "32V", "34V", "57V", "58V", "52V",
"49V")), row.names = c(NA, 50L), class = "data.frame")
library(stringr)
library(dplyr)
dat %>% mutate(var3 = str_replace_all(var3, c("N|M|P"), replacement = NA_character_))
The dplyr-stringr solution that you were trying to figure out would be like below:
library(stringr)
library(dplyr)
df1 %>%
mutate_at(vars(var1:var3),
list(~str_replace_all(., "N$|M$|P$", replacement = NA_character_)))
#> var1 var2 var3
#> 1 9V .6V 77V
#> 2 6V .3V 15V
#> 3 9V .7V 114V
#> 4 12V 1.0V 199V
#> 5 14V 1.2V 245V
## ...
#> 20 10V .7V 113V
#> 21 10V .6V 102V
#> 22 11V .8V 135V
#> 23 9V .7V 120V
#> 24 10V .7V 126V
#> 25 <NA> <NA> <NA>
#> 26 <NA> <NA> <NA>
#> 27 7V .4V 35V
#> 28 5V .3V 9V
#> 29 7V .4V 48V
#> 30 9V .6V 91V
## ...
#> 45 8V .5V 32V
#> 46 7V .4V 34V
#> 47 8V .4V 57V
#> 48 8V .5V 58V
#> 49 7V .4V 52V
#> 50 8V .4V 49V
I have this data
date signal
1 2009-01-13 09:55:00 4645.00 4838.931 5358.883 Buy2
2 2009-01-14 09:55:00 4767.50 4718.254 5336.703 Buy1
3 2009-01-15 09:55:00 4485.00 4653.316 5274.384 Buy2
4 2009-01-16 09:55:00 4580.00 4537.693 5141.435 Buy1
5 2009-01-19 09:55:00 4532.00 4548.088 4891.041 Buy2
6 2009-01-27 09:55:00 4190.00 4183.503 4548.497 Buy1
7 2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
8 2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
9 2009-02-09 09:55:00 4469.00 4203.437 4376.277 Sell1
10 2009-02-12 09:55:00 4469.90 4220.845 4503.798 Sell2
11 2009-02-13 09:55:00 4553.00 4261.980 4529.777 Sell1
12 2009-02-16 09:55:00 4347.20 4319.656 4564.387 Sell2
13 2009-02-17 09:55:00 4161.05 4371.474 4548.912 Buy2
14 2009-02-27 09:55:00 3875.55 3862.085 4101.929 Buy1
15 2009-03-02 09:55:00 3636.00 3846.423 4036.020 Buy2
16 2009-03-12 09:55:00 3420.00 3372.665 3734.949 Buy1
17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2
19 2009-03-18 09:55:00 3721.00 3363.735 3682.293 Sell1
20 2009-03-20 09:55:00 3687.00 3440.651 3784.778 Sell2
and have to arrange it in this form
2 2009-01-14 09:55:00 4767.50 4718.254 5336.703 Buy1
7 2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
8 2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
13 2009-02-17 09:55:00 4161.05 4371.474 4548.912 Buy2
14 2009-02-27 09:55:00 3875.55 3862.085 4101.929 Buy1
17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2
So that data is arranged in order of Buy1 Sell1 Sell2 Buy2 and eliminating the middle observations.
I have tried several dplyr:filter commands but none is giving the desired output.
If I have well understood your problem, the following code should solve it. It is adapted from this discussion.
The idea is to define your sequence as a pattern:
pattern <- c("Buy1", "Sell1", "Sell2", "Buy2")
Then find the position of this pattern in your column:
library(zoo)
pos <- which(rollapply(data$signal, 4, identical, pattern, fill = FALSE, align = "left"))
and extract the rows following the position of your patterns:
rows <- unlist(lapply(pos, function(x, n) seq(x, x+n-1), 4))
data_filtered <- data[rows,]
VoilĂ .
EDIT
Since I had misunderstood your problem, here is a new solution.
You want to retrieve the sequence "Buy1", "Sell1", "Sell2", "Buy2" in your column, and eliminate the observations that do not fit in this sequence. I do not see a trivial vectorised solution, so here is a loop to solve that. Depending on the size of your data, you may want to implement a similar algorithm in RCPP or vectorise it in some ways.
sequence <- c("Buy1", "Sell1", "Sell2", "Buy2")
keep <- logical(length(data$signal))
s <- 0
for (i in seq(1, length(data$signal))){
if (sequence[s +1] == data$signal[i]){
keep[i] <- T
s <- (s + 1) %% 4
} else {
keep[i] <- F
}
}
data_filtered <- data[keep,]
Tell me if this work better.
If anyone has a vectorised solution, I would be curious to see it.
You can coerce the column data$signal into a factor and define the levels.
data$signal <- as.factor(data.$signal, levels = c("Buy1","Sell1","Buy2","Sell2")
Then you can sort it
sorted.data <- data[order(signal),]
Here is a great answer that talks about what you want to do:
Sort data frame column by factor
Here is a Rcpp solution:
library(Rcpp)
cppFunction('LogicalVector FindHit(const CharacterVector x, const CharacterVector y) {
LogicalVector res(x.size());
int k = 0;
for(int i = 0; i < x.size(); i++){
if(x[i] == y[k]){
res[i] = true;
k = (k + 1) % y.size();
}
}
return res;
}')
dtt[FindHit(dtt$V6, c('Buy1', 'Sell1', 'Sell2', 'Buy2')),]
# V1 V2 V3 V4 V5 V6
# 2 2009-01-14 09:55:00 4767.50 4718.254 5336.703 Buy1
# 7 2009-01-30 09:55:00 4436.00 4155.236 4377.907 Sell1
# 8 2009-02-02 09:55:00 4217.00 4152.626 4390.802 Sell2
# 13 2009-02-17 09:55:00 4161.05 4371.474 4548.912 Buy2
# 14 2009-02-27 09:55:00 3875.55 3862.085 4101.929 Buy1
# 17 2009-03-13 09:55:00 3656.00 3372.100 3605.357 Sell1
# 18 2009-03-17 09:55:00 3650.00 3360.421 3663.322 Sell2
Here is the dtt:
> dput(dtt)
structure(list(V1 = c("2009-01-13", "2009-01-14", "2009-01-15",
"2009-01-16", "2009-01-19", "2009-01-27", "2009-01-30", "2009-02-02",
"2009-02-09", "2009-02-12", "2009-02-13", "2009-02-16", "2009-02-17",
"2009-02-27", "2009-03-02", "2009-03-12", "2009-03-13", "2009-03-17",
"2009-03-18", "2009-03-20"), V2 = c("09:55:00", "09:55:00", "09:55:00",
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00",
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00",
"09:55:00", "09:55:00", "09:55:00", "09:55:00", "09:55:00"),
V3 = c(4645, 4767.5, 4485, 4580, 4532, 4190, 4436, 4217,
4469, 4469.9, 4553, 4347.2, 4161.05, 3875.55, 3636, 3420,
3656, 3650, 3721, 3687), V4 = c(4838.931, 4718.254, 4653.316,
4537.693, 4548.088, 4183.503, 4155.236, 4152.626, 4203.437,
4220.845, 4261.98, 4319.656, 4371.474, 3862.085, 3846.423,
3372.665, 3372.1, 3360.421, 3363.735, 3440.651), V5 = c(5358.883,
5336.703, 5274.384, 5141.435, 4891.041, 4548.497, 4377.907,
4390.802, 4376.277, 4503.798, 4529.777, 4564.387, 4548.912,
4101.929, 4036.02, 3734.949, 3605.357, 3663.322, 3682.293,
3784.778), V6 = c("Buy2", "Buy1", "Buy2", "Buy1", "Buy2",
"Buy1", "Sell1", "Sell2", "Sell1", "Sell2", "Sell1", "Sell2",
"Buy2", "Buy1", "Buy2", "Buy1", "Sell1", "Sell2", "Sell1",
"Sell2")), row.names = c(NA, -20L), class = "data.frame")
rptdate st
1 2/18/2017 2/12/2017
2 2/25/2017 2/19/2017
3 3/4/2017 2/26/2017
4 3/11/2017 3/5/2017
5 3/18/2017 3/12/2017
6 3/25/2017 3/19/2017
7 4/1/2017 3/26/2017
8 4/8/2017 4/2/2017
9 4/15/2017 4/9/2017
10 4/22/2017 4/16/2017
11 4/29/2017 4/23/2017
12 5/6/2017 4/30/2017
13 5/13/2017 5/7/2017
14 5/20/2017 5/14/2017
15 5/27/2017 5/21/2017
16 6/3/2017 5/28/2017
17 6/10/2017 6/4/2017
So basically rptdate is a bunch of Saturdays and st is each previous Sunday.
I would like to reshape this dataframe (the data is in date format) in this manner:
what I would like to do is this:
i=1
j=1
While (rptdate[i][j]>=st[i][j])
{add a new row where rptdate[i][j+1]= rptdate[i][j] and st[i][j+1]=rptdate[i][j]+1}
So basically, my desired new dataframe should be like this:
rptdate st
1 2/18/2017 2/12/2017
2/18/2017 2/13/2017
2/18/2017 2/14/2017
2/18/2017 2/15/2017
2/18/2017 2/16/2017
2/18/2017 2/17/2017
2/18/2017 2/18/2017
2 2/25/2017 2/19/2017
2/25/2017 2/20/2017
2/25/2017 2/21/2017
2/25/2017 2/22/2017
2/25/2017 2/23/2017
2/25/2017 2/24/2017
2/25/2017 2/25/2017
Thank you very much for your time.
Here is an idea via base R. You need to convert you variables to dates first. Then expand the data frame with extra 7 rows (1 week) for each date. Generate all the missing dates using seq and add them in your st variable.
d2[] <- lapply(d2, function(i) as.Date(i, format = '%m/%d/%Y'))
d3 <- d2[rep(row.names(d2), each = 7),]
d3$st<- do.call(c, Map(function(x, y)seq(x, y, by = 1), d2$st, d2$rptdate))
head(d3, 10)
# rptdate st
#1 2017-02-18 2017-02-12
#1.1 2017-02-18 2017-02-13
#1.2 2017-02-18 2017-02-14
#1.3 2017-02-18 2017-02-15
#1.4 2017-02-18 2017-02-16
#1.5 2017-02-18 2017-02-17
#1.6 2017-02-18 2017-02-18
#2 2017-02-25 2017-02-19
#2.1 2017-02-25 2017-02-20
#2.2 2017-02-25 2017-02-21
...
library(data.table)
dt <- data.table(V1=as.Date(c("2/18/2017","2/25/2017","3/4/2017","3/11/2017"),format = "%m/%d/%Y"),
V2=as.Date(c("2/12/2017","2/19/2017","2/26/2017","3/5/2017"),format = "%m/%d/%Y"))
for(i in 0:6){
dt[,paste0("colomn_i",i):=V1-i]
}
dt[,V2:=NULL]
temp <- melt(dt,id.vars = "V1")
setorder(temp,V1,value)
temp[,variable:=NULL]
Even though, eventually V2, is not needed
Here is an example using functions from dplyr and lubridate. dt2 would be the final output.
# Create example data frame
dt <- read.table(text = "rptdate st
2/18/2017 2/12/2017
2/25/2017 2/19/2017
3/4/2017 2/26/2017
3/11/2017 3/5/2017
3/18/2017 3/12/2017
3/25/2017 3/19/2017
4/1/2017 3/26/2017
4/8/2017 4/2/2017
4/15/2017 4/9/2017
4/22/2017 4/16/2017
4/29/2017 4/23/2017
5/6/2017 4/30/2017
5/13/2017 5/7/2017
5/20/2017 5/14/2017
5/27/2017 5/21/2017
6/3/2017 5/28/2017
6/10/2017 6/4/2017",
header = TRUE, stringsAsFactors = FALSE)
# Load packages
library(dplyr)
library(lubridate)
# Process the data
dt2 <- dt %>%
mutate(rptdate = mdy(rptdate), st = mdy(st)) %>%
rowwise() %>%
do(data_frame(rptdate = rep(.$rptdate[1], 7),
st = seq(.$st[1], .$rptdate[1], by = 1))) %>%
mutate(rptdate = format(rptdate, "%m/%d/%Y"),
st = format(st, "%m/%d/%Y"))
Or you can use the map2 and unnest functions from tidyverse.
# Load packages
library(tidyverse)
library(lubridate)
# Process the data
dt2 <- dt %>%
mutate(rptdate = mdy(rptdate), st = mdy(st)) %>%
mutate(st = map2(st, rptdate, seq, by = 1)) %>%
unnest() %>%
mutate(rptdate = format(rptdate, "%m/%d/%Y"),
st = format(st, "%m/%d/%Y"))
Lets suppose I have a vector of numeric values
[1] 2844 4936 4936 4972 5078 6684 6689 7264 7264 7880 8133 9018 9968 9968 10247
[16] 11267 11508 11541 11607 11717 12349 12349 12364 12651 13025 13086 13257 13427 13427 13442
[31] 13442 13442 13442 14142 14341 14429 14429 14429 14538 14872 15002 15064 15163 15163 15324
[46] 15324 15361 15361 15400 15624 15648 15648 15648 15864 15864 15881 16332 16847 17075 17136
[61] 17136 17196 17843 17925 17925 18217 18455 18578 18578 18742 18773 18806 19130 19195 19254
[76] 19254 19421 19421 19429 19585 19686 19729 19729 19760 19760 19901 20530 20530 20530 20581
[91] 20629 20629 20686 20693 20768 20902 20980 21054 21079 21156
and I want to create a sequence along this vector but for unique numbers. for example
length(unique(vector))
is 74 and there are a total of 100 values in the vector. The sequence should have numbers ranging from 1 - 74 only but with length 100 as some numbers will be repeated.
Any idea on how this can be done?
Thanks.
Perhaps
res <- as.numeric(factor(v1))
head(res)
#[1] 1 2 2 3 4 5
Or
res1 <- match(v1, unique(v1))
Or
library(fastmatch)
res2 <- fmatch(v1, unique(v1))
Or
res3 <- findInterval(v1, unique(v1))
data
v1 <- c(2844, 4936, 4936, 4972, 5078, 6684, 6689, 7264, 7264, 7880,
8133, 9018, 9968, 9968, 10247, 11267, 11508, 11541, 11607, 11717,
12349, 12349, 12364, 12651, 13025, 13086, 13257, 13427, 13427,
13442, 13442, 13442, 13442, 14142, 14341, 14429, 14429, 14429,
14538, 14872, 15002, 15064, 15163, 15163, 15324, 15324, 15361,
15361, 15400, 15624, 15648, 15648, 15648, 15864, 15864, 15881,
16332, 16847, 17075, 17136, 17136, 17196, 17843, 17925, 17925,
18217, 18455, 18578, 18578, 18742, 18773, 18806, 19130, 19195,
19254, 19254, 19421, 19421, 19429, 19585, 19686, 19729, 19729,
19760, 19760, 19901, 20530, 20530, 20530, 20581, 20629, 20629,
20686, 20693, 20768, 20902, 20980, 21054, 21079, 21156)
You could use .GRP from "data.table" for this:
library(data.table)
y <- as.data.table(x)[, y := .GRP, by = x]
head(y)
# x y
# 1: 2844 1
# 2: 4936 2 ## Note the duplicated value
# 3: 4936 2 ## in these rows, corresponding to x
# 4: 4972 3
# 5: 5078 4
# 6: 6684 5
tail(y)
# x y
# 1: 20768 69
# 2: 20902 70
# 3: 20980 71
# 4: 21054 72
# 5: 21079 73
# 6: 21156 74 ## "y" values go to 74