Making a list by grouping consecutive months in R [duplicate] - r

This question already has answers here:
Split a vector by its sequences [duplicate]
(2 answers)
Closed 6 years ago.
This sounds simple, but having a hard time figuring it out. I have a dataframe (S) with one column populated with numeric months (1-12 i.e Jan-Dec):
S$month
[1] 6 7 12 1 2 3 4 5 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
[27] 11 12 2 3 4 6 10 11 12 1 2 3 5 6 7 7
I'd like to split the dataframe into a list as such consecutive months are grouped as shown below:
S[[1]]$month
[1] 6 7
S[[2]]$month
[1] 12 1 2 3 4 5 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
[25] 11 12
S[[3]]$month
[1] 2 3 4
S[[4]]$month
[1] 6
S[[5]]$month
[1] 10 11 12 1 2 3
S[[6]]$month
[1] 5 6 7 7
Note that some months are repetitive because more than one measurement was taken.
Is there any easy way to do it other than writing a lot like:
S[[1]]<-S[c(1:2),]; S[[2]]<-S[c(3:28),]; and so on ...?? because that's quite inefficient!

You can use cumsum and diff to create a group variable and use the split function to turn your vector into a list of consecutive months:
split(month, cumsum(!c(1, diff(month)) %in% c(0, 1, -11)))
# by using c(0, 1, -11), (12, 1) which is the only consecutive case which can have diff of
# -11 and consecutive same months are also considered as legitimate consecutive order.
# $`0`
# [1] 6 7
# $`1`
# [1] 12 1 2 3 4 5 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
# $`2`
# [1] 2 3 4
# $`3`
# [1] 6
# $`4`
# [1] 10 11 12 1 2 3
# $`5`
# [1] 5 6 7 7

We can do this programmatically and not rely on the output from the diff.
with(S, split(month, cumsum(c(TRUE, diff(cumsum(c(FALSE,
(month==12)[-length(month)]))*12 + month)>1))))
#$`1`
#[1] 6 7
#$`2`
#[1] 12 1 2 3 4 5 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10 11 12
#$`3`
#[1] 2 3 4
#$`4`
#[1] 6
#$`5`
#[1] 10 11 12 1 2 3
#$`6`
#[1] 5 6 7 7
data
S <- structure(list(month = c(6, 7, 12, 1, 2, 3, 4, 5, 5, 6, 7, 8,
9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 2, 3, 4,
6, 10, 11, 12, 1, 2, 3, 5, 6, 7, 7)), .Names = "month", row.names = c(NA,
-42L), class = "data.frame")

Related

Logical function of the sample numbers to be sequential

Suppose I am generating a matrix of card, each has its own letter (A/B/C) and its own number (2-11).
Now I randomly sampled 5 cards.
card <- data.frame(
pack = rep(c("A","B","C"), 10),
rank = rep(2:11, 3)
)
card
pack rank
1 A 2
2 B 3
3 C 4
4 A 5
5 B 6
6 C 7
7 A 8
8 B 9
9 C 10
10 A 11
11 B 2
12 C 3
13 A 4
14 B 5
15 C 6
16 A 7
17 B 8
18 C 9
19 A 10
20 B 11
21 C 2
22 A 3
23 B 4
24 C 5
25 A 6
26 B 7
27 C 8
28 A 9
29 B 10
30 C 11
s<-card[sample(seq_len(nrow(card)), 5),]
s
And then our sample is generated.
Now, define "sequential rank".
A sample is said to be containing five cards of sequential rank if the numbers are consecutive.
For example, the cards with rank 3 4 5 6 7 is sequential.
Also we define that 2 to be a special number, such that
2 3 4 5 6 is sequential. 8 9 10 11 2 is also sequential. But 9 10 11 2 3 is not sequential.
My target is to set a logical function such that the output is TRUE if the sample cards are sequential, FALSE otherwise.
Now I try to list all the sequential combination out
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
7 8 9 10 11
8 9 10 11 2
But note that 2 4 3 5 6 is also sequential, and the like. So we have actually many many cases to deal with.
My question is, is there a more compact way to set up the aforementioned function (TRUE if the sample cards are sequential, FALSE otherwise.) ?
Here's another approach:
is_run <- function(x) {
x <- sort(x)
# All consecutive runs should return TRUE
if(all(diff(x) == 1)) return(TRUE)
# The special case of 8, 9, 10, 11, 2 should also return TRUE
if(all(x == c(2, 8, 9, 10, 11))) return(TRUE)
# In all other cases this is not a consecutive run
return(FALSE)
}
Tests:
# Sequence
is_run(c(4, 5, 6, 7, 8))
#> [1] TRUE
# Sequence starting with 2
is_run(c(2, 4, 3, 5, 6))
#> [1] TRUE
# Sequence ending with 2
is_run(c(8, 9, 10, 11, 2))
#> [1] TRUE
# Cards not in order but can be arranged into sequence
is_run(c(6, 5, 8, 7, 4))
#> [1] TRUE
# Cards not in order but can be arranged into sequence (including a 2)
is_run(c(9, 11, 10, 2, 8))
#> [1] TRUE
# Non-sequence
is_run(c(3, 4, 6, 7, 8))
#> [1] FALSE
# Wrap around not allowed
is_run(c(10, 11, 2, 3, 4))
#> [1] FALSE
Created on 2022-10-06 with reprex v2.0.2
Exclude special number 2, then sort and check if difference is always 1.
# example input
d <- read.table(text = "
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
7 8 9 10 11
8 9 10 11 2
9 10 11 2 3")
foo <- function(x) all(diff(sort(x[ x != 2 ])) == 1)
apply(d, 1, foo)
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
It is unclear exactly how 2 operates. This solution assumes the only "sequential rank" multisets involving 2 are {2, 3, 4, 5, 6} and {2, 8, 9, 10, 11} (e.g., {2, 4, 5, 6, 7} and {2, 2, 9, 10, 11} are not considered "sequential rank").
!any((diff(sort(s)) - 1) %% 5:8)
It can also be vectorized for a 5-by-n matrix:
library(Rfast)
colAll(!((diff(colSort(s)) - 1) %% 5:8))
Example usage:
set.seed(1013119055L)
s <- matrix(sample(2:11, 5e3, TRUE), 5)
s[,colAll(!((diff(colSort(s)) - 1) %% 5:8))]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] 3 7 4 6 9 3 9 8 11 6
#> [2,] 7 8 2 2 8 7 8 7 2 8
#> [3,] 4 9 6 4 10 5 7 6 10 7
#> [4,] 5 10 5 5 11 4 6 10 9 10
#> [5,] 6 6 3 3 7 6 5 9 8 9
So you want a function that tells you if a vector of integers contains a straight sequence when ordered.
Edit: And if the last number is 2 but everything else is sequential, it should still be deemed sequential.
This should work:
is.sequential <- function(x) {
if (x[length(x)]==2) {
x <- x[-length(x)] # neglect last element if it is 2
if (2 %in% x) return(FALSE) # but then no other 2s are allowed
}
length(x) == max(x) - min(x) + 1
}
x <- c(1, 4, 5, 2, 3)
is.sequential(x) # TRUE
y <- c(1, 2, 3, 5, 6)
is.sequential(y) # FALSE
z <- c(4, 5, 6, 7, 2)
is.sequential(z) # TRUE
z2 <- c(2, 3, 4, 5, 2)
is.sequential(z2) # FALSE
Note that this function will not work well if x contains anything that is not an integer.

remove rows from a certain dataframe by the values of a different data frame in R [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 2 years ago.
I have these two data frames a and b
I want to remove what is in a from b
example a =
X Y
1 1 3
2 2 4
3 3 5
example b =
X Y Z
1 3 5 4 --- want to remove this
2 4 6 2
3 1 3 2 --- want to remove this
4 2 3 4
5 5 3 4
6 2 4 2 --- want to remove this
7 4 3 4
8 2 4 6 ---- want remove this
9 6 9 6
10 2 0 3
So I'm only keeping the rows that dont have the combination of a
the final result would be this:
X Y Z
1 4 6 2
2 2 3 4
3 5 3 4
4 4 3 4
5 6 9 6
6 2 0 3
Thanks
anti-join from the dplyr package can be very helpful.
library(tidyverse)
a <- tibble(X=c(1, 2, 3), Y=c(3, 4, 5))
b <- tibble(X=c(3, 4, 1, 2, 5, 2, 4, 2, 6, 2),
Y=c(5, 6, 3, 3, 3, 4, 3, 4, 9, 0),
Z=c(4, 2, 2, 4, 4, 2, 4, 6, 6, 3))
c <- b %>% anti_join(a, by=c("X", "Y"))
c
Gives
# A tibble: 6 x 3
X Y Z
<dbl> <dbl> <dbl>
1 4 6 2
2 2 3 4
3 5 3 4
4 4 3 4
5 6 9 6
6 2 0 3

R: create a vector based on a list

I have the following list called m1:
> m1
[[1]]
[1] 36 37 38
[[2]]
[1] 34 35
[[3]]
[1] 30 31 32 33
[[4]]
[1] 24 25 26 27 28 29
[[5]]
[1] 20 21 22 23
[[6]]
[1] 14 15 16 17 18 19
[[7]]
[1] 11 12 13
[[8]]
[1] 7 8 9 10
[[9]]
[1] 5 6
[[10]]
[1] 1 2 3 4
[[11]]
integer(0)
I would like to create a vector based on this list, which has the value 1 at positions 36, 37, and 38; the value 2 at positions 34 and 35, etc. The final output should be:
vector_1 <- c(10, 10, 10, 10, 9, 9, 8, 8, 8, 8, 7, 7, 7, 6, 6, 6, 6, 6, 6, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 2, 2, 1, 1, 1)
How can I accomplish this in R?
EDIT:
Thanks to a comment below:
> rep(length(m1):1, sapply(m1, length))
[1] 11 11 11 10 10 9 9 9 9 8 8 8 8 8 8 7 7 7 7 6 6 6 6 6 6 5 5 5 4
[30] 4 4 4 3 3 2 2 2 2
That doesn't quite give me what I want, but it's definitely on the right track!
This should handle cases with empty entries and non-sequential entries....
m1 <- list(c(7,4,5), c(2,10,9), c(1,3,6,8), integer())
# [[1]]
# [1] 7 4 5
#
# [[2]]
# [1] 2 10 9
#
# [[3]]
# [1] 1 3 6 8
#
# [[4]]
# integer(0)
rep(seq_along(m1), sapply(m1, length))[order(unlist(m1))]
#[1] 3 2 3 1 1 3 1 3 2 2
This solution should work for more general cases too even if the elements inside m1 are not in a specific order
#DATA
m1 = list(36:38, 34:35, 30:33, 24:29, 20:23,
14:19, 11:13, 7:10, 5:6, 1:4, integer(0))
#Extract the maximum element in m1
mymax = max(unlist(m1))
#Go through m1 using index and replace respective indices in the position
#defined by the elements of m1, otherwise make the elements zero
Reduce("+", lapply(1:length(m1), function(i)
replace(rep(0, mymax), m1[[i]], i)))
# [1] 10 10 10 10 9 9 8 8 8 8 7 7 7 6 6 6 6 6 6 5 5 5 5
#[24] 4 4 4 4 4 4 3 3 3 3 2 2 1 1 1
Here is a straightforward base-R solution:
# data
m1 <- list(36:38, 34:35, 30:33, 24:29, 20:23, 14:19, 11:13, 7:10, 5:6, 1:4, integer(0))
# Count length, and repeat each number in 1:11 accordingly
rev(rep(1:11, sapply(m1, length)))
[1] 10 10 10 10 9 9 8 8 8 8 7 7 7 6 6 6 6 6 6 5 5 5 5 4 4 4 4 4 4 3 3 3
[33] 3 2 2 1 1 1
Edit:
A more generalisable answer would be:
rev(rep(seq_along(m1), sapply(m1, length)))
Try this:
rev(unlist(sapply(1:length(m1), function(x) rep(x,length(m1[[x]])))))
#or even better, #snoram's edited version of this:
rev(rep(seq_along(m1), sapply(m1, length)))
Output:
[1] 10 10 10 10 9 9 8 8 8 8 7 7 7 6 6 6 6 6 6 5 5 5 5 4
[25] 4 4 4 4 4 3 3 3 3 2 2 1 1 1
Sample data:
m1 <- list(36:38,34:35,30:33,24:29,20:23,
14:19,11:13,7:10,5:6,1:4)
names(m1) <- 1:10

how to make a vector of x from 1 to max value

In a dataset like this one
what code should I use if I want to make a vector of
x <- 1: max (day)/ID
? So x will be
1:7 for B1
1:11 for B2
1:22 for B3
I tried
max_day <- summaryBy(day ~ ID , df ,FUN=max) # to extract the maximum day per ID
df<- merge (df, max_day) ## to create another column with the maximum day
max_day<- unique(df[,c("ID", " day.max")]) ## to have one value (max) per ID
##& Finlay the vector
x <- 1: (max_day$day.max)
I got this message
Warning message:
In 1:(max_day$day.max) :
numerical expression has 11134 elements: only the first used
Any suggestions?
tapply(df$day, df$ID, function(x) 1:max(x))
I don't know how should look your output, but you can try this:
my_data <- data.frame(ID = c(rep("B1", 3), rep("B2", 4), rep("B3", 3)),
day = sample(1:20, 10, replace = TRUE))
tmp <- aggregate(test$day, by = list(test$ID), FUN = max)
sapply(1:nrow(tmp), function(y) return(1:tmp$x[y]))
# [[1]]
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
# [[2]]
# [1] 1 2 3 4 5 6 7 8 9 10 11
# [[3]]
# [1] 1 2 3 4 5 6 7 8 9 10 11
We can use sapply to loop over unique element of ID and generate a sequence from 1 to the max for that ID in the day column
sapply(unique(df$ID), function(x) seq(1, max(df[df$ID == x, "day"])))
#[[1]]
#[1] 1 2 3 4 5 6 7
#[[2]]
#[1] 1 2 3 4 5 6 7 8 9 10 11
#[[3]]
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
If we want all as one vector , we can try unlist
unlist(sapply(unique(df$ID), function(x) seq(1, max(df[df$ID == x, "day"]))))
#[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10
# 11 12 13 14 15 16 17 18 19 20 21 22
Yet another option, using Hadley Wickham's purrr package, as part of the tidyverse.
d <- data.frame(id = rep(c("B1", "B2", "B3"), c(3, 4, 5)),
v = c(1:3, 1:4, 1:5),
day = c(1, 3, 7, 1, 5, 9, 11, 3, 5, 11, 20, 22),
number = c(15, 20, 30, 25, 26, 28, 35, 10, 12, 14, 16, 18))
library(purrr)
d %>%
split(.$id) %>%
map(~1:max(.$day))
# $B1
# [1] 1 2 3 4 5 6 7
# $B2
# [1] 1 2 3 4 5 6 7 8 9 10 11
# $B3
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
df <-
data.frame(ID = c(rep("B1",3),rep("B2",4),rep("B3",5)),
V = c(1,2,3,1,2,3,4,1,2,3,4,5),
day = c(1,3,7,1,5,9,11,3,5,11,20,22),
number = c(15,20,30,25,26,28,35,10,12,14,16,18))
x <- list()
n <- 1
for(i in unique(df$ID)){
max_day <- max(df$day[df$ID==i])
x[[n]] <- 1:max_day
n <- n+1
}
x
[[1]]
[1] 1 2 3 4 5 6 7
[[2]]
[1] 1 2 3 4 5 6 7 8 9 10 11
[[3]]
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22

lag not working as expected

lag is not working as I expected
a<-c(0,1,2,3,4,5,6,7,8)
a
## [1] 0 1 2 3 4 5 6 7 8
lag(a,k=1)
## [1] 0 1 2 3 4 5 6 7 8
## attr(,"tsp")
## [1] 0 8 1
I thought I would get:
0 0 1 2 3 4 5 6 7
or
1 2 3 4 5 6 7 8 0
What am I doing wrong?
You should use Lag from the Hmisc package:
library(Hmisc)
Lag(c(0,1,2,3,4,5,6,7,8), shift = 1)
# [1] NA 0 1 2 3 4 5 6 7
Actually,
I went for this in the end:
c(a[-1],0)
Does what I wanted
data.table::shift is another one:
library(data.table)
a <- c(0, 1, 2, 3, 4, 5, 6, 7, 8)
shift(a)
#[1] NA 0 1 2 3 4 5 6 7
Another way using base::lag with the zoo library:
library(zoo)
a <- zoo(c(0, 1, 2, 3, 4, 5, 6, 7, 8)) #convert to zoo
So:
lag(a, 1, na.pad = TRUE)
# 1 2 3 4 5 6 7 8 9
# 1 2 3 4 5 6 7 8 NA
Or:
lag(a, -1, na.pad = TRUE)
# 1 2 3 4 5 6 7 8 9
#NA 0 1 2 3 4 5 6 7
You can wrap an as.vector around it to get rid of the indices.

Resources