Returning index of vector - r

I have a vector that looks like this:
c(1,1,1,1,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5..)
I want to get the index of when the element changes, i.e. (1,5,9,...)
I know how to do it with a for loop, but I am trying a faster way as my vector is very large.
Thanks,

Try
which(c(TRUE,diff(v1)!=0))
Or
match(unique(v1), v1)
Or if the vector is sorted
head(c(1, findInterval(unique(v1), v1)+1),-1)
data
v1 <- c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4,
4, 4, 5, 5, 5, 5, 5)

Another fun approach:
v1 <- c(1, 1, 2, 3, 4, 4, 5, 6, 7, 7, 7, 8)
head(c(1, cumsum(rle(v1)$lengths) + 1), -1)
Or if you have magrittr then it can become
library(magrittr)
v1 %>%
rle %>%
.$lengths %>%
cumsum %>%
add(1) %>%
c(1, .) %>%
head(-1)
Result: 1 3 4 5 7 8 9 12
Might look weird but it's fun to think that through :)
Explanation: cumsum(rle(v1)$lengths) gets you almost all the way there, but it'll give you the index of where a sequence ends rather than where the next sequence starts, so that's why we add one to each element, append the index 1, and remove the last element.

Related

Subset every 5 rows by group?

I have a dataset with multiple groups, and want to subset rows within groups along multiples of 5, with the addition of the first row (so row 1,5,10,15, etc within every group).
Right now my dataset has a column named "Group ID" and a few other columns (e.g. time, date, etc), but nothing indicating row numbers of any kind.
Any help would be appreciated! I was thinking maybe something compatible with dplyr? I was trying things using the function slice but no luck so far.
You need to create the sequence within each group and then just use filter
library(dplyr)
df <- data.frame(id = c(1, 2, 1, 2, 2, 3, 4, 3, 1, 2, 4, 4, 4, 3, 1, 1, 1, 2, 2),
b = c(6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6))
df <- df %>%
group_by(id) %>%
mutate(group_index = row_number()) %>%
filter(group_index == 1 | group_index %% 5 == 0)

split vector after all predefined set of elements occured

I have to do the following:
I have a vector, let as say
x <- c(1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 3, 2, 11, 1, 3, 3, 4, 1)
I have to subset the remainder of a vector after 1, 2, 3, 4 occurred at least once.
So the subset new vector would only include 4, 5, 5, 3, 2, 11, 1, 3, 3, 4, 1.
I need a relatively easy solution on how to do this. It might be possible to do an if and while loop with breaks, but I am kinda struggling to come up with a solution.
Is there a simple (even mathematical way) to do this in R?
Use sapply to find where each predefined number occurs first time.
x[-seq(max(sapply(1:4, function(y) which(x == y)[1])))]
# [1] 4 5 5 3 2 11 1 3 3 4 1
Data
x <- c(1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 3, 2, 11, 1, 3, 3, 4, 1)
You can use run length encoding for this
x = c(1, 1, 2, 3, 3, 3, 4, 4, 5, 5, 3, 2, 11, 1, 3, 3, 4, 1)
encoded = rle(x)
# Pick the first location of 1, 2, 3, and 4
# Then find the max index location
indices = c(which(encoded$values == 1)[1],
which(encoded$values == 2)[1],
which(encoded$values == 3)[1],
which(encoded$values == 4)[1])
index = max(indices)
# Find the index of x corresponding to your split location
reqd_index = cumsum(encoded$lengths)[index-1] + 2
# Print final split value
x[reqd_index:length(x)]
The result is as follows
> x[reqd_index:length(x)]
[1] 4 5 5 3 2 11 1 3 3 4 1

Selecting columns using ends_with helper and a vector of string names

I have a data frame, in wide format, with each column representing one questionnaire item for one particular version of a questionnaire for a particular time point (repeated measures design).
My data would look something like the following:
df <- data.frame(id = c(1:5), t1_QOL_child_Q1 = c(5, 3, 6, 2, 7), t1_QOL_child_Q2 = c(5, 2, 3, 7, 1), t1_QOL_child_Q3 = c(7, 7, 6, 2, 5), t1_QOL_child_joy = c(9,9, 5, 3, 6), t1_QOL_teen_Q1 = c(5, 3, 6, 2, 7), t1_QOL_teen_Q2 = c(5, 2, 3, 7, 1), t1_QOL_teen_Q3 = c(7, 7, 6, 2, 5), t1_QOL_teen_joy = c(5, 7, 4, 7, 9), t1_QOL_adult_Q1 = c(5, 3, 6, 2, 7), t1_QOL_adult_Q2 = c(5, 2, 3, 7, 1), t1_QOL_adult_Q3 = c(7, 7, 6, 2, 5), t1_QOL_adult_joy = c(6, 5, 3, 3, 2), t2_QOL_child_Q1 = c(5, 3, 6, 2, 7), t2_QOL_child_Q2 = c(5, 2, 3, 7, 1), t2_QOL_child_Q3 = c(7, 7, 6, 2, 5), t2_QOL_child_joy = c(9,9, 5, 3, 6), t2_QOL_teen_Q1 = c(5, 3, 6, 2, 7), t2_QOL_teen_Q2 = c(5, 2, 3, 7, 1), t2_QOL_teen_Q3 = c(7, 7, 6, 2, 5), t2_QOL_teen_joy = c(5, 7, 4, 7, 9), t2_QOL_adult_Q1 = c(5, 3, 6, 2, 7), t2_QOL_adult_Q2 = c(5, 2, 3, 7, 1), t2_QOL_adult_Q3 = c(7, 7, 6, 2, 5), t2_QOL_adult_joy = c(6, 5, 3, 3, 2))
For example, column t1_QOL_child_Q1 would mean Question 1 (Q1) of the child version (child) of Quality of Life (QOL) questionnaire, with time point 1 (t1) data.
I want to select only subscales/columns whose suffix are labelled differently. In the sample data above, it would be the columns ending with "joy".
I have over 3000 columns and many more suffixes and it would be a pain to use the following:
select(df, ends_with("joy"), ends_with(<another suffix>), ends_with(<another suffix>))
I have thought of putting all the potential suffixes in a string vector, and use the vector as an input to the ends_with function, but ends_with could only take a single string instead of a vector of strings.
I have searched on Stackoverflow and found a solution that could accommodate a small vector of strings, which is the following:
select(df, sapply(vector_of_strings, starts_with))
However, I have too many suffixes in my vector of strings and the following error message resulted from it: Error: sapply(vector_of_strings, ends_with) must resolve to integer column positions, not a list
Help appreciated. Thanks!
We can use a single matches with multiple patterns separated by | to match substrings at the end ($) of the string
df %>%
select(matches("(joy|Q2)$"))

dplyr sample by groups of values

I want to make samples based on grouped values with dplyr :
What I tried :
id <- c(1, 1, 1, 2, 3, 3, 4, 5, 5, 5, 6, 6, 7, 8, 8, 8, 8, 8)
id <- as.data.frame(id)
sample <- id %>%
group_by(id) %>%
sample_n(2, replace = FALSE) %>%
ungroup(id)
sample
Expected result ( n sample =2) :
1, 1, 1, 2
or
1, 1, 1, 3, 3
or
5, 5, 5, 6, 6
etc.
I have got an error:
Error: `size` must be less or equal than 1 (size of data), set `replace` = TRUE to use sampling with replacement
Perhaps this helps
id %>%
distinct(id) %>%
sample_n(2, replace = FALSE) %>%
inner_join(id, .)

Subsample a matrix by selection locations with specific values within a matrix in R

I'm have to use R instead of Matlab and I'm new to it.
I have a large array of data repeating like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10...
I need to find the locations where values equal to 1, 4, 7, 10 are found to create a sample using those locations.
In this case it will be position(=corresponding value) 1(=1) 4(=4) 7(=7) 10(=10) 11(=1) 14(=4) 17(=7) 20(=10) and so on.
in MatLab it would be y=find(ismember(x,[1, 4, 7, 10 ])),
Please, help! Thanks, Pavel
something like this?
foo <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
bar <- c(1, 4, 7, 10)
which(foo %in% bar)
#> [1] 1 4 7 10 11 14 17 20
#nicola, feel free to copy my answer and get the recognition for your answer, simply trying to close answered questions.
The %in% operator is what you want. For example,
# data in x
targets <- c(1, 4, 7, 10)
locations <- x %in% targets
# locations is a logical vector you can then use:
y <- x[locations]
There'll be an extra step or two if you wanted the row and column indices of the locations, but it's not clear if you do. (Note, the logicals will be in column order).

Resources