R: Divide columns into various subcolumns at specific chosen points / values - r

I know this might be simple, however, I searched and couldn't find a clear answer, and as non-experienced user of r, I couldn't develop it myself.
I simply need to divide a column in a list or data frame into several sub-columns (not necessarily of equal lengths) at certain defined points of specific order or value. I'm dealing with large data so, so there must be a fast function to directly divide the column according to the chosesn points.
To make it clear, I need to make something like:
# data frame
df<- data.frame(cbind("l1"=c(1:20),"l2"=c(21:40)))
# sepration points
pts<- c(4, 11, 17)
# dividing into sub columns
gp1<-df$l1[1:pts[1]]
gp2<-df$l1[pts[1]:pts[2]]
gp3<-df$l1[pts[2]:pts[3]]
gp4<-df$l1[pts[3]:20]
# combining
res<- list(gp1, gp2, gp3, gp4)
> res
[[1]]
[1] 1 2 3 4
[[2]]
[1] 4 5 6 7 8 9 10 11
[[3]]
[1] 11 12 13 14 15 16 17
[[4]]
[1] 17 18 19 20
But without defining the separation points one by one, and without reordering the data on a value basis.
Thanks in advance for your help!

We can use Map to create the sequence. Concatenate 1 before the 'pts' and nrow at the end of the 'pts' as two separate vectors, use that to create sequence of index with Map and get the corresponding values of 'l1' column of 'df' based on the sequence
Map(function(i, j) df$l1[i:j], c(1, pts), c(pts, nrow(df)))
#[[1]]
#[1] 1 2 3 4
#[[2]]
#[1] 4 5 6 7 8 9 10 11
#[[3]]
#[1] 11 12 13 14 15 16 17
#[[4]]
#[1] 17 18 19 20

Related

How to implement extract/separate functions (from dplyr and tidyr) to separate a column into multiple columns. based on arbitrary values?

I have a column:
Y = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
I would like to split into multiple columns, based on the positions of the column values. For instance, I would like:
Y1=c(1,2,3,4,5)
Y2=c(6,7,8,9,10)
Y3=c(11,12,13,14,15)
Y4=c(16,17,18,19,20)
Since I am working with a big data time series set, the divisions will be arbitrary depending on the length of one time period.
You can use the base split to split this vector into vectors that are each 5 items long. You could also use a variable to store this interval length.
Using rep with each = 5, and creating a sequence programmatically, gets you a sequence of the numbers 1, 2, ... up to the length divided by 5 (in this case, 4), each 5 times consecutively. Then split returns a list of vectors.
It's worth noting that a variety of SO posts will recommend you store similar data in lists such as this, rather than creating multiple variables, so I'm leaving it in list form here.
Y <- 1:20
breaks <- rep(1:(length(Y) / 5), each = 5)
split(Y, breaks)
#> $`1`
#> [1] 1 2 3 4 5
#>
#> $`2`
#> [1] 6 7 8 9 10
#>
#> $`3`
#> [1] 11 12 13 14 15
#>
#> $`4`
#> [1] 16 17 18 19 20
Created on 2019-02-12 by the reprex package (v0.2.1)
Not a dplyr solution, but I believe the easiest way would involve using matrices.
foo = function(data, sep.in=5) {
data.matrix = matrix(data,ncol=5)
data.df = as.data.frame(data.matrix)
return(data.df)
}
I have not tested it but this function should create a data.frame who can be merge to a existing one using cbind()
We can make use of split (writing the commented code as solution) to split the vector into a list of vectors.
lst <- split(Y, as.integer(gl(length(Y), 5, length(Y))))
lst
#$`1`
#[1] 1 2 3 4 5
#$`2`
#[1] 6 7 8 9 10
#$`3`
#[1] 11 12 13 14 15
#$`4`
#[1] 16 17 18 19 20
Here, the gl create a grouping index by specifying the n, k and length parameters where n - an integer giving the number of levels, k - an integer giving the number of replications, and length -an integer giving the length of the result.
In our case, we want to have 'k' as 5.
as.integer(gl(length(Y), 5, length(Y)))
#[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4 4 4 4 4
If we want to have multiple objects in the global environment, use list2env
list2env(setNames(lst, paste0("Y", seq_along(lst))), envir = .GlobalEnv)
Y1
#[1] 1 2 3 4 5
Y2
#[1] 6 7 8 9 10
Y3
#[1] 11 12 13 14 15
Y4
#[1] 16 17 18 19 20
Or as the OP mentioned dplyr/tidyr in the question, we can use those packages as well
library(tidyverse)
tibble(Y) %>%
group_by(grp = (row_number()-1) %/% 5 + 1) %>%
summarise(Y = list(Y)) %>%
pull(Y)
#[[1]]
#[1] 1 2 3 4 5
#[[2]]
#[1] 6 7 8 9 10
#[[3]]
#[1] 11 12 13 14 15
#[[4]]
#[1] 16 17 18 19 20
data
Y <- 1:20

How can I remove shared values from a list of vectors

I have a list :
x <- list("a" = c(1:6,32,24) , "b" = c(1:4,8,10,12,13,17,24),
"F" = c(1:5,9:15,17,18,19,20,32))
x
$a
[1] 1 2 3 4 5 6 32 24
$b
[1] 1 2 3 4 8 10 12 13 17,24
$F
[1] 1 2 3 4 5 9 10 11 12 13 14 15 17 18 19 20 32
Each vector in the list shares a number of elements with others. How I can remove shared values to get the following result?
$a
[1] 1 2 3 4 5 6 32 24
$b
[1] 8 10 12 13 17
$F
[1] 9 11 14 15 18 19 20
As you can see: the first vector does not change. The shared elements between first and second vectors will be removed from the second vector, and then we will remove the shared elements from third vectors after comparing it with first and second vectors. The target of this task is clustering dataset (the original data set contains 590 objects).
You can use Reduce and setdiff on the list in the reverse order to find all elements of the last vector that do not appear in the others. Bung this into an lapply to run over partial sub-lists to get your desired output:
lapply(seq_along(x), function(y) Reduce(setdiff,rev(x[seq(y)])))
[[1]]
[1] 1 2 3 4 5 6 32 24
[[2]]
[1] 8 10 12 13 17
[[3]]
[1] 9 11 14 15 18 19 20
When scaling up, the number of rev calls may become an issue, so you might want to reverse the list once, outside the lapply as a new variable, and subset that within it.
x <- list("a" = c(1:6,32,24) ,
"b" = c(1:4,8,10,12,13,17,24),
"F" = c(1:5,9:15,17,18,19,20,32))
This is inefficient since it re-makes the union
of the previous set of lists at each step (rather than
keeping a running total), but it was the
first way I thought of.
for (i in 2:length(x)) {
## construct union of all previous lists
prev <- Reduce(union,x[1:(i-1)])
## remove shared elements from the current list
x[[i]] <- setdiff(x[[i]],prev)
}
You could probably improve this by initializing prev as numeric(0) and making prev into c(prev,x[i-1]) at each step (although this grows a vector at each step, which is a slow operation). If you don't have a gigantic data set/don't have to do this operation millions of times it's probably good enough.

R break up data frame into list using vector of number of rows

I have a data.frame that I want to break up into a list of data.frames using a vector that will tell me how many rows should be in each consecutive list element.
Sample Data
vectornom <- c(1,2,4,3)
df <- data.frame(x=1:10,y=11:20)
Desired result
> new_list
[[1]]
x y
1 11
[[2]]
x y
2 12
3 13
[[3]]
x y
4 14
5 15
6 16
7 17
[[4]]
x y
8 18
9 19
10 20
I appreciate your help
You can use the (pretty awesome) split function for this, using vectornom to create the index on which to "split"
split(df, rep(1:length(vectornom), vectornom))

How to store the result of a loop over combinatoric pairs of a list?

I have a matrix (but for the purposes of the example I will simplify to a vector).
I want to loop over all pairs of the list. So if the list is length n (or the matrix has n columns), the resulting list has to be (n choose 2) items long.
Suppose n = 6 for the example, but in reality is 36.
Basically, I want a loop like this:
list=1:6
endlist= vector("list", 15) # 15 from 6!/((4!)(2!))
Here is what I want:
Note the below loop does NOT work since there is no i index, and there appears to be no linear combination of j and k that fits the index. Is there a nonlinear one? Or is there a better way to program this?
for(j in 1:5){
for(k in (j+1):6){
endlist[[i]]=list[j]*list[k]
}
}
Giving the output:
endlist=
[[1]]
[1] 2 3 4 5 6
[[2]]
[1] 6 8 10 12
etc.
There's definitely a better way to code that. I'm not sure how this will necessarily apply to your matrix, but for your example:
combn(list, 2, prod)
#[1] 2 3 4 5 6 6 8 10 12 12 15 18 20 24 30
combn() produces combinations of a vector, and can apply a function to each combination(prod). If you really want the output as a list, you can do it with split():
split(combn(list, 2, prod), rep(1:(max(list)-1), times =(max(list)-1):1))
# $`1`
# [1] 2 3 4 5 6
#
# $`2`
# [1] 6 8 10 12
#
# $`3`
# [1] 12 15 18
#
# $`4`
# [1] 20 24
#
# $`5`
# [1] 30
I think the takeaway here is that it's better to calculate your combinations, and work on those, rather than create the combinations yourself in some kind of loop.

Making a data frame that is a subset of two data frames

I am stumped again.
I have two data frames
dataframe1
a b c
[1] 21 12 22
[2] 11 9 6
[3] 4 6 7
and
dataframe2
f g h
[1] 21 12 22
[2] 11 9 6
[3] 4 6 7
I want to take the first column of dataframe1 and make three new dataframes with the second column being each of the three f,g and h
Obviously I could just do a subset over and over
subset1 <- cbind(dataframe1[,1]dataframe2[,1])
subset2 <- cbind(dataframe1[,1]dataframe2[,2])
but my dataframes will have variable numbers of columns and are very long row numberwise. So I am looking for a little more something general. My data frames will always be the same length.
The closest I have come to getting anything was with apply and cbind but I got either a set of three rows that were a and f, a and g, a and h each combined as single numeric vector or I get a single data frame with four columns, a,f,g,h.
Help is deeply appreciated.
You can use lapply it iterate over the columns of dataframe2 like so:
lapply(dataframe2, function(x) as.data.frame(cbind(dataframe1[,1], x)))
This will result in a list object where each entry corresponds to a column of dataframe2. For example:
$f
V1 x
1 21 21
2 11 11
3 4 4
$g
V1 x
1 21 12
2 11 9
3 4 6
$h
V1 x
1 21 22
2 11 6
3 4 7

Resources