How to create all the combinations in a vector using R - r

I have a vector of n observations. Now I need to create all the possible combinations with those n elements. For example, my vector is
a<-1:4
In my output, combinations should be like,
1
2
3
4
12
13
14
23
24
34
123
124
134
234
1234
How can I get this output?
Thanks in advance.

Something like this could work:
unlist(sapply(1:4, function(x) apply(combn(1:4, x), 2, paste, collapse = '')))
First we get the combinations using combn and then we paste the outputs together. Finally, unlist gives us a vector with the output we need.
Output:
[1] "1" "2" "3" "4" "12" "13" "14" "23" "24" "34" "123" "124"
"134" "234" "1234"

Related

How to compare vectors with different structures

I have two vectors (fo, fo2) and I would like to compare if the numbers are matching between them (such as with intersect(fo,fo2)).
However, fo and fo2 can't be compared directly. fo is numeric (each element is typed into c() ) while fo2 is read from a string such as "1 3 6 7 8 10 11 13 14 15".
The output of the vectors are produced here for illustration. Any help is greatly appreciated!
# fo is a vector
> fo <- c(1,3,6,7,8,9,10,11)
> fo
[1] 1 3 6 7 8 10 11
> is.vector(fo)
[1] TRUE
# fo2 is also a vector
> library(stringr)
> fo2 <- str_split("1 3 6 7 8 10 11 13 14 15", " ")
> fo2
[[1]]
[1] "1" "3" "6" "7" "8" "10" "11" "13" "14" "15"
> is.vector(fo2)
[1] TRUE
> intersect(fo,fo2)
list()
fo2 here is list vector but fo is atomic vector so to get the intersect e.g.
intersect(fo , fo2[[1]])
#> [1] "1" "3" "6" "7" "8" "10" "11"
to learn the difference see Vectors
Another option:
fo %in% fo2[[1]]
Output:
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
Check with setdiff:
setdiff(fo, fo2[[1]])
Output:
[1] 9

Create a new data frame every time it encounters a value

I need to split the data frame based on certain condition, for example, I have a data framemy_df which has a variable k which has no negative values. I need to split this dataframe my_df every time it encounters 0. To interpret this more clearly below is my code to create my_df.
my_df <- data.frame("k" = c(0, 0,0, 0.1,1.3,4,5,7,8,11,14,17,10,5,0.4,0,0,0,1.0,2.3,5,7,3,0.1,0))
Upon executing the above code my dataframe is as shown below,
row_number k
1 0
2 0
3 0
4 0.1
5 1.3
6 4
7 5
8 7
9 8
10 11
11 14
12 17
13 10
14 5
15 0.4
16 0
17 0
18 0
19 1.0
20 2.3
21 5
22 7
23 3
24 0.1
25 0
My expected output is split the above data frame when the next value is zero.
i.e, a new dataframe df1 is created containing the values from row 1 to 15 similarly another data frame df2 is created containing values from row 16 -24, and another data frame df3 is created having values from row 25 this continues till the end of the data frame.
I found that split() does the job of splitting the data frame but I do not know how to implement my requirement in the function.
From data.table you can use the function rleidv() to create a grouping variable:
library("data.table")
my_df <- data.frame("k" = c(0, 0,0, 0.1,1.3,4,5,7,8,11,14,17,10,5,0.4,0,0,0,1.0,2.3,5,7,3,0.1,0))
split(my_df, (rleidv(my_df$k==0) - 1) %/% 2)
Here is a solution with base R:
r <- rle(my_df$k!=0)
r$values <- gl((length(r$values) + 1) %/% 2, k=2, length=length(r$values))
split(my_df, inverse.rle(r))
We can create a grouping variable with cumsum and diff, then split the 'my_df' based on it to have a list of data.frames
lst <- split(my_df, cumsum(c(TRUE, diff(!my_df$k) ==1)))
lapply(lst, row.names)
#$`1`
#[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15"
#$`2`
#[1] "16" "17" "18" "19" "20" "21" "22" "23" "24"
#$`3`
#[1] "25"
NOTE: No packages are used. Only base R methods are used.

Converting a vector of strings into a numerical vector, based on string-sequences

I have a vector like
A <- c("A","A","B","B", "C","C","C", "D")
i would like to convert it into a numerical vector based on the sequence in A, that would look like:
c(1:2, 3:4, 5:7, 8)
Is this possible?
Try:
A <- c("A","A","B","B", "C","C","C", "D")
as.numeric(factor(A))
[1] 1 1 2 2 3 3 3 4
and in case you really want a sequence from 1 to the length of the vector:
labels(factor(A))
[1] "1" "2" "3" "4" "5" "6" "7" "8"
or
1:length(A)
[1] 1 2 3 4 5 6 7 8
If the first sequence is what you want, you may find plyr::mapvalues interesting in case you have more complicated cases at some point. For instance,
library(plyr)
mapvalues(A, from=unique(A), to=1:4)
[1] "1" "1" "2" "2" "3" "3" "3" "4"
This comes in handy when you need a bit more control. For instance, you could easily supply other output as to argument, e.g.month.name[1:4].

Automatically changing matrix length and row names

My data lengthens each quarter and varies start dates in different data sets.
I have written a code which runs lots of tests and produces forecasts and is automatically documented with graphs and tables of the data.
Everything works fine until the length of data or start date changes because the data in the tables is either not of a correct length or doesnt match up to the correct quarter.
Here is an example:
Test.data <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
Test.dates <- c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3")
Test <- matrix(c(Test.data,""),nrow=4,byrow=FALSE)
colnames(Test) <- c("'08","'09","'10","'11","'12","'13","'14")
rownames(Test) <- c("Qtr 1", "Qtr 2", "Qtr 3", "Qtr 4")
Which quite nicely gives:
'08 '09 '10 '11 '12 '13 '14
Qtr 1 1 5 9 13 17 21 25
Qtr 2 2 6 10 14 18 22 26
Qtr 3 3 7 11 15 19 23 27
Qtr 4 4 8 12 16 20 24
However then in the next quarter the data will increase by 1 and come up with an error:
Warning message:
In matrix(c(Test.data, ""), nrow = 4, byrow = FALSE) :
data length [29] is not a sub-multiple or multiple of the number of rows [4]
Error in `colnames<-`(`*tmp*`, value = c("'08", "'09", "'10", "'11", "'12", :
length of 'dimnames' [2] not equal to array extent
Or if a data set begins in 08Q2 instead of 08Q1 then the data will all be next to the wrong quarter.
I need to display my data in the specific way of:
'yr1 'yr2 'yr3 ...
Qtr 1
Qtr 2
Qtr 3
Qtr 4
Does anyone have any suggestions on how i can get this to automatically change to fit my data without having to change anything (as very soon it will be joined to a database which will constantly produce results so therefore it cannot be changed each time the data is different lengths)
Thankyou for your help.
Please comment below if you want any more information
Test.data.padded <- as.character(Test.data)
length(Test.data.padded) <- ceiling(length(Test.data.padded) / 4) * 4
Test.data.padded[is.na(Test.data.padded)] <- ""
Test <- matrix(Test.data.padded, nrow=4, byrow=FALSE)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "1" "5" "9" "13" "17" "21" "25"
#[2,] "2" "6" "10" "14" "18" "22" "26"
#[3,] "3" "7" "11" "15" "19" "23" "27"
#[4,] "4" "8" "12" "16" "20" "24" ""
Then use a regex to extract the years from your Test.dates.
Not sure if this helps.
library(stringi)
n <- 4
l <- length(Test.data)
m1 <- stri_list2matrix(split(Test.data,as.numeric(gl(l,n,l))), fill='')
nm1 <- do.call(rbind,strsplit(Test.dates, '(?<=[0-9])(?=[Q])', perl=TRUE))
dimnames(m1) <- list(unique(nm1[,2]), unique(nm1[,1]))
m1
# 08 09 10 11 12 13 14
#Q1 "1" "5" "9" "13" "17" "21" "25"
#Q2 "2" "6" "10" "14" "18" "22" "26"
#Q3 "3" "7" "11" "15" "19" "23" "27"
#Q4 "4" "8" "12" "16" "20" "24" ""

Extracting nth element from a nested list following strsplit - R

I've been trying to understand how to deal with the output of strsplit a bit better. I often have data such as this that I wish to split:
mydata <- c("144/4/5", "154/2", "146/3/5", "142", "143/4", "DNB", "90")
#[1] "144/4/5" "154/2" "146/3/5" "142" "143/4" "DNB" "90"
After splitting that the results are as follows:
strsplit(mydata, "/")
#[[1]]
#[1] "144" "4" "5"
#[[2]]
#[1] "154" "2"
#[[3]]
#[1] "146" "3" "5"
#[[4]]
#[1] "142"
#[[5]]
#[1] "143" "4"
#[[6]]
#[1] "DNB"
#[[7]]
#[1] "90"
I know from the strsplit help guide that final empty strings are not produced. Therefore, there will be 1, 2 or 3 elements in each of my results based on the number of "/" to split by
Getting the first element is very trivial:
sapply(strsplit(mydata, "/"), "[[", 1)
#[1] "144" "154" "146" "142" "143" "DNB" "90"
But I am not sure how to get the 2nd, 3rd... when there are these unequal number of elements in each result.
sapply(strsplit(mydata, "/"), "[[", 2)
# Error in FUN(X[[4L]], ...) : subscript out of bounds
I would hope to return from a working solution, the following:
#[1] "4" "2" "3" "NA" "4" "NA" "NA"
This is a relatively small example. I could do some for loop very easily on these data, but for real data with 1000s of observations to run the strsplit on and dozens of elements produced from that, I was hoping to find a more generalizable solution.
(at least regarding 1D vectors) [ seems to return NA when "i > length(x)" whereas [[ returns an error.
x = runif(5)
x[6]
#[1] NA
x[[6]]
#Error in x[[6]] : subscript out of bounds
Digging a bit, do_subset_dflt (i.e. [) calls ExtractSubset where we notice that when a wanted index ("ii") is "> length(x)" NA is returned (a bit modified to be clean):
if(0 <= ii && ii < nx && ii != NA_INTEGER)
result[i] = x[ii];
else
result[i] = NA_INTEGER;
On the other hand do_subset2_dflt (i.e. [[) returns an error if the wanted index ("offset") is "> length(x)" (modified a bit to be clean):
if(offset < 0 || offset >= xlength(x)) {
if(offset < 0 && (isNewList(x)) ...
else errorcall(call, R_MSG_subs_o_b);
}
where #define R_MSG_subs_o_b _("subscript out of bounds")
(I'm not sure about the above code snippets but they do seem relevant based on their returns)
Try this:
> read.table(text = mydata, sep = "/", as.is = TRUE, fill = TRUE)
V1 V2 V3
1 144 4 5
2 154 2 NA
3 146 3 5
4 142 NA NA
5 143 4 NA
6 DNB NA NA
7 90 NA NA
If you want to treat DNB as an NA then add the argument na.strings="DNB" .
If you really want to use strsplit then try this:
> do.call(rbind, lapply(strsplit(mydata, "/"), function(x) head(c(x,NA,NA), 3)))
[,1] [,2] [,3]
[1,] "144" "4" "5"
[2,] "154" "2" NA
[3,] "146" "3" "5"
[4,] "142" NA NA
[5,] "143" "4" NA
[6,] "DNB" NA NA
[7,] "90" NA NA
Note: Using alexis_laz's observation that x[i] returns NA if i is not in 1:length(x) the last line of code above could be simplified to:
t(sapply(strsplit(mydata, "/"), "[", 1:3))
You could use regex (if it is allowed)
library(stringr)
str_extract(mydata , perl("(?<=\\d/)\\d+"))
#[1] "4" "2" "3" NA "4" NA NA
str_extract(mydata , perl("(?<=/\\d/)\\d+"))
#[1] "5" NA "5" NA NA NA NA
You can assign the length inside sapply, resulting in NA where the current length is shorter than the assigned length.
s <- strsplit(mydata, "/")
sapply(s, function(x) { length(x) <- 3; x[2] })
# [1] "4" "2" "3" NA "4" NA NA
Then you can add a second indexing argument with mapply
m <- max(sapply(s, length))
mapply(function(x, y, z) { length(x) <- z; x[y] }, s, 2, m)
# [1] "4" "2" "3" NA "4" NA NA

Resources