Split a vector into multiple vectors in R - r

I want to split one vector(x) into multiple vectors(x1, x2 ,... , xn).
My input: x <- 1:10
My desire output:
x1 <- c(1,2,3,4)
x2 <- c(2,3,4,5)
x3 <- c(3,4,5,6)
x4 <- c(4,5,6,7)
x5 <- c(5,6,7,8)
x6 <- c(6,7,8,9)
x7 <- c(7,8,9,10)
My code(thanks to Mrs.Richard Herron for inspiration):
x <- 1:10
n <-3
vectors <- function(x, n) split(x, sort(rank(x) %% n))
vectors(x,n)
Thanks very much!

We can use lapply to loop over the sequence of 'x' such that we have a length of 4 in each of the elements in list, create a sequence (:) from that index to index + n, subset the 'x'. If needed to have individual vectors, we set the names of the list and use list2env.
n <- 3
lst <- lapply(1:(length(x)-n), function(i) x[i:(i+n)])
names(lst) <- paste0("x", seq_along(lst))
list2env(lst, envir = .GlobalEnv)
x1
#[1] 1 2 3 4
x2
#[1] 2 3 4 5
x3
#[1] 3 4 5 6
Or we can also create a matrix instead of multiple vectors in the global environment where each row corresponds to the vector of interest
matrix(x[1:4] + rep(0:6, each = 4), ncol=4, byrow = TRUE)

Related

Sum of all vectors of variables with common prefix

Is it possible to sum all vector variables with a common prefix ?
Exemple:
x1 <- c(1,2,3)
x2 <- c(4,5,6)
.
.
.
xn <- c(n,n,n)
y = x1 + x2 + ... xn
The number of variables xn (ie with prefix x) is only known at runtime.
Assuming your y has the same dimension as x, you could try capturing all the variables into the list and apply a summation operation.
> x2 <- c(4,5,6)
> x1 <- c(1,2,3)
> ls(pattern = "^x\\d+$") # this is regex for finding "x" and "digits",
# ^ is start of string, $ is end of string
[1] "x1" "x2"
> sapply(ls(pattern = "^x\\d+$"), get, simplify = FALSE)
$x1
[1] 1 2 3
$x2
[1] 4 5 6
> out <- sapply(ls(pattern = "^x\\d+$"), get, simplify = FALSE)
> Reduce("+", out)
[1] 5 7 9
You can also use mget as suggested by #LyzandeR's, especially if fancy one-liners.
Reduce("+", mget(ls(pattern = "^x\\d+$")))
You can check an example:
xx <- 1
xx2 <- 2
xx3 <- 3
#get the names of the variables containing xx
vars <- ls(pattern = 'xx')
#mget will get the variables from the names, unlist will add them in an atomic vector
sum(unlist(mget(vars)))
#[1] 6
A very naive solution could be:
# first 2 vectors are of interest
x1 <- c(1,2,3)
x2 <- c(4,5,6)
# answer doesn't need to have z sum in it
z <- c(7,8,9)
# create a dummy answer vector, initialize it will all 0; length will be the length of single vector that we are adding
answer<-rep(0,length(x1))
# loop through each variable in current environment
for (var in ls()){
# see if variable name begins with x
if (startsWith(var,'x')){
# add it to our answer
answer = answer + get(var)
}
}
# print the answer
print(answer)

Data cleaning: Function to find very similar variables

I have some large data, which partly consists of very similar variables. Some variables have missing values (e.g. x3 and x5 in the example below) and some variables are similar, but with different labels (e.g. x2 and x5). In order to clean my data, I want to identify and eventually delete these similar variables. I am trying to write a function, which returns the column names of all similar variable pairs. Here is some exemplifying data:
# Example data
set.seed(222)
N <- 100
x1 <- round(rnorm(N, 0, 10))
x2 <- round(rnorm(N, 10, 20))
x3 <- x1
x3[sample(1:N, 7)] <- NA
x4 <- x1
x4[sample(1:N, 5)] <- round(rnorm(5, 0, 10))
x5 <- x2
x5 <- paste("A", x5, sep = "")
x5[sample(1:N, 15)] <- NA
df <- data.frame(x1, x2, x3, x4, x5)
df$x1 <- as.character(df$x1)
df$x2 <- as.character(df$x2)
df$x3 <- as.character(df$x3)
df$x4 <- as.character(df$x4)
df$x5 <- as.character(df$x5)
head(df)
As you can see, x1, x3, and x4 are very similar; and x2 and x5 are very similar as well. My function should print a list, which includes all pairs with the same values in 80% or more of the cases. Here is what I got so far:
# My attempt to write such a function
fun_clean <- function(data, similarity) {
output <- list()
data <- data[complete.cases(data), ]
for(i in 1:ncol(data)) {
if(i < ncol(data)) {
for(j in (i + 1):ncol(data)) {
similarity_ij <- sum(data[ , i] == data[ , j]) / nrow(data)
if(similarity_ij >= similarity) {
output[[length(output) + 1]] <- colnames(data)[c(i, j)]
}
}
}
}
output
}
fun_clean(data = df, similarity = 0.8)
I managed to identify the similarity of x1, x3, and x4. The similarity of x2 and x5 (i.e. similar variables with different labels) is not found by my function. Further, my function is very slow. Therefore, I have the following question:
Question: How could I identify all similar variables in a computationally efficient way?
In order to compare your columns, you need numeric values first. You can extract only the numeric values by using gsub() and then transform to numeric values. After this transformation, you'll be good to go:
df <- apply(df, 2, function(x) as.numeric( gsub("[^0-9]", "", x) ))
Now you can compare all columns by first using combn(5, 2) to get all pairs of columns you want to compare. Then you can use that to compare the columns and calculate the percentage of entries that are equal.
combs <- combn(ncol(df), 2)
res <- apply(combs, 2, function(x){
sum(df[, x[1]] == df[, x[2]], na.rm = TRUE)/nrow(df)
})
thresh <- 0.8
combs[, res > thresh]
# [,1] [,2] [,3] [,4]
# [1,] 1 1 2 3
# [2,] 3 4 5 4
So columns 1 & 3, 1 & 4, 2 & 5 and 3 & 4 are equal to each other in more than 80% of the cases.
Note: If one or both of the compared values have an NA, this will be considered as not a match!
In caret package there is a functionality to discover correlated variables and also variables that are lineal combinations of others:
http://topepo.github.io/caret/pre-processing.html

Conditional expression for a specific column in a list of data frames in R

Sorry if the title is confusing.
I have a list of data frames combined into temp.list. I want to raise each row of a specific column based on the value in vec. For example, vec has the values 2, 0, and 3. I want to do: X2^2, log(X2), X2^3. So do log(X2) if the value in vec==0. The last three lines of code is where I have an issue.
M1 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
M2 <- data.frame(matrix(1:9, nrow = 3, ncol = 3))
M3 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
mlist <- list(M1, M2, M3)
temp.list <-mlist
vec <- c(2,0,3)
The code below works! But I don't want to raise X2^0.
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- temp.list[[i]]$X2^vec[[i]]
}
The code below replaces all rows of X2 by the first value calculated in X2.
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- ifelse(vec[[i]]==0,log(temp.list[[i]]$X2),temp.list[[i]]$X2^vec[[i]]
}
Any other ways of doing this would also be much appreciated.
You could use this:
for(i in 1:length(vec)){
temp.list[[i]]$X2 <- if(vec[[i]]==0) log(temp.list[[i]]$X2)
else temp.list[[i]]$X2^vec[[i]]
}
temp.list
# [[1]]
# X1 X2
# 1 1 9
# 2 2 16
# [[2]]
# X1 X2 X3
# 1 1 1.386294 7
# 2 2 1.609438 8
# 3 3 1.791759 9
# [[3]]
# X1 X2
# 1 1 27
# 2 2 64
The problem is with the ifelse(...) statement, which returns a vector of the same length as the condition (e.g., 1 in your case). The if (...) ... else ... statement evaluates the expression and executes whichever block of code is appropriate.

number elements in a vector with constraints

Given x and y I wish to create the desired.result below:
x <- 1:10
y <- c(2:4,6:7,8:9)
desired.result <- c(1,2,2,2,3,4,4,5,5,6)
where, in effect, each sequence in y is replaced in x by the the first element in the sequence in y and then the elements of the new x are numbered.
The intermediate step for x would be:
x.intermediate <- c(1,2,2,2,5,6,6,8,8,10)
Below is code that does this. However, the code is not general and is overly complex:
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
unique.x <- 1:(length(x[-unlist(y)]) + length(y))
y1 <- rep(min(unlist(y[1])), length(unlist(y[1])))
y2 <- rep(min(unlist(y[2])), length(unlist(y[2])))
y3 <- rep(min(unlist(y[3])), length(unlist(y[3])))
new.x <- x
new.x[unlist(y[1])] <- y1
new.x[unlist(y[2])] <- y2
new.x[unlist(y[3])] <- y3
rep(unique.x, rle(new.x)$lengths)
[1] 1 2 2 2 3 4 4 5 5 6
Below is my attempt to generalize the code. However, I am stuck on the second lapply.
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
unique.x <- 1:(length(x[-unlist(y)]) + length(y))
y2 <- lapply(y, function(i) rep(min(i), length(i)))
new.x <- x
lapply(y2, function(i) new.x[i[1]:(i[1]-1+length(i))] = i)
rep(unique.x, rle(new.x)$lengths)
Thank you for any advice. I suspect there is a much simpler solution I am overlooking. I prefer a solution in base R.
A solution like this should work:
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
x[unlist(y)]<-rep(sapply(y,'[',1),lapply(y,length))
rep(1:length(rle(x)$lengths), rle(x)$lengths)
## [1] 1 2 2 2 3 4 4 5 5 6

How to combine a list of unequal lm object length into a data frame?

I like to extract the coefficients and standard errors of each lm object and combine them into a data.frame with NA fill in for the missing predictors.
set.seed(12345)
x<-matrix(rnorm(1000),nrow=100,ncol=10)
colnames(x)<-paste("x",1:10,sep="")
df<-data.frame(y=rnorm(100),x)
m1<-vector('list', 10)
for ( i in 2:11){
eqn <- as.formula(paste("y ~", paste(colnames(df)[2:i], collapse='+')))
m1[[i-1]] <- lm(eqn, df)
}
Any suggestions would be much appreciated!
This should do the trick:
cList <- lapply(m1, coef)
nms <- names(cList[[11]])
cMat <- do.call(rbind, lapply(cList, function(X) X[nms]))
cDF <- as.data.frame(cMat); names(cDF) <- nms # Pretty up the results
cDF[1:5, 1:6]
# (Intercept) x1 x2 x3 x4 x5
# 1 -0.2345084 0.2027485 NA NA NA NA
# 2 -0.2334043 0.2074812 -0.05006297 NA NA NA
# 3 -0.2299977 0.2099620 -0.03892985 0.09777829 NA NA
# 4 -0.2095798 0.2221179 -0.02710201 0.06403695 -0.1184191 NA
# 5 -0.2060406 0.2180674 -0.01062671 0.06632922 -0.1045128 0.130937
Edit:
To collect the standard errors into a similar structure, just do something like this:
seList <- lapply(m1, function(X) coef(summary(X))[,2])
seMat <- do.call(rbind, lapply(cList, function(X) X[nms]))
seDF <- as.data.frame(cMat); names(seDF) <- nms
Here is an approach using merge and Reduce:
m2 <- lapply(m1[-1], function(x) as.data.frame(coef(summary(x))) )
tmpfun <- function(x,y) {
n <- as.character(nrow(y)-1)
xn <- if( 'Row.names' %in% colnames(x) ) 1 else 0
merge(x,y,by.x=xn, by.y=0, suffixes=c('',n), all=TRUE)
}
out <- Reduce(tmpfun, m2)
You may want to reorder the columns, or drop some of the columns in m2, or transpose the result.

Resources