Making a column to help aggregation in r dataframe - r

I need to construct a new column for R dataframe that would help in aggregation.
First, I have some vectors:
vector1 <- c("ITEM11","ITEM12","ITEM13")
vector2 <- c("ITEM21","ITEM22","ITEM32")
and dataframe DF which has column VAR with the items included in the vectors. Now I want to make new column AGGVAR:
DF$AGGVAR[DF$VAR %in% vector1] <- "vector1"
This is manageable with small amount of vectors but I want to make it neater for more vectors. I made list
vectorList <- ls(pattern = "^vector")
and my obviously naive attempt was
for(i in regList){DF$AGGVAR[DF$VAR %in i] <- i}
What is still needed to make this work?
EDIT: My problem was actually bit more hairy than I first presented. The vectors don't actually have neat numerical suffixes, e.g.:
vectorGHI <- c("ITEM11","ITEM12","ITEM13")
vectorJKL <- c("ITEM21","ITEM22","ITEM32")

Something like this should do the trick:
vector1 <- c("ITEM11","ITEM12","ITEM13")
vector2 <- c("ITEM21","ITEM22","ITEM32")
d <- data.frame(var=c(vector1, vector2))
L <- mget(ls(patt='^vector'))
d$aggvar <- paste0('vector', sapply(d$var, grep, L))
d
# var aggvar
# 1 ITEM11 vector1
# 2 ITEM12 vector1
# 3 ITEM13 vector1
# 4 ITEM21 vector2
# 5 ITEM22 vector2
# 6 ITEM32 vector2
An alternative, which might have better performance:
lookup <- cbind(unlist(L),
c(mapply(rep, names(L), sapply(L, length))))
d$aggvar <- lookup[match(d$var, lookup[, 1]), 2]

Slightly modified answer based on jbaums' suggestion to make this complete:
namesVectors <- ls(pattern = "^vector")
vectorList <- mget(namesVectors)
# Getting rid of auxiliary prefix
namesVectors <- substring(namesVectors, 7)
DF$AGGVAR <- sapply(DF$VAR, grep, vectorList)
for(i in length(namesVectors)) {DF$AGGVAR[DF$AGGVAR == i] <- namesVectors[i]}

Related

How to find out the best combination of a given vector whose sum is closest to a given number

My question is quite similar to this one: Find a subset from a set of integer whose sum is closest to a value
It discussed the algorithm only, but I want to solve it with R. I'm quite new to R and tried to work out a solution, but I wonder whether there is a more efficient way.
Here is my example:
# Define a vector, to findout a subset whose sum is closest to the reference number 20.
A <- c(2,5,6,3,7)
# display all the possible combinations
y1 <- combn(A,1)
y2 <- combn(A,2)
y3 <- combn(A,3)
y4 <- combn(A,4)
y5 <- combn(A,5)
Y <- list(y1,y2,y3,y4,y5)
# calculate the distance to the reference number of each combination
s1 <- abs(apply(y1,2,sum)-20)
s2 <- abs(apply(y2,2,sum)-20)
s3 <- abs(apply(y3,2,sum)-20)
s4 <- abs(apply(y4,2,sum)-20)
s5 <- abs(apply(y5,2,sum)-20)
S <- list(s1,s2,s3,s4,s5)
# find the minimum difference
M <- sapply(S,FUN=function(x) list(which.min(x),min(x)))
Mm <- which.min(as.numeric(M[2,]))
# return the right combination
data.frame(Y[Mm])[as.numeric(M[,Mm[1]])]
so the answer is 2,5,6,7.
How can I refine this program? Especially the five combn()s and five apply()s, is there a way that can work them at once? I hope when A has more items in it, I can use length(A) to cover it.
Here is another way to do it,
l1 <- sapply(seq_along(A), function(i) combn(A, i))
l2 <- sapply(l1, function(i) abs(colSums(i) - 20))
Filter(length, Map(function(x, y)x[,y], l1, sapply(l2, function(i) i == Reduce(min, l2))))
#[[1]]
#[1] 2 5 6 7
The last line uses Map to index l1 based on a logical list created by finding the minimum value from list l2.
combiter library has isubsetv iterator, which goes through all subset of a vector. Combined with foreach simplifies the code.
library(combiter)
library(foreach)
A <- c(2,5,6,3,7)
res <- foreach(x = isubsetv(A), .combine = c) %do% sum(x)
absdif <- abs(res-20)
ind <- which(absdif==min(absdif))
as.list(isubsetv(A))[ind]

making an integer vector from index of another vector in R

I have just started to learn R, and trying to do the following task.
I have a vector of 10 random values few are NAs and few are numeric values in it, like
a <- rnorm(100)
b <- rep(NA, 100)
c <- sample(c(a, b), 10)
now I want to make another vector "d" which has indices of all the NA values in "c" for example
d <- c(2, 7, 9)
I tried
d <- which(c %in% is.na(c))
but its not giving me desired result
also what is wrong with this code i tried for the above purpose
navects <- function(x) {
for(i in 1:length(x)) {
if(is.na(x[i])) c(i)
}
}
You can try with which
which(is.na(c))
NOTE: c is also a function, so it is better not to name objects with c.

Using R to loop through vector and copy some sequences to data.frame

I want to search through a vector for the sequence of strings "hello" "world". When I find this sequence, I want to copy it, including the 10 elements before and after, as a row in a data.frame to which I'll apply further analysis.
My problem: I get an error "new column would leave holes after existing columns". I'm new to coding, so I'm not sure how to manipulate data.frames. Maybe I need to create rows in the loop?
This is what I have:
df = data.frame()
i <- 1
for(n in 1:length(v))
{
if(v[n] == 'hello' & v[n+1] == 'world')
{
df[i,n-11:n+11] <- v[n-10:n+11]
i <- i+1
}
}
Thanks!
May be this helps
indx <- which(v1[-length(v1)]=='hello'& v1[-1]=='world')
lst <- Map(function(x,y) {s1 <- seq(x,y)
v1[s1[s1>0 & s1 < length(v1)]]}, indx-10, indx+11)
len <- max(sapply(lst, length))
d1 <- as.data.frame(do.call(rbind,lapply(lst, `length<-`, len)))
data
set.seed(496)
v1 <- sample(c(letters[1:3], 'hello', 'world'), 100, replace=TRUE)

How to modify some but not all variables of a data frame?

Suppose there is a data.frame where some variables are coded as integers:
a <- c(1,2,3,4,5)
b <- as.integer(c(2,3,4,5,6))
c <- as.integer(c(5,1,0,9,2))
d <- as.integer(c(5,6,7,3,1))
e <- c(2,6,1,2,3)
df <- data.frame(a,b,c,d,e)
str(df)
Suppose I want to convert columns b to d to numeric:
varlist <- names(df)[2:4]
lapply(varlist, function(x) {
df$x <- as.numeric(x, data=x)
})
str(df)
does not work.
I tried:
df$b <- as.numeric(b, data=df)
df$c <- as.numeric(c, data=df)
df$d <- as.numeric(d, data=df)
str(df)
which works fine.
Questions:
How do I do this (in a loop or better with lapply, [but I'm a Stata person and as such used to writing loops])?
And more generally: how do I apply any function to a list of variables in a data.frame
(e.g. multiply each variable on the list with some other variable[which is always stays the same,
BONUS: or changes with each variable on the list])?
For the first question you can use sapply:
df[2:4] <- sapply(df[2:4],as.numeric)
for the second you should use mapply. For example to multiply the 3 variables(2 to 4) by some 3 different random scalars:
df[2:4] <- mapply(function(x,y)df[[x]]*y,2:4,rnorm(3))
df[,2:4] <- sapply(df[,2:4], as.numeric)
As for your second question, if you want to say multiply column c by 5
df$c <- df$c * 5
Or any vector the same length as c, maybe a new column multiplying c by d
df$cd <- df$c * df$d

How do you find the sample sizes used in calculations on r?

I am running correlations between variables, some of which have missing data, so the sample size for each correlation are likely different. I tried print and summary, but neither of these shows me how big my n is for each correlation. This is a fairly simple problem that I cannot find the answer to anywhere.
like this..?
x <- c(1:100,NA)
length(x)
length(x[!is.na(x)])
you can also get the degrees of freedom like this...
y <- c(1:100,NA)
x <- c(1:100,NA)
cor.test(x,y)$parameter
But I think it would be best if you show the code for how your are estimating the correlation for exact help.
Here's an example of how to find the pairwise sample sizes among the columns of a matrix. If you want to apply it to (certain) numeric columns of a data frame, combine them accordingly, coerce the resulting object to matrix and apply the function.
# Example matrix:
xx <- rnorm(3000)
# Generate some NAs
vv <- sample(3000, 200)
xx[vv] <- NA
# reshape to a matrix
dd <- matrix(xx, ncol = 3)
# find the number of NAs per column
apply(dd, 2, function(x) sum(is.na(x)))
# tack on some column names
colnames(dd) <- paste0("x", seq(3))
# Function to find the number of pairwise complete observations
# among all pairs of columns in a matrix. It returns a data frame
# whose first two columns comprise all column pairs
pairwiseN <- function(mat)
{
u <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
h <- expand.grid(x = u, y = u)
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
h$n <- mapply(f, h[, 1], h[, 2])
h
}
# Call it
pairwiseN(dd)
The function can easily be improved; for example, you could set h <- expand.grid(x = u[-1], y = u[-length(u)]) to cut down on the number of calculations, you could return an n x n matrix instead of a three-column data frame, etc.
Here is a for-loop implementation of Dennis' function above to output an n x n matrix rather than have to pivot_wide() that result. On my databricks cluster it cut the compute time for 1865 row x 69 column matrix down from 2.5 - 3 minutes to 30-40 seconds.
Thanks for your answer Dennis, this helped me with my work.
pairwise_nxn <- function(mat)
{
cols <- if(is.null(colnames(mat))) paste0("x", seq_len(ncol(mat))) else colnames(mat)
nn <- data.frame(matrix(nrow = length(cols), ncol = length(cols)))
rownames(nn) <- colnames(nn) <- cols
f <- function(x, y)
sum(apply(mat[, c(x, y)], 1, function(z) !any(is.na(z))))
for (i in 1:nrow(nn))
for (j in 1:ncol(nn))
nn[i,j] <- f(rownames(nn)[i], colnames(nn)[j])
nn
}
If your variables are vectors named a and b, would something like sum(is.na(a) | is.na(b)) help you?

Resources