cbind a vector of different length to a dataframe - r

I have a dataframe consisting of two samples. Only one sample has answered a questionnaire about state anxiety.
For this case, I have calculated a vector for somatic state anxiety with the following function "rowSums":
som_lp <- rowSums(sample1[,c(1, 7, 8, 10 )+108], na.rm = TRUE)
Now I would like to add this to my existing dataframe "data", but the function "cbind" doesn't work here, because of the different lengths (dataframe 88, som_lp 59).
data <- cbind(data, som_lp)
Can anyone help me and is there another option to calculate "som_lp" to avoid the different lengths?

We can use cbind.fill from rowr
library(rowr)
cbind.fill(data, som_lp, fill = NA)

Related

Bootstrapping data frame columns independently in R

I have a data.frame where each column represents a different individual and each row represents different food items eaten.
My goal is to resample each column via bootstrapping and then calculate a metric score and C.I.s for each individual (data column) using a defined function.
I have done this successfully on a single vector but cannot figure out how to apply the bootstrapping and metric function to individual columns in a data frame. Below is the code I have to apply it to a single vector:
data.1 <- c(10, 50, 200, 54, 6) ## example vector
## create function
metric.function <- function(x){
p <- x/sum(x)
dap <- 1/sum(p^2)
return(dap)
}
vect <- c() ## empty vector for bootstrap data
for (i in 1:1000){
data.2 <- sample(data.1, replace = TRUE) ##bootstrap sample ##
vect[i] <- metric.function (data.2) ## apply metric.function ##
}
summary(vect) ## summary
quantile(vect, probs = c(0.025, 0.975)) ## C.I.
This works fine for a single vector but I want to apply it independently to multiple columns in a data frame, for example in the example.df below I want to apply it to x1:x10 independently resulting in 10 metric scores and 10 C.I.s
example.df<-data.frame(replicate(10,sample(0:50,10,rep=TRUE)))
I have tried changing the vector item to a data.frame and messing around with apply and dply but cannot figure it out, can anyone suggest how to do it or point me in the direction of useful guide/website etc?
This is a perfect chance to use replicate and sapply.
replicate(1000, sapply(example.df, function(x)
metric.function(sample(x, replace = TRUE))))
sapply will operate column-wise (given that a data.frame is in a sense a list of columns); once we've isolated a column within sapply, we need only resample it & apply our metric.

Obtain a data.frame (or list) of X times the the original source, after applying some function

I've been having problems with this one for a while.
What I would like, is to apply a function to a data.frame that is divided by factors. This data frame has n>2 columns of values that I need to use for this function.
For the sake of this example, this dataset has a column of 5 factors (a,b,c,d,e), and 2 columns of values (values1,values2). I would like to apply a number of functions that takes into account each column of values (auto.arima first and forecast.Arima, in this case). A dataset to play follows:
library(forecast)
set.seed(2)
dat <- data.frame(factors = letters[1:5],values1 = rnorm(50), values2 =rnorm(50))
This previous dataset has a column of 5 factors (a,b,c,d,e), and 2 columns of values (values1,values2). I would like (for the sake of the exercise), to apply auto.arima to values1 and values 2, per factor. My expected output would be something that, per factor, takes into account both columns of values, and forecasts both (each as its own univariate time series). So if the dataset has 5 factors and 2 columns of values, I would need 10 lists/data.frames.
Some options that did not work: Splitting the data.frame per factor via:
split(dat, dat$factor)
And then using rapply:
rapply(dat,function(x) forecas.Arima(auto.arima(x)),dat$factors)
Or lapply:
lapply(split(dat,dat$factors), function(x) forecast.Arima(auto.arima(x)))
And some other combinations, all to no avail.
I thought that the easiest solution would involve a function in the apply family, but any solution would be valid.
Is this what you're looking for?
m = melt(dat, id.vars = "factors")
l = split(m, paste(m$factors, m$variable))
lapply(l, function(x) forecast.Arima(auto.arima(x$value)))
i.e. splitting the data into 10 different frames, then applying the forecast on the values?
The problem with you apply solutions is that you were passing the whole dataframe to the auto.arima function which take a vector so you'd need something like this:
lapply(split(dat,dat$factors), function(df) {
apply(df[,-1], 2, function(col) forecast.Arima(auto.arima(col)))
})
This splits the dataframe as before on the factors and then applies over each column (ignoring the first which is the factor) the auto.arima function wrapped in forecast.Arima. This returns a list of lists (5 factors by 2 values) so allows you to keep values1 and values2 separate.
You can use unlist(x, recursive=FALSE) to flatten this to a list of 10.

Combine several columns under same name

I am trying to get the mvr function in the R-package pls to work. When having a look at the example dataset yarn I realized that all 268 NIR columns are in fact treated as one column:
library(pls)
data(yarn)
head(yarn)
colnames(yarn)
I would need that to use the function with my data (so that a multivariate datset is treated as one entity) but I have no idea how to achive that. I tried
TT<-matrix(NA, 2, 3)
colnames(TT)<-rep("NIR", ncol(TT))
TT
colnames(TT)
You will notice that while all columns have the same heading, colnames(TT) shows a vector of length three, because each column is treated separately. What I would need is what can be found in yarn, that the colname "NIR" occurs only once and applies columns 1-268 alike.
Does anybody know how to do that?
You can just assign the matrix to a column of a data.frame
TT <- matrix(1:6, 2, 3 )
# assign to an existing dataframe
out <- data.frame(desnity = 1:nrow(TT))
out$NIR <- TT
str(out)
# assign to empty dataframe
out <- data.frame(matrix(integer(0), nrow=nrow(TT))) ;
out$NIR <- TT

Means from a list of data frames in R

I am relatively new to R and have a complicated situation to solve. I have uploaded a list of over 1000 data frames into R and called this list x. What I want to do is take certain data frames and take the mean and variance of the entire data frames (excluding the first column of each) and save these into two separate vectors. For example I wish to take the mean and variance of every third data frame in the list starting from element (3) and going to element (54).
So what I ultimately want are two vectors:
meanvector=c(mean(data frame(3)), mean(data frame(6)),..., mean(data frame(54)))
variancevector=c(var(data frame (3)), var(data frame (6)), ..., var(data frame(54)))
This problem is way above my knowledge level but I am thinking I can do this effectively using some sort of loop but I do not know how to go about making such loop. Any help would be much appreciated! Thank you in advance.
You can use lapply and pass indices as follows:
ids <- seq(3, 54, by=3)
out <- do.call(rbind, lapply(ids, function(idx) {
t <- unlist(x[[idx]][, -1])
c(mean(t), var(t))
}))
If x is a list of 1000 dataframes, you can use lapply to return the means and variances of a subset of this list.
ix = seq(1, 1000, 3)
lapply(x[ix], function(df){
#exclude the first column
c(mean(df[,-1]), var(df[,-1]))
})

Getting the minimum of the rows in a data frame

I am working with a dataframe that has 65 variables in it. The first variable catalogs a person, and the next 64 variables indicate the geographic distance that person is from each of 64 locations. Using R, I would like to create a new variable that catalogs the shortest distance for each person to one of those 64 locations.
For example: if person X is 35, 50, 79, 100, 450...miles away from the locations, I would like the new variable to automatically assign them a 35, because this is the shortest distance.
Any help with this would be much appreciated. Thanks.
Or, using the example of Justin:
df$shortest <- do.call(pmin,df[-1])
see also ?pmin and ?do.call, and note that you can drop the first variable in your data frame by using the list indices (so not using any comma at all, see also ?Extract )
df <- data.frame(let=letters[1:25], d1=sample(1:25,25), d2=sample(1:25,25), d3=sample(1:25,25))
df$shortest <- apply(df[,2:4],1,min)
The second line applies the function min to each row and assigns it to the new column in my data.frame df. See ?apply for more explanation of what the second line is doing. Careful to skip the first column, or any columns that aren't distances:
apply(df,1,min) gives completely difference answers since its finding the "min" of strings.
> min(2:10)
[1] 2
> min(as.character(2:10))
[1] "10"
I'd approach this with apply but transform or other approach could work.
#fake data set
ID=LETTERS[1:5], distance=matrixsample(
DF <- as.data.frame(matrix(sample(1:100, rep=T, 100), 5, 20))
DF <- data.frame(ID=LETTERS[1:5], DF)
#solution
DF$newvar <- apply(DF[,-1], 1, min)

Resources