Combine several columns under same name - r

I am trying to get the mvr function in the R-package pls to work. When having a look at the example dataset yarn I realized that all 268 NIR columns are in fact treated as one column:
library(pls)
data(yarn)
head(yarn)
colnames(yarn)
I would need that to use the function with my data (so that a multivariate datset is treated as one entity) but I have no idea how to achive that. I tried
TT<-matrix(NA, 2, 3)
colnames(TT)<-rep("NIR", ncol(TT))
TT
colnames(TT)
You will notice that while all columns have the same heading, colnames(TT) shows a vector of length three, because each column is treated separately. What I would need is what can be found in yarn, that the colname "NIR" occurs only once and applies columns 1-268 alike.
Does anybody know how to do that?

You can just assign the matrix to a column of a data.frame
TT <- matrix(1:6, 2, 3 )
# assign to an existing dataframe
out <- data.frame(desnity = 1:nrow(TT))
out$NIR <- TT
str(out)
# assign to empty dataframe
out <- data.frame(matrix(integer(0), nrow=nrow(TT))) ;
out$NIR <- TT

Related

Data.frame of Data.frames

I'm using a data.frame that contains many data.frames. I'm trying to access these sub-data.frames within a loop. Within these loops, the names of the sub-data.frames are contained in a string variable. Since this is a string, I can use the [,] notation to extract data from these sub-data.frames. e.g. X <- "sub.df"and then df[42,X] would output the same as df$sub.df[42].
I'm trying to create a single row data.frame to replace a row within the sub-data.frames. (I'm doing this repeatedly and that's why my sub-data.frame name is in a string). However, I'm having trouble inserting this new data into these sub-data.frames. Here is a MWE:
#Set up the data.frames and sub-data.frames
sub.frame <- data.frame(X=1:10,Y=11:20)
df <- data.frame(A=21:30)
df$Z <- sub.frame
Col.Var <- "Z"
#Create a row to insert
new.data.frame <- data.frame(X=40,Y=50)
#This works:
df$Z[3,] <- new.data.frame
#These don't (even though both sides of the assignment give the correct values/dimensions):
df[,Col.Var][6,] <- new.data.frame #Gives Warning and collapses df$Z to 1 dimension
df[7,Col.Var] <- new.data.frame #Gives Warning and only uses first value in both places
#This works, but is a work-around and feels very inelegant(!)
eval(parse(text=paste0("df$",Col.Var,"[8,] <- new.data.frame")))
Are there any better ways to do this kind of insertion? Given my experience with R, I feel like this should be easy, but I can't quite figure it out.

R - Summation of data frame columns changes data type

I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))

How to apply operation and sum over columns in R?

I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))

Order a data set

I have a list of dataframe which I want to order according 3 column.
I've tried to apply an anonymous function
mylist<-lapply(mylist, function (x) x[order((data[,col1]),(data$namecol2),na.last=NA),])
I've tried in a loop :
for (i in 1:length(mylist)) {
list_sorted <-mylist[[i]][order((data[,col1]),(data$namecol2),na.last=NA),]
}
Either way I get a list of dataframe which are full of NA when they were not in the first place. This step create the dataframe full of NA, I checked the step before and it return my dataframe full of values.
I don't know what I'm doing wrong, any tips?
Thank you.
I guess you have a list with dataframes, and want to sort each of the dataframes based on a column in the dataframe.
The example I have below is a list with two dataframes, the dataframe consists of two columns("x" and "y"). And I sort it based on the column "x" in a descending order. Hope this gives you an idea to accomplish what you want.
x <- rep(1:5)
y <- rnorm(5)
dfrm <- data.frame(x,y)
str(dfrm)
names(dfrm)
listd <- list(dfrm,dfrm)
str(listd)
listsorted <- lapply(listd, function(z) z[with(z,order(x,decreasing=TRUE,na.last=NA)),])
listsorted

Losing Class information when I use apply in R

When I pass a row of a data frame to a function using apply, I lose the class information of the elements of that row. They all turn into 'character'. The following is a simple example. I want to add a couple of years to the 3 stooges ages. When I try to add 2 a value that had been numeric R says "non-numeric argument to binary operator." How do I avoid this?
age = c(20, 30, 50)
who = c("Larry", "Curly", "Mo")
df = data.frame(who, age)
colnames(df) <- c( '_who_', '_age_')
dfunc <- function (er) {
print(er['_age_'])
print(er[2])
print(is.numeric(er[2]))
print(class(er[2]))
return (er[2] + 2)
}
a <- apply(df,1, dfunc)
Output follows:
_age_
"20"
_age_
"20"
[1] FALSE
[1] "character"
Error in er[2] + 2 : non-numeric argument to binary operator
apply only really works on matrices (which have the same type for all elements). When you run it on a data.frame, it simply calls as.matrix first.
The easiest way around this is to work on the numeric columns only:
# skips the first column
a <- apply(df[, -1, drop=FALSE],1, dfunc)
# Or in two steps:
m <- as.matrix(df[, -1, drop=FALSE])
a <- apply(m,1, dfunc)
The drop=FALSE is needed to avoid getting a single column vector.
-1 means all-but-the first column, you could instead explicitly specify the columns you want, for example df[, c('foo', 'bar')]
UPDATE
If you want your function to access one full data.frame row at a time, there are (at least) two options:
# "loop" over the index and extract a row at a time
sapply(seq_len(nrow(df)), function(i) dfunc(df[i,]))
# Use split to produce a list where each element is a row
sapply(split(df, seq_len(nrow(df))), dfunc)
The first option is probably better for large data frames since it doesn't have to create a huge list structure upfront.

Resources