sum different columns in a data.frame - r

I have a very big data.frame and want to sum the values in every column.
So I used the following code:
sum(production[,4],na.rm=TRUE)
or
sum(production$X1961,na.rm=TRUE)
The problem is that the data.frame is very big. And I only want to sum 40 certain columns with different names of my data.frame. And I don't want to list every single column. Is there a smarter solution?
At the end I also want to store the sum of every column in a new data.frame.
Thanks in advance!

Try this:
colSums(df[sapply(df, is.numeric)], na.rm = TRUE)
where sapply(df, is.numeric) is used to detect all the columns that are numeric.
If you just want to sum a few columns, then do:
colSums(df[c("X1961", "X1962", "X1999")], na.rm = TRUE)

res <- unlist(lapply(production, function(x) if(is.numeric(x)) sum(x, na.rm=T)))
will return the sum of each numeric column.
You could create a new data frame based on the result with
data.frame(t(res))

If you dont want to include every single column, you somehow have to indicate which ones to include (or alternatively, which to exclude)
colsInclude <- c("X1961", "X1962", "X1963") # by name
# or #
colsInclude <- paste0("X", 1961:2003) # by name
# or #
colsInclude <- c(10:19, 23, 55, 147) # by column number
To put those columns in a new data frame simply use [ ] as you've done: '
newDF <- oldDF[, colsInclude]
To sum up each column, simply use colSums
sums <- colSums(newDF, na.rm=T)
# or #
sums <- colSums(oldDF[, colsInclude], na.rm=T)
Note that sums will be a vector, not necessarilly a data frame.
You can make it into a data frame using as.data.frame
sums <- as.data.frame(sums)
# or, to include the data frame from which it came #
sums <- rbind(newDF, "totals"=sums)

Related

remove outliers from multiple columns in R

I used below codes to identify outliers on different columns:
outliers_x1 <- boxplot(mydata$x1, plot=FALSE)$out
outliers_x4 <- boxplot(mydata$x4, plot=FALSE)$out
outliers_x6 <- boxplot(mydata$x6, plot=FALSE)$out
Now, how can I remove those outliers from the dataset by one code?
This will set any outlier values to NA, and then optionally remove all rows where any column contains an outlier. Works with arbitrary number of columns.
Uses data.table for convenience.
library(data.table)
library(matrixStats)
##
# create sample data
#
set.seed(1)
dt <- data.table(x1=rnorm(100), x2=rnorm(100), x3=rnorm(100))
##
# incorporate possible outliers
#
dt[sample(100, 5), x1:=10*x1]
dt[sample(100, 5), x2:=10*x2]
dt[sample(100, 5), x3:=10*x3]
##
# you start here...
# remove all rows where any column contains an outlier
#
indx <- sapply(dt, \(x) !(x %in% boxplot(x, plot=FALSE)$out))
dt[as.logical(rowProds(indx))]
In the above, indx is a matrix with three logical columns. Each element is TRUE unless the corresponding column contained an outlier in that row. We use rowProds(...) from the matrixStats package to multiply ( & ) the 3 rows together. Unfortunately this converts everything numeric (1, 0), so we have to convert back to logical to use as an index into dt.
##
# replaces outliers with NA in each column
#
dt.melt <- melt(dt[, id:=seq(.N)], id='id')
dt.melt[, ol:=(value %in% boxplot(value, plot=FALSE)$out), by=.(variable)]
dt.melt[(ol), value:=NA]
result <- dcast(dt.melt, id~variable)[, id:=NULL]
##
# remove all rows where any column contains an outlier
#
na.omit(result)
In the code above we add an id column, then melt(...) so all other columns are in one column (value) with a second column (variable) indicating the original source column. Then we apply the boxplot(...) algorithm group-wise (by variable) to produce an ol column indicating an outlier. Then we set any value corresponding to ol == TRUE to NA. Then we re-convert to your original wide format with dcast(...) and remove the id.
It's a bit roundabout but this melt - process - dcast pattern is common when processing multiple columns like this.
Finally, na.omit(result) will remove any rows which have NA in any of the columns. If that's what you want it's simpler to use the first approach.

Add columns dynamically to dataframe in R

I am only a few days old in the R ecosystem and trying to figure out a way to add dynamic column for each numeric column found in the original dataframe.
I have succesfully written a way to change the value in the existing column in the dataframe but what I need is to put those calculated values into a new column rather than overwriting the existing one.
Here is what I've done so far,
myDf <- read.csv("MyData.csv",header = TRUE)
normalize <- function(x) {
return ((x - min(x,na.rm = TRUE)) / (max(x,na.rm = TRUE) - min(x,na.rm = TRUE)))
}
normalizeAllCols <- function(df){
df[,sapply(x, is.numeric)] <- lapply(df[,sapply(df, is.numeric)], normalize)
df
}
normalizedDf<-normalizeAllCols(myDf)
I came with above snippet (with a lot of help from the internet) to apply normalize function to all numeric columns in the given data frame. I want to know how to put those calculated values into a new column in the data frame. (in the given snippet I'd like to know how to put normalized value in a new column like "norm" + colname ).
You can find the column names which are numeric and use paste0 create new columns.
normalizeAllCols <- function(df){
cols <- names(df)[sapply(df, is.numeric)]
df[paste0('norm_', cols)] <- lapply(df[cols], normalize)
df
}
normalizedDf<-normalizeAllCols(myDf)
In dplyr you can use across to apply a function to only numeric columns directly.
library(dplyr)
normalizeAllCols <- function(df){
df %>%
mutate(across(where(is.numeric), list(norm = ~normalize)))
}

Subsetting efficiently on multiple columns and rows

I am trying to subset my data to drop rows with certain values of certain variables. Suppose I have a data frame df with many columns and rows, I want to drop rows based on the values of variables G1 and G9, and I only want to keep rows where those variables take on values of 1, 2, or 3. In this way, I aim to subset on the same values across multiple variables.
I am trying to do this with few lines of code and in a manner that allows quick changes to the variables or values I would like to use. For example, assuming I start with data frame df and want to end with newdf, which excludes observations where G1 and G9 do not take on values of 1, 2, or 3:
# Naive approach (requires manually changing variables and values in each line of code)
newdf <- df[which(df$G1 %in% c(1,2,3), ]
newdf <- df[which(newdf$G9 %in% c(1,2,3), ]
# Better approach (requires manually changing variables names in each line of code)
vals <- c(1,2,3)
newdf <- df[which(df$G1 %in% vals, ]
newdf <- df[which(newdf$G9 %in% vals, ]
If I wanted to not only subset on G1 and G9 but MANY variables, this manual approach would be time-consuming to modify. I want to simplify this even further by consolidating all of the code into a single line. I know the below is wrong but I am not sure how to implement an alternative.
newdf <- c(1,2,3)
newdf <- c(df$G1, df$G9)
newdf <- df[which(df$vars %in% vals, ]
It is my understanding I want to use apply() but I am not sure how.
You do not need to use which with %in%, it returns boolean values. How about the below:
keepies <- (df$G1 %in% vals) & (df$G9 %in% vals)
newdf <- df[keepies, ]
Use data.table
First, melt your data
library(data.table)
DT <- melt.data.table(df)
Then split into lists
DTLists <- split(DT, list(DT[1:9])) #this is the number of columns that you have.
Now you can operate on the lists recursively using lapply
DTresult <- lapply(DTLists, function(x) {
...
}

Replacing or imputing NA values in R without For Loop

Is there a better way to go through observations in a data frame and impute NA values? I've put together a 'for loop' that seems to do the job, swapping NAs with the row's mean value, but I'm wondering if there is a better approach that does not use a for loop to solve this problem -- perhaps a built in R function?
# 1. Create data frame with some NA values.
rdata <- rbinom(30,5,prob=0.5)
rdata[rdata == 0] <- NA
mtx <- matrix(rdata, 3, 10)
df <- as.data.frame(mtx)
df2 <- df
# 2. Run for loop to replace NAs with that row's mean.
for(i in 1:3){ # for every row
x <- as.numeric(df[i,]) # subset/extract that row into a numeric vector
y <- is.na(x) # create logical vector of NAs
z <- !is.na(x) # create logical vector of non-NAs
result <- mean(x[z]) # get the mean value of the row
df2[i,y] <- result # replace NAs in that row
}
# 3. Show output with imputed row mean values.
print(df) # before
print(df2) # after
Here's a possible vectorized approach (without any loop)
indx <- which(is.na(df), arr.ind = TRUE)
df[indx] <- rowMeans(df, na.rm = TRUE)[indx[,"row"]]
Some explanation
We can identify the locations of the NAs using the arr.ind parameter in which. Then we can simply index df (by the row and column indexes) and the row means (only by the row indexes) and replace values accordingly
Data:
set.seed(102)
rdata <- matrix(rbinom(30,5,prob=0.5),nrow=3)
rdata[cbind(1:3,2:4)] <- NA
df <- as.data.frame(rdata)
This is a little trickier than I'd like -- it relies on the column-major ordering of matrices in R as well as the recycling of the row-means vector to the full length of the matrix. I tried to come up with a sweep() solution but didn't manage so far.
rmeans <- rowMeans(df,na.rm=TRUE)
df[] <- ifelse(is.na(df),rmeans,as.matrix(df))
One possibility, using impute from Hmisc, which allows for choosing any function to do imputation,
library(Hmisc)
t(sapply(split(df2, row(df2)), impute, fun=mean))
Also, you can hide the loop in an apply
t(apply(df2, 1, function(x) {
mu <- mean(x, na.rm=T)
x[is.na(x)] <- mu
x
}))

Calculate pairwise-difference between each pair of columns in dataframe

I cannot get around the problem of creating the differentials of every variable (column) in "adat" and saving it to a matrix "dfmtx".
I would just need to automate the following sequence to run for each column in "adat" and than name the obtained vector according to the name of the ones subtracted from each other and placed in to a column of "dfmtx".
In "adat" I have 14 columns and 26 rows not including the header.
dfmtx[,1]=(adat[,1]-adat[,1])
dfmtx[,2]=(adat[,1]-adat[,2])
dfmtx[,3]=(adat[,1]-adat[,3])
dfmtx[,4]=(adat[,1]-adat[,4])
dfmtx[,5]=(adat[,1]-adat[,5])
dfmtx[,6]=(adat[,1]-adat[,6])
.....
dfmtx[,98]=(adat[,14]-adat[,14])
Any help would be appreciated thank you!
If adat is a data.frame, you can use outer to get the combinations of columns and then do the difference between pairwise subset of columns based on the index from outer. It is not clear how you got "98" columns. By removing the diagonal and lower triangular elements, the number of columns will be "91".
nm1 <- outer(colnames(adat), colnames(adat), paste, sep="_")
indx1 <- which(lower.tri(nm1, diag=TRUE))
res <- outer(1:ncol(adat), 1:ncol(adat),
function(x,y) adat[,x]-adat[,y])
colnames(res) <- nm1
res1 <- res[-indx1]
dim(res1)
#[1] 26 91
data
set.seed(24)
adat <- as.data.frame(matrix(sample(1:20, 26*14,
replace=TRUE), ncol=14))

Resources