How to sum data.frame column values? - r

I have a data frame with several columns; some numeric and some character. How to compute the sum of a specific column? I’ve googled for this and I see numerous functions (sum, cumsum, rowsum, rowSums, colSums, aggregate, apply) but I can’t make sense of it all.
For example suppose I have a data frame people with the following columns
people <- read(
text =
"Name Height Weight
Mary 65 110
John 70 200
Jane 64 115",
header = TRUE
)
…
How do I get the sum of all the weights?

You can just use sum(people$Weight).
sum sums up a vector, and people$Weight retrieves the weight column from your data frame.
Note - you can get built-in help by using ?sum, ?colSums, etc. (by the way, colSums will give you the sum for each column).

To sum values in data.frame you first need to extract them as a vector.
There are several way to do it:
# $ operatior
x <- people$Weight
x
# [1] 65 70 64
Or using [, ] similar to matrix:
x <- people[, 'Weight']
x
# [1] 65 70 64
Once you have the vector you can use any vector-to-scalar function to aggregate the result:
sum(people[, 'Weight'])
# [1] 199
If you have NA values in your data, you should specify na.rm parameter:
sum(people[, 'Weight'], na.rm = TRUE)

you can use tidyverse package to solve it and it would look like the following (which is more readable for me):
library(tidyverse)
people %>%
summarise(sum(weight, na.rm = TRUE))

When you have 'NA' values in the column, then
sum(as.numeric(JuneData1$Account.Balance), na.rm = TRUE)

to order after the colsum :
order(colSums(people),decreasing=TRUE)
if more than 20+ columns
order(colSums(people[,c(5:25)],decreasing=TRUE) ##in case of keeping the first 4 columns remaining.

Related

Using previous row values when creating new R dataframe variable

I'd really appreciate some help with an issue I have with my R dataframe. Couldn't find a similar thread, so please share if it exists already!
I have the following data:
mydata <- data.frame(inflow=c(50,60,55,70,80),
outflow=c(70,80,70,65,65),
current=c(100,100,100,100,100))
I want to create a new column which does something like:
mutate(calc=pmax(lag(calc,default=current)+inflow-outflow,inflow))
which basically creates a new column called calc that chooses between the maximum of a) the previous row value of calc plus this row's inflow minus outflow or b) this row's inflow value. pmax is a function from a package called rmpfr which selects the maximum across given columns per row.
so my results will be: row1 = max(100+50-70, 50) which is 80, row2 = max(80+60-80,60) which is 60 and so on.
The main issue is that the lag function doesn't allow for taking previous row values for the same column you're creating, it has to be a column that already exists in the data. I thought of doing it in steps by creating the calc column first and then adding a second calculation step, but can't exactly work it out.
Lastly, I know that using a for loop might be a solution but was wondering if there is a different way? my data is grouped by an extra column and not sure the for loop will work well with grouped data rows?
Thanks for any help :)
# I don't define the current column, as this is handled with the .init argument of accumulate2
mydata <- data.frame(
inflow=c(50,60,55,70,80),
outflow=c(70,80,70,65,65)
)
# define your recursive function
flow_function <- function(current, inflow, outflow){
pmax(inflow, inflow - outflow + current)
}
mydata %>%
mutate(result = accumulate2(inflow, outflow, flow_function, .init = 100)[-1] %>% unlist)
# inflow outflow result
# 1 50 70 80
# 2 60 80 60
# 3 55 70 55
# 4 70 65 70
# 5 80 65 85
Detail
The purrr::accumulate family of functions are designed to perform recursive calculations.
accumulate can handle functions which take the previous value plus values from one other column, whilst accumulate2 allows for a second additional column. Your scenario falls into the later.
accumulate2 expects the following arguments:
.x - the first column for the calculation.
.y - the second column for the calculation.
.f - the function to apply recursively: this should have three arguments, the first of which is the recursive argument.
.init - (optional) the initial value to use as the first argument.
So in your case the function to pass to .f will be
# define your recursive function
flow_function <- function(current, inflow, outflow){
pmax(inflow, inflow - outflow + current)
}
We first test what this produces outside of a dplyr::mutate
# note I don't define the current column, as this is handled with the .init argument
mydata <- data.frame(
inflow=c(50,60,55,70,80),
outflow=c(70,80,70,65,65)
)
purrr::accumulate2(mydata$inflow, mydata$outflow, flow_function, .init = 100)
# returns
# [[1]]
# [1] 100
#
# [[2]]
# [1] 80
#
# [[3]]
# [1] 60
#
# [[4]]
# [1] 55
#
# [[5]]
# [1] 70
#
# [[6]]
# [1] 85
So there's two things to note about the returned value:
The returned object is a list, so we'll want to unlist back to a vector.
The list has 6 entries as it includes the initial value, we'll want to drop this.
These two final steps are brought together in the full example at the top.
maybe the cummax function will help
mutate(calc=pmax(cummax(current+inflow-outflow),inflow))

correlation for data in matrix format in r

I have created a matrix in R and I want to investigate the correlation between two columns. My_matrix is:
speed motor rpm acceleration age
cadillac 3 42 67 22
porche 5 40 68 21
ferrari 7 37 69 20
peugeot 10 32 70 19
kia 12 28 71 18
when I try the cor(speed~age, data=My_matrix) I get the following error:
Error in cor(speed ~ age, data = a) : unused argument (data = My_matrix)
any idea how I can address this? Thanks.
We can subset the columns and apply the cor directly as the usage of cor is
cor(x, y = NULL, use = "everything",
method = c("pearson", "kendall", "spearman"))
and there is no formula method
cor(My_matrix[,c("speed", "age")])
# speed age
#speed 1.0000000 -0.9971765
#age -0.9971765 1.0000000
I also tried this and it worked:
I created a "b" dataset
b=as.data.frame(My_matrix)
then I used the
cor(b$speed, b$age) and got the correlation.
There are some great base R solutions on here already (hats off to #akrun & #Debutant, base R is great!). I would like to add alternate solutions for future viewers and code preference options.
If you don't like typing quote marks and the dataset is small enough, column numbers can be faster--although variable names in quotations is better for accuracy (especially if the columns are reordered).
#mikey in the comments offered a column number solution, here is an alternate version:
cor(My_matrix[,c(1,4)])
If your data is a dataframe instead of a matrix, you might enjoy a tidyverse approach, which also does not require quotation marks (although pesky variables with spaces in their names may require ` marks):
library(dplyr)
My_dataframe %>% select(speed, age) %>% cor()
#Debutant only asked for 2 variables for the correlation but if we wanted to go all out and get the full correlation matrix available, here are additional options:
# assuming all your columns are numeric as they are here
cor(My_matrix)
# if you have a dataframe with different data types, select only the numeric ones
library(dplyr)
My_dataframe %>% select_if(is.numeric) %>% cor()
# if you don't like the long decimals, toss in a round() for good measure
My_dataframe %>% select_if(is.numeric) %>% cor() %>% round(3)
Hope you find this useful. :)

R: How to run calculation on a part of the df without previous subsetting?

I want to apply a percentage calculation on certain rows (according to column criteria) of my data set. Normally I would do a (1) subset for this, (2) calculate the percentage, (3) delete the old (or previously subsetted rows) in my original data and (4) finally stack them together via rbind().
My question is there a better/faster/shorter way to do this calculation? Here some example data:
df <- data.frame(object = c("apples","tomatoes", "apples","pears" ),
Value = c(50,10,30,40))
The percentage calculation (50%) I would like to use for the subset on e.g. apples:
sub[,2] <- sub$Value * 50 /100
And the result should look like this:
object Value
1 apples 25
2 tomatoes 10
3 apples 15
4 pears 40
Thank you. Probably there is an easy way, but I didn't find online a solution so far.
Create a logical index for 'object' that are `apples' and do the calculation only the subset of 'Value' based on the 'index'.
i1 <- df$object=='apples'
df$Value[i1] <- df$Value[i1]*50/100
Or you can use ifelse
df$Value <- with(df, ifelse(object=='apples', Value*50/100, Value))
Or a more faster approach would be data.table
library(data.table)
setDT(df)[object=='apples', Value := Value*0.5]

Data Manipulation, Looping to add columns

I have asked this question a couple times without any help. I have since improved the code so I am hoping somebody has some ideas! I have a dataset full of 0's and 1's. I simply want to add the 10 columns together resulting in 1 column with 3835 rows. This is my code thus far:
# select for valid IDs
data = history[history$studyid %in% valid$studyid,]
sibling = data[,c('b16aa','b16ba','b16ca','b16da','b16ea','b16fa','b16ga','b16ha','b16ia','b16ja')]
# replace all NA values by 0
sibling[is.na(sibling)] <- 0
# loop over all columns and count the number of 174
apply(sibling, 2, function(x) sum(x==174))
The problem is this code adds together all the rows, I want to add together all the columns so I would result with 1 column. This is the answer I am now getting which is wrong:
b16aa b16ba b16ca b16da b16ea b16fa b16ga b16ha b16ia b16ja
68 36 22 18 9 5 6 5 4 1
In apply() you have the MARGIN set to 2, which is columns. Set the MARGIN argument to 1, so that your function, sum, will be applied across rows. This was mentioned by #sgibb.
If that doesn't work (can't reproduce example), you could try first converting the elements of the matrix to integers X2 <- apply(sibling, c(1,2), function(x) x==174), and then use rowSums to add up the columns in each row: Xsum <- rowSums(X2, na.rm=TRUE). With this setup you do not need to first change the NA's to 0's, as you can just handle the NA's with the na.rm argument in rowSums()

Running sums for every row of the previous 25 rows

I've been trying for a while now to produce a code that brings me a new vector of the sum of the 25 previous rows for an original vector.
So if we say I have a variable Y with 500 rows and I would like a running sum, in a new vector, which contains the sum of rows [1:25] then [2:26] for the length of Y, such as this:
y<-1:500
runsum<-function(x){
cumsum(x)-cumsum(x[26:length(x)])
}
new<-runsum(y)
I've tried using some different functions here and then even using the apply functions on top but none seem to produce the right answers....
Would anyone be able to help? I realise it's probably very easy for many of the community here but any help would be appreciated
Thanks
This function calculates the sum of the 24 preceding values and the actual value:
movsum <- function(x,n=25){filter(x,rep(1,n), sides=1)}
It is easy to adapt to sum only preceding values, if this is what you really want.
In addition to Roland's answer you could use the zoo library
library ( zoo )
y <- 1:500
rollapply ( zoo ( y ), 25, sum )
HTH
I like Roland's answer better as it relies on a time series function and will probably be pretty quick. Since you mentioned you started going down the path of using apply() and friends, here's one approach to do that:
y<-1:500
#How many to sum at a time?
n <- 25
#Create a matrix of the appropriate start and end points
mat <- cbind(start = head(y, -(n-1)), end = tail(y, -(n-1)))
#Check output
rbind(head(mat,3), tail(mat,3))
#-----
start end
1 25
2 26
3 27
[474,] 474 498
[475,] 475 499
[476,] 476 500
#add together
apply(mat, 1, function(x) sum(y[x[1]]:y[x[2]]))
#Is it the same as Roland's answer after removing the NA values it returns?
all.equal(apply(mat, 1, function(x) sum(y[x[1]]:y[x[2]])),
movsum(y)[-c(1:n-1)])
#-----
[1] TRUE

Resources