How to calculate means per each row in a dataframe? [R] - r

Here is the df:
# A tibble: 6 x 5
t a b c d
<dbl> <dbl> <dbl> <dbl> <dbl>
1 3999. 0.00586 0.00986 0.00728 0.00856
2 3998. 0.0057 0.00958 0.00702 0.00827
3 3997. 0.00580 0.00962 0.00711 0.00839
4 3996. 0.00602 0.00993 0.00726 0.00875
I want to get means for an all rows except to not include the first column. The code I wrote:
df$Mean <- rowMeans(df[select(df, -"t")])
The error I get:
Error: Must subset columns with a valid subscript vector.
x Subscript `select(group1, -"t")` has the wrong type `tbl_df<
p2 : double
p8 : double
p10: double
p9 : double
>`.
ℹ It must be logical, numeric, or character.
I tried to convert df to matrix, but then I get another error. How should I solve this?
Now I'm trying to calculate standard error using the code:
se <- function(x){sd(df[,x])/sqrt(length(df[,x]))}
sapply(group1[,2:5],se)
I try to indicate which columns should be used to calculate the error, but again an error pops up:
Error: Must subset columns with a valid subscript vector.
x Can't convert from `x` <double> to <integer> due to loss of precision.
I have used valid column subscripts, so I don't know why the error.

A similar base R solution would be:
df$Mean <- rowMeans(df[,-1],na.rm=T)
Output:
t a b c d Mean
1 3999 0.00586 0.00986 0.00728 0.00856 0.0078900
2 3998 0.00570 0.00958 0.00702 0.00827 0.0076425
3 3997 0.00580 0.00962 0.00711 0.00839 0.0077300
4 3996 0.00602 0.00993 0.00726 0.00875 0.0079900

We can use setdiff to return the columns that are not 't' and then get the rowMeans. This assumes that the column 't' can be anywhere and not based on the position of the column
df$Mean <- rowMeans(df[setdiff(names(df), "t")], na.rm = TRUE)
df
# t a b c d Mean
#1 3999 0.00586 0.00986 0.00728 0.00856 0.0078900
#2 3998 0.00570 0.00958 0.00702 0.00827 0.0076425
#3 3997 0.00580 0.00962 0.00711 0.00839 0.0077300
#4 3996 0.00602 0.00993 0.00726 0.00875 0.0079900
select from dplyr returns the subset of data.frame and not the column names or index. So, we can directly apply rowMeans
library(dplyr)
rowMeans(select(df, -t), na.rm = TRUE)
Or in a pipe
df <- df %>%
mutate(Mean = rowMeans(select(., -t), na.rm = TRUE))
Update
If we need to get the standard error per row, we can use apply with MARGIN as 1
apply(df[setdiff(names(df), 't')], 1,
function(x) sd(x)/sqrt(length(x)))
Or with rowSds from matrixStats
library(matrixStats)
rowSds(as.matrix(df[setdiff(names(df), 't')]))/sqrt(ncol(df)-1)
data
df <- structure(list(t = c(3999, 3998, 3997, 3996), a = c(0.00586,
0.0057, 0.0058, 0.00602), b = c(0.00986, 0.00958, 0.00962, 0.00993
), c = c(0.00728, 0.00702, 0.00711, 0.00726), d = c(0.00856,
0.00827, 0.00839, 0.00875)), class = "data.frame", row.names = c("1",
"2", "3", "4"))

Related

Dynamic summarise throught dynamic multiplication for an external vector

I have a tibble such:
X = tibble(Name = rep(c("A","B","C"),5),
Coeffs_1 = runif(15,0,1),
Coeffs_2 = runif(15,0,1)) %>% arrange(Name)
Y = runif(10,0,100)
In this example the amount of "Reps" per "Names" is arbitrarily fixed at 5, and the amount of Coeffs_2 is arbitrarily fixed at 2, but in my code they could be any number and I don't know the exact number.
I also have a vector with n = reps*2 elements:
Y = runif(10,0,100)
In this specific case, it's 10 = 5*2.
My task is to summarise, per each Name, per each Coeff, this formula:
Result_x[1] = sum(Coeff_x[1]*Y[2] + Coeff_x[2]* Y[3] + ... + Coeff_x[Reps]*Y[Reps+1]) - Y[1]
Result_x[2] = sum(Coeff_x[1]*Y[3] + Coeff_x[2]* Y[4] + ... + Coeff_x[Reps]*Y[Reps+2]) - Y[2]
.
.
.
Result_x[Reps] = sum(Coeff_x[1]* Y[Reps+1] + Coeff_x[2]* Y[Reps+2] + ... + Coeff_x[Reps]*Y[Reps*2]) - Y[Reps]
So that in the end, the final summarise tibble should look like:
Name
Lag
Result_1
...
Result_x
A
+1
a number
numbers
a number
A
+2
a number
numbers
a number
A
...
a number
numbers
a number
A
Reps
a number
numbers
a number
B
+1
a number
numbers
a number
B
+2
a number
numbers
a number
...
...
a number
numbers
a number
...
Reps
a number
numbers
a number
The dynamic nature of the issue makes hard for me to define it well with a for cycle, and the presence of the external vector that must be re-indexed and properly summarised for each row in the original tibble makes me difficult to work with a pipeline.
I thought that defining a custom function could help but again, it messes with pipeline code.
Split the 'X' by 'Name', loop over the list (map), while creating shifted lead values of 'Y' in a list with n specified as a vector. Loop over the list, summarise across the 'Coeff' columns for each of the nested list by taking the sum of product of the column value with the corresponding 'y' length corrected and subtract from the first value of 'y'
library(dplyr)
library(purrr)
library(data.table)
X %>%
group_split(Name) %>%
map_dfr(~ map_dfr(shift(Y, n = 1:nrow(.x), type = 'lead'),
function(y) .x %>%
summarise(Name = first(Name), across(starts_with('Coeff'),
~ sum(. * y[seq_along(.)], na.rm = TRUE) - first(y)))) ) %>%
mutate(Lag = rowid(Name))
-output
# A tibble: 15 × 4
Name Coeffs_1 Coeffs_2 Lag
<chr> <dbl> <dbl> <int>
1 A 127. 54.4 1
2 A 162. 134. 2
3 A 127. 68.2 3
4 A 109. 38.0 4
5 A 108. 94.0 5
6 B 175. 197. 1
7 B 187. 240. 2
8 B 151. 200. 3
9 B 132. 159. 4
10 B 102. 152. 5
11 C 48.8 131. 1
12 C 89.1 128. 2
13 C 42.5 98.7 3
14 C 29.4 95.7 4
15 C 41.7 50.1 5

Calculate the slope for each individual

I have a longitudinal data
ID<-c(1,1,1,2,2,2,2,3,3,4,4,4)
time<-c(0,12,36,0,7,23,68,0,23,0,32,45)
Age<-rnorm(12,45,9)
Sexe<-c("F","F","F","M","M","M","M","M","M","F","F","F")
biology1<-rnorm(12,12,3)
biology2<-rnorm (12,100,20)
biology3<-rnorm(12,45,9)
biology4<-rnorm(12,20,2)
Death<-c(0,0,1,0,0,0,0,0,0,0,0,1)
data<-data.frame(ID,time,Age,Sexe,biology1,biology2,biology3,biology4,Death)
I would like to calculate the slope (from the begining to the end of the folow-up) for each numerical variable (biology1,biology2,biology3,biology4) and for each individual irres; mainly a function to calculate the slope for each variable without retyping a new line of codes for each variable.I have no idea how to do it.
Here's an approach with dplyr. Here are the tricks:
Use group_by to group the data for each individual.
Use summarise to perform an action for each group.
Use across to do so for multiple columns
Use starts_with to select the appropriate columns
Use list(slope = ...) to name the columns.
Use $coef to extract the coefficients and [2] to get the slope rather than the intercept.
library(dplyr)
data %>%
group_by(ID) %>%
summarise(across(starts_with("biology"),
list(slope = ~lm(. ~ time)$coef[2])))
# A tibble: 4 x 5
ID biology1_slope biology2_slope biology3_slope biology4_slope
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 -0.0459 -1.61 -0.204 -0.00106
2 2 0.131 -0.553 0.0783 -0.0340
3 3 -0.0462 -0.427 -0.402 -0.191
4 4 -0.0524 -1.10 0.379 -0.0736
Here is a data.table option with lm
dt[,
lapply(
.SD,
function(x) coef(lm(x ~ time, data = cbind(x, dt[, "time"])))["time"]
), ID,
.SDcols = patterns("^biology")
]
which gives results like
ID biology1 biology2 biology3 biology4
1: 1 0.07223152 0.07187708 -2.960618e-02 0.022861337
2: 2 -0.05728224 0.13207814 -2.349493e-01 -0.018541982
3: 3 -0.03925044 -0.63219541 -3.166489e-05 -0.009484951
4: 4 -0.01801599 0.04758699 -6.547484e-03 -0.004253647

How to modify existing variable in the data frame based on conditions

I have to modify existing variable in the data frame using conditional formatting.
df <- data.frame(a = c('upperGI','UpperGI','UpperGI','UpperGI'),
+ b = c('C22.0 - Liver cell carcinoma', 'C16.0 - Cardia', 'C15.3 - Upper third of oesophagus', 'C25.9 - Pancreas, unspecified'))
I would like to split the variable an into upperGI and HPB when b lies between 22.0 to 25.0.
So the first one and second should be upperGI and third and forth would be HPB.
I am trying to learn dplyr package, that would be great to see in that(if possible).
You can try the following :
library(dplyr)
df_new <- df %>%
mutate(num = readr::parse_number(b),
col = if_else(between(b, 22, 25), 'upperGI', 'HPB', missing = 'upperGI'))
We use parse_number to extract numeric from the b column and assign 'upperGI' when b is between 22 and 25 and 'HPB' otherwise.
I would like to split the variable an into upperGI and HPB when b lies between 22.0 to 25.0.
So the first one and second should be upperGI and third and forth would be HPB.
The latter seems a bit odd given the data set you provide (e.g. the first value is 25 >= 22 >= 22 whilst the second is 26.3 >= 25.0). Like Ronak Shah, I assume that you mean that values between 22 and 25 should be "upperGI". In that case, an alternative but similar base R solution to Ronak's is:
transform(df, new_col = c('HPB', 'upperGI')[(b >= 22 & b <= 25) + 1L])
#R> a b new_col
#R> 1 upperGI 22.0 upperGI
#R> 2 UpperGI 26.3 HPB
#R> 3 UpperGi 21.0 HPB
#R> 4 UpperGI 25.0 upperGI
It adds a transform call but saves you three df$s.
Data
df <- data.frame(a = c('upperGI','UpperGI','UpperGi','UpperGI'),
b = c(22.0, 26.3, 21.0, 25.0))
There is also a package called fmtr specifically designed for this situation. You can define a format using the value() function, and apply it using the fapply() function. Like this:
library(fmtr)
# Set up data
df <- data.frame(a = c('upperGI','UpperGI','UpperGI','UpperGI', 'UpperGI'),
b = c('C22.0 - Liver cell carcinoma',
'C16.0 - Cardia',
'C15.3 - Upper third of oesophagus',
'C25.9 - Pancreas, unspecified',
'C23.0 - Livery, pancreas, and biliary surgery'))
df$c <- as.numeric(substr(df$b, 2, 5))
# Define format
fmt <- value(condition(x > 22 && x <= 25, "HPB"),
condition(TRUE, "upperGI"))
# Apply format
df$a <- fapply(df$b, fmt)
df
# a b c
#1 upperGI C22.0 - Liver cell carcinoma 22.0
#2 upperGI C16.0 - Cardia 16.0
#3 upperGI C15.3 - Upper third of oesophagus 15.3
#4 upperGI C25.9 - Pancreas, unspecified 25.9
#5 HPB C23.0 - Livery, pancreas, and biliary surgery 23.0

Elegant way of adding columns on a specific position in a data frame

I have a data.frame with 3 cols: date, rate, price. I want to add columns that come from a matrix, after rate and before price.
df = tibble('date' = c('01/01/2000', '02/01/2000', '03/01/2000'),
'rate' = c(7.50, 6.50, 5.54),
'price' = c(92, 94, 96))
I computed the lags of rate using a function that outputs a matrix:
rate_Lags = matrix(data = c(NA, 7.50, 5.54, NA, NA, 7.50), ncol=2, dimnames=list(c(), c('rate_tMinus1', 'rate_tMinus2'))
I want to insert those lags after rate (and before price) using names indexing rather than column order.
The add_column function from tibble package (Adding a column between two columns in a data.frame) does not work because it only accepts an atomic vector (hence if I have 10 lags I will have to call add_column 10 times). I could use apply in my rate_Lags matrix. Then, however, I lose the dimnames from my rate_Lags matrix.
Using number indexing (subsetting) (https://stat.ethz.ch/pipermail/r-help/2011-August/285534.html) could work if I knew the position of a specific column name (any function that retrieves the position of a column name?).
Is there any simple way of inserting a bunch of columns in a specific position in a data frame/tibble object?
You may be overlooking the following
library(dplyr)
I <- which(names(df) == "rate")
if (I == ncol(df)) {
cbind(df, rate_Lags)
} else {
cbind(select(df, 1:I), rate_Lags, select(df, (I+1):ncol(df)))
}
# date rate rate_tMinus1 rate_tMinus2 price
# 1 0.0005 7.50 NA NA 92
# 2 0.0010 6.50 7.50 NA 94
# 3 0.0015 5.54 5.54 7.5 96
Maybe this is not very elegant, but you only call the function once and I believe it's more or less general purpose.
fun <- function(DF, M){
nms_DF <- colnames(DF)
nms_M <- colnames(M)
inx <- which(sapply(nms_DF, function(x) length(grep(x, nms_M)) > 0))
cbind(DF[seq_len(inx)], M, DF[ seq_along(nms_DF)[-seq_len(inx)] ])
}
fun(df, rate_Lags)
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
We could unclass the dataset to a list and then use append to insert 'rate_Lags' at specific locations, reconvert the list to data.frame
i1 <- match('rate', names(df))
data.frame(append(unclass(df), as.data.frame(rate_Lags), after = i1))
# date rate rate_tMinus1 rate_tMinus2 price
#1 01/01/2000 7.50 NA NA 92
#2 02/01/2000 6.50 7.50 NA 94
#3 03/01/2000 5.54 5.54 7.5 96
Or with tidyverse
library(tidyverse)
rate_Lags %>%
as_tibble %>%
append(unclass(df), ., after = i1) %>%
bind_cols
# A tibble: 3 x 5
# date rate rate_tMinus1 rate_tMinus2 price
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 01/01/2000 7.5 NA NA 92
#2 02/01/2000 6.5 7.5 NA 94
#3 03/01/2000 5.54 5.54 7.5 96

Using aggregate to compute monthly weighted average

I need to compute a monthly weighted average. The data frame looks like this:
Month Variable Weighting
460773 1998-06-01 11 153.00
337134 1998-06-01 9 0.96
473777 1998-06-01 10 264.00
358226 1998-06-01 6 0.52
414626 1998-06-01 10 34.00
341020 1998-05-01 9 1.64
453066 1998-05-01 5 26.00
183276 1998-05-01 8 0.51
403729 1998-05-01 6 123.00
203005 1998-05-01 11 0.89
When I use aggregate e.g.,
Output <- aggregate(Variable ~ Month, df , mean )
Output
Month Variable
1 1998-05-01 7.8
2 1998-06-01 9.2
I get correct results, however, when I try to add weight to the aggregation e.g.,
Output <- aggregate(Variable ~ Month, df , FUN = weighted.mean, w = df$Weighting)
I get a different-vector-lenghts error:
Error in weighted.mean.default(X[[1L]], ...) :
'x' and 'w' must have the same length
Is there a way to remedy this situation?
With aggregate() it is not possible, because your weight vector is not partitionated during aggregate(). You can use by() or split() plus sapply() or additional package data.table or function ddply() from package plyr or functions from the package dplyr
example with split() plus sapply():
sapply(split(df, df$Month), function(d) weighted.mean(d$Variable, w = d$Weighting))
result:
1998-05-01 1998-06-01
5.89733 10.33142
a variant with by()
by(df, df$Month, FUN=function(d) weighted.mean(d$Variable, w = d$Weighting)) # or
unclass(by(df, df$Month, FUN=function(d) weighted.mean(d$Variable, w = d$Weighting)))
with package plyr
library(plyr)
ddply(df, ~Month, summarize, weighted.mean(Variable, w=Weighting))
with data.table
library(data.table)
setDT(df)[, weighted.mean(Variable, w = Weighting), Month]
In case you don't have plyr, dplyr or data.table installed and cannot install them due to some reasons, it is still possible to use aggregate to compute monthly weighted average, all you need is to do the following trick,
df$row <- 1:nrow(df) #the trick
aggregate(row~Month, df, function(i) mean(df$Variable[i])) #mean
aggregate(row~Month, df, function(i) weighted.mean(df$Variable[i], df$Weighting[i])) #weighted mean
Here are the outputs:
Mean:
> aggregate(row~Month, df, function(i) mean(df$Variable[i]))
Month row
1 1998-05-01 7.8
2 1998-06-01 9.2
Weighted mean:
> aggregate(row~Month, df, function(i) weighted.mean(df$Variable[i], df$Weighting[i]))
Month row
1 1998-05-01 5.89733
2 1998-06-01 10.33142

Resources