Get summaries of repeated consecutive values by row in R - r

I´m trying to get some statistics (min, max, mean) of repeated values by row in R.
My dataframe looks similar to this:
b <- as.data.frame(matrix(ncol=7, nrow=3,
c(3,NA,NA,4,5,NA,7,6,NA,7,NA,8,9,NA,NA,4,6,NA,NA,7,NA), byrow = TRUE))
For each row, I want to add a column with the min, max and mean of the no. of columns containing consecutive NAs and it should something like this
V1 V2 V3 V4 V5 V6 V7 max min mean
1 3 NA NA 4 5 NA 7 2 1 1.5
2 6 NA 7 NA 8 9 NA 1 1 1.0
3 NA 4 6 NA NA 7 NA 2 1 1.33
This is just a small example of my dataset with 2000 rows and 48 columns.
Does anyone have some code for this?

You can apply over the rows and get the "runs" of non-NA columns. Once you have that, you can simply take the summary stats of those:
b[,c("mean", "max", "min")] <- do.call(rbind, apply(b, 1, function(x){
res <- rle(!is.na(x))
res2 <- res[["lengths"]][res[["values"]]]
data.frame(mean = mean(res2), max = max(res2), min = min(res2))
}
))
b
# V1 V2 V3 V4 V5 V6 V7 mean max min
#1 3 NA NA 4 5 NA 7 1.333333 2 1
#2 6 NA 7 NA 8 9 NA 1.333333 2 1
#3 NA 4 6 NA NA 7 NA 1.500000 2 1

A dplyr solution with rlewhich computes the lengths of runs of equal values in a vector.
library(dplyr)
b %>% cbind( b %>% rowwise() %>% do(rl = rle(is.na(.))$lengths[rle(is.na(.))$values == T]))
%>% rowwise()
%>% mutate(mean = mean(rl),
max = max(rl),
min = min(rl))
%>% select(-rl)
# V1 V2 V3 V4 V5 V6 V7 max min mean
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <dbl>
# 1 3 NA NA 4 5 NA 7 2 1 1.50
# 2 6 NA 7 NA 8 9 NA 1 1 1.00
# 3 NA 4 6 NA NA 7 NA 2 1 1.33

Related

Using Mutate and conditional statement to calculate column means [duplicate]

This question already has answers here:
Average across Columns in R, excluding NAs
(2 answers)
Closed 11 months ago.
I have a dataset with an identifier variable and some numeric variables. I want to calculate the mean of the columns according to the identifier variable. Here is a simple example:
From this
id v1 v2 v3 v4
d 1 2 NA NA
e NA NA 3 3
e NA NA 2 4
d 3 5 NA NA
I want to get to this:
id v1 v2 v3 v4 mean
d 1 2 NA NA 1.5
e NA NA 3 3 3
e NA NA 2 4 3
d 3 5 NA NA 4
I would like to use an if else statement like:
ifelse(id=d, colMeans(v1:v2), colMeans(v3:v4)
Thank you in advance!
df %>%
rowwise %>%
mutate(mean = mean(c_across(v1:v4), na.rm = T))
# A tibble: 4 x 6
# Rowwise:
id v1 v2 v3 v4 mean
<chr> <int> <int> <int> <int> <dbl>
1 d 1 2 NA NA 1.5
2 e NA NA 3 3 3
3 e NA NA 2 4 3
4 d 3 5 NA NA 4
Or:
df %>%
rowwise %>%
mutate(mean = mean(c_across(where(is.numeric)), na.rm = T))

Why does my function not overwrite part of my variable?

I wrote a function that is supposed to count how many NA's there are per column. Before I packed everything into a function it worked. Now it doesn't.
I bet just a stupid beginner mistake, still, I could use your help on this.
My thought is, that the statement
x[nrow(x),i] <- aux_count
does not properly assign my stuff. Why I wonder.
The following code shows, my function, which demonstrates the problem.
check_Quandl_tibble <- function(x){
for(i in 2:ncol(x)){
aux_count <- 0
for(j in 1:(nrow(x)-1)){
if(is.na(x[j,i])){
aux_count <- aux_count + 1
}
}
x[nrow(x),i] <- aux_count
}
}
a <- matrix(c(1,4, NA, 81), nrow = 5, ncol = 5)
a <- rbind(a, rep(NA, ncol(a)))
a <- as_tibble(a)
# a now looks like this
# A tibble: 6 x 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 NA 81 1
2 4 NA 81 1 4
3 NA 81 1 4 NA
4 81 1 4 NA 81
5 1 4 NA 81 1
6 NA NA NA NA NA
a <- check_Quandl_tibble(a)
# a now looks like this
# A tibble: 6 x 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 NA 81 1
2 4 NA 81 1 4
3 NA 81 1 4 NA
4 81 1 4 NA 81
5 1 4 NA 81 1
6 NA NA NA NA NA
# instead I wanted
# A tibble: 6 x 5
V1 V2 V3 V4 V5
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 4 NA 81 1
2 4 NA 81 1 4
3 NA 81 1 4 NA
4 81 1 4 NA 81
5 1 4 NA 81 1
6 1 1 2 1 1 # this row is supposed to count the NA's per column.
We can take the colSums of logical matrix (is.na(a)) and rbind to the matrix
rbind(a, colSums(is.na(a)))
Here, it is assumed that the 'a' is from the first line of code
a <- matrix(c(1,4, NA, 81), nrow = 5, ncol = 5)
If we want to replace the last row after creating the tibble
a %>%
mutate_all(list(~ replace(., n(), sum(is.na(.[-n()])))))

Iteratively shift variables in data

Some example of my data:
library(tidyverse)
set.seed(1234)
df <- tibble(
v1 = c(1:6),
v2 = rnorm(6, 5, 2) %>% round,
v3 = rnorm(6, 4, 2) %>% round,
v4 = rnorm(6, 4, 1) %>% round %>% lag(1),
v5 = rnorm(6, 6, 2) %>% round %>% lag(2),
v6 = rnorm(6, 5, 3) %>% round %>% lag(3),
v7 = rnorm(6, 5, 3) %>% round %>% lag(4))
v1 v2 v3 v4 v5 v6 v7
1 1 3 3 NA NA NA NA
2 2 6 3 3 NA NA NA
3 3 7 3 4 4 NA NA
4 4 0 2 5 11 3 NA
5 5 6 3 4 6 1 8
6 6 6 2 3 5 7 4
I want to shift it by diagonal, that separates NA and filled data.
So, desired output looks like this:
v1 v2 v3 v4 v5 v6 v7
1 NA NA 3 3 4 3 8
2 NA 3 3 4 11 1 4
3 1 6 3 5 6 7 NA
4 2 7 2 4 5 NA NA
5 3 0 3 4 NA NA NA
6 4 6 2 NA NA NA NA
7 5 6 NA NA NA NA NA
8 6 NA NA NA NA NA NA
Each column around v3 is just shifted by 1, 2, 3.. etc rows down and up.
Tried to achieve this inside dplyr::mutate_all() but I failed to iterate it with a lag() and lead() functions.
EDIT: after #wibeasley advice I made this
df %>%
mutate(dummy1 = c(3:8)) %>%
gather("var", "val", -dummy1) %>%
mutate(
dummy2 = sub("v", "", var, fixed = T),
dummy3 = dummy1 - as.numeric(dummy2) + 1) %>%
select(-dummy1, -dummy2) %>%
spread(var, val) %>%
slice(-c(1:4)) %>% select(-dummy3)
Looks ugly, but works.
We can use lapply to handle each column, putting NA to the back.
df[] <- lapply(df, function(x) c(x[!is.na(x)], x[is.na(x)]))
df
# # A tibble: 6 x 7
# v1 v2 v3 v4 v5 v6 v7
# <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 3 3 3 4 3 8
# 2 2 6 3 4 11 1 4
# 3 3 7 3 5 6 7 NA
# 4 4 0 2 4 5 NA NA
# 5 5 6 3 3 NA NA NA
# 6 6 6 2 NA NA NA NA

Transpose multiple columns as column names and fill with values in R

The sample data as following:
x <- read.table(header=T, text="
ID CostType1 Cost1 CostType2 Cost2
1 a 10 c 1
2 b 2 c 20
3 a 1 b 50
4 a 40 c 1
5 c 2 b 30
6 a 60 c 3
7 c 10 d 1
8 a 20 d 2")
I want the second and third columns (CostType1 and CostType 2) to be the the names of new columns and fill the corresponding cost to certain cost type. If there's no match, filled with NA. The ideal format will be following:
a b c d
1 10 NA 1 NA
2 NA 2 20 NA
3 1 50 NA NA
4 40 1 NA NA
5 NA 30 2 NA
6 60 NA 3 NA
7 NA NA 10 1
8 20 NA NA 2
A solution using tidyverse. We can first get how many groups are there. In this example, there are two groups. We can convert each group, combine them, and then summarize the data frame with the first non-NA value in the column.
library(tidyverse)
# Get the group numbers
g <- (ncol(x) - 1)/2
x2 <- map_dfr(1:g, function(i){
# Transform the data frame one group at a time
x <- x %>%
select(ID, ends_with(as.character(i))) %>%
spread(paste0("CostType", i), paste0("Cost", i))
return(x)
}) %>%
group_by(ID) %>%
# Select the first non-NA value if there are multiple values
summarise_all(funs(first(.[!is.na(.)])))
x2
# # A tibble: 8 x 5
# ID a b c d
# <int> <int> <int> <int> <int>
# 1 1 10 NA 1 NA
# 2 2 NA 2 20 NA
# 3 3 1 50 NA NA
# 4 4 40 NA 1 NA
# 5 5 NA 30 2 NA
# 6 6 60 NA 3 NA
# 7 7 NA NA 10 1
# 8 8 20 NA NA 2
A base solution using reshape
x1 <- setNames(x[,c("ID", "CostType1", "Cost1")], c("ID", "CostType", "Cost"))
x2 <- setNames(x[,c("ID", "CostType2", "Cost2")], c("ID", "CostType", "Cost"))
reshape(data=rbind(x1, x2), idvar="ID", timevar="CostType", v.names="Cost", direction="wide")

Sort dataframe in a function

I am trying to create a function which takes a dataframe and the columns by which I want to sort as arguments. This is what I have come up with:
sortDf <- function(df, columns){
df <- df[order(df[,columns]),]
return(df)
}
This is my usecase:
set.seed(24)
dataset <- matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5)
df <- as.data.frame(dataset)
sortedDf <- sortDf(df, c('V1', 'V2'))
How ever I get this as a result:
V1 V2 V3 V4 V5
3 1 1 5 3 4
5 1 5 2 5 2
NA NA NA NA NA NA
NA.1 NA NA NA NA NA
NA.2 NA NA NA NA NA
NA.3 NA NA NA NA NA
1 5 2 1 2 5
4 5 2 1 2 1
NA.4 NA NA NA NA NA
2 NA 4 NA 1 4
The dataframe is kinda sorted but where does the 'NA' come from and how can I remove them? What do I do wrong? I want to sort descending. Thanks in advance.
We can create a different function
f1 <- function(dat, cols){
dat[do.call(order, dat[cols]),]
}
f1(df, c("V1", "V2"))
# V1 V2 V3 V4 V5
#2 1 1 2 1 3
#1 1 5 3 5 NA
#5 3 1 1 NA 1
#4 3 4 4 3 NA
#3 4 4 4 NA 4
In the OP's code, the order is applied on a data.frame instead of a vector. It can be used either separately or within do.call i.e.
df[order(df$V1, df$V2),]
# V1 V2 V3 V4 V5
#2 1 1 2 1 3
#1 1 5 3 5 NA
#5 3 1 1 NA 1
#4 3 4 4 3 NA
#3 4 4 4 NA 4
gives the same result as the OP's code. So, either it columns can be individually mentioned (which would not be easy when there are more number of columns) or use do.call.
This can also be implemented using the devel version of dplyr (soon to be released 0.6.0) with quosures. After taking the input vector, it is converted to quosures (parse_quosures) and then evaluated by unquoting (!!!) it in arrange
library(dplyr)
f2 <- function(dat, cols){
cols <- rlang::parse_quosures(paste(cols, collapse=";"))
dat %>%
arrange(!!! cols)
}
f2(df, c("V1", "V2"))
# V1 V2 V3 V4 V5
#1 1 1 2 1 3
#2 1 5 3 5 NA
#3 3 1 1 NA 1
#4 3 4 4 3 NA
#5 4 4 4 NA 4
data
set.seed(24)
df <- as.data.frame(matrix(sample(c(NA, 1:5), 25, replace = TRUE), 5))

Resources