Calculate mean in matrix subgroup (sub-matrix) in R - r

I have a large matrix, in which every row corresponds to a sample, and samples belong to a population. For example, the row name s1-2 means population 1 - sample 2. I would like to calculate the mean for every population, such as in the illustration (unfortunately, I cannot create a sample):
Is this possible in R? May I kindly ask for guidance?

It's not clear why you can't create a sample. Here's one for the purposes of exposition:
set.seed(1)
dimnames <- paste(rep(paste0('s', 1:3), each = 3), rep(1:3, 3), sep = '-')
m <- matrix(sample(0:5, 81, TRUE), 9, dimnames = list(dimnames, dimnames))
m
#> s1-1 s1-2 s1-3 s2-1 s2-2 s2-3 s3-1 s3-2 s3-3
#> s1-1 0 2 4 5 5 3 4 3 3
#> s1-2 3 0 4 0 3 0 1 5 4
#> s1-3 0 4 0 3 3 5 5 2 4
#> s2-1 1 4 0 0 3 1 5 0 3
#> s2-2 4 1 5 3 1 2 5 3 5
#> s2-3 2 5 4 2 3 1 0 4 4
#> s3-1 5 5 4 5 0 5 2 0 3
#> s3-2 1 1 1 1 5 5 2 0 3
#> s3-3 2 0 1 1 0 1 5 5 0
To get the mean of each row / column group, then assuming we can identify the group by the first two characters of the row or column name (as in your example), we could do:
groups <- expand.grid(row = unique(substr(rownames(m), 1, 2)),
col = unique(substr(colnames(m), 1, 2)))
m2 <- matrix(unlist(Map(function(r, c) {
mean(m[grep(r, rownames(m)), grep(c, rownames(m))])
}, r = groups$row, c = groups$col)), 3,
dimnames = list(unique(substr(rownames(m), 1, 2)),
unique(substr(colnames(m), 1, 2))))
Resuting in
m2
#> s1 s2 s3
#> s1 1.888889 3.000000 3.444444
#> s2 2.888889 1.777778 3.222222
#> s3 2.222222 2.555556 2.222222

Related

Conditional replacing of a numeric value in dplyr

Dear all I have a data frame that looks like this
df <- data.frame(time=c(1,2,3,4,1,2,3,4,5), type=c("A","A","A","A","B","B","B","B","B"), count=c(10,0,0,1,8,0,1,0,1))
df
time type count
1 1 A 10
2 2 A 0
3 3 A 0
4 4 A 1
5 1 B 8
6 2 B 0
7 3 B 1
8 4 B 0
9 5 B 1
I want to examine each group of types and if I see that one count is 0 then to replace the next count forward in time with 0. I do not count to be resurrected from the zero.
I want my data to looks like this
time type count
1 1 A 10
2 2 A 0
3 3 A 0
4 4 A 0
5 1 B 8
6 2 B 0
7 3 B 0
8 4 B 0
9 5 B 0
If I understood correctly
library(tidyverse)
df <-
data.frame(
time = c(1, 2, 3, 4, 1, 2, 3, 4, 5),
type = c("A", "A", "A", "A", "B", "B", "B", "B", "B"),
count = c(10, 0, 0, 1, 8, 0, 1, 0, 1)
)
df %>%
group_by(type) %>%
mutate(count = if_else(lag(count, default = first(count)) == 0, 0, count))
#> # A tibble: 9 x 3
#> # Groups: type [2]
#> time type count
#> <dbl> <chr> <dbl>
#> 1 1 A 10
#> 2 2 A 0
#> 3 3 A 0
#> 4 4 A 0
#> 5 1 B 8
#> 6 2 B 0
#> 7 3 B 0
#> 8 4 B 0
#> 9 5 B 0
Created on 2021-09-10 by the reprex package (v2.0.1)
You may use cummin function.
library(dplyr)
df %>% group_by(type) %>% mutate(count = cummin(count))
# time type count
# <dbl> <chr> <dbl>
#1 1 A 10
#2 2 A 0
#3 3 A 0
#4 4 A 0
#5 1 B 8
#6 2 B 0
#7 3 B 0
#8 4 B 0
#9 5 B 0
Since cummin is a base R function you may also implement it in base R -
transform(df, count = ave(count, type, FUN = cummin))

How to replace 0 in the nth column of all dataframe in a list with another value?

I have a list of dataframes and I would like to replace 0 in the second column of all dataframes in that list.
Heres the minimal working example for the list of dataframes:
> named <- c(1, 2, 3, 4, 5, 6)
> one <- c(0, 2, 0, 4, 5, 0)
> two <- c(1, 0, 3, 0, 0, 6)
> df <- data.frame(named, one, two)
> df1 <- data.frame(named, two, one)
> listed <- list(df, df1)
>
> listed
[[1]]
named one two
1 1 0 1
2 2 2 0
3 3 0 3
4 4 4 0
5 5 5 0
6 6 0 6
[[2]]
named two one
1 1 1 0
2 2 0 2
3 3 3 0
4 4 0 4
5 5 0 5
6 6 6 0
I can replace a column of a particular dataframe (second column of second dataframe in the following code) with replace(listed[[2]][2], listed[[2]][2] == 0, 1).
But how do I do this for all the dataframes in the list? I tried:
for (i in 1:2) {
replace(listed[[i]][2], listed[[i]][2] == 0, -1)
}
But its obviously a terrible attempt.
You can use lapply -
listed <- lapply(listed, function(x) {x[2][x[2] == 0] <- -1;x})
Or to use replace and for loop assign the changed data back to list.
for (i in seq_along(listed)) {
listed[[i]][2] <- replace(listed[[i]][2], listed[[i]][2] == 0, -1)
}
listed
#[[1]]
# named one two
#1 1 -1 1
#2 2 2 0
#3 3 -1 3
#4 4 4 0
#5 5 5 0
#6 6 -1 6
#[[2]]
# named two one
#1 1 1 0
#2 2 -1 2
#3 3 3 0
#4 4 -1 4
#5 5 -1 5
#6 6 6 0
You may also use column index in across
library(tidyverse)
map(listed, ~.x %>% mutate(across(2, ~replace(., .== 0, -1))))
#> [[1]]
#> named one two
#> 1 1 -1 1
#> 2 2 2 0
#> 3 3 -1 3
#> 4 4 4 0
#> 5 5 5 0
#> 6 6 -1 6
#>
#> [[2]]
#> named two one
#> 1 1 1 0
#> 2 2 -1 2
#> 3 3 3 0
#> 4 4 -1 4
#> 5 5 -1 5
#> 6 6 6 0
Created on 2021-06-26 by the reprex package (v2.0.0)
You could use map from tidyverse:
library(tidyverse)
map(listed, ~ mutate_at(.x, .vars = colnames(.x)[length(colnames(.x))],
~ case_when(. == 0 ~ -1, T ~ as.numeric(.))))
This code goes through each dataframe in listed, identifies only the last column, and changes the 0 values to -1, as in your for loop example.
Output:
[[1]]
named one two
1 1 0 1
2 2 2 -1
3 3 0 3
4 4 4 -1
5 5 5 -1
6 6 0 6
[[2]]
named two one
1 1 1 -1
2 2 0 2
3 3 3 -1
4 4 0 4
5 5 0 5
6 6 6 -1
An option with na_if/replace_na
library(dplyr)
library(tidyr)
library(purrr)
map(listed, ~ .x %>%
mutate(across(2, ~ replace_na(na_if(., 0), -1))))
[[1]]
named one two
1 1 -1 1
2 2 2 0
3 3 -1 3
4 4 4 0
5 5 5 0
6 6 -1 6
[[2]]
named two one
1 1 1 0
2 2 -1 2
3 3 3 0
4 4 -1 4
5 5 -1 5
6 6 6 0

"R" Warning message: In mapply longer argument not a multiple of length of shorter

I am trying to make a simple function to reverse code multiple columns in my R dataframe. I am using lapply and mapply
in the function, which seems to give me the expected outcome, except for the warning message
Warning message:
In mapply("-", max_value, data[, rev]) :
longer argument not a multiple of length of shorter
To illustrate, here is some sample data
{
A = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3)
B = c(9, 2, 3, 2, 4, 0, 2, 7, 2, 8)
C = c(2, 4, 1, 0, 2, 1, 3, 0, 7, 8)
df1 = data.frame(A, B, C)
print(df1)
}
A B C
1 3 9 2
2 3 2 4
3 3 3 1
4 3 2 0
5 3 4 2
6 3 0 1
7 3 2 3
8 3 7 0
9 3 2 7
10 3 8 8
The function to reverse-code is below:
## columns to reverse-code
revcode_cols = c("A", "B")
## function to reverse code variables
reverseCode <- function(data, rev){
# get maximum value per column
max_value = lapply(data, max)
# subtract values in designated cols from max value plus 1
data[, rev] = mapply("-", max_value, data[, rev]) + 1
return(data)
}
reverseCode(df1, revcode_cols)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
and it gives the right output, but for the warning message. Just wondering which part of my script I need to fix to get rid of the warning message.
The unit for a data.frame is a column and for a vector is a single element. In the mapply/Map, we are passing two input, but in the OP's code, the max is calculated on the whole dataset creating a the 'max_value' as a list of 3 elements, which the data[, rev] is of length 2. We just need to subset the data for calculating the max
reverseCode <- function(data, rev){
# get maximum value per column
max_value = lapply(data[rev], max)
# subtract values in designated cols from max value plus 1
data[, rev] = mapply("-", max_value, data[, rev]) + 1
return(data)
}
-testing
reverseCode(df1, revcode_cols)
A B C
1 1 1 2
2 1 8 4
3 1 7 1
4 1 8 0
5 1 6 2
6 1 10 1
7 1 8 3
8 1 3 0
9 1 8 7
10 1 2 8
Instead of using two apply loops you can do this in one lapply.
reverseCode <- function(data, rev){
data[rev] <- lapply(data[rev], function(x) max(x) - x + 1)
data
}
reverseCode(df1, c("A", "B"))
# A B C
#1 1 1 2
#2 1 8 4
#3 1 7 1
#4 1 8 0
#5 1 6 2
#6 1 10 1
#7 1 8 3
#8 1 3 0
#9 1 8 7
#10 1 2 8

Add a count column and count twice if a certain condition is met

I am wondering if there is a way to make a conditional column-count by a group, adding 1 to a row_number or rowid if a certain value is met (in this case 0). For example:
df<-data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,3,3,3),
condition=c(1,0,1,1,1,0,0,1,1,0,1,1,0, 1),
want=c(1, 3, 4,5,1,3,5,6,7,2,3,4,6,7))
group condition want
1 1 1 1
2 1 0 3
3 1 1 4
4 1 1 5
5 2 1 1
6 2 0 3
7 2 0 5
8 2 1 6
9 2 1 7
10 3 0 2
11 3 1 3
12 3 1 4
13 3 0 6
14 3 1 7
I think this might involve making a row_number per group and then making a customized row_number but I am open to suggestions. It is kind of a work-around method to "break up" my data when a 0 appears.
Using dplyr, for each group of data (group-by(group)) we can add a column which has a counter from 1 to the length of each group (i.e. n()). By adding a cumulative sum of condition == 0, that counter will jump one more, whenever your desired condition is met.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(desired = (1:n()) + cumsum(condition == 0))
Output:
#> # A tibble: 14 x 3
#> # Groups: group [3]
#> group condition desired
#> <dbl> <dbl> <int>
#> 1 1 1 1
#> 2 1 0 3
#> 3 1 1 4
#> 4 1 1 5
#> 5 2 1 1
#> 6 2 0 3
#> 7 2 0 5
#> 8 2 1 6
#> 9 2 1 7
#> 10 3 0 2
#> 11 3 1 3
#> 12 3 1 4
#> 13 3 0 6
#> 14 3 1 7
Data:
df1 <- data.frame(group=c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3,3,3,3),
condition=c(1,0,1,1,1,0,0,1,1,0,1,1,0, 1))
You can do:
transform(df, want = ave(condition, group, FUN = function(x) cumsum(x + (x == 0) * 2 )))
group condition want
1 1 1 1
2 1 0 3
3 1 1 4
4 1 1 5
5 2 1 1
6 2 0 3
7 2 0 5
8 2 1 6
9 2 1 7
10 3 0 2
11 3 1 3
12 3 1 4
13 3 0 6
14 3 1 7

Remove last n rows for 0 by id and then tail last n elements by Id in R

final datasetI have a dataset as attached.
I want to remove all Last 0s in
binary by Id, and then select tail 3 for
by id. (Note: for Id 2 only remainng 2
w5 and w6 ). It will be like the final dataset.
Is there any efficient way to do it?
How about something like this?
ID = c(rep(1, 8), rep(2, 4), rep(3, 8))
week = c(2:9, 5:8, 13:20)
binary = c(0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0)
dataset = data.frame(ID, week, binary)
> dataset
ID week binary
1 1 2 0
2 1 3 1
3 1 4 1
4 1 5 1
5 1 6 1
6 1 7 0
7 1 8 0
8 1 9 0
9 2 5 1
10 2 6 1
11 2 7 0
12 2 8 0
13 3 13 1
14 3 14 1
15 3 15 1
16 3 16 0
17 3 17 0
18 3 18 1
19 3 19 0
20 3 20 0
Solution:
# Split dataset by ID
byGroupDf = split(dataset, ID)
# For each grouped dataset, take only rows up till the last row where ID not equal to 0
byGroupFinal = lapply(byGroupDf, function(x) tail(x[1:max(which(x$binary != 0)),], 3))
# Combine the grouped datasets
FinalDf = do.call(rbind, byGroupFinal)
> FinalDf
ID week binary
1.3 1 4 1
1.4 1 5 1
1.5 1 6 1
2.9 2 5 1
2.10 2 6 1
3.16 3 16 0
3.17 3 17 0
3.18 3 18 1

Resources