Display sum after each row r [duplicate] - r

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Closed 2 years ago.
df <- as_tibble(a <- c(1,2,3))
df
# A tibble: 3 x 1
value
<dbl>
1 1
2 2
3 3
The goal is this:
# A tibble: 3 x 2
value Sum
<dbl>
1 1 1
2 2 3
3 3 6
So just display the sum after each row. 1 = 1. 1+2 = 3. 3+3 = 6, and so on. I guess it's kinda easy, maybe with rowSums?

It is a cumulative sum. In R, there is cumsum to do that
df$Sum <- cumsum(df$value)
We could do the same while constructing the 'tibble
library(tibble)
df <- tibble(value = 1:3, Sum = cumsum(value))

Related

Use replicate to create new variable

I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)

R data imputation from group_by table [duplicate]

This question already has answers here:
How to replace NA with mean by group / subset?
(5 answers)
Closed 7 months ago.
group = c(1,1,4,4,4,5,5,6,1,4,6)
animal = c('a','b','c','c','d','a','b','c','b','d','c')
sleep = c(14,NA,22,15,NA,96,100,NA,50,2,1)
test = data.frame(group, animal, sleep)
print(test)
group_animal = test %>% group_by(`group`, `animal`) %>% summarise(mean_sleep = mean(sleep, na.rm = T))
I would like to replace the NA values the sleep column based on the mean sleep value grouped by group and animal.
Is there any way that I can perform some sort of lookup like Excel that matches group and animal from the test dataframe to the group_animal dataframe and replaces the NA value in the sleep column from the test df with the sleep value in the group_animal df?
We could use mutate instead of summarise as summarise returns a single row per group
library(dplyr)
library(tidyr)
test <- test %>%
group_by(group, animal) %>%
mutate(sleep = replace_na(sleep, mean(sleep, na.rm = TRUE))) %>%
ungroup
-output
test
# A tibble: 11 × 3
group animal sleep
<dbl> <chr> <dbl>
1 1 a 14
2 1 b 50
3 4 c 22
4 4 c 15
5 4 d 2
6 5 a 96
7 5 b 100
8 6 c 1
9 1 b 50
10 4 d 2
11 6 c 1

Average values from multiple data frames by position [duplicate]

This question already has an answer here:
Average Cells of Two or More DataFrames
(1 answer)
Closed 1 year ago.
I have two dataframes:
dataA <- data.frame(A = replicate(5, 1), B = replicate(5, 2))
dataB <- data.frame(A = replicate(5, 3), B = replicate(5, 4))
I would like to create a third data frame dataC that is the average of the other two. For example, row 1 column 1 in the third data frame would be the average of the same position in the first two data frames.
Desired output:
dataC <- data.frame(A = replicate(5, 2), B = replicate(5, 3))
dataC
A B
2 3
2 3
2 3
2 3
2 3
We can use place the datasets in a list, do elementwise sum with + and divide by the lenght of the list
Reduce(`+`, list(dataA, dataB))/2
-output
A B
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
Or another option is to bind the datasets while creating a grouping column based on sequence and then do the group by mean
library(dplyr)
library(data.table)
bind_rows(dataA, dataB, .id = 'grp') %>%
group_by(grp = rowid(grp)) %>%
summarise(across(everything(), mean)) %>%
select(-grp)
-output
# A tibble: 5 x 2
A B
<dbl> <dbl>
1 2 3
2 2 3
3 2 3
4 2 3
5 2 3
Here are some solutions:
# method 1:
dataC <- (dataA + dataB) / 2
# method 2:
dataC <- dataA
dataC[] <- Map(function(x,y) (x+y)/2, dataA, dataB)
# A B
# 1 2 3
# 2 2 3
# 3 2 3
# 4 2 3
# 5 2 3

R lag across arbitrary number of missing values [duplicate]

This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 1 year ago.
library(tidyverse)
testdata <- tibble(ID=c(1,NA,NA,2,NA,3),
Observation = LETTERS[1:6])
testdata1 <- testdata %>%
mutate(
ID1 = case_when(
is.na(ID) ~ lag(ID, 1),
TRUE ~ ID
)
)
testdata1
I have a dataset like testdata, with a valid ID only when ID changes. There can be an arbitrary number of records in a set, but the above case_when and lag() structure does not fill in ID for all records, just for record 2 in each group. Is there a way to get the 3rd (or deeper) IDs filled with the appropriate value?
We can use fill from the tidyr package. Since you are using tidyverse, tidyr is already inlcuded.
testdata1 <- testdata %>%
fill(ID)
testdata1
# # A tibble: 6 x 2
# ID Observation
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 2 D
# 5 2 E
# 6 3 F
Or we can use na.locf from the zoo package.
library(zoo)
testdata1 <- testdata %>%
mutate(ID = na.locf(ID))
testdata1
# # A tibble: 6 x 2
# ID Observation
# <dbl> <chr>
# 1 1 A
# 2 1 B
# 3 1 C
# 4 2 D
# 5 2 E
# 6 3 F

R reset counter based on two columns [duplicate]

This question already has an answer here:
R code to assign a sequence based off of multiple variables [duplicate]
(1 answer)
Closed 3 years ago.
I have following kind of data and i need output as the second data frame...
a <- c(1,1,1,1,2,2,2,2,2,2,2)
b <- c(1,1,1,2,3,3,3,3,4,5,6)
d <- c(1,2,3,4,1,2,3,4,5,6,7)
df <- as.data.frame(cbind(a,b,d))
output <- c(1,1,1,2,1,1,1,1,2,3,4)
df_output <- as.data.frame(cbind(df,output))
I have tried cumsum and I am not able to get the desired results. Please guide. Regards, Enthu.
based on column a value cahnges and if b is to be reset starting from one.
the condition is if b has same value it should start with 1.
Like in the 5th record, col b has value as 3. It should reset to 1 and if all the values if col b is same ( as the case from ro 6,6,7,8 is same , then it should be 1 and any change should increment by 1).
We can do a group by column 'a' and then create the new column with either match the unique values in 'b'
library(dplyr)
df2 <- df %>%
group_by(a) %>%
mutate(out = match(b, unique(b)))
df2
# A tibble: 11 x 4
# Groups: a [2]
# a b d out
# <dbl> <dbl> <dbl> <int>
# 1 1 1 1 1
# 2 1 1 2 1
# 3 1 1 3 1
# 4 1 2 4 2
# 5 2 3 1 1
# 6 2 3 2 1
# 7 2 3 3 1
# 8 2 3 4 1
# 9 2 4 5 2
#10 2 5 6 3
#11 2 6 7 4
Or another option is to coerce a factor variable to integer
df %>%
group_by(a) %>%
mutate(out = as.integer(factor(b)))
data
df <- data.frame(a, b, d)

Resources