Cumulative sum of dataframe up to a certain line (row) - r

I would like to have cumulative sum of following data:
c1 c2 c3
1 3 6 3
2 4 3 2
3 6 2 5
4 1 5 4
5 0 0 0
6 0 0 0
but up to 4th line (row). For example, a following code with produce general cumulative sum of dataframe including all the rows over the columns
library(readxl)
library(xts)
library("xlsx")
library(dplyr)
library(data.table)
library(tidyverse)
D <- structure(list(c1 = c(3, 4, 6, 1, 0, 0), c2 = c(6, 3, 2, 5, 0,
0), c3 = c(3, 2, 5, 4, 0, 0)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
D
csD <- cumsum(D)
csD
resulting with
c1 c2 c3
1 3 6 3
2 7 9 5
3 13 11 10
4 14 16 14
5 14 16 14
6 14 16 14
However, I would like to have:
c1 c2 c3
1 3 6 3
2 7 9 5
3 13 11 10
4 14 16 14
5 0 0 0
6 0 0 0
Thank you in advance. Alan

csD*(D!=0)
c1 c2 c3
1 3 6 3
2 7 9 5
3 13 11 10
4 14 16 14
5 0 0 0
6 0 0 0

Does this work:
> rbind(cumsum(D[1:(min(which(rowSums(D) == 0))-1), ]), cumsum(D[min(which(rowSums(D) == 0)):nrow(D), ]))
# A tibble: 6 x 3
c1 c2 c3
<dbl> <dbl> <dbl>
1 3 6 3
2 7 9 5
3 13 11 10
4 14 16 14
5 0 0 0
6 0 0 0
>

Maybe not the most optimal way but you can define N and use apply() and rbind() like this:
#Code
#Define N
N <- 4
#Compute
newdf <- rbind(apply(D,2,function(x) cumsum(x[1:N])),
D[(N+1):nrow(D),])
Output:
newdf
c1 c2 c3
1 3 6 3
2 7 9 5
3 13 11 10
4 14 16 14
5 0 0 0
6 0 0 0

We can convert the NA to 0 (na_if), get the cumsum and replace the NAwith 0 (replace_na) across all the columns
library(dplyr)
library(tidyr)
D %>%
mutate(across(everything(), ~replace_na(cumsum(na_if(., 0)), 0)))
-output
# A tibble: 6 x 3
# c1 c2 c3
# <dbl> <dbl> <dbl>
#1 3 6 3
#2 7 9 5
#3 13 11 10
#4 14 16 14
#5 0 0 0
#6 0 0 0
Or if we want to specify a row number
D %>%
mutate(across(everything(), ~ case_when(row_number() <=4 ~
cumsum(.), TRUE ~ .)))

Related

Combine csv files with redundant columns R

I have a series of .csv files that look like this :
a.csv contains
id, a, b, c
1, 10, 0, 0
2, 3, 0 , 0
3, 20, 0, 0
b.csv contains
id, a, b, c
1, 0, 7, 0
2, 0, 9, 0
3, 0, 14, 0
c.csv contains
id, a, b, c
1, 0, 0, 12
2, 0, 0, 8
3, 0, 0, 22
I'm trying to figure out the most efficient way to read them in and create a dataframe that looks like this
id, a, b, c
1, 10, 7, 12
2, 3, 9, 8
3, 20, 14, 22
What would be the best way to do this if there are many more files with many more columns and rows? tidyverse is preferred.
How about this. If all redundant columns have zeros, then you can go long, filter out the zeros, bind the rows, and then go wide.
library(tidyverse)
df_a <- read_table("id a b c
1 10 0 0
2 3 0 0
3 20 0 0")
df_b <- read_table("id a b c
1 0 7 0
2 0 9 0
3 0 14 0")
df_c <- read_table("id a b c
1 0 0 12
2 0 0 8
3 0 0 22")
list(df_a, df_b, df_c)|>
map(\(d) pivot_longer(d, cols = -id) |>
filter(value >0)) |>
bind_rows() |>
pivot_wider(names_from = name, values_from = value)
#> # A tibble: 3 x 4
#> id a b c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 7 12
#> 2 2 3 9 8
#> 3 3 20 14 22
Or better yet, read in the data marking 0 as NA and then coalesce the data frames.
df_a <- read_table("id a b c
1 10 0 0
2 3 0 0
3 20 0 0", na = "0")
df_b <- read_table("id a b c
1 0 7 0
2 0 9 0
3 0 14 0", na = "0")
df_c <- read_table("id a b c
1 0 0 12
2 0 0 8
3 0 0 22", na = "0")
coalesce(df_a, df_b, df_c)
#> # A tibble: 3 x 4
#> id a b c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 7 12
#> 2 2 3 9 8
#> 3 3 20 14 22
Or if you can read the data in with NA, you can define 0 as NA:
list(df_a, df_b, df_c) |>
map(\(d) mutate(d, across(everything(), \(x) ifelse(x == 0, NA, x)))) |>
reduce(coalesce)
#> # A tibble: 3 x 4
#> id a b c
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 10 7 12
#> 2 2 3 9 8
#> 3 3 20 14 22
A base R solution, given the symmetry of the files.
Read the files
file_names <- list.files(pattern="^[abc]\\.csv")
lis <- sapply(file_names, function(x) list(read.csv(x, header=T)))
lis
$a.csv
id a b c
1 1 10 0 0
2 2 3 0 0
3 3 20 0 0
$b.csv
id a b c
1 1 0 7 0
2 2 0 9 0
3 3 0 14 0
$c.csv
id a b c
1 1 0 0 12
2 2 0 0 8
3 3 0 0 22
Combine the columns
column_names <- c("a","b","c")
cbind( lis[["a.csv"]]["id"], sapply(lis, function(x) rowSums(x[column_names])) )
id a.csv b.csv c.csv
1 1 10 7 12
2 2 3 9 8
3 3 20 14 22

conditionally copying reference values in r

I am trying to conditionally copy values from the x column into a new column based on a reference value. for example in row 1, for time == 1, the ref value is 7 so the newx value should copy the x value from time == 1 and id == 7 the copied value always needs to be in the same time block.
In the event the ref value is 0, the newx value should also be 0
I have tried a few approaches and the below is probably the closest I have reached but it still isn't working
library(dplyr)
x <- sample(1:50, 24)
y <- sample(1:50, 24)
ref <- c(7,7,7,7,0,0,0,0,0,0,0,0,4,3,4,1,8,8,5,8,0,0,0,0)
id <- rep(seq(1,8,1), 3)
time <- rep(1:3, each = 8)
x y ref id time
1 41 29 7 1 1
2 18 37 7 2 1
3 50 25 7 3 1
4 47 7 7 4 1
5 2 40 0 5 1
6 22 19 0 6 1
7 48 9 0 7 1
8 26 36 0 8 1
9 49 47 0 1 2
10 46 18 0 2 2
11 25 23 0 3 2
12 38 3 0 4 2
13 28 31 4 5 2
14 34 4 3 6 2
15 21 32 4 7 2
16 9 48 1 8 2
17 43 43 8 1 3
18 39 38 8 2 3
19 6 16 5 3 3
20 12 41 8 4 3
21 1 13 0 5 3
22 19 17 0 6 3
23 7 34 0 7 3
24 33 10 0 8 3
df <- as.data.frame(cbind(x,y,ref,id,time))
df <- df %>% group_by(time) %>% mutate(Newx = case_when((ref > 0) ~ x[which(id==ref)],
T ~ 0,))
You can join df with itself. The last mutate is just to remove the NAs for the ref == 0 rows. You can also use tidyr::replace_na but I wanted to stick to using only dplyr:
df %>%
left_join(df %>% select(x, id, time) %>% rename(newx = x), by= c("time", "ref" = "id")) %>%
mutate(newx = ifelse(is.na(newx), 0, newx))
Which results to:
x y ref id time newx
1 44 44 7 1 1 36
2 37 26 7 2 1 36
3 40 27 7 3 1 36
4 32 46 7 4 1 36
5 48 33 0 5 1 0
6 31 6 0 6 1 0
7 36 1 0 7 1 0
8 27 11 0 8 1 0
9 26 32 0 1 2 0
10 42 22 0 2 2 0
11 22 21 0 3 2 0
12 15 28 0 4 2 0
13 45 47 4 5 2 15
14 49 4 3 6 2 22
15 25 50 4 7 2 15
16 14 3 1 8 2 26
17 13 42 8 1 3 12
18 38 7 8 2 3 12
19 10 12 5 3 3 50
20 2 40 8 4 3 12
21 50 43 0 5 3 0
22 4 9 0 6 3 0
23 34 49 0 7 3 0
24 12 31 0 8 3 0
Using purrr::map_dbl you could do:
library(purrr)
library(dplyr)
df %>%
group_by(time) %>%
mutate(newx = map_dbl(ref, function(ref) if (ref > 0) .data$x[.data$id == ref] else 0)) %>%
ungroup()
#> # A tibble: 24 × 6
#> x y ref id time newx
#> <int> <int> <dbl> <dbl> <int> <dbl>
#> 1 31 17 7 1 1 37
#> 2 15 43 7 2 1 37
#> 3 14 39 7 3 1 37
#> 4 3 12 7 4 1 37
#> 5 42 15 0 5 1 0
#> 6 43 32 0 6 1 0
#> 7 37 42 0 7 1 0
#> 8 48 7 0 8 1 0
#> 9 25 9 0 1 2 0
#> 10 26 41 0 2 2 0
#> # … with 14 more rows
DATA
set.seed(123)
x <- sample(1:50, 24)
y <- sample(1:50, 24)
ref <- c(7, 7, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 4, 3, 4, 1, 8, 8, 5, 8, 0, 0, 0, 0)
id <- rep(seq(1, 8, 1), 3)
time <- rep(1:3, each = 8)
df <- data.frame(x, y, ref, id, time)

Sum values incrementally for panel data

I have a very basic question as I am relatively new to R. I was wondering how to add a value in a particular column to the previous one for each cross-sectional unit in my data separately. My data looks like this:
firm date value
A 1 10
A 2 15
A 3 20
A 4 0
B 1 0
B 2 1
B 3 5
B 4 10
C 1 3
C 2 2
C 3 10
C 4 1
D 1 7
D 2 3
D 3 6
D 4 9
And I want to achieve the data below. So I want to sum values for each cross-sectional unit incrementally.
firm date value cumulative value
A 1 10 10
A 2 15 25
A 3 20 45
A 4 0 45
B 1 0 0
B 2 1 1
B 3 5 6
B 4 10 16
C 1 3 3
C 2 2 5
C 3 10 15
C 4 1 16
D 1 7 7
D 2 3 10
D 3 6 16
D 4 9 25
Below is a reproducible example code. I tried lag() but couldn't figure out how to repeat it for each firm.
firm <- c("A","A","A","A","B","B","B","B","C","C","C", "C","D","D","D","D")
date <- c("1","2","3","4","1","2","3","4","1","2","3","4", "1", "2", "3", "4")
value <- c(10, 15, 20, 0, 0, 1, 5, 10, 3, 2, 10, 1, 7, 3, 6, 9)
data <- data.frame(firm = firm, date = date, value = value)
Does this work:
library(dplyr)
df %>% group_by(firm) %>% mutate(cumulative_value = cumsum(value))
# A tibble: 16 x 4
# Groups: firm [4]
firm date value cumulative_value
<chr> <int> <int> <int>
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25
Using base R with ave
data$cumulative_value <- with(data, ave(value, firm, FUN = cumsum))
-output
> data
firm date value cumulative_value
1 A 1 10 10
2 A 2 15 25
3 A 3 20 45
4 A 4 0 45
5 B 1 0 0
6 B 2 1 1
7 B 3 5 6
8 B 4 10 16
9 C 1 3 3
10 C 2 2 5
11 C 3 10 15
12 C 4 1 16
13 D 1 7 7
14 D 2 3 10
15 D 3 6 16
16 D 4 9 25

Fill subset of rows with values from row above

I have a long format dataset with longitudinal data and for one variable I want to fill in the missings in timepoint 0 with the values in timepoint 1, but I do not want to fill in the missings from timepoint 1 with values from timepoint 2 and so on.
My dataset is ordered by id and timepoint.
I have used the fill function succesfully in cases where I just needed to fill missings from all timepoints from a specific id.
Example dataframe:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
timepoint=c(0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3),
var1=c(NA,9,8,10, NA, 10, NA, 12, NA, NA, 12, 11, NA, 12, 12, NA))
> df
id timepoint var1
1 1 0 NA
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 NA
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 NA
14 4 1 12
15 4 2 12
16 4 3 NA
This is what works when I just need to fill any missing no matter the timepoint:
library(dplyr)
library(tidyr)
df <- df %>%
group_by(id) %>%
fill(`var9`:`var12`, .direction = "up") %>%
as.data.frame
But now I have trouble specifying to only fill in the missings in rows at timepoint 0. Any help is appreciated.
My expected output:
> df
id timepoint var1
1 1 0 9
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 10
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 12
14 4 1 12
15 4 2 12
16 4 3 NA
This might be an oversimplification, but you can just call the fill function again, but this time with direction down. Then your entire data frame will be complete.
df <- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
timepoint=c(0,1,2,3,0,1,2,3,0,1,2,3,0,1,2,3),
var1=c(NA,9,8,10, NA, 10, NA, 12, NA, NA, 12, 11, NA, 12, 12, NA))
In this case I will use an ifelse statement followed the by the lead function.
library(dplyr); library(tidyr);
df %>%
group_by(id) %>%
mutate(var1 = ifelse(is.na(var1) & timepoint == 0,
lead(var1, 1), var1))
Yields:
# A tibble: 16 x 3
# Groups: id [4]
id timepoint var1
<dbl> <dbl> <dbl>
1 1 0 9
2 1 1 9
3 1 2 8
4 1 3 10
5 2 0 10
6 2 1 10
7 2 2 NA
8 2 3 12
9 3 0 NA
10 3 1 NA
11 3 2 12
12 3 3 11
13 4 0 12
14 4 1 12
15 4 2 12
16 4 3 NA
We can group_by id and use replace to change the values where timepoint = 0 & var1 is NA from the corresponding value of var1 where timepoint = 1 in each group.
library(dplyr)
df %>%
group_by(id) %>%
mutate(var2 = replace(var1, timepoint == 0 & is.na(var1), var1[timepoint == 1]))
# id timepoint var1 var2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 0 NA 9
# 2 1 1 9 9
# 3 1 2 8 8
# 4 1 3 10 10
# 5 2 0 NA 10
# 6 2 1 10 10
# 7 2 2 NA NA
# 8 2 3 12 12
# 9 3 0 NA NA
#10 3 1 NA NA
#11 3 2 12 12
#12 3 3 11 11
#13 4 0 NA 12
#14 4 1 12 12
#15 4 2 12 12
#16 4 3 NA NA

identifying rows having common values in two columns

How to identify rows having same values in two columns (here: treatment, replicate) at least in another one row?
set.seed(0)
x <- rep(1:10, 4)
y <- sample(c(rep(1:10, 2)+rnorm(20)/5, rep(6:15, 2) + rnorm(20)/5))
treatment <- sample(gl(8, 5, 40, labels=letters[1:8]))
replicate <- sample(gl(8, 5, 40))
d <- data.frame(x=x, y=y, treatment=treatment, replicate=replicate)
table(d$treatment, d$replicate)
# 1 2 3 4 5 6 7 8
# a 1 0 0 1 1 2 0 0
# b 1 1 0 0 1 2 0 0
# c 0 0 0 0 2 0 1 2
# d 2 0 1 1 0 0 1 0
# e 0 2 1 1 0 0 0 1
# f 0 1 1 0 1 1 1 0
# g 0 1 0 2 0 0 1 1
# h 1 0 2 0 0 0 1 1
From the above output, my guess is that the output should contain 16 rows. Any idea how to achieve this?
Update:
d %>% group_by(treatment, replicate) %>% filter(n()>1)
# A tibble: 16 x 4
x y treatment replicate
<int> <dbl> <fctr> <fctr>
1 2 7.050445 h 3
2 5 1.840198 b 6
3 8 9.160838 d 1
4 9 4.254486 h 3
5 2 8.870106 g 4
6 4 7.821616 a 6
7 6 9.752492 e 2
8 7 9.988579 c 5
9 9 10.480931 c 8
10 1 2.770469 c 8
11 2 7.913338 e 2
12 3 13.743080 d 1
13 9 5.692010 b 6
14 10 11.100722 a 6
15 3 12.198432 g 4
16 5 5.955146 c 5
I have identified one approach where the results seem to satisfy the condition. Any other better solutions?
You can use duplicated as a condition:
dups <- d[which(duplicated(d[,c("treatment", "replicate")]) |
duplicated(d[ ,c("treatment", "replicate")], fromLast = TRUE)),]
>dups
x y treatment replicate
2 2 7.050445 h 3
5 5 1.840198 b 6
8 8 9.160838 d 1
9 9 4.254486 h 3
12 2 8.870106 g 4
14 4 7.821616 a 6
16 6 9.752492 e 2
17 7 9.988579 c 5
19 9 10.480931 c 8
21 1 2.770469 c 8
22 2 7.913338 e 2
23 3 13.743080 d 1
29 9 5.692010 b 6
30 10 11.100722 a 6
33 3 12.198432 g 4
35 5 5.955146 c 5

Resources