Cumulative summing between groups using dplyr - r

I have a tibble structured as follows:
day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5
Note that the tibble contains multiple rows for each day, and for each day the same value for theta is repeated an arbitrary number of times. (The tibble contains other arbitrary columns necessitating this repeating structure.)
I'd like to use dplyr to cumulatively sum values for theta across days such that, in the example above, 2.1 is added only a single time to 3.2, etc. The tibble would be mutated so as to append the new cumulative sum (c.theta) as follows:
day theta c.theta
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8
...
My initial efforts to group_by day and then cumsum over theta resulted only in cumulative summing over the full set of data (e.g., 2.1 + 2.1 + 3.2 ...) which is undesirable. In my Stack Overflow searches, I can find many examples of cumulative summing within groups, but never between groups, as I describe above. Nudges in the right direction would be much appreciated.

Doing this in dplyr I came up with a very similar solution to PoGibas - use distinct to just get one row per day, find the sum and merge back in:
df = read.table(text="day theta
1 1 2.1
2 1 2.1
3 2 3.2
4 2 3.2
5 5 9.5
6 5 9.5
7 5 9.5", header = TRUE)
cumsums = df %>%
distinct(day, theta) %>%
mutate(ctheta = cumsum(theta))
df %>%
left_join(cumsums %>% select(day, ctheta), by = 'day')

Not a dplyr, but just an alternative data.table solution:
library(data.table)
# Original table is called d
setDT(d)
merge(d, unique(d)[, .(c.theta = cumsum(theta), day)])
day theta c.theta
1: 1 2.1 2.1
2: 1 2.1 2.1
3: 2 3.2 5.3
4: 2 3.2 5.3
5: 5 9.5 14.8
6: 5 9.5 14.8
7: 5 9.5 14.8
PS: If you want to preserve other columns you have to use unique(d[, .(day, theta)])

In base R you could use split<- and tapply to return the desired result.
# construct 0 vector to fill in
dat$temp <- 0
# fill in with cumulative sum for each day
split(dat$temp, dat$day) <- cumsum(tapply(dat$theta, dat$day, head, 1))
Here, tapply returns the first element of theta for each day which is is fed to cumsum. The elements of cumulative sum are fed to each day using split<-.
This returns
dat
day theta temp
1 1 2.1 2.1
2 1 2.1 2.1
3 2 3.2 5.3
4 2 3.2 5.3
5 5 9.5 14.8
6 5 9.5 14.8
7 5 9.5 14.8

Related

Adding a header to columns based on the values of rows

I have the following different dataframes:
df1:
Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12
df2:
Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3
df3:
Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5
I would like to add a header for each of the dataframes where the column names are:
Class Technique Algorithm 1 2 3 ....
My issue is not with the first three columns but with the rest of the columns (integer values). As you see in the example, the number of columns for these integer values differs which makes it difficult to me how to name these columns (i.e., starting form 1 until the last value, for example, 4 in df1).
Can someone help me please in solving this issue?
Here is a function for you. The first argument, dat, is your data frame. The second argument, chr, is the vector names for your first few columns.
header_fun <- function(dat, chr = c("Class", "Technique", "Algorithm")){
dat2 <- setNames(dat, c(chr, 1:(ncol(dat) - length(chr))))
return(dat2)
}
The function will return a new data frame with the updated header.
header_fun(df1)
# Class Technique Algorithm C1 C2 C3 C4
# 1 Scribe Reduced A 5.0 2.5 3 10
# 2 Reader Reduced A 9.2 4.0 12 10
# 3 Optimise Reduced A 5.0 5.8 3 12
header_fun(df2)
# Class Technique Algorithm 1 2
# 1 Convert Reduced A 14.0 25.0
# 2 Configure Reduced A 14.7 6.8
# 3 Race Reduced A 2.0 6.3
header_fun(df3)
# Class Technique Algorithm 1 2 3 4 5 6
# 1 Abstract Reduced A 8.0 7.5 9 8 4.5 11.0
# 2 Follower Reduced A 5.5 6.0 14 19 6.0 13.5
DATA
df1 <- read.table(text = "Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12",
header = FALSE, stringsAsFactors = FALSE)
df2 <- read.table(text = "Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3",
header = FALSE, stringsAsFactors = FALSE)
df3 <- read.table(text = "Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5",
header = FALSE, stringsAsFactors = FALSE)

R - Create random subsamples of size n for multiple sample groups

I have a large data set of samples that belong to different groups and differ in the area covered. The structure of the data set is simplified below. I now would like to create pooled samples (Subgroups) for each Group where the area covered by each Subgroup equates to a specified area (e.g. 20). Samples should be allocated randomly and without replacement to each Subgroup and the number of the Subgroup should be listed in a new column at the end of the data frame.
SampleID Group Area Subgroup
1 A 1.5 1
2 A 3.8 2
3 A 6 4
4 A 1.9 1
5 A 1.5 3
6 A 4.1 1
7 A 3.7 1
8 A 4.5 3
...
300 B 1.2 1
301 B 3.8 1
302 B 4.1 4
303 B 2.6 3
304 B 3.1 5
305 B 3.5 3
306 B 2.1 2
...
2000 S 2.7 5
...
I am currently using the ‘cumsum’ command to create the Subgroups, using the code below.
dat <- read.table("Pooling_Test.txt", header = TRUE, sep = "\t")
dat$CumArea <- cumsum(dat$Area)
dat$Diff_CumArea <- c(0, head(cumsum(dat$Area), -1))
dat$Sample_Int_1 <- "0"
dat$Sample_End <- "0"
current.sum <- 0
for (c in 1:nrow(dat)) {
current.sum <- current.sum + dat[c, "Area"]
dat[c, "Diff_CumArea"] <- current.sum
if (current.sum >= 20) {
dat[c, "Sample_Int_1"] <- "1"
dat[c, "Sample_End"] <- "End"
current.sum <- 0
dat$Sample_Int_2 <- cumsum(dat$Sample_Int_1)+1
dat$Sample_Final <- dat$Sample_Int_2
for (d in 1:nrow(dat)) {
if (dat$Sample_End[d] == 'End')
dat$Subgroup[d] <- dat$Sample_Int_2[d]-1
else 0 }
}}
write.csv(dat, file = 'Pooling_Test_Output.csv', row.names = FALSE)
The resultant data frame shows what I want (see below). However, there are a couple of steps I would like to improve. First, I have problems including a command for choosing samples randomly from each Group, so I currently randomise the order of samples before loading the data frame into R. Secondly, in the output table the Subgroups are numbered consecutively, but I would like to start the Subgroup numbering with 1 for each new Group. Has anybody any advice on how to achieve this?
SampleID Group CumArea Subgroups
1 A 1.5 1
77 A 4.6 1
6 A 9.3 1
43 A 16.4 1
17 A 19.5 1
67 A 2.1 2
4 A 4.3 2
32 A 8.9 2
...
300 B 4.5 10
257 B 6.8 10
397 B 10.6 10
344 B 14.5 10
367 B 16.7 10
303 B 20.1 10
306 B 1.5 11
...
A few functions in the dplyr package make this fairly straightforward. You can use slice to randomize the data, group_by to perform computations at the group level, and mutate to create new variables. If you chain the functions together with the %>% operator, I believe the solution would look something like this, assuming that you want groups that add up to 20.
install.packages("dplyr") #If you haven't used dplyr before
library(dplyr)
dat %>%
group_by(Group) %>%
slice(sample(1:n())) %>%
mutate(CumArea = cumsum(Area), SubGroup = ceiling(CumArea / 20))

R: Creating an index vector

I need some help with R coding here.
The data set Glass consists of 214 rows of data in which each row corresponds to a glass sample. Each row consists of 10 columns. When viewed as a classification problem, column 10
(Type) specifies the class of each observation/instance. The remaining columns are attributes that might beused to infer column 10. Here is an example of the first row
RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0.0 0.0 1
First, I casted column 10 so that it is interpreted by R as a factor instead of an integer value.
Now I need to create a vector with indices for all observations (must have values 1-214). This needs to be done to creating training data for Naive Bayes. I know how to create a vector with 214 values, but not one that has specific indices for observations from a data frame.
If it helps this is being done to set up training data for Naive Bayes, thanks
I'm not totally sure that I get what you're trying to do... So please forgive me if my solution isn't helpful. If your df's name is 'df', just use the dplyr package for reordering your columns and write
library(dplyr)
df['index'] <- 1:214
df <- df %>% select(index,everything())
Here's an example. So that I can post full dataframes, my dataframes will only have 10 rows...
Let's say my dataframe is:
df <- data.frame(col1 = c(2.3,6.3,9.2,1.7,5.0,8.5,7.9,3.5,2.2,11.5),
col2 = c(1.5,2.8,1.7,3.5,6.0,9.0,12.0,18.0,20.0,25.0))
So it looks like
col1 col2
1 2.3 1.5
2 6.3 2.8
3 9.2 1.7
4 1.7 3.5
5 5.0 6.0
6 8.5 9.0
7 7.9 12.0
8 3.5 18.0
9 2.2 20.0
10 11.5 25.0
If I want to add another column that just is 1,2,3,4,5,6,7,8,9,10... and I'll call it 'index' ...I could do this:
library(dplyr)
df['index'] <- 1:10
df <- df %>% select(index, everything())
That will give me
index col1 col2
1 1 2.3 1.5
2 2 6.3 2.8
3 3 9.2 1.7
4 4 1.7 3.5
5 5 5.0 6.0
6 6 8.5 9.0
7 7 7.9 12.0
8 8 3.5 18.0
9 9 2.2 20.0
10 10 11.5 25.0
Hope this will help
df$ind <- seq.int(nrow(df))

How to identify missing rows in a R df - without NA

I have several df with different length of the observations. The observations should have been collected at every 0.2 m depth. This is often the case but sometimes 1, 2 or more depth intervals are "missed out/jumped over", in a completely random order. The tables have same fields. I need to know which depth that has been missed out and at which time. As an example is a reduced file; Table A (profile, time_UTC, depth_m). Table A has 12 rows, but should have 15 and i.e 1 at 0.4m, and at 1.2m and 1.4m. The time stamps (ms) are irregular so I cannot use them to identify gaps in depth.
==============
Table A
Has 12 rows but should have 15:
profil time_UTC depth_m
1 V 24871 0.2
2 V 24877 0.6
3 V 24882 0.8
4 V 24887 1
5 V 24898 1.6
6 V 24901 1.8
7 V 24902 2
8 V 24905 2.2
9 V 24907 2.4
10 V 24909 2.6
11 V 24912 2.8
12 V 24915 3
The check needs to be done in an operational manner for all df that I will read in.
I need help to write a query where I can find those missing rows and add new rows with the missing depth and integrated time.
I provided two links for the similar problem, but in the R example the sequence is categorical (my df are not) and the other is for Sql code.
R identify missing rows from pre-specified sequence
How to find missing rows?
Thanks for the help in advance.
library(dplyr)
right_join(df, data_frame(depth_m = seq(0.2, 3, by = 0.2)))
#> Joining, by = "depth_m"
#> profil time_UTC depth_m
#> 1 V 24871 0.2
#> 2 <NA> NA 0.4
#> 3 <NA> NA 0.6
#> 4 V 24882 0.8
#> 5 V 24887 1.0
#> 6 <NA> NA 1.2
#> 7 <NA> NA 1.4
#> 8 V 24898 1.6
#> 9 V 24901 1.8
#> 10 V 24902 2.0
#> 11 V 24905 2.2
#> 12 <NA> NA 2.4
#> 13 <NA> NA 2.6
#> 14 <NA> NA 2.8
#> 15 V 24915 3.0

Computing a "rightmost" moving average?

I would like to compute a moving average (ma) over some time series data but I would like the ma to consider the order n starting from the rightmost of my series so my last ma value corresponds to the ma of the last n values of my series. The desired function rightmost_ma would produce this output:
data <- seq(1,10)
> data
[1] 1 2 3 4 5 6 7 8 9 10
rightmost_ma(data, n=2)
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I was reviewing the different ma possibilities e.g. package forecast and could not find how to cover this use case. Note that the critical requirement for me is to have valid non NA ma values for the last elements of the series or in other words I want my ma to produce valid results without "looking into the future".
Take a look at rollmean function from zoo package
> library(zoo)
> rollmean(zoo(1:10), 2, align ="right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
you can also use rollapply
> rollapply(zoo(1:10), width=2, FUN=mean, align = "right", fill=NA)
1 2 3 4 5 6 7 8 9 10
NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
I think using stats::filter is less complicated, and might have better performance (though zoo is well written).
This:
filter(1:10, c(1,1)/2, sides=1)
gives:
Time Series:
Start = 1
End = 10
Frequency = 1
[1] NA 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
If you don't want the result to be a ts object, use as.vector on the result.

Resources