conditional combining rows and add to existing row - r

here is my data frame:
df<- data.frame(age=c(10,11,12,11,11,10,11,13,13,13,14,14,15,15,15),
time1=c(10:24),time2=c(20:34))
I want to sum rows for age 14 and 15 and keep as age 14. my expected output would be like this:
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
thank you in advance.

Here is one method - replace the 'age' where value is '15' to 14, and summarise across the columns 'time' to get the sum if the 'age' values are all '14'
library(dplyr)
df %>%
group_by(age = replace(age, age %in% 15, 14)) %>%
summarise(across(everything(), ~if(all(age == 14))sum(.x) else .x),
.groups = 'drop')
-output
# A tibble: 11 × 3
age time1 time2
<dbl> <int> <int>
1 10 10 20
2 10 15 25
3 11 11 21
4 11 13 23
5 11 14 24
6 11 16 26
7 12 12 22
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160
Or using base R with colSums and subset/rbind
rbind(subset(df, !age %in% c(14, 15)),
c(age = 14, colSums(df[df$age %in% c(14, 15), - 1])))
-output
age time1 time2
1 10 10 20
2 11 11 21
3 12 12 22
4 11 13 23
5 11 14 24
6 10 15 25
7 11 16 26
8 13 17 27
9 13 18 28
10 13 19 29
11 14 110 160

Related

Filling the gap in the time series

I have the following data,
id <- c(rep(12, 10), rep(14, 12), rep(16, 2))
m <- c(seq(1:5), seq(8,12), seq(1:12), 10, 12)
y <- c(rep(14, 10), rep(14, 12), rep(15, 2))
v <- rnorm(24)
df <- data.frame(id, m, y, v)
> df
id m y v
1 12 1 14 0.9453216
2 12 2 14 1.0666393
3 12 3 14 -0.2750527
4 12 4 14 1.3264349
5 12 5 14 -1.8046676
6 12 8 14 0.3334960
7 12 9 14 -1.2448408
8 12 10 14 0.5258248
9 12 11 14 -0.1233157
10 12 12 14 1.4717530
11 14 1 14 0.6217376
12 14 2 14 -0.8344823
13 14 3 14 1.1468841
14 14 4 14 -0.3363987
15 14 5 14 -1.3543311
16 14 6 14 -0.2146853
17 14 7 14 -0.6546186
18 14 8 14 -2.4286257
19 14 9 14 -1.3314888
20 14 10 14 0.8215581
21 14 11 14 -0.9999368
22 14 12 14 -1.2935147
23 16 10 15 0.7339261
24 16 12 15 1.1303524
The first column is the id, second column m is the month, third column y is the year, and the last column is the value.
In the month column, in the year 14, two observations (June and July) is missing and in the year 15, November is missing.
I would like to have those missing months with a value of zero. That means, for example, for the year 15, the data should look like this,
16 10 15 0.7339261
16 11 15 0
16 12 15 1.1303524
Anyone can suggest a way to do that?
Or in data.table, generate the months for each id and year, left join this with original dataset on id, y, m and then replace NAs with 0:
library(data.table)
setDT(df)
df[df[, .(m=min(m):max(m)), by=.(id, y)], on=.(id,y,m)][
is.na(v), v := 0]
With dplyr and tidyr, you can do:
df %>%
group_by(id) %>%
complete(m = seq(min(m), max(m), 1), fill = list(v = 0)) %>%
fill(y)
id m y v
<dbl> <dbl> <dbl> <dbl>
1 12 1 14 0.539
2 12 2 14 -0.0768
3 12 3 14 1.85
4 12 4 14 -0.855
5 12 5 14 0.0326
6 12 6 14 0
7 12 7 14 0
8 12 8 14 -1.03
9 12 9 14 -0.982
10 12 10 14 0.00410
11 12 11 14 -0.233
12 12 12 14 -0.499
13 14 1 14 1.55
14 14 2 14 0.0875
15 14 3 14 1.32
16 14 4 14 -0.981
17 14 5 14 -0.246
18 14 6 14 -1.40
19 14 7 14 1.44
20 14 8 14 -0.981
21 14 9 14 1.47
22 14 10 14 -0.991
23 14 11 14 -0.0945
24 14 12 14 -2.88
25 16 10 15 -0.247
26 16 11 15 0
27 16 12 15 0.0147

Subtract and find the difference of a value or volume

I have a volume measurements of brain parts (optic lobe, olfactory lobe, auditory cortex, etc), all the parts will add up to total brain volume. As shown in the example dataframe here.
a b c d e total
1 2 3 4 5 15
2 3 4 5 6 20
4 6 7 8 9 34
7 8 10 10 15 50
I would like to find the find the difference of brain volume if I subtract one components out of total volume.
So I was wondering how to go about it in R, without having to create a new column for every brain part.
For example: (total - a = 14, total - b =13, and so on for other components).
total-a total-b total-c total-d total-e
14 13 12 11 10
18 17 16 15 14
30 28 27 26 25
43 42 40 40 35
You can do
dat[, "total"] - dat[1:5]
# a b c d e
#1 14 13 12 11 10
#2 18 17 16 15 14
#3 30 28 27 26 25
#4 43 42 40 40 35
If you want also the column names, then one tidyverse possibility could be:
df %>%
gather(var, val, -total) %>%
mutate(var = paste0("total-", var),
val = total - val) %>%
spread(var, val)
total total-a total-b total-c total-d total-e
1 15 14 13 12 11 10
2 20 18 17 16 15 14
3 34 30 28 27 26 25
4 50 43 42 40 40 35
If you do not care about the column names, then with just dplyr you can do:
df %>%
mutate_at(vars(-matches("(total)")), list(~ total - .))
a b c d e total
1 14 13 12 11 10 15
2 18 17 16 15 14 20
3 30 28 27 26 25 34
4 43 42 40 40 35 50
Or without column names with just base R:
df[, grepl("total", names(df))] - df[, !grepl("total", names(df))]
a b c d e
1 14 13 12 11 10
2 18 17 16 15 14
3 30 28 27 26 25
4 43 42 40 40 35

Trying to integrate over discrete points from a data frame

I have several months of weather data; an example day is here:
Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10
I need to figure out the total number of hours above 15 degrees by integrating in R. I'm analyzing for degree days, a concept in agriculture, that gives valuable information about relative growth rate. For example, hour 10 is 2 degree hours and hour 11 is 4 degree hours above 15 degrees. This can help predict when to harvest fruit. How can I write the code for this?
Another column could potentially work with a simple subtraction. Then I would have to make a cumulative sum after canceling out all negative numbers. That is the approach I'm setting out to do right now. Is there an integral I could write and have an answer in one step?
This solution subtracts your threshold (i.e., 15°), fits a function to the result, then integrates this function. Note that if the temperature is below the threshold this contribute zero to the total rather than a negative value.
df <- read.table(text = "Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", header = TRUE)
with(df, integrate(approxfun(Hour, pmax(Avg.Temp-15, 0)),
lower = min(Hour), upper = max(Hour)))
#> 53.00017 with absolute error < 0.0039
Created on 2019-02-08 by the reprex package (v0.2.1.9000)
The OP has requested to figure out the total number of hours above 15 degrees by integrating in R.
It is not fully clear to me what the espected result is. Does the OP want to count the number of hours above 15 degrees or does the OP want to sum up the degrees greater 15 ("integrate").
However, the code below creates both figures. Supposed the data is sampled at each hour without gaps (as suggested by OP's sample dataset), cumsum() and sum() can be used, resp.:
library(data.table)
setDT(DT)[, c("deg_hrs_sum", "deg_hrs_cnt") :=
.(cumsum(pmax(0, Avg.Temp - 15)), cumsum(Avg.Temp > 15))]
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
1: 1 11 0 0
2: 2 11 0 0
3: 3 11 0 0
4: 4 10 0 0
5: 5 10 0 0
6: 6 11 0 0
7: 7 12 0 0
8: 8 14 0 0
9: 9 15 0 0
10: 10 17 2 1
11: 11 19 6 2
12: 12 21 12 3
13: 13 22 19 4
14: 14 24 28 5
15: 15 23 36 6
16: 16 22 43 7
17: 17 21 49 8
18: 18 18 52 9
19: 19 16 53 10
20: 20 15 53 10
21: 21 14 53 10
22: 22 12 53 10
23: 23 11 53 10
24: 24 10 53 10
Hour Avg.Temp deg_hrs_sum deg_hrs_cnt
Alternatively,
setDT(DT)[, .(deg_hrs_sum = sum(pmax(0, Avg.Temp - 15)),
deg_hrs_cnt = sum(Avg.Temp > 15))]
returns only the final result (last row):
deg_hrs_sum deg_hrs_cnt
1: 53 10
Data
library(data.table)
DT <- fread("
rn Hour Avg.Temp
1 1 11
2 2 11
3 3 11
4 4 10
5 5 10
6 6 11
7 7 12
8 8 14
9 9 15
10 10 17
11 11 19
12 12 21
13 13 22
14 14 24
15 15 23
16 16 22
17 17 21
18 18 18
19 19 16
20 20 15
21 21 14
22 22 12
23 23 11
24 24 10", drop = 1L)

Sum a variable based on another variable

I have a dataset consisting of two variables, Contents and Time like so:
Time Contents
2017M01 123
2017M02 456
2017M03 789
. .
. .
. .
2018M12 789
Now I want to create a numeric vector that aggregates Contents for six months, that is I want to sum 2017M01 to 2017M06 to one number, 2017M07 to 2017M12 to another number and so on.
I'm able to do this by indexing but I want to be able to write: "From 2017M01 to 2017M06 sum contents corresponding to that sequence" in my code.
I would really appreciate some help!
You can create a grouping variable based on the number of rows and number of elements to group. For your case, you want to group every 6 rows so your data frame should be divisible with 6. Using iris to demonstrate (It has 150 rows, so 150 / 6 = 25)
rep(seq(nrow(iris)%/%6), each = 6)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
There are plenty of ways to handle how you want to call it. Here is a custom function that allows you to do that (i.e. create the grouping variable),
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
return(rep(seq(nrow(df)%/%i1), each = i1))
}
f1("2017M01:2017M06", iris)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6 7 7 7 7 7 7 8 8 8 8 8 8 9 9 9 9 9 9 10 10 10 10
#[59] 10 10 11 11 11 11 11 11 12 12 12 12 12 12 13 13 13 13 13 13 14 14 14 14 14 14 15 15 15 15 15 15 16 16 16 16 16 16 17 17 17 17 17 17 18 18 18 18 18 18 19 19 19 19 19 19 20 20
#[117] 20 20 20 20 21 21 21 21 21 21 22 22 22 22 22 22 23 23 23 23 23 23 24 24 24 24 24 24 25 25 25 25 25 25
EDIT: We can easily make the function compatible with 'non-0-remainder' divisions by concatenating the final result with a repetition of the max+1 value of the final result of remainder times, i.e.
f1 <- function(x, df) {
v1 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\1', x))
v2 <- as.numeric(gsub('[0-9]{4}M(.*):[0-9]{4}M(.*)$', '\\2', x))
i1 <- (v2 - v1) + 1
final_v <- rep(seq(nrow(df) %/% i1), each = i1)
if (nrow(df) %% i1 == 0) {
return(final_v)
} else {
remainder = nrow(df) %% i1
final_v1 <- c(final_v, rep((max(final_v) + 1), remainder))
return(final_v1)
}
}
So for a data frame with 20 rows, doing groups of 6, the above function will yield the result:
f1("2017M01:2017M06", df)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4

Assigning Values based on row value

I have a large vector (column of a data frame) where values containing integers 1 to 30. I want to replace numbers from 1 to 5 with 1, 6 to 10 with 5, 11 to 15 with 9...
> x3 <- sample(1:30, 100, rep=TRUE)
> x3
[1] 13 24 16 30 10 6 15 10 3 17 18 22 11 13 29 7 25 28 17 27 1 5 6 20 15 15 8 10 13 26 27 24 3 24 5 7 10 6 28 27 1 4 22 25 14 13 2 10 4 29 23 24 30 24 29 11 2 28 23 1 1 2
[63] 3 23 13 26 21 22 11 4 8 26 17 11 20 23 6 14 24 5 15 21 11 13 6 14 20 11 22 9 6 29 4 30 20 30 4 24 23 29
As I mentioned this is a column in a data frame and with above assignment I want to create a different column. If I do the following I have to do this 30 times.
myFrame$NewColumn[myFrame$oldColumn==1] <- 1
myFrame$NewColumn[myFrame$oldColumn==2] <- 1
myFrame$NewColumn[myFrame$oldColumn==3] <- 1
...
Whats a better way to do this?
We can do this with cut (suppose what you mean by '...' is 10, 11, 12):
x4 <- cut(x3,
breaks = c(seq(1, 30, 5), 30), right = F, include.lowest = T, # generate correct intervals
labels = 4 * (0:5) + 1) # number to fill
# x4 is factor. We should convert it to character first then to the number
x4 <- as.numeric(as.character(x4))
Did you try:
myFrame$NewColumn[myFrame$oldColumn > 0 & myFrame$oldColumn< 6] <- 1
myFrame$NewColumn[myFrame$oldColumn > 5 & myFrame$oldColumn< 11] <- 1
...
Or even better:
myFrame$NewColumn <- as.integer((myFrame$oldColumn - 1)/5)) * 4 + 1

Resources