I am trying to do less in Excel and more in R, but get stuck on a simple calculation. I have a dataframe with meter readings over a number of weeks. I need to calculate the consumption in each week, i.e. subtracting a column from the previous column. For instance, in the example below I need to subtract Reading1 from Reading2 and Reading2 from Reading3. My actual data set contains hundreds of readings, so I need to find an easy way to do this.
SerialNo = c(1,2,3,4,5)
Reading1 = c(100, 102, 119, 99, 200)
Reading2 = c(102, 105, 120, 115, 207)
Reading3 = c(107, 109, 129, 118, 209)
df <- data.frame(SerialNo, Reading1, Reading2, Reading3)
df
SerialNo Reading1 Reading2 Reading3
1 1 100 102 107
2 2 102 105 109
3 3 119 120 129
4 4 99 115 118
5 5 200 207 209
Here's a tidyverse solution that returns a data frame with similar formatting. It converts the data to long format (pivot_longer), applies the lag function, does the subtraction and then widens back to the original format (pivot_wider).
library(dplyr)
library(tidyr)
df %>%
pivot_longer(Reading1:Reading3,
names_to = "reading",
names_prefix = "Reading",
values_to = "value") %>%
group_by(SerialNo) %>%
mutate(offset = lag(value, 1),
measure = value - offset) %>%
select(SerialNo, reading, measure) %>%
pivot_wider(names_from = reading,
values_from = measure,
names_prefix = "Reading")
>
# A tibble: 5 x 4
# Groups: SerialNo [5]
SerialNo Reading1 Reading2 Reading3
<dbl> <dbl> <dbl> <dbl>
1 1 NA 2 5
2 2 NA 3 4
3 3 NA 1 9
4 4 NA 16 3
5 5 NA 7 2
df[,paste0(names(df)[3:4], names(df)[2:3])] <- df[,names(df)[3:4]] - df[,names(df)[2:3]]
df
SerialNo Reading1 Reading2 Reading3 Reading2Reading1 Reading3Reading2
1 1 100 102 107 2 5
2 2 102 105 109 3 4
3 3 119 120 129 1 9
4 4 99 115 118 16 3
5 5 200 207 209 7 2
PS: I assume columns are ordered 1,2,3,...etc
We can use apply row-wise to calculate difference between consecutive columns.
temp <- t(apply(df[-1], 1, diff))
df[paste0('ans', seq_len(ncol(temp)))] <- temp
df
# SerialNo Reading1 Reading2 Reading3 ans1 ans2
#1 1 100 102 107 2 5
#2 2 102 105 109 3 4
#3 3 119 120 129 1 9
#4 4 99 115 118 16 3
#5 5 200 207 209 7 2
Another option is to use a simple for to loop over the columns of your data frame. I think this solution can be easier to understand, specially if you are starting to use R.
#Create a data frame with same rows as your df and number of cols-1
resul<-as.data.frame(matrix(nrow=nrow(df),ncol=(ncol(df)-1)))
#Add the SerialNo column to the first column of results df
resul[,1]<-df[,1]
#Set the name of the first column to SerialNo (as the first colname of df)
colnames(resul)[1]<-colnames(df)[1]
#Loop over the Reading columns of df (from the second column to the last minus 1)
for(i in 2:(ncol(df)-1)){
#Do the subtraction
resul[,i] <- df[,i+1]-df[,i]
#Set the colname for each iteration
colnames(resul)[i]<-paste0(colnames(df)[i+1],"-",colnames(df)[i])
}
Related
Below is a sample data frame of what I am working with.
df <- data.frame(
Sample = c('A', 'A', 'B', 'C'),
Length = c('100', '110', '99', '102'),
Molarity = c(5,4,6,7)
)
df
Sample Length Molarity
1 A 100 5
2 A 110 4
3 B 99 6
4 C 102 7
I would like the result listed below but am unsure how to approach the problem.
Sample Length Molarity
1 A 100,110 9
2 B 99 6
3 C 102 7
We may do group by summarisation
library(dplyr)
df %>%
group_by(Sample) %>%
summarise(Length = toString(Length), Molarity = sum(Molarity))
-output
# A tibble: 3 × 3
Sample Length Molarity
<chr> <chr> <dbl>
1 A 100, 110 9
2 B 99 6
3 C 102 7
A base R option:
df |>
within({
Length <- ave(Length, Sample, FUN = toString)
Molarity <- ave(Molarity, Sample, FUN = mean)
}) |>
unique()
Sample Length Molarity
1 A 100, 110 4.5
3 B 99 6.0
4 C 102 7.0
I have a column of IDs in a dataframe that sometimes has duplicates, take for example,
ID
209
315
109
315
451
209
What I want to do is take this column and create another column that indicates what ID the row belongs to. i.e. I want it to look like,
ID
ID Category
209
1
315
2
109
3
315
2
451
4
209
1
Essentially, I want to loop through the IDs and if it equals to a previous one, I indicate that it is from the same ID, and if it is a new ID, I create a new indicator for it.
Does anyone know is there a quick function in R that I could do this with? Or have any other suggestions?
Convert to factor with levels ordered with unique (order of appearance in the data set) and then to numeric:
data$IDCategory <- as.numeric(factor(data$ID, levels = unique(data$ID)))
#> data
# ID IDCategory
#1 209 1
#2 315 2
#3 109 3
#4 315 2
#5 451 4
#6 209 1
library(tidyverse)
data <- tibble(ID= c(209,315,109,315,451,209))
data %>%
left_join(
data %>%
distinct(ID) %>%
mutate(`ID Category` = row_number())
)
#> Joining, by = "ID"
#> # A tibble: 6 × 2
#> ID `ID Category`
#> <dbl> <int>
#> 1 209 1
#> 2 315 2
#> 3 109 3
#> 4 315 2
#> 5 451 4
#> 6 209 1
Created on 2022-03-10 by the reprex package (v2.0.0)
df <- df %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, drop=TRUE)))
Answer with data.table
library(data.table)
df <- as.data.table(df)
df <- df[
j = `ID Category` := as.numeric(interaction(ID, drop=TRUE))
]
The pro of this solution is that you can create an unique ID for a group of variables. Here you only need ID, but if you want to have an unique ID let say for the couple [ID—Location] you could.
data <- tibble(ID= c(209,209,209,315,315,315), Location = c("A","B","C","A","A","B"))
data <- data %>%
dplyr::mutate(`ID Category` = as.numeric(interaction(ID, Location, drop=TRUE)))
another way:
merge(data,
data.frame(ID = unique(data$ID),
ID.Category = seq_along(unique(data$ID))
), sort = F)
# ID ID.Category
# 1 209 1
# 2 209 1
# 3 315 2
# 4 315 2
# 5 109 3
# 6 451 4
data:
tibble(ID = c(209,315,109,315,451,209)) -> data
I'm currently reviewing R for Data Science when I encounter this chunk of code.
The question for this code is as follows. I don't understand the necessity of the arrange function here. Doesn't arrange function just reorder the rows?
library(tidyverse)
library(nycflights13))
flights %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
mutate(delay_gt1hr = dep_delay > 60) %>%
mutate(before_delay = cumsum(delay_gt1hr)) %>%
filter(before_delay < 1) %>%
count(sort = TRUE)
However, it does output differently with or without the arrange function, as shown below:
#with the arrange function
tailnum n
<chr> <int>
1 N954UW 206
2 N952UW 163
3 N957UW 142
4 N5FAAA 117
5 N38727 99
6 N3742C 98
7 N5EWAA 98
8 N705TW 97
9 N765US 97
10 N635JB 94
# ... with 3,745 more rows
and
#Without the arrange function
tailnum n
<chr> <int>
1 N952UW 215
2 N315NB 161
3 N705TW 160
4 N961UW 139
5 N713TW 128
6 N765US 122
7 N721TW 120
8 N5FAAA 117
9 N945UW 104
10 N19130 101
# ... with 3,774 more rows
I'd appreciate it if you can help me understand this. Why is it necessary to include the arrange function here?
Yes, arrange just orders the rows but you are filtering after that which changes the result.
Here is a simplified example to demonstrate how the output differs with and without arrange.
library(dplyr)
df <- data.frame(a = 1:5, b = c(7, 8, 9, 1, 2))
df %>% filter(cumsum(b) < 20)
# a b
#1 1 7
#2 2 8
df %>% arrange(b) %>% filter(cumsum(b) < 20)
# a b
#1 4 1
#2 5 2
#3 1 7
#4 2 8
df <- data.frame(row.names = c('1s.u1','1s.u2','2s.u1','2s.u2','6s.u1'),fjri_deu_klcea= c('0','0','0','15','23'),hfue_klcea=c('2','2','0','156','45'),dji_dhi_ghcea_jk=c('456','0','0','15','15'),jdi_jdi_ghcea=c('1','2','3','4','100'),gz7_jfu_dcea_jdi=c('5','6','3','7','56'))
df
fjri_deu_klcea hfue_klcea dji_dhi_ghcea_jk jdi_jdi_ghcea gz7_jfu_dcea_jdi
1s.u1 0 2 456 1 5
1s.u2 0 2 0 2 6
2s.u1 0 0 0 3 3
2s.u2 15 156 15 4 7
6s.u1 23 45 15 100 56
I want to sum up df based on the cea part of the column names. So all rows with the same cea part should sum up.
df should look like this
klcea ghcea dcea
1s.u1 2 457 5
1s.u2 2 2 6
2s.u1 0 3 3
2s.u2 171 19 7
6s.u1 68 115 56
I thought about firstly getting a new column with the cea name called cea and then summing it up based on row.names and the respective cea
with something like with(df, ave(cea, row.names(df), FUN = sum))
I do not know how to generate the new column based on a pattern in a string. I guess grepl is useful but I could not come up with something, I tried df$cea <- df[grepl(colnames(df),'cea'),] which is wrong...
Using base R, you can extract the "cea" part from the name and use it in split.default to split dataframe into columns, we can then use rowSums to sum each individual dataframe.
sapply(split.default(df, sub('.*_(.*cea).*', '\\1', names(df))), rowSums)
# dcea ghcea klcea
#1s.u1 5 457 2
#1s.u2 6 2 2
#2s.u1 3 3 0
#2s.u2 7 19 171
#6s.u1 56 115 68
where sub part returns :
sub('.*_(.*cea).*', '\\1', names(df))
#[1] "klcea" "klcea" "ghcea" "ghcea" "dcea"
Using dplyr:
> df %>% rowwise() %>% mutate(klcea = sum(c_across(ends_with('klcea'))),
+ ghcea = sum(c_across(contains('ghcea'))),
+ dcea = sum(c_across(contains('dcea')))) %>%
+ select(klcea, ghcea, dcea)
# A tibble: 5 x 3
# Rowwise:
klcea ghcea dcea
<dbl> <dbl> <dbl>
1 2 457 5
2 2 2 6
3 0 3 3
4 171 19 7
5 68 115 56
If you wish to retain row names:
> df %>% rownames_to_column('rn') %>% rowwise() %>% mutate(klcea = sum(c_across(ends_with('klcea'))),
+ ghcea = sum(c_across(contains('ghcea'))),
+ dcea = sum(c_across(contains('dcea')))) %>%
+ select(klcea, ghcea, dcea, rn) %>% column_to_rownames('rn')
klcea ghcea dcea
1s.u1 2 457 5
1s.u2 2 2 6
2s.u1 0 3 3
2s.u2 171 19 7
6s.u1 68 115 56
>
I need to combine some of the columns for these multiple IDs and can just use the values from the first ID listing for the others. For example here I just want to combine the "spending" column as well as the heart attack column to just say whether they ever had a heart attack. I then want to delete the duplicate ID#s and just keep the values from the first listing for the other columns:
df <- read.table(text =
"ID Age Gender heartattack spending
1 24 f 0 140
2 24 m na 123
2 24 m 1 58
2 24 m 0 na
3 85 f 1 170
4 45 m na 204", header=TRUE)
What I need:
df2 <- read.table(text =
"ID Age Gender ever_heartattack all_spending
1 24 f 0 140
2 24 m 1 181
3 85 f 1 170
4 45 m na 204", header=TRUE)
I tried group_by with transmute() and sum() as follows:
df$heartattack = as.numeric(as.character(df$heartattack))
df$spending = as.numeric(as.character(df$spending))
library(dplyr)
df = df %>% group_by(ID) %>% transmute(ever_heartattack = sum(heartattack, na.rm = T), all_spending = sum(spending, na.rm=T))
But this removes all the other columns! Also it turns NA values into zeros...for example I still want "NA" to be the value for patient ID#4, I don't want to change the data to say they never had a heart attack!
> print(dfa) #This doesn't at all match df2 :(
ID ever_heartattack all_spending
1 1 0 140
2 2 1 181
3 2 1 181
4 2 1 181
5 3 1 170
6 4 0 204
Could you do this?
aggregate(
spending ~ ID + Age + Gender,
data = transform(df, spending = as.numeric(as.character(spending))),
FUN = sum)
# ID Age Gender spending
#1 1 24 f 140
#2 3 85 f 170
#3 2 24 m 181
#4 4 45 m 204
Some comments:
The thing is that when aggregating you don't give clear rules how to deal with data in additional columns that differ (like heartattack in this case). For example, for ID = 2 why do you retain heartattack = 1 instead of heartattack = na or heartattack = 0?
Your "na"s are in fact not real NAs. That leads to spending being a factor column instead of a numeric column vector.
To exactly reproduce your expected output one can do
df %>%
mutate(
heartattack = as.numeric(as.character(heartattack)),
spending = as.numeric(as.character(spending))) %>%
group_by(ID, Age, Gender) %>%
summarise(
heartattack = ifelse(
any(heartattack %in% c(0, 1)),
max(heartattack, na.rm = T),
NA),
spending = sum(spending, na.rm = T))
## A tibble: 4 x 5
## Groups: ID, Age [?]
# ID Age Gender heartattack spending
# <int> <int> <fct> <dbl> <dbl>
#1 1 24 f 0 140
#2 2 24 m 1 181
#3 3 85 f 1 170
#4 4 45 m NA 204
This feels a bit "hacky" on account of the rules not being clear which heartattack value to keep. In this case we
keep the maximum value of heartattack if heartattack contains either 0 or 1.
return NA if heartattack does not contain 0 or 1.