subtract first or second value from each row [duplicate] - r

This question already has answers here:
R subtract value for the same ID (from the first ID that shows)
(3 answers)
subtract first value from each subset of dataframe
(2 answers)
Closed 4 years ago.
I'm manipulating my data using dplyr, and after grouping my data, I would like to subtract all values by the first or second value in my group (i.e., subtracting a baseline). Is it possible to perform this in a single pipe step?
MWE:
test <- tibble(one=c("c","d","e","c","d","e"), two=c("a","a","a","b","b","b"), three=1:6)
test %>% group_by(`two`) %>% mutate(new=three-three[.$`one`=="d"])
My desired output is:
# A tibble: 6 x 4
# Groups: two [2]
one two three new
<chr> <chr> <int> <int>
1 c a 1 -1
2 d a 2 0
3 e a 3 1
4 c b 4 -1
5 d b 5 0
6 e b 6 1
However I am getting this as the output:
# A tibble: 6 x 4
# Groups: two [2]
one two three new
<chr> <chr> <int> <int>
1 c a 1 -1
2 d a 2 NA
3 e a 3 1
4 c b 4 -1
5 d b 5 NA
6 e b 6 1

We can use the first from dplyr
test %>%
group_by(two) %>%
mutate(new=three- first(three))
# A tibble: 6 x 4
# Groups: two [2]
# one two three new
# <chr> <chr> <int> <int>
#1 c a 1 0
#2 d a 2 1
#3 e a 3 2
#4 c b 4 0
#5 d b 5 1
#6 e b 6 2
If we are subsetting the 'three' values based on string "c" in 'one', then we don't need .$ as it will get the whole column 'c' instead of the values within the group by column
test %>%
group_by(`two`) %>%
mutate(new=three-three[one=="c"])

library(tidyverse)
tibble(
one = c("c", "d", "e", "c", "d", "e"),
two = c("a", "a", "a", "b", "b", "b"),
three = 1:6
) -> test_df
test_df %>%
group_by(two) %>%
mutate(new = three - three[1])
## # A tibble: 6 x 4
## # Groups: two [2]
## one two three new
## <chr> <chr> <int> <int>
## 1 c a 1 0
## 2 d a 2 1
## 3 e a 3 2
## 4 c b 4 0
## 5 d b 5 1
## 6 e b 6 2

Related

How to create new column of repeating sequence based on other column

I have a the following dataframe:
Participant_ID Order
1 A
1 A
2 B
2 B
3 A
3 A
4 B
4 B
5 B
5 B
6 A
6 A
Every two rows refer to the same participant. I want to create a new column based on the value in the column 'Order'. If the 'Order' == A, then I want it to create a new column with two rows of [1, 2], and then if the 'Order' == B, then I want it to create two rows of [2,1] in the same column
The preferred output would be the following:
Participant_ID Order Period
1 A 1
1 A 2
2 B 2
2 B 1
3 A 1
3 A 2
4 B 2
4 B 1
5 B 2
5 B 1
6 A 1
6 A 2
Any help would be appreciated
Here are a couple of possibilities. This assumes that Order value is same for a given Participant_ID. If this isn't the case, you will need to include additional logic.
You can use if_else:
library(tidyverse)
df %>%
group_by(Participant_ID) %>%
mutate(Period = if_else(Order == "A", 1:2, 2:1))
Or to explicitly check for multiple different values (e.g., "A", "B", etc.), have more flexibility, and include NA for other cases, you can use case_when:
df %>%
group_by(Participant_ID) %>%
mutate(Period = case_when(
Order == "A" ~ 1:2,
Order == "B" ~ 2:1,
TRUE ~ NA_integer_
))
Output
Participant_ID Order Period
<int> <chr> <int>
1 1 A 1
2 1 A 2
3 2 B 2
4 2 B 1
5 3 A 1
6 3 A 2
7 4 B 2
8 4 B 1
9 5 B 2
10 5 B 1
11 6 A 1
12 6 A 2

dplyr mutate: create column using first occurrence of another column

I was wondering if there's a more elegant way of taking a dataframe, grouping by x to see how many x's occur in the dataset, then mutating to find the first occurrence of every x (y)
test <- data.frame(x = c("a", "b", "c", "d",
"c", "b", "e", "f", "g"),
y = c(1,1,1,1,2,2,2,2,2))
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 c 2
6 b 2
7 e 2
8 f 2
9 g 2
Current Output
output <- test %>%
group_by(x) %>%
summarise(count = n())
x count
<fct> <int>
1 a 1
2 b 2
3 c 2
4 d 1
5 e 1
6 f 1
7 g 1
Desired Output
x count first_seen
<fct> <int> <dbl>
1 a 1 1
2 b 2 1
3 c 2 1
4 d 1 1
5 e 1 2
6 f 1 2
7 g 1 2
I can filter the test dataframe for the first occurrences then use a left_join but was hoping there's a more elegant solution using mutate?
# filter for first occurrences of y
right <- test %>%
group_by(x) %>%
filter(y == min(y)) %>%
slice(1) %>%
ungroup()
# bind to the output dataframe
left_join(output, right, by = "x")
We can use first after grouping by 'x' to create a new column, use that also in group_by and get the count with n()
library(dplyr)
test %>%
group_by(x) %>%
group_by(first_seen = first(y), add = TRUE) %>%
summarise(count = n())
# A tibble: 7 x 3
# Groups: x [7]
# x first_seen count
# <fct> <dbl> <int>
#1 a 1 1
#2 b 1 2
#3 c 1 2
#4 d 1 1
#5 e 2 1
#6 f 2 1
#7 g 2 1
I have a question. Why not keep it simple? for example
test %>%
group_by(x) %>%
summarise(
count = n(),
first_seen = first(y)
)
#> # A tibble: 7 x 3
#> x count first_seen
#> <chr> <int> <dbl>
#> 1 a 1 1
#> 2 b 2 1
#> 3 c 2 1
#> 4 d 1 1
#> 5 e 1 2
#> 6 f 1 2
#> 7 g 1 2

Separate data frame depending on one column duplicates

I have a large data frame with a lot of rows and columns. In one column there are characters, some of them occur only once, other multiple times. I would now like to separate the whole data frame, so that I end up with two data frames, one with all the rows that have characters that repeat themselves in this one column and another one with all the rows with the charcaters that occur only once. Like for example:
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
> df
One Two Three
1 1 4 a
2 2 5 b
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
7 7 1 f
8 8 8 e
9 9 1 g
10 10 9 c
I wish to have two data frames like
> dfSingle
One Two Three
1 1 4 a
2 2 5 b
7 7 1 f
9 9 1 g
> dfMultiple
One Two Three
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
8 8 8 e
10 10 9 c
I tried with the duplicated() function
dfSingle = subset(df, !duplicated(df$Three))
dfMultiple = subset(df, duplicated(df$Three))
but it does not work as the first of the "c", "d" and "e" go to the "dfSingle".
I also tried to do a for-loop
MulipleValues = unique(df$Three[c(which(duplicated(df$Three)))])
dfSingle = data.frame()
x = 1
dfMultiple = data.frame()
y = 1
for (i in 1:length(df$One)) {
if(df$Three[i] %in% MulipleValues){
dfMultiple[x,] = df[i,]
x = x+1
} else {
dfSingle[y,] = df[i,]
y = y+1
}
}
It seems to do the right thing as the data frames have now the right amont of rows but they somehow have 0 columns.
> dfSingle
data frame with 0 columns and 4 rows
> dfMultiple
data frame with 0 columns and 6 rows
What am I doing wrong? Or is there another way to do this?
Thanks for your help!
In base R, we can use split with duplicated which will return you list of two dataframes.
df1 <- split(df, duplicated(df$Three) | duplicated(df$Three, fromLast = TRUE))
df1
#$`FALSE`
# One Two Three
#1 1 4 a
#2 2 5 b
#7 7 1 f
#9 9 1 g
#$`TRUE`
# One Two Three
#3 3 3 c
#4 4 6 d
#5 5 2 d
#6 6 7 e
#8 8 8 e
#10 10 9 c
where df1[[1]] can be considered as dfSingle and df1[[2]] as dfMultiple.
Here is a dplyr one for fun,
library(dplyr)
df %>%
group_by(Three) %>%
mutate(new = n() > 1) %>%
split(.$new)
which gives,
$`FALSE`
# A tibble: 4 x 4
# Groups: Three [4]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
$`TRUE`
# A tibble: 6 x 4
# Groups: Three [3]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE
A way with dplyr:
library(dplyr)
df %>%
group_split(Duplicated = (add_count(., Three) %>% pull(n)) > 1)
Output:
[[1]]
# A tibble: 4 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
[[2]]
# A tibble: 6 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE
You can do it using base R
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
str(df)
df$Three <- as.character(df$Three)
df$count <- as.numeric(ave(df$Three,df$Three,FUN = length))
dfSingle = subset(df,df$count == 1)
dfMultiple = subset(df,df$count > 1)

dplyr: create new column with values from other specified columns [duplicate]

This question already has answers here:
How can I use one column to determine where I get the value for another column?
(3 answers)
Closed 4 years ago.
I have a tibble:
library(tibble)
library(dplyr)
(
data <- tibble(
a = 1:3,
b = 4:6,
mycol = c('a', 'b', 'a')
)
)
#> # A tibble: 3 x 3
#> a b mycol
#> <int> <int> <chr>
#> 1 1 4 a
#> 2 2 5 b
#> 3 3 6 a
Using dplyr::mutate I'd like to create a new column called value which uses a value from either column a or b, depending on which column name is specified in the mycol column.
(
desired <- tibble(
a = 1:3,
b = 4:6,
mycol = c('a', 'b', 'a'),
value = c(1, 5, 3)
)
)
#> # A tibble: 3 x 4
#> a b mycol value
#> <int> <int> <chr> <dbl>
#> 1 1 4 a 1
#> 2 2 5 b 5
#> 3 3 6 a 3
Here we're just using the values from column a all the time.
data %>%
mutate(value = a)
#> # A tibble: 3 x 4
#> a b mycol value
#> <int> <int> <chr> <int>
#> 1 1 4 a 1
#> 2 2 5 b 2
#> 3 3 6 a 3
Here we're just assigning the values of mycol to the new column rather than getting the values from the appropriate column.
data %>%
mutate(value = mycol)
#> # A tibble: 3 x 4
#> a b mycol value
#> <int> <int> <chr> <chr>
#> 1 1 4 a a
#> 2 2 5 b b
#> 3 3 6 a a
I've tried various combinations of !!, quo(), etc. but I don't fully understand what's going on under the hood in terms of NSE.
#Jaap has marked this as a duplicate but I'd still like to see a dplyr/tidyverse approach using NSE rather than using base R if possible.
Here is one approach:
df %>%
mutate(value = ifelse(mycol == "a", a, b))
#output
# A tibble: 3 x 4
a b mycol value
<int> <int> <chr> <int>
1 1 4 a 1
2 2 5 b 5
3 3 6 a 3
and here is a more general way in base R
df$value <- diag(as.matrix(df[,df$mycol]))
more complex example:
df <- tibble(
a = 1:4,
b = 4:7,
c = 5:8,
mycol = c('a', 'b', 'a', "c"))
df$value <- diag(as.matrix(df[,df$mycol]))
#output
# A tibble: 4 x 5
a b c mycol value
<int> <int> <int> <chr> <int>
1 1 4 5 a 1
2 2 5 6 b 5
3 3 6 7 a 3
4 4 7 8 c 8

Add missing subtotals to each group using dplyr

I need to add a new row to each id group where the key= "n" and value is the total - a + b
x <- data_frame( id = c(1,1,1,2,2,2,2),
key = c("a","b","total","a","x","b","total"),
value = c(1,2,10,4,1,3,12) )
# A tibble: 7 × 3
id key value
<dbl> <chr> <dbl>
1 1 a 1
2 1 b 2
3 1 total 10
4 2 a 4
5 2 x 1
6 2 b 3
7 2 total 12
In this example, the new rows should be
1 n 7
2 n 5
I tried getting the a+b subtotal and joining that to the total count to get the difference, but after using nine dplyr verbs I seem to be going in the wrong direction. Thanks.
This isn't a join, it's just binding new rows on:
x %>% group_by(id) %>%
summarize(
value = sum(value[key == 'total']) - sum(value[key %in% c('a', 'b')]),
key = 'n'
) %>%
bind_rows(x) %>%
select(id, key, value) %>% # back to original column order
arrange(id, key) # and a start a row order
# # A tibble: 9 × 3
# id key value
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 b 2
# 3 1 n 7
# 4 1 total 10
# 5 2 a 4
# 6 2 b 3
# 7 2 n 5
# 8 2 total 12
# 9 2 x 1
Here's a way using data.table, binding rows as in Gregor's answer:
library(data.table)
setDT(x)
dcast(x, id ~ key)[, .(id, key = "n", value = total - a - b)][, rbind(.SD, x)][order(id)]
id key value
1: 1 n 7
2: 1 a 1
3: 1 b 2
4: 1 total 10
5: 2 n 5
6: 2 a 4
7: 2 x 1
8: 2 b 3
9: 2 total 12

Resources