I have the following data
a <- c("A1","A1","A2","A2")
b <- c("B1","B1","B2","B2")
val <- c(10,20,30,40)
df <- data.frame(a,b,val)
I want to replace the the values in 'val' when the a = b and 'val' should have the value of the initial row
You may try
library(dplyr)
df %>%
group_by(a,b) %>%
mutate(val = first(val))
a b val
<chr> <chr> <dbl>
1 A1 B1 10
2 A1 B1 10
3 A2 B2 30
4 A2 B2 30
Related
Let I have the below data frame(df):
x=c("a1","a2","a3","b1","b2","b3")
y1=c(4,2,1,1,5,8)
y2=c(7,1,9,3,2,10)
df<-data.frame(x,y1,y2)
Namely:
> df
x y1 y2
1 a1 4 7
2 a2 2 1
3 a3 1 9
4 b1 1 3
5 b2 5 2
6 b3 8 10
I want to find the value of x which is mininmum of for both y1 and y2 by group of x.
I want to reach the below output for df:
y1 y2
a3 a2
b1 b2
How can I reach that recult? My original data is much bigger.
Thanks a lot.
You don't have a clear group column defined we can create one first. For the example shown we can remove all the numbers from x column and use that as a group column. For each group we can then find out minimum value in the column and get corresponding x value of it.
library(dplyr)
df %>%
group_by(group = sub('\\d+', '', x)) %>%
summarise(across(y1:y2, ~x[which.min(.)]))
# group y1 y2
# <chr> <chr> <chr>
#1 a a3 a2
#2 b b1 b2
We could use
library(stringr)
library(dplyr)
df %>%
group_by(grp = str_remove(x, "\\d+")) %>%
summarise(across(where(is.numeric), ~ x[which.min(.)]))
# A tibble: 2 x 3
# grp y1 y2
# <chr> <chr> <chr>
#1 a a3 a2
#2 b b1 b2
A data.table option
> setDT(df)[, lapply(.SD, function(v) x[which.min(v)]), .(grp = gsub("\\d", "", x))]
grp y1 y2
1: a a3 a2
2: b b1 b2
Here is a data frame:
ID<-c(rep("A",3),rep("B",2), rep("C",3),rep("D",5))
cell<-c("a1","a2","a3","a1","a2","a1","a2", "a3","a1","a2","a1","a2","a3")
value<-c(2,5,3,4,5,6,9,8,7,2,5,2,4)
df<-as.data.frame(cbind(ID, cell, value))
I want to calculate the sum of all values for each ID up to cell a2 (incl.). The sequence of cells and ID’s must be taken into account. If there isn’t any cell “a2” after calculating of the sum, this rows should not be taken into account.
As a result I would like to get this table:
Could You please help me to code this condition?
Thanks in advance.
Best regards, Inna
assuming the file is already correctly ordered by cell
library( tidyverse )
df %>%
group_by( ID ) %>%
mutate( value = cumsum( value ) ) %>%
filter( cell == "a2" )
# # A tibble: 5 x 3
# # Groups: ID [4]
# ID cell value
# <chr> <chr> <dbl>
# 1 A a2 7
# 2 B a2 9
# 3 C a2 15
# 4 D a2 9
# 5 D a2 16
Treating each occurrence of "a2" as different group we can do :
library(dplyr)
df %>%
#Create a group column with every value of cell == 'a2' as different group
group_by(ID, grp = cumsum(lag(cell == 'a2', default = TRUE))) %>%
#Remove those groups that do not have 'a2' in them
filter(any(cell == 'a2')) %>%
#Sum till 'a2' value
summarise(value = sum(value[seq_len(match('a2', cell))]),
cell = last(cell)) %>%
select(-grp)
# ID value cell
# <chr> <dbl> <chr>
#1 A 7 a2
#2 B 9 a2
#3 C 15 a2
#4 D 9 a2
#5 D 7 a2
A succinct solution using ave.
r <- transform(df, value=ave(value, ID, FUN=cumsum))[df$cell == "a2", ]
r
# ID cell value
# 2 A a2 7
# 5 B a2 9
# 7 C a2 15
# 10 D a2 9
# 12 D a2 16
An option with data.table
library(data.table)
setDT(df)[, value := cumsum(value) , ID][cell == 'a2']
-output
# ID cell value
#1: A a2 7
#2: B a2 9
#3: C a2 15
#4: D a2 9
#5: D a2 16
This question already has an answer here:
dplyr with name of columns in a function
(1 answer)
Closed 4 years ago.
I'm trying to use dplyr's mutate_at to subtract a numeric column's value (A1) from another corresponding numeric column (A2), I have multiple columns and several data frames I want to do for this for (BCDE..., df1:df99) so I want to write a function.
df1 <- df1 %>% mutate_at(.vars = vars(A1), .funs = funs(remainder = .-A2))
Works fine, however when I try and write a function to perform this:
REMAINDER <- function(df, numer, denom){
df <- df %>% mutate_at(.vars = vars(numer), .funs = funs(remainder = .-denom))
return(df)
}
With arguments df1 <- REMAINDER(df1, A1, A2)
I get the error Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
Which I don't understand as I just manually called the line of code without a function and my columns are numeric.
The vignette Programming with dplyr explains in great detail what to do:
library(dplyr)
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
df %>% mutate_at(.vars = vars(!! numer), .funs = funs(remainder = . - !! denom))
}
df1 <- data_frame(A1 = 11:13, A2 = 3:1, B1 = 21:23, B2 = 8:6)
REMAINDER(df1, A1, A2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
REMAINDER(df1, B1, B2)
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
Naming the result column
The OP wants to update df1 and he wants to apply this operation to other columns as well.
Unfortunately, the REMAINDER() function as it is currently defined will overwrite the result column:
df1
# A tibble: 3 x 4
A1 A2 B1 B2
<int> <int> <int> <int>
1 11 3 21 8
2 12 2 22 7
3 13 1 23 6
df1 <- REMAINDER(df1, A1, A2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 8
2 12 2 22 7 10
3 13 1 23 6 12
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 5
A1 A2 B1 B2 remainder
<int> <int> <int> <int> <int>
1 11 3 21 8 13
2 12 2 22 7 15
3 13 1 23 6 17
The function can be modified so that the result column is individually named:
REMAINDER <- function(df, numer, denom) {
numer <- enquo(numer)
denom <- enquo(denom)
result_name <- paste0("remainder_", quo_name(numer), "_", quo_name(denom))
df %>% mutate_at(.vars = vars(!! numer),
.funs = funs(!! result_name := . - !! denom))
}
Now, calling REMAINDER() twice on different columns and replacing df1 after each call, we get
df1 <- REMAINDER(df1, A1, A2)
df1 <- REMAINDER(df1, B1, B2)
df1
# A tibble: 3 x 6
A1 A2 B1 B2 remainder_A1_A2 remainder_B1_B2
<int> <int> <int> <int> <int> <int>
1 11 3 21 8 8 13
2 12 2 22 7 10 15
3 13 1 23 6 12 17
I have used this suggestion in order to subtract pairs of columns in a list of data frames. My example has only 3 pairs of columns in each of the two data frames and it can work with higher number of columns and data frames.
dt <- data.table(A1 = round(runif(3),1), A2 = round(runif(3),1),
B1 = round(runif(3),1), B2 = round(runif(3),1),
C1 =round(runif(3),1), C2 =round(runif(3),1))
dt = list(dt,dt+dt)
lapply(seq_along(dt), function(z) {
dt[[z]][, lapply(1:(ncol(.SD)/2), function(x) (.SD[[2*x-1]] - .SD[[2*x]]))]
})
I have a dataframedf1 with columns a,b,c. I want to assign c=0 to the first row of the dataset returned by group_by(a,b). I tried something like
t <- df1 %>% group_by(a,b) %>% filter(row_number(a)==1) %>% mutate(c= 0)
But it reduced number of rows. Expected output is
a b c
a1 b1 0
a1 b1 NA
a2 b2 0
a2 b2 NA
You can use seq_along to number elements in each group from 1 to the total number of elements within each group (2, in this case). Then use ifelse to set the first element of 'c' for each group to be 0 and leave the other element as is.
library(dplyr)
df %>%
group_by(a, b) %>%
mutate(c = ifelse(seq_along(c) == 1, 0, c))
# A tibble: 4 x 3
# Groups: a, b [2]
# a b c
# <fct> <fct> <dbl>
#1 a1 b1 0.
#2 a1 b1 NA
#3 a2 b2 0.
#4 a2 b2 NA
data
df <- data.frame(a = rep(c("a1", "a2"), each = 2),
b = rep(c("b1", "b2"), each = 2),
c = NA)
df
# a b c
#1 a1 b1 NA
#2 a1 b1 NA
#3 a2 b2 NA
#4 a2 b2 NA
Ok so my dataframe looks like this let's call if df
KEY A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4
1 120 100 NA 110 1 1 NA 1 NA NA NA NA
2 100 NA 115 NA NA NA NA NA Y N Y N
So what I'm trying to do is make it so that when an A columns has a value of 100 and the corresponding B or C column has a value of 1 or "Y" respectively that makes a new column with a X with a value of 1. In Row 1 that would be A2 and B2 and in row that would be A1 and C1.
I tried doing three sets of gather and then using the mutate function using case_when. like so
df<- df %>%
gather(key="A",value="code",dx)%>%
gather(key="B",value="number",dxadm)%>%
gather(key="C",value="character",dxpoa) %>%
mutate(X=case_when(
code == 100 & present >0 ~ 1,
code ==100 & character == "Y"~1)
)
Except my spread function of these rows came back with rows all array and my X out of place.
Alternatively, I considered something like
df <- df %>%
mutate(X=case_when(
A1 == 100 & B1 >0 ~ 1,
A1 ==100 & C1 == "Y"~1,
A2 == 100 & B2 >0 ~ 1,
A2 ==100 & C2 == "Y"~1,)
and so on for all permutations. The two problems with this are that I have a lot of columns and I'd like to this for multiple different values of A.
Can anyone recommend an alternative or at least a way to make the second solution into something that would only require one annoying long piece of code that I could make into a more generalizable function? Thanks!
A suggestion
require(read.so) #awesome package to read from Stackoverflow,
# available on GitHub [https://alistaire47.github.io/read.so/][1]
require(tidyr)
require(reshape2)
require(dplyr)
dat <- read.so()
dat %>% gather(var, value, 2:13) %>% #make it long
mutate(var = gsub('([A-Z])', '\\1_', .[['var']])) %>% #add underscore
separate(var, c('var', 'number') ) %>% #separate your column
dcast(KEY+number ~ var) %>% #dcast is a bit complex but quite powerful
group_by(KEY) %>%
filter(A == 100)
# A tibble: 2 x 5
# Groups: KEY [2]
KEY number A B C
<int> <chr> <chr> <chr> <chr>
1 1 2 100 1 <NA>
2 2 1 100 <NA> Y
A solution using dplyr and tidyr. We can gather all the columns except KEY, separate the letters and numbers, and then spread the letter so that we can create the X column without specifying the numbers. Notice that I assume if the condition is not met, X would be 0, and based on your description, I used any(A %in% 100 & (B %in% 1 | C %in% "Y")) to test the condition as any given numbers met the condition, X would be 1.
library(dplyr)
library(tidyr)
df2 <- df %>%
gather(Column, Value, -KEY) %>%
separate(Column, into = c("Letter", "Number"), sep = 1) %>%
spread(Letter, Value, convert = TRUE) %>%
group_by(KEY) %>%
mutate(X = ifelse(any(A %in% 100 & (B %in% 1 | C %in% "Y")), 1L, 0L))
df2 %>% as.data.frame()
# KEY Number A B C X
# 1 1 1 120 1 <NA> 1
# 2 1 2 100 1 <NA> 1
# 3 1 3 NA NA <NA> 1
# 4 1 4 110 1 <NA> 1
# 5 2 1 100 NA Y 1
# 6 2 2 NA NA N 1
# 7 2 3 115 NA Y 1
# 8 2 4 NA NA N 1
I think the structure of df2 is good, but if you really want the original structure, we can do the following.
df3 <- df2 %>%
gather(Letter, Value, A:C) %>%
unite(Column, Letter, Number, sep = "") %>%
spread(Column, Value) %>%
select(names(df), X)
df3 %>% as.data.frame()
# KEY A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 X
# 1 1 120 100 <NA> 110 1 1 <NA> 1 <NA> <NA> <NA> <NA> 1
# 2 2 100 <NA> 115 <NA> <NA> <NA> <NA> <NA> Y N Y N 1
df3 is the final output.
DATA
df <- read.table(text = "KEY A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4
1 120 100 NA 110 1 1 NA 1 NA NA NA NA
2 100 NA 115 NA NA NA NA NA Y N Y N",
header = TRUE, stringsAsFactors = FALSE)
Same idea as Tjebo, but sticking to the tidyverse....
library(tidyverse)
dat <- data.frame(stringsAsFactors=FALSE,
KEY = c(1L, 2L),
A1 = c(120L, 100L),
A2 = c(100L, NA),
A3 = c(NA, 115L),
A4 = c(110L, NA),
B1 = c(1L, NA),
B2 = c(1L, NA),
B3 = c(NA, NA),
B4 = c(1L, NA),
C1 = c(NA, "Y"),
C2 = c(NA, "N"),
C3 = c(NA, "Y"),
C4 = c(NA, "N"))
dat %>%
gather(var, value, -KEY) %>% #make it long
extract(var, regex = "(.)(.)", into = c("var", "number") ) %>%
spread(var, value) %>%
filter( A %in% 100 )
#> KEY number A B C
#> 1 1 2 100 1 <NA>
#> 2 2 1 100 <NA> Y
Created on 2018-02-27 by the reprex package (v0.2.0).