I have a data.frame with two variables. I need to group them by var1 and replace every x in var2 with the unique different value in that group.
For example:
var1 var2
1 1 a
2 2 a
3 2 x
4 3 b
5 4 c
6 5 a
7 6 c
8 6 x
9 7 c
10 8 x
11 8 b
12 8 b
13 9 a
Outcome should be:
var1 var2
1 1 a
2 2 a
3 2 a <-
4 3 b
5 4 c
6 5 a
7 6 c
8 6 c <-
9 7 c
10 8 b <-
11 8 b
12 8 b
13 9 a
I did manage to solve this example:
dat <- data.frame(var1=c(1,2,2,3,4,5,6,6,7,8,8,8,9), var2=c("a","a","x","b","c","a","a","x","c","x","b","b","a"))
dat %>% group_by(var1) %>% mutate(
var2 = as.character(var2),
var2 = ifelse(var2 == 'x',var2[order(var2)][1],var2))
But this does not work for my real data because of the ordering :(
I would need another approach, I think of something like checking explicit for "not x" but I did not came to a solution.
Any help appreciatet!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'var1', we get the 'var2' that are not 'x', select the first observation and assign (:=) it to 'var2'.
library(data.table)
setDT(df1)[, var2 := var2[var2!='x'][1], var1]
Or with dplyr
library(dplyr)
df1 %>%
group_by(var1) %>%
mutate(var2 = var2[var2!="x"][1])
# var1 var2
# <int> <chr>
#1 1 a
#2 2 a
#3 2 a
#4 3 b
#5 4 c
#6 5 a
#7 6 c
#8 6 c
#9 7 c
#10 8 b
#11 8 b
#12 8 b
#13 9 a
Related
I have a dataframe:
my_df <- data.frame(var1 = c(1,2,3,4,5), var2 = c(6,7,8,9,10))
my_df
var1 var2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I also have a vector:
my_vec <- c("a", "b", "c")
I want to repeat the dataframe length(my_vec) times, filling in the values of a new variable with the vector values. Is there a simple way to do this? If possible, i'd like to do this in a dplyr chain. Desired output:
var1 var2 var3
1 1 6 a
2 2 7 a
3 3 8 a
4 4 9 a
5 5 10 a
6 1 6 b
7 2 7 b
8 3 8 b
9 4 9 b
10 5 10 b
11 1 6 c
12 2 7 c
13 3 8 c
14 4 9 c
15 5 10 c
We can use crossing or with expand_grid
library(tidyr)
crossing(my_df, var3 = my_vec)
#expand_grid(my_df, var3 = my_vec)
If the order is important, use arrange
library(dplyr)
crossing(my_df, var3 = my_vec) %>%
arrange(var3)
-output
# A tibble: 15 × 3
var1 var2 var3
<dbl> <dbl> <chr>
1 1 6 a
2 2 7 a
3 3 8 a
4 4 9 a
5 5 10 a
6 1 6 b
7 2 7 b
8 3 8 b
9 4 9 b
10 5 10 b
11 1 6 c
12 2 7 c
13 3 8 c
14 4 9 c
15 5 10 c
Though I don't think this is likely to be the simplest answer in practice, I specifically saw that you wanted a dplyr chain that would solve this, and so I tried to do this without using the pre-existing functions that do this for you.
For your example specifically, you could use this chain with the tibble package functions add_column and add_row
my_df %>%
tibble::add_column(var3 = my_vec[1]) %>%
tibble::add_row(tibble::add_column(my_df, var3 = my_vec[2])) %>%
tibble::add_row(tibble::add_column(my_df, var3 = my_vec[3]))
which directly yields
var1 var2 var3
1 1 6 a
2 2 7 a
3 3 8 a
4 4 9 a
5 5 10 a
6 1 6 b
7 2 7 b
8 3 8 b
9 4 9 b
10 5 10 b
11 1 6 c
12 2 7 c
13 3 8 c
14 4 9 c
15 5 10 c
Though the principle can be extended a bit, it can still be more adaptable for whatever it is you want to apply this to. So I decided to make a function to do it for you.
my_fxn <-
function(frame, yourVector, new.col.name = paste0("var", NCOL(frame) + 1)) {
require(tidyverse)
origcols <- colnames(frame)
for (i in 1:length(yourVector)) {
intermediateFrame <- tibble::add_column(
frame,
temp.name = rep_len(yourVector[[i]], nrow(frame))
)
colnames(intermediateFrame) <- append(origcols, new.col.name)
if (i == 1) {
Frame3 <- intermediateFrame
} else {
Frame3 <- tibble::add_row(Frame3, intermediateFrame)
}
}
return(Frame3)
}
Running my_fxn(my_df, my_vec) should get you the same data frame/table that we got above.
I also experimented with using a for loop outside a function on its own to do this, but decided that it was getting to be overkill. That approach is definitely also possible, though.
I have a data.frame (the eBird basic dataset) where many observers may upload a record from a same sighting to a database, in this case, the event is given a "group identifier"; when not from a group session, a NA will appear in the database; so I'm trying to filter out all those duplicates from group events and keep all NAs, I'm trying to do this without splitting the dataframe in two:
library(dplyr)
set.seed(1)
df <- tibble(
x = sample(c(1:6, NA), 30, replace = T),
y = sample(c(letters[1:4]), 30, replace = T)
)
df %>% count(x,y)
gives:
> df %>% count(x,y)
# A tibble: 20 x 3
x y n
<int> <chr> <int>
1 1 a 1
2 1 b 2
3 2 a 1
4 2 b 1
5 2 c 1
6 2 d 3
7 3 a 1
8 3 b 1
9 3 c 4
10 4 d 1
11 5 a 1
12 5 b 2
13 5 c 1
14 5 d 1
15 6 a 1
16 6 c 2
17 NA a 1
18 NA b 2
19 NA c 2
20 NA d 1
I want no NA at x to be grouped together, as here happened with "NA b" and "NA c" combinations; distinct function has no information on not taking NAs into the computation; is splitting the dataframe the only solution?
With distinct an option is to create a new column based on the NA elements in 'x'
library(dplyr)
df %>%
mutate(x1 = row_number() * is.na(x)) %>%
distinct %>%
select(-x1)
Or we can use duplicated with an OR (|) condition to return all NA elements in 'x' with filter
df %>%
filter(is.na(x)|!duplicated(cur_data()))
# A tibble: 20 x 2
# x y
# <int> <chr>
# 1 1 b
# 2 4 b
# 3 NA a
# 4 1 d
# 5 2 c
# 6 5 a
# 7 NA d
# 8 3 c
# 9 6 b
#10 2 b
#11 3 b
#12 1 c
#13 5 d
#14 2 d
#15 6 d
#16 2 a
#17 NA c
#18 NA a
#19 1 a
#20 5 b
df <- data.frame(dat=c("11-03","12-03","13-03"),
c=c(0,15,20,4,19,21,2,10,14), d=rep(c("A","B","C"),each=3))
suppose c has the cumulative values. I want to create a column daily that will look like
dat c d daily
1 11-03 0 A 0
2 12-03 15 A 15
3 13-03 20 A 5
4 11-03 4 B 4
5 12-03 19 B 15
6 13-03 21 B 2
7 11-03 2 C 2
8 12-03 10 C 8
9 13-03 14 C 4
for each value of d and dat (date wise) a daily change in value is generated from the column c has that cumulative value.
We can get the diff of 'c' after grouping by 'd'
library(dplyr)
df %>%
group_by(d) %>%
mutate(daily = c(first(c), diff(c)))
# A tibble: 9 x 4
# Groups: d [3]
# dat c d daily
# <fct> <dbl> <fct> <dbl>
#1 11-03 0 A 0
#2 12-03 15 A 15
#3 13-03 20 A 5
#4 11-03 4 B 4
#5 12-03 19 B 15
#6 13-03 21 B 2
#7 11-03 2 C 2
#8 12-03 10 C 8
#9 13-03 14 C 4
Or do the difference between the 'c' and the lag of 'c'
df %>%
group_by(d) %>%
mutate(daily = c - lag(c))
Data.table solution:
df <- as.data.table(df)
df[, daily:= c - shift(c, fill = 0),by=d]
Shift is datatable's lag operator, so basically we subtract from C its previous value within each group.
fill = 0 replaces NAs with zeros, because within each group, there is no previous value (shift(c)) for the first element.
I have the following dataframe
df<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5),
'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency
1 A 1 1
2 A 2 2
3 A 3 NA
4 A 4 4
5 A 5 6
6 B 1 1
7 B 2 2
8 B 3 5
9 B 4 6
10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 1
8 B 2
9 B 3
10 B 4
11 B 5
12 B 6
13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<-df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>%
mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq
<fct> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 NA 3
4 A 4 4 4
5 A 5 6 5
6 B 1 1 1
7 B 2 2 2
8 B 3 5 3
9 B 4 6 4
10 B 5 7 5
Request someone to help me here
gather function from tidyr package will be helpful here:
library(tidyverse)
df %>%
gather(x, CFreq, -ID) %>%
select(-x) %>%
na.omit() %>%
unique() %>%
arrange(ID, CFreq)
A different tidyverse possibility could be:
df %>%
nest(A_Frequency, B_Frequency, .key = C_Frequency) %>%
mutate(C_Frequency = map(C_Frequency, function(x) unique(x[!is.na(x)]))) %>%
unnest()
ID C_Frequency
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
9 A 6
10 B 1
11 B 2
12 B 3
13 B 4
14 B 5
18 B 6
19 B 7
Base R approach would be to split the dataframe based on ID and for every list we count the number of unique enteries and create a sequence based on that.
do.call(rbind, lapply(split(df, df$ID), function(x) data.frame(ID = x$ID[1] ,
CFreq = seq_len(length(unique(na.omit(unlist(x[-1]))))))))
# ID CFreq
#A.1 A 1
#A.2 A 2
#A.3 A 3
#A.4 A 4
#A.5 A 5
#A.6 A 6
#B.1 B 1
#B.2 B 2
#B.3 B 3
#B.4 B 4
#B.5 B 5
#B.6 B 6
#B.7 B 7
This will also work when A_Frequency B_Frequency has characters in them or some other random numbers instead of sequential numbers.
In tidyverse we can do
library(tidyverse)
df %>%
group_split(ID) %>%
map_dfr(~ data.frame(ID = .$ID[1],
CFreq= seq_len(length(unique(na.omit(flatten_chr(.[-1])))))))
A data.table option
library(data.table)
cols <- c('A_Frequency', 'B_Frequency')
out <- setDT(df)[, .(CFreq = sort(unique(unlist(.SD)))),
.SDcols = cols,
by = ID]
out
# ID CFreq
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: A 5
# 6: A 6
# 7: B 1
# 8: B 2
# 9: B 3
#10: B 4
#11: B 5
#12: B 6
#13: B 7
This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d