I have a list of datasets with different variables. I need to rename them according to the naming convention in the name dataframe below.
df1 <- data.frame(x1= c(1,2,3), x2=c(1,2,3))
df2 <- data.frame(x1= c(1,2,3), x3=c(1,2,3))
df3 <- data.frame(x4= c(1,2,3), x5=c(1,2,3))
mylist <- list(df1,df2,df3)
name <- data.frame(old= c("x1","x2","x3","x4","x5"), new=c("A","B","A","A","C"))
I can do this one by one, but I am wondering how to be more efficient and rename them all at once
newdf <- map_if(mylist, ~ "x1" %in% colnames(.x),
.f = list(. %>% rename("A"="x1")))
I was hoping something like this would work, but it doesn't:
for (i in nrow(name)) {
newdf <- map_if(mylist, ~ name[i,1] %in% colnames(.x),
.f = list(. %>% rename(name[2] = name[1])))
}
You can use setnames from data.table, which can take a list of old and new names.
library(data.table)
library(purrr)
map(mylist, ~ setnames(.x, name$old, name$new, skip_absent=TRUE))
Output
[[1]]
A B
1 1 1
2 2 2
3 3 3
[[2]]
A A
1 1 1
2 2 2
3 3 3
[[3]]
A C
1 1 1
2 2 2
3 3 3
Column names must be unique, so there is a typo (?) in your example (as "x1" and "x3" would both be re-labelled as "A").
If we fix the typo, here is an option using map and rename_with.
name <- data.frame(old= c("x1","x2","x3","x4","x5"), new=c("A","B","C","D","E"))
library(tidyverse)
mylist %>%
map(function(df) df %>% rename_with(~ name$new[match(.x, name$old)]))
#[[1]]
# A B
#1 1 1
#2 2 2
#3 3 3
#
#[[2]]
# A C
#1 1 1
#2 2 2
#3 3 3
#
#[[3]]
# D E
#1 1 1
#2 2 2
#3 3 3
You could use set_names + recode:
library(tidyverse)
map(mylist, set_names, ~ recode(.x, !!!deframe(name)))
[[1]]
A B
1 1 1
2 2 2
3 3 3
[[2]]
A A
1 1 1
2 2 2
3 3 3
[[3]]
A C
1 1 1
2 2 2
3 3 3
Related
I would like to combine two dataframes using crossing, but some have the same columnnames. For that, I would like to add "_nameofdataframe" to these columns. Here are some reproducible dataframes (dput below):
> df1
person V1 V2 V3
1 A 1 3 3
2 B 4 4 5
3 C 2 1 1
> df2
V2 V3
1 2 5
2 1 6
3 1 2
When I run the following code it will return duplicated column names:
library(tidyr)
crossing(df1, df2, .name_repair = "minimal")
#> # A tibble: 9 × 6
#> person V1 V2 V3 V2 V3
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 A 1 3 3 1 2
#> 2 A 1 3 3 1 6
#> 3 A 1 3 3 2 5
#> 4 B 4 4 5 1 2
#> 5 B 4 4 5 1 6
#> 6 B 4 4 5 2 5
#> 7 C 2 1 1 1 2
#> 8 C 2 1 1 1 6
#> 9 C 2 1 1 2 5
As you can see it returns the column names while being duplicated. My desired output should look like this:
person V1 V2_df1 V3_df1 V2_df2 V3_df2
1 A 1 3 3 1 2
2 A 1 3 3 1 6
3 A 1 3 3 2 5
4 B 4 4 5 1 2
5 B 4 4 5 1 6
6 B 4 4 5 2 5
7 C 2 1 1 1 2
8 C 2 1 1 1 6
9 C 2 1 1 2 5
So I was wondering if anyone knows a more automatic way to give the duplicated column names a name like in the desired output above with crossing?
dput of df1 and df2:
df1 <- structure(list(person = c("A", "B", "C"), V1 = c(1, 4, 2), V2 = c(3,
4, 1), V3 = c(3, 5, 1)), class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(V2 = c(2, 1, 1), V3 = c(5, 6, 2)), class = "data.frame", row.names = c(NA,
-3L))
As you probably know, the .name_repair parameter can take a function. The problem is crossing() only passes that function one argument, a vector of the concatenated column names() of both data frames. So we can't easily pass the names of the data frame objects to it. It seems to me that there are two solutions:
Manually add the desired suffix to an anonymous function.
Create a wrapper function around crossing().
1. Manually add the desired suffix to an anonymous function
We can simply supply the suffix as a character vector to the anonymous .name_repair parameter, e.g. suffix = c("_df1", "_df2").
crossing(
df1,
df2,
.name_repair = \(x, suffix = c("_df1", "_df2")) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
# person V1 V2_df1 V3_df1 V2_df2 V3_df2
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 A 1 3 3 1 2
# 2 A 1 3 3 1 6
# 3 A 1 3 3 2 5
# 4 B 4 4 5 1 2
# 5 B 4 4 5 1 6
# 6 B 4 4 5 2 5
# 7 C 2 1 1 1 2
# 8 C 2 1 1 1 6
# 9 C 2 1 1 2 5
The disadvantage of this is that there is a room for error when typing the suffix, or that we might forget to change it if we change the names of the data frames.
Also note that we are checking for names which appear twice. If one of your original data frames already has broken (duplicated) names then this function will also rename those columns. But I think it would be unwise to try to do any type of join if either data frame did not have unique column names.
2. Create a wrapper function around crossing()
This might be more in the spirit of the tidyverse. Thecrossing() docs to which you linked state crossing() is a wrapper around expand_grid(). The source for expand_grid() show that it is basically a wrapper which uses map() to apply vctrs::vec_rep() to some inputs. So if we want to add another function to the call stack, there are two ways I can think of:
Using deparse(substitute())
crossing_fix_names <- function(df_1, df_2) {
suffixes <- paste0(
"_",
c(deparse(substitute(df_1)), deparse(substitute(df_2)))
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Output the same as above
crossing_fix_names(df1, df2)
The disadvantage of this is that deparse(substitute()) is ugly and can occasionally have surprising behaviour. The advantage is we do not need to remember to manually add the suffixes.
Using match.call()
crossing_fix_names2 <- function(df_1, df_2) {
args <- as.list(match.call())
suffixes <- paste0(
"_",
c(
args$df_1,
args$df_2
)
)
crossing(
df_1,
df_2,
.name_repair = \(x, suffix = suffixes) {
names_to_repair <- names(which(table(x) == 2))
x[x %in% names_to_repair] <- paste0(
x[x %in% names_to_repair],
rep(
suffix,
each = length(unique(names_to_repair))
)
)
x
}
)
}
# Also the same output
crossing_fix_names2(df1, df2)
As we don't have the drawbacks of deparse(substitute()) and we don't have to manually specify the suffix, I think this is the probably the best approach.
test for the condition using dputs :
colnames(df1) %in% colnames(df2)
[1] FALSE FALSE TRUE TRUE
rename
colnames(df2) <- paste0(colnames(df2), '_df2')
then cbind
cbind(df1,df2)
person V1 V2 V3 V2_df2 V3_df2
1 A 1 3 3 2 5
2 B 4 4 5 1 6
3 C 2 1 1 1 2
not so elegant, but usefully discernible later.
I would like to merge multiple columns. Here is what my sample dataset looks like.
df <- data.frame(
id = c(1,2,3,4,5),
cat.1 = c(3,4,NA,4,2),
cat.2 = c(3,NA,1,4,NA),
cat.3 = c(3,4,1,4,2))
> df
id cat.1 cat.2 cat.3
1 1 3 3 3
2 2 4 NA 4
3 3 NA 1 1
4 4 4 4 4
5 5 2 NA 2
I am trying to merge columns cat.1 cat.2 and cat.3. It is a little complicated for me since there are NAs.
I need to have only one cat variable and even some columns have NA, I need to ignore them. The desired output is below:
> df
id cat
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Any thoughts?
Another variation of Gregor's answer using dplyr::transmute:
library(dplyr)
df %>%
transmute(id = id, cat = coalesce(cat.1, cat.2, cat.3))
#> id cat
#> 1 1 3
#> 2 2 4
#> 3 3 1
#> 4 4 4
#> 5 5 2
With dplyr:
library(dplyr)
df %>%
mutate(cat = coalesce(cat.1, cat.2, cat.3)) %>%
select(-cat.1, -cat.2, -cat.3)
An option with fcoalesce from data.table
library(data.table)
setDT(df)[, .(id, cat = do.call(fcoalesce, .SD)), .SDcols = patterns('^cat')]
-output
# id cat
#1: 1 3
#2: 2 4
#3: 3 1
#4: 4 4
#5: 5 2
Does this work:
> library(dplyr)
> df %>% rowwise() %>% mutate(cat = mean(c(cat.1, cat.2, cat.3), na.rm = T)) %>% select(-(2:4))
# A tibble: 5 x 2
# Rowwise:
id cat
<dbl> <dbl>
1 1 3
2 2 4
3 3 1
4 4 4
5 5 2
Since values across rows are unique, mean of the rows will return the same unique value, can also go with max or min.
Here is a base R solution which uses apply:
df$cat <- apply(df, 1, function(x) unique(x[!is.na(x)][-1]))
I have a dataset where the first line is the header, the second line is some explanatory data, and then rows 3 on are numbers. Because when I read in the data with this second explanatory row, the classes are automatically converted to factors (or I could put stringsasfactors=F).
What I would like to do is remove the second row, and have a function that goes through all columns and detects if they're just numbers and change the class type to the appropriate type. Is there something like that available? Perhaps using dplyr? I have many columns so I'd like to avoid manually reassigning them.
A simplified example below
> df <- data.frame(A = c("col 1",1,2,3,4,5), B = c("col 2",1,2,3,4,5))
> df
A B
1 col 1 col 2
2 1 1
3 2 2
4 3 3
5 4 4
6 5 5
if all the numbers are after the second line, then we can do so
library(tidyverse)
df[-1, ] %>% mutate_all(as.numeric)
depending on the task can be done this way
df <- tibble(A = c("col 1",1,2,3,4,5),
B = c("col 2",1,2,3,4,5),
C = c(letters[1:5], 6))
df[-1, ] %>% mutate_if(~ any(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <dbl>
1 1 1 NA
2 2 2 NA
3 3 3 NA
4 4 4 NA
5 5 5 6
or so
df[-1, ] %>% mutate_if(~ all(!is.na(as.numeric(.))), as.numeric)
A B C
<dbl> <dbl> <chr>
1 1 1 b
2 2 2 c
3 3 3 d
4 4 4 e
5 5 5 6
In base R, we can just do
df[-1] <- lapply(df[-1], as.numeric)
I have several dataframes in a list in R. There are entries in each of those DF I would like to summarise. Im trying to get into lapply so that would be my preferred way (though if theres a better solution I would be happy to know it and why).
My Sample data:
df1 <- data.frame(Count = c(1,2,3), ID = c("A","A","C"))
df2 <- data.frame(Count = c(1,1,2), ID = c("C","B","C"))
dfList <- list(df1,df2)
> head(dfList)
[[1]]
Count ID
1 1 A
2 2 A
3 3 C
[[2]]
Count ID
1 1 C
2 1 B
3 2 C
I tried to implement this in lapply with
dfList_agg<-lapply(dfList, function(i) {
aggregate(i[[1:length(i)]][1L], by=list(names(i[[1:length(i)]][2L])), FUN=sum)
})
However this gives me a error "arguments must have same length". What am I doing wrong?
My desired output would be the sum of Column "Count" by "ID" which looks like this:
>head(dfList_agg)
[[1]]
Count ID
1 3 A
2 3 C
[[2]]
Count ID
1 3 C
2 1 B
I think you've overcomplicated it. Try this...
dfList_agg<-lapply(dfList, function(i) {
aggregate(i[,1], by=list(i[,2]), FUN=sum)
})
dflist_agg
[[1]]
Group.1 x
1 A 3
2 C 3
[[2]]
Group.1 x
1 B 1
2 C 3
Here is a third option
lapply(dfList, function(x) aggregate(. ~ ID, data = x, FUN = "sum"))
#[[1]]
# ID Count
#1 A 3
#2 C 3
#
#[[2]]
#ID Count
#1 B 1
#2 C 3
I guess this is what you need
library(dplyr)
lapply(dfList,function(x) ddply(x,.(ID),summarize,Count=sum(Count)))
An option with tidyverse would be
library(tidyverse)
map(dfList, ~ .x %>%
group_by(ID) %>%
summarise(Count = sum(Count)) %>%
select(names(.x)))
#[[1]]
# A tibble: 2 x 2
# Count ID
# <dbl> <fctr>
#1 3.00 A
#2 3.00 C
#[[2]]
# A tibble: 2 x 2
# Count ID
# <dbl> <fctr>
#1 1.00 B
#2 3.00 C
I have a dataframe as follows. It is ordered by column time.
Input -
df = data.frame(time = 1:20,
grp = sort(rep(1:5,4)),
var1 = rep(c('A','B'),10)
)
head(df,10)
time grp var1
1 1 1 A
2 2 1 B
3 3 1 A
4 4 1 B
5 5 2 A
6 6 2 B
7 7 2 A
8 8 2 B
9 9 3 A
10 10 3 B
I want to create another variable var2 which computes no of distinct var1 values so far i.e. until that point in time for each group grp . This is a little different from what I'd get if I were to use n_distinct.
Expected output -
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
I want to create a function say cum_n_distinct for this and use it as -
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))
A dplyr solution inspired from #akrun's answer -
Ths logic is basically to set 1st occurrence of each unique values of var1 to 1 and rest to 0 for each group grp and then apply cumsum on it -
df = df %>%
arrange(time) %>%
group_by(grp,var1) %>%
mutate(var_temp = ifelse(row_number()==1,1,0)) %>%
group_by(grp) %>%
mutate(var2 = cumsum(var_temp)) %>%
select(-var_temp)
head(df,10)
Source: local data frame [10 x 4]
Groups: grp
time grp var1 var2
1 1 1 A 1
2 2 1 B 2
3 3 1 A 2
4 4 1 B 2
5 5 2 A 1
6 6 2 B 2
7 7 2 A 2
8 8 2 B 2
9 9 3 A 1
10 10 3 B 2
Assuming stuff is ordered by time already, first define a cumulative distinct function:
dist_cum <- function(var)
sapply(seq_along(var), function(x) length(unique(head(var, x))))
Then a base solution that uses ave to create groups (note, assumes var1 is factor), and then applies our function to each group:
transform(df, var2=ave(as.integer(var1), grp, FUN=dist_cum))
A data.table solution, basically doing the same thing:
library(data.table)
(data.table(df)[, var2:=dist_cum(var1), by=grp])
And dplyr, again, same thing:
library(dplyr)
df %>% group_by(grp) %>% mutate(var2=dist_cum(var1))
Try:
Update
With your new dataset, an approach in base R
df$var2 <- unlist(lapply(split(df, df$grp),
function(x) {x$var2 <-0
indx <- match(unique(x$var1), x$var1)
x$var2[indx] <- 1
cumsum(x$var2) }))
head(df,7)
# time grp var1 var2
# 1 1 1 A 1
# 2 2 1 B 2
# 3 3 1 A 2
# 4 4 1 B 2
# 5 5 2 A 1
# 6 6 2 B 2
# 7 7 2 A 2
Here's another solution using data.table that's pretty quick.
Generic Function
cum_n_distinct <- function(x, na.include = TRUE){
# Given a vector x, returns a corresponding vector y
# where the ith element of y gives the number of unique
# elements observed up to and including index i
# if na.include = TRUE (default) NA is counted as an
# additional unique element, otherwise it's essentially ignored
temp <- data.table(x, idx = seq_along(x))
firsts <- temp[temp[, .I[1L], by = x]$V1]
if(na.include == FALSE) firsts <- firsts[!is.na(x)]
y <- rep(0, times = length(x))
y[firsts$idx] <- 1
y <- cumsum(y)
return(y)
}
Example Use
cum_n_distinct(c(5,10,10,15,5)) # 1 2 2 3 3
cum_n_distinct(c(5,NA,10,15,5)) # 1 2 3 4 4
cum_n_distinct(c(5,NA,10,15,5), na.include = FALSE) # 1 1 2 3 3
Solution To Your Question
d_out = df %>%
arrange(time) %>%
group_by(grp) %>%
mutate(var2 = cum_n_distinct(var1))