I have a vector containing "potential" column names:
col_vector <- c("A", "B", "C")
I also have a data frame, e.g.
library(tidyverse)
df <- tibble(A = 1:2,
B = 1:2)
My goal now is to create all columns mentioned in col_vector that don't yet exist in df.
For the above exmaple, my code below works:
df %>%
mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.
Example data where code above fails:
df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
B = 1:2,
C = 1:2)
This should work.
df[,setdiff(col_vector, colnames(df))] <- NA
Solution
This base operation might be simpler than a full-fledged dplyr workflow:
library(tidyverse) # For the setdiff() function.
# ...
# Code to generate 'df'.
# ...
# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA
# View results
df
Results
Given your sample col_vector and df here
col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)
this solution should yield the following results:
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Advantages
An advantage of my solution, over the alternative linked above by #geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.
df %>% mutate(
#####################################
A = ifelse("A" %in% names(.), A, NA),
B = ifelse("B" %in% names(.), B, NA),
C = ifelse("C" %in% names(.), B, NA)
# ...
# etc.
#####################################
)
My solution is by contrast more dynamic
##############################
df[, setdiff(col_vector, names(df))] <- NA
##############################
if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.
Note
Incredibly, #AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.
Related
This question already has answers here:
R group by aggregate
(3 answers)
Closed 2 years ago.
I am working with a dataset of more than 3 million observations. This data set includes more than 770,000 unique IDs that are of interest to me. The data includes descriptive information about these IDs. The challenge is that these unique IDs contain non-unique duplicates, which means I need to find a way to consolidate the data.
After much thinking, I decided to take the mode of each column for each ID in the data set. The output gives me most common value for each column for each id. By taking the most common value, I am able to consolidate the non-unique duplicates into one row per each id.
The problem: To do so, I have iterate over 770,000 unique ids in a for loop. I want to use code that will be as efficient as possible because the for loop I have been using takes days to complete.
Given the code I have provided, is there a way to optimize the code, use parallel processing, or a different way to complete the task more efficiently?
Reproducible code:
ID <- c(1,2,2,3,3,3)
x1 <- c("A", "B", "B","C", "C", "C")
x2 <- c("alpha", "bravo", "bravo", "charlie", "charlie2", "charlie2")
x3 <- c("apple", "banana", "banana", "plum1", "plum1", "plum")
df <- data.frame(ID, x1, x2, x3)
#Mode Function
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
library(reshape2)
#Takes the mode for every column
mode_row <- function(dat){
x <- setNames(as.data.frame(apply(dat, 2, getmode)), c("value"))
x$variable <- rownames(x); rownames(x) <- NULL
mode_row <- reshape2::dcast(x, . ~ variable, value.var = "value")
mode_row$. <- NULL
return(mode_row)
}
#Take the mode of each row to account for duplicate donors
df2 <- NULL
for(i in unique(df$ID)){
df2 <- rbind(df2, mode_row(subset(df, ID == i)))
#message(i)
}
df2
Expected Output:
ID x1 x2 x3
1 1 A alpha apple
2 2 B bravo banana
3 3 C charlie2 plum1
There are grouped functions available in base R, dplyr and data.table :
Base R :
aggregate(.~ID, df, getmode)
# ID x1 x2 x3
#1 1 A alpha apple
#2 2 B bravo banana
#3 3 C charlie2 plum1
dplyr :
library(dplyr)
df %>% group_by(ID) %>% summarise(across(x1:x3, getmode))
#Use summarise_at in older version of dplyr
#df %>% group_by(ID) %>% summarise_at(vars(x1:x3), getmode)
data.table :
library(data.table)
setDT(df)[, lapply(.SD, getmode), ID, .SDcols = x1:x3]
I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets
I need to merge several different dataframes.
On the one hand, I have several data frames with metadata A and, on the other hand, respective information B.
A.
[1] "LOJun_Meta" "LOMay_Meta" "VOJul_Meta" "VOJun_Meta" "VOMay_Meta" "ZOJun_Meta"
[7] "ZOMay_Meta"
B.
[1] "LOJun_All." "LOMay_all." "VOJul_All." "VOJun_all." "VOMay_all." "ZOJun_all."
[7] "ZOMay_all."
The names of the data frames are already in a list format (i.e. list1 and list2) and the data frames are already imported in R.
My aim is to create a loop which would merge dplyr > left-join the respective dataframes. For example:
LOJun_Meta + LOJun_All; LoMay_Meta + LOJun_all etc...
What I have a hard time on is creating the loop that would "synchronize" the "merging" procedure.
I am unsure if I should create a function which would have two inputs and would do such "merging".
It would be something like
merging(list1, list2){
for i in length(list):
left_join(list1[i], list[2], by = c("PrimaryKey" = "ForeignKey"))
}
I reckon the problem is that the function should refer to data frames which are not list1 & list2 values but data frame names stored in list1 & list2.
Any ideas?
Thanks a lot! Cheers
A diagram of what I intend to achieve is presented below:
[Diagram of loop - dplyr / several dataframes1
An example of what I am keen to automate would be this action:
ZOMay<- left_join(ZOMay_Meta, ZOMay_all., by = c("Primary Key" = "Foreign key"))
ZOJun<- left_join(ZOJun_Meta, ZOJun_all., by = c("Primary Key" = "Foreign Key"))
write.csv(ZOMay, file = "ZOMay_Consolidated.csv")
write.csv(ZOMay, file = "ZOJun_Consolidated.csv")
Here's an example of how you could build a reproducible example for your situation:
library(tidyverse)
df1a <- data_frame(id = 1:3, var1 = LETTERS[1:3])
df2a <- data_frame(id = 1:3, var1 = LETTERS[4:6])
df1b <- data_frame(id = 1:3, var2 = LETTERS[7:9])
df2b <- data_frame(id = 1:3, var2 = LETTERS[10:12])
list1 <- list(df1a, df2a)
list2 <- list(df1b, df2b)
Now as I understand it you want to do a left_join for df1a and df1b, as well as df2a and df2b. Instead of a loop, you can use map2 from the purrr package. This will iterate over two lists and apply a function to each pair of elements.
map2(list1, list2, left_join)
# [[1]]
# # A tibble: 3 x 3
# id var1 var2
# <int> <chr> <chr>
# 1 1 A G
# 2 2 B H
# 3 3 C I
#
# [[2]]
# # A tibble: 3 x 3
# id var1 var2
# <int> <chr> <chr>
# 1 1 D J
# 2 2 E K
# 3 3 F L
I am learning the syntax of manipulating data.table variables. Although I can do simple things, my understanding is not thorough enough for more complex tasks. For example, I would like to transform the following data to have one distinct "type" value per row, separate columns generated based on the value of "subtype", and unique values collapsed when there are multiple rows with the same "type/subtype" combination.
Given the input data:
data = data.frame(
var1 = c("a","b","c","b","d","e","f"),
var2 = c("aa","bb","cc","dd","ee","ee","ff"),
subtype = c("1","2","2","2","1","1","2"),
type = c("A","A","A","A","B","B","B")
)
var1 var2 subtype type
1 a aa 1 A
2 b bb 2 A
3 c cc 2 A
4 b dd 2 A
5 d ee 1 B
6 e ee 1 B
7 f ff 2 B
I would like to derive:
1.var1 1.var2 2.var1 2.var2 2.type
A "a" "aa" "b|c" "bb|cc|dd" "A"
B "d|e" "ee" "f" "ff" "B"
Using a data frame, I can achieve this with the following code:
data.derived = do.call(
rbind,
lapply(
split(data,list(data$type)),
function(x) {
do.call (
c,
lapply(
split(x, list(x$subtype)),
function(y) {
result = c(
var1 = paste(unique(y$var1),collapse ="|"),
var2 = paste(unique(y$var2),collapse ="|")
)
if (as.character(y$subtype[1]) == "2") {
result = c(result, type = as.character(y$type[1]))
}
result}))}))
How can I do the same using a data table?
From your result, it's clear to see that you are transforming data from long to wide format and subtype is spread along the row direction, so you will need dcast from data.table. And since your want to aggregate your values from var1 and var2 to be a single string, you will need to customize the aggregate function as paste to collapse the result:
library(data.table)
setDT(data)
dcast(data, type ~ subtype, value.var = c("var1", "var2"),
fun = function(v) paste0(unique(v), collapse = "|"))
# type var1_function_1 var1_function_2 var2_function_1 var2_function_2
# 1: A a b|c aa bb|cc|dd
# 2: B d|e f ee ff
Not sure if you want to use data.table package and commands, or if you want to find out whether your code works with data tables as well.
I think complex computations require the usage of the appropriate packages. The above script works for you, but it's hard to see what it does if it's not written by you.
Before you start using data.table check some nice packages that make life easier for you. Like
library(dplyr)
library(tidyr)
data = data.frame(
var1 = c("a","b","c","b","d","e","f"),
var2 = c("aa","bb","cc","dd","ee","ee","ff"),
subtype = c("1","2","2","2","1","1","2"),
type = c("A","A","A","A","B","B","B")
)
data %>%
group_by(type, subtype) %>%
summarise(x1 = paste(unique(var1),collapse ="|"),
x2 = paste(unique(var2),collapse ="|")) %>%
unite(xx,x1,x2) %>%
spread(subtype,xx) %>%
separate(`1`, c("1.var1","1.var2"), sep="_") %>%
separate(`2`, c("2.var1","2.var2"), sep="_") %>%
ungroup
# # A tibble: 2 x 5
# type 1.var1 1.var2 2.var1 2.var2
# * <fctr> <chr> <chr> <chr> <chr>
# 1 A a aa b|c bb|cc|dd
# 2 B d|e ee f ff
You can use the same code, or even your script, when having a data table instead of a data frame. But if you're looking for using data table commands that's a different story.
Situation
I have two data frames, df1 and df2with the same column headings
x <- c(1,2,3)
y <- c(3,2,1)
z <- c(3,2,1)
names <- c("id","val1","val2")
df1 <- data.frame(x, y, z)
names(df1) <- names
a <- c(1, 2, 3)
b <- c(1, 2, 3)
c <- c(3, 2, 1)
df2 <- data.frame(a, b, c)
names(df2) <- names
And am performing a merge
#library(dplyr) # not needed for merge
joined_df <- merge(x=df1, y=df2, c("id"),all=TRUE)
This gives me the columns in the joined_df as id, val1.x, val2.x, val1.y, val2.y
Question
Is there a way to co-locate the columns that had the same heading in the original data frames, to give the column order in the joined data frame as id, val1.x, val1.y, val2.x, val2.y?
Note that in my actual data frame I have 115 columns, so I'd like to stay clear of using joned_df <- joined_df[, c(1, 2, 4, 3, 5)] if possible.
Update/Edit: also, I would like to maintain the original order of column headings, so sorting alphabetically is not an option (-on my actual data, I realise it would work with the example I have given).
My desired output is
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Update with solution for general case
The accepted answer solves my issue nicely.
I've adapted the code slightly here to use the original column names, without having to hard-code them in the rep function.
#specify columns used in merge
merge_cols <- c("id")
# identify duplicate columns and remove those used in the 'merge'
dup_cols <- names(df1)
dup_cols <- dup_cols [! dup_cols %in% merge_cols]
# replicate each duplicate column name and append an 'x' and 'y'
dup_cols <- rep(dup_cols, each=2)
var <- c("x", "y")
newnames <- paste(dup_cols, ".", var, sep = "")
#create new column names and sort the joined df by those names
newnames <- c(merge_cols, newnames)
joined_df <- joined_df[newnames]
How about something like this
numrep <- rep(1:2, each = 2)
numrep
var <- c("x", "y")
var
newnames <- paste("val", numrep, ".", var, sep = "")
newdf <- cbind(joined_df$id, joined_df[newnames])
names(newdf)[1] <- "id"
Which should give you the dataframe like this
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1