Situation
I have two data frames, df1 and df2with the same column headings
x <- c(1,2,3)
y <- c(3,2,1)
z <- c(3,2,1)
names <- c("id","val1","val2")
df1 <- data.frame(x, y, z)
names(df1) <- names
a <- c(1, 2, 3)
b <- c(1, 2, 3)
c <- c(3, 2, 1)
df2 <- data.frame(a, b, c)
names(df2) <- names
And am performing a merge
#library(dplyr) # not needed for merge
joined_df <- merge(x=df1, y=df2, c("id"),all=TRUE)
This gives me the columns in the joined_df as id, val1.x, val2.x, val1.y, val2.y
Question
Is there a way to co-locate the columns that had the same heading in the original data frames, to give the column order in the joined data frame as id, val1.x, val1.y, val2.x, val2.y?
Note that in my actual data frame I have 115 columns, so I'd like to stay clear of using joned_df <- joined_df[, c(1, 2, 4, 3, 5)] if possible.
Update/Edit: also, I would like to maintain the original order of column headings, so sorting alphabetically is not an option (-on my actual data, I realise it would work with the example I have given).
My desired output is
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Update with solution for general case
The accepted answer solves my issue nicely.
I've adapted the code slightly here to use the original column names, without having to hard-code them in the rep function.
#specify columns used in merge
merge_cols <- c("id")
# identify duplicate columns and remove those used in the 'merge'
dup_cols <- names(df1)
dup_cols <- dup_cols [! dup_cols %in% merge_cols]
# replicate each duplicate column name and append an 'x' and 'y'
dup_cols <- rep(dup_cols, each=2)
var <- c("x", "y")
newnames <- paste(dup_cols, ".", var, sep = "")
#create new column names and sort the joined df by those names
newnames <- c(merge_cols, newnames)
joined_df <- joined_df[newnames]
How about something like this
numrep <- rep(1:2, each = 2)
numrep
var <- c("x", "y")
var
newnames <- paste("val", numrep, ".", var, sep = "")
newdf <- cbind(joined_df$id, joined_df[newnames])
names(newdf)[1] <- "id"
Which should give you the dataframe like this
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Related
I am using a panel data set and intent to model this as a dynamic affiliation network using SAOMs. The data is unfortunately very messy and a pain to deal with.
I have managed to create adjacency matrices for each panel wave. However, over time the panel grew in size / people left. I need the number of rows in each matrix to be the same and in the same order according to the unique IDs, which are present when inspecting the objects in R. All "added IDs" should show 10s across the whole row.
Here is a reproducible example that should make the issue clear and also shows what I aim for. I assume this can be solved by smart use of the merge() function, but I could not get it to work:
wave1 <- matrix(c(0,0,1,1,0,1,1,0,1,1), nrow = 5, ncol = 2, dimnames = list(c("1","2","4","5","9"), c("group1","group2")))
wave2 <- matrix(c(0,1,1,0,1,0,1,1), nrow = 4, ncol = 2, dimnames = list(c("1","4","8","9"), c("group1","group2")))
wave1_c <- matrix(c(0,0,1,1,10,0,1,1,0,0,10,1), nrow = 6, ncol = 2, dimnames = list(c("1","2","4","5","8","9"), c("group1","group2")))
wave2_c <- matrix(c(0,10,1,10,1,0,1,10,0,10,1,1), nrow = 6, ncol = 2, dimnames = list(c("1","2","4","5","8","9"), c("group1","group2")))
Thanks in advance. Numbers in the matrices are arbitrary except for the 10s.
Solution in base R using dataframes and merge.
Merge and outer join.
dwave1_c <- merge(wave1, wave2, by = 'row.names', all = TRUE, suffixes="")[2:3]
dwave2_c <- merge(wave2, wave1, by = 'row.names', all = TRUE, suffixes="")[2:3]
dwave1_c[is.na(dwave1_c)] <- 10
dwave2_c[is.na(dwave2_c)] <- 10
as.matrix(dwave1_c)
as.matrix(dwave2_c)
Update.
both <- merge(wave1, wave2, by = 'row.names', all = TRUE)
Output.
Row.names group1.x group2.x group1.y group2.y
1 1 0 1 0 1
2 2 0 1 NA NA
3 4 1 0 1 0
4 5 1 1 NA NA
5 8 NA NA 1 1
6 9 0 1 0 1
dwave1_c <- both[,2:3]; colnames(dwave1_c) <- colnames(wave1)
dwave2_c <- both[,4:5]; colnames(dwave2_c) <- colnames(wave2)
dwave1_c[is.na(dwave1_c)] <- 10
dwave2_c[is.na(dwave2_c)] <- 10
Show result.
as.matrix(dwave1_c)
as.matrix(dwave2_c)
First try.
## Convert matrix to dataframe.
df1 <- as.data.frame(wave1)
df2 <- as.data.frame(wave2)
## Merge df1 and df2 by row name.
m_df1_df2 <- merge(df1, df2, by = 'row.names', all = TRUE)
rownames(m_df1_df2) <- m_df1_df2$Row.names
# Rows not in df1, but in df2,
# rows not in df2, but in df1
not1_2 <- m_df1_df2[is.na(m_df1_df2$group1.x),][c("group1.x", "group2.x")] # not in df1, in df2
not2_1 <- m_df1_df2[is.na(m_df1_df2$group1.y),][c("group1.y", "group2.y")] # not in df2, in df1
## Same column names.
colnames(not1_2) <- colnames(df1)
colnames(not2_1) <- colnames(df2)
## append
df1_c <- rbind(df1, not1_2)
df2_c <- rbind(df2, not2_1)
## order by row name
df1_c <- df1_c[order(row.names(df1_c)), ]
df2_c <- df2_c[order(row.names(df2_c)), ]
## replace NA by 10
df1_c[is.na(df1_c)] <- 10
df2_c[is.na(df2_c)] <- 10
as.matrix(df1_c)
as.matrix(df2_c)
The conversion of wave1,2 to data frames in my first attempt is redundant and can be omitted. However at the expense of implicit coercions.
## merge wave1 and wave2 by row name.
m_df1_df2 <- merge(wave1, wave2, by = 0, all = TRUE)
rownames(m_df1_df2) <- m_df1_df2$Row.names
# rows not in set 1, but in set 2,
# rows not in set 2, but in set 1.
not1_2 <- m_df1_df2[is.na(m_df1_df2$group1.x),][c("group1.x", "group2.x")]
not2_1 <- m_df1_df2[is.na(m_df1_df2$group1.y),][c("group1.y", "group2.y")]
## Same column names.
colnames(not1_2) <- colnames(wave1)
colnames(not2_1) <- colnames(wave2)
## append.
wave1_c <- rbind(wave1, not1_2)
wave2_c <- rbind(wave2, not2_1)
## order by row name.
wave1_c <- wave1_c[order(row.names(wave1_c)), ]
wave2_c <- wave2_c[order(row.names(wave2_c)), ]
## replace NA by 10.
wave1_c[is.na(wave1_c)] <- 10
wave2_c[is.na(wave2_c)] <- 10
## show result.
wave1_c
wave2_c
Solution using setdiff.
## rownames not in set 1, but in set 2,
## rownames not in set 2, but in set 1.
rn_not2_1 <- setdiff(rownames(wave1), rownames(wave2))
rn_not1_2 <- setdiff(rownames(wave2), rownames(wave1))
## missing rows to add.
add_to_1 <- wave2[rn_not1_2,,drop=FALSE]
add_to_2 <- wave1[rn_not2_1,,drop=FALSE]
add_to_1[,] <- 10
add_to_2[,] <- 10
## append.
wave1_c <- rbind(wave1, add_to_1)
wave2_c <- rbind(wave2, add_to_2)
## order by row name.
wave1_c <- wave1_c[order(row.names(wave1_c)), ]
wave2_c <- wave2_c[order(row.names(wave2_c)), ]
## show result.
wave1_c
wave2_c
I have a vector containing "potential" column names:
col_vector <- c("A", "B", "C")
I also have a data frame, e.g.
library(tidyverse)
df <- tibble(A = 1:2,
B = 1:2)
My goal now is to create all columns mentioned in col_vector that don't yet exist in df.
For the above exmaple, my code below works:
df %>%
mutate(!!sym(setdiff(col_vector, colnames(.))) := NA)
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Problem is that this code fails as soon as a) more than one column from col_vector is missing or b) no column from col_vector is missing. I thought about some sort of if_else, but don't know how to make the column creation conditional in such a way - preferably in a tidyverse way. I know I can just create a loop going through all the missing columns, but I'm wondering if there is a more direc approach.
Example data where code above fails:
df2 <- tibble(A = 1:2)
df3 <- tibble(A = 1:2,
B = 1:2,
C = 1:2)
This should work.
df[,setdiff(col_vector, colnames(df))] <- NA
Solution
This base operation might be simpler than a full-fledged dplyr workflow:
library(tidyverse) # For the setdiff() function.
# ...
# Code to generate 'df'.
# ...
# Find the subset of missing names, and create them as columns filled with 'NA'.
df[, setdiff(col_vector, names(df))] <- NA
# View results
df
Results
Given your sample col_vector and df here
col_vector <- c("A", "B", "C")
df <- tibble(A = 1:2, B = 1:2)
this solution should yield the following results:
# A tibble: 2 x 3
A B C
<int> <int> <lgl>
1 1 1 NA
2 2 2 NA
Advantages
An advantage of my solution, over the alternative linked above by #geoff, is that you need not code by hand the set of column names, as symbols and strings within the dplyr workflow.
df %>% mutate(
#####################################
A = ifelse("A" %in% names(.), A, NA),
B = ifelse("B" %in% names(.), B, NA),
C = ifelse("C" %in% names(.), B, NA)
# ...
# etc.
#####################################
)
My solution is by contrast more dynamic
##############################
df[, setdiff(col_vector, names(df))] <- NA
##############################
if you ever decide to change (or even dynamically calculate!) your variable names midstream, since it determines the setdiff() at runtime.
Note
Incredibly, #AustinGraves posted their answer at precisely the same time (2021-10-25 21:03:05Z) as I posted mine, so both answers qualify as original solutions.
Lets say I have this list of dataframes:
DF1_A<- data.frame (first_column = c("A", "B","C"),
second_column = c(5, 5, 5),
third_column = c(1, 1, 1)
)
DF1_B <- data.frame (first_column = c("A", "B","E"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
DF2_A <- data.frame (first_column = c("E", "F","G"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
DF2_B <- data.frame (first_column = c("K", "L","B"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
mylist <- list(DF1_A, DF1_B, DF2_A, DF2_B)
names(mylist) = c("DF1_A", "DF1_B", "DF2_A", "DF2_B")
mylist = lapply(mylist, function(x){
x[, "first_column"] <- as.character(x[, "first_column"])
x
})
I want to bind them by their name (All DF1, All DF2 etc), or, objectively, two by two in this ordered named list. Keeping the "named list structure" of the list is important to keep track (for example, DF1_A and DF1_B = DF1 or something similiar in the names(mylist))
There are some rows that have duplicated values, and I want to keep them (which will introduce some duplicated characters such as first_column, value A)
I have tried finding any clues here on stack overflow, but most people want to bind dataframes irrespective of their names or orders.
Final result would look something like this:
mylist
DF1
DF2
DF1
first_column second_column third_column
A 1 1
A 5 1
B 1 1
B 5 1
C 5 1
E 5 1
Do you mean something like this?
lapply(
split(mylist, gsub("_.*", "", names(mylist))),
function(v) `row.names<-`((out <- do.call(rbind, v))[do.call(order, out), ], NULL)
)
which gives
$DF1
first_column second_column third_column
1 A 1 1
2 A 5 1
3 B 1 1
4 B 5 1
5 C 5 1
6 E 5 1
$DF2
first_column second_column third_column
1 B 5 1
2 E 1 1
3 F 1 1
4 G 5 1
5 K 1 1
6 L 1 1
Here is a solution with Map, but it only works for two suffixes. If you want to merge, use the first Map instruction; if you want to keep duplicates, use the 2nd, rbind solution.
sp <- split(mylist, sub("^DF.*_", "", names(mylist)))
res1 <- Map(function(x, y)merge(x, y, all = TRUE), sp[["A"]], sp[["B"]])
res2 <- Map(function(x, y)rbind(x, y), sp[["A"]], sp[["B"]])
names(res1) <- sub("_.*$", "", names(res1))
names(res2) <- sub("_.*$", "", names(res2))
One of many obligatory tidyverse solutions can be this.
library(purrr)
library(stringr)
# find the unique DF names
unique_df <- set_names(unique(str_split_fixed(names(mylist), "_", 2)[,1]))
# loop over each unique name, extracting the elements and binding into columns
purrr::map(unique_df, ~ keep(mylist, str_starts(names(mylist), .x))) %>%
map(bind_rows)
Also for things like this, bind_rows() from dplyr has a .id argument which will add a column with the list element name, and stack the rows. That can also be a helpful way. You can bind, manipulate the name how you'd like, and then split().
I have a relatively large amount of data stored in a list of data frames with several columns.
For each element of the list I wish to check one column against a reference and if present extract the value held in another column of the same element and place in a new summary matrix.
e.g. with the following example code:
add1 = c("N1","N1","N1")
coords1 = c(1,2,3)
vals1 = c("a","b","c")
extra1 = c("x","y","x")
add2 = c("N2","N2","N2","N2")
coords2 = c(2,3,4,5)
vals2 = c("b","c","d","e")
extra2 = c("z","y","x","x")
add3 = c("N3","N3","N3")
coords3 = c(1,3,5)
vals3 = c("a","c","e")
extra3 = c("z","z","x")
df1 <- data.frame(add1, coords1, vals1, extra1)
df2 <- data.frame(add2, coords2, vals2, extra2)
df3 <- data.frame(add3, coords3, vals3, extra3)
list_all <- list(df1, df2, df3)
coordinate.extract <- unique(unlist(lapply(list_all, "[", 1)))
my_matrix <- matrix(0, ncol = length(list_all)
, nrow = (length(coordinate.extract)))
my_matrix_new <- cbind(as.character(coordinate.extract)
, my_matrix)
I would like to end up with:
my_matrix_new = V1 V2 V3 V4
1 a a
2 b b
3 c c c
4 d
5 e e
i.e. the 3rd column of each list element is chosen based on the value of the second column.
I hope this is clear.
Thanks,
Matt
I would use data.frame as there are mixed classes. You may try merge with Reduce to get the expected output. Select the 2nd and 3rd columns,in each list element, change the column name for the 2nd to be same across all the list elements, merge, and if needed replace the NA elements with ''
lst1 <- lapply(list_all, function(x) {names(x)[2] <- 'V1';x[2:3] })
res <- Reduce(function(...) merge(..., by='V1', all=TRUE), lst1)
res[-1] <- lapply(res[-1], as.character)
res[is.na(res)] <- ''
res
# V1 vals1 vals2 vals3
#1 1 a a
#2 2 b b
#3 3 c c c
#4 4 d
#5 5 e e
We can change the column names
names(res) <- paste0('V', seq_along(res))
Here is an example,
df <- data.frame(x = I(list(1:2, 3:4)))
x <- df[1,]
Now the following does not work,
df[2,] <- x
or
df[2,] <- I(x)
Warning message:
In `[<-.data.frame`(`*tmp*`, 2, , value = list(1:2)) :
replacement element 1 has 2 rows to replace 1 rows
How do I add more rows to data frame with a single column of vector type.
I found the following after few tries,
df[2,] <- list(x)
add new row of list type.
It might be because you are using a list. If you set your data frame as:
df <- data.frame(rbind(c(1, 2), c(3, 4)))
then your code should work:
df <- data.frame(rbind(c(1, 2), c(3, 4))) # Make DF
x <- df[1,]
df[2,] <- x
print(df)
> df
X1 X2
1 1 2
2 1 2