Select data frame values row-wise using a variable of column names - r

Suppose I have a data frame that looks like this:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
# x y
# 1 1 4
# 2 2 5
# 3 3 6
And a vector of column names, one per row of the data frame:
colname = c('x', 'y', 'x')
For each row of the data frame, I would like to select the value from the corresponding column in the vector. Something similar to dframe[, colname] but for each row.
Thus, I want to obtain c(1, 5, 3) (i.e. row 1: col "x"; row 2: col "y"; row 3: col "x")

My favourite old matrix-indexing will take care of this. Just pass a 2-column matrix with the respective row/column index:
rownames(dframe) <- seq_len(nrow(dframe))
dframe[cbind(rownames(dframe),colname)]
#[1] 1 5 3
Or, if you don't want to add rownames:
dframe[cbind(seq_len(nrow(dframe)), match(colname,names(dframe)))]
#[1] 1 5 3

One can use mapply to pass arguments for rownumber (of dframe) and vector for column name (for each row) to return specific column value.
The solution using mapply can be as:
dframe = data.frame(x = c(1, 2, 3), y = c(4, 5, 6))
colname = c('x', 'y', 'x')
mapply(function(x,y)dframe[x,y],1:nrow(dframe), colname)
#[1] 1 5 3
Although, the next option may not be very intuitive but if someone wants a solution in dplyr chain then a way using gather can be as:
library(tidyverse)
data.frame(colname = c('x', 'y', 'x'), stringsAsFactors = FALSE) %>%
rownames_to_column() %>%
left_join(dframe %>% rownames_to_column() %>%
gather(colname, value, -rowname),
by = c("rowname", "colname" )) %>%
select(rowname, value)
# rowname value
# 1 1 1
# 2 2 5
# 3 3 3

Related

Bind dataframes in a list two by two (or by name) - R

Lets say I have this list of dataframes:
DF1_A<- data.frame (first_column = c("A", "B","C"),
second_column = c(5, 5, 5),
third_column = c(1, 1, 1)
)
DF1_B <- data.frame (first_column = c("A", "B","E"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
DF2_A <- data.frame (first_column = c("E", "F","G"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
DF2_B <- data.frame (first_column = c("K", "L","B"),
second_column = c(1, 1, 5),
third_column = c(1, 1, 1)
)
mylist <- list(DF1_A, DF1_B, DF2_A, DF2_B)
names(mylist) = c("DF1_A", "DF1_B", "DF2_A", "DF2_B")
mylist = lapply(mylist, function(x){
x[, "first_column"] <- as.character(x[, "first_column"])
x
})
I want to bind them by their name (All DF1, All DF2 etc), or, objectively, two by two in this ordered named list. Keeping the "named list structure" of the list is important to keep track (for example, DF1_A and DF1_B = DF1 or something similiar in the names(mylist))
There are some rows that have duplicated values, and I want to keep them (which will introduce some duplicated characters such as first_column, value A)
I have tried finding any clues here on stack overflow, but most people want to bind dataframes irrespective of their names or orders.
Final result would look something like this:
mylist
DF1
DF2
DF1
first_column second_column third_column
A 1 1
A 5 1
B 1 1
B 5 1
C 5 1
E 5 1
Do you mean something like this?
lapply(
split(mylist, gsub("_.*", "", names(mylist))),
function(v) `row.names<-`((out <- do.call(rbind, v))[do.call(order, out), ], NULL)
)
which gives
$DF1
first_column second_column third_column
1 A 1 1
2 A 5 1
3 B 1 1
4 B 5 1
5 C 5 1
6 E 5 1
$DF2
first_column second_column third_column
1 B 5 1
2 E 1 1
3 F 1 1
4 G 5 1
5 K 1 1
6 L 1 1
Here is a solution with Map, but it only works for two suffixes. If you want to merge, use the first Map instruction; if you want to keep duplicates, use the 2nd, rbind solution.
sp <- split(mylist, sub("^DF.*_", "", names(mylist)))
res1 <- Map(function(x, y)merge(x, y, all = TRUE), sp[["A"]], sp[["B"]])
res2 <- Map(function(x, y)rbind(x, y), sp[["A"]], sp[["B"]])
names(res1) <- sub("_.*$", "", names(res1))
names(res2) <- sub("_.*$", "", names(res2))
One of many obligatory tidyverse solutions can be this.
library(purrr)
library(stringr)
# find the unique DF names
unique_df <- set_names(unique(str_split_fixed(names(mylist), "_", 2)[,1]))
# loop over each unique name, extracting the elements and binding into columns
purrr::map(unique_df, ~ keep(mylist, str_starts(names(mylist), .x))) %>%
map(bind_rows)
Also for things like this, bind_rows() from dplyr has a .id argument which will add a column with the list element name, and stack the rows. That can also be a helpful way. You can bind, manipulate the name how you'd like, and then split().

Adding a column based on a list dplyr

I am trying to summarise a list of dataframes. Here is some test data
noms <- list('A', 'B')
A_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(3, 2, 6, 7))
B_data <- data.frame('Dis' = c(1, 1, 2, 2),
'adj' = c(2, 6, 3, 6))
frames <- list(A_data, B_data)
I want to produce a list of data frams where'adj' is summed for each 'Dis' group, and then add a column for the relevant name from 'noms' so I can then combine the data frames together to form a single dataframe in the future.
So far I have this:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)))
,paste0(unlist(noms)))
But I can figure out how to add a column with the relevant name. I know I need to use the mutate function something like so:
totals <- setNames(lapply(frames, function (x)
x %>%
dplyr::group_by(Dis) %>%
dplyr::summarise(total = sum(adj)) %>%
dplyr::mutate(nom = )
,paste0(unlist(noms)))
but I cant figure out how to add the correct name.
The expected output would be a list of two dataframes one for 'A' and one for 'B'. Here is the expected output for 'A':
Dis total Nom
1 1 5 A
2 2 13 A
How do I do this?
A base R option where we use Map instead of lapply
out <- Map(function(x, y) {
transform(aggregate(adj ~ Dis, data = x, sum), Nom = y)
}, x = frames, y = noms)
out
#[[1]]
# Dis adj Nom
#1 1 5 A
#2 2 13 A
#[[2]]
# Dis adj Nom
#1 1 8 B
#2 2 9 B
The same idea with tidyverse functions
library(purrr); library(dplyr)
map2(.x = frames, .y = noms, ~ .x %>%
group_by(Dis) %>%
summarise(adj = sum(adj)) %>%
mutate(Nom = .y))

filter one data.frame by another data.frame by specific columns

full = data.frame(group = c('a', 'a', 'a', 'a', 'a', 'b', 'c'), values = c(1, 2, 2, 3, 5, 3, 4))
filter = data.frame(group = c('a', 'b', 'c'), values = c(4, 3, 3))
## find rows of full where values are larger than filter for the given group
full[full$group == filter$group & full$values > filter$values, ]
prints an empty data.frame with the warning:
Warning messages:
1: In full$group == filter$group :
longer object length is not a multiple of shorter object length
2: In full$values > filter$values :
longer object length is not a multiple of shorter object length
I'm looking for all the rows in full that match that criteria, to end up with:
full
> group
group values
a 5
c 4
Using merge
full=merge(full,filter,by='group')
full=full[full$values.x>full$values.y,]
full$values.y=NULL
names(full)=c('group','values')
> full
group values
5 a 5
7 c 4
Or match
full$Filter=filter$values[match(full$group,filter$group)]
full=full[full$values>full$Filter,]
full$Filter=NULL
> full
group values
5 a 5
7 c 4
full[unlist(sapply(1:NROW(filter), function(i)
which(full$group == filter$group[i] & full$values > filter$values[i]))),]
# group values
#5 a 5
#7 c 4
Using base R functions Map, split, unlist, and logical indexing you can do
full[unlist(Map(">", split(full$values, full$group), split(filter$values, filter$group))),]
group values
5 a 5
7 c 4
here, you split the value vectors by group into lists and feed these to Map, which applies >. As Map returns a list, unlist returns a logical vector which is fed to [ for subsetting. Note that this requires that both data.frames are sorted by group and that each has the same levels in the group variable.
One option is to use dplyr.
library(dplyr)
dt <- full %>%
left_join(filter, by = "group") %>%
dplyr::filter(values.x > values.y) %>%
select(group, values = values.x)
dt
group values
1 a 5
2 c 4
Or purrr.
library(purrr)
dt <- full %>%
split(.$group) %>%
map2_df(filter %>% split(.$group), ~.x[.x$values > .y$values, ])
dt
group values
1 a 5
2 c 4

Deduplicating a data frame when the order of values may differ in R

Let's say I have a data.frame that looks like this:
df = data.frame(from=c(1, 1, 2, 1),
to=c(2, 3, 1, 4),
title=c("A", "B", "A", "A"),
stringsAsFactors=F)
df is an object that holds all of the various connections for a network graph. I also have a second data.frame, which is the simplified graph data:
df2 = data.frame(from=c(1, 1, 3),
to=c(2, 4, 1),
stringsAsFactors=F)
What I need is to pull the title values from df into df2. I can't simply dedup df because a) from and to can be in different orders, and b) title is not unique between connections. The current condition I have is:
df2$title = df$title[df2$from == df$from & df2$to == df$to]
However, this results in too few rows due to the order of from and to being reversed in row 2 of df2. If I introduce an OR condtion, then I get too many results because the connection between 1 and 2 will be matched twice.
My question, then, is how do I effectively "dedup" the title variable to append it to df2?
The expected outcome is this:
from to title
1 1 2 A
2 1 4 A
3 3 1 B
library(dplyr);
merge(mutate(df2, from1 = pmin(from, to), to1 = pmax(from, to)),
mutate(df, from1 = pmin(from, to), to1 = pmax(from, to)),
by = c("from1", "to1"), all.x = T) %>%
select(from1, to1, title) %>% unique()
# from1 to1 title
#1 1 2 A
#3 1 3 B
#4 1 4 A
Another way we can try, where edgeSort function produce unique edges if the two vertices are the same and use match function to match all equal edges.
edgeSort <- function(df) apply(df, 1, function(row) paste0(sort(row[1:2]), collapse = ", "))
df2$title <- df$title[match(edgeSort(df2), edgeSort(df))]
df2
from to title
1 1 2 A
2 1 4 A
3 3 1 B
I guess you can do it in base R by 2 merge statements:
step1 <- merge(df2, df, all.x = TRUE)
step2 <- merge(df2[is.na(step1$title),], df, all.x = TRUE, by.x = c("to", "from"), by.y = c("from", "to"))
rbind(step1[!is.na(step1$title),], step2)
from to title
1 1 2 A
2 1 4 A
3 3 1 B

R - co-locate columns with the same name after merge

Situation
I have two data frames, df1 and df2with the same column headings
x <- c(1,2,3)
y <- c(3,2,1)
z <- c(3,2,1)
names <- c("id","val1","val2")
df1 <- data.frame(x, y, z)
names(df1) <- names
a <- c(1, 2, 3)
b <- c(1, 2, 3)
c <- c(3, 2, 1)
df2 <- data.frame(a, b, c)
names(df2) <- names
And am performing a merge
#library(dplyr) # not needed for merge
joined_df <- merge(x=df1, y=df2, c("id"),all=TRUE)
This gives me the columns in the joined_df as id, val1.x, val2.x, val1.y, val2.y
Question
Is there a way to co-locate the columns that had the same heading in the original data frames, to give the column order in the joined data frame as id, val1.x, val1.y, val2.x, val2.y?
Note that in my actual data frame I have 115 columns, so I'd like to stay clear of using joned_df <- joined_df[, c(1, 2, 4, 3, 5)] if possible.
Update/Edit: also, I would like to maintain the original order of column headings, so sorting alphabetically is not an option (-on my actual data, I realise it would work with the example I have given).
My desired output is
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1
Update with solution for general case
The accepted answer solves my issue nicely.
I've adapted the code slightly here to use the original column names, without having to hard-code them in the rep function.
#specify columns used in merge
merge_cols <- c("id")
# identify duplicate columns and remove those used in the 'merge'
dup_cols <- names(df1)
dup_cols <- dup_cols [! dup_cols %in% merge_cols]
# replicate each duplicate column name and append an 'x' and 'y'
dup_cols <- rep(dup_cols, each=2)
var <- c("x", "y")
newnames <- paste(dup_cols, ".", var, sep = "")
#create new column names and sort the joined df by those names
newnames <- c(merge_cols, newnames)
joined_df <- joined_df[newnames]
How about something like this
numrep <- rep(1:2, each = 2)
numrep
var <- c("x", "y")
var
newnames <- paste("val", numrep, ".", var, sep = "")
newdf <- cbind(joined_df$id, joined_df[newnames])
names(newdf)[1] <- "id"
Which should give you the dataframe like this
id val1.x val1.y val2.x val2.y
1 1 3 1 3 3
2 2 2 2 2 2
3 3 1 3 1 1

Resources