I have a list of dfs. The dfs all have the same column names. I would like to:
(1) Change one of the column names to the name of the df within the list
(2) full_join all the dfs after name change
Example of my list:
my_list <- list(one = data.frame(Type = c(1,2,3), Class = c("a", "a", "b")),
two = data.frame(Type = c(1,2,3), Class = c("a", "a", "b")))
Output that I want:
data.frame(Type = c(1,2,3),
one = c("a", "a", "b"),
two = c("a", "a", "b"))
Type one two
1 a a
2 a a
3 b b
You could possible use dplyr::bind_rows combined with tidyr::spread to achieve the same result (if you are happy to consider alternative approaches). For example:
library(tidyverse)
my_list %>% bind_rows(.id = "groups") %>% spread(groups, Class)
#> Type one two
#> 1 1 a a
#> 2 2 a a
#> 3 3 b b
The first step can be tricky, but it's simple if you iterate over names(my_list).
transformed <- sapply(names(my_list), function(name) {
df <- my_list[[name]]
colnames(df)[colnames(df) == 'Class'] <- name
df
}, simplify = FALSE, USE.NAMES = TRUE)
With purrr::reduce and dplyr::full_join the result can be obtained:
purrr::reduce(transformed, dplyr::full_join)
# Type one two
# 1 1 a a
# 2 2 a a
# 3 3 b b
Related
I have 2 data frames with account numbers and amounts plus some other irrelevant columns. I would like to compare the output with a Y or N if they match or not.
I need to compare the account number in row 1 in dataframe A to the account number in row 1 in dataframe B and if they match put a Y in a column or an N if they don't. I've managed to get the code to check if there is a match in the entire dataframe but I need to check each row individually.
E.g.
df1
|account.num|x1|x2|x3|
|100|a|b|c|
|101|a|b|c|
|102|a|b|c|
|103|a|b|c|
df2
|account.num|x1|x2|x3|
|100|a|b|c|
|102|a|b|c|
|101|a|b|c|
|103|a|b|c|
output
|account.num|x1|x2|x3|match|
|100|a|b|c|Y|
|101|a|b|c|N|
|102|a|b|c|N|
|103|a|b|c|Y|
So, row 1 matches as they have the same account number, but row 2 doesn't because they are different. However, the other data in the dataframe doesn't matter just that column. Can I do this without merging the data frames? (I did have tables, but they won't work. I don't know why. So sorry if that's hard to follow).
You can use == to compare if account.num is equal, and use this boolean vector to subset c("N", "Y")
df1$match <- c("N", "Y")[1 + (df1[[1]] == df2[[1]])]
df1
# account.num x1 x2 x3 match
#1 100 a b c Y
#2 101 a b c N
#3 102 a b c N
#4 103 a b c Y
Data:
df1 <- data.frame(account.num=100:103, x1="a", x2="b", x3="c")
df2 <- data.frame(account.num=c(100,102,101,103), x1="a", x2="b", x3="c")
If you want a base R solution, here is a quick sketch. Assuming boath dataframes are of the same length (number of rows), it should work with your data.
# example dataframes
a <- data.frame(A=c(1,2,3), B=c("one","two","three"))
b <- data.frame(A=c(3,2,1), B=c("three","two","one"))
res <- c() #initialise empty result vector
for (rownum in c(1:nrow(a))) {
# iterate over all numbers of rows
res[rownum] <- all(a[rownum,]==b[rownum,])
}
res # result vector
# [1] FALSE TRUE FALSE
# you can put it in frame a like this. example colname is "equalB"
a$equalB <- res
If you want a tidyverse solution, you can use left_join.
The principle here would be to try to match the data from df2 to the data from df1. If it matches, it would add TRUE to a match column. Then, the code replace the NA values with FALSE.
I'm also adding code to create the data frames from the exemple.
library(tidyverse)
df1 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
101, "a", "b", "c",
102, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column() # because position in the df is an important information,
# I need to hardcode it in the df
df2 <-
tribble(~account_num, ~x1, ~x2, ~x3,
100, "a", "b", "c",
102, "a", "b", "c",
101, "a", "b", "c",
103, "a", "b", "c") %>%
rowid_to_column()
# take a
df1 %>%
# try to match df1 with version of df2 with a new column where `match` = TRUE
# according to `rowid`, `account_num`, `x1`, `x2`, and `x3`
left_join(df2 %>%
tibble::add_column(match = TRUE),
by = c("rowid", "account_num", "x1", "x2", "x3")
) %>%
# replace the NA in `match` with FALSE in the df
replace_na(list(match = FALSE))
I want to add rows to a dataframe (or tibble) as part of a data entry project. I need to:
Find one row that holds a specific value in one column (obsid)
Duplicate that row. However, replace the value in column "word".
Append the new row to the dataframe
I want to write a function that makes it easy. When I write the function, it won't add the new rows. I can print out the answer. But it won't alter the basic dataframe
If I do it without a function it works as well.
Why won't the function add the row?
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
print(rowtoadd)
print(filter(df, df$obsid== id))}
addrow("a", "xxx")
R objects usually don't modify itself, you need to warp the result in return() to return the modified copy of that dataframe.
Change your function to:
df <- tibble(obsid = c("a","b" , "c", "d"), b=c("a", "a", "b", "b"), word= c("what", "is", "the", "answer"))
df$main <- 1
addrow <- function(id, newword) {
rowtoadd <- df %>%
filter(obsid== id & main==1) %>%
mutate(word=replace(word, main==1, newword)) %>%
mutate(main=replace(main, word==newword, 0))
df <- bind_rows(df, rowtoadd)
return(df)
}
> addrow("a", "xxx")
# A tibble: 5 x 4
obsid b word main
<chr> <chr> <chr> <dbl>
1 a a what 1
2 b a is 1
3 c b the 1
4 d b answer 1
5 a a xxx 0
I want to count the number of the unique edges in an undirected network, e.g, net
x y
1 A B
2 B A
3 A B
There should be only one unique edge for this matrix, because edges A-B and B-A are same for the undirected network.
For the directed network I can get the number of unique edges by:
nrow(unique(net[,c("x","y"]))
But this doesn't work for the undirected network.
Given that you are working with networks, an igraph solution:
library(igraph)
as_data_frame(simplify(graph_from_data_frame(dat, directed=FALSE)))
Then use nrow
Explanantion
dat %>%
graph_from_data_frame(., directed=FALSE) %>% # convert to undirected graph
simplify %>% # remove loops / multiple edges
as_data_frame # return remaining edges
Try this,
df <- data.frame(x=c("A", "B", "A"), y = c("B", "A", "B"))
unique(apply(df, 1, function(x) paste(sort(unlist(strsplit(x, " "))),collapse = " ")))
[1] "A B"
So how does this work?
We are applying a function to each row of the data frame, so we can take each row at a time.
Take the second row of the df,
df[2,]
x y
1 B A
We then split (strsplit) this, and unlist into a vector of each letter, (We use as.matrix to isolate the elements)
unlist(strsplit(as.matrix(df[2,]), " "))
[1] "B" "A"
Use the sort function to put into alphabetical order, then paste them back together,
paste(sort(unlist(strsplit(as.matrix(df[2,]), " "))), collapse = " ")
[1] "A B"
Then the apply function does this for all the rows, as we set the index to 1, then use the unique function to identify unique edges.
Extension
This can be extended to n variables, for example n=3,
df <- data.frame(x=c("A", "B", "A"), y = c("B", "A", "B"), z = c("C", "D", "D"))
unique(apply(df, 1, function(x) paste(sort(unlist(strsplit(x, " "))),collapse = " ")))
[1] "A B C" "A B D"
If more letters are needed, just combine two letters like the following,
df <- data.frame(x=c("A", "BC", "A"), y = c("B", "A", "BC"))
df
x y
1 A B
2 BC A
3 A BC
unique(apply(df, 1, function(x) paste(sort(unlist(strsplit(x, " "))),collapse = " ")))
[1] "A B" "A BC"
Old version
Using the tidyverse package, create a function called rev that can order our edges, then use mutate to create a new column combining the x and y columns, in such a way it works well with the rev function, then run the new column through the function and find the unique pairs.
library(tidyverse)
rev <- function(x){
unname(sapply(x, function(x) {
paste(sort(trimws(strsplit(x[1], ',')[[1]])), collapse=',')} ))
}
df <- data.frame(x=c("A", "B", "A"), y = c("B", "A", "B"))
rows <- df %>%
mutate(both = c(paste(x, y, sep = ", ")))
unique(rev(rows$both))
Here is a solution without the intervention of igraph, all inside one pipe:
df = tibble(x=c("A", "B", "A"), y = c("B", "A", "B"))
It is possible to use group_by() and then sort() combinations of values and paste() them in the new column via mutate(). unique() is utilized if you have "true" duplicates (A-B, A-B will get into one group).
df %>%
group_by(x, y) %>%
mutate(edge_id = paste(sort(unique(c(x,y))), collapse=" "))
When you have properly sorted edge names in a new column, it's quite straightforward to count unique values or filter duplicates out of your data frame.
If you have additional variables for edges, just add them into grouping.
If you're not using{igraph} or just want know how to do it cleanly without any dependencies...
Here's your data...
your_edge_list <- data.frame(x = c("A", "B", "A"),
y = c("B", "A", "B"),
stringsAsFactors = FALSE)
your_edge_list
#> x y
#> 1 A B
#> 2 B A
#> 3 A B
and here's a step-by-step breakdown...
`%>%` <- magrittr::`%>%`
your_edge_list %>%
apply(1L, sort) %>% # sort dyads
t() %>% # transpose resulting matrix to get the original shape back
unique() %>% # get the unique rows
as.data.frame() %>% # back to data frame
setNames(names(your_edge_list)) # reset column names
#> x y
#> 1 A B
If we drop the pipes, the core of it looks like this...
unique(t(apply(your_edge_list, 1, sort)))
#> [,1] [,2]
#> [1,] "A" "B"
And we can wrap it up in a function that 1) handles both directed and undirected, 2) handles data frames and (the more common) matrices, and 3) can drop loops...
simplify_edgelist <- function(el, directed = TRUE, drop_loops = TRUE) {
stopifnot(ncol(el) == 2)
if (drop_loops) {
el <- el[el[, 1] != el[, 2], ]
}
if (directed) {
out <- unique(el)
} else {
out <- unique(t(apply(el, 1, sort)))
}
colnames(out) <- colnames(el)
if (is.data.frame(el)) {
as.data.frame(out, stringsAsFactors = FALSE)
} else {
out
}
}
el2 <- rbind(your_edge_list,
data.frame(x = c("C", "C"), y = c("C", "A"), stringsAsFactors = FALSE))
el2
#> x y
#> 1 A B
#> 2 B A
#> 3 A B
#> 4 C C
#> 5 C A
simplify_edgelist(el2, directed = FALSE)
#> x y
#> 1 A B
#> 5 A C
How to use column index to dplyr::left_join (and your family)?
Example (by column names):
library(dplyr)
data1 <- data.frame(var1 = c("a", "b", "c"), var2 = c("d", "d", "f"))
data2 = data.frame(alpha = c("d", "f"), beta = c(20, 30))
left_join(data1, data2, by = c("var2" = "alpha"))
However, replacing by = c("var2" = "alpha")) to by = c(data1[,2] = data2[,1]) results to this error:
by must be a (named) character vector, list, or NULL for natural
joins (not recommended in production code), not logical.
I need to use the "column position" for loop on new functions.
How can I do it?
Using dplyr:
# rename_at changes alpha into var2 in data2
left_join(data1, rename_at(data2, 1, ~ names(data1)[2]), by = names(data1)[2])
# output
var1 var2 beta
1 a d 20
2 b d 20
3 c f 30
Using base R:
merge(data1, data2, by.x = 2, by.y = 1, all.x = T, all.y = F)
# output
var2 var1 beta
1 d a 20
2 d b 20
3 f c 30
I don't know how you're going to use the column index but a hacky solution is the following:
#make a named vector for the by argument, see ?left_join
join_var <- names(data2)[1] #change index here based on data2
names(join_var) <- names(data1)[2] #change index here based on data1
left_join(data1, data2, by = join_var)
Depending on the final output you desire by using the column index, there is probably a more appropriate solution than this.
The shape of my data is fairly simple:
set.seed(1337)
id <- c(1:4)
values <- runif(0, 1, n=4)
df <- data.frame(id, values)
df
id values
1 1 0.57632155
2 2 0.56474213
3 3 0.07399023
4 4 0.45386562
What isn't simple: I have a list of character-value arrays that match up to each row, where each list item can be empty, or it can contain up to 5 separate tags, like...
tags <- list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
I will be asked various questions using the tags as classifers, for instance, "what is the average value of all rows with a B tag?" Or "how many rows contain both tag A and tag C?"
What way would you choose to store the tags so that I can do this? My real-life data file is quite large, which makes experimenting with unlist or other commands difficult.
Here are couple of options to get the expected output. Create 'tags' as a list column in the dataset and unnest (already from the comments), and then summarise the number of 'A' or 'C' by getting the sum of logical vector. Similarly, the mean of 'values' where 'tag' is 'B'
library(tidyverse)
df %>%
mutate(tag = tags) %>%
unnest %>%
summarise(nAC = sum(tag %in% c("A", "C")),
meanB = mean(values[tag == "B"], na.rm = TRUE))
That is not very hard . you just need assign your list to your df create a new columns as name tags then we do unnest, I have list the solutions for your listed questions .
library(tidyr)
library(dplyr)
df$tags=list(
c("A"),
NA,
c("A", "B", "C"),
c("B", "C")
)
Newdf=df%>%tidyr::unnest(tags)
Q1.
Newdf%>%group_by(tags)%>%summarise(Mean=mean(values))%>%filter(tags=='B')
tags Mean
<chr> <dbl>
1 B 0.263927925960161
Q2.
Newdf%>%group_by(id)%>%dplyr::summarise(Count=any(tags=='A')&any(tags=='C'))
# A tibble: 4 x 2
id Count
<int> <lgl>
1 1 FALSE
2 2 NA
3 3 TRUE
4 4 FALSE