Join columns of the same data.frame - r

How to join 2 columns from a single data.frame
For example:
Column A : a,b,c,d,e
Column B : b,c,a,b,e
The column i want
New Column : a,b,c,d,e,b,c,a,b,e
Basically i want to get all data under both columns into a single column

df <- setNames(data.frame(matrix(, nrow = 100, ncol = 2)), c("V1", "V2"))
df$V1 <- "a, b, c, d, e"
df$V2 <- "b, c, a, b, e"
df$V3 <- paste(df$V1, df$V2, sep = ", ")
Hope this helps.

Using base R you could just copy the data.frame to a new object and concatenate the columns A and B using the c() function:
df <- data.frame(
A = c("a", "b", "c", "d", "e"),
B = c("b", "c", "a", "b", "e"),
stringsAsFactors = FALSE
)
df2 <- data.frame(
AB = c(df$A, df$B)
)
Alternatively, you could use a tidyverse approach with the gather() function from the tidyr package. This has the advantage that you can easily include the old column IDs (A or B) from the original data.frame in each row.
library(tidyr)
df_tidy <- df %>%
gather(key = "old_col_id", value = "value", A, B)

Related

Grouping to form more than one comma-separated columns in data.table

Problem: I basically want to group data based on the data.table syntax and in parallel create two or more columns which contain comma-separated values (as in the example below).
Approach: I thought about an lapply where I can provide a list of columns which I want to comma-separate; however this did not turn out as expected.
Any suggestions?
EDIT
I am somehow looking for an approach where I only have to provide a list/vector of columns and then apply the function on this list (similar to the not-working lapply approach)
library(data.table)
dt <- data.table(
x = c(1, 1, 1, 3, 3, 2),
y = c("AA", "BB", "CC", "BB", "EE", "AA"),
z = c("H", "A", "C", "Z", "F", "G")
)
## Attempts
dt[, paste0(y, collapse = ","), by = .(x)]
dt[, lapply(c("y", "z"), paste0, collapse = ","), by = x]
## Desired Ouput
x y z
1: 1 AA,BB,CC H, A, C
2: 3 BB,EE Z, F
3: 2 AA G
library(data.table)
dt[, lapply(.SD, toString), by = x, .SDcols = names(dt)[sapply(dt, is.character)]]
dt_sum <- dt[,.(yy=toString(unique(y)),zz=toString(unique(z))),by=c("x")]
dt_sum

bind_rows() error: by reading in a function?

This block runs below, and produces df_all as intended, but when I uncomment the single function at the top (not even apply it here but I do need for other things) and rerun the same block, I get: Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a function
library(data.table)
# addxtoy_newy_csv <- function(df) {
# zdf1 <- df %>% filter(Variable == "s44")
# setDT(df)
# setDT(zdf1)
# df[zdf1, Value := Value + i.Value, on=.(tstep, variable, Scenario)]
# setDF(df)
#}
tstep <- rep(c("a", "b", "c", "d", "e"), 5)
Variable <- c(rep(c("v"), 5), rep(c("w"), 5), rep(c("x"), 5), rep(c("y"), 5), rep(c("x"), 5))
Value <- c(1,2,3,4,5,10,11,12,13,14,33,22,44,57,5,3,2,1,2,3,34,24,11,11,7)
Scenario <- c(rep(c("i"), 20), rep(c("j"), 5) )
df1 <- data.frame(tstep, Variable, Value, Scenario)
tstep <- c("a", "b", "c", "d", "e")
Variable <- rep(c("x"), 5)
Value <- c(100, 34, 100,22, 100)
Scenario <- c(rep(c("i"), 5))
df2<- data.frame(tstep, Variable, Value, Scenario)
setDT(df1)
setDT(df2)
df1[df2, Value := Value + i.Value, on=.(tstep, Variable, Scenario)]
setDF(df1)
df_all <- mget(ls(pattern="df*")) %>% bind_rows()
The pattern you use in ls() will match any object with a "d" in its name, so addxtoy_newy_csv gets included in the list of object names. The f* in your pattern means you currently search for "d, followed by zero or more f's". I think a safer pattern to use would be ^df.*, to match objects that start with "df":
df1 = data.frame(x = 1:3)
df2 = data.frame(x = 4:6)
adder = function(x) x + 1
ls(pattern = "df*")
ls(pattern = "^df.*")

How to combine two data frames by equal elements

I have two data frames containing the names of genetic elements. I want another data frame with the elements in common in both data frames.
Example:
data.a data.b
Column Column
1 a c
2 b e
3 c l
4 d a
I want this result:
data.c
Column
1 a
2 c
This is just an example. The data frame data.b has more elements than data.a.
The %in% operator lets you find which elements are in both.
data.c = data.frame(Column = data.a$Column[data.a$Column %in% data.b$Column])
data.c
Column
1 a
2 c
a <- data.frame(a = c("a","b","c","d"))
a
b <- data.frame(b = c("c","d","e","f"))
b
c <- data.frame(c = a[a$a %in% b$b,])
c
The merge function allows you control the type of join you want.
df1 <- data.frame(a = c("a", "b", "c", "d"))
df2 <- data.frame(a = c("c", "e", "l", "a"))
merge(x=df1, y=df2, by.x="a", by.y="a", all = FALSE)
library(dplyr)
data.a <- data_frame(a = c("a", "b", "c", "d"))
data.b <- data_frame(a = c("c", "e", "l", "a"))
data.c <- data.a %>% inner_join(data.b)

Change the class of columns in a list of data frames

I have a list of data frames, e.g.
df1 = data.frame(ID=c("id1", "id2", "id3"), A1 = c("A", "A", "B"), A2 = c("AA", "AA", "AA"))
df2 = data.frame(ID=c("id2", "id3", "id4"), A1 = c("B", "B", "B"), A2=c("BB", "BB", "BB"))
df3 = data.frame(ID=c("id1", "id2", "id3"), A1 = c("A", "A", "A"), A2 = c("AA", "BB", "BB"))
listDF = list(df1, df2, df3)
I am wondering if there is a good way to change the class from factor to character. This is what I have tried:
d <- lapply(listDF, function(x) sapply(x[,"A1", "A2"], as.character))
This code gives me the columns I want to change, but is there a way to just change the class and not have to re-append these new columns?
mutate_at() function from dplyr package comes handy here:
library(dplyr)
d <- lapply(listDF, function(df) mutate_at(df, .cols = 2:3, as.character))
You can also pass the column names to .cols parameter:
d <- lapply(listDF, function(df) mutate_at(df, .cols = c("A1", "A2"), as.character))
Or select the columns by regex:
# mutate columns whose names start with A
d <- lapply(listDF, function(df) mutate_at(df, vars(matches("^A")), as.character))
In base R, this could be:
d <- lapply(listDF, function(df) {df[c("A1", "A2")] <- lapply(df[c("A1", "A2")], as.character); df})
We can use data.table
lapply(listDF, function(df) setDT(df)[, (2:3) := lapply(.SD, as.character), .SDcols = 2:3])
The list.update function from rlist package provides a handy alternative.
names(listDF) <- c("df1", "df2", "df3")
d <- list.update(listDF,A1 = as.factor(A1), A2 = as.factor(A2))
With tidyverse, you can use a combination of purrr and across inside mutate to change particular columns in each dataframe in the list. Here, I use starts_with, but you can also supply the names too (i.e., c(A1, A2)).
library(tidyverse)
map(listDF, ~ .x %>%
mutate(across(starts_with("A"), as.character)))

R ddply vector version

I am looking for a vector version of ddply.
I would like to do the following:
vector_ddply(frame1, frame2, ..., frameN, c("column1", "column2"), processingFunction);
Here all frames have both "column1" and "column2" and processingFunction takes N parameters.
Note that in my specific case it doesn't make sense to merge the N data frames into one.
The resulting frame would made of the unions of all the keys of the N frames.
Is there a way to achieve this ?
Thanks
Let's start with some sample data:
ll <- list(
f1 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), p = 1:4 ),
f2 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), q = 1:4 ),
f3 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), r = 1:4 )
)
1. Solution: apply data.frame-wise
You want to ddply processingFunction on each data.frame individually, and combine the results to one resulting data.frame:
ldply( ll, ddply, .(x, y), summarise, z = processingFunction(z) )
2. Solution: apply on one rbinded data.frame
You want to apply processingFunction over all rows of the data.frames at once. So then you should just rbind all data.frames together to a large one. Just in case this is not directly possible because the individual frames have not all columns in common, you have to rbind on the common column subset:
commonCols <- Reduce( "intersect", lapply(ll, colnames) )
oneDf <- do.call( "rbind", lapply( ll, "[", commonCols ) )
ddply( oneDf, .(x,y), summarise, z = processingFunction(z) )

Resources