Merge rows in same data frame based on common values - r

Most of the approaches I've come across involve using dplyr to apply a function when combining features, however, I would just like to restructure a single data frame without applying any function to each group.
I have a single data frame that looks like this:
gene_name chr nb_pos nb_ref nb_alt m_pos m_ref m_alt
ACAA1 3 38173733 C T 38144875 G T
ACAA1 3 38144875 G T 38144876 G A
I would like to combine each row with a common gene_name and chr, where each gene can have a variable amount of rows, to look like this:
gene_name chr np_pos1 nb_ref1 nb_alt1 nb_pos2 nb_ref2 nb_alt2 nb_alt2
ACAA1 3 38173733 C T 38144875 G T T
Does anyone know of a way to do this?

We can use dcast from the devel version of data.table i.e. v1.9.5. Instructions to install it are here.
Create a sequence column ('ind') based on the grouping columns ('gene_name', 'chr'), and then use dcast specifying the value.var columns.
library(data.table)
dcast(setDT(df1)[, ind:= 1:.N ,.(gene_name, chr)],
gene_name+chr~ind, value.var=names(df1)[3:8])
# gene_name chr 1_nb_pos 2_nb_pos 1_nb_ref 2_nb_ref 1_nb_alt 2_nb_alt 1_m_pos
#1: ACAA1 3 38173733 38144875 C G TRUE TRUE 38144875
# 2_m_pos 1_m_ref 2_m_ref 1_m_alt 2_m_alt
#1: 38144876 G G T A
Or using reshape from base R after we create the sequence column using ave.
df2 <- transform(df1, ind=ave(seq_along(gene_name),
gene_name, chr, FUN=seq_along))
reshape(df2, idvar=c('gene_name', 'chr'), timevar='ind',
direction='wide')
# gene_name chr nb_pos.1 nb_ref.1 nb_alt.1 m_pos.1 m_ref.1 m_alt.1 nb_pos.2
#1 ACAA1 3 38173733 C TRUE 38144875 G T 38144875
# nb_ref.2 nb_alt.2 m_pos.2 m_ref.2 m_alt.2
#1 G TRUE 38144876 G A
data
df1 <- structure(list(gene_name = c("ACAA1", "ACAA1"), chr = c(3L, 3L
), nb_pos = c(38173733L, 38144875L), nb_ref = c("C", "G"),
nb_alt = c(TRUE,
TRUE), m_pos = 38144875:38144876, m_ref = c("G", "G"), m_alt = c("T",
"A")), .Names = c("gene_name", "chr", "nb_pos", "nb_ref", "nb_alt",
"m_pos", "m_ref", "m_alt"), class = "data.frame",
row.names = c(NA, -2L))

Related

Is there a way to replace rows in one dataframe with another in R?

I'm trying to figure out how to replace rows in one dataframe with another by matching the values of one of the columns. Both dataframes have the same column names.
Ex:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"))
df2 <- data.frame(x = c(1,2), y = c("f", "g"))
Is there a way to replace the rows of df1 with the same row in df2 where they share the same x variable? It would look like this.
data.frame(x = c(1,2,3,4), y = c("f","g","c","d")
I've been working on this for a while and this is the closest I've gotten -
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
But it just replaces the values with NA.
Does anyone know how to do this?
We can use match. :
inds <- match(df1$x, df2$x)
df1$y[!is.na(inds)] <- df2$y[na.omit(inds)]
df1
# x y
#1 1 f
#2 2 g
#3 3 c
#4 4 d
First off, well done in producing a nice reproducible example that's directly copy-pastable. That always helps, specially with an example of expected output. Nice one!
You have several options, but lets look at why your solution doesn't quite work:
First of all, I tried copy-pasting your last line into a new session and got the dreaded factor-error:
Warning message:
In `[<-.factor`(`*tmp*`, iseq, value = 1:2) :
invalid factor level, NA generated
If we look at your data frames df1 and df2 with the str function, you will see that they do not contain text but factors. These are not text - in short they represent categorical data (male vs. female, scores A, B, C, D, and F, etc.) and are really integers that have a text as label. So that could be your issue.
Running your code gives a warning because you are trying to import new factors (labels) into df1 that don't exist. And R doesn't know what to do with them, so it just inserts NA-values.
As r2evens answered, he used the stringsAsFactors to disable using strings as Factors - you can even go as far as disabling it on a session-wide basis using options(stringsAsFactors=FALSE) (and I've heard it will be disabled as default in forthcoming R4.0 - yay!).
After disabling stringsAsFactors, your code works - or does it? Try this on for size:
df2 <- df2[c(2,1),]
df1[which(df1$x %in% df2$x),]$y <- df2[which(df1$x %in% df2$x),]$y
What's in df1 now? Not quite right anymore.
In the first line, I swapped the two rows in df2 and lo and behold, the replaced values in df1 were swapped. Why is that?
Let's deconstruct your statement df2[which(df1$x %in% df2$x),]$y
Call df1$x %in% df2$x returns a logical vector (boolean) of which elements in df1$x are found ind df2 - i.e. the first two and not the second two. But it doesn't relate which positions in the first vector corresponds to which in the second.
Calling which(df1$x %in% df2$x) then reduces the logical vector to which indices were TRUE. Again, we do not now which elements correspond to which.
For solutions, I would recommend r2evans, as it doesn't rely on extra packages (although data.table or dplyr are two powerful packages to get to know).
In his solution, he uses merge to perform a "full join" which matches rows based on the value, rather than - well, what you did. With transform, he assigns new variables within the context of the data.frame returned from the merge function called in the first argument.
I think what you need here is a "merge" or "join" operation.
(I add stringsAsFactors=FALSE to the frames so that the merging and later work is without any issue, as factors can be disruptive sometimes.)
Base R:
df1 <- data.frame(x = c(1,2,3,4), y = c("a", "b", "c", "d"), stringsAsFactors = FALSE)
# df2 <- data.frame(x = c(1,2), y = c("f", "g"), stringsAsFactors = FALSE)
merge(df1, df2, by = "x", all = TRUE)
# x y.x y.y
# 1 1 a f
# 2 2 b g
# 3 3 c <NA>
# 4 4 d <NA>
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y))
# x y.x y.y y
# 1 1 a f f
# 2 2 b g g
# 3 3 c <NA> c
# 4 4 d <NA> d
transform(merge(df1, df2, by = "x", all = TRUE), y = ifelse(is.na(y.y), y.x, y.y), y.x = NULL, y.y = NULL)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
Dplyr:
library(dplyr)
full_join(df1, df2, by = "x") %>%
mutate(y = coalesce(y.y, y.x)) %>%
select(-y.x, -y.y)
# x y
# 1 1 f
# 2 2 g
# 3 3 c
# 4 4 d
A join option with data.table where we join on the 'x' column, assign the values of 'y' in second dataset (i.y) to the first one with :=
library(data.table)
setDT(df1)[df2, y := i.y, on = .(x)]
NOTE: It is better to use stringsAsFactors = FALSE (in R 4.0.0 - it is by default though) or else we need to have all the levels common in both datasets

R - How to use dplyr left_join by column index?

How to use column index to dplyr::left_join (and your family)?
Example (by column names):
library(dplyr)
data1 <- data.frame(var1 = c("a", "b", "c"), var2 = c("d", "d", "f"))
data2 = data.frame(alpha = c("d", "f"), beta = c(20, 30))
left_join(data1, data2, by = c("var2" = "alpha"))
However, replacing by = c("var2" = "alpha")) to by = c(data1[,2] = data2[,1]) results to this error:
by must be a (named) character vector, list, or NULL for natural
joins (not recommended in production code), not logical.
I need to use the "column position" for loop on new functions.
How can I do it?
Using dplyr:
# rename_at changes alpha into var2 in data2
left_join(data1, rename_at(data2, 1, ~ names(data1)[2]), by = names(data1)[2])
# output
var1 var2 beta
1 a d 20
2 b d 20
3 c f 30
Using base R:
merge(data1, data2, by.x = 2, by.y = 1, all.x = T, all.y = F)
# output
var2 var1 beta
1 d a 20
2 d b 20
3 f c 30
I don't know how you're going to use the column index but a hacky solution is the following:
#make a named vector for the by argument, see ?left_join
join_var <- names(data2)[1] #change index here based on data2
names(join_var) <- names(data1)[2] #change index here based on data1
left_join(data1, data2, by = join_var)
Depending on the final output you desire by using the column index, there is probably a more appropriate solution than this.

From long to wide formats just based on two columns Rstudio

This is my data frame:
I have a data frame of six columns and last columns contains the values . The Column 'code' includes s and d. column 'Sex' includes M and F. And I have two thousand offsprings in the column offspring.
seq parent code Sex offspring Value
1 49032 s M J44010_CCG7YANXX_2_661_X4 -0.38455056
2 48741 s M J44010_CCG7YANXX_2_661_X4 0.10574340
3 48757 s M J44010_CCG7YANXX_2_661_X4 0.39572906
4 48465 d f J44010_CCG7YANXX_2_661_X4 0.43409006
5 48521 d f J44010_CCG7YANXX_2_661_X4 0.40337447
6 48703 d f J44010_CCG7YANXX_2_661_X4 -0.38148980
The column parent includes ids for both males and females.
I want to keep the female/dam id ,female/dam code and female/dam sex just beside the male/sire as a column and also keep the sire value and dam value seperately . So, the 'value' will be seprated in two parts .
The data frame will look like the below:
'seq''parent1''sirecode''Sex''parent2''damcode''Sex''offspring''sireValue' 'damvalue'
1 49032 s M 48465 d f J44010 -0.38455056 0.43409006
2 48741 s M 48521 d f J44010 0.10574340 0.40337447
3 48757 s M 48703 d f J44010 0.39572906 -0.38148980
So, each offspring will have 3 or 4 pair of parents.
I tried to use dcast function on it.
We could use dcast after creating a sequence column
library(data.table)
setDT(df1)[, n := seq_len(.N), .(code, Sex)]
dcast(df1, n + offspring ~ rowid(n), value.var = c('parent', 'code', 'Sex', 'Value'), sep = "")
# n offspring parent1 parent2 code1 code2 Sex1 Sex2 Value1 Value2
#1: 1 J44010_CCG7YANXX_2_661_X4 49032 48465 s d M f -0.3845506 0.4340901
#2: 2 J44010_CCG7YANXX_2_661_X4 48741 48521 s d M f 0.1057434 0.4033745
#3: 3 J44010_CCG7YANXX_2_661_X4 48757 48703 s d M f 0.3957291 -0.3814898
In base R, we can use reshape
df1$n <- with(df1, ave(seq_along(Sex), Sex, FUN = seq_along))
df1$n1 <- with(df1, ave(n, n, FUN = seq_along))
reshape(df1[-1], idvar = c('n', 'offspring'), timevar = 'n1', direction = 'wide' )
data
df1 <- structure(list(seq = 1:6, parent = c(49032L, 48741L, 48757L,
48465L, 48521L, 48703L), code = c("s", "s", "s", "d", "d", "d"
), Sex = c("M", "M", "M", "f", "f", "f"),
offspring = c("J44010_CCG7YANXX_2_661_X4",
"J44010_CCG7YANXX_2_661_X4", "J44010_CCG7YANXX_2_661_X4",
"J44010_CCG7YANXX_2_661_X4",
"J44010_CCG7YANXX_2_661_X4", "J44010_CCG7YANXX_2_661_X4"),
Value = c(-0.38455056,
0.1057434, 0.39572906, 0.43409006, 0.40337447, -0.3814898)),
class = "data.frame", row.names = c(NA, -6L))

when creating a data, it hasn't name

dataset=structure(list(goods = structure(1:6, .Label = c("a", "b", "c",
"d", "e", "f"), class = "factor")), .Names = "goods", class = "data.frame", row.names = c(NA,
-6L))
goods
1 a
2 b
3 c
4 d
5 e
6 f
i want create new data, simple i do
df1=dataset$goods
but after it df1 doesn't have name column goods.
Why?
str(df1)
Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6
As you can see it hasn't name goods
How to do that df1 data has name column goods?
If this post is dublicate, let me know, i delete it.
You are assigning a column vector, not a data frame. To assign the whole data frame, simply do
df = dataset
If you want to preserve only some columns and not all, use column subsetting (documentation):
df = dataset[, "goods", drop = FALSE]
drop = FALSE is necessary here because the dataframe subset operator will otherwise return a vector instead of a data frame with a single column (this is arguably a bug, which is why tidyverse tibbles behave differently).
Using tidyverse operations (aka the “modern” R way), this would be written as
library(dplyr)
df = select(dataset, goods)
df1=data.frame(goods=dataset$goods, stringsAsFactors=F) works perfectly well, or you can use the longer but (somewhat?) more explicit:
ds <- dataset[,c("goods")]
df1=data.frame(goods=dataset$goods)
library(dplyr)
ds <- dataset[,c("goods")] %>% as.data.frame(stringsAsFactors=F)
colnames(ds) <- "goods"
edit: Added the stringsAsFactors option as it is useful to control where you'd like factor conversion or not. c("goods") is equivalent to "goods", but I left it as a template in case you need to add more columns.

Recode a variable using data.table

I am trying to recode a variable using data.table. I have googled for almost 2 hours but couldn't find an answer.
Assume I have a data.table as the following:
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
I want to recode V1 and V2. For V1, I want to recode 1s to 0 and 2s to 1.
For V2, I want to recode A to T, B to K, C to D.
If I use dplyr, it is simple.
library(dplyr)
DT %>%
mutate(V1 = recode(V1, `1` = 0L, `2` = 1L)) %>%
mutate(V2 = recode(V2, A = "T", B = "K", C = "D"))
But I have no idea how to do this in data.table
DT[V1==1, V1 := 0]
DT[V1==2, V1 := 1]
DT[V2=="A", V2 := "T"]
DT[V2=="B", V2 := "K"]
DT[V2=="C", V2 := "D"]
Above is the code that I can think as my best. But there must be a better and a more efficient way to do this.
Edit
I changed how I want to recode V2 to make my example more general.
With data.table the recode can be solved with an update on join:
DT[.(V1 = 1:2, to = 0:1), on = "V1", V1 := i.to]
DT[.(V2 = LETTERS[1:3], to = c("T", "K", "D")), on = "V2", V2 := i.to]
which converts DT to
V1 V2 V4
1: 0 T 1
2: 0 K 2
3: 1 D 3
4: 0 T 4
5: 0 K 5
6: 1 D 6
7: 0 T 7
8: 0 K 8
9: 1 D 9
10: 0 T 10
11: 0 K 11
12: 1 D 12
Edit: #Frank suggested to use i.to to be on the safe side.
Explanation
The expressions .(V1 = 1:2, to = 0:1) and .(V2 = LETTERS[1:3], to = c("T", "K", "D")), resp., create lookup tables on-the-fly.
Alternatively, the lookup tables can be set-up beforehand
lut1 <- data.table(V1 = 1:2, to = 0:1)
lut2 <- data.table(V2 = LETTERS[1:3], to = c("T", "K", "D"))
lut1
V1 to
1: 1 0
2: 2 1
lut2
V2 to
1: A T
2: B K
3: C D
Then, the update joins become
DT[lut1, on = "V1", V1 := i.to]
DT[lut2, on = "V2", V2 := i.to]
Edit 2: Answers to How can I use this code dynamically?
mat asked "How can I use this code dynamically?"
So, here is a modified version where the name of column to update is provided as a character variable my_var_name but the lookup tables still are created on-the-fly:
my_var_name <- "V1"
DT[.(from = 1:2, to = 0:1), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
my_var_name <- "V2"
DT[.(from = LETTERS[1:3], to = c("T", "K", "D")), on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
There are 3 points to note:
Instead of naming the first column of the lookup table dynamically it gets a fixed name from. This requires a join between differently named columns (foreign key join). The names of the columns to join on have to be specified via the on parameter.
The on parameter accepts character strings for foreign key joins of the form "V1==from". This string is created dynamically using paste0().
In the expression (my_var_name) := i.to, the parentheses around the variable my_var_name forces to use the contents of my_var_name.
Dynamic code using pre-defined lookup tables
Now, while the column to recode is specified dynamically by a variable, the lookup tables to use are still hard-coded in the statement which means we have stopped halfways: We need also to select the appropriate lookup table dynamically.
This can be achieved by storing the lookup tables in a list where each list element is named according to the column of DT it is supposed to recode:
lut_list <- list(
V1 = data.table(from = 1:2, to = 0:1),
V2 = data.table(from = LETTERS[1:3], to = c("T", "K", "D"))
)
lut_list
$V1
from to
<int> <int>
1: 1 0
2: 2 1
$V2
from to
<char> <char>
1: A T
2: B K
3: C D
Now, we can pick the appropriate lookup table from the list dynamically as well:
my_var_name <- "V1"
DT[lut_list[[my_var_name]], on = paste0(my_var_name, "==from"),
(my_var_name) := i.to]
Going one step further, we can recode all relevant columns of DT in a loop:
for (v in intersect(names(lut_list), colnames(DT))) {
DT[lut_list[[v]], on = paste0(v, "==from"), (v) := i.to]
}
Note that DT is updated by reference, i.e., only the affected elements are replaced in place without copying the whole object. So, the for loop is applied iteratively on the same data object. This is a speciality of data.table and will not work with data.frames or tibbles.
I think this might be what you're looking for. On the left hand side of := we name the variables we want to update and on the right hand side we have the expressions we want to update the corresponding variables with.
DT[, c("V1","V2") := .(as.numeric(V1==2), sapply(V2, function(x) {if(x=="A") "T"
else if (x=="B") "K"
else if (x=="C") "D" }))]
# V1 V2 V4
#1: 0 T 1
#2: 0 K 2
#3: 1 D 3
#4: 0 T 4
#5: 0 K 5
#6: 1 D 6
#7: 0 T 7
#8: 0 K 8
#9: 1 D 9
#10: 0 T 10
#11: 0 K 11
#12: 1 D 12
Alternatively, just use recode within data.table:
library(dplyr)
DT[, c("V1","V2") := .(as.numeric(V1==2), recode(V2, "A" = "T", "B" = "K", "C" = "D"))]
mapvalues() from plyr, in combination with data.table, works really well.
I use it on large-ish data (50 mio - 400 mio rows). Although I haven't benchmarked it as compared to other possibilities, I find the clear syntax is worth a lot, as it means fewer errors in complicated recode operations.
library(data.table)
library(plyr)
DT <- data.table(V1=c(0L,1L,2L),
V2=LETTERS[1:3],
V4=1:12)
DT[, V1 := mapvalues(V1, from=c(1, 2), to=c(0, 1))]
DT[, V2 := mapvalues(V2, from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))]
For more complicated recode operations, I would always create a new variable first with NA, and use another data.table with from-to vectors as variables.
A feature that in some use-cases is more of a bug is that mapvalues() keeps those values from the old variable that isn't in the from argument.
This is a problem if you're sure that all the correct values is in the from-vector, so that any values in the data.table that isn't in this vector should be NA instead.
DT <- data.table(V1=c(LETTERS[1:3], 'i dont want this value transfered'),
V4=1:12)
map_DT <- data.table(from=c('A', 'B', 'C'), to=c('T', 'K', 'D'))
# NA variable to begin with is good practice because it is clearer to spot an error
DT[, V1_new := NA_character_]
DT[V1 %in% map_DT$from , V1_new := mapvalues(V1, from=map_DT$from, to=map_DT$to)][]
note that plyr is deprecated, so the mapvalues-function is somewhat at risk of disappearing at some point in the future. the update-joins method proposed might be a better method because of this, although I find mapvalues to be just a tad clearer to read. although it will probably take years before mapvalues is deprecated, most likely, a lot of years. But still, something to keep in mind when deciding to use it as a tool or not.

Resources