How to merge multiple columns in one in R - r

I have a data frame called mydf with hundreds of paired columns (value1 to valueX and rec1 to recX). I want to combine all these paired columns in the order of their values into value and rec columns as shown in the result. How can I do this in R?
mydf<-structure(list(samples = structure(1:3, .Label = c("A", "B",
"c"), class = "factor"), value1 = c(1, 8, 7), value2 = c(2, 5,
9), rec1 = c(7158, 6975, 6573), rec2 = c(1122, 2235, 229)), .Names = c("samples",
"value1", "value2", "rec1", "rec2"), row.names = c(NA, -3L), class = "data.frame")
result
sample value rec
A 1 7158
A 2 1122
B 5 2235
C 7 6573
B 8 6975
C 9 229

You could solve this quickly using data.tables melt method which allows you to specify regex patters within the measure.vars argument
library(data.table) # v >= 1.9.6
melt(setDT(mydf), measure = patterns("value", "rec"), value.name = c("value", "rec"))
# samples variable value rec
# 1: A 1 1 7158
# 2: B 1 8 6975
# 3: c 1 7 6573
# 4: A 2 2 1122
# 5: B 2 5 2235
# 6: c 2 9 229

Related

Merging data frame and filling missing values [duplicate]

This question already has answers here:
Merging a lot of data.frames [duplicate]
(1 answer)
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Closed 2 years ago.
I want to merge the following 3 data frames and fill the missing values with -1. I think I should use the fct merge() but not exactly know how to do it.
> df1
Letter Values1
1 A 1
2 B 2
3 C 3
> df2
Letter Values2
1 A 0
2 C 5
3 D 9
> df3
Letter Values3
1 A -1
2 D 5
3 B -1
desire output would be:
Letter Values1 Values2 Values3
1 A 1 0 -1
2 B 2 -1 -1 # fill missing values with -1
3 C 3 5 -1
4 D -1 9 5
code:
> dput(df1)
structure(list(Letter = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), Values1 = c(1, 2, 3)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df2)
structure(list(Letter = structure(1:3, .Label = c("A", "C", "D"
), class = "factor"), Values2 = c(0, 5, 9)), class = "data.frame", row.names = c(NA,
-3L))
> dput(df3)
structure(list(Letter = structure(c(1L, 3L, 2L), .Label = c("A",
"B", "D"), class = "factor"), Values3 = c(-1, 5, -1)), class = "data.frame", row.names = c(NA,
-3L))
You can get data frames in a list and use merge with Reduce. Missing values in the new dataframe can be replaced with -1.
new_df <- Reduce(function(x, y) merge(x, y, all = TRUE), list(df1, df2, df3))
new_df[is.na(new_df)] <- -1
new_df
# Letter Values1 Values2 Values3
#1 A 1 0 -1
#2 B 2 -1 -1
#3 C 3 5 -1
#4 D -1 9 5
A tidyverse way with the same logic :
library(dplyr)
library(purrr)
list(df1, df2, df3) %>%
reduce(full_join) %>%
mutate(across(everything(), replace_na, -1))
Here's a dplyr solution
df1 %>%
full_join(df2, by = "Letter") %>%
full_join(df3, by = "Letter") %>%
mutate_if(is.numeric, function(x) replace_na(x, -1))
output:
Letter Values1 Values2 Values3
<chr> <dbl> <dbl> <dbl>
1 A 1 0 -1
2 B 2 -1 -1
3 C 3 5 -1
4 D -1 9 5

Add the index of list to bind_rows?

I have this data:
dat=list(structure(list(Group.1 = structure(3:4, .Label = c("A","B", "C", "D", "E", "F"), class = "factor"), Pr1 = c(65, 75)), row.names = c(NA, -2L), class = "data.frame"),NULL, structure(list( Group.1 = structure(3:4, .Label = c("A","B", "C", "D", "E", "F"), class = "factor"), Pr1 = c(81,4)), row.names = c(NA,-2L), class = "data.frame"))
I want to use combine using bind_rows(dat) but keeping the index number as a varaible
Output Include Type([[1]] and [[3]])
type Group.1 Pr1
1 1 C 65
2 1 D 75
3 3 C 81
4 3 D 4
data.table solution
use rbindlist() from the data.table-package, which had built-in id-support that respects NULL df's.
library(data.table)
rbindlist( dat, idcol = TRUE )
.id Group.1 Pr1
1: 1 C 65
2: 1 D 75
3: 3 C 81
4: 3 D 4
dplyr - partly solution
bind_rows also has ID-support, but it 'skips' empty elements...
bind_rows( dat, .id = "id" )
id Group.1 Pr1
1 1 C 65
2 1 D 75
3 2 C 81
4 2 D 4
Note that the ID of the third element from dat becomes 2, and not 3.
According to the documentation of bind_rows() you can supply the name for .id argument of the function. When you apply bind_rows() to the list of data.frames the names of the list containing your data.frames are assigned to the identifier column. [EDIT] But there is a problem mentioned by #Wimpel:
names(dat)
NULL
However, supplying the names to the list will do the thing:
names(dat) <- 1:length(dat)
names(dat)
[1] "1" "2" "3"
bind_rows(dat, .id = "type")
type Group.1 Pr1
1 1 C 65
2 1 D 75
3 3 C 81
4 3 D 4
Or in one line, if you prefer:
bind_rows(setNames(dat, seq_along(dat)), .id = "type")

Matching column in dataframe by nearest values in column of other dataframe

Hello I have one question of matching two data.frames.
Consider I have two datasets:
Dataframe 1:
"A" "B"
91 1
92 3
93 11
94 4
95 10
96 6
97 7
98 8
99 9
100 2
structure(list(A = 91:100, B = c(1, 3, 11, 4, 10, 6, 7, 8, 9,
2)), .Names = c("A", "B"), row.names = c(NA, -10L), class = "data.frame")
Dataframe 2:
"C" "D"
91.12 1
92.34 3
93.65 11
94.23 4
92.14 10
96.98 6
97.22 7
98.11 8
93.15 9
100.67 2
91.45 1
96.45 3
83.78 11
84.66 4
100 10
structure(list(C = c(91.12, 92.34, 93.65, 94.23, 92.14, 96.98,
97.22, 98.11, 93.15, 100.67, 91.25, 96.45, 83.78, 84.66, 100),
D = c(1, 3, 11, 4, 10, 6, 7, 8, 9, 2, 1, 3, 11, 4, 10)), .Names = c("C",
"D"), row.names = c(NA, -15L), class = "data.frame")
Now I want to find the rounded matches between column A and C and replace column D by the respective value in column B from Dataframe 1. Where there is no corresponding value (by rounded matches between A and C) I want to get an NaN for the replaced column D.
result:
"C" "newD"
91.12 1
92.34 3
93.65 4
94.23 4
92.14 3
96.98 7
97.22 7
98.11 8
93.15 11
100.67 NaN
91.25 1
96.45 6
83.78 NaN
84.66 NaN
100 2
structure(list(C = c(91.12, 92.34, 93.65, 94.23, 92.14, 96.98,
97.22, 98.11, 93.15, 100.67, 91.25, 96.45, 83.78, 84.66, 100),
D = c(1, 3, 4, 4, 3, 7, 7, 8, 11, NaN, 1, 6, NaN, NaN, 2)), .Names = c("C",
"D"), row.names = c(NA, -15L), class = "data.frame")
Does anybody knows how to do that especially for large datasets?
Thanks a lot!
Making an update join with data.table:
library(data.table)
setDT(DF1); setDT(DF2)
DF2[, A := round(C)]
DF2[, D := DF1[DF2, on=.(A), x.B] ]
# alternately, chain together in one step:
DF2[, A := round(C)][, D := DF1[DF2, on=.(A), x.B] ]
This gives NAs in unmatched rows. To switch it... DF2[is.na(D), D := NaN].
To drop the new DF2$A column, use DF2[, A := NULL].
Does anybody knows how to do that especially for large datasets?
This modifies DF2 in place (instead of making a new table like a vanilla join as in Mike's answer), so it should be fairly efficient for large tables. It might perform better if A is stored as an integer instead of a float in both tables.
On data.table 1.9.6, use on="A", B instead of on=.(A), x.B. Thanks to Mike H for checking this.
You can create a lookup table where the values in A are used to look up the values in B.
Lookup = df1$B
names(Lookup) = df1$A
df3 = data.frame(C = df2$C, newD = Lookup[as.character(round(df2$C))])
df3$newD[is.na(df3$newD)] = NaN
For these types of merges I like sql:
library(sqldf)
res <- sqldf("SELECT l.C, r.B
FROM df2 as l
LEFT JOIN df1 as r
on round(l.C) = round(r.A)")
res
# C B
#1 91.12 1
#2 92.34 3
#3 93.65 4
#4 94.23 4
#5 92.14 3
#6 96.98 7
#7 97.22 7
#8 98.11 8
#9 93.15 11
#10 100.67 NA
#11 91.45 1
#12 96.45 6
#13 83.78 NA
#14 84.66 NA
#15 100.00 2

Subset R data.frame by index and name in one line

Sample data.frame:
structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
Output:
df
# a b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
I'd like to get the first and third columns, but I want to subset by name and also by column index.
df[, "a"]
# [1] 1 2 3
df[, 3]
# [1] 7 8 9
df[, c("a", 3)]
# Error in `[.data.frame`(df, , c("a", 3)) : undefined columns selected
df[, c(match("a", names(df)), 3)]
# a c
# 1 1 7
# 2 2 8
# 3 3 9
Are there functions or packages that allow for clean/simple syntax, as in the third example, while also achieving the result of the fourth example?
Maybe use dplyr?
For interactive use - i.e., if you know ahead of time the name of the column you want to select
library(dplyr)
df %>% select(a, 3)
If you do not know the name of the column in advance, and want to pass it as a variable,
x <- names(df)[1]
x
[1] "a"
df %>% select_(x, 3)
Either way the output is
# a c
#1 1 7
#2 2 8
#3 3 9
In base R you can combine subset with select.
df <- structure(list(a = c(1, 2, 3),
b = c(4, 5, 6), c = c(7, 8, 9)),
.Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df <- subset(df, select = c(a, 3))
You can index names(df) without using dplyr:
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df[,c("a",names(df)[3]) ]
Output:
a c
1 1 7
2 2 8
3 3 9

R: Subsetting a data.table with repeated column names with numerical positions

I have a data.table that looks like this
> dput(DT)
A B C A B C D
1: 1 2 3 3 5 6 7
2: 2 1 3 2 1 3 4
Here's the dput
DT <- structure(list(A = 1:2, B = c(2L, 1L), C = c(3L, 3L), A = c(3L,
2L), B = c(5L, 1L), C = c(6L, 3L), D = c(7L, 4L)), .Names = c("A",
"B", "C", "A", "B", "C", "D"), row.names = c(NA, -2L), class = c("data.table",
"data.frame"))
Basically, I want to subset them according to their headers. So for header "B", I would do this:
subset(DT,,grep(unique(names(DT))[2],names(DT)))
B B
1: 2 2
2: 1 1
As you can see, the values are wrong as the second column is simply a repeat of the first. I want to get this instead:
B B
1: 2 5
2: 1 1
Can anyone help me please?
The following alternatives work for me:
pos <- grep("B", names(DT))
DT[, ..pos]
# B B
# 1: 2 5
# 2: 1 1
DT[, .SD, .SDcols = patterns("B")]
# B B
# 1: 2 5
# 2: 1 1
DT[, names(DT) %in% unique(names(DT))[2], with = FALSE]
# B B
# 1: 2 5
# 2: 1 1

Resources