R, creating several random numbers from each data frame row - r

I want to generate several random numbers, sampled from normal distribution, for several pairs of mean and standard deviation.
These pairs are stored in a data frame, with three columns containing the identifiant of the pair, value of mean and standard deviation as in the following example:
ex <- data.frame("id" = c("id_1_0.1", "id_2_0.5"), "mean" = c(1, 2), "sd" = c(0.1, 0.5))
To create 10 random numbers for each pair, I used these two lines:
tmp <- by(cbind(ex$mean, ex$sd), ex$id, function(x) rnorm(10, mean = x[, 1], sd = x[, 2]))
tmp <- do.call(rbind, lapply(tmp, data.frame, stringsAsFactors = FALSE))
What I would like to do is to then merge both data frames ex and tmp to have all the information in one data frame.
With this method, I face a problem of naming due to incrementation so I cannot do a simple merge.
Should I try to solve this using a regex formula or is there a simpler solution ?

This code seems to work for you:
library(dplyr)
ex <- data.frame("id" = c("id_1_0.1", "id_2_0.5"), mean = c(1, 2), sd = c(0.1, 0.5))
random_list = apply(ex[,c("id","mean","sd")],1,function(x) {
data.frame(id=rep(x[1],10),
random= rnorm(10, mean = as.numeric(x[2]), sd = as.numeric(x[3])))})
ex = do.call(rbind,random_list) %>% left_join(ex)
Hope this helps!

I was able to use some regex to delete the incrementation counters off your IDs, allowing them to merge with your original IDs. There may be a prettier way to do this, but this appears to work.
# Pull rownames in and delete counter
tmp$id <- gsub("(.[^.]*$)", "", rownames(tmp))
# Merge with original data
new <- merge(ex, tmp, by = "id")
head(new)
# id mean sd X..i..
# 1 id_1_0.1 1 0.1 1.1226943
# 2 id_1_0.1 1 0.1 1.0666694
# 3 id_1_0.1 1 0.1 0.8848397
# 4 id_1_0.1 1 0.1 0.9839212
# 5 id_1_0.1 1 0.1 0.9027086
# 6 id_1_0.1 1 0.1 0.9389538
Regex: Select a . followed by any number of non . characters [^.]*, starting at the end ($)

Related

How would I run a t test on 58 (variables) columns to compare 2 different data frames

I have 58 columns in each data frame that I would like to compare to see if there is a significant difference between them (individually and as a whole) as each of the 58 comprise a water basin and would be a sum of the whole, but still individually represent different things. I am not sure how to run a t.test on this. I am really new to coding and to R
Here is a way of conducting t-tests on all colimns of two data.frames using a lapply loop. Each of the tests returns a list of class "htest", and the sapply instructions extract the list members of interest.
tests_list <- lapply(seq_along(df1), function(i){
t.test(df1[[i]], df2[[i]])
})
sapply(tests_list, '[[', 'statistic')
sapply(tests_list, '[[', 'p.value')
sapply(tests_list, '[[', 'conf.int')
Test data
set.seed(2021)
n <- 20
df1 <- matrix(rnorm(n*4), ncol = 4)
df2 <- matrix(rnorm(n*4), ncol = 4)
df1 <- as.data.frame(df1)
df2 <- as.data.frame(df2)
In most simplistic case, you would loop through each column and do multiple t-test, one such example shown below.
# Dataframe 1: Col 1: It has 100 values, mean = 1, SD = 1
df_1_col_1 = rnorm(100, 1, 1)
# Dataframe 2: Col 1: It has 75 values, mean = 2, SD = 1
df_2_col_1 = rnorm(75, 2, 1)
# Null hyposthesis: difference between x and y is = 0
t.test(df_1_col_1, df_2_col_1)
# P-value < 0.05 you reject the null hypothesis.
Or, you can row-wise aggregate the 58 columns to get one value for each row. Ex: take mean of 58 column values. Now you will get a list of values(df_1_col_1 & df_2_col_1 in above code) for dataframe 1 and dataframe 2. If you don't like simple mean, you can do PCA on your dataframes and use 1st principal component from both the dataframes, to do a t-test.

Problem in merging a gff file and a csv file in R [duplicate]

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5
You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html
You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.
For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.
The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.
Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.
In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.
There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order
There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

How to use parameters from data frame in R and loop through time holding them constant

I have a function (weisurv) that has 2 parameters - sc and shp. It is a function through time (t). Time is a sequence, i.e. t<-seq(1:100).
weisurv<-function(t,sc,shp){
surv<-exp(-(t/sc)^shp)
return(surv)
}
I have a data frame (df) that contains a list of sc and shp values (like 300+ of them). For example, I have:
M shp sc p C i
1 1 1.138131 10.592154 0.1 1 1
2 1.01 1.143798 10.313217 0.1 1 2
3 1.02 1.160653 10.207863 0.1 1 3
4 1.03 1.185886 9.861997 0.1 1 4
...
I want to apply each set (ROW) of sc and shp parameters to my function. So the function would look like weisurv(t,sc[[i]],shp[i]]) for each row[i]. I do not understand how to use apply or adply to do this though I'm sure one of these or a combo of both are what is needed.
In the end, I am looking for a data frame that gives a value of weisurv for each time given a set of sc and shp (held constant through time). So if I had 10 sets of sc and shp parameters, I would end up with 10 time series of weisurv.
Thanks....
Using plyr:
As a matrix (time in cols, rows corresponding to rows of df):
aaply(df, 1, function(x) weisurv(t, x$sc, x$shp), .expand = FALSE)
As a list:
alply(df, 1, function(x) weisurv(t, x$sc, x$shp))
As a data frame (structure as per matrix above):
adply(df, 1, function(x) setNames(weisurv(t, x$sc, x$shp), t))
As a long data frame (one row per t/sc/shp combination); note uses mutate and the pipe operator from dplyr):
newDf <- data.frame(t = rep(t, nrow(df)), sc = df$sc, shp = df$shp) %>%
mutate(surv = weisurv(t, sc, shp))
You can also create a wide data.frame and then use reshape2::melt to reformat as long:
wideDf <- adply(df, 1, function(x) setNames(weisurv(t, x$sc, x$shp), t))
newDf <- melt(wideDf, id.vars = colnames(df), variable.name = "t", value.name = "surv")
newDf$t <- as.numeric(as.character(newDf$t))
Pretty plot of last newDf (using ggplot2):
ggplot(newDf, aes(x = t, y = surv, col = sprintf("sc = %0.3f, shp = %0.3f", sc, shp))) +
geom_line() +
scale_color_discrete(name = "Parameters")
Not sure about the exact structure you want in the final dataframe...
and I think there must be a cleaner way to do this, but this should work.
option 1
rows are the same as your df, with new columns t<n> for each value of t:
for(n in t){
df$temp <- weisurv(n, df$sc, df$shp)
names(df)[n+2] <- paste0('t', n)
}
option 2
long dataframe, with columns sc, shp, t, and weisurv(t,sc,shp):
l = length(t)
newdf <- data.frame(sc=rep(df$sc, each=l), shp=rep(df$shp, each=l),
t=rep(t, times=nrow(df)) )
newdf$weisurv <- weisurv(newdf$t, newdf$sc, newdf$shp)

Combine and aggregate multiple data.frames

I have a collection of .csv files each consisting of the same number of rows and columns. Each file contains observations (column 'value') of some test subjects characterised by A, B, C and takes the form similar to the following:
A B C value
1 1 1 0.5
1 1 2 0.6
1 2 1 0.1
1 2 2 0.2
. . . .
Suppose each file is read into a separate data frame. What would be the most efficient way to combine these data frames into a single data frame in which 'value' column contains means, or generally speaking, results of some function call over all 'value' rows for a given test subject. Columns A, B and C are constant across all files and can be viewed as keys for these observations.
Thank you for your help.
This should be pretty easy, assuming that the files are all ordered in the same way:
dflist <- lapply(dir(pattern='csv'), read.csv)
# row means:
rowMeans(do.call('cbind', lapply(dflist, `[`, 'value')))
# other function `myfun` applied to each row:
apply(do.call('cbind', lapply(dflist, `[`, 'value')), 1, myfun)
Here is another solution in the case where the keys might be in any order, or maybe missing:
n <- 10 # of csv files to create
obs <- 10 # of observations
# create test files
for (i in 1:n){
df <- data.frame(A = sample(1:3, obs, TRUE)
, B = sample(1:3, obs, TRUE)
, C = sample(1:3, obs, TRUE)
, value = runif(obs)
)
write.csv(df, file = tempfile(fileext = '.csv'), row.names = FALSE)
}
# read in the data
input <- lapply(list.files(tempdir(), "*.csv", full.names = TRUE)
, function(file) read.csv(file)
)
# put dataframe together and the compute the mean for each unique combination
# of A, B & C assuming that they could be in any order.
input <- do.call(rbind, input)
result <- lapply(split(input, list(input$A, input$B, input$C), drop = TRUE)
, function(sect){
sect$value[1L] <- mean(sect$value)
sect[1L, ]
}
)
# create output DF
result <- do.call(rbind, result)
result

Merge two data frames while keeping the original row order

I want to merge two data frames keeping the original row order of one of them (df.2 in the example below).
Here are some sample data (all values from class column are defined in both data frames):
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
If I do:
merge(df.2, df.1)
Output is:
class object prob
1 1 B 0.5
2 1 C 0.5
3 2 A 0.7
4 2 D 0.7
5 3 F 0.3
If I add sort = FALSE:
merge(df.2, df.1, sort = F)
Result is:
class object prob
1 2 A 0.7
2 2 D 0.7
3 1 B 0.5
4 1 C 0.5
5 3 F 0.3
But what I would like is:
class object prob
1 2 A 0.7
2 1 B 0.5
3 2 D 0.7
4 3 F 0.3
5 1 C 0.5
You just need to create a variable which gives the row number in df.2. Then, once you have merged your data, you sort the new data set according to this variable. Here is an example :
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
df.2$id <- 1:nrow(df.2)
out <- merge(df.2,df.1, by = "class")
out[order(out$id), ]
Check out the join function in the plyr package. It's like merge, but it allows you to keep the row order of one of the data sets. Overall, it's more flexible than merge.
Using your example data, we would use join like this:
> join(df.2,df.1)
Joining by: class
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
Here are a couple of links describing fixes to the merge function for keeping the row order:
http://www.r-statistics.com/2012/01/merging-two-data-frame-objects-while-preserving-the-rows-order/
http://r.789695.n4.nabble.com/patching-merge-to-allow-the-user-to-keep-the-order-of-one-of-the-two-data-frame-objects-merged-td4296561.html
You can also check out the inner_join function in Hadley's dplyr package (next iteration of plyr). It preserves the row order of the first data set. The minor difference to your desired solution is that it also preserves the original column order of the first data set. So it does not necessarily put the column we used for merging at the first position.
Using your example above, the inner_join result looks like this:
inner_join(df.2,df.1)
Joining by: "class"
object class prob
1 A 2 0.7
2 B 1 0.5
3 D 2 0.7
4 F 3 0.3
5 C 1 0.5
From data.table v1.9.5+, you can do:
require(data.table) # v1.9.5+
setDT(df.1)[df.2, on="class"]
The performs a join on column class by finding out matching rows in df.1 for each row in df.2 and extracting corresponding columns.
For the sake of completeness, updating in a join preserves the original row order as well. This might be an alternative to Arun's data.table answer if there are only a few columns to append:
library(data.table)
setDT(df.2)[df.1, on = "class", prob := i.prob][]
object class prob
1: A 2 0.7
2: B 1 0.5
3: D 2 0.7
4: F 3 0.3
5: C 1 0.5
Here, df.2 is right joined to df.1 and gains a new column prob which is copied from the matching rows of df.1.
The accepted answer proposes a manual way to keep order when using merge, which works most of the times but requires unnecessary manual work. This solution comes on the back of How to ddply() without sorting?, which deals with the issue of keeping order but in a split-apply-combine context:
This came up on the plyr mailing list a while back (raised by #kohske no less) and this is a solution offered by Peter Meilstrup for limited cases:
#Peter's version used a function gensym to
# create the col name, but I couldn't track down
# what package it was in.
keeping.order <- function(data, fn, ...) {
col <- ".sortColumn"
data[,col] <- 1:nrow(data)
out <- fn(data, ...)
if (!col %in% colnames(out)) stop("Ordering column not preserved by function")
out <- out[order(out[,col]),]
out[,col] <- NULL
out
}
So now you can use this generic keeping.order function to keep the original row order of a merge call:
df.1<-data.frame(class=c(1,2,3), prob=c(0.5,0.7,0.3))
df.2<-data.frame(object=c('A','B','D','F','C'), class=c(2,1,2,3,1))
keeping.order(df.2, merge, y=df.1, by = "class")
Which will yield, as requested:
> keeping.order(df.2, merge, y=df.1, by = "class")
class object id prob
3 2 A 1 0.7
1 1 B 2 0.5
4 2 D 3 0.7
5 3 F 4 0.3
2 1 C 5 0.5
So keeping.order effectively automates the approach in the accepted answer.
Thanks to #PAC , I came up with something like this:
merge_sameord = function(x, y, ...) {
UseMethod('merge_sameord')
}
merge_sameord.data.frame = function(x, y, ...) {
rstr = paste(sample(c(0:9, letters, LETTERS), 12, replace=TRUE), collapse='')
x[, rstr] = 1:nrow(x)
res = merge(x, y, all.x=TRUE, sort=FALSE, ...)
res = res[order(res[, rstr]), ]
res[, rstr] = NULL
res
}
This assumes that you want to preserve the order the first data frame, and the merged data frame will have the same number of rows as the first data frame. It will give you the clean data frame without extra columns.
In this specific case you could us factor for a compact base solution:
df.2$prob = factor(df.2$class,labels=df.1$prob)
df.2
# object class prob
# 1 A 2 0.7
# 2 B 1 0.5
# 3 D 2 0.7
# 4 F 3 0.3
# 5 C 1 0.5
Not a general solution however, it works if:
You have a lookup table containing unique values
You want to update a table, not create a new one
the lookup table is sorted by the merging column
The lookup table doesn't have extra levels
You want a left_join
If you're fine with factors
1 is not negotiable, for the rest we can do:
df.3 <- df.2 # deal with 2.
df.1b <- df.1[order(df.1$class),] # deal with 3
df.1b <- df.1b[df.1$class %in% df.2$class,] # deal with 4.
df.3$prob = factor(df.3$class,labels=df.1b$prob)
df.3 <- df3[!is.na(df.3$prob),] # deal with 5. if you want an `inner join`
df.3$prob <- as.numeric(as.character(df.3$prob)) # deal with 6.
For package developers
As a package developer, you want to be dependent on as few other packages as possible. Especially tidyverse functions, that change way too often for package developers IMHO.
To be able to make use of the join functions of the dplyr package without importing dplyr, below is a quick implementation. It keeps the original sorting (as requested by OP) and does not move the joining column to the front (which is another annoying thing of merge()).
left_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.x = TRUE, ...)
}
right_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all.y = TRUE, ...)
}
inner_join <- function(x, y, ...) {
merge_exec(x = x, y = y, all = TRUE, ...)
}
full_join <- function(x, y, ...) {
merge_exec(x = x, y = y, ...)
}
# workhorse:
merge_exec <- function(x, y, ...) {
# set index
x$join_id_ <- 1:nrow(x)
# do the join
joined <- merge(x = x, y = y, sort = FALSE, ...)
# get suffices (yes, I prefer this over suffixes)
if ("suffixes" %in% names(list(...))) {
suffixes <- list(...)$suffixes
} else {
suffixes <- c("", "")
}
# get columns names in right order, so the 'by' column won't be forced first
cols <- unique(c(colnames(x),
paste0(colnames(x), suffixes[1]),
colnames(y),
paste0(colnames(y), suffixes[2])))
# get the original row and column index
joined[order(joined$join_id),
cols[cols %in% colnames(joined) & cols != "join_id_"]]
}
The highest rated answer does not produce what the Original Poster would like, i.e., "class" in column 1. If OP would allow switching column order in df.2, then here is a possible base R non-merge one-line answer:
df.1 <- data.frame(class = c(1, 2, 3), prob = c(0.5, 0.7, 0.3))
df.2 <- data.frame(class = c(2, 1, 2, 3, 1), object = c('A', 'B', 'D', 'F', 'C'))
cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE])
I happen to like the information portrayed in the row.names. A complete one-liner that exactly duplicates the OP's desired outcome is
data.frame(cbind(df.2, df.1[match(df.2$class, df.1$class), -1, drop = FALSE]),
row.names = NULL)
I agree with https://stackoverflow.com/users/4575331/ms-berends that the fewer dependencies of a package developer on another package (or "verse") the better because development paths frequently diverge over time.
Note: The one-liner above does not work when there are duplicates in df.1$class. This can be overcome sans merge with 'outer' and a loop, or more generally with Ms Berend's clever post-merge rescrambling code.
There are several uses cases in which a simple subset will do:
# Use the key variable as row.names
row.names(df.1) = df.1$key
# Sort df.1 so that it's rows match df.2
df.3 = df.1[df.2$key, ]
# Create a data.frame with cariables from df.1 and (the sorted) df.2
df.4 = cbind(df.1, df.3)
This code will preserve df.2 and it's order and add only matching data from df.1
If only one variable is to be added, the cbind() ist not required:
row.names(df.1) = df.1$key
df.2$data = df.1[df.2$key, "data"]
I had the same problem with it but I simply used a dummy vector c(1:5) applied to a new column 'num'
df.2 <- data.frame(object = c('A', 'B', 'D', 'F', 'C'), class = c(2, 1, 2, 3, 1))
df.2$num <- c(1:5) # This range you can order in the last step.
dfm <- merge(df.2, df.1) # merged
dfm <- dfm[order(dfm$num),] # ascending order
There may be a more efficient way in base. This would be fairly simple to make into a function.
varorder <- names(mydata) # --- Merge
mydata <- merge(mydata, otherData, by="commonVar")
restOfvars <- names(mydata[!(names(mydata) %in% varorder)])
mydata[c(varorder,restOfvars)]

Resources