Function for pasting corrected values inside existing dataframe - r

Does something like the 'paste_over' function below already exist within base R or one of the standard R packages?
paste_over <- function(original, corrected, key){
corrected <- corrected[order(corrected[[key]]),]
output <- original
output[
original[[key]] %in% corrected[[key]],
names(corrected)
] <- corrected
return(output)
}
An example:
D1 <- data.frame(
k = 1:5,
A = runif(5),
B = runif(5),
C = runif(5),
D = runif(5),
E = runif(5)
)
D2 <- data.frame(
k=c(4,1,3),
D=runif(3),
E=runif(3),
A=runif(3)
)
D2 <- D2[order(D2$k),]
D3 <- D1
D3[
D1$k %in% D2$k,
names(D2)
] <- D2
D4 <- paste_over(D1, D2, "k")
all(D4==D3)
In the example D2 contains some values that I want to paste over corresponding cells within D1. However D2 is not in the same order and does not have the same dimension as D1.
The motivation for this is that I was given a very large dataset, reported some errors within it, and received a subset of the original dataset with some corrected values. I would like to be able to 'paste over' the new, corrected values into the old dataset without changing the old dataset in terms of structure. (As the rest of the code I've written assume's the old dataset's structure.)
Although the paste_over function seems to work I can't help but think this must have been tackled before, and so maybe there's already a well known function that's both faster and has error checking. If there is then please let me know what it is.
Thanks.

We can accomplish this using data.table as follows:
setkeyv(setDT(D1), "k")
cols = c("D", "E", "A")
D1[D2, (cols) := D2[, cols]]
setDT() converts a data.frame to data.table by reference (without actually copying the data). We want D1 to be a data.table.
setkey() sorts the data.table by the column specified (here k) and marks that column as sorted (by setting the attribute sorted) by reference. This allows us to perform joins using binary search.
x[i] in data.table performs a join. You can read more about it here. Briefly, for each row of column k in D2, it finds the matching row indices in D1 by matching on D1's key column (here k).
x[i, LHS := RHS] performs the join to find matching rows, and the LHS := RHS part adds/updates x with the columns specified in LHS with the values specified in RHS by reference. LHS should be a a vector of column names or numbers, and RHS should be a list of values.
So, D1[D2, (cols) := D2[, cols]] finds matching rows in D1 for k=c(1,3,4) from D2 and updates the columns D,E,A specified in cols by the list (a data.frame is also a list) of corresponding columns from D2 on RHS.
D1 will now be modified in-place.
HTH

You could use the replacement method for data frames in your function, like this maybe. It does adequate checking for you. I chose to pass the logical row subset as an argument, but you can change that
pasteOver <- function(original, corrected, key) {
"[<-.data.frame"(original, key, names(corrected), corrected)
}
(p1 <- pasteOver(D1, D2, D1$k %in% D2$k))
k A B C D E
1 1 0.18827167 0.006275082 0.3754535 0.8690591 0.73774065
2 2 0.54335829 0.122160101 0.6213813 0.9931259 0.38941407
3 3 0.62946977 0.323090601 0.4464805 0.5069766 0.41443988
4 4 0.66155954 0.201218532 0.1345516 0.2990733 0.05296677
5 5 0.09400961 0.087096652 0.2327039 0.7268058 0.63687025
p2 <- paste_over(D1, D2, "k")
identical(p1, p2)
# [1] TRUE

Related

R - Append rows from dataframe to another one without duplicate on "primary keys columns"

I have two dataframes (A and B). B contains new values and A contains outdated values.
Each of these dataframes have one column representing the key and another one representing the value.
I want to add rows from B to A and then clean rows that contain duplicated keys from A (update A with the new values that are in B). Order doesn't really matter, I think it is easier in the other order : cleaning duplicates and then appending.
At the moment, I have done this script :
A <- bind_rows(B, A)
A <- A[!duplicated(A),]
The issue I have is that it doesn't clean rows because they are not real duplicates (value is different).
How could I handle this?
This is just a hunch because there's no example data provided, but I suspect a merge is a much safer approach than a row-bind:
Solution with data.table
library(data.table)
1 - Rename variables to prepare for a merge
setnames(A, old="value", new="value_A")
setnames(B, old="value", new="value_B")
2 - Merge, be sure to use the all arg
dt <- merge(A, B, by="key", all=TRUE)
3 - Use some rule for the update - for example: use value_B unless it's missing, in which case use value_A
dt[ , value := value_B]
dt[is.na(value), value := value_A]
Solution with Base R
names(A) <- c("key", "value_A")
names(B) <- c("key", "value_B")
df <- merge(A, B, by="key", all=TRUE)
df$value <- df$value_B
df[is.na(df$value), "value"] <- df[is.na(df$value), "value_A"]
Solution with dplyr/tidyverse
library(dplyr)
df <- full_join(A, B, by="key") %>%
mutate(value = ifelse(is.na(value_B), value_A, value_B))
Example Data
set.seed(1234)
A <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))
B <- data.frame(
key = sample(1:50, size=20),
value = runif(20, 1, 10))

Looping by row across subset of columns

I have a data frame with column 1 being the gene and all other columns being gene expression data for that gene under different conditions. I want to go gene by gene and divide all the expression values by the median expression value for that gene. I have the medians in a data frame called s.med.df.
I’m trying to direct R to divide all the expression columns (2:n) but not the first column by the median value for each gene. I'm new to R, but the script I have so far is as follows:
Con1 <- c(5088.77, 274.62, 251.97, 122.21)
Con2 <- c(4382.59, 288.55, 208.12, 171.93)
Con3 <- c(4732.81, 417.43, 305.58, 132.93)
Solid.df <- data.frame(Gene = c("A", "B", "C", "D"), Con1=Con1, Con2=Con2, Con3=Con3)
Gene Con1 Con2 Con3
A 5088.77 4382.59 4732.81
B 274.62 288.55 417.43
C 251.97 208.12 305.58
D 122.21 171.93 132.93
n <- ncol(Solid.df)
genes = levels(s.med.df$Gene)
Solid.mt.df = Solid.df
for (i in 1:length(genes)) {
gene = genes[i]
Solid.mt.df[2:n][Solid.mt.df$Gene == gene] = Solid.mt.df[2:n][Solid.mt.df$Gene == gene] / s.med.df$Medians[i]
print(gene)
}
Thank you in advance
This can be achieved by direct divide. Change s.med.df to a vector. See the following example.
d1 <- data.frame(ge=c("A", "B", "C"), e1=1:3, e2=7:9,
stringsAsFactors = FALSE)
m1 <- data.frame(md=4:6, stringsAsFactors = FALSE)
d1[,2:3]/unlist(m1)
# e1 e2
# 1 0.25 1.75
# 2 0.40 1.60
# 3 0.50 1.50
Can also bind the gene names with the results.
cbind(d1[,1], d1[,2:3]/unlist(m1))
For anything to do with applying a function over columns or rows, you're looking for apply:
median_centered <- t(apply(genes[,2:length(genes)], 1, function(x) x / median(x)))
genes2 <- cbind(genes[,1], median_centered)
This takes the data frame except for the first column, iterates over the 1st axis (rows), and applies x / median(x) to those rows. Since R broadcasts scalar operations to vectors, you'll get the desired result, but it will be transposed, so calling t() on it turns it back into the original format. Then we can cbind it back with the gene names.
like #VenYao pointed out, you can use direct division if you turn your medians into a vector. It would be helpful to show what structure is your s.med.df file.
This can be achieved using data.table pretty easily:
cbind your dataframes into a data.table:
library(data.table)
combined <- data.table(cbind(Solid.df, s.med.df))
combined[, med.con1 := Con1/median]
# assume median is the column in s.med.df that stores median values.
# then you can repeat that for all three conditions:
combined[, med.con2 := Con2/median]
combined[, med.con2 := Con2/median]

assigning a subset of data.table rows and columns by join

I'm trying to do something similar but different enough from what's described here:
Update subset of data.table based on join
Specifically, I'd like to assign to matching key values (person_id is a key in both tables) column values from table control. CI is the column index. The statement below says 'with=F' was not used. when I delete those parts, it also doesn't work as expected. Any suggestions?
To rephrase: I'd like to set the subset of flatData that corresponds to control FROM control.
flatData[J(eval(control$person_id)), ci, with=F] = control[, ci, with=F]
To give a reproducible example using classic R:
x = data.frame(a = 1:3, b = 1:3, key = c('a', 'b', 'c'))
y = data.frame(a = c(2, 5), b = c(11, 2), key = c('a', 'b'))
colidx = match(c('a', 'b'), colnames(y))
x[x$key %in% y$key, colidx] = y[, colidx]
As an aside, someone please explain how to easily assign SETS of columns without using indices! Indices and data.table are a marriage made in hell.
You can use the := operator along with the join simultaneously as follows:
First prepare data:
require(data.table) ## >= 1.9.0
setDT(x) ## converts DF to DT by reference
setDT(y)
setkey(x, key) ## set key column
setkey(y, key)
Now the one-liner:
x[y, c("a", "b") := list(i.a, i.b)]
:= modifies by reference (in-place). The rows to modify are provided by the indices computed from the join in i.
i.a and i.b are the column names data.table internally generates for easy access to i's columns when both x and i have identical column names, when performing a join of the form x[i].
HTH
PS: In your example y's columns a and b are of type numeric and x's are of type integer and therefore you'll get a warning when run on your data, that the types dint match and therefore a coercion had to take place.

Backfilling from another data.frame

I frequently have situations where I have to "fill in" information from another data source.
For example:
x <- data.frame(c1=letters[1:26],c2=letters[26:1])
x[x$c1 == "m","c2"] <- NA
x[x$c1 == "a","c2"] <- NA
c1 c2
1 a <NA>
2 b y
3 c x
4 d w
5 e v
6 f u
7 g t
8 h s
9 i r
10 j q
11 k p
12 l o
13 m <NA>
...
Now, with that missing variable, I'd like to check and fill it in using a seperate data.frame, lets' call it y
y <- data.frame(c1=c("m","a"),c2=c("n","z"))
So, what I would like to happen is for x to be filled in with y. (row 13 should be c("m","n"), row 1 should be c("a","z"))
The method I use to deal with this currently seems convoluted and indirect. What would your approach be? Keeping in mind that my data is not necessarily in a nice order like this one is but the order should be maintained in x. My preference would be for a solution that does not rely on anything but base R.
This will be a far simpler proposition if you deal with character variables, not factors.
I will present a simple data.table solution (for elegant and easy to use syntax amongst many other advantages)
x <- data.frame(c1=letters[1:26],c2=letters[26:1], stringsAsFactors =FALSE)
x[x$c1 == "m","c2"] <- NA
y <- data.frame(c1="m",c2="n", stringsAsFactors = FALSE)
library(data.table)
X <- as.data.table(x)
Y <- as.data.table(y)
For simplicity of merging, I will create a column that indicating
X[,missing_c2 := is.na(c2)]
# a similar column in Y
Y[,missing_c2 := TRUE]
setkey(X, c2, missing_c2)
setkey(Y, c2, missing_c2)
# merge and replace (by reference) those values in X with the the values in `Y`
X[Y, c2 := i.c2]
The i.c2 means that we use the values of c2 from the i argument to [
This approach assumes that not all values where c1 = 'm' will be missing in X and you don't want to replace all values in c2 with 'm' where c1='m', only those which are missing
A base solution
Here is a base solution -- I use merge so that the y data.frame can contain more missing replacements than actually needed (i.e. could have values for all c1 values, although only c1=m`` is required.
# add a second missing value row because to make the solution more generalizable
x <- rbind(x, data.frame(c1 = 'm',c2 = NA, stringsAsFactors = FALSE) )
missing <- x[is.na(x$c2),]
merged <- merge(missing, y, by = 'c1')
x[is.na(x$c2),] <- with(merged, data.frame(c1 = c1, c2 = c2.y, stringsAsFactors = FALSE))
If you use factors you will come up against a wall of pain ensuring that the levels correspond.
In base R, I believe this will work for you:
nas <- is.na(x$c2)
x[nas, ] <- y[y$c1 %in% x[nas, 1], ]

How do you delete a column by name in data.table?

To get rid of a column named "foo" in a data.frame, I can do:
df <- df[-grep('foo', colnames(df))]
However, once df is converted to a data.table object, there is no way to just remove a column.
Example:
df <- data.frame(id = 1:100, foo = rnorm(100))
df2 <- df[-grep('foo', colnames(df))] # works
df3 <- data.table(df)
df3[-grep('foo', colnames(df3))]
But once it is converted to a data.table object, this no longer works.
Any of the following will remove column foo from the data.table df3:
# Method 1 (and preferred as it takes 0.00s even on a 20GB data.table)
df3[,foo:=NULL]
df3[, c("foo","bar"):=NULL] # remove two columns
myVar = "foo"
df3[, (myVar):=NULL] # lookup myVar contents
# Method 2a -- A safe idiom for excluding (possibly multiple)
# columns matching a regex
df3[, grep("^foo$", colnames(df3)):=NULL]
# Method 2b -- An alternative to 2a, also "safe" in the sense described below
df3[, which(grepl("^foo$", colnames(df3))):=NULL]
data.table also supports the following syntax:
## Method 3 (could then assign to df3,
df3[, !"foo"]
though if you were actually wanting to remove column "foo" from df3 (as opposed to just printing a view of df3 minus column "foo") you'd really want to use Method 1 instead.
(Do note that if you use a method relying on grep() or grepl(), you need to set pattern="^foo$" rather than "foo", if you don't want columns with names like "fool" and "buffoon" (i.e. those containing foo as a substring) to also be matched and removed.)
Less safe options, fine for interactive use:
The next two idioms will also work -- if df3 contains a column matching "foo" -- but will fail in a probably-unexpected way if it does not. If, for instance, you use any of them to search for the non-existent column "bar", you'll end up with a zero-row data.table.
As a consequence, they are really best suited for interactive use where one might, e.g., want to display a data.table minus any columns with names containing the substring "foo". For programming purposes (or if you are wanting to actually remove the column(s) from df3 rather than from a copy of it), Methods 1, 2a, and 2b are really the best options.
# Method 4:
df3[, .SD, .SDcols = !patterns("^foo$")]
Lastly there are approaches using with=FALSE, though data.table is gradually moving away from using this argument so it's now discouraged where you can avoid it; showing here so you know the option exists in case you really do need it:
# Method 5a (like Method 3)
df3[, !"foo", with=FALSE]
# Method 5b (like Method 4)
df3[, !grep("^foo$", names(df3)), with=FALSE]
# Method 5b (another like Method 4)
df3[, !grepl("^foo$", names(df3)), with=FALSE]
You can also use set for this, which avoids the overhead of [.data.table in loops:
dt <- data.table( a=letters, b=LETTERS, c=seq(26), d=letters, e=letters )
set( dt, j=c(1L,3L,5L), value=NULL )
> dt[1:5]
b d
1: A a
2: B b
3: C c
4: D d
5: E e
If you want to do it by column name, which(colnames(dt) %in% c("a","c","e")) should work for j.
I simply do it in the data frame kind of way:
DT$col = NULL
Works fast and as far as I could see doesn't cause any problems.
UPDATE: not the best method if your DT is very large, as using the $<- operator will lead to object copying. So better use:
DT[, col:=NULL]
Very simple option in case you have many individual columns to delete in a data table and you want to avoid typing in all column names #careadviced
dt <- dt[, -c(1,4,6,17,83,104)]
This will remove columns based on column number instead.
It's obviously not as efficient because it bypasses data.table advantages but if you're working with less than say 500,000 rows it works fine
Suppose your dt has columns col1, col2, col3, col4, col5, coln.
To delete a subset of them:
vx <- as.character(bquote(c(col1, col2, col3, coln)))[-1]
DT[, paste0(vx):=NULL]
Here is a way when you want to set a # of columns to NULL given their column names
a function for your usage :)
deleteColsFromDataTable <- function (train, toDeleteColNames) {
for (myNm in toDeleteColNames)
train <- train [,(myNm):=NULL]
return (train)
}
DT[,c:=NULL] # remove column c
For a data.table, assigning the column to NULL removes it:
DT[,c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the extra comma if DT is a data.table
... which is the equivalent of:
DT$col1 <- NULL
DT$col2 <- NULL
DT$col3 <- NULL
DT$col4 <- NULL
The equivalent for a data.frame is:
DF[c("col1", "col1", "col2", "col2")] <- NULL
^
|---- Notice the missing comma if DF is a data.frame
Q. Why is there a comma in the version for data.table, and no comma in the version for data.frame?
A. As data.frames are stored as a list of columns, you can skip the comma. You could also add it in, however then you will need to assign them to a list of NULLs, DF[, c("col1", "col2", "col3")] <- list(NULL).

Resources