I am working to update an old dataframe with a data from a new dataframe.
I found this option, it works for some of the fields, but not all. Not sure how to alter that as it is beyond my skill set. I tried removing the is.na(x) portion of the ifelse code and that did not work.
df_old <- data.frame(
bb = as.character(c("A", "A", "A", "B", "B", "B")),
y = as.character(c("i", "ii", "ii", "i", "iii", "i")),
z = 1:6,
aa = c(NA, NA, 123, NA, NA, 12))
df_new <- data.frame(
bb = as.character(c("A", "A", "A", "B", "A", "A")),
z = 1:6,
aa = c(NA, NA, 123, 1234, NA, 12))
cols <- names(df_new)[names(df_new) != "z"]
df_old[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[df_new$z == df_old$z], x), df_old[,cols], df_new[,cols])
The code also changes my bb variable from a character vector to a numeric. Do I need another call to mapply focusing on specific variable bb?
To update the aa and bb columns you can approach this using a join via merge(). This assumes column z is the index for these data frames.
# join on `z` column
df_final<- merge(df_old, df_new, by = c("z"))
# replace NAs with new values for column `aa` from `df_new`
df_final$aa <- ifelse(is.na(df_final$aa.x), df_final$aa.y, df_final$aa.x)
# choose new values for column `bb` from `df_new`
df_final$bb <- df_final$bb.y
df_final<- df_final[,c("bb", "z", "y", "aa")]
df_final
bb z y aa
1 A 1 i NA
2 A 2 ii NA
3 A 3 ii 123
4 B 4 i 1234
5 A 5 iii NA
6 A 6 i 12
Related
Lets assume there are 2 columns of two huge dataframes (different lengths) like:
df1 df2
A 1 C X
A 1 D X
B 4 C X
A 1 F X
B 4 A X
B 4 B X
C 7 B X
Each time there is a match in the 1st columns, X should be replaced with data from column 2 of df1. If the 1st column of df2 contains Elements, which are still not in the first column of df1 (F, D), X should be replaced with 0.
Hence there is a huge dataframe, a loop in a loop would not be useful.
The solution should look like this:
df1 df2
A 1 C 7
A 1 D 0
B 4 C 7
A 1 F 0
B 4 A 1
B 4 B 4
C 7 B 4
Thank You in advance
As there are duplicate rows in 'df1', we can get the unique rows
df3 <- unique(df1)
Then, use match to get the idnex
i1 <- match(df2$Col1, df3$Col1)
and based on the index, assign
df2$Col2 <- df3$Col2[i1]
If there are no matches, it would be NA, which can be changed to 0
df2$Col2[is.na(df2$Col2)] <- 0
df2
# Col1 Col2
#1 C 7
#2 D 0
#3 C 7
#4 F 0
#5 A 1
#6 B 4
#7 B 4
Or this can be done with data.table by joining on the 'Col1' and assigning the 'Col2' (after removing the Col2 from the second data) with the Col2 from 'df3'
library(data.table)
setDT(df2)[, Col2 := NULL][df3, Col2 := Col2, on = .(Col1)]
data
df1 <- structure(list(Col1 = c("A", "A", "B", "A", "B", "B", "C"), Col2 = c(1,
1, 4, 1, 4, 4, 7)), class = "data.frame", row.names = c(NA, -7L
))
df2 <- structure(list(Col1 = c("C", "D", "C", "F", "A", "B", "B"), Col2 = c("X",
"X", "X", "X", "X", "X", "X")), class = "data.frame", row.names = c(NA,
-7L))
Comparing "x1", "x2", an "x3" to "target", how do I return the first index of the column that matches "target"? An NA can result for no match.
pop <- c("A", "B", "C", "D")
target <- pop
x1 <- sample(pop)
x2 <- sample(pop)
x3 <- sample(pop)
df <- data.frame(target,x1,x2,x3)
> df
target x1 x2 x3
1 A B B D
2 B D C C
3 C C A A
4 D A D B
I have tried using something along the lines of:
min(which(df[3, 1] == df[3, 2:ncol(df)]))
...(row 3 being used as an example), but I don't know how to gracefully handle cases where there is no match, which is probably why I am having trouble using this in a function with apply(). The goal is either a new column on df or a vector of the returned values.
Thanks!
Here's a solution using match -
> df
target x1 x2 x3
1 A C A C
2 B A B B
3 C D D D
4 D B C A
apply(df, 1, function(x) match(TRUE, x[-1] == x[1]))
[1] 2 2 NA NA
Data -
df <- structure(list(target = c("A", "B", "C", "D"), x1 = c("C", "A",
"D", "B"), x2 = c("A", "B", "D", "C"), x3 = c("C", "B", "D",
"A")), .Names = c("target", "x1", "x2", "x3"), row.names = c(NA,
-4L), class = "data.frame")
There are many ways to do this. Loop through the columns 2:4, compare with the target and get the index of first match with which
sapply(df[-1], function(x) which(x == df$target)[1])
x1 x2 x3
#1 3 NA
If it is for comparing the rows
m1 <- df$target == df[-1]
max.col(m1, 'first') * NA^!rowSums(m1)
Or
apply(m1, 1, function(x) which(x)[1])
data
df <- data.frame(target,x1,x2,x3, stringsAsFactors = FALSE)
I am trying to add a counter column to my dataframe based on the combination of two categorical values. e.g:
dat <- data.frame(cat1 = c("a", "a", "a", "a", "a", "b", "b", "b", "b"),
cat2 = c("x", "x", "x", "y", "y", "j", "j", "k", "l"),
Result = c(1, 1, 1, 2, 2, 1, 1, 2, 3))
I have used this:
dat$Result <- ave(dat$cat1, dat$cat2, FUN=function(x) match(x,sort(unique(x))))
but I have errors. I have checked similar suggestions in other threads but the answers only apply to numeric columns. Could anybody please offer me a suggestion? Thanks you.
We can use
with(dat, as.numeric(ave(as.character(cat2), cat1,
FUN = function(x) match(x, unique(x)))))
If the factor levels are already in the same order for 'cat2', then coercing to numeric can also be done
with(dat, ave(as.numeric(cat2), cat1, FUN = function(x) match(x, unique(x))))
Update
With the new dataset,
with(dat, as.numeric(ave(as.character(cat2), cat1, FUN =
function(x) inverse.rle(within.list(rle(x), values <- seq_along(values))))))
#[1] 1 1 1 2 2 1 1 2 3 4
You can use rleid from data.table,
library(data.table)
setDT(dat)[, Result := rleid(cat2), by = cat1]
dat
# cat1 cat2 Result
#1: a x 1
#2: a x 1
#3: a x 1
#4: a y 2
#5: a y 2
#6: b j 1
#7: b j 1
#8: b k 2
#9: b l 3
This question already has an answer here:
Select equivalent rows [A-B & B-A] [duplicate]
(1 answer)
Closed 7 years ago.
Consider the following dataframe:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
How can I remove duplicate pair-wise columns so that the output looks like:
# V1 V2 n
#1 A B 1
#2 A C 3
#3 B C 2
I tried unique() and duplicated() to no avail.
Not sure if this is the simplest way of doing it (transposing can be computationally expensive) but this would work with your data frame:
df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
V2 = c("B", "C", "A", "C", "A", "B"),
n = c(1, 3, 1, 2, 3, 2))
First, sort the data frame row-wise, so your value-pairs become true duplicates.
df <- data.frame(t(apply(df, 1, sort)))
Then you can just apply the unique function.
df <- unique(df)
If your column names and order are important, you'll have to re-establish those.
names(df) <- c("n", "V1", "V2")
df <- df[, c("V1", "V2", "n")]
Another option would be to reshape (xtabs(n~..)) the dataset ('df') to wide format, set the lower triangular matrix to 0, and remove the rows with "Freq" equal to 0.
m1 <- xtabs(n~V1+V2, df)
m1[lower.tri(m1)] <- 0
subset(as.data.frame(m1), Freq!=0)
# V1 V2 Freq
#4 A B 1
#7 A C 3
#8 B C 2
I'd be very grateful if you could help me with the following as after a few tests I haven't still been able to get the right outcome.
I've got this data:
dd_1 <- data.frame(ID = c("1","2", "3", "4", "5"),
Class_a = c("a",NA, "a", NA, NA),
Class_b = c(NA, "b", "b", "b", "b"))
And I'd like to produce a new column 'CLASS':
dd_2 <- data.frame(ID = c("1","2", "3", "4", "5"),
Class_a = c("a",NA, "a", NA, NA),
Class_b = c(NA, "b", "b", "b", "b"),
CLASS = c("a", "b", "a-b", "b", "b"))
Thanks a lot!
Here it is:
tmp <- paste(dd_1$Class_a, dd_1$Class_b, sep='-')
tmp <- gsub('NA-|-NA', '', tmp)
(dd_2 <- cbind(dd_1, tmp))
First we concatenate (join as strings) the 2 columns. paste treats NAs as ordinary strings, i.e. "NA", so we either get NA-a, NA-b, or a-b. Then we substitute NA- or -NA with an empty string.
Which results in:
## ID Class_a Class_b tmp
## 1 1 a <NA> a
## 2 2 <NA> b b
## 3 3 a b a-b
## 4 4 <NA> b b
## 5 5 <NA> b b
Another option:
dd_1$CLASS <- with(dd_1, ifelse(is.na(Class_a), as.character(Class_b),
ifelse(is.na(Class_b), as.character(Class_a),
paste(Class_a, Class_b, sep="-"))))
This way you would check if any of the classes is NA and return the other, or, if none is NA, return both separated by "-".
Here's a short solution with apply:
dd_2 <- cbind(dd_1, CLASS = apply(dd_1[2:3], 1,
function(x) paste(na.omit(x), collapse = "-")))
The result
ID Class_a Class_b CLASS
1 1 a <NA> a
2 2 <NA> b b
3 3 a b a-b
4 4 <NA> b b
5 5 <NA> b b