I have a dataframe as follows:
df1
ColA ColB ColC ColD ColE COlF ColG Recs
1 A-1 A - 3 B B NA C
1 B-1 C R D E NA B
1 NA A B A B
How do I determine if the last from the column Recs is found in it's respective row?
I tried below but it doesn't work because there are duplicates in my normal dataset:
df1$Exist <- apply(df1, 1, FUN = function(x)
c("No", "Yes")[(anyDuplicated(x[!is.na(x) & x != "" ])!=0) +1])
There are also blanks, NA's, and character values that have spaces and dashes.
Final output should be:
ColA ColB ColC ColD ColE COlF ColG Recs Exist?
1 A-1 A - 3 B B NA C No
1 B-1 C R D E NA B No
1 NA A B A B Yes
Thanks
For efficiency, you could use data.table here.
library(data.table)
setDT(df)[, Exist := Recs %chin% unlist(.SD), .SDcols=-"Recs", by=1:nrow(df)]
which gives
ColA ColB ColC ColD ColE COlF ColG Recs Exist
1: 1 A-1 A-3 B B NA NA C FALSE
2: 1 B-1 C R D E NA B FALSE
3: 1 NA A B A NA B TRUE
Original data:
df <-structure(list(ColA = c(1L, 1L, 1L), ColB = c("A-1", "", NA),
ColC = c("A-3", "B-1", "A"), ColD = c("B", "C R", "B"), ColE = c("B",
"D", "A"), COlF = c(NA, "E", ""), ColG = c(NA, NA, NA), Recs = c("C",
"B", "B")), .Names = c("ColA", "ColB", "ColC", "ColD", "ColE",
"COlF", "ColG", "Recs"), row.names = c(NA, -3L), class = "data.frame")
If I understood you correctly, this should work:
# Compute column index of reference variable
col_ind <- which(colnames(df1) == "Recs")
# Compute boolean vector of presence
present_bool <- apply(df1, 1, function(row) {
any(row[col_ind] == row[-col_ind], na.rm = TRUE)
})
# Create the desired column
df1$Exist <- ifelse(present_bool, "Yes", "No")
exist <- rep(NA, nrow(df1))
for (i in 1:nrow(df1)) {
exist[i] <- df1$Recs[i] %in% df1[i, 1:7]
}
df1 <- cbind(df1, exist)
This should be another way of obtaining the desired result:
f.checkExist <- function(x) {
grepl(df[x, 8], df[x, 1:7])
}
df$exists <- grepl(T, lapply(1:nrow(df), f.checkExist))
Related
Given two dataframes df1 and df2 as follows:
df1:
df1 <- structure(list(A = 1L, B = 2L, C = 3L, D = 4L, G = 5L), class = "data.frame", row.names = c(NA,
-1L))
Out:
A B C D G
1 1 2 3 4 5
df2:
df2 <- structure(list(Col1 = c("A", "B", "C", "D", "X"), Col2 = c("E",
"Q", "R", "Z", "Y")), class = "data.frame", row.names = c(NA,
-5L))
Out:
Col1 Col2
1 A E
2 B Q
3 C R
4 D Z
5 X Y
I need to rename columns of df1 using df2, except column G since it not in df2's Col1.
I use df2$Col2[match(names(df1), df2$Col1)] based on the answer from here, but it returns "E" "Q" "R" "Z" NA, as you see column G become NA. I hope it keep the original name.
The expected result:
E Q R Z G
1 1 2 3 4 5
How could I deal with this issue? Thanks.
By using na.omit(it's little bit messy..)
colnames(df1)[na.omit(match(names(df1), df2$Col1))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
df1
E Q R Z G
1 1 2 3 4 5
I have success to reproduce your error with
df2 <- data.frame(
Col1 = c("H","I","K","A","B","C","D"),
Col2 = c("a1","a2","a3","E","Q","R","Z")
)
The problem is location of df2$Col1 and names(df1) in match.
na.omit(match(names(df1), df2$Col1))
gives [1] 4 5 6 7, which index does not exist in df1 that has length 5.
For df1, we should change order of terms in match, na.omit(match(df2$Col1,names(df1))) gives [1] 1 2 3 4
colnames(df1)[na.omit(match(df2$Col1, names(df1)))] <- df2$Col2[na.omit(match(names(df1), df2$Col1))]
This will works.
A solution using the rename_with function from the dplyr package.
library(dplyr)
df3 <- df2 %>%
filter(Col1 %in% names(df1))
df4 <- df1 %>%
rename_with(.cols = df3$Col1, .fn = function(x) df3$Col2[df3$Col1 %in% x])
df4
# E Q R Z G
# 1 1 2 3 4 5
I have a dataset with one ID column, 12 information columns (strings) and n rows. It looks like this:
ID Col1 Col2 Col3 Col4 Col5 ...
01 a b c d a
02 a a a a a
03 b b b b b
...
I need to go row by row and check if that row (considering all of it's columns) is equal to any other row in the dataset. My output needs to be two new columns: one indicating if that particular row is equal to any other row and a second column indicating which row it is equal to (in case of TRUE in the previous column)
I appreciate any suggestions.
Assuming DF in the Note at the end, sort it and create a column dup indicating whether there exists a prior duplicate row. Then set to wx to the row number in the original data frame of the duplicate. Finaly resort back.
We have assumed that duplicate means that the columns other than the ID are the same but that is readily changed if need be. We have also assumed that we should mark the second and subsequent rows among duplicates whereas the first is not so marked becaue it has to that point no duplicate.
The question does not address the situation of more than 2 identical rows but if that situation exists then each duplicate will point to the nearest prior row of which it is a duplicate.
o <- do.call("order", DF[-1])
DFo <- DF[o, ]
DFo$wx <- DFo$dup <- duplicated(DFo)
DFo$wx[DFo$dup] <- as.numeric(rownames(DFo))[which(DFo$dup) - 1]
DFo[order(o), ] # back to original order
giving:
ID Col1 Col2 Col3 Col4 Col5 dup wx
1 1 a b c d a FALSE 0
2 2 a a a a a FALSE 0
3 3 b b b b b FALSE 0
4 1 a b c d a TRUE 1
Note
Lines <- "ID Col1 Col2 Col3 Col4 Col5
01 a b c d a
02 a a a a a
03 b b b b b"
DF <- read.table(text = Lines, header = TRUE)
DF <- DF[c(1:3, 1), ]
rownames(DF) <- NULL
giving:
> DF
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 1 a b c d a
With a df like below:
ID Col1 Col2 Col3 Col4 Col5
1 1 a b c d a
2 2 a a a a a
3 3 b b b b b
4 3 b b b b b
You could try grouping by all columns and checking whether any count > 1 as well as pasting together row numbers (1:nrow(df)):
df <- transform(
df,
dupe = ave(ID, mget(names(df)), FUN = length) > 1,
dupeRows = ave(1:nrow(df), mget(names(df)), FUN = toString)
)
As this would get you a number for each row, even when there are no duplicates, you could do:
df$dupeRows <- with(df,
Map(function(x, y)
toString(x[x != y]),
strsplit(as.character(dupeRows), split = ', '),
1:nrow(df)))
Output:
ID Col1 Col2 Col3 Col4 Col5 dupe dupeRows
1 1 a b c d a FALSE
2 2 a a a a a FALSE
3 3 b b b b b TRUE 4
4 3 b b b b b TRUE 3
Data
df <- structure(list(ID = c(1L, 2L, 3L, 3L), Col1 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col2 = structure(c(2L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor"), Col3 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "c"), class = "factor"), Col4 = structure(c(3L,
1L, 2L, 2L), .Label = c("a", "b", "d"), class = "factor"), Col5 = structure(c(1L,
1L, 2L, 2L), .Label = c("a", "b"), class = "factor")), row.names = c(NA,
-4L), class = "data.frame")
A dplyr solution
library(dplyr)
df %>%
mutate(row_num = 1:n(), is_dup = duplicated(df)) %>%
group_by(across(-c(row_num, is_dup))) %>%
mutate(
has_copies = n() > 1L,
which_row = if_else(is_dup, first(row_num), NA_integer_),
row_num = NULL, is_dup = NULL
)
Output
# A tibble: 5 x 8
# Groups: ID, Col1, Col2, Col3, Col4, Col5 [3]
ID Col1 Col2 Col3 Col4 Col5 has_copies which_row
<chr> <fct> <fct> <fct> <fct> <fct> <lgl> <int>
1 1 a b c d a FALSE NA
2 2 a a a a a FALSE NA
3 3 b b b b b TRUE NA
4 3 b b b b b TRUE 3
5 3 b b b b b TRUE 3
For each row that has more than one copies, the has_copies gives a TRUE.
For a set of rows that are the same, I consider the first one as the original and all other rows as duplicates. In this regard, which_row gives you the index of the original for each duplicate it found. In other words, If a row has no duplicate or is the original, it gives you NA.
Lets assume there are 2 columns of two huge dataframes (different lengths) like:
df1 df2
A 1 C X
A 1 D X
B 4 C X
A 1 F X
B 4 A X
B 4 B X
C 7 B X
Each time there is a match in the 1st columns, X should be replaced with data from column 2 of df1. If the 1st column of df2 contains Elements, which are still not in the first column of df1 (F, D), X should be replaced with 0.
Hence there is a huge dataframe, a loop in a loop would not be useful.
The solution should look like this:
df1 df2
A 1 C 7
A 1 D 0
B 4 C 7
A 1 F 0
B 4 A 1
B 4 B 4
C 7 B 4
Thank You in advance
As there are duplicate rows in 'df1', we can get the unique rows
df3 <- unique(df1)
Then, use match to get the idnex
i1 <- match(df2$Col1, df3$Col1)
and based on the index, assign
df2$Col2 <- df3$Col2[i1]
If there are no matches, it would be NA, which can be changed to 0
df2$Col2[is.na(df2$Col2)] <- 0
df2
# Col1 Col2
#1 C 7
#2 D 0
#3 C 7
#4 F 0
#5 A 1
#6 B 4
#7 B 4
Or this can be done with data.table by joining on the 'Col1' and assigning the 'Col2' (after removing the Col2 from the second data) with the Col2 from 'df3'
library(data.table)
setDT(df2)[, Col2 := NULL][df3, Col2 := Col2, on = .(Col1)]
data
df1 <- structure(list(Col1 = c("A", "A", "B", "A", "B", "B", "C"), Col2 = c(1,
1, 4, 1, 4, 4, 7)), class = "data.frame", row.names = c(NA, -7L
))
df2 <- structure(list(Col1 = c("C", "D", "C", "F", "A", "B", "B"), Col2 = c("X",
"X", "X", "X", "X", "X", "X")), class = "data.frame", row.names = c(NA,
-7L))
I would like to paste0 two columns if the element in one column is not NA.If one element of one columns is NA then keep the element of the other column only.
structure(list(col1 = structure(1:3, .Label = c("A", "B", "C"),
class = "factor"), col2 = c(1, NA, 3)), .Names = c("col1", "col2"),
class = "data.frame",row.names = c(NA, -3L))
# col1 col2
# 1 A 1
# 2 B NA
# 3 C 3
structure(list(col1 = structure(1:3, .Label = c("A", "B", "C"),
class = "factor"),col2 = c(1, NA, 3), col3 = c("A|1", "B", "C|3")),
.Names = c("col1", "col2", "col3"), row.names = c(NA,-3L),
class = "data.frame")
# col1 col2 col3
#1 A 1 A|1
#2 B NA B
#3 C 3 C|3
you can also do it with regular expressions:
df$col3 <- sub("NA\\||\\|NA", "", with(df, paste0(col1, "|", col2)))
That is, paste them in regular way and then replace any "NA|" or "|NA" with "". Note that | needs to be "double escaped" because it means "OR" in regexps, that's why the strange pattern NA\\||\\|NA means actually "NA|" OR "|NA".
As #Roland says, this is easy using ifelse (just translate the mental logic into a series of nested ifelse statements):
x <- transform(x,col3=ifelse(is.na(col1),as.character(col2),
ifelse(is.na(col2),as.character(col1),
paste0(col1,"|",col2))))
update: need as.character in some cases.
Try:
> df$col1 = as.character(df$col1)
> df$col3 = with(df, ifelse(is.na(col1),col2, ifelse(is.na(col2), col1, paste0(col1,'|',col2))))
> df
col1 col2 col3
1 A 1 A|1
2 B NA B
3 C 3 C|3
You could also do:
library(stringr)
df$col3 <- apply(df, 1, function(x)
paste(str_trim(x[!is.na(x)]), collapse="|"))
df
# col1 col2 col3
#1 A 1 A|1
#2 B NA B
#3 C 3 C|3
I have the following data.frames:
a <- data.frame(id = 1:3, v1 = c('a', NA, NA), v2 = c(NA, 'b', 'c'))
b <- data.frame(id = 1:3, v1 = c(NA, 'B', 'C'), v2 = c("A", NA, NA))
> a
id v1 v2
1 1 a <NA>
2 2 <NA> b
3 3 <NA> c
> b
id v1 v2
1 1 <NA> A
2 2 B <NA>
3 3 C <NA>
note: There are no ids for which v1 or v2 are defined in both tables; there is only a single unique non-NA value in each column for each id value
I would like to merge these data frames on matching values of "id':
ab <- merge(a, b, by = "id")
but I would also like to combine the two columns v1 and v2, so that the data.frame ab will look like this:
ab <- data.frame(id = 1:3, v1 = c("a", "B", "C"), v2 = c("A", "b", "c"))
> ab
id v1 v2
1 1 a A
2 2 B b
3 3 C c
instead, I get this:
> merge(a, b, by = "id")
id v1.x v2.x v1.y v2.y
1 1 a <NA> <NA> A
2 2 <NA> b B <NA>
3 3 <NA> c C <NA>
it would be helpful to have examples using both data.frame and data.table, so here are the data.table versions of above:
A <- data.table(a, key = 'id')
B <- data.table(b, key = 'id')
A[B]
The type of merge you specify probably won't be possible using merge (with data frames), although saying that usually invites being proved wrong.
You also omit some details: will there always be a single unique non-NA value in each column for each id value? If so, this will work:
ab <- rbind(a,b)
> colFun <- function(x){x[which(!is.na(x))]}
> ddply(ab,.(id),function(x){colwise(colFun)(x)})
id v1 v2
1 1 a A
2 2 B b
3 3 C c
A similar strategy should work with data.tables as well:
abDT <- data.table(ab,key = "id")
> abDT[,list(colFun(v1),colFun(v2)),by = id]
id V1 V2
[1,] 1 a A
[2,] 2 B b
[3,] 3 C c
If your data is as simple as it is above joran's answer is likely the simplest way. Here's may approach in base:
a <- data.frame(id = 1:3, v1 = c('a', NA, NA), v2 = c(NA, 'b', 'c'))
b <- data.frame(id = 1:3, v1 = c(NA, 'B', 'C'), v2 = c("A", NA, NA))
decider <- function(x, y) factor(ifelse(is.na(x), as.character(y), as.character(x)))
data.frame(mapply(a, b, FUN = decider))
If your data has different id's (some overlap and some do not, then here's a different approach:
a <- data.frame(id = c(1,2,4,5), v1 = c('a', NA, "q", NA), v2 = c(NA, 'b', 'c', "e"))
b <- data.frame(id = 1:4, v1 = c(NA, "A", "C", 'B'), v2 = c("A", NA, "D", NA))
decider <- function(x, y) factor(ifelse(is.na(x), as.character(y), as.character(x)))
DF <- data.frame(mapply(a, b, FUN = decider))
DF2 <- rbind(b[!b$id %in% DF$id , ], DF)
DF2 <- DF2[order(DF2$id), ]
rownames(DF2) <- 1:nrow(DF2)