Comparing "x1", "x2", an "x3" to "target", how do I return the first index of the column that matches "target"? An NA can result for no match.
pop <- c("A", "B", "C", "D")
target <- pop
x1 <- sample(pop)
x2 <- sample(pop)
x3 <- sample(pop)
df <- data.frame(target,x1,x2,x3)
> df
target x1 x2 x3
1 A B B D
2 B D C C
3 C C A A
4 D A D B
I have tried using something along the lines of:
min(which(df[3, 1] == df[3, 2:ncol(df)]))
...(row 3 being used as an example), but I don't know how to gracefully handle cases where there is no match, which is probably why I am having trouble using this in a function with apply(). The goal is either a new column on df or a vector of the returned values.
Thanks!
Here's a solution using match -
> df
target x1 x2 x3
1 A C A C
2 B A B B
3 C D D D
4 D B C A
apply(df, 1, function(x) match(TRUE, x[-1] == x[1]))
[1] 2 2 NA NA
Data -
df <- structure(list(target = c("A", "B", "C", "D"), x1 = c("C", "A",
"D", "B"), x2 = c("A", "B", "D", "C"), x3 = c("C", "B", "D",
"A")), .Names = c("target", "x1", "x2", "x3"), row.names = c(NA,
-4L), class = "data.frame")
There are many ways to do this. Loop through the columns 2:4, compare with the target and get the index of first match with which
sapply(df[-1], function(x) which(x == df$target)[1])
x1 x2 x3
#1 3 NA
If it is for comparing the rows
m1 <- df$target == df[-1]
max.col(m1, 'first') * NA^!rowSums(m1)
Or
apply(m1, 1, function(x) which(x)[1])
data
df <- data.frame(target,x1,x2,x3, stringsAsFactors = FALSE)
Related
I think this problem can be solved in many different ways, but I basically want to find a function that will give me a dataframe with every combination of values from a list into its columns, including the incomplete sets and excluding some, but not all, redundant combinations (order isn't important for now).
So I might start out with a list like this:
List = c("A","B","C")
and I want to get a dataframe that looks like
C1 = c("A","B","C","A","A","B","A")
C2 = c("","","","B","C","C","B")
C3 = c("","","","","","","C")
df <- cbind(C1, C2, C3)
row.names(df) <- c("A", "B", "C", "AB", "AC", "BC", "ABC")
colnames(df) <- c("First_Item", "Second_Item","Third_Item")
And then it fills in each cell with the corresponding letter.
e.g. position A1 in the df would be "A", positions A2 and A3 would be empty.
any idea how to do this?
I tried with dplyr:
library(tidyr)
list_1 = c("A", "B", "C", "NA")
list_2 = c("A", "B", "C", "NA")
list_3 = c("A", "B", "C", "NA")
list_4 = c("A", "B", "C", "NA")
test <- crossing(list_1, list_2,list_3,list_4)
test <- test[apply(test, MARGIN = 1, FUN = function(x) !(duplicated(x) | !any = "NA")),]
But I want to keep all the values with multiple NAs in them, so this doesn't quite work.
expand.grid has the same problem
expand.grid(list_1 = c("A", "B", "C", "NA"),list_2 = c("A", "B", "C", "NA"),list_3 = c("A", "B", "C", "NA"),list_4 = c("A", "B", "C", "NA"))
That's basically Roland's answer:
library(magrittr) # just for the pipe-operator
List %>%
seq_along() %>%
lapply(combn, x = List, simplify = FALSE) %>%
unlist(recursive = FALSE) %>%
sapply(`length<-`, length(List)) %>%
t() %>%
data.frame()
returns
X1 X2 X3
1 A <NA> <NA>
2 B <NA> <NA>
3 C <NA> <NA>
4 A B <NA>
5 A C <NA>
6 B C <NA>
7 A B C
Further more you could use the dplyr and tidyr packages to replace NAs. Just add one more function into the pipe:
mutate(across(everything(), replace_na, ""))
Here is my approach:
library(purrr)
List <- c("xA","xB","xC") # arbitrary as per request in comments
seq_along(List) %>% # h/t #MartinGal
map(~ combn(List, m = .x) %>%
apply(2, paste, collapse = "<!>")) %>%
unlist() %>%
tibble::tibble() %>%
tidyr::separate(1, into = c("First_Item", "Second_Item", "Third_Item"),
sep = "<!>")
Returns:
# A tibble: 7 x 3
First_Item Second_Item Third_Item
<chr> <chr> <chr>
1 xA NA NA
2 xB NA NA
3 xC NA NA
4 xA xB NA
5 xA xC NA
6 xB xC NA
7 xA xB xC
I have the following dataset
df <- structure(list(
X1 = c("A", "B", "C", "D"),
X2 = c("NA", "B", "C", "D"),
X3 = c("NA", "B", "C", "D"),
X4 = c("NA", "B", "C", "D")),
class = "data.frame", row.names = c(NA, -4L))
I need to transpose the top row into its own column such that the data looks like
df <- structure(list(
X1 = c("A", "A", "A"),
X2 = c("B", "C", "D"),
X3 = c("B", "C", "D"),
X4 = c("B", "C", "D")),
class = "data.frame", row.names = c(NA, -3L))
I have thought about just subsetting and taking the top row, then transposing only that one and then merging it back to the original dataset.
I am wondering if there is a more elegant solution.
Edit, Thanks for everyone's help.
The next step is to take this and apply it to a list of tibbles that were split via group_split (code thanks to #LMc).
data %>%
group_by(split_on = cumsum(is.na(Company) & is.na(lag(Company)))) %>%
group_split(.keep = F) %>%
`names<-`({.} %>%
map(~ .[1,1])%>%
unlist())
"[<-"("["(df, -1, ), ,1,df[1,1])
:-)
df[,1] <- df[1,1]
df[-1,]
X1 X2 X3 X4
2 A B B B
3 A C C C
4 A D D D
We could use
library(dplyr)
df %>%
mutate(X1 = first(X1)) %>%
slice(-1)
# X1 X2 X3 X4
#1 A B B B
#2 A C C C
#3 A D D D
Or in base R (R 4.1.0)
df |>
transform(X1 = X1[1]) |>
subset( seq_along(X1) > 1)
# X1 X2 X3 X4
#2 A B B B
#3 A C C C
#4 A D D D
I am working to update an old dataframe with a data from a new dataframe.
I found this option, it works for some of the fields, but not all. Not sure how to alter that as it is beyond my skill set. I tried removing the is.na(x) portion of the ifelse code and that did not work.
df_old <- data.frame(
bb = as.character(c("A", "A", "A", "B", "B", "B")),
y = as.character(c("i", "ii", "ii", "i", "iii", "i")),
z = 1:6,
aa = c(NA, NA, 123, NA, NA, 12))
df_new <- data.frame(
bb = as.character(c("A", "A", "A", "B", "A", "A")),
z = 1:6,
aa = c(NA, NA, 123, 1234, NA, 12))
cols <- names(df_new)[names(df_new) != "z"]
df_old[,cols] <- mapply(function(x, y) ifelse(is.na(x), y[df_new$z == df_old$z], x), df_old[,cols], df_new[,cols])
The code also changes my bb variable from a character vector to a numeric. Do I need another call to mapply focusing on specific variable bb?
To update the aa and bb columns you can approach this using a join via merge(). This assumes column z is the index for these data frames.
# join on `z` column
df_final<- merge(df_old, df_new, by = c("z"))
# replace NAs with new values for column `aa` from `df_new`
df_final$aa <- ifelse(is.na(df_final$aa.x), df_final$aa.y, df_final$aa.x)
# choose new values for column `bb` from `df_new`
df_final$bb <- df_final$bb.y
df_final<- df_final[,c("bb", "z", "y", "aa")]
df_final
bb z y aa
1 A 1 i NA
2 A 2 ii NA
3 A 3 ii 123
4 B 4 i 1234
5 A 5 iii NA
6 A 6 i 12
This question already has answers here:
Find duplicate values in R [duplicate]
(5 answers)
Closed 3 years ago.
I have a list of dataframes called list and it looks like this:
list[[1]]
X1 X2 X3 X4
a 1 b c
d 2 e f
g 3 h i
j 4 k l
list[[2]]
X1 X2 X3 X4
a 1 b c
d 2 e f
g 2 h i
j 3 k l
list[[3]]
X1 X2 X3 X4
a 1 b c
d 2 e f
g 3 h i
j 4 k l
I have been trying to use lapply to loop through the list and print out all the duplicates in column X2 of each dataframe.
I'm not able to figure this out. Would appreciate any help. Thanks.
I've tied
lapply(list, function(i) {
if(length(unique(i[X2])) != length(i[X2])) {
print(i[X2][duplicated(i[X2]))
} else {
print("No duplicates")
}
})
We could use lapply, find out the duplicated indices in X2 column and print the unique duplicated values.
lapply(list_df, function(x) {
inds <- duplicated(x$X2)
if(any(inds)) unique(x$X2[inds]) else "No duplicates"
})
#[[1]]
#[1] "No duplicates"
#[[2]]
#[1] 2
#[[3]]
#[1] "No duplicates"
Using list_df instead of list since list is an internal R function.
We can use table to find out the frequency of values in the column 'X2', extract the names of the output where the frequency is greater than 1
lapply(list, function(x) {
x1 <- names(which(table(x$X2) > 1))
if(length(x1)== 0) "No duplicates" else x1})
#[[1]]
#[1] "No duplicates"
#[[2]]
#[1] "2"
#[[3]]
#[1] "No duplicates"
Or using duplicated
lapply(list, function(x) unique(x$X2[duplicated(x$X2)|duplicated(x$X2,
fromLast = TRUE)]))
Or another option is to stack after extracting the column and get the index of duplicate elements with table and which
which(table(stack(setNames(lapply(list, `[[`, "X2"),
seq_along(list)))[2:1]) > 1, arr.ind = TRUE)
Or another option is
library(tidyverse)
map(list, ~ .x %>%
count(X2) %>%
filter(n > 1) %>%
pull(X2))
data
list <- list(structure(list(X1 = c("a", "d", "g", "j"), X2 = 1:4, X3 = c("b",
"e", "h", "k"), X4 = c("c", "f", "i", "l")), class = "data.frame", row.names = c(NA,
-4L)), structure(list(X1 = c("a", "d", "g", "j"), X2 = c(1L,
2L, 2L, 3L), X3 = c("b", "e", "h", "k"), X4 = c("c", "f", "i",
"l")), class = "data.frame", row.names = c(NA, -4L)), structure(list(
X1 = c("a", "d", "g", "j"), X2 = 1:4, X3 = c("b", "e", "h",
"k"), X4 = c("c", "f", "i", "l")), class = "data.frame", row.names = c(NA,
-4L)))
I have a list aa which references the index names of another list bb as well as containing one other element (call it cm). List bb items contain strings. I have a loop that goes through bb and, for every item which matches a string I've specified, adds it to a new row in a dataframe. What I need is to also add the cm value to that row.
Example:
library("tidyverse")
aa <- list(c(123, 1), c(234, 1), c(345, 2), c(456, 3))
bb <- list("123" = c("a", "b", "c"), "234" = c("b", "c", "d"), "345" = c("c", "d", "e"), "456" = c("f", "g", "h"))
cc <- c("a", "b", "c")
tbl <- NULL
for (a in aa){
for (b in bb) {
if (any(cc %in% b)) {
tb <- tibble(cm=a[2],n1=b[1],n2=b[2],n3=b[3])
tbl <- bind_rows(tbl,tb)
}
}
}
This iterates through for every possible combination of bb, and adds it to pairs it to every cm, which is no good. My output should look something like this:
output <- tibble(cm = c(1, 1, 2), n1 = c("a", "b", "c"),
n2 = c("b", "c", "d"), n3 = c("c", "d", "e"))
> output
# A tibble: 3 x 4
cm n1 n2 n3
<dbl> <chr> <chr> <chr>
1 1 a b c
2 1 b c d
3 2 c d e
I thought maybe something like this would work, as at least then I could loop through tbl later and use nm to replace it with the appropriate cm values:
tbl <- NULL
for (a in aa){
for (b in bb) {
if (any(cc %in% b)) {
tb <- tibble(nm = names(bb)[b], n1=b[1],n2=b[2],n3=b[3])
tbl <- bind_rows(tbl,tb)
}
}
}
I don't really understand why this doesn't work, because names(bb)[1] returns 123 so I figured it would work the same in a loop with names(bb)[b].
If you're happy with a base R solution without the explicit loops, would this work?
# generate data
aa <- list(c(123, 1), c(234, 1), c(345, 2), c(456, 3))
# cm is an element of bb
bb <- list("123" = c("a", "b", "c"), "234" = c("b", "c", "d"),
"345" = c("c", "d", "e"), "456" = c("f", "g", "h"),
cm = c(1, 1, 2))
cc <- c("a", "b", "c")
tbl <- data.frame(
bb[["cm"]],
# apply to each element of aa
do.call(rbind, lapply(aa, function(x, y, c) { # function takes 3 args
# only elements of bb whose names are in aa[[x]]
names_y <- as.character(intersect(x, names(y)))
# turn subset of bb into data.frame
out <- as.data.frame(do.call(rbind, y[names_y]))
# subset rows for which any row element %in% cc
out <- out[apply(out, 1, function(x, c) any(x %in% c), c)]
return(out)
}, bb, cc))) # pass bb and cc as args to the function in lapply()
names(tbl) <- c("cm", paste0("n", 1:(ncol(tbl) - 1)))
gives
> tbl
cm n1 n2 n3
123 1 a b c
234 1 b c d
345 2 c d e