How to apply a function to factored subgroups in R? - r

I have some columns of characters as a data frame df:
V1 V2 V3 group
B C - 1
B C C 1
B C C 1
A C A 2
A A A 2
A A A 2
I would like to find out whether the intersection of the factored groups for each column are empty or not and would like to output the result in say a TRUE/FALSE format.
Column 2 is the only column with non-zero intersection which I have checked using:
> is.na(intersect(df[,2][df$group=="1"],df[,2][df$group=="2"]))
[1] FALSE
I was trying to automate this for the three columns V1-V3 using
by(df[,1:3], df$group, function(x) { is.na(intersect(x[df$group=="1"],x[df$group=="2"]))})
but got an error:
Error in `[.data.frame`(x, df$group == "2") : undefined columns selected
Thanks for any suggestions/alternatives!

Try
lapply(df[,1:3], function(x)
is.na(intersect(x[df$group=='1'], x[df$group=='2'])))
Or
Map(function(x,y) is.na(intersect(x,y)),
df[df$group=='1',-4], df[df$group=='2', -4])
If you have many groups,
lapply(df[,1:3], function(x) is.na(Reduce(`intersect`,split(x, df$group))))
data
df <- structure(list(V1 = c("B", "B", "B", "A", "A", "A"), V2 = c("C",
"C", "C", "C", "A", "A"), V3 = c("-", "C", "C", "A", "A", "A"
), group = c(1L, 1L, 1L, 2L, 2L, 2L)), .Names = c("V1", "V2",
"V3", "group"), class = "data.frame", row.names = c(NA, -6L))

Related

How to collapse rows by identical values in a column

Good evening,
I have a two columns tab separated .txt file, as the following:
number letter
1 a
1 b
2 a
2 b
3 b
I would like to collapse rows where the column "number" has identical value, by creating a comma separated value in the corresponding column "letter".
In other words, this should be the output:
number letter
1 a,b
2 a,b
3 b
I have looked up the web but I did not find an actual solution.
Thank you in advance,
Giuseppe
We can use aggregate in base R
aggregate(letter ~ number, df1, FUN = paste, collapse=",")
-output
# number letter
#1 1 a,b
#2 2 a,b
#3 3 b
Or with tidyverse
library(dplyr)
library(stringr)
df1 %>%
group_by(number) %>%
summarise(letter = str_c(letter, collapse=","))
data
df1 <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))
We can also combine aggregate() with toString:
#Code
newdf <- aggregate(letter~.,df,toString)
Output:
number letter
1 1 a, b
2 2 a, b
3 3 b
Some data:
#Data
df <- structure(list(number = c(1L, 1L, 2L, 2L, 3L), letter = c("a",
"b", "a", "b", "b")), class = "data.frame", row.names = c(NA,
-5L))

subset of R undefined columns

I'm trying to use subset to get values from the union of two tables
> ans<-subset(table2, select=rownames(table1))
But i get the following error:
Error in [.data.frame(x, r, vars, drop = drop) : undefined columns selected
Given table1
V2
E x
F x
G x
H x
And table2
V1 V2 V3 V4 V5 V6
1 A B C D E F
2 2 5 6 4 6 8
I want to obtain:
E F
6 8
Used this data:
table1 <- structure(list(V2 = structure(c(1L, 1L, 1L, 1L), .Label = "x", class = "factor")), class = "data.frame", row.names = c("E",
"F", "G", "H"))
structure(list(X1 = c("A", "2"), X2 = c("B", "5"), X3 = c("C",
"6"), X4 = c("D", "4"), X5 = c("E", "6"), X6 = c("F", "8")), class = "data.frame", row.names = c(NA,
-2L))
Note: This does not work if the data structure is factors. I assembled table2 with:
table2 <- data.frame(rbind(as.character(LETTERS[1:6]), c(2, 5, 6, 4, 6, 8)), stringsAsFactors = FALSE)
So, then this works:
ans <- table2[, as.character(table2[1, ]) %in% rownames(table1)]
ans

R extract unique row values in a column in a dataframe in a list [duplicate]

This question already has answers here:
Find duplicate values in R [duplicate]
(5 answers)
Closed 3 years ago.
I have a list of dataframes called list and it looks like this:
list[[1]]
X1 X2 X3 X4
a 1 b c
d 2 e f
g 3 h i
j 4 k l
list[[2]]
X1 X2 X3 X4
a 1 b c
d 2 e f
g 2 h i
j 3 k l
list[[3]]
X1 X2 X3 X4
a 1 b c
d 2 e f
g 3 h i
j 4 k l
I have been trying to use lapply to loop through the list and print out all the duplicates in column X2 of each dataframe.
I'm not able to figure this out. Would appreciate any help. Thanks.
I've tied
lapply(list, function(i) {
if(length(unique(i[X2])) != length(i[X2])) {
print(i[X2][duplicated(i[X2]))
} else {
print("No duplicates")
}
})
We could use lapply, find out the duplicated indices in X2 column and print the unique duplicated values.
lapply(list_df, function(x) {
inds <- duplicated(x$X2)
if(any(inds)) unique(x$X2[inds]) else "No duplicates"
})
#[[1]]
#[1] "No duplicates"
#[[2]]
#[1] 2
#[[3]]
#[1] "No duplicates"
Using list_df instead of list since list is an internal R function.
We can use table to find out the frequency of values in the column 'X2', extract the names of the output where the frequency is greater than 1
lapply(list, function(x) {
x1 <- names(which(table(x$X2) > 1))
if(length(x1)== 0) "No duplicates" else x1})
#[[1]]
#[1] "No duplicates"
#[[2]]
#[1] "2"
#[[3]]
#[1] "No duplicates"
Or using duplicated
lapply(list, function(x) unique(x$X2[duplicated(x$X2)|duplicated(x$X2,
fromLast = TRUE)]))
Or another option is to stack after extracting the column and get the index of duplicate elements with table and which
which(table(stack(setNames(lapply(list, `[[`, "X2"),
seq_along(list)))[2:1]) > 1, arr.ind = TRUE)
Or another option is
library(tidyverse)
map(list, ~ .x %>%
count(X2) %>%
filter(n > 1) %>%
pull(X2))
data
list <- list(structure(list(X1 = c("a", "d", "g", "j"), X2 = 1:4, X3 = c("b",
"e", "h", "k"), X4 = c("c", "f", "i", "l")), class = "data.frame", row.names = c(NA,
-4L)), structure(list(X1 = c("a", "d", "g", "j"), X2 = c(1L,
2L, 2L, 3L), X3 = c("b", "e", "h", "k"), X4 = c("c", "f", "i",
"l")), class = "data.frame", row.names = c(NA, -4L)), structure(list(
X1 = c("a", "d", "g", "j"), X2 = 1:4, X3 = c("b", "e", "h",
"k"), X4 = c("c", "f", "i", "l")), class = "data.frame", row.names = c(NA,
-4L)))

Coverting a data frame to matrix in R

I would like to convert a data frame to a matrix in R, as in the following example:
df
row.index column.index matrix element
1 1 A
1 2 B
2 1 C
2 2 D
matrix
A B
C D
Is it possible to do the same with rownames? In example
df
row.name column.name matrix element
X P A
X Q B
Y P C
Y Q D
matrix
P Q
X A B
Y C D
Thanks for help!
We can use tapply
tapply(df$matrixelement, df[1:2], FUN = I)
It would also work for the second dataset
res <- tapply(df1$matrixelement, df1[1:2], FUN = I)
names(dimnames(res)) <- NULL
res
# P Q
#X "A" "B"
#Y "C" "D"
If we need a data.frame, then dcast can be used
library(reshape2)
dcast(df, row.index ~column.index)
data
df <- structure(list(row.index = c(1L, 1L, 2L, 2L), column.index = c(1L,
2L, 1L, 2L), matrixelement = c("A", "B", "C", "D")), .Names = c("row.index",
"column.index", "matrixelement"), class = "data.frame", row.names = c(NA,
-4L))
df1 <- structure(list(row.name = c("X", "X", "Y", "Y"), column.name = c("P",
"Q", "P", "Q"), matrixelement = c("A", "B", "C", "D")), .Names = c("row.name",
"column.name", "matrixelement"), class = "data.frame", row.names = c(NA,
-4L))

Count matching instances between two data frames [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm a newbie with R and can't find my answer/anything that works.
I've got two data frames that look like..
Teams
A
B
C
...
and
TCF
A
B
C
C
B
A
...
I need to count the number of instances that each of the first DF column occurs in the second DF and return the value to the first DF. Thanks in advance!
You could use base R to do this:
sapply(unique(df1$Teams), function(x) sum(df2$TCF %in% x))
#A B C
#2 2 2
Or
setNames(table(match(df2$TCF, unique(df1$Teams))), unique(df1$Teams))
#A B C
#2 2 2
Or using data.table
library(data.table)
setkey(setDT(df1), Teams)
setkey(setDT(df2), TCF)
df2[J(unique(df1$Teams)),.N, by=.EACHI]
# TCF N
#1: A 2
#2: B 2
#3: C 2
data
df1 <- structure(list(Teams = c("A", "B", "C")), .Names = "Teams",
class = "data.frame", row.names = c(NA,-3L))
df2 <- structure(list(TCF = c("A", "B", "C", "C", "B", "A")), .Names = "TCF",
class = "data.frame", row.names = c(NA, -6L))
Would this option be easier to your eyes?
library(dplyr)
df2 %>% count(TCF) %>% filter(TCF %in% unique(df1$Teams))
# Source: local data frame [3 x 2]
# TCF n
# 1 A 2
# 2 B 2
# 3 C 2
Data
df1 <- structure(list(Teams = c("A", "B", "C")), .Names = "Teams", class = "data.frame", row.names = c(NA,
-3L))
df2 <- structure(list(TCF = structure(c(1L, 2L, 3L, 3L, 2L, 1L, 4L,
5L, 5L), .Label = c("A", "B", "C", "X", "Y"), class = "factor")), .Names = "TCF", row.names = c(NA,
-9L), class = "data.frame")

Resources