How do I Identify by row id the values in a data frame column not in another data frame column? - r

How do I identify by row id the values in data frame d2 column c3 that are not in data frame d1 column c1? My which function returns all records when sub-setting as shown. My requirement is to follow this sub set structure and not value$field design which works:
c1 <- c("A", "B", "C", "D", "E")
c2 <- c("a", "b", "c", "d", "e")
c3 <- c("A", "z", "C", "z", "E", "F")
c4 <- c("a", "x", "x", "d", "e", "f")
d1 <- data.frame(c1, c2, stringsAsFactors = F)
d2 <- data.frame(c3, c4, stringsAsFactors = F)
x <- unique(d1["c1"])
y <- d2[,"c3"]
id <- which(!(y %in% x) ) # incorrect, all row ids returned
I am trying to find the id's of rows in y where the specified column does not include values of x

I believe setdiff would work here. I see z and F are what you want, right? They are not in d1[,"c1"] but are in d2[,"c3"]
includes <- setdiff(d2[,"c3"], d1[,"c1"])
d2_new <- d2[d2[,"c3"] %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
# or
ids <- rownames(d2[d2[,"c3"] %in% includes,])
output
d2_new
# c3 c4 id
#2 z x 2
#4 z d 4
#6 F f 6
ids
#[1] "2" "4" "6"

I had the same problem, and this code worked for me. However, indexing did not work for me. With a slight change it worked perfect.
includes <- setdiff(d2$c3, d1$c3)
d2_new <- d2[d2$c3 %in% includes,]
d2_new$id <- rownames(d2_new)
d2_new
thank you #jpsmith

Related

Looping in R with dynamic variables as dataframe names

I am trying to loop through dataframes where my search variable is in the name of the dataframe. Here I have multiple dataframes beginning with "person", "place", or "thing" and ending with either "5" or "8." I would like to loop through the many combinations of beginning and ending to create a temporary dataframe. The temporary dataframe will be used to create a plot and save the plot.
When I try my current code, I'm able to get the variable name to loop correctly (in other words, I can get "person_odds5" or "place_odds5"), but I cannot use those variables to access the corresponding column in the dataframe.
My current code is:
person_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=1:4, or_uci95=11:14, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
place_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=5:8, or_uci95=15:18, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
thing_odds5 <- data.frame(odds=c("a", "b", "c", "d"), or_lci95=9:12, or_uci95=19:22, id.exposure=c("f", "g", "h", "i"), id.outcome=c("w", "x", "y", "z"))
nouns <- list("person", "place", "thing")
for (x in nouns) {
pval <- c(5)
for (p in pval) {
name <- paste(x,"_odds",p, sep="")
odds <- paste(name,"$odds", sep="")
temp_dat <- data.frame(odds=odds, index=1:nrow(name))
}
}
When I run this code, my output for "name" is "person_odds5" as character type; my output for "odds" is "person_odds5$odds" as character type, and I encounter "Error in 1:nrow(name) : argument of length 0." Basically, it appears that I can't parse my name assignment through the original dataframe.
Input:
>person_odds5
odds or_lci95 or_uci95 id.exposure id.outcome
1 a 1 11 f w
2 b 2 12 g x
3 c 3 13 h y
4 d 4 14 i z
>
Desired output:
>temp_dat
odds index
1 a 1
2 b 2
3 c 3
4 d 4
>

Exclude common rows in tibbles [duplicate]

This question already has an answer here:
Using anti_join() from the dplyr on two tables from two different databases
(1 answer)
Closed 2 years ago.
I'm looking for a way to join two tibbles in a a way to leave rows only unique to the first first tibble or unique in both tibbles - simply those one that do not have any matched key.
Let's see example:
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
With common dplyr::join I am not able to get this:
A
1 d
2 e
Is there some way within dplyr to overcome it or in general in tidyverse to overcome it?
Use setdiff() function from dplyr library
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
C <- setdiff(A,B)
Just to add.
Setdiff(A,B) gives out those elements present in A but not in B.
dplyr::anti_join will keep only the rows that are unique to the tibble/data.frame of the first argument.
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
dplyr::anti_join(A, B, by = "A")
# A
# <chr>
# 1 d
# 2 e
A base R possibility (well except the tibble):
A[!A$A %in% B$A,]
returns
# A tibble: 2 x 1
A
<chr>
1 d
2 e

How can I make a data.frame using chr vertor [duplicate]

This question already has answers here:
aggregating unique values in columns to single dataframe "cell" [duplicate]
(2 answers)
Closed 2 years ago.
I have the data.frame below.
> Chr Chr
> A E
> A F
> A E
> B G
> B G
> C H
> C I
> D E
and... I want to convert the dataset as belows as you may be noticed.
I want to coerce all chr vectors into an row.
chr chr
A E,F
B G
C H,I
D E
they are all characters, so I tried to do several things so that I want to make.
Firstly, I used unique function for FILTER <- unique(chr[,15])1st column and try to subset them using
FILTER data that I created using rbind or bind rows function.
Secondly, I tested to check whether my idea works or not
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
result = unique(Top10Data[TGT == FILTER[i]]$`NM`))
print(result)
}
to this stage, it seems to be working well.
The problem for me is that when I used both functions, the data frame only creates 1 column and ignored the others vector (2nd variables from above data.frame) all.
Only For the chr [1,1], those functions do work well, but I have chr vectors such as chr[1,n], which is unable to be coerced.
here's my code for your reference.
FILTER <- unique(Top[,15])
NN <- data.frame()
for(i in 1 :nrow(FILTER)){
CGONM <- rbind(NN,unique(Top10Data[TGT == FILTER[i]]$`NM`))
}
Base R solutions:
# Solution 1:
df_str_agg1 <- aggregate(var2~var1, df, FUN = function(x){
paste0(unique(x), collapse = ",")})
# Solution 2:
df_str_agg2 <- data.frame(do.call("rbind",lapply(split(df, df$var1), function(x){
data.frame(var1 = unique(x$var1),
var2 = paste0(unique(x$var2), collapse = ","))
}
)
),
row.names = NULL
)
Tidyverse solution:
library(tidyverse)
df_str_agg3 <-
df %>%
group_by(var1) %>%
summarise(var2 = str_c(unique(var2), collapse = ",")) %>%
ungroup()
Data:
df <- data.frame(var1 = c("A", "A", "A", "B", "B", "C", "C", "D"),
var2 = c("E", "F", "E", "G", "G", "H", "I", "E"), stringsAsFactors = FALSE)

Generating distinct groups based on vector/column pairs in R

SEE UPDATE BELOW:
Given a data frame with two columns (x1, x2) representing pairs of objects, I would like to generate groups where all members of each group are paired with all other members in that group. Thus far, I have been able to generate groups by showing all items in x2 that are paired with each item in x1, but this leaves me with groups where a couple of members are only paired with one other group member. I'm having a hard time getting off the ground with this one... Thanks in advance for any help you may have. Please let me know if I should edit this post as I am new to Stack Overflow and new to R coding.
x1 <- c("A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E")
x2 <- c("A", "B", "C", "D", "B", "C", "B", "D", "E", "D", "E")
df <- data.frame(x1, x2)
I would like to go from this df, to an output that looks like df2.
group1 <- c("A")
group2 <- c("B", "C")
group3 <- c("B", "D")
group4 <- c("D", "E")
df2 <- data.frame(cbind.fill(group1, group2, group3, group4, fill = "NULL"))
UPDATE:
Given the following dataset....
x1 <- c("A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E", "B", "C", "F")
x2 <- c("A", "B", "C", "D", "B", "C", "B", "D", "E", "D", "E", "F", "F", "F")
df <- data.frame(x1, x2)
.... I would like to identify groups of x1/x2 where all objects within said group are connected to all other objects of that group.
This is what I have thus far (I'm sure this is riddled with best-practice errors, feel free to call them out. I'm eager to learn)...
n <- nrow(as.data.frame(unique(df$x1)))
RosterGuide <- as.data.frame(matrix(nrow = n , ncol = 1))
RosterGuide$V1 <- seq.int(nrow(RosterGuide))
RosterGuide$Object <- (unique(df$x1))
colnames(RosterGuide) <- c("V1","Object")
groups_frame <- matrix(, ncol= length(n), nrow = length(n))
for (loopItem in 1:nrow(RosterGuide)) {
object <- subset(RosterGuide$Object, RosterGuide$V1 == loopItem)
group <- as.data.frame(subset(df$x2, df$x1 == object))
groups_frame <- cbind.fill(group, groups_frame, fill = "NULL")
}
Groups <- as.data.frame(groups_frame)
Groups <- subset(Groups, select = - c(object))
colnames(Groups) <- RosterGuide$V1
This yields the data frame 'Groups'....
1 2 3 4 5 6
1 F D B B B A
2 NULL E D C C NULL
3 NULL NULL E F D NULL
4 NULL NULL NULL NULL F NULL
... which is exactly what I am looking for, except that if you look at the original df, objects F and D are never paired, rendering group 5 invalid. Also, objects B and E are never paired, rendering group 3 invalid. A valid output should look like this...
1 2 3 4 5
1 D B B B A
2 E D C C NULL
3 NULL NULL NULL F NULL
Question: is there some way that I can relate the groups listed above in the 'Groups' data frame to the original df to remove groups with invalid relationships? This really has me stumped.
For context: What I am really trying to do is group items based on pairwise connections derived from a network of nodes where not all nodes are connected.
Here is one way doing it in base R using apply and unique
df <- data.frame(x1, x2, stringsAsFactors = F)
df <- df[df$x1 != df$x2, ]
unique(t(apply(df, 1, sort)))
[,1] [,2]
3 "B" "C"
4 "B" "D"
9 "D" "E"
dplyr
df %>%
dplyr::filter(x1 != x2) %>%
dplyr::filter(!duplicated(paste(pmin(x1,x2), pmax(x1,x2), sep = "-")))
x1 x2
1 B C
2 B D
3 D E
data.table (there might be another better way)
library(data.table)
as.data.table(df)[, .SD[x1 != x2]][, .GRP, by = .(x1 = pmin(x1,x2), x2 = pmax(x1,x2))]
x1 x2 GRP
1: B C 1
2: B D 2
3: D E 3

Counting number of elements in a character column by levels of a factor column in a dataframe

I am a beginner in R. I have a dataframe in which there are two factor columns. One column is a company column, second is a product column. There are several missing values in product column and so I want to count the number of values in product column for each company (or each level of the company variable). I tried table, and count function in plyr package but they only seem to work with numeric variables. Please help!
Lets say the data frame looks like this:
df <- data.frame(company= c("A", "B", "C", "D", "A", "B", "C", "C", "D", "D"), product = c(1, 1, 2, 3, 4, 3, 3, NA, NA, NA))
So the output I am looking for is -
A 2
B 2
C 3
D 2
Thanks in advance!!
A dplyr solution.
df %>%
filter(!is.na(product)) %>%
group_by(company) %>%
count()
# A tibble: 4 × 2
comp n
<fctr> <int>
1 A 2
2 B 2
3 C 3
4 D 1
We can use rowsum from base R
with(df, rowsum(+!is.na(prod), comp))
Assuming your df is :
CASE 1) As give in question
Data for df:
options(stringsAsFactors = F)
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c(1,1,2,3,4,3,3,1,NA,NA)
df <- data.frame(comp=comp,prod=prod)
Program:
df$prodflag <- !is.na(df$prod)
tapply(df$prodflag , df$comp,sum)
Output:
> tapply(df$prodflag , df$comp,sum)
A B C D
2 2 3 1
#########################################################################
CASE 2) In case stringsAsFactors is on and prod is in characters, even NAs are quoted as characters and marked as factors then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a","NA","NA")
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!as.character(df$prod)=="NA")
tapply(df$prodflag , df$comp,sum)
#########################################################################
CASE 3) In case the prod is a character and stringsAsFactors is on but NAs are not quoted then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a",NA,NA)
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!is.na(df$prod))
tapply(df$prodflag , df$comp,sum)
Moral of the story, we should understand our data and then we can the logic which best suits our need.

Resources