Counting on dataframe in R - r

I have a data frame like
A B
A E
B E
B C
..
I want to convert it to two dataframes
One is counting how many times A, B, C.. appear in the first column and other one is counting how many times A, B, B .. appear in the second column.
A 5
B 4
...
Could you give me some suggestions?
Thanks

Try plyr library:
library(plyr)
myDataFrame <- as.data.frame(cbind( c("A", "A", "B", "B", "B", "C"), c("B", "E", "E", "C", "C", "E") ))
count(myDataFrame[,1]) ##prints counts of first column
count(myDataFrame[,2]) ##prints counts of second column

We can use lapply to loop over the columns, get the frequency with table, convert to data.frame and if needed as separate datasets, use list2env (not recommended)
list2env(setNames(lapply(df1, function(x)
as.data.frame(table(x))), paste0("df", 1:2)), envir=.GlobalEnv)

Alternatively, You could also use the dplyr library-
library("dplyr")
df<- as.data.frame(cbind( c("A", "A", "B", "B", "B", "C"), c("B", "E", "E", "C", "C", "E") ))
names(df)<-c("V1","V2")
df <- tbl_df(df)
df %>% group_by(V1) %>% summarise(c1 = n()) ## for column 1
df %>% group_by(V2) %>% summarise(c1 = n()) ## for column 2

Related

Apply filter criteria to variables that contain/start with certain string in R

I am trying to find a way to filter a dataframe by criteria applied to variables that their name contains a certain string
in this example below,
I want to find the subjects that any of their test results contain "d".
d=structure(list(ID = c("a", "b", "c", "d", "e"), test1 = c("a", "b", "a", "d", "a"), test2 = c("a", "b", "b", "a", "s"), test3 = c("b", "c", "c", "c", "d"), test4 = c("c", "d", "a", "a", "f")), class = "data.frame", row.names = c(NA, -5L))
I can use dplyr and write one by one using | which works for small examples like this but for my real data will be time consuming.
library(dplyr) library(stringr) d %>% filter(str_detect(d$test1, "d") |str_detect(d$test2, "d") |str_detect(d$test3, "d") |str_detect(d$test4, "d") )
the output I get shows that subjects b, d and e meet the criteria:
ID test1 test2 test3 test4
1 b b b c d
2 d d a c a
3 e a s d f
The output is what I need but I was looking for an easier way, for example, if there is a way to apply the filter criteria to the variables that contain the word "test"
I know about the contain function in dplyr to select certain variables and I tried it here but not working,
d %>% filter(str_detect(contains("test"), "d"))
is there a way to write this code different or is there another way to achieve the same goal?
thank you
In base R you can use lapply/sapply :
d[Reduce(`|`, lapply(d[-1], grepl, pattern = 'd')), ]
#d[rowSums(sapply(d[-1], grepl, pattern = 'd')) > 0, ]
# ID test1 test2 test3 test4
#2 b b b c d
#4 d d a c a
#5 e a s d f
If you are interested in dplyr solution you can use any of the below method :
library(dplyr)
library(stringr)
#1.
d %>%
filter_at(vars(starts_with('test')), any_vars(str_detect(., 'd')))
#2.
d %>%
rowwise() %>%
filter(any(str_detect(c_across(starts_with('test')), 'd')))
#3.
d %>%
filter(Reduce(`|`, across(starts_with('test'), str_detect, 'd')))

Exclude common rows in tibbles [duplicate]

This question already has an answer here:
Using anti_join() from the dplyr on two tables from two different databases
(1 answer)
Closed 2 years ago.
I'm looking for a way to join two tibbles in a a way to leave rows only unique to the first first tibble or unique in both tibbles - simply those one that do not have any matched key.
Let's see example:
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
With common dplyr::join I am not able to get this:
A
1 d
2 e
Is there some way within dplyr to overcome it or in general in tidyverse to overcome it?
Use setdiff() function from dplyr library
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
C <- setdiff(A,B)
Just to add.
Setdiff(A,B) gives out those elements present in A but not in B.
dplyr::anti_join will keep only the rows that are unique to the tibble/data.frame of the first argument.
A <- tibble( A = c("a", "b", "c", "d", "e"))
B <- tibble( A = c("a", "b", "c"))
dplyr::anti_join(A, B, by = "A")
# A
# <chr>
# 1 d
# 2 e
A base R possibility (well except the tibble):
A[!A$A %in% B$A,]
returns
# A tibble: 2 x 1
A
<chr>
1 d
2 e

Generating distinct groups based on vector/column pairs in R

SEE UPDATE BELOW:
Given a data frame with two columns (x1, x2) representing pairs of objects, I would like to generate groups where all members of each group are paired with all other members in that group. Thus far, I have been able to generate groups by showing all items in x2 that are paired with each item in x1, but this leaves me with groups where a couple of members are only paired with one other group member. I'm having a hard time getting off the ground with this one... Thanks in advance for any help you may have. Please let me know if I should edit this post as I am new to Stack Overflow and new to R coding.
x1 <- c("A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E")
x2 <- c("A", "B", "C", "D", "B", "C", "B", "D", "E", "D", "E")
df <- data.frame(x1, x2)
I would like to go from this df, to an output that looks like df2.
group1 <- c("A")
group2 <- c("B", "C")
group3 <- c("B", "D")
group4 <- c("D", "E")
df2 <- data.frame(cbind.fill(group1, group2, group3, group4, fill = "NULL"))
UPDATE:
Given the following dataset....
x1 <- c("A", "B", "B", "B", "C", "C", "D", "D", "D", "E", "E", "B", "C", "F")
x2 <- c("A", "B", "C", "D", "B", "C", "B", "D", "E", "D", "E", "F", "F", "F")
df <- data.frame(x1, x2)
.... I would like to identify groups of x1/x2 where all objects within said group are connected to all other objects of that group.
This is what I have thus far (I'm sure this is riddled with best-practice errors, feel free to call them out. I'm eager to learn)...
n <- nrow(as.data.frame(unique(df$x1)))
RosterGuide <- as.data.frame(matrix(nrow = n , ncol = 1))
RosterGuide$V1 <- seq.int(nrow(RosterGuide))
RosterGuide$Object <- (unique(df$x1))
colnames(RosterGuide) <- c("V1","Object")
groups_frame <- matrix(, ncol= length(n), nrow = length(n))
for (loopItem in 1:nrow(RosterGuide)) {
object <- subset(RosterGuide$Object, RosterGuide$V1 == loopItem)
group <- as.data.frame(subset(df$x2, df$x1 == object))
groups_frame <- cbind.fill(group, groups_frame, fill = "NULL")
}
Groups <- as.data.frame(groups_frame)
Groups <- subset(Groups, select = - c(object))
colnames(Groups) <- RosterGuide$V1
This yields the data frame 'Groups'....
1 2 3 4 5 6
1 F D B B B A
2 NULL E D C C NULL
3 NULL NULL E F D NULL
4 NULL NULL NULL NULL F NULL
... which is exactly what I am looking for, except that if you look at the original df, objects F and D are never paired, rendering group 5 invalid. Also, objects B and E are never paired, rendering group 3 invalid. A valid output should look like this...
1 2 3 4 5
1 D B B B A
2 E D C C NULL
3 NULL NULL NULL F NULL
Question: is there some way that I can relate the groups listed above in the 'Groups' data frame to the original df to remove groups with invalid relationships? This really has me stumped.
For context: What I am really trying to do is group items based on pairwise connections derived from a network of nodes where not all nodes are connected.
Here is one way doing it in base R using apply and unique
df <- data.frame(x1, x2, stringsAsFactors = F)
df <- df[df$x1 != df$x2, ]
unique(t(apply(df, 1, sort)))
[,1] [,2]
3 "B" "C"
4 "B" "D"
9 "D" "E"
dplyr
df %>%
dplyr::filter(x1 != x2) %>%
dplyr::filter(!duplicated(paste(pmin(x1,x2), pmax(x1,x2), sep = "-")))
x1 x2
1 B C
2 B D
3 D E
data.table (there might be another better way)
library(data.table)
as.data.table(df)[, .SD[x1 != x2]][, .GRP, by = .(x1 = pmin(x1,x2), x2 = pmax(x1,x2))]
x1 x2 GRP
1: B C 1
2: B D 2
3: D E 3

Counting number of elements in a character column by levels of a factor column in a dataframe

I am a beginner in R. I have a dataframe in which there are two factor columns. One column is a company column, second is a product column. There are several missing values in product column and so I want to count the number of values in product column for each company (or each level of the company variable). I tried table, and count function in plyr package but they only seem to work with numeric variables. Please help!
Lets say the data frame looks like this:
df <- data.frame(company= c("A", "B", "C", "D", "A", "B", "C", "C", "D", "D"), product = c(1, 1, 2, 3, 4, 3, 3, NA, NA, NA))
So the output I am looking for is -
A 2
B 2
C 3
D 2
Thanks in advance!!
A dplyr solution.
df %>%
filter(!is.na(product)) %>%
group_by(company) %>%
count()
# A tibble: 4 × 2
comp n
<fctr> <int>
1 A 2
2 B 2
3 C 3
4 D 1
We can use rowsum from base R
with(df, rowsum(+!is.na(prod), comp))
Assuming your df is :
CASE 1) As give in question
Data for df:
options(stringsAsFactors = F)
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c(1,1,2,3,4,3,3,1,NA,NA)
df <- data.frame(comp=comp,prod=prod)
Program:
df$prodflag <- !is.na(df$prod)
tapply(df$prodflag , df$comp,sum)
Output:
> tapply(df$prodflag , df$comp,sum)
A B C D
2 2 3 1
#########################################################################
CASE 2) In case stringsAsFactors is on and prod is in characters, even NAs are quoted as characters and marked as factors then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a","NA","NA")
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!as.character(df$prod)=="NA")
tapply(df$prodflag , df$comp,sum)
#########################################################################
CASE 3) In case the prod is a character and stringsAsFactors is on but NAs are not quoted then you can do:
Data:
comp <- c("A", "B", "C", "D", "A", "B", "C", "C", "D","D" )
prod <- c("a","a","b","c","d","c","c","a",NA,NA)
df <- data.frame(comp=comp,prod=prod,stringsAsFactors = T)
Solution:
df$prodflag <- as.numeric(!is.na(df$prod))
tapply(df$prodflag , df$comp,sum)
Moral of the story, we should understand our data and then we can the logic which best suits our need.

Reorder / arrange bars in a plot(table) while keeping value names

I would like to plot the result of table in a decreasing order, but if I sort the table before plotting it the plot does not show the value names anymore.
a <- data.frame(var = c("A", "A", "B", "B", "B", "B", "B", "C", "D", "D", "D"))
plot(table(a))
plot(sort(table(a)))
We get the count with table ('tbl'), order the elements and assign it to 'tbl' to keep the same structure as in 'tbl' and then plot. In the OP's code, the sort or order converts the table class to matrix.
tbl <- table(a)
tbl[] <- tbl[order(tbl)]
plot(tbl)

Resources