Iterating over group and counting the matches between data frame - r

Apologize if the title of the question is not so clear.
I have two data frame as below:
df1
NAME FOLLOWS
san big supa
san EAU
san simulate
san spang
glyn guido
glyn claire
glyn vincent
glyn dan
glyn peter
glyn EAU
df2
FOLLOWS
guido
vincent
EAU
EUSC
brian
simulate
peter
I would like to count matches betweendf1$FOLLOWS and df2$FOLLOWS for each NAME in df1 and also the length of df1$FOLLOWS for each NAME in df1. For these data frame, I am expecting output like this:
df3
NAME LENGTH_FOLLOWS COUNT_Match
san 4 2
glyn 6 4

You can merge df1 with df2 first which will keep only values present in df1. then you can simply count the instance.
library(sqldf)
sqldf('select NAME, count(NAME) as LENGTH_FOLLOWS , count(Actual_F) as COUNT_Match from (select t1.*, t2.FOLLOWS as Actual_F from df1 t1 left join df2 t2 on t1.FOLLOWS=t2.FOLLOWS) group by NAME')
Or using base R
df1$index=match(df1$FOLLOWS, df2$FOLLOWS)
aggregate(cbind(df1$FOLLOWS,df1$index), by = list(df1$NAME) , FUN = function(x) length(x[!is.na(x)]))

Here is an option using data.table. Convert the first data.frame to 'data.table' (setDT(df1)) and join on with the 'df2' to create an index column ('ind'). Then, grouped by 'NAME', we get the number of rows (.N) and the sum of logical vector of non-NA elements in 'ind'
library(data.table)
setDT(df1)[df2, ind := 1, on = .(FOLLOWS)]
df1[, .(LENGTH_FOLLOWS = .N, COUNT_MATCH = sum(!is.na(ind))), NAME]
# NAME LENGTH_FOLLOWS COUNT_MATCH
#1: san 4 2
#2: glyn 6 4

Related

In R, how can I find rows with any values included in a column

I have two dataframe:
df1: Names of staff in my organization.
df2: Names of staff in 10 different organizations
I would like to find people listed in df1 from df2. In particular, I would like to make an additional variable showing whether the names in df2 is overlapped with names in df1 (yes:1, no:0)
How should I code this?
Thanks
You can try something like this:
Use data.table to check for matches between df1 and df2 on staff_names column.
library(data.table)
Manually create data tables
df1 <- data.table(staff_names = c("John Appleseed", "Daniel Lewis", "Todd Smith"))
df2 <- data.table(staff_names = c("John Appleseed", "Greg Scott", "Tony Hawk"))
Code:
df3 <- df1[df2, on=c(staff_names="staff_names"), overlap:="1"]
df3[is.na(df3)] <- 0
#> staff_names overlap
#> 1: John Appleseed 1
#> 2: Daniel Lewis 0
#> 3: Todd Smith 0
Created on 2020-08-08 by the reprex package (v0.3.0)

splitting a column into two based on if conditions getting prior row

I have a data frame as follows:
A<- c ('Proceed', 'John Smith', 'K University, J.smith#Ku.edu', 'Arun Pandey', 'P.S University, a.pan#ps.ed', 'This is a test')
new <- data.frame (A)
I would like to split the data frame column A into two rows: 1) containing the email address(containing all the rows from the data frame) and 2) containing the name which appears a row before the email address row. Any suggestions?
email name
K University, J.smith#Ku.edu John Smith
P.S University, a.pan#ps.ed Arun Pandey
Get the index of rows where 'A' column have # character with grep. Then use it to subset the rows of the data in creating the two column dataset
i1 <- grep("#", new$A)
data.frame(email = new$A[i1], name = new$A[i1-1])
# email name
#1 K University, J.smith#Ku.edu John Smith
#2 P.S University, a.pan#ps.ed Arun Pandey

Deleting duplicates in R, changing remainder

I have a fairly straightforward question, but very new to R and struggling a little. Basically I need to delete duplicate rows and then change the remaining unique row based on the number of duplicates that were deleted.
In the original file I have directors and the company boards they sit on, with directors appearing as a new row for each company. I want to have each director appear only once, but with column that lists the number of their board seats (so 1 + the number of duplicates that were removed) and a column that lists the names of all companies on which they sit.
So I want to go from this:
To this
Bonus if I can also get the code to list the directors "home company" as the company on which she/he is an executive rather than outsider.
Thanks so very much in advance!
N
You could use the ddply function from plyr package
#First I will enter a part of your original data frame
Name <- c('Abbot, F', 'Abdool-Samad, T', 'Abedian, I', 'Abrahams, F', 'Abrahams, F', 'Abrahams, F')
Position <- c('Executive Director', 'Outsider', 'Outsider', 'Executive Director','Outsider', 'Outsider')
Companies <- c('ARM', 'R', 'FREIT', 'FG', 'CG', 'LG')
NoBoards <- c(1,1,1,1,1,1)
df <- data.frame(Name, Position, Companies, NoBoards)
# Then you could concatenate the Positions and Companies for each Name
library(plyr)
sumPosition <- ddply(df, .(Name), summarize, Position = paste(Position, collapse=", "))
sumCompanies <- ddply(df, .(Name), summarize, Companies = paste(Companies, collapse=", "))
# Merge the results into a one data frame usin the name to join them
df2 <- merge(sumPosition, sumCompanies, by = 'Name')
# Summarize the number of oBoards of each Name
names_NoBoards <- aggregate(df$NoBoards, by = list(df$Name), sum)
names(names_NoBoards) <- c('Name', 'NoBoards')
# Merge the result whit df2
df3 <- merge(df2, names_NoBoards, by = 'Name')
You get something like this
Name Position Companies NoBoards
1 Abbot, F Executive Director ARM 1
2 Abdool-Samad, T Outsider R 1
3 Abedian, I Outsider FREIT 1
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3
In order to get a list the directors "home company" as the company on which she/he is an executive rather than outsider. You could use the next code
ExecutiveDirector <- df[Position == 'Executive Director', c(1,3)]
df4 <- merge(df3, ExecutiveDirector, by = 'Name', all.x = TRUE)
You get the next data frame
Name Position Companies.x NoBoards Companies.y
1 Abbot, F Executive Director ARM 1 ARM
2 Abdool-Samad, T Outsider R 1 <NA>
3 Abedian, I Outsider FREIT 1 <NA>
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3 FG

Delete rows with matching words in the same column, and matching values in multiple columns

I have a data frame with over 20000 rows (data3) with a Column named "collector". In this column I have strings of words, for example: "Ruiz Galvis Marta". I need to compare each row with all other rows in my data frame, and delete those rows in which one or more than one word in the column df$collector matches with the words in the same column in all other rows, and with the value in column "sample" and column "number". That is:
INPUT:
Collector Times sample number
Ruiz Galvis Marta 9 SP.1 one
Smith et al Marta 8 SP.2 two
Ruiz Andres Allan 4 SP.1 one
EXPECTED OUTPUT
Collector Times sample number
Smith et al Marta 8 SP.2 two
Thanks for any help!
Probably going to be slow as hell but
dd <- data.frame(Collector = c('Ruiz Galvis Marta', 'Smith et al Marta', 'Ruiz Andres Allan'),
stringsAsFactors = FALSE)
## create a matrix with the words by column
tt <- strsplit(dd$Collector, '\\s+')
mm <- do.call('rbind', lapply(tt, `length<-`, max(lengths(tt))))
## remove all duplicates
dd[rowSums(apply(mm, 2, function(x)
duplicated(x) | duplicated(x, fromLast = TRUE))) == 0, ]
# [1] "Smith et al Marta"

R: Subset by multiple criteria filter

I have df1:
City Freq
Seattle 20
San Jose 10
SEATTLE 5
SAN JOSE 15
Miami 12
I created this dataframe using table(df)
I have another df2:
City
San Jose
Miami
I want to subset df1 if the city values in df1 equal to those in df2. This df2 is only a sample so I can't use an OR condition ( " | " ) because I have many different criteria. Perhaps I could convert this df2 into a vector.. but I'm not sure how to do this. as.vector() doesn't seem to work.
I thought about using
subset(df1, City == df2)
but this gives me errors.
Also, if you guys could get me a way to make this case insensitive such that "San Jose" and "SAN JOSE" are added together, that would be even better!
If I use "toupper / tolower", I get the error: invalid multibyte
Thanks in advance!!
Here are few more methods
R Code:
# Method 1: using dplyr package
library(dplyr)
filter(df1, tolower(df1$City) %in% tolower(df2$City))
df1 %>% filter(tolower(df1$City) %in% tolower(df2$City))
# Method 2: using which function
df1[ which( tolower(df1$City) %in% tolower(df2$City)) , ]
# Method 3:
df1[(tolower(df1$City) %in% tolower(df2$City)), ]
Output:
City Freq
2 San Jose 10
4 SAN JOSE 15
5 Miami 12
Hope this helps.

Resources