I have two dataframe:
df1: Names of staff in my organization.
df2: Names of staff in 10 different organizations
I would like to find people listed in df1 from df2. In particular, I would like to make an additional variable showing whether the names in df2 is overlapped with names in df1 (yes:1, no:0)
How should I code this?
Thanks
You can try something like this:
Use data.table to check for matches between df1 and df2 on staff_names column.
library(data.table)
Manually create data tables
df1 <- data.table(staff_names = c("John Appleseed", "Daniel Lewis", "Todd Smith"))
df2 <- data.table(staff_names = c("John Appleseed", "Greg Scott", "Tony Hawk"))
Code:
df3 <- df1[df2, on=c(staff_names="staff_names"), overlap:="1"]
df3[is.na(df3)] <- 0
#> staff_names overlap
#> 1: John Appleseed 1
#> 2: Daniel Lewis 0
#> 3: Todd Smith 0
Created on 2020-08-08 by the reprex package (v0.3.0)
I have a data frame as follows:
A<- c ('Proceed', 'John Smith', 'K University, J.smith#Ku.edu', 'Arun Pandey', 'P.S University, a.pan#ps.ed', 'This is a test')
new <- data.frame (A)
I would like to split the data frame column A into two rows: 1) containing the email address(containing all the rows from the data frame) and 2) containing the name which appears a row before the email address row. Any suggestions?
email name
K University, J.smith#Ku.edu John Smith
P.S University, a.pan#ps.ed Arun Pandey
Get the index of rows where 'A' column have # character with grep. Then use it to subset the rows of the data in creating the two column dataset
i1 <- grep("#", new$A)
data.frame(email = new$A[i1], name = new$A[i1-1])
# email name
#1 K University, J.smith#Ku.edu John Smith
#2 P.S University, a.pan#ps.ed Arun Pandey
I have a fairly straightforward question, but very new to R and struggling a little. Basically I need to delete duplicate rows and then change the remaining unique row based on the number of duplicates that were deleted.
In the original file I have directors and the company boards they sit on, with directors appearing as a new row for each company. I want to have each director appear only once, but with column that lists the number of their board seats (so 1 + the number of duplicates that were removed) and a column that lists the names of all companies on which they sit.
So I want to go from this:
To this
Bonus if I can also get the code to list the directors "home company" as the company on which she/he is an executive rather than outsider.
Thanks so very much in advance!
N
You could use the ddply function from plyr package
#First I will enter a part of your original data frame
Name <- c('Abbot, F', 'Abdool-Samad, T', 'Abedian, I', 'Abrahams, F', 'Abrahams, F', 'Abrahams, F')
Position <- c('Executive Director', 'Outsider', 'Outsider', 'Executive Director','Outsider', 'Outsider')
Companies <- c('ARM', 'R', 'FREIT', 'FG', 'CG', 'LG')
NoBoards <- c(1,1,1,1,1,1)
df <- data.frame(Name, Position, Companies, NoBoards)
# Then you could concatenate the Positions and Companies for each Name
library(plyr)
sumPosition <- ddply(df, .(Name), summarize, Position = paste(Position, collapse=", "))
sumCompanies <- ddply(df, .(Name), summarize, Companies = paste(Companies, collapse=", "))
# Merge the results into a one data frame usin the name to join them
df2 <- merge(sumPosition, sumCompanies, by = 'Name')
# Summarize the number of oBoards of each Name
names_NoBoards <- aggregate(df$NoBoards, by = list(df$Name), sum)
names(names_NoBoards) <- c('Name', 'NoBoards')
# Merge the result whit df2
df3 <- merge(df2, names_NoBoards, by = 'Name')
You get something like this
Name Position Companies NoBoards
1 Abbot, F Executive Director ARM 1
2 Abdool-Samad, T Outsider R 1
3 Abedian, I Outsider FREIT 1
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3
In order to get a list the directors "home company" as the company on which she/he is an executive rather than outsider. You could use the next code
ExecutiveDirector <- df[Position == 'Executive Director', c(1,3)]
df4 <- merge(df3, ExecutiveDirector, by = 'Name', all.x = TRUE)
You get the next data frame
Name Position Companies.x NoBoards Companies.y
1 Abbot, F Executive Director ARM 1 ARM
2 Abdool-Samad, T Outsider R 1 <NA>
3 Abedian, I Outsider FREIT 1 <NA>
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3 FG
I have a data frame with over 20000 rows (data3) with a Column named "collector". In this column I have strings of words, for example: "Ruiz Galvis Marta". I need to compare each row with all other rows in my data frame, and delete those rows in which one or more than one word in the column df$collector matches with the words in the same column in all other rows, and with the value in column "sample" and column "number". That is:
INPUT:
Collector Times sample number
Ruiz Galvis Marta 9 SP.1 one
Smith et al Marta 8 SP.2 two
Ruiz Andres Allan 4 SP.1 one
EXPECTED OUTPUT
Collector Times sample number
Smith et al Marta 8 SP.2 two
Thanks for any help!
Probably going to be slow as hell but
dd <- data.frame(Collector = c('Ruiz Galvis Marta', 'Smith et al Marta', 'Ruiz Andres Allan'),
stringsAsFactors = FALSE)
## create a matrix with the words by column
tt <- strsplit(dd$Collector, '\\s+')
mm <- do.call('rbind', lapply(tt, `length<-`, max(lengths(tt))))
## remove all duplicates
dd[rowSums(apply(mm, 2, function(x)
duplicated(x) | duplicated(x, fromLast = TRUE))) == 0, ]
# [1] "Smith et al Marta"
I have df1:
City Freq
Seattle 20
San Jose 10
SEATTLE 5
SAN JOSE 15
Miami 12
I created this dataframe using table(df)
I have another df2:
City
San Jose
Miami
I want to subset df1 if the city values in df1 equal to those in df2. This df2 is only a sample so I can't use an OR condition ( " | " ) because I have many different criteria. Perhaps I could convert this df2 into a vector.. but I'm not sure how to do this. as.vector() doesn't seem to work.
I thought about using
subset(df1, City == df2)
but this gives me errors.
Also, if you guys could get me a way to make this case insensitive such that "San Jose" and "SAN JOSE" are added together, that would be even better!
If I use "toupper / tolower", I get the error: invalid multibyte
Thanks in advance!!
Here are few more methods
R Code:
# Method 1: using dplyr package
library(dplyr)
filter(df1, tolower(df1$City) %in% tolower(df2$City))
df1 %>% filter(tolower(df1$City) %in% tolower(df2$City))
# Method 2: using which function
df1[ which( tolower(df1$City) %in% tolower(df2$City)) , ]
# Method 3:
df1[(tolower(df1$City) %in% tolower(df2$City)), ]
Output:
City Freq
2 San Jose 10
4 SAN JOSE 15
5 Miami 12
Hope this helps.