Deleting duplicates in R, changing remainder - r

I have a fairly straightforward question, but very new to R and struggling a little. Basically I need to delete duplicate rows and then change the remaining unique row based on the number of duplicates that were deleted.
In the original file I have directors and the company boards they sit on, with directors appearing as a new row for each company. I want to have each director appear only once, but with column that lists the number of their board seats (so 1 + the number of duplicates that were removed) and a column that lists the names of all companies on which they sit.
So I want to go from this:
To this
Bonus if I can also get the code to list the directors "home company" as the company on which she/he is an executive rather than outsider.
Thanks so very much in advance!
N

You could use the ddply function from plyr package
#First I will enter a part of your original data frame
Name <- c('Abbot, F', 'Abdool-Samad, T', 'Abedian, I', 'Abrahams, F', 'Abrahams, F', 'Abrahams, F')
Position <- c('Executive Director', 'Outsider', 'Outsider', 'Executive Director','Outsider', 'Outsider')
Companies <- c('ARM', 'R', 'FREIT', 'FG', 'CG', 'LG')
NoBoards <- c(1,1,1,1,1,1)
df <- data.frame(Name, Position, Companies, NoBoards)
# Then you could concatenate the Positions and Companies for each Name
library(plyr)
sumPosition <- ddply(df, .(Name), summarize, Position = paste(Position, collapse=", "))
sumCompanies <- ddply(df, .(Name), summarize, Companies = paste(Companies, collapse=", "))
# Merge the results into a one data frame usin the name to join them
df2 <- merge(sumPosition, sumCompanies, by = 'Name')
# Summarize the number of oBoards of each Name
names_NoBoards <- aggregate(df$NoBoards, by = list(df$Name), sum)
names(names_NoBoards) <- c('Name', 'NoBoards')
# Merge the result whit df2
df3 <- merge(df2, names_NoBoards, by = 'Name')
You get something like this
Name Position Companies NoBoards
1 Abbot, F Executive Director ARM 1
2 Abdool-Samad, T Outsider R 1
3 Abedian, I Outsider FREIT 1
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3
In order to get a list the directors "home company" as the company on which she/he is an executive rather than outsider. You could use the next code
ExecutiveDirector <- df[Position == 'Executive Director', c(1,3)]
df4 <- merge(df3, ExecutiveDirector, by = 'Name', all.x = TRUE)
You get the next data frame
Name Position Companies.x NoBoards Companies.y
1 Abbot, F Executive Director ARM 1 ARM
2 Abdool-Samad, T Outsider R 1 <NA>
3 Abedian, I Outsider FREIT 1 <NA>
4 Abrahams, F Executive Director, Outsider, Outsider FG, CG, LG 3 FG

Related

RStudio - Multiple name for one id problem

I have a data frame that look like this (called df1)
trip_id
station_id
station_name
id123
s01
A Street
id385
s02
B Street
id332
s01
C Street
id423
s01
A Street
The problem is there is an inconsistency with the station name column (multiple names for one id) and I want to correct it based on the most popular name used with the same station id. For example, in the table above, all rows with station id = "s01" must have station name = "A Street" (since A Street occurred 2 times and C Street occurred only once).
The result should look like this:
trip_id
station_id
station_name
id123
s01
A Street
id385
s02
B Street
id332
s01
A Street
id423
s01
A Street
All I'm able to do so far is to extract a list of station id with more than 1 name:
dupl_list <- unique(df1[,c("station_id","station_name")]) %>% group_by (station_id) %>% count() %>% filter(n>1)
Thx for reading
Using base R you can do this:
# data
df1 <- data.frame(trip_id=c('id123', 'id332', 'id385', 'id423'),
station_id=c('s01', 's02', 's01', 's01'),
station_name=c('A Street', 'B Street', 'C Street', 'A Street'),
date_of_trip=c(1, 1, 3, 2))
# most common name for each id (alphabetically lowest in case of ties)
id.name <- tapply(df1$station_name, df1$station_id, function(x) {
tab <- table(x)
names(which.max(tab))
})
df1$station_name <- id.name[df1$station_id]
dplyr way:
df1 <- df1 %>%
group_by(station_id) %>%
mutate(station_name = names(which.max(table(station_name))))
According to the most recent trip:
df1 <- df1 %>%
group_by(station_id) %>%
mutate(station_name = station_name[which.max(date_of_trip)])

R! mutate conditional and list intersect (How many time was a player on the court ?)

This is a sport analysis question - How many time was a player on the court ?
I have a list of players I am interested in
names <- c('John','Bill',Peter')
and a list of actions during multiple matches
team <- c('teama','teama','teama','teama','teama','teama','teamb','teamb')
player1 <- c('John', 'John', 'John', 'Bill', 'Mike', 'Mike', 'Steve', 'Steve')
player2 <- c('Mike', 'Mike', 'Mike', 'John', 'Bill', 'Bill', 'Peter', 'Bob')
df <- data.frame(team,player1,player2)
I want to build a column that will list how many action was the player on the court
actions_when_player_on_court <- df %>% group_by(team) %>%
calculate({nb of observation where the player is either player1 or player2} )
so I end up with a new list like
actions_when_player_on_court <- c(4,3,1)
so I can create a new DF like this
new df <- data.frame(names,actions_when_player_on_court)
where John appears 4 times on the court, Bill twice, and Peter once
I feel I may need to intersect the names and c(player1,player2) especially if
names are unique - John, Bill and Peter cannot belong to other teams and are unique in df
I may have 0 to n players on the field so 0 to n column (player1, player2... playern)
The following code should do what you need.
We first need to create a new data frame to store all names and an empty actions_when_player_on_court variable.
names = c()
for (i in 2:ncol(df)) {
names = c(names, unique(df[,i]))
}
names = data.frame(name = unique(names), actions_when_player_on_court = 0)
Then, we can fill the actions_when_player_on_court variable using a for loop:
df$n = 1
for (i in 2:(ncol(df)-1)) {
tmp = aggregate(cbind(n = n) ~ df[, i], data = df[, c(i, ncol(df))], FUN="sum")
names(tmp)[1] = "name"
names = merge(names, tmp, all=T)
names[is.na(names)] = 0
names$actions_when_player_on_court = names$actions_when_player_on_court + names$n
names = names[-ncol(names)]
}
You can have as many players as you want as long as they start with the second column an run until the end of the data frame. Note that the resulting data frame does not include the team variable. I think you can deal with that yourself. Here is the result:
> names
name actions_when_player_on_court
1 Bill 3
2 Bob 1
3 John 4
4 Mike 5
5 Peter 1
6 Steve 2

How to go from long to wide dataframe in R with multiple values separated by a comma in focal column [duplicate]

This question already has answers here:
Split string column to create new binary columns
(10 answers)
Transform comma delimited list values into a sparse matrix using R
(2 answers)
Closed 3 years ago.
Say I have a list of movies with their directors. I want to convert these directors to dummy variables (i.e. if a director directs a movie, they have their own column with a 1, if they don't direct that movie then that column has a zero). This is tricky because there are occasionally movies with two directors. See example below. df is the data I have, df2 is what I want.
movie <- c("Star Wars V", "Jurassic Park", "Terminator 2")
budget <- c(100,300,400)
director <- c("George Lucas, Lawrence Kasdan", "Steven Spielberg", "Steven Spielberg")
df <- data.frame(movie,budget,director)
df
movie <- c("Star Wars V", "Jurassic Park", "Terminator 2")
budget <- c(100,300,400)
GeorgeLucas <- c(1,0,0)
LawrenceKasdan <- c(1,0,0)
StevenSpielberg <- c(0,1,1)
df2 <- data.frame(movie, budget, GeorgeLucas, LawrenceKasdan, StevenSpielberg)
df2
One option is cSplit_e
library(splitstackshape)
library(dplyr)
library(stringr)
cSplit_e(df, 'director', sep=", ", type = 'character', fill = 0, drop = TRUE) %>%
rename_at(vars(starts_with('director_')), ~ str_remove(., 'director_'))
# movie budget George Lucas Lawrence Kasdan Steven Spielberg
#1 Star Wars V 100 1 1 0
#2 Jurassic Park 300 0 0 1
#3 Terminator 2 400 0 0 1

How to remove dupes and replace column variables

I'm working with a data set named CCCrn on candidates in a local election with some duplicate values. Here's a sample:
Adam Hill 4100 New Texas Rd. Pittsburgh 15239 School Director PLUM Democratic 4 5
Adam Hill 4100 New Texas Rd. Pittsburgh 15239 School Director PLUM Republican 4 5
As you can see, this candidate cross listed and was on both parties' ballots. I'd like to remove one of the rows, and then edit the Party variable to say "Cross Listed.
Obviously unique and distinct haven't been much help. I tried
test <- CCCrn[!duplicated(CCCrn$Name), ] which succeeded in removing the duplicate canidates, but now I'm not sure how I would go back and edit the "Party" variable.
create a flag for duplicate record
df <- df %>% mutate(dup = ifelse(duplicated(name)|duplicated(name, fromLast=TRUE),1,0))
df <- df[!duplicated(df$name),] ## remove duplicate
df <- df %>% mutate(party= ifelse(dup==1, "Cross Listed", party)) # update party
df <- df%>% select(-dup) ## remove flag
One way, using dplyr, would be to group_by all fields other than the party, and then summarise to "CrossListed" if the number of rows in the group is bigger than 1, i.e. if n()>1.
Something like this...
library(dplyr)
df2 <- df %>% group_by(-Party) %>%
summarise(Party = ifelse(n() > 1, "CrossListed", first(Party))
or an alternative to the last line would be to paste all the party names together so that you can see where they are cross-listed (which might be useful if there are lots of parties - less so if there are only two!)... summarise(Party = paste(sort(Party), collapse=", "))

How do I make this nested for loop work faster

My data is as shown below:
txt$txt:
my friend stays in adarsh nagar
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
I have an exhaustive list of city names. Listing few of them below:
city:
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
I am searching for city names (from the "city" list I have) in txt$txt and extracting them into another column if they are present. So the simple loop below works for me... but it's taking a lot of time on the bigger dataset.
for(i in 1:nrow(txt)){
a <- c()
for(j in 1:nrow(city)){
a[j] <- grepl(paste("\\b",city[j,1],"\\b", sep = ""),txt$txt[i])
}
txt$city[i] <- ifelse(sum(a) > 0, paste(city[which(a),1], collapse = "_"), "NONE")
}
I tried to use an apply function, and this is the maximum i could get to.
apply(as.matrix(txt$txt), 1, function(x){ifelse(sum(unlist(strsplit(x, " ")) %in% city[,1]) > 0, paste(unlist(strsplit(x, " "))[which(unlist(strsplit(x, " ")) %in% city[,1])], collapse = "_"), "NONE")})
[1] "NONE" "NONE" "bangalore" "bkc"
Desired Output:
> txt
txt city
1 my friend stays in adarsh nagar adarsh nagar
2 I changed one apple one samsung S3 n one sony experia z. NONE
3 Hi girls..Friends meet at bangalore bangalore
4 what do u think of ccd at bkc bkc
I want a faster process in R, which does the same thing what the for loop above does. Please advise. Thanks
Here's a possibility using stri_extract_first_regex from stringi package:
library(stringi)
# prepare some data
df <- data.frame(txt = c("in adarsh nagar", "sony experia z", "at bangalore"))
city <- c("ahmedabad", "adarsh nagar", "airoli", "bangalore")
df$city <- stri_extract_first_regex(str = df$txt, regex = paste(city, collapse = "|"))
df
# txt city
# 1 in adarsh nagar adarsh nagar
# 2 sony experia z <NA>
# 3 at bangalore bangalore
This should be much faster:
bigPattern <- paste('(\\b',city[,1],'\\b)',collapse='|',sep='')
txt$city <- sapply(regmatches(txt$txt,gregexpr(bigPattern,txt$txt)),FUN=function(x) ifelse(length(x) == 0,'NONE',paste(unique(x),collapse='_')))
Explanation:
in the first line we build a big regular expression matching all the cities, e.g. :
(\\bahmedabad\\b)|(\\badarsh nagar\\b)|(\\bairoli\\b)| ...
Then we use gregexpr in combination with regmatches, in this way we get a list of the matches for each element in txt$txt.
Finally, with a simple sapply, for each element of the list we concatenate the matched cities (after removing the duplicates i.e. cities mentioned more than one time).
Try this:
# YOUR DATA
##########
txt <- readLines(n = 4)
my friend stays in adarsh nagar and airoli
I changed one apple one samsung S3 n one sony experia z.
Hi girls..Friends meet at bangalore
what do u think of ccd at bkc
city <- readLines(n = 8)
ahmedabad
adarsh nagar
airoli
bangalore
bangaladesh
banerghatta Road
bkc
calcutta
# MATCHING
##########
matches <- unlist(setNames(lapply(city, grep, x = txt, fixed = TRUE),
city))
(res <- (sapply(1:length(txt), function(x)
paste0(names(matches)[matches == x], collapse = "___"))))
# [1] "adarsh nagar___airoli" ""
# [3] "bangalore" "bkc"

Resources