R count number of Team members based on Team name - r

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!

This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps

If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA

Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

Related

Join of 2 dataframes [duplicate]

This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 3 years ago.
I have 2 dataframes and I want to join by name, but names are not written exactly the same:
Df1:
ID Name Age
1 Jose 13
2 M. Jose 12
3 Laura 8
4 Karol P 32
Df2:
Name Surname
José Hall
María José Perez
Laura Alza
Karol Smith
I need to join and get this:
ID Name Age Surname
1 Jose 13 Hall
2 M. Jose 12 Perez
3 Laura 8 Alza
4 Karol P 32 Smith
How to consider that the names are not exactly the same before to join?
You can get close to your result using stringdist_left_join from fuzzyjoin
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = "Name")
# ID Name.x Age Name.y Surname
#1 1 Jose 13 José Hall
#2 2 M. Jose 12 <NA> <NA>
#3 3 Laura 8 Laura Alza
#4 4 Karol P 32 Karol Smith
For the example shared it does not work for 1 entry since it is difficult to match Maria with M.. You can get the result for it by adjusting the max_dist argument to a higher value (default is 2) however, this will screw up other results and would give unwanted matches. If you have minimal NA entries (like the example shared) after this join you could just match them by "hand".
I would clean the database before (for example deleting those ´, in excel is easy doing those replace) and then use
new_df <- merge(df1, df2, by="name")
or you could try to assign an ID for df2 that coincide with df2 if it is possible.

Multiple criteria lookup in R

I have data like this:
ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats
And I want to add columns indicating teams A and B by matching the row ID of that row in the ID column, and by matching one of the names in one of the "a" columns of that row in the "Name" column (for Team A), and doing the same for Team B using one of the names in one of the "b" columns of that row:
ID 1a 2a 3a 1b 2b 3b Name Team Team A Team B
cb128c James John Bill Jeremy Ed Simon Simon Wolves Tigers Wolves
cb128c John James Randy Simon David Ben John Tigers Tigers Wolves
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows Wildcats Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats Wildcats Sparrows
In row 1, we know Team A is Tigers because we match the ID of row 1, cb128c, in the ID column, and one of the "a" names of row 1 (either James, John or Bill) in the Name column. In this case, Row 2 has that ID, cb128c, and has "John" in the Name column. The Team in row 2 is "Tigers." Therefore, Row 1's Team A is Tigers. Team B is the Wolves because we match row 1's ID, still cb128c, and one of the "b" names in row 1 (either Jeremy, Ed or Simon) in the Name column. In this case, row 1 itself has the data we're looking for since one of the "b" names appears in the "Name" column of that row (Simon). The "Team" listed in each row will always either be the Team A or the Team B for that row.
Further down, we know Team A for row 3 is Wildcats because we match row 3's ID, ko351u and one of row 3's "a" names (either Adam, Alex or Jacob) in the "Name" column. Row 4 has that ID and "Adam" in the Name column. So the Team in Row 4 is Team A for Row 3.
Also notice that David switched teams in Row 3. In Row 2, David was on Simon's team, which we know is the Wolves (as explained above), but when we match Row 3's ID and one of Row 3's "b" names (Bob, Oscar or David), we get the Sparrows (like Row 1, one of the "b" names appears in the name column of that same row, so the Team B is the Team listed in that row).
How can I get this done in R?
df = read.table(text = "ID 1a 2a 3a 1b 2b 3b Name Team
cb128c James John Bill Jeremy Ed Simon Simon Wolves
cb128c John James Randy Simon David Ben John Tigers
ko351u Adam Alex Jacob Bob Oscar David Oscar Sparrows
ko351u Adam Matt Sam Fred Frank Harry Adam Wildcats", header = T)
# convert to character
df[] = lapply(df, as.character)
library(tidyr)
library(dplyr)
The following code 1. gathers to long format, 2. creates "Team_A" and "Team_B" out of the a or b suffix, 3. matches names to fill in the A/B Team Name, 4. removes missing values (no match), 5. gets rid of unnecessary columns, 6. converts back to wide format, 7. joins the A and B teams to the original data.
I'd encourage you to step through the code line by line to understand what's going on. I'll leave reordering the columns to you.
result = gather(df, key = "key", value = "value", starts_with("X")) %>%
mutate(ab = paste0("Team_", toupper(substr(key, start = nchar(key), stop = nchar(key)))),
team = ifelse(Name == value, Team, NA)) %>%
filter(!is.na(team)) %>%
select(ID, ab, team) %>%
spread(key = ab, value = team) %>%
right_join(df)
result
# ID Team_A Team_B X1a X2a X3a X1b X2b X3b Name Team
# 1 cb128c Tigers Wolves James John Bill Jeremy Ed Simon Simon Wolves
# 2 cb128c Tigers Wolves John James Randy Simon David Ben John Tigers
# 3 ko351u Wildcats Sparrows Adam Alex Jacob Bob Oscar David Oscar Sparrows
# 4 ko351u Wildcats Sparrows Adam Matt Sam Fred Frank Harry Adam Wildcats

Add column to R dataframe that is length of string in another column

This should be EASY but I can't figure it out and search didn't help. I'd like to add a column to a dataframe that is just the length of the strings in another column.
So say I have a data frame of names like such:
Name Last
1 John Doe
2 Edgar Poe
3 Walt Whitman
4 Jane Austen
I'd like to append a new column with the string length of, say, the last name, so it would look like:
Name Last Length
1 John Doe 3
2 Edgar Poe 3
3 Walt Whitman 7
4 Jane Austen 6
Thanks
We can use str_count from stringr
library(stringr)
df1$Length <- str_count(df1$Last)
df1$Length
[1] 3 3 7 6
If you want to filter by the length, based on column, then do the following:
library(dplyr)
df<- df %>%
filter(nchar(Last) <= 3)

How to rbind when only some of the columns match

I have about 18 dataframes which are essentially frequency counts of the elements stored in the column Rptnames. They all have some different and some the same elements in the Rptnames columns so they look like this
dataframe called GroupedTableProportiondelAll
Rptname freq
bob 4324234
jane 433
ham 4324
tim 22
dataframe called GroupedTableProportiondelLUAD
Rptname freq
bob 987
jane 223
jonny 12
jim 98092
I am trying to set up a table so that the Rptname becomes the column and each row is the frequencies. This is so that I can combine all the dataframes.
I have tried the following
GroupedTableProportiondelAll_T <- as.data.frame(t(GroupedTableProportiondelAll))
GroupedTableProportiondelLUAD_T <- as.data.frame(t(GroupedTableProportiondelLUAD))
total <- rbind(GroupedTableProportiondelLUAD_T, GroupedTableProportiondelAll_T)
but I get the error
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
So the question is
a) how can I do rbind (cbind would also do without transposing I suppose) so that the bind can happen without needing to match.
b) would merge be better here
c) in either is there a way to enter zero for empty values
d) P'raps there's a better way to do this like matrices which Im not really familiar with? I know its 4 questions but the central question's the same- how to bind when not all the rows or columns are matching
An alternative to the rbind + dcast technique that would use the tidyverse.
Use pipes (%>%) to first use bind_rows() to bind all your dataframes together while simultaneously creating a dataframe id column (in this case I just called the variable "df"). Then use spread() to move unique "Rptname" values to become column names and spreading the values of "freq" across the new columns. "Rptname" is the key and "freq" is the value in this case.
It would look like this:
Input:
GTP_A
Rptname freq
1 bob 4324234
2 jane 433
3 ham 4324
4 tim 22
GTP_LUAD
Rptname freq
1 bob 987
2 jane 223
3 jonny 12
4 jim 98092
Code:
GroupTable <- bind_rows(GTP_A,GTP_LUAD, .id = "df") %>%
spread(Rptname, freq)
Output:
GroupTable
df bob ham jane jim jonny tim
1 1 4324234 4324 433 NA NA 22
2 2 987 NA 223 98092 12 NA
UPDATE:
As of the release of tidyr 1.0.0 on 2019/09/13 spread() and gather() have been retired and replaced by pivot_wider() and pivot_longer(), respectively. From the release notes Hadley Wickem states "spread() and gather() won’t go away, but they’ve been retired which means that they’re no longer under active development."
In order to get the same output as above, you will now need to first arrange() by Rptname then use pivot_wider(). If you do not arrange first you will get a similar output but the column order will not be the same as the output from spread().
GroupTable <- bind_rows(GTP_A, GTP_LUAD, .id = "df") %>%
arrange(Rptname) %>%
pivot_wider(names_from = Rptname, values_from = freq)
You could first rbind the dataframes after adding a column to identify the data.frame. Then use dcast function from reshape2 package.
rpt1
## Rptname freq df
## 1 bob 4324234 rpt1
## 2 jane 433 rpt1
## 3 ham 4324 rpt1
## 4 tim 22 rpt1
rpt2
## Rptname freq df
## 1 bob 987 rpt2
## 2 jane 223 rpt2
## 3 jonny 12 rpt2
## 4 jim 98092 rpt2
rpt1$df <- "rpt1"
rpt2$df <- "rpt2"
rpt <- rbind(rpt1, rpt2)
dcast(data = rpt, df ~ Rptname, value.var = "freq")
## df bob ham jane tim jim jonny
## 1 rpt1 4324234 4324 433 22 NA NA
## 2 rpt2 987 NA 223 NA 98092 12

Locate and merge duplicate rows in a data.frame but ignore column order

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Resources