Add column to R dataframe that is length of string in another column - r

This should be EASY but I can't figure it out and search didn't help. I'd like to add a column to a dataframe that is just the length of the strings in another column.
So say I have a data frame of names like such:
Name Last
1 John Doe
2 Edgar Poe
3 Walt Whitman
4 Jane Austen
I'd like to append a new column with the string length of, say, the last name, so it would look like:
Name Last Length
1 John Doe 3
2 Edgar Poe 3
3 Walt Whitman 7
4 Jane Austen 6
Thanks

We can use str_count from stringr
library(stringr)
df1$Length <- str_count(df1$Last)
df1$Length
[1] 3 3 7 6

If you want to filter by the length, based on column, then do the following:
library(dplyr)
df<- df %>%
filter(nchar(Last) <= 3)

Related

Conversion of Matrix to Columns

Say you have
Name August September October November
Bob 5 4 3 2
George 3 2 2 4
Gina 1 4 2 1
And you want to convert into 3 columns like so
Name Month Output
Bob August 5
Bob September 4
.....
I see how to do it in VBA through the following link : https://www.extendoffice.com/documents/excel/2773-excel-convert-matrix-to-list.html
Unsure how to execute in R. All of the searching I've yielded want to simply split the matrix into vectors which isn't correct.
If you have a dataframe, say, df you can define its column names as a column in its own right by using names:
df$Month <- names(df)[2:5]

Complex merging in R with duplicate matching values in y set producing problems

So I'm trying to merge two dataframes. Dataframe x looks something like:
Name ParentID
Steve 1
Kevin 1
Stacy 1
Paula 4
Evan 7
Dataframe y looks like:
ParentID OtherStuff
1 things
2 stuff
3 item
4 ideas
5 short
6 help
7 me
The dataframe I want would look like:
Name ParentID OtherStuff
Steve 1 things
Kevin 1 things
Stacy 1 things
Paula 4 ideas
Evan 7 me
Using a left merge gives me substantially more observations than I want, with many duplicates. Any idea how to merge things, where y is duplicated where appropriate to match x?
I'm working with a databases set up similarly to the example. x has 5013 observations, while y has 6432. Using the merge function as described by Joel and thelatemail gives me 1627727 observations.
We can use match from base R
df1$OtherStuff <- with(df1, df2$OtherStuff[match(ParentID, df2$ParentID)])
df1
# Name ParentID OtherStuff
#1 Steve 1 things
#2 Kevin 1 things
#3 Stacy 1 things
#4 Paula 4 ideas
#5 Evan 7 me

Classification according to unique values \

I have a data frame named as Records having 2 vectors Rank and Name
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
I want that every name should be placed in either 1, 2 ,3 or 4 tag depending on their occurrence and uniqueness:
I want to create a new vector which will be named as Tagging
So The output should be:
Rank 1 has three unique elements Mark Spencer and Ashish so the tag is 1 for all three.
Rank 2 has one unique records which is Mary as Ashish has already been assigned tag 1 so Mary is tagged as 2.
Rank 3 has no unique records as Spencer and Mark has already been assigned 1 so I cannot tag 3 to anybody.
Rank 4 has one unique record Joseph so he gets tagged as 4.
Let me know which function can help me do this.
I do not want to use looping as this is 1000000 row database
The below solution follows the principle that the highest Rank of a person is going to be that person's tag too.
tbl <- read.table(header=TRUE, text='
Rank Name
1 Ashish
1 Ashish
2 Ashish
3 Mark
4 Mark
1 Mark
3 Spencer
2 Spencer
1 Spencer
2 Mary
4 Joseph
')
Ordering the 'tbl' dataframe by Rank
tbl_ord <- tbl[with(tbl,order(Rank)),]
Removing multiple occurrence of name within same Rank
> name_ord<- tbl_ord[duplicated(tbl_ord$Rank),]
> name_ord
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
8 2 Spencer
10 2 Mary
7 3 Spencer
11 4 Joseph
Displaying unique Names
#name_ord[unique(name_ord$Name),] #this will work too
> name_ord[!duplicated(name_ord$Name),]
Rank Name
2 1 Ashish
6 1 Mark
9 1 Spencer
10 2 Mary
11 4 Joseph
Using the setkey function of data.table package and unique:
library(data.table)
dt<-data.table(Rank=c(1,1,2,3,4,1,3,2,1,2,4), Name=c(rep("Ashish", 3), rep("Mark", 3), rep("Spencer", 3), "Mary", "Joseph"))
setkey(dt, Rank, Name)
dt<-unique(dt)
setkey(dt, Name)
dt<-unique(dt) # works because of the above setkey call which sorted it
setkey(dt, Rank) # if you want to order them by Rank again

R count number of Team members based on Team name

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

Locate and merge duplicate rows in a data.frame but ignore column order

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.
Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):
name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15
However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.
Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.
Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...
dd$lookup=apply(dd[,c("name1","name2","name3")],1,
function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]
You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!
(Minor update to help OP:) And if you want a cleaned-up version of the final answer:
outdf = with(ee,data.frame(name1,name2,name3,
total=newtotal,stringsAsFactors=FALSE))
This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.
Sort the index columns, then use ddply to aggregate and sum:
Define the data:
dat <- " name1 name2 name3 total
1 Bob Fred Sam 30
2 Bob Joe Frank 20
3 Frank Sam Tom 25
4 Sam Tom Frank 10
5 Fred Bob Sam 15"
x <- read.table(text=dat, header=TRUE)
Create a copy:
xx <- x
Use apply to sort the columns, then aggregate:
xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
name1 name2 name3 total
1 Bob Frank Joe 20
2 Bob Fred Sam 45
3 Frank Sam Tom 35

Resources