R Identifying Dataframe Change Patterns by Groups - r

I have a dataframe looks like below:
person year location salary
Harry 2002 Los Angeles $2000
Harry 2006 Boston $3000
Harry 2007 Los Angeles $2500
Peter 2001 New York $2000
Peter 2002 New York $2300
Lily 2007 New York $7000
Lily 2008 Boston $2300
Lily 2011 New York $4000
Lily 2013 Boston $3300
I want to identify a pattern at the person level. I want to know who moves out of a location and came back later. For example, Harry moves out of Los Angeles and came back later. Lily moved out of new York and came back later. Also for Lily, we can say she also moved out of Boston and came back later. I only am interested in who has this pattern and does not care the number of back and forth. Therefore, ideally, the output can look like:
person move_back (yes/no)
Harry 1
Peter 0
Lily 1

With the help of data.table rleid you can do -
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(val = data.table::rleid(location)) %>%
arrange(person, location) %>%
group_by(location, .add = TRUE) %>%
summarise(move_back = any(val != lag(val, default = first(val)))) %>%
summarise(move_back = as.integer(any(move_back)))
# person move_back
# <chr> <int>
#1 Harry 1
#2 Lily 1
#3 Peter 0

You could use rle to identify situations where the are one or more instances of repeats. (I think your item Lily had two repeats.)
lapply( split(dat, dat$person), function(x) duplicated( rle(x$location)$values))
$Harry
[1] FALSE FALSE TRUE
$Lily
[1] FALSE FALSE TRUE TRUE
$Peter
[1] FALSE
You could use sapply with sum or any to determine the number of move-backs or whether any move-backs occurred. If you only want to know if there's a move-back to the first site then the logic would be different.

A slightly different data.table method, based on joins and row number (.I).
Basically I'm flagging all the times that a location for a person matches a row that is not the next row, then aggregating.
library(data.table)
setDT(dat)
dat[, rn := .I]
dat[, rnp1 := .I + 1]
dat[dat, on=.(person, location, rn > rnp1), back := TRUE]
dat[, .(move_back = any(back, na.rm=TRUE)), by=person]
# person move_back
#1: Harry TRUE
#2: Peter FALSE
#3: Lily TRUE
Where dat was:
dat <- read.csv(text="person,year,location,salary
Harry,2002,Los Angeles,$2000
Harry,2006,Boston,$3000
Harry,2007,Los Angeles,$2500
Peter,2001,New York,$2000
Peter,2002,New York,$2300
Lily,2007,New York,$7000
Lily,2008,Boston,$2300
Lily,2011,New York,$4000
Lily,2013,Boston,$3300", header=TRUE)

Related

Mutate DF1 based on DF2 with a check

nubie here with a dataframe/mutate question... I want to update a dataframe (df1) based on data in another dataframe (df2). For one offs I've used MUTATE so I figure this is the way to go. Additionally I would like a check function added (TRUE/FALSE ?) to indicate if the the field in df1 was updated.
For Example..
df1-
State
<chr>
1 N.Y.
2 FL
3 AL
4 MS
5 IL
6 WS
7 WA
8 N.J.
9 N.D.
10 S.D.
11 CALL
df2
State New_State
<chr> <chr>
1 N.Y. New York
2 FL Florida
3 AL Alabama
4 MS Mississippi
5 IL Illinois
6 WS Wisconsin
7 WA Washington
8 N.J. New Jersey
9 N.D. North Dakota
10 S.D. South Dakota
11 CAL California
I want the output to look like this
df3
New_State Test
<chr>
1 New York TRUE
2 Florida TRUE
3 Alabama TRUE
4 Mississippi TRUE
5 Illinois TRUE
6 Wisconsin TRUE
7 Washington TRUE
8 New Jersey TRUE
9 North Dakota TRUE
10 South Dakota TRUE
11 CALL FALSE
In essence I want R to read the data in df1 and change df1 based on the match in df2 chaining out to the full state name and replace. Lastly if the data in df1 was update mark as "TRUE" (N.Y. to NEW YORK) and "FALSE" if not updated (CALL vs CAL)
Thanks in advance for any and all help.
This should give you the result you're looking for:
match_vec <- match(df1$State, table = df2$State)
This vector should match all the abbreviated state names in df1 with those in df2. Where there's no match, you end up with a missing value:
Then the following code using dplyr should produce the df3 you requested.
library(dplyr)
df3 <- df1 %>%
mutate(New_State = df2$New_State[match_vec]) %>%
mutate(Test = !is.na(match_vec)) %>%
mutate(New_State = ifelse(is.na(New_State),
State, New_State)) %>%
select(New_State, Test)

standardizing customer ids based on the same company name

I need to use one of the many customers ids and standarize it upon all companies names that are extact same.
Before
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1756 Lightz California
After
Customer.Ids Company Location
1211 Lightz New York
1325 Comput.Inc Seattle
1211 Lightz California
The customer ids for the two companies are now the same. Which code would be the best for this?
We can use match here as it returns the first matching position. We can match Company with Company. According to ?match
match returns a vector of the positions of (first) matches of its first argument in its second.
df$Customer.Ids <- df$Customer.Ids[match(df$Company, df$Company)]
df
# Customer.Ids Company Location
#1 1211 Lightz NewYork
#2 1325 Comput.Inc Seattle
#3 1211 Lightz California
where
match(df$Company, df$Company) #returns
#[1] 1 2 1
Some other options, using sapply
df$Customer.Ids <- df$Customer.Ids[sapply(df$Company, function(x)
which.max(x == df$Company))]
Here we loop over each Company and get the first instance of it's occurrence.
Or another option using ave which follows same logic as that of #Shree, to get first occurrence by group.
with(df, ave(Customer.Ids, Company, FUN = function(x) head(x, 1)))
#[1] 1211 1325 1211
Here's a way using dplyrpackage. It'll replace all Ids as per the first instance for any company -
df %>%
group_by(Company) %>%
mutate(
Customer.Ids = Customer.Ids[1]
) %>%
ungroup()
# A tibble: 3 x 3
Customer.Ids Company Location
<int> <fct> <fct>
1 1211 Lightz New York
2 1325 Comput.Inc Seattle
3 1211 Lightz California

R count number of Team members based on Team name

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

Get count of group-level observations with multiple individual observations from dataframe in R

How do I get a dataframe like this:
soccer_player country position
"sam" USA left defender
"jon" USA right defender
"sam" USA left midfielder
"jon" USA offender
"bob" England goalie
"julie" England central midfielder
"jane" England goalie
To look like this (country with the counts of unique players per country):
country player_count
USA 2
England 3
The obvious complication is that there are multiple observations per player, so I cannot simply do table(df$country) to get the number of observations per country.
I have been playing with the table() and merge() functions but have not had any luck.
Here's one way:
as.data.frame(table(unique(d[-3])$country))
# Var1 Freq
# 1 England 3
# 2 USA 2
Drop the third column, remove any duplicate Country-Name pairs, then count the occurrences of each country.
The new features of dplyr v 3.0 provide a compact solution:
Data:
dd <- read.csv(text='
soccer_player,country,position
"sam",USA,left defender
"jon",USA,right defender
"sam",USA,left midfielder
"jon",USA,offender
"bob",England,goalie
"julie",England,central midfielder
"jane",England,goalie')
Code:
library(dplyr)
dd %>% distinct(soccer_player,country) %>%
count(country)
Without using any packages you can do:
List = by(df, df$country, function(x) length(unique(x$soccer_player)))
DataFrame = do.call(rbind, lapply(names(List), function(x)
data.frame(country=x, player_count=List[[x]])))
# country player_count
#1 England 2
#2 USA 2
It's easier with something like data.table:
dt = data.table(df)
dt[,list(player_count = length(unique(soccer_player))),by=country]
Here is an sqldf solution:
library(sqldf)
sqldf("select country, count(distinct soccer_player) player_count
from df
group by country")
## country player_count
## 1 England 2
## 2 USA 2
and here is a base R solution:
as.data.frame(xtabs(~ country, unique(df[1:2])), responseName = "player_count")
## country player_count
## 1 England 2
## 2 USA 2
One more base R option, using aggregate:
> aggregate(soccer_player ~ country, dd, FUN = function(x) length(unique(x)))
# country soccer_player
#1 England 3
#2 USA 2

How do I cast a character column in R

Using a data frame like this
df <- data.frame(Season=c("1992","1993","1993"),
Team=c("Man Utd.","Blackburn","Blackburn"),
Player=c("Peter Schmeichel(42)","Tim Flowers(39)","Bobby Mimms(4)"),
Order = c(1,1,2))
How do i get to this
1992 Man Utd. Peter Schmeichel(42)
1993 Blackburn Tim Flowers(39) Bobby Mimms(4)
Here is one option:
library(reshape2)
dcast(df,Season+Team~Order,value.var = "Player")
Here's one solution:
library(plyr)
ddply(df, .(Season, Team), summarize, Players = paste(Player, collapse = " "))
#-----
Season Team Players
1 1992 Man Utd. Peter Schmeichel(42)
2 1993 Blackburn Tim Flowers(39) Bobby Mimms(4)
Sticking in base R, you can do the following:
aggregate(list(Player = df$Player),
list(Season = df$Season, Team = df$Team), paste)
# Season Team Player
# 1 1993 Blackburn Tim Flowers(39), Bobby Mimms(4)
# 2 1992 Man Utd. Peter Schmeichel(42)
Update
Having seen your desired output from the accepted answer, note that this is also possible using base R's reshape() function:
reshape(df, direction = "wide", idvar=c("Season", "Team"), timevar="Order")
# Season Team Player.1 Player.2
# 1 1992 Man Utd. Peter Schmeichel(42) <NA>
# 2 1993 Blackburn Tim Flowers(39) Bobby Mimms(4)
Another option is to use `tidyr
library(tidyr)
spread(df,Order,Player)

Resources