How do I cast a character column in R - r

Using a data frame like this
df <- data.frame(Season=c("1992","1993","1993"),
Team=c("Man Utd.","Blackburn","Blackburn"),
Player=c("Peter Schmeichel(42)","Tim Flowers(39)","Bobby Mimms(4)"),
Order = c(1,1,2))
How do i get to this
1992 Man Utd. Peter Schmeichel(42)
1993 Blackburn Tim Flowers(39) Bobby Mimms(4)

Here is one option:
library(reshape2)
dcast(df,Season+Team~Order,value.var = "Player")

Here's one solution:
library(plyr)
ddply(df, .(Season, Team), summarize, Players = paste(Player, collapse = " "))
#-----
Season Team Players
1 1992 Man Utd. Peter Schmeichel(42)
2 1993 Blackburn Tim Flowers(39) Bobby Mimms(4)

Sticking in base R, you can do the following:
aggregate(list(Player = df$Player),
list(Season = df$Season, Team = df$Team), paste)
# Season Team Player
# 1 1993 Blackburn Tim Flowers(39), Bobby Mimms(4)
# 2 1992 Man Utd. Peter Schmeichel(42)
Update
Having seen your desired output from the accepted answer, note that this is also possible using base R's reshape() function:
reshape(df, direction = "wide", idvar=c("Season", "Team"), timevar="Order")
# Season Team Player.1 Player.2
# 1 1992 Man Utd. Peter Schmeichel(42) <NA>
# 2 1993 Blackburn Tim Flowers(39) Bobby Mimms(4)

Another option is to use `tidyr
library(tidyr)
spread(df,Order,Player)

Related

Unlist dataframes but also keep the original

I have the following data which i wish to unlist to make a new dataframe, probably easier here if i show show what im looking for; so i currently have names and codes like this;
name code
joe blogs/john williams 100000/100001
what i want:
name code
joe blogs 1000000
john williams 1000001
joe blogs/john williams 100000/100001
so im unlisting the original but also keeping it whilst making a new df
Something like this may work for you
rbind(data.frame(sapply(df, strsplit, "/")), df)
name code
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001
Data
df <- structure(list(name = "joe blogs/john williams", code = "100000/100001"), class = "data.frame", row.names = c(NA,
-1L))
You can use seperate_rows() for that:
library(dplyr)
library(tidyr)
df <- data.frame(name = "joe blogs/john williams",
code = "100000/100001")
df |>
separate_rows(everything(), sep = "/") |>
bind_rows(df)
# A tibble: 3 × 2
name code
<chr> <chr>
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001
Using reframe
library(dplyr)
df %>%
reframe(across(everything(), ~ c(unlist(strsplit(.x, "/")), .x)))
-output
name code
1 joe blogs 100000
2 john williams 100001
3 joe blogs/john williams 100000/100001

R Identifying Dataframe Change Patterns by Groups

I have a dataframe looks like below:
person year location salary
Harry 2002 Los Angeles $2000
Harry 2006 Boston $3000
Harry 2007 Los Angeles $2500
Peter 2001 New York $2000
Peter 2002 New York $2300
Lily 2007 New York $7000
Lily 2008 Boston $2300
Lily 2011 New York $4000
Lily 2013 Boston $3300
I want to identify a pattern at the person level. I want to know who moves out of a location and came back later. For example, Harry moves out of Los Angeles and came back later. Lily moved out of new York and came back later. Also for Lily, we can say she also moved out of Boston and came back later. I only am interested in who has this pattern and does not care the number of back and forth. Therefore, ideally, the output can look like:
person move_back (yes/no)
Harry 1
Peter 0
Lily 1
With the help of data.table rleid you can do -
library(dplyr)
df %>%
arrange(person, year) %>%
group_by(person) %>%
mutate(val = data.table::rleid(location)) %>%
arrange(person, location) %>%
group_by(location, .add = TRUE) %>%
summarise(move_back = any(val != lag(val, default = first(val)))) %>%
summarise(move_back = as.integer(any(move_back)))
# person move_back
# <chr> <int>
#1 Harry 1
#2 Lily 1
#3 Peter 0
You could use rle to identify situations where the are one or more instances of repeats. (I think your item Lily had two repeats.)
lapply( split(dat, dat$person), function(x) duplicated( rle(x$location)$values))
$Harry
[1] FALSE FALSE TRUE
$Lily
[1] FALSE FALSE TRUE TRUE
$Peter
[1] FALSE
You could use sapply with sum or any to determine the number of move-backs or whether any move-backs occurred. If you only want to know if there's a move-back to the first site then the logic would be different.
A slightly different data.table method, based on joins and row number (.I).
Basically I'm flagging all the times that a location for a person matches a row that is not the next row, then aggregating.
library(data.table)
setDT(dat)
dat[, rn := .I]
dat[, rnp1 := .I + 1]
dat[dat, on=.(person, location, rn > rnp1), back := TRUE]
dat[, .(move_back = any(back, na.rm=TRUE)), by=person]
# person move_back
#1: Harry TRUE
#2: Peter FALSE
#3: Lily TRUE
Where dat was:
dat <- read.csv(text="person,year,location,salary
Harry,2002,Los Angeles,$2000
Harry,2006,Boston,$3000
Harry,2007,Los Angeles,$2500
Peter,2001,New York,$2000
Peter,2002,New York,$2300
Lily,2007,New York,$7000
Lily,2008,Boston,$2300
Lily,2011,New York,$4000
Lily,2013,Boston,$3300", header=TRUE)

How to Restructure R Data Frame in R [duplicate]

This question already has answers here:
reshape wide to long with character suffixes instead of numeric suffixes
(3 answers)
Closed 5 years ago.
I have data in this format:
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
and I would like it in this format:
boss employee
1 wil james
2 wil andy
3 james dean
4 james bert
5 billy herb
6 billy collin
7 tony mike
8 tony david
I have searched the forums, but I have not yet found anything that helps. I have tried using dplyr and some others, but I am still pretty new to R.
If this question has been answered and you could give me a link that would be greatly appreciated.
Thanks,
Wil
Here is a solution that uses tidyr. Specifically, the gather function is used to combine the two employee columns. This also generates a column bsaed on the column headers (employee1 and employee2) which is called key. We remove that with select from dplyr.
library(tidyr)
library(dplyr)
df <- read.table(
text = "boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david",
header = TRUE,
stringsAsFactors = FALSE
)
df2 <- df %>%
gather(key, employee, -boss) %>%
select(-key)
> df2
boss employee
1 wil james
2 james dean
3 billy herb
4 tony mike
5 wil andy
6 james bert
7 billy collin
8 tony david
I would be shocked if there isn't a slicker, base solution but this should work for you.
Using base R:
df1 <- df[, 1:2]
df2 <- df[, c(1, 3)]
names(df1)[2] <- names(df2)[2] <- "employee"
rbind(df1, df2)
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 11 wil andy
# 21 james bert
# 31 billy collin
# 41 tony david
Using dplyr:
df %>%
select(boss, employee1) %>%
rename(employee = employee1) %>%
bind_rows(df %>%
select(boss, employee2) %>%
rename(employee = employee2))
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 5 wil andy
# 6 james bert
# 7 billy collin
# 8 tony david
Data:
df <- read.table(text = "
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
", header = TRUE, stringsAsFactors = FALSE)

Get count of group-level observations with multiple individual observations from dataframe in R

How do I get a dataframe like this:
soccer_player country position
"sam" USA left defender
"jon" USA right defender
"sam" USA left midfielder
"jon" USA offender
"bob" England goalie
"julie" England central midfielder
"jane" England goalie
To look like this (country with the counts of unique players per country):
country player_count
USA 2
England 3
The obvious complication is that there are multiple observations per player, so I cannot simply do table(df$country) to get the number of observations per country.
I have been playing with the table() and merge() functions but have not had any luck.
Here's one way:
as.data.frame(table(unique(d[-3])$country))
# Var1 Freq
# 1 England 3
# 2 USA 2
Drop the third column, remove any duplicate Country-Name pairs, then count the occurrences of each country.
The new features of dplyr v 3.0 provide a compact solution:
Data:
dd <- read.csv(text='
soccer_player,country,position
"sam",USA,left defender
"jon",USA,right defender
"sam",USA,left midfielder
"jon",USA,offender
"bob",England,goalie
"julie",England,central midfielder
"jane",England,goalie')
Code:
library(dplyr)
dd %>% distinct(soccer_player,country) %>%
count(country)
Without using any packages you can do:
List = by(df, df$country, function(x) length(unique(x$soccer_player)))
DataFrame = do.call(rbind, lapply(names(List), function(x)
data.frame(country=x, player_count=List[[x]])))
# country player_count
#1 England 2
#2 USA 2
It's easier with something like data.table:
dt = data.table(df)
dt[,list(player_count = length(unique(soccer_player))),by=country]
Here is an sqldf solution:
library(sqldf)
sqldf("select country, count(distinct soccer_player) player_count
from df
group by country")
## country player_count
## 1 England 2
## 2 USA 2
and here is a base R solution:
as.data.frame(xtabs(~ country, unique(df[1:2])), responseName = "player_count")
## country player_count
## 1 England 2
## 2 USA 2
One more base R option, using aggregate:
> aggregate(soccer_player ~ country, dd, FUN = function(x) length(unique(x)))
# country soccer_player
#1 England 3
#2 USA 2

Lookup values in a vectorized way

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Resources