R: Sort data by most common value of a column - r

I am following this stackoverflow post over here: Sort based on Frequency in R
I am trying to sort my data by the most frequent value of the column "Node_A".
library(dplyr)
Data_I_Have <- data.frame(
"Node_A" = c("John", "John", "John", "John, "John", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
"Node_B" = c("Claude", "Peter", "Tim", "Tim", "Claude", "Henry", "Kevin", "Claude", "Tim", "Henry", "Claude"),
" Place_Where_They_Met" = c("Chicago", "Boston", "Seattle", "Boston", "Paris", "Paris", "Chicago", "London", "Chicago", "London", "Paris"),
"Years_They_Have_Known_Each_Other" = c("10", "10", "1", "5", "2", "8", "7", "10", "3", "3", "5"),
"What_They_Have_In_Common" = c("Sports", "Movies", "Computers", "Computers", "Video Games", "Sports", "Movies", "Computers", "Sports", "Sports", "Video Games")
)
sort = Data_I_Have %>% arrange(Node_A, desc(Freq))
Could someone please show me what I am doing wrong?
Thanks

Before sorting the data you need to count the data. You can try :
library(dplyr)
Data_I_Have %>%
count(Node_A, sort = TRUE) %>%
left_join(Data_I_Have, by = 'Node_A')
# Node_A n Node_B X.Place_Where_They_Met Years_They_Have_Known_Each_Other What_They_Have_In_Common
#1 John 5 Claude Chicago 10 Sports
#2 John 5 Peter Boston 10 Movies
#3 John 5 Tim Seattle 1 Computers
#4 John 5 Tim Boston 5 Computers
#5 John 5 Claude Paris 2 Video Games
#6 Adam 2 Tim Chicago 3 Sports
#7 Adam 2 Henry London 3 Sports
#8 Kevin 1 Claude London 10 Computers
#9 Peter 1 Henry Paris 8 Sports
#10 Tim 1 Kevin Chicago 7 Movies
#11 Xavier 1 Claude Paris 5 Video Games
Or we can use add_count instead of count so that we don't have to join the data.
Data_I_Have %>% add_count(Node_A, sort = TRUE)
You can remove the n column from the final output if it is not needed.

As the last answer of the post you mentionend :
Data_I_Have %>%
group_by(Node_A) %>%
arrange( desc(n()))
# Node_A Node_B X.Place_Where_They_Met Years_They_Have_Known_Each_Other What_They_Have_In_Common
# <chr> <chr> <chr> <chr> <chr>
# 1 John Claude Chicago 10 Sports
# 2 John Peter Boston 10 Movies
# 3 John Tim Seattle 1 Computers
# 4 John Tim Boston 5 Computers
# 5 John Claude Paris 2 Video Games
# 6 Peter Henry Paris 8 Sports
# 7 Tim Kevin Chicago 7 Movies
# 8 Kevin Claude London 10 Computers
# 9 Adam Tim Chicago 3 Sports
# 10 Adam Henry London 3 Sports
# 11 Xavier Claude Paris 5 Video Games

Related

Separate column into two: before and after a certain word

I have the following data set
> data
firm_name
1: Light Ltd John Smith
2: Bolt Night Ltd Mary Poppins
3: Bright Yellow Sun Ltd Harry Potter
---
I want to separate it into two columns depending on the position of the "Ltd". So, the data would look like:
> data
firm_name name
1: Light Ltd John Smith
2: Bolt Night Ltd Mary Poppins
3: Bright Yellow Sun Ltd Harry Potter
---
I tried with the stringr package but did not find any particular solution.
thanks in advance
You can use separate from tidyr with a lookbehind regular expression for this.
library(tidyr)
df %>%
separate(col = firm_name, into = c("firm_name", "name"), sep = "(?<=Ltd)")
#> firm_name name
#> 1 Light Ltd John Smith
#> 2 Bolt Night Ltd Mary Poppins
#> 3 Bright Yellow Sun Ltd Harry Potter
data
df <- data.frame(firm_name = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter"))
We can use base R with read.csv
read.csv(text = sub("(Ltd)", "\\1,", df$names),
header = FALSE, col.names = c('firm_name', 'name'))
# firm_name name
#1 Light Ltd John Smith
#2 Bolt Night Ltd Mary Poppins
#3 Bright Yellow Sun Ltd Harry Potter
data
df <- structure(list(names = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter")), row.names = c(NA, -3L
), class = "data.frame")
Are you after something like this?
df <-
tibble(
names = c("Light Ltd John Smith",
"Bolt Night Ltd Mary Poppins",
"Bright Yellow Sun Ltd Harry Potter")
)
df %>%
tidyr::separate(names, c("half_1", "half_2"), sep = "Ltd")
Does this work:
> df %>% mutate(name = gsub('([A-z].*Ltd) (.*)','\\2', df$firm_name), firm_name = gsub('([A-z].*Ltd) (.*)','\\1', df$firm_name))
# A tibble: 3 x 2
firm_name name
<chr> <chr>
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter
>
Data used:
> df
# A tibble: 3 x 1
firm_name
<chr>
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter
>
Using tidyr::extract :
tidyr::extract(df, names, c('firm_name', 'name'), regex = '(.*Ltd)\\s(.*)')
# A tibble: 3 x 2
# firm_name name
# <chr> <chr>
#1 Light Ltd John Smith
#2 Bolt Night Ltd Mary Poppins
#3 Bright Yellow Sun Ltd Harry Potter
Or in base R :
df$name <- sub('.*Ltd\\s', '', df$names)
df$firm_name <- sub('(.*Ltd).*', '\\1', df$names)
df$names <- NULL
Another base R option
setNames(
data.frame(
do.call(
rbind,
strsplit(df$names, "(?<=Ltd)\\s+", perl = TRUE)
)
),
c("firm_name", "name")
)
giving
firm_name name
1 Light Ltd John Smith
2 Bolt Night Ltd Mary Poppins
3 Bright Yellow Sun Ltd Harry Potter

Converting Names into Identification Codes in different columns in R

I am new with R and I am struggling with the following issue:
I have a dataset more or less like this:
NAME Collegue1 Collegue 2
John Smith Bill Gates Brad Pitt
Adam Sandler Bill Gates John Smith
Bill Gates Brad Pitt Adam Sandler
Brad Pitt John Smith Bill Gates
I need to create an ID code and substitute names with the corresponding ID in the three columns, how can I do that?
Maybe you can try the code like below
df[]<-as.integer(factor(unlist(df),levels = df$NAME))
such that
> df
NAME Collegue1 Collegue2
1 1 3 4
2 2 3 1
3 3 4 2
4 4 1 3
Or
df[-1] <- as.integer(factor(unlist(df[-1]),levels = df$NAME))
such that
> df
NAME Collegue1 Collegue2
1 John Smith 3 4
2 Adam Sandler 3 1
3 Bill Gates 4 2
4 Brad Pitt 1 3
Data
df <- structure(list(NAME = c("John Smith", "Adam Sandler", "Bill Gates",
"Brad Pitt"), Collegue1 = c("Bill Gates", "Bill Gates", "Brad Pitt",
"John Smith"), Collegue2 = c("Brad Pitt", "John Smith", "Adam Sandler",
"Bill Gates")), class = "data.frame", row.names = c(NA, -4L))
You can convert the names to a factor and use unclass to get the ID codes.
x[-1] <- unclass(factor(unlist(x[-1]), x$NAME))
cbind(x["NAME"], ID=seq_along(x$NAME), x[-1])
# NAME ID Collegue1 Collegue.2
#1 John Smith 1 3 4
#2 Adam Sandler 2 3 1
#3 Bill Gates 3 4 2
#4 Brad Pitt 4 1 3
In case you are just interested in ID's:
levels(factor(unlist(x))) #Only in case you are interested in the codes of the table
#[1] "Adam Sandler" "Bill Gates" "Brad Pitt" "John Smith"
x[] <- unclass(factor(unlist(x)))
x
# NAME Collegue1 Collegue.2
#1 4 2 3
#2 1 2 4
#3 2 3 1
#4 3 4 2
Data:
x <- structure(list(NAME = c("John Smith", "Adam Sandler", "Bill Gates",
"Brad Pitt"), Collegue1 = c("Bill Gates", "Bill Gates", "Brad Pitt",
"John Smith"), Collegue.2 = c("Brad Pitt", "John Smith", "Adam Sandler",
"Bill Gates")), class = "data.frame", row.names = c(NA, -4L))

How can I group the same value across multiple columns and sum subsequent values?

I have a table of information that looks like the following:
rusher_full_name receiver_full_name rushing_fpts receiving_fpts
<chr> <chr> <dbl> <dbl>
1 Aaron Jones NA 5 0
2 NA Aaron Jones 0 5
3 Mike Davis NA 0.5 0
4 NA Allen Robinson 0 3
5 Mike Davis NA 0.7 0
What I'm trying to do is get all of the values from the rushing_fpts and receiving_fpts to sum up depending on the rusher_full_name and receiver_full_name value. For example, for every instance of "Aaron Jones" (whether it's in rusher_full_name or receiver_full_name) sum up the values of rushing_fpts and receiving_fpts
In the end, this is what I'd like it to look like:
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Mike Davis 1.2
3 Allen Robinson 3
I'm pretty new to using R and have Googled a number of things but can't find any solution. Any suggestions on how to accomplish this?
library(tidyverse)
df %>%
mutate(player_full_name = coalesce(rusher_full_name, receiver_full_name)) %>%
group_by(player_full_name) %>%
summarise(total_fpts = sum(rushing_fpts+receiving_fpts))
Output
# A tibble: 3 x 2
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Allen Robinson 3
3 Mike Davis 1.2
Data
df <- data.frame(
rusher_full_name = c("Aaron Jones", NA, "Mike Davis", NA, "Mike Davis"),
receiver_full_name = c(NA, "Aaron Jones", NA, "Allen Robinson", NA),
rushing_fpts = c(5,0,0.5,0,.7),
receiving_fpts = c(0,5,0,3,0),
stringsAsFactors = FALSE
)

Merge two datasets

I create a node list as follows:
name <- c("Joe","Frank","Peter")
city <- c("New York","Detroit","Maimi")
age <- c(24,55,65)
node_list <- data.frame(name,age,city)
node_list
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
Then I create an edge list as follows:
from <- c("Joe","Frank","Peter","Albert")
to <- c("Frank","Albert","James","Tony")
to_city <- c("Detroit","St. Louis","New York","Carson City")
edge_list <- data.frame(from,to,to_city)
edge_list
from to to_city
1 Joe Frank Detroit
2 Frank Albert St. Louis
3 Peter James New York
4 Albert Tony Carson City
Notice that the names in the node list and edge list do not overlap 100%. I want to create a master node list of all the names, capturing city information as well. This is my dplyr attempt to do this:
new_node <- edge_list %>%
gather("from_to", "name", from, to) %>%
distinct(name) %>%
full_join(node_list)
new_node
name age city
1 Joe 24 New York
2 Frank 55 Detroit
3 Peter 65 Maimi
4 Albert NA <NA>
5 James NA <NA>
6 Tony NA <NA>
I need to figure out how to add to_city information. What do I need to add to my dplyr code to make this happen? Thanks.
Join twice, once on to and once on from, with the irrelevant columns subsetted out:
library(dplyr)
node_list <- data_frame(name = c("Joe", "Frank", "Peter"),
city = c("New York", "Detroit", "Maimi"),
age = c(24, 55, 65))
edge_list <- data_frame(from = c("Joe", "Frank", "Peter", "Albert"),
to = c("Frank", "Albert", "James", "Tony"),
to_city = c("Detroit", "St. Louis", "New York", "Carson City"))
node_list %>%
full_join(select(edge_list, name = to, city = to_city)) %>%
full_join(select(edge_list, name = from))
#> Joining, by = c("name", "city")
#> Joining, by = "name"
#> # A tibble: 6 x 3
#> name city age
#> <chr> <chr> <dbl>
#> 1 Joe New York 24.
#> 2 Frank Detroit 55.
#> 3 Peter Maimi 65.
#> 4 Albert St. Louis NA
#> 5 James New York NA
#> 6 Tony Carson City NA
In this case the second join doesn't do anything because everybody is already included, but it would insert anyone who only existed in the from column.

Lookup values in a vectorized way

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Resources