Is there a function that will allow to get the difference between rows of the same type? [duplicate] - r

This question already has answers here:
Calculate difference between values in consecutive rows by group
(4 answers)
Closed 1 year ago.
I want to find the difference in the values of the same type.
Please refer to the sample dataframe below:
df <- data.frame(
x = c("Jimmy Page","Jimmy Page","Jimmy Page","Jimmy Page", "John Smith", "John Smith", "John Smith", "Joe Root", "Joe Root", "Joe Root", "Joe Root", "Joe Root"),
y = c(1,2,3,4,5,7,89,12,34,67,95,9674 )
)
I would like to get the difference in the each value for e.g. Jimmy Page = 1 and Jimmy Page = 2, difference = 1.
And present NA for difference between dissimilar names.

You can use diff in ave.
df$diff <- ave(df$y, df$x, FUN=function(z) c(diff(z), NA))
df
# x y diff
#1 Jimmy Page 1 1
#2 Jimmy Page 2 1
#3 Jimmy Page 3 1
#4 Jimmy Page 4 NA
#5 John Smith 5 2
#6 John Smith 7 82
#7 John Smith 89 NA
#8 Joe Root 12 22
#9 Joe Root 34 33
#10 Joe Root 67 28
#11 Joe Root 95 9579
#12 Joe Root 9674 NA

library(tidyverse)
df <-
data.frame(
x = c(
"Jimmy Page",
"Jimmy Page",
"Jimmy Page",
"Jimmy Page",
"John Smith",
"John Smith",
"John Smith",
"Joe Root",
"Joe Root",
"Joe Root",
"Joe Root",
"Joe Root"
),
y = c(1, 2, 3, 4, 5, 7, 89, 12, 34, 67, 95, 9674)
)
df %>%
group_by(x) %>%
mutate(res = c(NA, diff(y))) %>%
ungroup()
#> # A tibble: 12 x 3
#> x y res
#> <chr> <dbl> <dbl>
#> 1 Jimmy Page 1 NA
#> 2 Jimmy Page 2 1
#> 3 Jimmy Page 3 1
#> 4 Jimmy Page 4 1
#> 5 John Smith 5 NA
#> 6 John Smith 7 2
#> 7 John Smith 89 82
#> 8 Joe Root 12 NA
#> 9 Joe Root 34 22
#> 10 Joe Root 67 33
#> 11 Joe Root 95 28
#> 12 Joe Root 9674 9579
Created on 2021-09-14 by the reprex package (v2.0.1)

Related

R: Sort data by most common value of a column

I am following this stackoverflow post over here: Sort based on Frequency in R
I am trying to sort my data by the most frequent value of the column "Node_A".
library(dplyr)
Data_I_Have <- data.frame(
"Node_A" = c("John", "John", "John", "John, "John", "Peter", "Tim", "Kevin", "Adam", "Adam", "Xavier"),
"Node_B" = c("Claude", "Peter", "Tim", "Tim", "Claude", "Henry", "Kevin", "Claude", "Tim", "Henry", "Claude"),
" Place_Where_They_Met" = c("Chicago", "Boston", "Seattle", "Boston", "Paris", "Paris", "Chicago", "London", "Chicago", "London", "Paris"),
"Years_They_Have_Known_Each_Other" = c("10", "10", "1", "5", "2", "8", "7", "10", "3", "3", "5"),
"What_They_Have_In_Common" = c("Sports", "Movies", "Computers", "Computers", "Video Games", "Sports", "Movies", "Computers", "Sports", "Sports", "Video Games")
)
sort = Data_I_Have %>% arrange(Node_A, desc(Freq))
Could someone please show me what I am doing wrong?
Thanks
Before sorting the data you need to count the data. You can try :
library(dplyr)
Data_I_Have %>%
count(Node_A, sort = TRUE) %>%
left_join(Data_I_Have, by = 'Node_A')
# Node_A n Node_B X.Place_Where_They_Met Years_They_Have_Known_Each_Other What_They_Have_In_Common
#1 John 5 Claude Chicago 10 Sports
#2 John 5 Peter Boston 10 Movies
#3 John 5 Tim Seattle 1 Computers
#4 John 5 Tim Boston 5 Computers
#5 John 5 Claude Paris 2 Video Games
#6 Adam 2 Tim Chicago 3 Sports
#7 Adam 2 Henry London 3 Sports
#8 Kevin 1 Claude London 10 Computers
#9 Peter 1 Henry Paris 8 Sports
#10 Tim 1 Kevin Chicago 7 Movies
#11 Xavier 1 Claude Paris 5 Video Games
Or we can use add_count instead of count so that we don't have to join the data.
Data_I_Have %>% add_count(Node_A, sort = TRUE)
You can remove the n column from the final output if it is not needed.
As the last answer of the post you mentionend :
Data_I_Have %>%
group_by(Node_A) %>%
arrange( desc(n()))
# Node_A Node_B X.Place_Where_They_Met Years_They_Have_Known_Each_Other What_They_Have_In_Common
# <chr> <chr> <chr> <chr> <chr>
# 1 John Claude Chicago 10 Sports
# 2 John Peter Boston 10 Movies
# 3 John Tim Seattle 1 Computers
# 4 John Tim Boston 5 Computers
# 5 John Claude Paris 2 Video Games
# 6 Peter Henry Paris 8 Sports
# 7 Tim Kevin Chicago 7 Movies
# 8 Kevin Claude London 10 Computers
# 9 Adam Tim Chicago 3 Sports
# 10 Adam Henry London 3 Sports
# 11 Xavier Claude Paris 5 Video Games

Converting Names into Identification Codes in different columns in R

I am new with R and I am struggling with the following issue:
I have a dataset more or less like this:
NAME Collegue1 Collegue 2
John Smith Bill Gates Brad Pitt
Adam Sandler Bill Gates John Smith
Bill Gates Brad Pitt Adam Sandler
Brad Pitt John Smith Bill Gates
I need to create an ID code and substitute names with the corresponding ID in the three columns, how can I do that?
Maybe you can try the code like below
df[]<-as.integer(factor(unlist(df),levels = df$NAME))
such that
> df
NAME Collegue1 Collegue2
1 1 3 4
2 2 3 1
3 3 4 2
4 4 1 3
Or
df[-1] <- as.integer(factor(unlist(df[-1]),levels = df$NAME))
such that
> df
NAME Collegue1 Collegue2
1 John Smith 3 4
2 Adam Sandler 3 1
3 Bill Gates 4 2
4 Brad Pitt 1 3
Data
df <- structure(list(NAME = c("John Smith", "Adam Sandler", "Bill Gates",
"Brad Pitt"), Collegue1 = c("Bill Gates", "Bill Gates", "Brad Pitt",
"John Smith"), Collegue2 = c("Brad Pitt", "John Smith", "Adam Sandler",
"Bill Gates")), class = "data.frame", row.names = c(NA, -4L))
You can convert the names to a factor and use unclass to get the ID codes.
x[-1] <- unclass(factor(unlist(x[-1]), x$NAME))
cbind(x["NAME"], ID=seq_along(x$NAME), x[-1])
# NAME ID Collegue1 Collegue.2
#1 John Smith 1 3 4
#2 Adam Sandler 2 3 1
#3 Bill Gates 3 4 2
#4 Brad Pitt 4 1 3
In case you are just interested in ID's:
levels(factor(unlist(x))) #Only in case you are interested in the codes of the table
#[1] "Adam Sandler" "Bill Gates" "Brad Pitt" "John Smith"
x[] <- unclass(factor(unlist(x)))
x
# NAME Collegue1 Collegue.2
#1 4 2 3
#2 1 2 4
#3 2 3 1
#4 3 4 2
Data:
x <- structure(list(NAME = c("John Smith", "Adam Sandler", "Bill Gates",
"Brad Pitt"), Collegue1 = c("Bill Gates", "Bill Gates", "Brad Pitt",
"John Smith"), Collegue.2 = c("Brad Pitt", "John Smith", "Adam Sandler",
"Bill Gates")), class = "data.frame", row.names = c(NA, -4L))

How can I group the same value across multiple columns and sum subsequent values?

I have a table of information that looks like the following:
rusher_full_name receiver_full_name rushing_fpts receiving_fpts
<chr> <chr> <dbl> <dbl>
1 Aaron Jones NA 5 0
2 NA Aaron Jones 0 5
3 Mike Davis NA 0.5 0
4 NA Allen Robinson 0 3
5 Mike Davis NA 0.7 0
What I'm trying to do is get all of the values from the rushing_fpts and receiving_fpts to sum up depending on the rusher_full_name and receiver_full_name value. For example, for every instance of "Aaron Jones" (whether it's in rusher_full_name or receiver_full_name) sum up the values of rushing_fpts and receiving_fpts
In the end, this is what I'd like it to look like:
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Mike Davis 1.2
3 Allen Robinson 3
I'm pretty new to using R and have Googled a number of things but can't find any solution. Any suggestions on how to accomplish this?
library(tidyverse)
df %>%
mutate(player_full_name = coalesce(rusher_full_name, receiver_full_name)) %>%
group_by(player_full_name) %>%
summarise(total_fpts = sum(rushing_fpts+receiving_fpts))
Output
# A tibble: 3 x 2
player_full_name total_fpts
<chr> <dbl>
1 Aaron Jones 10
2 Allen Robinson 3
3 Mike Davis 1.2
Data
df <- data.frame(
rusher_full_name = c("Aaron Jones", NA, "Mike Davis", NA, "Mike Davis"),
receiver_full_name = c(NA, "Aaron Jones", NA, "Allen Robinson", NA),
rushing_fpts = c(5,0,0.5,0,.7),
receiving_fpts = c(0,5,0,3,0),
stringsAsFactors = FALSE
)

Summing Rows Next to a Name in R

I'm working on a banking project where I'm trying to find a yearly sum of money spent, while the dataset has these listed as monthly transactions.
Month Name Money Spent
2 John Smith 10
3 John Smith 25
4 John Smith 20
2 Joe Nais 10
3 Joe Nais 25
4 Joe Nais 20
Right now, this is the code I have:
OTData <- OTData %>%
mutate(
OTData,
Full Year = [CODE NEEDED TO SUM UP]
)
Thanks!
As #Pawel said, there's no question here. I assume you want:
df <- data.frame(Month = c(2,3,4,2,3,4),
Name = c("John Smith", "John Smith", "John Smith",
"Joe Nais", "Joe Nais", "Joe Nais"),
Money_Spent = c(10,25,20,10,25,20))
df %>%
group_by(Name) %>%
summarize(Full_year = sum(Money_Spent))
Name Full_year
<fct> <dbl>
1 Joe Nais 55
2 John Smith 55
NOTE: You're going to run into trouble if you include spaces in your variable names. You really should replace them with ., _, or camelCase as in the above example.

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

Resources