How do I vectorize using a lookup? [duplicate] - r

Suppose I have two data frames df1 and df2 as follows
Df1
Id Price Profit Month
10 5 2 1
10 5 3 2
10 5 2 3
11 7 3 1
11 7 1 2
12 0 0 1
12 5 1 2
Df2
Id Name
9 Kane
10 Jack
10 Jack
11 Will
12 Matt
13 Lee
14 Han
Now I want to insert a new column in Df1 named Name and get its value from Df2 based on matching Id
So modified Df1 will be
Id Price Profit Month Name
10 5 2 1 Jack
10 5 3 2 Jack
10 5 2 3 Jack
11 7 3 1 Will
11 7 1 2 Will
12 0 0 1 Matt
12 5 1 2 Matt

df1 <- data.frame(Id=c(10L,10L,10L,11L,11L,12L,12L),Price=c(5L,5L,5L,7L,7L,0L,5L),Profit=c(2L,3L,2L,3L,1L,0L,1L),Month=c(1L,2L,3L,1L,2L,1L,2L),stringsAsFactors=F);
df2 <- data.frame(Id=c(9L,10L,10L,11L,12L,13L,14L),Name=c('Kane','Jack','Jack','Will','Matt','Lee','Han'),stringsAsFactors=F);
df1$Name <- df2$Name[match(df1$Id,df2$Id)];
df1;
## Id Price Profit Month Name
## 1 10 5 2 1 Jack
## 2 10 5 3 2 Jack
## 3 10 5 2 3 Jack
## 4 11 7 3 1 Will
## 5 11 7 1 2 Will
## 6 12 0 0 1 Matt
## 7 12 5 1 2 Matt

use left_join in dplyr
library(dplyr)
left_join(df1, df2, "Id")
eg:
> left_join(df1, df2)
Joining by: "Id"
Id Price Profit Month Name
1 10 5 2 1 Jack
2 10 5 3 2 Jack
3 10 5 2 3 Jack
4 11 7 3 1 Will
5 11 7 1 2 Will
6 12 0 0 1 Matt
7 12 5 1 2 Matt
Data wrangling cheatsheet by RStudio is a very helpful resource.

Here is an option using data.table
library(data.table)
setDT(Df1)[unique(Df2), on = "Id", nomatch=0]
# Id Price Profit Month Name
#1: 10 5 2 1 Jack
#2: 10 5 3 2 Jack
#3: 10 5 2 3 Jack
#4: 11 7 3 1 Will
#5: 11 7 1 2 Will
#6: 12 0 0 1 Matt
#7: 12 5 1 2 Matt
Or as #Arun mentioned in the comments, we can assign (:=) the "Name" column after joining on "Id" to reflect the changes in the original dataset "Df1".
setDT(Df1)[Df2, Name:= Name, on = "Id"]
Df1

A simple base R option could be merge()
merge(Df1,unique(Df2), by="Id")
# Id Price Profit Month Name
#1 10 5 2 1 Jack
#2 10 5 3 2 Jack
#3 10 5 2 3 Jack
#4 11 7 3 1 Will
#5 11 7 1 2 Will
#6 12 0 0 1 Matt
#7 12 5 1 2 Matt
The function unique() is used here because of the duplicate entry in Df2 concerning "Jack". For the example data described in the OP the option by="Id" can be omitted, but it might be necessary in a more general case.

Related

Grouping similar elements together

I am trying to group similar entities together and can't find an easy way to do so.
For example, here is a table:
Names Initial_Group Final_Group
1 James,Gordon 6 A
2 James,Gordon 6 A
3 James,Gordon 6 A
4 James,Gordon 6 A
5 James,Gordon 6 A
6 James,Gordon 6 A
7 Amanda 1 A
8 Amanda 1 A
9 Amanda 1 A
10 Gordon,Amanda 5 A
11 Gordon,Amanda 5 A
12 Gordon,Amanda 5 A
13 Gordon,Amanda 5 A
14 Gordon,Amanda 5 A
15 Gordon,Amanda 5 A
16 Gordon,Amanda 5 A
17 Gordon,Amanda 5 A
18 Edward,Gordon,Amanda 4 A
19 Edward,Gordon,Amanda 4 A
20 Edward,Gordon,Amanda 4 A
21 Anna 2 B
22 Anna 2 B
23 Anna 2 B
24 Anna,Leonard 3 B
25 Anna,Leonard 3 B
26 Anna,Leonard 3 B
I am unsure how to get the 'Final_Group' field, in the table above.
For that, I need to assign any element that has any connections to another element, and group them together:
For example, rows 1 to 20 needs to be grouped together because they are all connected by at least one or more elements.
So for rows 1 to 6, 'James, Gordon' appear, and since "Gordon" is in rows 10:20, they all have to be grouped. Likewise, since 'Amanda' appears in rows 7:9, these have to be grouped with "James,Gordon", "Gordon, Amanda", and "Edward, Gordon, Amanda".
Below is code to generate the initial data:
# Manually generating data
Names <- c(rep('James,Gordon',6)
,rep('Amanda',3)
,rep('Gordon,Amanda',8)
,rep('Edward,Gordon,Amanda',3)
,rep('Anna',3)
,rep('Anna,Leonard',3))
Initial_Group <- rep(1:6,c(6,3,8,3,3,3))
Final_Group <- rep(c('A','B'),c(20,6))
data <- data.frame(Names,Initial_Group,Final_Group)
# Grouping
data %>%
select(Names) %>%
mutate(Initial_Group=group_indices(.,Names))
Does anyone know of anyway to do this in R?
This is a long one but you could do:
library(tidyverse)
library(igraph)
df %>%
select(Names)%>%
distinct() %>%
separate(Names, c('first', 'second'), extra = 'merge', fill = 'right')%>%
separate_rows(second) %>%
mutate(second = coalesce(second, as.character(cumsum(is.na(second)))))%>%
graph_from_data_frame()%>%
components()%>%
getElement('membership')%>%
imap(~str_detect(df$Names, .y)*.x) %>%
invoke(pmax, .)%>%
cbind(df, value = LETTERS[.], value1 = .)
Names Initial_Group Final_Group value value1
1 James,Gordon 6 A A 1
2 James,Gordon 6 A A 1
3 James,Gordon 6 A A 1
4 James,Gordon 6 A A 1
5 James,Gordon 6 A A 1
6 James,Gordon 6 A A 1
7 Amanda 1 A A 1
8 Amanda 1 A A 1
9 Amanda 1 A A 1
10 Gordon,Amanda 5 A A 1
11 Gordon,Amanda 5 A A 1
12 Gordon,Amanda 5 A A 1
13 Gordon,Amanda 5 A A 1
14 Gordon,Amanda 5 A A 1
15 Gordon,Amanda 5 A A 1
16 Gordon,Amanda 5 A A 1
17 Gordon,Amanda 5 A A 1
18 Edward,Gordon,Amanda 4 A A 1
19 Edward,Gordon,Amanda 4 A A 1
20 Edward,Gordon,Amanda 4 A A 1
21 Anna 2 B B 2
22 Anna 2 B B 2
23 Anna 2 B B 2
24 Anna,Leonard 3 B B 2
25 Anna,Leonard 3 B B 2
26 Anna,Leonard 3 B B 2
Check the column called value
I was wrong that I misunderstood that you're focus on Final_Group. If not, please let me know
My approach is based on distance between samples.
data <- data %>%
mutate(Names = sapply(Names, function(x) as.vector(str_split(x, ","))))
for (i in c(1:26)){
data$James[i] = ("James" %in% data$Names[[i]])
data$Gordon[i] = ("Gordon" %in% data$Names[[i]])
data$Amanda[i] = ("Amanda" %in% data$Names[[i]])
data$Edward[i] = ("Edward" %in% data$Names[[i]])
data$Anna[i] = ("Anna" %in% data$Names[[i]])
dummy$Leonard[i] = ("Leonard" %in% dummy$Names[[i]])
}
hc <- data%>% select(-Names,) %>%
select(-Final_Group, -Initial_Group ) %>%
dist() %>% hclust(.,method = "complete")
cutree(hc)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2
plot(hc)
now that's similar to Final_Group

How to use join to combined two data frame by two variables and keep different rows with second variable

I want to combine two data frames by both ID and date variables, and want to keep all IDs from two data, and dates from two data.
examples:
data A:
ID date V1
1 1 a
1 4 b
2 9 d
3 10 e
data B:
ID date X
1 1 24
1 2 30
1 4 15
2 2 40
2 5 10
2 7 12
results:
ID date X V1
1 1 24 a
1 2 30 NA
1 4 15 b
2 2 40 NA
2 5 10 NA
2 7 12 NA
2 9 NA d
3 10 NA e
You could use the following solution:
library(dplyr)
df1 %>%
full_join(df2, by = c("ID", "date")) %>%
arrange(ID, date)
ID date V1 X
1 1 1 a 24
2 1 2 <NA> 30
3 1 4 b 15
4 2 2 <NA> 40
5 2 5 <NA> 10
6 2 7 <NA> 12
7 2 9 d NA
8 3 10 e NA

Recode dates to study day within subject

I have data in which subjects completed multiple ratings per day over 6-7 days. The number of ratings per day varies. The data set includes subject ID, date, and the ratings. I would like to create a new variable that recodes the dates for each subject into "study day" --- so 1 for first day of ratings, 2 for second day of ratings, etc.
For example, I would like to take this:
id Date Rating
1 10/20/2018 2
1 10/20/2018 3
1 10/20/2018 5
1 10/21/2018 1
1 10/21/2018 7
1 10/21/2018 9
1 10/22/2018 4
1 10/22/2018 5
1 10/22/2018 9
2 11/15/2018 1
2 11/15/2018 3
2 11/15/2018 4
2 11/16/2018 3
2 11/16/2018 1
2 11/17/2018 0
2 11/17/2018 2
2 11/17/2018 9
And end up with this:
id Day Date Rating
1 1 10/20/2018 2
1 1 10/20/2018 3
1 1 10/20/2018 5
1 2 10/21/2018 1
1 2 10/21/2018 7
1 2 10/21/2018 9
1 3 10/22/2018 4
1 3 10/22/2018 5
1 3 10/22/2018 9
2 1 11/15/2018 1
2 1 11/15/2018 3
2 1 11/15/2018 4
2 2 11/16/2018 3
2 2 11/16/2018 1
2 3 11/17/2018 0
2 3 11/17/2018 2
2 3 11/17/2018 9
I was going to look into setting up some kind of loop, but I thought it would be worth asking if there is a more efficient way to pull this off. Are there any functions that would allow me to automate this sort of thing? Thanks very much for any suggestions.
Since you want to reset the count after every id , makes this question a bit different.
Using only base R, we can split the Date based on id and then create a count of each distinct group.
df$Day <- unlist(sapply(split(df$Date, df$id), function(x) match(x,unique(x))))
df
# id Date Rating Day
#1 1 10/20/2018 2 1
#2 1 10/20/2018 3 1
#3 1 10/20/2018 5 1
#4 1 10/21/2018 1 2
#5 1 10/21/2018 7 2
#6 1 10/21/2018 9 2
#7 1 10/22/2018 4 3
#8 1 10/22/2018 5 3
#9 1 10/22/2018 9 3
#10 2 11/15/2018 1 1
#11 2 11/15/2018 3 1
#12 2 11/15/2018 4 1
#13 2 11/16/2018 3 2
#14 2 11/16/2018 1 2
#15 2 11/17/2018 0 3
#16 2 11/17/2018 2 3
#17 2 11/17/2018 9 3
I don't know how I missed this but thanks to #thelatemail who reminded that this is basically the same as
library(dplyr)
df %>%
group_by(id) %>%
mutate(Day = match(Date, unique(Date)))
AND
df$Day <- as.numeric(with(df, ave(Date, id, FUN = function(x) match(x, unique(x)))))
If you want a slightly hacky dplyr version....you can use the date column and convert it to a numeric date then manipulate that number to give the desired result
library(tidyverse)
library(lubridate)
df <- data_frame(id=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
Date= c('10/20/2018', '10/20/2018', '10/20/2018', '10/21/2018', '10/21/2018', '10/21/2018',
'10/22/2018', '10/22/2018', '10/22/2018','11/15/2018', '11/15/2018', '11/15/2018',
'11/16/2018', '11/16/2018', '11/17/2018', '11/17/2018', '11/17/2018'),
Rating=c(2,3,5,1,7,9,4,5,9,1,3,4,3,1,0,2,9))
df %>%
group_by(id) %>%
mutate(
Date = mdy(Date),
Day = as.numeric(Date),
Day = Day-min(Day)+1)
# A tibble: 17 x 4
# Groups: id [2]
id Date Rating Day
<dbl> <date> <dbl> <dbl>
1 1 2018-10-20 2 1
2 1 2018-10-20 3 1
3 1 2018-10-20 5 1
4 1 2018-10-21 1 2
5 1 2018-10-21 7 2
6 1 2018-10-21 9 2
7 1 2018-10-22 4 3
8 1 2018-10-22 5 3
9 1 2018-10-22 9 3
10 2 2018-11-15 1 1
11 2 2018-11-15 3 1
12 2 2018-11-15 4 1
13 2 2018-11-16 3 2
14 2 2018-11-16 1 2
15 2 2018-11-17 0 3
16 2 2018-11-17 2 3
17 2 2018-11-17 9 3

Give unique identifier to consecutive groupings

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue
Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5
We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

calculate each chunk by group using dplyr?

How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3

Resources