Data merge with data.table for repeating unique values - r

I am trying two merge two columns in data table 'A' with another column in another data table 'B' which is the unique value of a column . I want to merge in such a way that for every unique combination of two variables in data table 'A' , we get all unique values of column in data table 'B' repeated.
I tried merge but it doesn't give me all the values.I also tried the automated recycling function in data.table but this also doesn't give me the result.
Input:
data.table A
X Y
1 1
1 2
1 3
2 1
3 1
4 4
4 5
5 6
data.table B
Z
1
2
Expected output
X Y Z
1 1 1
1 1 2
1 2 1
1 2 2
1 3 1
1 3 2
2 1 1
2 1 2
3 1 1
3 1 2
4 4 1
4 4 2
4 5 1
4 5 2
5 6 1
5 6 2

We can make use of crossing from tidyr
library(tidyr)
crossing(A, B)
# X Y Z
#1 1 1 1
#2 1 1 2
#3 1 2 1
#4 1 2 2
#5 1 3 1
#6 1 3 2
#7 2 1 1
#8 2 1 2
#9 3 1 1
#10 3 1 2
#11 4 4 1
#12 4 4 2
#13 4 5 1
#14 4 5 2
#15 5 6 1
#16 5 6 2
Or with merge from base R, but the order will be slightly different
merge(A, B)
To get the correct order, replace the arguments in reverse and then order the columns
merge(B, A)[c(names(A), names(B))]

Related

R left_join() replacing joined values rather than adding in new columns

I have the following dataframes:
A<-data.frame(AgentNo=c(1,2,3,4,5,6),
N=c(2,5,6,1,9,0),
Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 1 2 1
2 2 5 2
3 3 6 1
4 4 1 1
5 5 9 2
6 6 0 2
B<-data.frame(Rank=c(1,5),
AgentNo.x=c(2,5),
AgentNo.y=c(1,4),
N=c(3,1),
Rarity=c(1,2))
Rank AgentNo.x AgentNo.y N Rarity
1 1 2 1 3 1
2 5 5 4 1 2
I would like to left join B onto A by columns "AgentNo"="AgentNo.y" and "N"="N" but rather than add new columns to A from B I want the same columns from A but where joined values have been updated and taken from B.
For any joined rows I want A.AgentNo to now be B.AgentNo.x, A.N to be B.N and A.Rarity to be B.Rarity. I would like to drop B.Rank and B.Agent.y completely.
The result should be:
Result<-data.frame(AgentNo=c(2,2,3,5,5,6), N=c(2,5,6,1,9,0), Rarity=c(1,2,1,1,2,2))
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 2
5 5 9 2
6 6 0 2
After some data wrangling, you can use rows_update to update the rows of A by the values of B:
library(dplyr)
A <- A %>%
mutate(AgentNo.y = AgentNo)
B <- select(B, AgentNo = AgentNo.x, AgentNo.y, N, Rarity)
rows_update(A, B, by = "AgentNo.y") %>%
select(-AgentNo.y)
output
AgentNo N Rarity
1 2 3 1
2 2 5 2
3 3 6 1
4 5 1 1
5 5 9 2
6 6 0 2

In R: How to coerce a list of vectors with unequal length to a dataframe using tidyverse?

Suppose you have the following list in R:
list_test <- list(c(2,4,5, 6), c(1,2,3), c(7,8))
What I am looking for is a dataframe of the following form:
value list_index
2 1
4 1
5 1
6 1
1 2
2 2
3 2
7 3
8 3
I tried to find a solution with the tidyverse but either lost the the list_index/name or had problems with the unequal length of the vectors.
You can give name to the list and then use stack in base R.
names(list_test) <- seq_along(list_test)
stack(list_test)
# values ind
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3
If interested in a tidyverse solution we can use enframe with unnest.
tibble::enframe(list_test) %>% tidyr::unnest(value)
Or imap_dfr from purrr.
purrr::imap_dfr(list_test, ~tibble::tibble(value = .x, list_index = .y))
Another option could be:
map_dfr(list_test, ~ enframe(.) %>%
select(-name), .id = "name")
name value
<chr> <dbl>
1 1 2
2 1 4
3 1 5
4 1 6
5 2 1
6 2 2
7 2 3
8 3 7
9 3 8
Or if you don't mind to have a column also with vector indexes:
map_dfr(list_test, enframe, .id = "name_list")
name_list name value
<chr> <int> <dbl>
1 1 1 2
2 1 2 4
3 1 3 5
4 1 4 6
5 2 1 1
6 2 2 2
7 2 3 3
8 3 1 7
9 3 2 8
In base R, we can use lengths to replicate the sequence and unlist the list elements into a two column 'data.frame'
data.frame(value = unlist(list_test),
list_index = rep(seq_along(list_test), lengths(list_test)))
# value list_index
#1 2 1
#2 4 1
#3 5 1
#4 6 1
#5 1 2
#6 2 2
#7 3 2
#8 7 3
#9 8 3

number similar/duplicated rows in R [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
How to convert three columns into single one
(2 answers)
Assign unique ID per multiple columns of data table
(2 answers)
Closed 4 years ago.
Hi I'm using R and I have a data like this:
1 2 3 4 5
1 2 1 2 2
3 4 1 2 3
1 2 3 4 5
3 4 1 2 3
I want to number the identical lines together with the same number, for the above ex
1 2 3 4 5 --> 1
1 2 1 2 2 --> 2
3 4 1 2 3 --> 3
1 2 3 4 5 --> 1
3 4 1 2 3 --> 3
Does any know how to do this in R (for both numeric case and character case)?
Your help is really appreciated!
This is your data:
df <- data.frame(a=c(1,1,3,1,3),
b=c(2,2,4,2,4),
c=c(3,1,1,3,1),
d=c(4,2,2,4,2),
e=c(5,2,3,5,3))
Approach 1:
You would need the data.table package to perform the below approach:
library(data.table)
i <- interaction(data.table(df), drop=TRUE)
df.out <- cbind(df, id=factor(i,labels=length(unique(i)):1))
This would give you the following:
# a b c d e id
#1 1 2 3 4 5 1
#2 1 2 1 2 2 3
#3 3 4 1 2 3 2
#4 1 2 3 4 5 1
#5 3 4 1 2 3 2
Approach 2:
Another approach is by using the plyr package, as follows:
library(plyr)
.id <- 0
df.out <- ddply(df, colnames(df), transform, id=(.id<<-.id+1))
This will give you the following output:
# a b c d e id
#1 1 2 1 2 2 1
#2 1 2 3 4 5 2
#3 1 2 3 4 5 2
#4 3 4 1 2 3 3
#5 3 4 1 2 3 3
Hope it helps.

Find minimal value for a multiple same keys in table [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
I have a table which contains multiple rows of the different data for a key of multiple columns.
Table looks like this:
A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2
I also discovered how to remove all of the duplicate elements using unique command for multiple colums, so the data duplication is not a problem.
I would like to know how to for every key(columns A and B in example) in the table to find only the minimum value in third column(C column in table)
At the end table should look like this
A B C
1 1 1 2
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
Thanks for any help. It is really appreciated
In any question, feel free to ask
con <- textConnection(" A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2")
df <- read.table(con, header = T)
df[with(df, order(A, B, C)), ]
df[!duplicated(df[1:2]),]
# A B C
# 1 1 1 2
# 3 2 1 4
# 4 1 2 4
# 5 2 2 3
# 6 2 3 1

Extracting event rows from a data frame

I have this data frame:
df <-
ID var TIME value method
1 3 0 2 1
1 3 2 2 1
1 3 3 0 1
1 4 0 10 1
1 4 2 10 1
1 4 4 5 1
1 4 6 5 1
2 3 0 2 1
2 3 2 2 1
2 3 3 0 1
2 4 0 10 1
2 4 2 10 1
2 4 4 5 1
2 4 6 5 1
I want to extract rows that has a new eventin value column. For example, for ID=1, var=3 has a value of 2 at TIME=0. This value stays the same at TIME=1, so I would take the first row at TIME=0 only and discard the second row. However, the third row, the value for var=3 has changed into zero, so I have also to extract this row. And so on for the rest of the variables. This has to be applied for every subject ID. For the above df, the result should be as follows:
dfevent <-
ID var TIME value method
1 3 0 2 1
1 3 3 0 1
1 4 0 10 1
1 4 4 5 1
2 3 0 2 1
2 3 3 0 1
2 4 0 10 1
2 4 4 5 1
Could any one help me doing this in R? I have a huge data set and I want to extract the information at which a new event has occurred for the value of every var. I have 4 variables in the data frame numbered (3, 4,5,6, and 7). The above is an example for 2 variables (variable number: 3 and 4).
This does it using dplyr
library(dplyr)
df %>%
group_by(ID, var) %>%
mutate(tf = ifelse(value==lag(value), 1, 0)) %>%
filter(is.na(tf) | tf==0) %>%
select(-tf)
# ID var TIME value method
#1 1 3 0 2 1
#2 1 3 3 0 1
#3 1 4 0 10 1
#4 1 4 4 5 1
#5 2 3 0 2 1
#6 2 3 3 0 1
#7 2 4 0 10 1
#8 2 4 4 5 1
basically, I created an extra variable that returns a '1' when the value is the same as the preceding row within groups of unique ID/var combinations. We then get rid of this variable before returning the output.
Base solution:
df[with(df, abs(ave(value,ID,FUN=function(x) c(1,diff(x)) ))) > 0,]
# ID var TIME value method
#1 1 3 0 2 1
#3 1 3 3 0 1
#4 1 4 0 10 1
#6 1 4 4 5 1
#8 2 3 0 2 1
#10 2 3 3 0 1
#11 2 4 0 10 1
#13 2 4 4 5 1
From the expected results, you may also try rleid from data.table
library(data.table)#data.table_1.9.5
setDT(df)[df[, .I[1L] , list(ID, var, rleid(value))]$V1]
# ID var TIME value method
#1: 1 3 0 2 1
#2: 1 3 3 0 1
#3: 1 4 0 10 1
#4: 1 4 4 5 1
#5: 2 3 0 2 1
#6: 2 3 3 0 1
#7: 2 4 0 10 1
#8: 2 4 4 5 1
Or a similar approach as #thelatemail
setDT(df)[df[, .I[abs(c(1,diff(value)))>0] , ID]$V1]
Or
unique(setDT(df)[, id:=rleid(value)], by=c('ID', 'var', 'id'))

Resources