R - Get Common Columns without the use of Joins or Merge - r

I have two data frames in R
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
> df1
Cust ItemId
1 1 1
2 1 2
3 1 3
4 1 4
5 2 2
6 2 3
7 2 2
8 2 5
9 3 1
10 3 2
11 3 5
12 4 6
13 4 2
> df2
ItemId1 ItemId2
1 1 3
2 3 1
3 2 2
4 3 3
5 2 4
6 1 1
7 2 6
8 3 4
9 4 2
10 6 4
11 5 3
12 3 1
13 2 3
14 4 5
All I need is the following output which is less costly than joins/merge (because in real time I am dealing with billions of records)
> output
ItemId1 ItemId2 Cust
1 1 3 1
2 3 1 1
3 2 2 1, 2, 3, 4
4 3 3 1, 2
5 2 4 1
6 1 1 1, 3
7 2 6 4
8 3 4 1
9 4 2 1
10 6 4 NA
11 5 3 2
12 3 1 1
13 2 3 1, 2
14 4 5 NA
What happens is If ItemId1, ItemId2 of df2 combination is present in ItemId of df1 we need to return the Cust values (even if they are multiple). If they are present we need to return NA.
i.e. Take the first row as example: ItemId1 = 1, ItemId2 = 3. Only Customer = 1 has ItemId = c(1,3) in df1. Similarly the next rows.
We can do this using Joins/Merge which are costly operations. But, they are resulting in Memory Error.

This may take more time but wont take much of your memory.
Please convert for loops using apply if possible:
library(plyr)
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
temp2 = ddply(df1[,c("Cust","ItemId")], .(Cust), summarize, ItemId = toString(unique(ItemId)))
temp3 = ddply(df1[,c("ItemId","Cust")], .(ItemId), summarize, Cust = toString(unique(Cust)))
dfout = cbind(df2[0,],data.frame(Cust = df1[0,1]))
for(i in 1:nrow(df2)){
a = df2[i,1]
b = df2[i,2]
if(a == b){
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = a,Cust = temp3$Cust[temp3$ItemId == a]))
}else{
cusli = c()
for(j in 1:nrow(temp2)){
if(length(grep(a,temp2$ItemId[j]))>0 & length(grep(b,temp2$ItemId[j]))>0){
cusli = c(cusli,temp2$Cust[j])
}
}
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = b,Cust = paste(cusli,collapse = ", ")))
}
}
dfout$Cust[dfout$Cust == "",] = NA

Related

How to dynamically alternate columns between dataframe subset given column prefixes in R?

I have a dataframe that looks like this:
ID <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
State <- c('AZ', 'IA', 'MN', 'NY', 'IL', 'FL', 'TX', 'TN', 'LA', 'ND')
ABC1_current <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC2_current <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC1_future <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_future <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ID, State, ABC1_current, ABC2_current, ABC1_future, ABC2_future)
I am trying to dynamically place the columns with the future suffix to the right of the columns with the current suffix for the given prefix (ABC1, ABC2, etc.) The ID and State columns don't move at all. Here's what I am hoping to get as a result:
df2 <- data.frame(ID, State, ABC1_current, ABC1_future, ABC2_current, ABC2_future)
Is there a way to interlace columns like this dynamically? Ideally, I'd like to use dplyr if possible.
Although not purely dplyr, this may help.
This takes advantage of ordering logic by number value.
library(dplyr)
df1 <- df %>% select("ID","State")
df2 <- df %>% select(-c("ID","State"))
index <- sort(colnames(df2))
df3 <- merge(df1,df2[index])
df3
ID State ABC1_current ABC1_future ABC2_current ABC2_future
1 1.1 AZ 1 2 4 2
2 1.2 IA 1 2 4 2
3 1.3 MN 1 2 4 2
4 1.4 NY 1 2 4 2
5 1.5 IL 1 2 4 2
6 1.6 FL 1 2 4 2
7 1.7 TX 1 2 4 2
8 1.8 TN 1 2 4 2
9 1.9 LA 1 2 4 2
10 2.0 ND 1 2 4 2
Basically the same idea as #PeteKittinun's:
library(dplyr)
df %>%
select(ID, State, sort(colnames(.)[3:ncol(.)]))
returns
> df %>%
+ select(ID, State, sort(colnames(.)[3:ncol(.)]))
ID State ABC1_current ABC1_future ABC2_current ABC2_future
1 1.1 AZ 1 2 4 2
2 1.2 IA 5 5 5 5
3 1.3 MN 3 3 5 5
4 1.4 NY 4 5 4 1
5 1.5 IL 3 3 2 2

How to collapse dataframe columns by character in the middle of column name with varying column names?

This is similar to a question I asked earlier, but I left out a couple of important pieces: an ID column and the XYZ variables.
I have a dataset with the following layout (strange column titles, I know):
ID <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
XYZ1_a <- c(1, 2, 1, 2, 1, 2, 4, 2, 5, 1)
XYZ1_b <- c(2, 1, 1, 1, 2, 2, 4, 2, 1, 5)
ABC1a_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1)
ABC1b_1 <- c(4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1a_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4)
ABC1b_2 <- c(2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2a_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3)
ABC2b_1 <- c(1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2a_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4)
ABC2b_2 <- c(2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ID, XYZ1_a, XYZ1_b, ABC1a_1, ABC1b_1, ABC1a_2, ABC1b_2, ABC2a_1, ABC2b_1, ABC2a_2, ABC2b_2)
I want to collapse all of the ABC[N][x]_[n] variables into a single ABC[N]_[n] variable like the below, but I also need to do the same for the columns with the XYZ naming convention:
ID <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0,
1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
XYZ1 <- c(1, 2, 1, 2, 1, 2, 4, 2, 5, 1, 2, 1, 1, 1, 2, 2, 4, 2, 1, 5)
ABC1_1 <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC1_2 <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC2_1 <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_2 <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df2 <- data.frame(ID, XYZ1, ABC1_1, ABC1_2, ABC2_1, ABC2_2)
What's the best way to achieve this, ideally with a tidyverse solution?
We can use rearrange the substrings in those column names that starts_with 'ABC' by capturing the letter as a group followed by the underscore and one or more digits (\\d+) as second group, in the replacement specify the backreference in reverse while adding a new _. In pivot_longer, specify the sep to match the _ that precedes a letter
library(dplyr)
library(stringr)
library(tidyr)
df %>%
rename_with(~ str_replace(., "([a-z])_(\\d+)$", "_\\2_\\1"),
starts_with('AB')) %>%
pivot_longer(cols = -ID, names_to = c(".value", "grp"),
names_sep = "_(?=[a-z])", values_drop_na = TRUE) %>%
select(-grp)
-output
# A tibble: 20 x 6
ID XYZ1 ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.1 1 1 4 2 2
2 1.1 2 4 2 1 2
3 1.2 2 5 5 5 5
4 1.2 1 2 3 2 3
5 1.3 1 3 5 3 5
6 1.3 1 1 3 2 3
7 1.4 2 4 4 5 1
8 1.4 1 1 2 4 2
9 1.5 1 3 2 3 2
10 1.5 2 5 2 5 1
11 1.6 2 4 5 4 1
12 1.6 2 3 3 3 3
13 1.7 4 5 5 5 5
14 1.7 4 2 2 2 1
15 1.8 2 2 1 3 1
16 1.8 2 1 1 4 1
17 1.9 5 2 2 2 3
18 1.9 1 1 4 1 2
19 2 1 1 4 3 4
20 2 5 5 2 4 2
In the older version use rename_at
df %>%
rename_at(vars(starts_with('AB')),
~ str_replace(., "([a-z])_(\\d+)$", "_\\2_\\1")) %>%
pivot_longer(cols = -ID, names_to = c(".value", "grp"),
names_sep = "_(?=[a-z])", values_drop_na = TRUE) %>%
select(-grp)
-output
# A tibble: 20 x 6
ID XYZ1 ABC1_1 ABC1_2 ABC2_1 ABC2_2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1.1 1 1 4 2 2
2 1.1 2 4 2 1 2
3 1.2 2 5 5 5 5
4 1.2 1 2 3 2 3
5 1.3 1 3 5 3 5
6 1.3 1 1 3 2 3
7 1.4 2 4 4 5 1
8 1.4 1 1 2 4 2
9 1.5 1 3 2 3 2
10 1.5 2 5 2 5 1
11 1.6 2 4 5 4 1
12 1.6 2 3 3 3 3
13 1.7 4 5 5 5 5
14 1.7 4 2 2 2 1
15 1.8 2 2 1 3 1
16 1.8 2 1 1 4 1
17 1.9 5 2 2 2 3
18 1.9 1 1 4 1 2
19 2 1 1 4 3 4
20 2 5 5 2 4 2

R DataTable Select Rows

data1=data.frame("Student"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 2, 2, 3, 10, NA, 3))
data2=data.frame("Student"=c(1, 1, 1, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 10, NA, 3))
I have 'data1' and wish for 'data2' where I ONLY include 'Student' if 'Score' at 'Grade' = 1 is at least 4.
My only knowledge of how to do this is doing it by 'Grade' and 'Score' but that does not give desired output.
library(data.table)
setDT(data1)
data1=data1[Grade==1 & Score >=4)
how is it possible to specify that I wish to select all STUDENTS who have a Score>=4 at Grade 1 and not just the ROWS
You just need to do a join with your desired conditions to retain the Student id.
Does this work?
library(data.table)
data1 <- data.frame("Student"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 2, 2, 3, 10, NA, 3))
data2 <- data.frame("Student"=c(1, 1, 1, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 10, NA, 3))
setDT(data1)
setDT(data2)
wanted <- data1[ Grade == 1 & Score >= 4, .( Student ) ]
setkey( wanted, Student )
setkey( data1, Student )
data3 = data1[ wanted ]
data2
#> Student Grade Score
#> 1: 1 1 5
#> 2: 1 2 7
#> 3: 1 3 9
#> 4: 3 1 10
#> 5: 3 2 NA
#> 6: 3 3 3
data3
#> Student Grade Score
#> 1: 1 1 5
#> 2: 1 2 7
#> 3: 1 3 9
#> 4: 3 1 10
#> 5: 3 2 NA
#> 6: 3 3 3
Created on 2020-04-29 by the reprex package (v0.3.0)

How to expand a vector of integers into consecutive integers in each group in r

I want to expand a vector of integers into consecutive integers in each group in r. Can anyone have some hints on this problem?
Below is my original dataset:
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
data = data.frame(x, group)
and my desired dataset is as below:
desired_data = data.frame(
x = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8),
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3))
Thanks for your help!
This can be easily done via expand from tidyr,
library(tidyverse)
df %>%
group_by(group) %>%
expand(x = full_seq(x, 1))
Which gives,
# A tibble: 19 x 2
# Groups: group [3]
group x
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
11 2 6
12 3 1
13 3 2
14 3 3
15 3 4
16 3 5
17 3 6
18 3 7
19 3 8
I'm sure someone will have a cleaner solution soon. In the meantime:
minVals=aggregate(data$x, by = list(data$group), min)[,2]
maxVals=aggregate(data$x, by = list(data$group), max)[,2]
ls=apply(cbind(minVals,maxVals),1,function(x) x[1]:x[2])
desired_data = data.frame(
x = unlist(ls),
group = rep(unique(data$group),lapply(ls,length)))
x group
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 1 2
7 2 2
8 3 2
9 4 2
10 5 2
11 6 2
12 1 3
13 2 3
14 3 3
15 4 3
16 5 3
17 6 3
18 7 3
19 8 3
Here's a base R solution.
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
sl = split(x,group)
expanded = lapply(names(sl),function(x){
r = range(sl[[x]])
return(data.frame(x = seq(r[1],r[2],1),group = x))
})
do.call(rbind,expanded)
split x by group which results in a named list per group
using lapply on the names we can expand the integer range for each group
finally use do.call to rbind the results together.

Categorizing the contents of a vector

Here is my problem:
myvec <- c(1, 2, 2, 2, 3, 3,3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
I want to develop a function that can caterize this vector depending upon number of categories I define.
if categories 1 all newvec elements will be 1
if categories are 2 then
unique (myvec), i.e.
1 = 1, 2 =2, 3 = 1, 4 = 2, 5 =1, 6 = 2, 7 = 1, 8 = 2, 9 = 1, 10 = 2
(which is situation of odd or even numbers)
If categories are 3 then first three number will be 1:3 and then pattern will be repeated.
1 = 1, 2 = 2, 3=3, 4=1, 5 = 2, 6 = 3, 7 =1, 8 = 2, 9 = 3, 10 =1
If caterogies are 4 then first number will be 1:4 and then pattern will be repeated
1 = 1, 2 = 2, 3= 3, 4 = 4, 5 = 1, 6 = 2, 7=3, 8=4, 9 =1, 10 = 2
Similarly for n categories the first 1:n, then the pattern repeated.
This should do what you need, if I correctly understood the question. You can vary variable n to choose the number of groups.
myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
out <- vector(mode="integer", length=length(myvec))
uid <- sort(unique(myvec))
n <- 3
for (i in 1:n) {
s <- seq(i, length(uid), n)
out[myvec %in% s] <- i
}
Using the recycling features of R (this gives a warning if the vector length is not divisible by n):
R> myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
R> n <- 3
R> y <- cbind(x=sort(unique(myvec)), y=1:n)[, 2]
R> y
[1] 1 2 3 1 2 3 1 2 3 1
or using rep:
R> x <- sort(unique(myvec))
R> y <- rep(1:n, length.out=length(x))
R> y
[1] 1 2 3 1 2 3 1 2 3 1
Update: you could just use the modulo operator
R> myvec
[1] 1 2 2 2 3 3 3 4 4 5 6 6 6 6 7 8 8 9 10 10 10
R> n <- 4
R> ((myvec - 1) %% n) + 1
[1] 1 2 2 2 3 3 3 4 4 1 2 2 2 2 3 4 4 1 2 2 2

Resources