Merge 2 dataframes based on condition in R - r

I have the following 2 data frames that I want to merge:
x <- data.frame(a= 1:11, b =3:13, c=2:12, d=7:17, invoice = 1:11)
x =
a b c d invoice
1 3 2 7 1
2 4 3 8 2
3 5 4 9 3
4 6 5 10 4
5 7 6 11 5
6 8 7 12 6
7 9 8 13 7
8 10 9 14 8
9 11 10 15 9
10 12 11 16 10
11 13 12 17 11
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
> y
nr invoice
100 1
101 1
102 1
103 1
104 1
105 1
106 1
107 1
108 1
109 1
110 1
111 2
112 2
113 2
114 2
115 2
116 2
117 2
I want to merge the letters from dataframe X with dataframe Y when the invoice number is the same. It should start with merging the value from letter A, then B ect. This should be happening until the invoice number is not the same anymore and then choose the numbers from invoice nr 2.
the output should be like this:
> output
nr invoice letter_count
100 1 1
101 1 3
102 1 2
103 1 7
104 1 1
105 1 3
106 1 2
107 1 7
108 1 1
109 1 2
110 1 7
111 2 2
112 2 4
113 2 3
114 2 8
115 2 2
116 2 4
I tried to use the merge function with the by argument but this created an error that the number of rows is not the same. Any help I will appreciate.

Here is a solution using the purrr package.
# Prepare the data frames
x <- data.frame(a = 1:11, b = 3:13, c = 2:12, d = 7:17, invoice = 1:11)
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
# Load package
library(purrr)
# Split the data based on invoice
y_list <- split(y, f = y$invoice)
# Design a function to transfer data
trans_fun <- function(main_df, letter_df = x){
# Get the invoice number
temp_num<- unique(main_df$invoice)
# Extract letter_count information from x
add_vec <- unlist(letter_df[letter_df$invoice == temp_num, 1:4])
# Get the remainder of nrow(main_df) and length(add_vec)
reamin_num <- nrow(main_df) %% length(add_vec)
# Get the multiple of nrow(main_df) and length(add_vec)
multiple_num <- nrow(main_df) %/% length(add_vec)
# Create the entire sequence to add
add_seq <- rep(add_vec, multiple_num + 1)
add_seq2 <- add_seq[1:(length(add_seq) - (length(add_vec) - reamin_num))]
# Add new column, add_seq2, to main_df
main_df$letter_count <- add_seq2
return(main_df)
}
# Apply the trans_fun function using map_df
output <- map_df(y_list, .f = trans_fun)
# See the result
output
nr invoice letter_count
1 100 1 1
2 101 1 3
3 102 1 2
4 103 1 7
5 104 1 1
6 105 1 3
7 106 1 2
8 107 1 7
9 108 1 1
10 109 1 3
11 110 1 2
12 111 2 2
13 112 2 4
14 113 2 3
15 114 2 8
16 115 2 2
17 116 2 4
18 117 2 3
19 118 2 8
20 119 2 2
21 120 2 4
22 121 2 3
23 122 2 8
24 123 2 2
25 124 2 4
26 125 2 3

Related

calculate count of number observation for all variables at once in R

numbers1 <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
and
numbers2 <- c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
to peform counting i do so manually
as.data.frame(table(numbers1))
as.data.frame(table(numbers2))
but i can have 100 variables from mydat$x1 to mydat$100.
I don't want manually enter 100 times.
How to do that all counting would for all variables?
as.data.frame(table(mydat$x1-mydat$x100))
is not working.
We can make a list of all variables in the environment that have a pattern like numbers. Then we can loop through all of the elements of the list:
number_lst <- mget(ls(pattern = 'numbers\\d'), envir = .GlobalEnv) #thanks NelsonGon
lapply(number_lst, function(x) as.data.frame(table(x)))
$numbers1
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
$numbers2
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
As I read your question, you want to count the number of times each unique element in a set occurs using minimal re-typing over many sets.
To do this, you'll first need to put the sets into a single object, e.g. into a list:
list_of_sets <- list(numbers1 = c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435),
numbers2 = c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435))
Then you loop over each list element, e.g. using a for loop:
list_of_counts <- list()
for(i in seq_along(list_of_sets)){
list_of_counts[[i]] <- as.data.frame(table(list_of_sets[[i]]))
}
list_of_counts then contains the results:
[[1]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
[[2]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1

R purrr row-wise lookups from two lists

Here’s a simplified version of a problem that involves larger, more complex inputs. First, I create data:
input <- tibble(
person = rep(101:103, each = 12),
item = rep(1:12, 3),
response = sample(1:4, 36, replace = T)
)
These data are responses from three persons on a 12-item test. input is a multilevel table in which the test items are nested within each person. The columns of input are:
person: ID numbers for persons 101, 102, and 103 (12 rows for each person)
item: test items 1-12 for each person. Note how the items are nested within each person
response: score for each item
The test is divided into four subscales consisting of three items each.
scale_assign <- list(1:3, 4:6, 7:9, 10:12)
scale_num <- 1:4
scale_assign is a four-element list containing four item sets (expressed as four numerical ranges): items 1-3 (subscale 1), items 4-6 (subscale 2), items 7-9 (subscale 3), and items 10-12 (subscale 4). scale_num is a four element numerical vector containing the numbers (1-4) that label the four subscales.
What I want R to do is process input row-wise, creating a new column scale, and filling it with the correct value of scale_num for each item (that is, each item's subscale assignment). In each row, R needs to check the value of item against the ranges in scale_assign and fill in scale with the value of scale_num that corresponds to the scale_assign range for that item.
The desired output looks like this:
# A tibble: 36 x 4
# person item response scale
# 1 101 1 4 1
# 2 101 2 2 1
# 3 101 3 4 1
# 4 101 4 4 2
# 5 101 5 4 2
# 6 101 6 4 2
# 7 101 7 3 3
# 8 101 8 2 3
# 9 101 9 4 3
# 10 101 10 1 4
# 11 101 11 1 4
# 12 101 12 4 4
# 13 102 1 1 1
# 14 102 2 3 1
# 15 102 3 1 1
# 16 102 4 1 2
# 17 102 5 3 2
# 18 102 6 3 2
# 19 102 7 4 3
# 20 102 8 1 3
# 21 102 9 3 3
# 22 102 10 4 4
# 23 102 11 3 4
# 24 102 12 3 4
# 25 103 1 4 1
# 26 103 2 1 1
# 27 103 3 2 1
# 28 103 4 2 2
# 29 103 5 4 2
# 30 103 6 1 2
# 31 103 7 4 3
# 32 103 8 4 3
# 33 103 9 1 3
# 34 103 10 4 4
# 35 103 11 1 4
# 36 103 12 2 4
Preferring a tidyverse solution, I thought this might be a job for purrr::map2(), because it seems to involve simultaneous iteration over a four-element list scale_assign and a four-element vector scale_num. I tried to implement the coding of scale within a map2() call, using mutate() and case_when() to do the coding, but could not get it to work.
Thanks in advance for any help!
Instead of performing this operation row-wise and checking for each value it would be easy to perform a join operation if you change scale_assign to named list convert it into a dataframe and do a right_join with input dataframe.
scale_assign <- list(1:3, 4:6, 7:9, 10:12)
names(scale_assign) <- 1:4
library(tidyverse)
enframe(scale_assign) %>%
unnest(cols = value) %>%
mutate_all(as.integer) %>%
right_join(input, by = c("value" = "item"))
# A tibble: 36 x 4
# name value person response
# <int> <int> <int> <int>
# 1 1 1 101 4
# 2 1 2 101 4
# 3 1 3 101 2
# 4 2 4 101 2
# 5 2 5 101 1
# 6 2 6 101 4
# 7 3 7 101 3
# 8 3 8 101 1
# 9 3 9 101 1
#10 4 10 101 2
# … with 26 more rows
In base R, that can be done using stack and merge
merge(input, stack(scale_assign), all.x = TRUE, by.x = "item", by.y = "values")
data
set.seed(1234)
input <- tibble(
person = rep(101:103, each = 12),
item = rep(1:12, 3),
response = sample(1:4, 36, replace = TRUE))
Here is a data.table solution, using an update-join.
Basically this is #Ronak Shah's Base-R answer, but using the data.table-package (i.e. fast performance on large data-sets).
library(data.table)
#1. set inpus as data.table
#2. create a lookup-table using `stack( scale_assign )`,
# and make that also a data.table (using setDT() )
#3. left update join on item
setDT(input)[ setDT( stack( scale_assign ) ),
scale := i.ind,
on = .( item = values ) ][]
output
# person item response scale
# 1: 101 1 3 1
# 2: 101 2 4 1
# 3: 101 3 3 1
# 4: 101 4 2 2
# 5: 101 5 3 2
# 6: 101 6 4 2
# 7: 101 7 1 3
# 8: 101 8 3 3
# 9: 101 9 4 3
# 10: 101 10 2 4
# 11: 101 11 3 4
# 12: 101 12 4 4
# 13: 102 1 4 1
# 14: 102 2 2 1
# 15: 102 3 3 1
# 16: 102 4 2 2
# 17: 102 5 1 2
# 18: 102 6 4 2
# 19: 102 7 1 3
# 20: 102 8 3 3
# 21: 102 9 2 3
# 22: 102 10 1 4
# 23: 102 11 4 4
# 24: 102 12 3 4
# 25: 103 1 1 1
# 26: 103 2 1 1
# 27: 103 3 2 1
# 28: 103 4 1 2
# 29: 103 5 2 2
# 30: 103 6 4 2
# 31: 103 7 4 3
# 32: 103 8 2 3
# 33: 103 9 3 3
# 34: 103 10 2 4
# 35: 103 11 2 4
# 36: 103 12 2 4
# person item response scale

How to rank a column with a condition

I have a data frame :
dt <- read.table(text = "
1 390
1 366
1 276
1 112
2 97
2 198
2 400
2 402
3 110
3 625
4 137
4 49
4 9
4 578 ")
The first colomn is Index and the second is distance.
I want to add a colomn to rank the distance by Index in a descending order (the highest distance will be ranked first)
The result will be :
dt <- read.table(text = "
1 390 1
1 66 4
1 276 2
1 112 3
2 97 4
2 198 3
2 300 2
2 402 1
3 110 2
3 625 1
4 137 2
4 49 3
4 9 4
4 578 1")
Another R base approach
> dt$Rank <- unlist(tapply(-dt$V2, dt$V1, rank))
A tidyverse solution
dt %>%
group_by(V1) %>%
mutate(Rank=rank(-V2))
transform(dt,s = ave(-V2,V1,FUN = rank))
V1 V2 s
1 1 390 1
2 1 66 4
3 1 276 2
4 1 112 3
5 2 97 4
6 2 198 3
7 2 300 2
8 2 402 1
9 3 110 2
10 3 625 1
11 4 137 2
12 4 49 3
13 4 9 4
14 4 578 1
You could group, arrange, and rownumber. The result is a bit easier on the eyes than a simple rank, I think, and so worth an extra step.
dt %>%
group_by(V1) %>%
arrange(V1,desc(V2)) %>%
mutate(rank = row_number())
# A tibble: 14 x 3
# Groups: V1 [4]
V1 V2 rank
<int> <int> <int>
1 1 390 1
2 1 366 2
3 1 276 3
4 1 112 4
5 2 402 1
6 2 400 2
7 2 198 3
8 2 97 4
9 3 625 1
10 3 110 2
11 4 578 1
12 4 137 2
13 4 49 3
14 4 9 4
A scrambled alternative is min_rank
dt %>%
group_by(V1) %>%
mutate(min_rank(desc(V2)) )

Replace rows with 0s in dataframe with preceding row values diverse than 0

Here an example of my dataframe:
df = read.table(text = 'a b
120 5
120 5
120 5
119 0
118 0
88 3
88 3
87 0
10 3
10 3
10 3
7 4
6 0
5 0
4 0', header = TRUE)
I need to replace the 0s within col b with each preceding number diverse than 0.
Here my desired output:
a b
120 5
120 5
120 5
119 5
118 5
88 3
88 3
87 3
10 3
10 3
10 3
7 4
6 4
5 4
4 4
Until now I tried:
df$b[df$b == 0] = (df$b == 0) - 1
But it does not work.
Thanks
na.locf from zoo can help with this:
library(zoo)
#converting zeros to NA so that na.locf can get them
df$b[df$b == 0] <- NA
#using na.locf to replace NA with previous value
df$b <- na.locf(df$b)
Out:
> df
a b
1 120 5
2 120 5
3 120 5
4 119 5
5 118 5
6 88 3
7 88 3
8 87 3
9 10 3
10 10 3
11 10 3
12 7 4
13 6 4
14 5 4
15 4 4
Performing this task in a simple condition seems pretty hard, but you could also use a small for loop instead of loading a package.
for (i in which(df$b==0)) {
df$b[i] = df$b[i-1]
}
Output:
> df
a b
1 120 5
2 120 5
3 120 5
4 119 5
5 118 5
6 88 3
7 88 3
8 87 3
9 10 3
10 10 3
11 10 3
12 7 4
13 6 4
14 5 4
15 4 4
I assume that this could be slow for large data.frames
Here is a base R method using rle.
# get the run length encoding of variable
temp <- rle(df$b)
# fill in 0s with previous value
temp$values[temp$values == 0] <- temp$values[which(temp$values == 0) -1]
# replace variable
df$b <- inverse.rle(temp)
This returns
df
a b
1 120 5
2 120 5
3 120 5
4 119 5
5 118 5
6 88 3
7 88 3
8 87 3
9 10 3
10 10 3
11 10 3
12 7 4
13 6 4
14 5 4
15 4 4
Note that the replacement line will throw an error if the first element of the vector is 0. You can fix this by creating a vector that excludes it.
For example
replacers <- which(temp$values == 0)
replacers <- replacers[replacers > 1]

Create Customized weighted variable in R

My data set looks like this
set.seed(1)
data <- data.frame(ITEMID = 101:120,DEPT = c(rep(1,10),rep(2,10)),
CLASS = c(1,1,1,1,1,2,2,2,2,2,1,1,1,1,1,2,2,2,2,2),
SUBCLASS = c(3,3,3,3,4,4,4,4,4,3,3,3,3,3,3,4,4,4,4,4),
PRICE = sample(1:20,20),UNITS = sample(1:100,20)
)
> data
ITEMID DEPT CLASS SUBCLASS PRICE UNITS
1 101 1 1 3 6 94
2 102 1 1 3 8 22
3 103 1 1 3 11 64
4 104 1 1 3 16 13
5 105 1 1 4 4 26
6 106 1 2 4 14 37
7 107 1 2 4 15 2
8 108 1 2 4 9 36
9 109 1 2 4 19 81
10 110 1 2 3 1 31
11 111 2 1 3 3 44
12 112 2 1 3 2 54
13 113 2 1 3 20 90
14 114 2 1 3 10 17
15 115 2 1 3 5 72
16 116 2 2 4 7 57
17 117 2 2 4 12 67
18 118 2 2 4 17 9
19 119 2 2 4 18 60
20 120 2 2 4 13 34
Now I want to add another column called PRICE_RATIO using the following logic
Taking ItemID 101 and group_by with DEPT,CLASS and SUBCLASS yields prices c(6,8,11,16) and UNITS c(94,22,64,13) for ITEMIDs c(101,102,103,104) respectively
Now for each item id the variable PRICE_RATIO will be the ratio of the price of that item id to weighted price of all other itemIDs in the group. For example
For item ID 101 other items are c(102,103,104) whose total UNITS is (22+ 64+13) =99 and weights are (22/99,64/99,13/99). So weighted price for all other items is (22/99)*8 + (64/99)*11 + (13/99)*16 = 10.9899. Hence value for PRICE_RATIO will be 6/10.9899= .54
Similarly for all other items.
Any help in creating the code for this will be greatly appreciated
One solution to your problem, and generally such problems can be with the use of dplyr package and its data munging capabilities. The logic here is as you say, you group by the desired columns, then mutate the desired value (sum product of price and units (excluding the product for that specific row) and ratio of price to that weight. You can execute every step in this computation separately (I encourage that so you can learn) and see exactly what it does.
library(dplyr)
data %>%
group_by(DEPT, CLASS, SUBCLASS) %>%
mutate(price_ratio = round(PRICE /
((sum(UNITS * PRICE) - UNITS * PRICE) /
(sum(UNITS) - UNITS)),
2))
Output is as follows:
Source: local data frame [20 x 7]
Groups: DEPT, CLASS, SUBCLASS [6]
ITEMID DEPT CLASS SUBCLASS PRICE UNITS price_ratio
<int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 101 1 1 3 6 94 0.55
2 102 1 1 3 8 22 0.93
3 103 1 1 3 11 64 1.50
4 104 1 1 3 16 13 1.99
5 105 1 1 4 4 26 NaN
6 106 1 2 4 14 37 0.88
7 107 1 2 4 15 2 0.97
8 108 1 2 4 9 36 0.52
9 109 1 2 4 19 81 1.63
10 110 1 2 3 1 31 NaN
11 111 2 1 3 3 44 0.29
12 112 2 1 3 2 54 0.18
13 113 2 1 3 20 90 4.86
14 114 2 1 3 10 17 1.08
15 115 2 1 3 5 72 0.46
16 116 2 2 4 7 57 0.48
17 117 2 2 4 12 67 0.93
18 118 2 2 4 17 9 1.36
19 119 2 2 4 18 60 1.67
20 120 2 2 4 13 34 1.03

Resources