Same column ( different row ) operations in R

Same column ( different row ) operations in R - r

I have a big database and I'm trying to create a new column starting from an existing one doing the difference between elements in consecutive cells ( same column, different row):
existing_column
new_column
A
A-B
B
B-C
C
C-D
D
D-E
...
...
Z
Z-NULL
The way I'm doing it is to duplicate existing column into a dummy one, remove first element, adding NULL as last element and subtracting the existing column and the dummy one ... is there a better way? Thank you
exist <-c("A","B","C","D","E")
db<-data.frame(exist)
dummy<-exist[-1]
dummy[length(dummy)+1]<-"NULL"
new_col<-paste(exist,"-",dummy)
new_col
db<-data.frame(exist,new_col)
db

Does this work:
library(dplyr)
df <- data.frame(existing_column = LETTERS)
df %>% mutate(new_column = paste(existing_column, lead(existing_column, default = 'NULL'), sep = '-'))
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NULL

Try the code below
transform(
df,
new_column = paste(existing_column, c(existing_column[-1], NA), sep = "-")
)
which gives
existing_column new_column
1 A A-B
2 B B-C
3 C C-D
4 D D-E
5 E E-F
6 F F-G
7 G G-H
8 H H-I
9 I I-J
10 J J-K
11 K K-L
12 L L-M
13 M M-N
14 N N-O
15 O O-P
16 P P-Q
17 Q Q-R
18 R R-S
19 S S-T
20 T T-U
21 U U-V
22 V V-W
23 W W-X
24 X X-Y
25 Y Y-Z
26 Z Z-NA

If you are working with numeric data just represented as characters in your example, you can use mutate() and lead()
df<-data.frame(old_col=sample(1:10))
df%>%mutate(new_col=old_col-lead(old_col, default = 0))
old_col new_col
1 10 4
2 6 -3
3 9 8
4 1 -1
5 2 -5
6 7 3
7 4 1
8 3 -5
9 8 3
10 5 5

In case there is a need of a fast data.table version
dt[, new_column:=paste(exist, shift(exist, type="lead"), sep="-")]
Edit. Turns it isn't much faster:
df = data.table(exist = rep(letters, 80000))
> m = microbenchmark::microbenchmark(
... a = df %>% mutate(new_column = paste(exist, lead(exist, default = 'NULL'), sep = '-')),
...
... b = transform(
... df,
... new_column = paste(exist, c(exist[-1], NA), sep = "-")
... ),
...
... d = df[, new_column := paste(exist, shift(exist, type="lead"), sep="-")]
... )
> m
Unit: milliseconds
expr min lq mean median uq max neval
a 292.2430 309.6150 342.0191 323.9778 361.0937 603.8449 100
b 349.4509 383.3391 475.0177 423.8864 472.0276 2136.2970 100
d 294.6786 302.8530 332.3989 315.6228 340.9642 641.8345 100

Related

R: Counting the Frequencies of Coin Flips

I am working with the R programming language.
I simulated this dataset which contains 1000 coin flips - then I calculated the number of "2 Flip Sequences":
Coin <- c('H', 'T')
Results = sample(Coin,1000, replace = TRUE)
My_Data = data.frame(id = 1:1000, Results)
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second Freq
1 H H 255
2 T H 245
3 H T 246
4 T T 253
I am curious - is it possible to extend the above code for "3 Flip Sequences"?
For example - I tried modifying parts of the code to see how the results change (and hoped to stumble across the correct way to write this code):
# First Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = head(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 245
3 H T H 0
4 T T H 0
5 H H T 0
6 T H T 0
7 H T T 246
8 T T T 253
# Second Attempt
Pairs = data.frame(first = head(My_Data$Results, -1), second = tail(My_Data$Results, -1) , third = tail(My_Data$Results, -1))
Final = as.data.frame(table(Pairs))
first second third Freq
1 H H H 255
2 T H H 0
3 H T H 0
4 T T H 245
5 H H T 246
6 T H T 0
7 H T T 0
8 T T T 253
I am not sure which of these options are correct?
In general, I am looking to understand the logic as to how I can adapt the above code for an "arbitrary number of coin flips" (e.g. "4 flip sequences", "5 flip sequences", etc.)
Also, this might not be the most efficient way to calculate these frequencies - I would also be interested in learning about other ways that might be more efficient ( e.g. as the overall size of the data increases).
Thanks!

It might be helpful to work with strings.
coin <- c("H", "T")
results <- sample(coin, 1000, replace = TRUE)
Then to get sequence counts (assuming overlapping sequences also count) for triples, we could do something like:
triples <- table(
sapply(
1:(length(results) - 3),
function(i) sprintf(
"%s%s%s",
results[i],
results[i + 1],
results[i + 2]
)
)
)
which gives me something like:
HHH HHT HTH HTT THH THT TTH TTT
132 129 138 115 129 124 116 114
This idea could be generalized fairly easily, for example:
n_sequences <- function(n, results) {
helper <- function(i, n) if (n < 1) "" else sprintf(
"%s%s",
helper(i, n - 1),
results[i + n - 1]
)
result <- data.frame(
table(
sapply(
1:(length(results) - n + 1),
function(i) helper(i, n)
)
)
)
colnames(result) <- c("Sequence", "Frequency")
result
}
For example:
n_sequences(5, results)
Gives me something like:
Sequence Frequency
1 HHHHH 34
2 HHHHT 31
3 HHHTH 36
4 HHHTT 31
5 HHTHH 35
6 HHTHT 36
7 HHTTH 20
8 HHTTT 37
9 HTHHH 35
10 HTHHT 34
11 HTHTH 41
12 HTHTT 27
13 HTTHH 27
14 HTTHT 24
15 HTTTH 34
16 HTTTT 30
17 THHHH 31
18 THHHT 36
19 THHTH 36
20 THHTT 26
21 THTHH 34
22 THTHT 32
23 THTTH 31
24 THTTT 27
25 TTHHH 32
26 TTHHT 28
27 TTHTH 25
28 TTHTT 31
29 TTTHH 33
30 TTTHT 31
31 TTTTH 30
32 TTTTT 20

You could first cut along 3 + 1 breaks, split it along the levels. The interaction can now be tabled to get the result.
My_Data$cut3 <- cut(seq_len(nrow(My_Data)), seq.int(1, nrow(My_Data), length.out=3 + 1), include.lowest=TRUE)
(res <- interaction(split(My_Data$Results, My_Data$cut3)) |> table() |> as.data.frame())
# Var1 Freq
# 1 H.H.H 51
# 2 T.H.H 58
# 3 H.T.H 43
# 4 T.T.H 49
# 5 H.H.T 38
# 6 T.H.T 51
# 7 H.T.T 64
# 8 T.T.T 46
To get the desired output, we can strsplit Var1.
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c('first', 'second', 'third', 'Freq'))
# first second third Freq
# 1 H H H 51
# 2 T H H 58
# 3 H T H 43
# 4 T T H 49
# 5 H H T 38
# 6 T H T 51
# 7 H T T 64
# 8 T T T 46
Note, that nrow of your data should be divisible by 3.
Edit
To generalize, we may write a small function.
f <- \(x, n) {
ct <- cut(seq_len(nrow(x)), seq.int(1L, nrow(x), length.out=n + 1L), include.lowest=TRUE)
res <- interaction(split(x$Results, ct)) |> table() |> as.data.frame()
strsplit(as.character(res$Var1), '\\.') |> do.call(what=rbind) |>
cbind.data.frame(res$Freq) |> setNames(c(LETTERS[seq_len(n)], 'Freq'))
}
f(My_Data, 4)
# A B C D Freq
# 1 H H H H 13
# 2 T H H H 25
# 3 H T H H 18
# 4 T T H H 17
# 5 H H T H 18
# 6 T H T H 15
# 7 H T T H 21
# 8 T T T H 24
# 9 H H H T 26
# 10 T H H T 15
# 11 H T H T 16
# 12 T T H T 18
# 13 H H T T 22
# 14 T H T T 18
# 15 H T T T 10
# 16 T T T T 24
Data:
set.seed(42)
My_Data <- data.frame(id=1:1200, Results=sample(c('H', 'T'), 1200, replace=TRUE))

A slightly generalized solution with tidyverse tools. Change the sets variable for longer or shorter sequences.
coin <- c("H", "T")
sets <- 4
rolls <- 10000
results <- sample(coin, sets * rolls, rep = TRUE)
named_results <- purrr::map_chr(
0:(rolls - 1),
~ paste0(results[(sets * .x + 1):(sets * .x + sets)],
collapse = ""
)
)
dplyr::count(tibble::tibble(x = named_results), x)
with output
# A tibble: 16 x 2
x n
<chr> <int>
1 HHHH 629
2 HHHT 627
3 HHTH 638
4 HHTT 599
5 HTHH 602
6 HTHT 633
7 HTTH 596
8 HTTT 661
9 THHH 631
10 THHT 589
11 THTH 633
12 THTT 647
13 TTHH 660
14 TTHT 637
15 TTTH 623
16 TTTT 595
sets = 8 would give something like
# A tibble: 256 x 2
x n
<chr> <int>
1 HHHHHHHH 37
2 HHHHHHHT 36
3 HHHHHHTH 43
4 HHHHHHTT 35
5 HHHHHTHH 38
6 HHHHHTHT 27
7 HHHHHTTH 32
8 HHHHHTTT 28
9 HHHHTHHH 33
10 HHHHTHHT 38
# ... with 246 more rows

conditional sampling without replacement

I am attempting to write a simulation that involves randomly re-assigning items to categories with some restrictions.
Lets say I have a collection of pebbles 1 to N distributed across buckets A through J:
set.seed(100)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
head(df1)
#> pebble bucket
#> 1 1 D
#> 2 2 C
#> 3 3 F
#> 4 4 A
#> 5 5 E
#> 6 6 E
I want to randomly re-assign pebbles to buckets. Without restrictions I could do it like so:
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
colSums(table(df1))
#> A B C D E F G H I J
#> 4 7 13 14 12 11 11 10 9 9
Importantly this re-assigns pebbles while ensuring that each bucket retains the same number (because we are sampling without replacement).
However, I have a set of restrictions such that certain pebbles cannot be assigned to certain buckets. I encode the restrictions in df2:
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
df2
#> pebble bucket
#> 1 33 I
#> 2 39 I
#> 3 5 A
#> 4 36 C
#> 5 55 J
#> 6 66 A
#> 7 92 J
#> 8 95 H
#> 9 2 C
#> 10 49 I
The logic here is that pebbles 33 and 39 cannot be placed in bucket I, or pebble 5 in bucket A, etc. I would like to permute which pebbles are in which bucket subject to these restrictions.
So far, I've thought of tackling it in a loop as below, but this does not result in buckets retaining the same number of pebbles:
perms <- character(0)
cnt <- 1
for (p in df1$pebble) {
perms[cnt] <- sample(df1$bucket[!df1$bucket %in% df2$bucket[df2$pebble==p]], 1)
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G H I J
#> 6 7 12 22 15 1 14 7 7 9
I then tried sampling positions, and then removing that position from the available buckets and the available remaining positions. This is also not working, and I suspect it is because I am sampling my way into branches of the tree that do not yield solutions.
set.seed(42)
perms <- character(0)
cnt <- 1
ids <- 1:nrow(df1)
bckts <- df1$bucket
for (p in df1$pebble) {
id <- sample(ids[!bckts %in% df2$bucket[df2$pebble==p]], 1)
perms[cnt] <- bckts[id]
bckts <- bckts[-id]
ids <- ids[ids!=id]
cnt <- cnt + 1
}
table(perms)
#> perms
#> A B C D E F G J
#> 1 1 4 1 2 1 2 2
Any thoughts or advice much appreciated (and apologies for the length).
EDIT:
I foolishly forgot to clarify that I was previously solving this by just resampling until I got a draw that didn't violate any of the conditions in df2, but I now have many conditions such that this would make my code take too long to run. I am still up for trying to force it if I could figure out a way to make forcing it faster.

I have a solution (I managed to write it in base R, but the data.table solution is easier to understand and write:
random.permutation.df2 <- data.frame(pebble = df1$pebble, bucket = rep(NA,length(df1$pebble)))
for(bucket in unique(df1$bucket)){
N <- length( random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] )
random.permutation.df2$bucket[is.na(random.permutation.df2$bucket) &
!random.permutation.df2$pebble %in% df2$pebble[df2$bucket == bucket] ] <-
sample(c(rep(bucket,sum(df1$bucket == bucket)),rep(NA,N-sum(df1$bucket == bucket))))
}
The idea is to sample the authorised peeble for each bucket: those that are not in df2, and those that are not already filled. You sample then a vector of the good length, choosing between NAs (for the following buckets values) and the value in the loop, and voilà.
Now easier to read with data.table
library(data.table)
random.permutation.df2 <- setDT(random.permutation.df2)
df2 <- setDT(df2)
for( bucketi in unique(df1$bucket)){
random.permutation.df2[is.na(bucket) & !pebble %in% df2[bucket == bucketi, pebble],
bucket := sample(c(rep(bucketi,sum(df1$bucket == bucket)),rep(NA,.N-sum(df1$bucket == bucket))))]
}
it has the two conditions
> colSums(table(df1))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
> colSums(table(random.permutation.df2))
A B C D E F G H I J
4 7 13 14 12 11 11 10 9 9
To verify that there isn't any contradiction with df2
> df2
pebble bucket
1: 37 D
2: 95 H
3: 90 C
4: 80 C
5: 31 D
6: 84 G
7: 76 I
8: 57 H
9: 7 E
10: 39 A
> random.permutation.df2[pebble %in% df2$pebble,.(pebble,bucket)]
pebble bucket
1: 7 D
2: 31 H
3: 37 J
4: 39 F
5: 57 B
6: 76 E
7: 80 F
8: 84 B
9: 90 H
10: 95 D

Here a brute force approach where one simply tries long enough until a valid solution is found:
set.seed(123)
df1 <- data.frame(pebble = 1:100,
bucket = sample(LETTERS[1:10], 100, T),
stringsAsFactors = F)
df2 <- data.frame(pebble = sample(1:100, 10),
bucket = sample(LETTERS[1:10], 10, T),
stringsAsFactors = F)
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
Random permutation does not match the condition, so try new ones:
merge(random.permutation.df1, df2)
#> pebble bucket
#> 1 60 J
while(TRUE) {
random.permutation.df1 <- data.frame(pebble = df1$pebble, bucket = sample(df1$bucket))
if(nrow(merge(random.permutation.df1, df2)) == 0)
break;
}
New permutation matches the condition:
merge(random.permutation.df1, df2)
#> [1] pebble bucket
#> <0 Zeilen> (oder row.names mit Länge 0)
colSums(table(random.permutation.df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7
colSums(table(df1))
#> A B C D E F G H I J
#> 7 12 11 9 14 7 11 11 11 7

mutate based on conditional sum in a group

Say I have a dataframe like this:
set.seed(1)
n <- 20
df <- data.frame(ID = sample(1:5, n, replace = TRUE),
Fac1 = sample(letters[1:5], n, replace = TRUE),
Fac2 = sample(LETTERS[10:15], n, replace = TRUE),
Val1 = sample(1:10, n, replace = TRUE)) %>%
arrange(ID) %>% group_by(ID,Fac1) %>%
summarise(Val1 = sum(Val1),Fac2 = first(Fac2)) %>%
group_by(ID,Fac2) %>%
mutate(Val2 = sum(Val1))
df
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 N 10
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 N 6
13 5 a 13 M 13
14 5 b 3 N 3
ID is a grouping variable. Rows with an Fac1 value of e should have the Fac2 value changed to be that same as the other row in the group where Fac1 is either b or c and the sum of Val 2 for the two rows if greater than 20. (I've simplified this to the point where you probably don't get why but just work with me).
This is what I have tried so far:
result <- df %>% group_by(ID) %>%
mutate(Fac2 = case_when(
Fac1 == "e" &
sum(Val2,ifelse(Fac1 %in% c("b","c"), Val2, 0)) > 20 ~
ifelse(sum(Val2,ifelse(Fac1 %in% c("b","c"),Val2,0)) > 20,
as.character(Fac2),
NA_character_),
TRUE ~ as.character(Fac2)
))
It doesn't work properly because it is summing the first value of Val2 in the group rather than only doing so when Fac1 is b or c.
Any ideas?
Adding desired outcome:
ID Fac1 Val1 Fac2 Val2
1 1 b 9 N 9
2 1 c 9 O 9
3 2 a 4 K 4
4 2 b 10 M 18
5 2 c 4 L 4
6 2 d 8 M 18
7 2 e 10 M 10 **Changed to M b/c row 4 is M and 10 + 18 > 20
8 3 d 14 N 14
9 4 b 8 L 22
10 4 c 14 L 22
11 4 d 9 K 9
12 4 e 6 L 6 **Changed to L b/c row 10 is L and 6 + 22 > 20
13 5 a 13 M 13
14 5 b 3 N 3

I'm having a hard time following what you are wanting the values to be changed to.
But when I have multiple conditions or decisions that need to be made in a sequence, I use a loop and a series of if statements to go through the data frame. I prefer while loops, so that's what I'll use in the example.
counter <- 1
stopper <- nrow(df)
while (counter <= stopper) {
fac1 <- df$Fac1[counter1]
if (fac1 == 'e') {
if ([INSERT NEXT CONDITION]) #Change whichever value your trying to change using the counter to reference the correct row.
else #Change whichever value your trying to change using the counter to reference the correct row.
}
counter <- counter + 1
}
For me, simplifying the code makes it a lot easier for me to keep track of what decisions are being made. It also allows for complex decisions that are difficult to get functions to work with.

I was able to get the desired result with this code. I made a new column containing the result of the test for what value to replace Fac2 with, which wasn't entirely necessary but makes it more readable and debugable.
The key thing was to use first(na.omit()) to get the value from a different row in the same group which met the condition.
result <- df %>% group_by(ID) %>%
mutate(Max_bc_Val = ifelse(Val2 == max(ifelse(Fac1 %in% c("b","c"),
Val2,0)),
ifelse(Fac1 %in% c("b","c"),
as.character(Fac2),NA),NA)) %>%
mutate(Fac2 = case_when(
Fac1 == "e" ~ ifelse(is.na(first(na.omit(Max_bc_Val))),
NA_character_,
first(na.omit(Max_bc_Val))),
TRUE ~ as.character(Fac2)))
This works but doesn't seem like the best solution. Any other ideas?

Getting the maximum common words in R

I have data of the form:
ID A1 A2 A3 ... A100
1 john max karl ... kevin
2 kevin bosy lary ... rosy
3 karl lary bosy ... hale
.
.
.
10000 isha john lewis ... dave
I want to get one ID for each ID such that both of them have maximum number of common attributes(A1,A2,..A100)
How can I do this in R ?
Edit: Let's call the output a MatchId:
ID MatchId
1 70
2 4000
.
.
10000 3000

I think this gets what you're looking for:
library(dplyr)
# make up some data
set.seed(1492)
rbind_all(lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
})) -> dat
print(dat)
## Source: local data frame [15 x 11]
##
## ID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
## 1 1 H F E C B A R J Z N
## 2 2 Q P E M L Z C G V Y
## 3 3 Q J D N B T L K G Z
## 4 4 D Y U F V O I C A W
## 5 5 T Z D I J F R C B S
## 6 6 Q D H U P V O E R N
## 7 7 C L I M E K N S X Z
## 8 8 M J S E N O F Y X I
## 9 9 R H V N M T Q X L S
## 10 10 Q H L Y B W S M P X
## 11 11 M N J K B G S X V R
## 12 12 W X A H Y D N T Q I
## 13 13 K H V J D X Q W A U
## 14 14 M U F H S T W Z O N
## 15 15 G B U Y E L A Q W O
# get commons
rbind_all(lapply(1:15, function(i) {
rbind_all(lapply(setdiff(1:15, i), function(j) {
data.frame(id1=i,
id2=j,
common=length(intersect(c(t(dat[i, 2:11])),
c(t(dat[j, 2:11])))))
}))
})) -> commons
commons %>%
group_by(id1) %>%
top_n(1, common) %>%
filter(row_number()==1) %>%
select(ID=id1, MatchId=id2)
## Source: local data frame [15 x 2]
## Groups: ID
##
## ID MatchId
## 1 1 5
## 2 2 7
## 3 3 5
## 4 4 12
## 5 5 1
## 6 6 9
## 7 7 8
## 8 8 7
## 9 9 10
## 10 10 9
## 11 11 9
## 12 12 13
## 13 13 12
## 14 14 8
## 15 15 2

Using similar data as provided by #hrbrmstr
set.seed(1492)
dat <- do.call(rbind, lapply(1:15, function(i) {
x <- cbind.data.frame(stringsAsFactors=FALSE, i, t(sample(LETTERS, 10)))
colnames(x) <- c("ID", sprintf("A%d", 1:10))
x
}))
You could achieve the same using base R only
Res <- sapply(seq_len(nrow(dat)),
function(x) apply(dat[-1], 1,
function(y) length(intersect(dat[x, -1], y))))
diag(Res) <- -1
cbind(dat[1], MatchId = max.col(Res, ties.method = "first"))
# ID MatchId
# 1 1 5
# 2 2 7
# 3 3 5
# 4 4 12
# 5 5 1
# 6 6 9
# 7 7 8
# 8 8 7
# 9 9 10
# 10 10 9
# 11 11 9
# 12 12 13
# 13 13 12
# 14 14 8
# 15 15 2

If I understand correctly, the requirement is to obtain the maximum number of common attributes for each ID.
Frequency tables can be obtained using table() and recursively in lapply(), assuming that ID column is unique - slight modification is necessary if not (unique(df$ID) rather than df$ID in lapply()). The maximum frequencies can be taken and, if there is a tie, only the first one is chosen. Finally they are combined by do.call().
df <- read.table(header = T, text = "
ID A1 A2 A3 A100
1 john max karl kevin
2 kevin bosy lary rosy
3 karl lary bosy hale
10000 isha john lewis dave")
do.call(rbind, lapply(df$ID, function(x) {
tbl <- table(unlist(df[df$ID == x, 2:ncol(df)]))
data.frame(ID = x, MatchId = tbl[tbl == max(tbl)][1])
}))
# ID MatchId
#john 1 1
#kevin 2 1
#karl 3 1
#isha 10000 1

How do you multiply two unequal length vectors by a factor?

I have two data frames of differing lengths. There is a unique factor that links the two data frames together. I want to multiply the values in the larger data frame by the matching factor in the smaller data frame. Here is code to demonstrate:
d1 <- data.frame(u = factor(x = LETTERS[1:5]), n1 = 1:5)
d2 <- data.frame(u = factor(x = rep(x = LETTERS[1:5], each = 2)), n2 = 1:10)
I want d2[1:2, 2] both multiplied by d1[1, 2] because the factor "A" matches and so forth for the rest of the matching factors.

For this problem you can also use match, which should be somewhat more efficient than the merge/transform approach (particularly if you don't need the data.frame that the latter creates):
d2$n2 * d1[match(d2$u, d1$u), 'n1']
# [1] 1 2 6 8 15 18 28 32 45 50

Use merge to join the two data frames, then transform to add a column to it.
> transform(merge(d1, d2), n.total = n1*n2)
u n1 n2 n.total
1 A 1 1 1
2 A 1 2 2
3 B 2 3 6
4 B 2 4 8
5 C 3 5 15
6 C 3 6 18
7 D 4 7 28
8 D 4 8 32
9 E 5 9 45
10 E 5 10 50
If you don't need the data frame created by transform you can use with instead.
> with(merge(d1, d2), n1*n2)
[1] 1 2 6 8 15 18 28 32 45 50
If you have a lot of data and the above solutions are too slow or inefficient I suggest you go for #jbaums solution, but otherwise I find that the increased readability of merge is preferable.
> require(microbenchmark)
> microbenchmark(transform(merge(d1, d2), n.total = n1*n2),
+ with(merge(d1, d2), n1*n2),
+ d2$n2 * d1[match(d2$u, d1$u), 'n1'])
Unit: microseconds
expr min lq mean
transform(merge(d1, d2), n.total = n1 * n2) 826.897 904.2275 1126.41204
with(merge(d1, d2), n1 * n2) 658.295 722.6715 907.34581
d2$n2 * d1[match(d2$u, d1$u), "n1"] 49.372 59.5830 78.42575
median uq max neval cld
940.3890 1087.0350 2695.521 100 c
764.2965 934.5555 2463.300 100 b
66.2475 86.1505 260.820 100 a

If we into speed comparisons, you might just as well try data.table package (although for such a small data set, jbaums approach probably be more efficient)
library(data.table)
setkey(setDT(d1), u); setDT(d2)
d1[d2][, n.total := n1*n2][]
# u n1 n2 n.total
# 1: A 1 1 1
# 2: A 1 2 2
# 3: B 2 3 6
# 4: B 2 4 8
# 5: C 3 5 15
# 6: C 3 6 18
# 7: D 4 7 28
# 8: D 4 8 32
# 9: E 5 9 45
# 10: E 5 10 50
Or as (suggested by #Arun)
d2[d1, n2 := n2*n1] # Update (by reference) `n2`
OR
d2[d1, new := n2*n1] # Add new column
Note: Although these would be faster, you won't see column n1 in the final result

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Same column ( different row ) operations in R - r

If you are working with numeric data just represented as characters in your example, you can use mutate() and lead() df<-data.frame(old_col=sample(1:10)) df%>%mutate(new_col=old_col-lead(old_col, default = 0)) old_col new_col 1 10 4 2 6 -3 3 9 8 4 1 -1 5 2 -5 6 7 3 7 4 1 8 3 -5 9 8 3 10 5 5

Related

R: Counting the Frequencies of Coin Flips

conditional sampling without replacement

mutate based on conditional sum in a group

Getting the maximum common words in R

How do you multiply two unequal length vectors by a factor?

Categories

Resources