Multiplication of different subsets with different data in R - r

I have a large dataset, which I splitted up into subsets. For each subsets, I have to do the same calculations but with different numbers. Example:
Main Table
x a b c d
A 1 2 4 5
A 4 5 1 7
A 3 5 6 2
B 4 5 2 9
B 3 5 2 8
C 4 2 5 2
C 1 9 6 9
C 1 2 3 4
C 6 3 6 2
Additional Table for A
a b c d
A 5 1 6 1
Additional Table for B
a b c d
B 1 5 2 6
Additional Table for C
a c c d
C 8 2 4 1
I need to multiply all rows A in the Main Table with the values from Additional Table for A, all rows B in the Main table with the values from B and all rows B in the main table with values from C. It is completely fine to merge the additional tables into a combined one, if this makes the solution easier.
I thought about a for-loop but I am not able to put the different multiplicators (from the Additional Tables) into the code. Since there is a large number of subgroups, coding each multiplication manually should be avoided. How do I do this multiplications?

If we start with the addition table as addDf and main table as df:
addDf
x a b c d
A A 5 1 6 1
B B 1 5 2 6
C C 8 2 4 1
We can use a merge and the by-element multiplication of matrix as,
df[-1] <- merge(addDf, data.frame(x = df[1]), by = "x")[-1] * df[order(df[1]), -1]
df
x a b c d
1 A 5 2 24 5
2 A 20 5 6 7
3 A 15 5 36 2
4 B 4 25 4 54
5 B 3 25 4 48
6 C 32 4 20 2
7 C 8 18 24 9
8 C 8 4 12 4
9 C 48 6 24 2
Note: Borrowed a little syntax sugar from #akrun as df[-1] assignment.

We can use Map after splitting the main data 'df' (assuming that all of the datasets are data.frames.
df[-1] <- unsplit(Map(function(x,y) x*y[col(x)],
split(df[-1], df$x),
list(unlist(dfA), unlist(dfB), unlist(dfC))), df$x)
df
# x a b c d
#1 A 5 2 24 5
#2 A 20 5 6 7
#3 A 15 5 36 2
#4 B 4 25 4 54
#5 B 3 25 4 48
#6 C 32 4 20 2
#7 C 8 18 24 9
#8 C 8 4 12 4
#9 C 48 6 24 2
Or we can use a faster option with data.table
library(data.table)
setnames(setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
.(a= a*i.a, b= b*i.b, c = c*i.c, d= d*i.d), on = c('rn' = 'x'), by = .EACHI], 1, 'x')[]
# x a b c d
#1: A 5 2 24 5
#2: A 20 5 6 7
#3: A 15 5 36 2
#4: B 4 25 4 54
#5: B 3 25 4 48
#6: C 32 4 20 2
#7: C 8 18 24 9
#8: C 8 4 12 4
#9: C 48 6 24 2
The above would be difficult if there many columns, in that case, we could use mget to retrieve the columns and do the * on the i. columns with Map
setDT(do.call(rbind, list(dfA, dfB, dfC)), keep.rownames=TRUE)[df,
Map(`*`, mget(names(df)[-1]), mget(paste0("i.", names(df)[-1]))) ,
on = c('rn' = 'x'), by = .EACHI]

Related

How to add a column with repeating but changing sequence?

I'm trying to add a column with repeating sequence but one that changes for each group. In the example data, the group is the id column.
data <- tibble::expand_grid(id = 1:12, condition = c("a", "b", "c"))
data
id condition
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
... and so on
I'd like to add a column called order to repeat various combinations like 1 2 3 2 3 1 3 1 2 1 3 2 2 1 3 3 2 1 for each id.
In the end, the desired output will look like this
id condition order
1 a 1
1 b 2
1 c 3
2 a 2
2 b 3
2 c 1
3 a 3
3 b 1
3 c 2
... and so on
I'm looking for a simple mutate solution or base R solution. I tried generating a list of combinations but I'm not sure how to create a variable from that.
You can use perms from package pracma to generate all permutations, e.g.,
data %>%
cbind(order = c(t(pracma::perms(1:3))))
which gives
id condition order
1 1 a 3
2 1 b 2
3 1 c 1
4 2 a 3
5 2 b 1
6 2 c 2
7 3 a 2
8 3 b 3
9 3 c 1
10 4 a 2
11 4 b 1
12 4 c 3
13 5 a 1
14 5 b 2
15 5 c 3
16 6 a 1
17 6 b 3
18 6 c 2
19 7 a 3
20 7 b 2
21 7 c 1
22 8 a 3
23 8 b 1
24 8 c 2
25 9 a 2
26 9 b 3
27 9 c 1
28 10 a 2
29 10 b 1
30 10 c 3
31 11 a 1
32 11 b 2
33 11 c 3
34 12 a 1
35 12 b 3
36 12 c 2

I want to delete the IDs that have no information in the remaining columns

Here is a representation of my dataset:
Number<-c(1:10)
AA<-c(head(LETTERS,4), rep(NA,6))
BB<-c(head(letters,6), rep(NA,4))
CC<-c(1:6, rep(NA,4))
DD<-c(10:14, rep(NA,5))
EE<-c(3:8, rep(NA,4))
FF<-c(6:1, rep(NA,4))
mydata<-data.frame(Number,AA,BB,CC,DD,EE,FF)
I want to delete all the IDs (Number) that have no information in the remaining columns, automatically. I want to tell the function that if there is a value in Number but there is only NA in all the remaining columns, delete the row.
I must have the dataframe below:
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Another possible base R solution:
mydata[rowSums(is.na(mydata[,-1])) != ncol(mydata[,-1]), ]
Output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Or we could use apply:
mydata[!apply(mydata[,-1], 1, function(x) all(is.na(x))),]
A possible solution, using janitor::remove_empty:
library(dplyr)
library(janitor)
inner_join(mydata, remove_empty(mydata[-1], which = "rows"))
#> Joining, by = c("AA", "BB", "CC", "DD", "EE", "FF")
#> Number AA BB CC DD EE FF
#> 1 1 A a 1 10 3 6
#> 2 2 B b 2 11 4 5
#> 3 3 C c 3 12 5 4
#> 4 4 D d 4 13 6 3
#> 5 5 <NA> e 5 14 7 2
#> 6 6 <NA> f 6 NA 8 1
We can use if_all/if_all
library(dplyr)
mydata %>%
filter(if_any(-Number, complete.cases))
-output
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
or
mydata %>%
filter(!if_all(-Number, is.na))
Or with base R
subset(mydata, rowSums(!is.na(mydata[-1])) >0 )
Number AA BB CC DD EE FF
1 1 A a 1 10 3 6
2 2 B b 2 11 4 5
3 3 C c 3 12 5 4
4 4 D d 4 13 6 3
5 5 <NA> e 5 14 7 2
6 6 <NA> f 6 NA 8 1
Try this:
df <- df[,colSums(is.na(df))<nrow(df)]
This makes a copy of your data though. If you have a large dataset then you can use:
Filter(function(x)!all(is.na(x)), df)
and depending on your approach you can use
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
If you want to use a data.table which is usually a pretty solid go-to

r - dedupe the rows with value in dataframe

How to subset only the rows with values in a particular column among the duplicates based on another column.
Example:
df
A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5
So in the above dataset, first 4 rows are duplicate based on column A and C, so among them, I want to choose only the rows which has value in column B.
Desired output,
A B C D
1 5 8 9
2 6 5 8
3 NA 8 5
Thanks.
Using dplyr:
df <- read.table(text="A B C D
1 NA 8 7
1 5 8 9
2 6 5 8
2 NA 5 6
3 NA 8 5", header=T)
df %>%
group_by(A,C) %>%
filter(n()==1|!is.na(B))
A B C D
<int> <int> <int> <int>
1 1 5 8 9
2 2 6 5 8
3 3 NA 8 5
Duplicates back or forwards and not missing on B; or not a duplicate:
anydup <- duplicated(df[c("A","C")]) | duplicated(df[c("A","C")], fromLast=TRUE)
df[(anydup & (!is.na(df$B))) | (!anydup),]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5
Or use ave to check the length per group as per #HubertL's dplyr answer:
df[!is.na(df$B) | ave(df$B, df[c("A","C")], FUN=length)==1,]
# A B C D
#2 1 5 8 9
#3 2 6 5 8
#5 3 NA 8 5
Here is one option with data.table
library(data.table)
setDT(df)[df[, .I[.N==1 | complete.cases(B)] , .(A, C)]$V1]
# A B C D
#1: 1 5 8 9
#2: 2 6 5 8
#3: 3 NA 8 5

R Subset matching contiguous blocks

I have a dataframe.
dat <- data.frame(k=c("A","A","B","B","B","A","A","A"),
a=c(4,2,4,7,5,8,3,2),b=c(2,5,3,5,8,4,5,8),
stringsAsFactors = F)
k a b
1 A 4 2
2 A 2 5
3 B 4 3
4 B 7 5
5 B 5 8
6 A 8 4
7 A 3 5
8 A 2 8
I would like to subset contiguous blocks based on variable k. This would be a standard approach.
#using rle rather than levels
kval <- rle(dat$k)$values
for(i in 1:length(kval))
{
subdf <- subset(dat,dat$k==kval[i])
print(subdf)
#do something with subdf
}
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
k a b
3 B 4 3
4 B 7 5
5 B 5 8
k a b
1 A 4 2
2 A 2 5
6 A 8 4
7 A 3 5
8 A 2 8
So the subsetting above obviously does not work the way I intended. Any elegant way to get these results?
k a b
1 A 4 2
2 A 2 5
k a b
1 B 4 3
2 B 7 5
3 B 5 8
k a b
1 A 8 4
2 A 3 5
3 A 2 8
We can use rleid from data.table to create a grouping variable
library(data.table)
setDT(dat)[, grp := rleid(k)]
dat
# k a b grp
#1: A 4 2 1
#2: A 2 5 1
#3: B 4 3 2
#4: B 7 5 2
#5: B 5 8 2
#6: A 8 4 3
#7: A 3 5 3
#8: A 2 8 3
We can group by 'grp' and do all the operations within the 'grp' using standard data.table methods.
Here is a base R option to create 'grp'
dat$grp <- with(dat, cumsum(c(TRUE, k[-1]!= k[-length(k)])))

Reproduce character pattern as numeric pattern

I would like to expand the following data frame
d <- data.frame(a = c(rep("A",5),rep("B",5),rep("C",3),rep("D",2)))
> d
a
1 A
2 A
3 A
4 A
5 A
6 B
7 B
8 B
9 B
10 B
11 C
12 C
13 C
14 D
15 D
so that there is a column b looking like:
> d
a b
1 A 1
2 A 1
3 A 1
4 A 1
5 A 1
6 B 2
7 B 2
8 B 2
9 B 2
10 B 2
11 C 3
12 C 3
13 C 3
14 D 4
15 D 4
Not really sure how to realise that.
Use match:
match(d$a, unique(d$a))
d$b <- as.integer(factor(d$a, levels=unique(d$a)))

Resources