Multiply columns in different dataframes - r

I am writing a code for analysis a set of dplyr data.
here is how my table_1 looks:
1 A B C
2 5 2 3
3 9 4 1
4 6 3 8
5 3 7 3
And my table_2 looks like this:
1 D E F
2 2 9 3
I would love to based on table 1 column"A", if A>6, then create a column "G" in table1, equals to "C*D+C*E"
Basically, it's like make table 2 as a factor...
Is there any way I can do it?
I can apply a filter to Column "A" and multiply Column"C" with a set number instead of a factor from table_2
table_1_New <- mutate(Table_1,G=if_else(A<6,C*2+C*9))

You could try
#Initialize G column with 0
df1$G <- 0
#Get index where A value is greater than 6
inds <- df1$A > 6
#Multiply those values with D and E from df2
df1$G[inds] <- df1$C[inds] * df2$D + df1$C[inds] * df2$E
df1
# A B C G
#2 5 2 3 0
#3 9 4 1 11
#4 6 3 8 0
#5 3 7 3 0
Using dplyr, we can do
df1 %>% mutate(G = ifelse(A > 6, C*df2$D + C*df2$E, 0))

Related

Filtering observations using multivariate column conditions

I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Using %in% for multiple criteria simultaneously

I have a dataframe showing some data on individuals (ID), where for each year they live there is one row. It also contains information on parent ID (P.ID) and parent age when born (P.AB).
# Dataframe A: 1 row per individual
dfA <- data.frame(
"ID" = c("A", "B", "C", "D", "E"),
"P.ID" = c(NA, "A", "A", "B", "B"),
"P.AB" = c(NA, 3, 4, 2, 4),
"LS" = c(5, 6, 3, 4, 5))
# Dataframe B: 1 row per year of life
dfB <- data.frame("ID" = rep(dfA[,'ID'], dfA[,'LS']+1))
dfB <- merge(dfB, dfA, by = "ID")
dfB[ ,'AGE'] <- 0
for(i in 2:length(dfB[,1])){
if(dfB[i,'ID'] == dfB[i-1, 'ID']){
dfB[i,'AGE'] <- dfB[i-1, 'AGE'] + 1
}
}
Giving:
> head(dfB)
ID P.ID P.AB LS AGE
1 A <NA> NA 5 0
2 A <NA> NA 5 1
3 A <NA> NA 5 2
4 A <NA> NA 5 3
5 A <NA> NA 5 4
6 A <NA> NA 5 5
What I am trying to do is then get R to put a "1" into column REP to show the years in which an individual reproduced. E.g. B was born to A when A was 3, so the row where A is 3 years old gets a 1. I have been trying to do this using %in% but struggling to make this work with multiple criteria. A work around is to paste together the ID and age (plus a random string to make sure that there is no false duplication in my larger dataset), but this feels like it is lacking elegance and is unnecessarily complex. I wonder can/how does one use %in% for multiple criteria?
# Add 1 where an individual reproduced
dfB[,'REP'] <- 0
dfB[,'T1'] <- paste0(dfB[,'AGE'], "abcdefghijk656hjhjhj", dfB[,'ID'])
dfB[,'T2'] <- paste0(dfB[,'P.AB'], "abcdefghijk656hjhjhj", dfB[,'P.ID'])
dfB[,'REP'][dfB[,'T1'] %in% dfB[,'T2']] <- 1
dfB[,'T2'] <- dfB[,'T1'] <- NULL
dfB
The output would then look like this:
> dfB
ID P.ID P.AB LS AGE REP
1 A <NA> NA 5 0 0
2 A <NA> NA 5 1 0
3 A <NA> NA 5 2 0
4 A <NA> NA 5 3 1
5 A <NA> NA 5 4 1
6 A <NA> NA 5 5 0
7 B A 3 6 0 0
8 B A 3 6 1 0
9 B A 3 6 2 1
10 B A 3 6 3 0
11 B A 3 6 4 1
12 B A 3 6 5 0
13 B A 3 6 6 0
14 C A 4 3 0 0
15 C A 4 3 1 0
16 C A 4 3 2 0
17 C A 4 3 3 0
18 D B 2 4 0 0
19 D B 2 4 1 0
20 D B 2 4 2 0
21 D B 2 4 3 0
22 D B 2 4 4 0
23 E B 4 5 0 0
24 E B 4 5 1 0
25 E B 4 5 2 0
26 E B 4 5 3 0
27 E B 4 5 4 0
28 E B 4 5 5 0
I tried this (and some variants of) which gets close, correctly adding them to the right individuals, but at the wrong years - it's seeing that A and B both reproduce, and that reproductions occurred at ages 2, 3, and 4 (6 events in total), but not that A and B both reproduce at age 4, while A also reproduces at age 3, and B also reproduces at age 2 (4 events in total):
dfB[,'REP'][dfB[,'P.ID'] %in% dfB[,'ID'] & dfB[,'P.AB'] %in% dfB[,'AGE']] <- 1
dfB[,'REP'][dfB[,'ID'] %in% dfB[,'P.ID'] & dfB[,'AGE'] %in% dfB[,'P.AB'] ] <- 1
As an extension on this, I'd like to have the number of offspring per age, rather than just a 1 or 0, this works (I change dfA so that B and C are twins), but is also probably inefficient:
# Counts of offspring per year
dfA[,'PASTED'] <- paste0(dfA[,'P.ID'], "randomtext", dfA[,'P.AB'])
# Create rep column
dfB[,'REP'] <- 0
# Paste together ID and AGE columns to give unique row identifiers
dfB[,'T1'] <- paste0(dfB[,'AGE'], "randomtext", dfB[,'ID'])
dfB[,'T2'] <- paste0(dfB[,'P.AB'], "randomtext", dfB[,'P.ID'])
# Add Reps
dfB[,'REP'][dfB[,'T1'] %in% dfB[,'T2']] <- table(dfA[,'PASTED'])
# Remove excess columns
dfB[,'T2'] <- dfB[,'T1'] <- NULL
If you are thinking about using %in% with more than one column, then you are probably looking for a merge/join. You can do this all with base R, but I find it a bit easier to do with some help from dplyr
library(dplyr)
dfB %>%
select(P.ID, P.AB) %>%
distinct() %>%
filter(!is.na(P.ID)) %>%
rename(ID=P.ID, AGE=P.AB) %>%
mutate(REP=1) %>%
left_join(dfB, .) %>%
mutate(REP=coalesce(REP, 0))
Basically you just find the unique parent/age values from the data, then you join that back to the same data.frame, but match on different columns.

Sort Data in the Table

For example, now I get the table
A B C
A 0 4 1
B 2 1 3
C 5 9 6
I like to order the columns and rows by my own defined order, to achieve
B A C
B 1 2 3
A 4 0 1
C 9 5 6
This can be accomplished in base R. First we make the example data:
# make example data
df.text <- 'A B C
0 4 1
2 1 3
5 9 6'
df <- read.table(text = df.text, header = T)
rownames(df) <- LETTERS[1:3]
A B C
A 0 4 1
B 2 1 3
C 5 9 6
Then we simply re-order the columns and rows using a vector of named indices:
# re-order data
defined.order <- c('B', 'A', 'C')
df <- df[, defined.order]
df <- df[defined.order, ]
B A C
B 1 2 3
A 4 0 1
C 9 5 6
If the defined order is given as
defined_order <- c("B", "A", "C")
and the initial table is created by
library(data.table)
# create data first
dt <- fread("
id A B C
A 0 4 1
B 2 1 3
C 5 9 6")
# note that row names are added as own id column
then you could achieve the desired result using data.table as follows:
# change column order
setcolorder(dt, c("id", defined_order))
# change row order
dt[order(defined_order)]
# id B A C
# 1: B 1 2 3
# 2: A 4 0 1
# 3: C 9 5 6

R Sum columns by index

I need to find a way to sum columns by their index,I'm working on a bigread.csv file, I'll show here a sample of the problem; I'd like for example to sum from the 2nd to the 5th and from the 6th to the 7h the following matrix:
a 1 3 3 4 5 6
b 2 1 4 3 4 1
c 1 3 2 1 1 5
d 2 2 4 3 1 3
The result has to be like this:
a 11 11
b 10 5
c 7 6
d 8 4
The columns have all different names
We can use rowSums on the subset of columns i.e 2:5 and 6:7 separately and then create a new data.frame with the output.
data.frame(df1[1], Sum1=rowSums(df1[2:5]), Sum2=rowSums(df1[6:7]))
# id Sum1 Sum2
#1 a 11 11
#2 b 10 5
#3 c 7 6
#4 d 11 4
The package dplyr has a function exactly made for that purpose:
require(dplyr)
df1 = data.frame(a=c(1,2,3,4,3,3),b=c(1,2,3,2,1,2),c=c(1,2,3,21,2,3))
df2 = df1 %>% transmute(sum1 = a+b , sum2 = b+c)
df2 = df1 %>% transmute(sum1 = .[[1]]+.[[2]], sum2 = .[[2]]+.[[3]])

Resources