Conditional Subset, Manipulate and Replace - r

Following on from a previous question here I extracted the following data.frame
DF <- data.frame(A =c("One","Two","Three","Four","Five"),
B=c(1,1,2,2,3),
D=c(10,2,3,-5,5))
subset(DF, B %in% c(1,3))
A B D
1 One 1 10
2 Two 1 2
5 Five 3 5
but now I want to (for example) multiply the numbers by (say) five and replace them in the original data.frame
The following code
subset(DF, B %in% c(1,3))[,2:3] * 5
B D
1 5 50
2 5 10
5 15 25
gives me the numbers I want but how to I get them back to
A B D
1 One 5 50
2 Two 5 10
3 Three 2 3
4 Four 2 -5
5 Five 15 25
The answer is staring me in the face (ie the index numbers ... but how do I get to them)?

You can do
DF[DF$B %in% c(1, 3), 2:3] <- DF[DF$B %in% c(1, 3), 2:3] * 5
DF
# A B D
#1 One 5 50
#2 Two 5 10
#3 Three 2 3
#4 Four 2 -5
#5 Five 15 25

Related

Filtering observations using multivariate column conditions

I'm not very experienced R user, so seek advice how to optimize what I've build and in which direction to move on.
I have one reference data frame, it contains four columns with integer values and one ID.
df <- matrix(ncol=5,nrow = 10)
colnames(df) <- c("A","B","C","D","ID")
# df
for (i in 1:10){
df[i,1:4] <- sample(1:5,4, replace = TRUE)
}
df <- data.frame(df)
df$ID <- make.unique(rep(LETTERS,length.out=10),sep='')
df
A B C D ID
1 2 4 3 5 A
2 5 1 3 5 B
3 3 3 5 3 C
4 4 3 1 5 D
5 2 1 2 5 E
6 5 4 4 5 F
7 4 4 3 3 G
8 2 1 5 5 H
9 4 4 1 3 I
10 4 2 2 2 J
Second data frame has manual input, it's user input, I want to turn it into shiny app later on, that's why also I'm asking for optimization, because my code doesn't seem very neat to me.
df.man <- data.frame(matrix(ncol=5,nrow=1))
colnames(df.man) <- c("A","B","C","D","ID")
df.man$ID <- c("man")
df.man$A <- 4
df.man$B <- 4
df.man$C <- 3
df.man$D <- 4
df.man
A B C D ID
4 4 3 4 man
I want to filter rows from reference sequentially, following the rules:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
So with my limited knowledge I've wrote this:
# subtraction manual from reference
df <- df %>% dplyr::mutate(Adiff=A-df.man$A)%>%
dplyr::mutate(Bdiff=B-df.man$B)%>%
dplyr::mutate(Cdiff=C-df.man$C) %>%
dplyr::mutate(Ddiff=D-df.man$D)
# check manually how much in a row has zero difference and filter those
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0 & Ddiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0 & Cdiff==0),
ifelse(nrow(df%>%filter(Adiff==0 & Bdiff==0)) != 0,
df0<-df%>%filter(Adiff==0 & Bdiff==0),
"less then two exact match")
))
tbl_df(df0[,1:5])
# A tibble: 1 x 5
A B C D ID
<int> <int> <int> <int> <chr>
1 4 4 3 3 G
It works and found ID G but looks ugly to me. So the first question is - What would be recommended way to improve this? Are there any functions, packages or smth I'm missing?
Second question - I want to complicate condition.
Imagine we have reference data set.
A B C D ID
2 4 3 5 A
5 1 3 5 B
3 3 5 3 C
4 3 1 5 D
2 1 2 5 E
5 4 4 5 F
4 4 3 3 G
2 1 5 5 H
4 4 1 3 I
4 2 2 2 J
Manual input is
A B C D ID
4 4 2 2 man
Filtering rules should be following:
If there is exact match in a whole row between reference table and manual than extract this(those) from reference and show me that row, if not then reduce number of matching columns from right to left until there is a match but not between less then two variables(columns A,B).
From those rows where I have only two variable matches filter those which has ± 1 difference in columns to the right. So I should have filtered case G and I from reference table from the example above.
keep going the way I did above, I would do the following:
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)))>0,
df01 <- df0%>%filter(Cdiff %in% (-1:1) & Ddiff %in% (-1:1)),
ifelse(nrow(df0%>%filter(Cdiff %in% (-1:1)))>0,
df01<- df0%>%filter(Cdiff %in% (-1:1)),
"NA"))
It will be about 11 columns at the end, but I assume it doesn't matter so much.
Keeping in mind this objective - how would you suggest to proceed?
Thanks!
This is a lot to sort through, but I have some ideas that might be helpful.
First, you could keep your df a matrix, and use row names for your letters. Something like:
set.seed(2)
df
A B C D
A 5 1 5 1
B 4 5 1 2
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
F 1 5 5 2
G 2 3 4 3
H 1 1 5 1
I 2 4 5 5
J 4 2 5 5
And for demonstration, you could use a vector for manual as this is input:
# Complete match example
vec.man <- c(3, 1, 5, 3)
To check for complete matches between the manual input and reference (all 4 columns), with all numbers, you can do:
df[apply(df, 1, function(x) all(x == vec.man)), ]
A B C D
3 1 5 3
If you don't have a complete match, would calculate differences between df and vec.man:
# Change example vec.man
vec.man <- c(3, 1, 5, 2)
df.diff <- sweep(df, 2, vec.man)
A B C D
A 2 0 0 -1
B 1 4 -4 0
C 0 0 -2 0
D 0 0 -4 2
E 0 0 0 1
F -2 4 0 0
G -1 2 -1 1
H -2 0 0 -1
I -1 3 0 3
J 1 1 0 3
The diffs that start with and continue with 0 will be your best matches (same as looking from right to left iteratively). Then, your best match is the column of the first non-zero element in each row:
df.best <- apply(df.diff, 1, function(x) which(x!=0)[1])
A B C D E F G H I J
1 1 3 3 4 1 1 1 1 1
You can see that the best match is E which was non-zero in the 4th column (last column did not match). You can extract rows that have 4 in df.best as your best matches:
df.match <- df[which(df.best == max(df.best, na.rm = T)), ]
A B C D
3 1 5 3
Finally, if you want all the rows with closest match +/- 1 if only 2 match, you could check for number of best matches (should be 3). Then, compare differences with vector c(0,0,1) which would imply 2 matches then 3rd column off by +/- 1:
# Example vec.man with only 2 matches
vec.man <- c(3, 1, 6, 9)
> df.match
A B C D
C 3 1 3 2
D 3 1 1 4
E 3 1 5 3
if (max(df.best, na.rm = T) == 3) {
vec.alt = c(0, 0, 1)
df[apply(df.diff[,1:3], 1, function(x) all(abs(x) == vec.alt)), ]
}
A B C D
3 1 5 3
This should be scalable for 11 columns and 4 matches.
To generalize for different numbers of columns, #IlyaT suggested:
n.cols <- max(df.best, na.rm=TRUE)
vec.alt <- c(rep(0, each=n.cols-1), 1)

cumulative product in R across column

I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4

Using %in% for multiple criteria simultaneously

I have a dataframe showing some data on individuals (ID), where for each year they live there is one row. It also contains information on parent ID (P.ID) and parent age when born (P.AB).
# Dataframe A: 1 row per individual
dfA <- data.frame(
"ID" = c("A", "B", "C", "D", "E"),
"P.ID" = c(NA, "A", "A", "B", "B"),
"P.AB" = c(NA, 3, 4, 2, 4),
"LS" = c(5, 6, 3, 4, 5))
# Dataframe B: 1 row per year of life
dfB <- data.frame("ID" = rep(dfA[,'ID'], dfA[,'LS']+1))
dfB <- merge(dfB, dfA, by = "ID")
dfB[ ,'AGE'] <- 0
for(i in 2:length(dfB[,1])){
if(dfB[i,'ID'] == dfB[i-1, 'ID']){
dfB[i,'AGE'] <- dfB[i-1, 'AGE'] + 1
}
}
Giving:
> head(dfB)
ID P.ID P.AB LS AGE
1 A <NA> NA 5 0
2 A <NA> NA 5 1
3 A <NA> NA 5 2
4 A <NA> NA 5 3
5 A <NA> NA 5 4
6 A <NA> NA 5 5
What I am trying to do is then get R to put a "1" into column REP to show the years in which an individual reproduced. E.g. B was born to A when A was 3, so the row where A is 3 years old gets a 1. I have been trying to do this using %in% but struggling to make this work with multiple criteria. A work around is to paste together the ID and age (plus a random string to make sure that there is no false duplication in my larger dataset), but this feels like it is lacking elegance and is unnecessarily complex. I wonder can/how does one use %in% for multiple criteria?
# Add 1 where an individual reproduced
dfB[,'REP'] <- 0
dfB[,'T1'] <- paste0(dfB[,'AGE'], "abcdefghijk656hjhjhj", dfB[,'ID'])
dfB[,'T2'] <- paste0(dfB[,'P.AB'], "abcdefghijk656hjhjhj", dfB[,'P.ID'])
dfB[,'REP'][dfB[,'T1'] %in% dfB[,'T2']] <- 1
dfB[,'T2'] <- dfB[,'T1'] <- NULL
dfB
The output would then look like this:
> dfB
ID P.ID P.AB LS AGE REP
1 A <NA> NA 5 0 0
2 A <NA> NA 5 1 0
3 A <NA> NA 5 2 0
4 A <NA> NA 5 3 1
5 A <NA> NA 5 4 1
6 A <NA> NA 5 5 0
7 B A 3 6 0 0
8 B A 3 6 1 0
9 B A 3 6 2 1
10 B A 3 6 3 0
11 B A 3 6 4 1
12 B A 3 6 5 0
13 B A 3 6 6 0
14 C A 4 3 0 0
15 C A 4 3 1 0
16 C A 4 3 2 0
17 C A 4 3 3 0
18 D B 2 4 0 0
19 D B 2 4 1 0
20 D B 2 4 2 0
21 D B 2 4 3 0
22 D B 2 4 4 0
23 E B 4 5 0 0
24 E B 4 5 1 0
25 E B 4 5 2 0
26 E B 4 5 3 0
27 E B 4 5 4 0
28 E B 4 5 5 0
I tried this (and some variants of) which gets close, correctly adding them to the right individuals, but at the wrong years - it's seeing that A and B both reproduce, and that reproductions occurred at ages 2, 3, and 4 (6 events in total), but not that A and B both reproduce at age 4, while A also reproduces at age 3, and B also reproduces at age 2 (4 events in total):
dfB[,'REP'][dfB[,'P.ID'] %in% dfB[,'ID'] & dfB[,'P.AB'] %in% dfB[,'AGE']] <- 1
dfB[,'REP'][dfB[,'ID'] %in% dfB[,'P.ID'] & dfB[,'AGE'] %in% dfB[,'P.AB'] ] <- 1
As an extension on this, I'd like to have the number of offspring per age, rather than just a 1 or 0, this works (I change dfA so that B and C are twins), but is also probably inefficient:
# Counts of offspring per year
dfA[,'PASTED'] <- paste0(dfA[,'P.ID'], "randomtext", dfA[,'P.AB'])
# Create rep column
dfB[,'REP'] <- 0
# Paste together ID and AGE columns to give unique row identifiers
dfB[,'T1'] <- paste0(dfB[,'AGE'], "randomtext", dfB[,'ID'])
dfB[,'T2'] <- paste0(dfB[,'P.AB'], "randomtext", dfB[,'P.ID'])
# Add Reps
dfB[,'REP'][dfB[,'T1'] %in% dfB[,'T2']] <- table(dfA[,'PASTED'])
# Remove excess columns
dfB[,'T2'] <- dfB[,'T1'] <- NULL
If you are thinking about using %in% with more than one column, then you are probably looking for a merge/join. You can do this all with base R, but I find it a bit easier to do with some help from dplyr
library(dplyr)
dfB %>%
select(P.ID, P.AB) %>%
distinct() %>%
filter(!is.na(P.ID)) %>%
rename(ID=P.ID, AGE=P.AB) %>%
mutate(REP=1) %>%
left_join(dfB, .) %>%
mutate(REP=coalesce(REP, 0))
Basically you just find the unique parent/age values from the data, then you join that back to the same data.frame, but match on different columns.

R: Filtering by two columns using "is not equal" operator dplyr/subset

This questions must have been answered before but I cannot find it any where. I need to filter/subset a dataframe using values in two columns to remove them. In the examples I want to keep all the rows that are not equal (!=) to both replicate "1" and treatment "a". However, either subset and filter functions remove all replicate 1 and all treatment a. I could solve it by using which and then indexing, but it is not the best way for using pipe operator. do you know why filter/subset do not filter only when both conditions are true?
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
#filtering data
> filter(df, replicate!=1, treatment!="a")
replicate treatment
1 2 b
2 3 b
3 2 b
4 3 b
> subset(df, (replicate!=1 & treatment!="a"))
replicate treatment
8 2 b
9 3 b
11 2 b
12 3 b
#solution by which - indexing
index = which(df$replicate==1 & df$treatment=="a")
> df[-index,]
replicate treatment
2 2 a
3 3 a
5 2 a
6 3 a
7 1 b
8 2 b
9 3 b
10 1 b
11 2 b
12 3 b
I think you're looking to use an "or" condition here. How does this look:
require(dplyr)
#Create example dataframe
replicate = rep(c(1:3), times = 4)
treatment = rep(c("a","b"), each = 6)
df = data.frame(replicate, treatment)
df %>%
filter(replicate != 1 | treatment != "a")
replicate treatment
1 2 a
2 3 a
3 2 a
4 3 a
5 1 b
6 2 b
7 3 b
8 1 b
9 2 b
10 3 b

R - Match column by other columns in same data frame

In the following dataframe I want to create a new column called D2 that matches the corresponding A, B, or C column. For example, if D == A, I want D2 == A2.
A A2 B B2 C C2 D
1 10 2 90 3 9 1
1 11 2 99 3 15 1
1 42 2 2 3 9 2
1 5 2 54 3 235 2
1 13 2 20 3 10 3
1 6 2 1 3 4 3
This is what I want the new data frame to look like:
A A2 B B2 C C2 D D2
1 10 2 90 3 9 1 10
1 11 2 99 3 15 1 11
1 42 2 2 3 9 2 2
1 5 2 54 3 235 2 54
1 13 2 20 3 10 3 10
1 6 2 1 3 4 3 4
I have succeeded in doing this with ifelse statements using dplyr, but because I am doing this with many columns, it gets tedious after a while. I was wondering if there is a more clever way to accomplish the same task.
library(dplyr)
newdata <- olddata %>% mutate(D2=ifelse(D==A,A2,ifelse(D==B,B2,C2)))
We can do this efficiently with max.col from base R. Subset the 'olddata' with only 'A', 'B', 'C' columns ('d1'), check whether it is equal to 'D' (after replicating the 'D' to match the lengths), use max.col to find the index of the maximum element (i.e TRUE in this case, assuming that there will be a single TRUE value per rows), multiply by 2 as the 'A1', 'B2', 'C2' columns are alternating after 'A', 'B', 'C', cbind with the row sequence to create the row/column index and extract the elements based on that to create the 'D2' column.
d1 <- olddata[c("A", "B", "C")]
olddata$D2 <- olddata[cbind(1:nrow(d1), max.col(d1 == rep(olddata["D"],
ncol(d1)), "first")*2)]
olddata$D2
#[1] 10 11 2 54 10 4
A slightly different approach would be to compare the columns separately in a loop using lapply (should be efficient if the dataset is very big as converting to a big logical matrix can cost the memory) and based on that we subset the corresponding columns of A2, B2, C2 with mapply
i1 <- grep("^[^D]", names(olddata)) #create an index for columns that are not D
i2 <- seq(1, ncol(olddata[i1]), by = 2)#for subsetting A, B, C
i3 <- seq(2, ncol(olddata[i1]), by = 2)# for subsetting A2, B2, C2
olddata$D2 <- c(mapply(`[`, olddata[i3], lapply(olddata[i2], `==`, olddata$D)))
olddata$D2
[1] 10 11 2 54 10 4

Resources