Multiplying values in same position in R - r

I am working in R and I have two datasets. One dataset contains a contribution amount, and the other includes an include/exclude flag. Below are the data:
> contr_df
asof_dt X Y
1 2014-11-03 0.3 1.2
2 2014-11-04 -0.5 2.3
3 2014-11-05 1.2 0.4
> inex_flag
asof_dt X Y
1 2014-11-03 1 0
2 2014-11-04 1 1
3 2014-11-05 0 0
I would like to create a 3rd dataset that show one multiplied by the other. For example, I want to see the following
2014-11-03 0.3 * 1 1.2*0
2014-11-04 -0.5*1 2.3*1
2014-11-05 1.2*0 0.4*0
So far the only way that I've been able accomplish this is through using a for loop that loops through the total number of columns. However, this is complicated and inefficient. I was wondering if there was an easier way to make this happen. Does anyone know of a better solution?

This does the multiplication, but doesn't make sense for factors:
df1 * df2
# asof_dt X Y
#1 NA 0.3 0.0
#2 NA -0.5 2.3
#3 NA 0.0 0.0
#Warning message:
#In Ops.factor(left, right) : * nicht sinnvoll für Faktoren
One Option: You can cbind the first column and the multiplied values like this:
cbind(df1[1], df1[-1] * df2[-1])
# asof_dt X Y
#1 2014-11-03 0.3 0.0
#2 2014-11-04 -0.5 2.3
#3 2014-11-05 0.0 0.0
This means, you multiply the df1 and df2 without their first column of each data frame and add to it the first column of df1 with the dates.

The one-line answer is:
mapply(`*`, contr_df, inex_flag)
This will pair-wise apply the scalar multiplication function across the data.frame columns.
d = data.frame(a=c(1,2,3), b=c(0,2,-1))
e = data.frame(a=c(.2, 2, -1), b=c(0, 2, -2))
mapply(`*`, d, e)
a b
[1,] 0.2 0
[2,] 4.0 4
[3,] -3.0 2

Related

Remove groups based on multiple conditions in dplyr R

I have a data that looks like this
gene=c("A","A","A","A","B","B","B","B")
frequency=c(1,1,0.8,0.6,0.3,0.2,1,1)
time=c(1,2,3,4,1,2,3,4)
df <- data.frame(gene,frequency,time)
gene frequency time
1 A 1.0 1
2 A 1.0 2
3 A 0.8 3
4 A 0.6 4
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4
I want to remove each a gene group, in this case A or B when they have
frequency > 0.9 at time==1
In this case I want to remove A and my data to look like this
gene frequency time
1 B 0.3 1
2 B 0.2 2
3 B 1.0 3
4 B 1.0 4
Any hint or help are appreciated
We may use subset from base R i.e. create a logical vector with multiple expressions extract the 'gene' correspond to that, use %in% to create a logical vector, negate (!) to return the genes that are not. Or may also change the > to <= and remove the !
subset(df, !gene %in% gene[frequency > 0.9 & time == 1])
-ouptut
gene frequency time
5 B 0.3 1
6 B 0.2 2
7 B 1.0 3
8 B 1.0 4

Filter a list of data frames based on the first value in a column

I have some data, example here:
dat1 <- data.frame(a = c("5","10","15","20"), b = c("0.1","0.2","0.3","0.4"))
dat2 <- data.frame(a = c("15","20","25","30"), b = c("0.5","0.6","0.7","0.8"))
datalist <-list (dat1,dat2)
Giving me a format like this
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4
[[2]]
a b
1 15 0.5
2 20 0.6
3 25 0.7
4 30 0.8
I want to be able to filter the list of data frames with the condition that the first value of column a should be <= 10. So in this scenario the output would be just the first data frame [[1]], and the second data frame would be ignored entirely.
Desired output
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4
Any advice would be greatly appreciated!
Thanks
You can use sapply to get a vector of logicals indicating whether each element of a list meets a certain condition. This can then be applied to subset the list in the usual way with [. E.g:
datalist[sapply(datalist, function(x){as.numeric(x[[1,"a"]]) <= 10})]
will return only the first element in your example.
(Note the as.numeric is necessary because your numbers are stored as character strings here)
We can also use keep function from purrr. It takes a predicate function .p and apply it on every element of a list and return those elements that predicate function equals to a single TRUE.
lirbary(purrr)
datalist %>%
keep(~ .x[["a"]][1] %>% as.numeric() <= 10)
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4
We may use Filter from base R
Filter(\(x) as.numeric(x$a[1]) <= 10, datalist)
[[1]]
a b
1 5 0.1
2 10 0.2
3 15 0.3
4 20 0.4

How to do this in R

I have a dataset that looks like this:
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
head(sample.data)
groups A B position
1 1 1 3 2
2 2 3 2 1
3 3 2 4 2
4 4 4 1 1
5 5 2 5 2
6 6 5 2 1
The "position" column always alternates between 2 and 1. I want to do this calculation in R: starting from the first row, if it's in position 1, ignore it. If it starts at 2 (as in this example), then calculate as follows:
Take the first 2 values of column A that are at position 2, average them, then subtract the first value that is at position 1 (in this example: (1+2)/2 - 3 = -1.5). Then repeat the calculation for the next set of values, using the last position 2 value as the starting point, i.e. the next calculation would be (2+2)/2 - 4 = -2.
So basically, in this example, the calculations are done for the values of these sets of groups: 1-2-3, 3-4-5, 5-6-7, etc. (the last value of the previous is the first value of the next set of calculation)
Repeat the calculation until the end. Also do the same for column B.
Since I need the original data frame intact, put the newly calculated values in a new data frame(s), with columns dA and dB corresponding to the calculated values of column A and B, respectively (if not possible then they can be created as separated data frames, and I will extract them into one afterwards).
Desired output (from the example):
dA dB
1 -1.5 1.5
2 -2 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4
groups <- c(1:20)
A <- c(1,3,2,4,2,5,1,6,2,7,3,5,2,6,3,5,1,5,3,4)
B <- c(3,2,4,1,5,2,4,1,3,2,6,1,4,2,5,3,7,1,4,2)
position <- c(2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1)
sample.data <- data.frame(groups,A,B,position)
start <- match(2, sample.data$position)
twos <- seq(from = start, to = nrow(sample.data), by = 2)
df <-
sapply(c("A", "B"), function(l) {
sapply(twos, function(i) {
mean(sample.data[c(i, i+2), l]) - sample.data[i+1, l]
})
})
df <- setNames(as.data.frame(df), c('dA', 'dB'))
As your values in position always alternate between 1 and 2, you can define an index of odd rows i1 and an index of even rows i2, and do your calculations:
## In case first row has position==1, we add an increment of 1 to the indexes
inc=0
if(sample.data$position[1]==1)
{inc=1}
i1=seq(1+inc,nrow(sample.data),by=2)
i2=seq(2+inc,nrow(sample.data),by=2)
res=data.frame(dA=(lead(sample.data$A[i1])+sample.data$A[i1])/2-sample.data$A[i2],
dB=(lead(sample.data$B[i1])+sample.data$B[i1])/2-sample.data$B[i2]);
This returns:
dA dB
1 -1.5 1.5
2 -2.0 3.5
3 -3.5 2.5
4 -4.5 2.5
5 -4.5 2.5
6 -2.5 4.0
7 -3.5 2.5
8 -3.0 3.0
9 -3.0 4.5
10 NA NA
The last row returns NA, you can remove it if you need.
res=na.omit(res)

apply distribution to new sample set

I have a data-frame dfu that holds for each id (id belongs to one team, team has many ids) the percentage samples where a bunch of properties prop1, prop2 and so on are observed based on some past studies - this is used as sort of reference table for future studies. Now there is data from new experiment which gives a new set of ids. I need to find the percentage samples where prop1, prop2 and so on are observed on per team basis by using the reference data in dfu. This could be done by counting the number of occurrences per id in dfi and then take a weighted average grouped by team.- not all ids in dfu may be present and one or more ids not present in dfu may be present in dfi. The ids not present in dfu may be excluded from the weighted average as no presence per property values are available for them.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
> dfu
id team prop1 prop2
1 A 0.8 0.2
2 B 0.9 0.3
3 C 0.6 0.3
4 A 0.5 0.2
5 A 0.8 0.2
6 C 0.9 0.3
>
> dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
> dfi
id
2
3
2
1
4
3
7
The output format would be like below. For example the value for prop1 for group A would be (0.8*1 + 0.5*1)/2 = 0.65.
team prop1 prop2
A
B
C
prefer base R approach, other approaches welcome. The number of columns could be many.
I don't know exactly how to do it with base R.
With data.table it's should be pretty easy.
Let convert your data.frames into data.table.
library(data.table)
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"), prop1=c(0.8,0.9,0.6,0.5,0.8,0.9), prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
dfi <- data.table(dfi)
dfu <- data.table(dfu)
Then merge them like
dfu[dfi,on="id"]
## > dfu[dfi,on="id"]
## id team prop1 prop2
## 1: 2 B 0.9 0.3
## 2: 3 C 0.6 0.3
## 3: 2 B 0.9 0.3
## 4: 1 A 0.8 0.2
## 5: 4 A 0.5 0.2
## 6: 3 C 0.6 0.3
## 7: 7 NA NA NA
Then we just have to perform the mean by group. In fact we can to it one liner like
dfu[dfi,on="id"][,mean(prop1),team]
## > dfu[dfi,on="id"][,mean(prop1),team]
## team V1
## 1: B 0.90
## 2: C 0.60
## 3: A 0.65
## 4: NA NA
You can achieve the same thing in base R by merging the data.frame and using the function aggregate I guess.
taking cue from #DJJ's answer.
dfu <- data.frame(id=1:6, team=c('A',"B","C","A","A","C"),
prop1=c(0.8,0.9,0.6,0.5,0.8,0.9),
prop2=c(0.2,0.3,.3,.2,.2,.3))
dfi <- data.frame(id=c(2 , 3 , 2 , 1 , 4 , 3 , 7))
Merge by id
> dfx <- merge(dfi, dfu, by="id")
> dfx
id team prop1 prop2
1 1 A 0.8 0.2
2 2 B 0.9 0.3
3 2 B 0.9 0.3
4 3 C 0.6 0.3
5 3 C 0.6 0.3
6 4 A 0.5 0.2
Aggregate prop1 and prop2 by team with mean
> aggregate(cbind(prop1, prop2) ~ team, dfx, mean)
team prop1 prop2
1 A 0.65 0.2
2 B 0.90 0.3
3 C 0.60 0.3

Multiplying column value by another value depending on value in certain column R

in the following dataset, I would like to multiply value in column Size by value in column Month1, Month2 or Month3 depending on what number we have in column Month. So if in certain row the Month value is 2, I would like to multiply the value in column Size by the value in column Month2 and save the result in new column NewSize. Many thanks for your help in advance!
Orig = c("A","B","A","A","B","A","A","B","A")
Dest = c("B","A","C","B","A","C","B","A","C")
Month = c(1,1,1,2,2,2,3,3,3)
Size = c(30,20,10,10,20,20,30,50,20)
Month1 = c(1,0.2,0,1,0.2,0,1,0.2,0)
Month2 = c(0.6,1,0,0.6,1,0,0.6,1,0)
Month3 = c(0,1,0.6,0,1,0.6,0,1,0.6)
df <- data.frame(Orig,Dest,Month,Size,Month1,Month2,Month3)
df
Orig Dest Month Size Month1 Month2 Month3
1 A B 1 30 1.0 0.6 0.0
2 B A 1 20 0.2 1.0 1.0
3 A C 1 10 0.0 0.0 0.6
4 A B 2 10 1.0 0.6 0.0
5 B A 2 20 0.2 1.0 1.0
6 A C 2 20 0.0 0.0 0.6
7 A B 3 30 1.0 0.6 0.0
8 B A 3 50 0.2 1.0 1.0
9 A C 3 20 0.0 0.0 0.6
Here's one alternative using ifelse
> transform(df, NewSize=ifelse(Month==1, Size*Month1,
ifelse(Month==2, Size*Month2, Size*Month3)))
Orig Dest Month Size Month1 Month2 Month3 NewSize
1 A B 1 30 1.0 0.6 0.0 30
2 B A 1 20 0.2 1.0 1.0 4
3 A C 1 10 0.0 0.0 0.6 0
4 A B 2 10 1.0 0.6 0.0 6
5 B A 2 20 0.2 1.0 1.0 20
6 A C 2 20 0.0 0.0 0.6 0
7 A B 3 30 1.0 0.6 0.0 0
8 B A 3 50 0.2 1.0 1.0 50
9 A C 3 20 0.0 0.0 0.6 12
In base R, fully vectorized:
df$Size*df[,5:7][cbind(1:nrow(df),df$Month)]
Here's how I'd handle this using data.table.
require(data.table)
setkey(setDT(df),
Month)[.(mon = 1:3), ## i
NewSize := Size * get(paste0("Month", mon)), ## j
by=.EACHI] ## by
setDT converts df from data.frame to data.table by reference.
setkey reorders that data.table by the column specified, Month, in increasing order, and marks that column as key column, on which we'll perform a join.
We perform a join on the key column set in the previous set with the values 1:3. This can also be interpreted as a subset operation that extracts all rows matching 1,2 and 3 from the key column Month.
So, for each value of 1:3, we calculate the matching rows in i. And on those matching rows, we compute NewSize by extracting Size and MonthX for those matching rows, and multiplying them. We use get() to achieve extracting the right MonthX column.
by=.EACHI as the name implies, executes the expression in j for each i. As an example, i=1 matches (or joins) to rows 1:3 of df. For those rows, the j-expression extracts Size = 30,20,10 and Month1 = 1.0, 0.2, 0.0, and it gets evaluated to return 30, 4, 0. And then for i=2 and so on..
Hope this helps a bit even if you're looking for a dplyr only answer.
You can use apply:
apply(df, 1, function(u) as.numeric(u[paste0('Month', u['Month'])])*as.numeric(u['Size']))
#[1] 30 4 0 6 20 0 0 50 12
Or a vectorized solution:
bool = matrix(rep(df$Month, each=3)==rep(1:3, nrow(df)), byrow=T, ncol=3)
df[c('Month1', 'Month2', 'Month3')][bool] * df$Size
#[1] 30 4 0 6 20 0 0 50 12

Resources