Multiply values in a dataset by values in another dataset in R - r

I have two datasets which both share a common ID variable, and also share n variables which are denoted SNP1-SNPn. An example of the two datasets is shown below
Dataset 1
ID SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7
1 0 1 1 0 0 0 0
2 1 1 0 0 0 0 0
3 1 0 0 0 1 1 0
4 0 1 1 0 0 0 0
5 1 0 0 0 1 1 0
6 1 0 0 0 1 1 0
7 0 1 1 0 0 0 0
Dataset 2
ID SNP1 SNP2 SNP3 SNP4 SNP5 SNP6 SNP7
1 0.65 1.3 2.8 0.43 0.62 0.9 1.5
2 0.74 1.6 3.4 0.9 2.4 4.4 2.3
3 0.28 0.5 5.7 6.7 0.3 2.5 0.56
4 0.74 1.6 3.4 0.9 2.4 4.4 2.3
5 0.65 1.3 2.8 0.43 0.62 0.9 1.5
6 0.74 1.6 3.4 0.9 2.4 4.4 2.3
7 0.28 0.5 5.7 6.7 0.3 2.5 0.56
I would like to multiply each value in a given position in dataframe 1, with the value in the equivalent position in dataframe 2.
For example, I would like to multiple position [1,2] in dataset 1 (value = 0), by position [1,2] in dataset 2 (value = 0.65). My data set is very large and spans almost 300 columns and 500,000 IDs.
Variable names for SNP1-n are longer in reality (for example they actually read Affx.5869593), so I cannot just use SNP1-300 in my code, it would have to be specified by the number of columns.
Do I need to unlist both datasets by person ID and SNP name first? What function can be used for multiplying values within two datasets?

I am assuming that you are trying to return a third dataframe which will have, in each position, the product of the values that were in that position in the two data frames.
For example, if the following are your two dataframes
df1 <- structure(list(ID = c(1, 2, 3, 4, 5), SNP1a = c(0, 1, 1, 0, 1
), SNP2a = c(1, 1, 0, 1, 0)), class = "data.frame", row.names = c(NA,
-5L))
ID SNP1a SNP2a
1 0 1
2 1 1
3 1 0
4 0 1
5 1 0
df2 <- structure(list(ID = c(1, 2, 3, 4, 5), SNP1b = c(0.65, 0.74,
0.28, 0.74, 0.65), SNP2b = c(1.3, 1.6, 0.5, 1.6, 1.3)), class = .
"data.frame", row.names = c(NA, -5L))
ID SNP1b SNP2b
1 0.65 1.3
2 0.74 1.6
3 0.28 0.5
4 0.74 1.6
5 0.65 1.3
Then
df3 <- df1[,2:3] * df2[,2:3]
SNP1 SNP2
1 0.00 1.3
2 0.74 1.6
3 0.28 0.0
4 0.00 1.6
5 0.65 0.0
Will work (As long as the two dataframes are of equivalent size).

If your data frames have identical set of id's and they are the same size, you could sort both for id and do this:
df <- data.frame(
id = c(1,2,3,4,5),
snp1 = c(0,0,1,0,0),
snp2 = c(1,1,1,0,1)
)
df2 <- data.frame(
id <- c(1,2,3,4,5),
snp1 <- c(0.3,0.2,0.3,0.1,0.2),
snp2 <- c(0.5,0.8,0.2,0.3,0.3)
)
res <- mapply(`*`, df[,-1], df2[,-1)
res$id <- df$id

Related

How to find the first column with a certain value for each row with dplyr

I have a dataset like this:
df <- data.frame(id=c(1:4), time_1=c(1, 0.9, 0.2, 0), time_2=c(0.1, 0.4, 0, 0.9), time_3=c(0,0.5,0.3,1.0))
id time_1 time_2 time_3
1 1.0 0.1 0
2 0.9 0.4 0.5
3 0.2 0 0.3
4 0 0.9 1.0
And I want to identify for each row, the first column containing a 0, and extract the corresponding number (as the last element of colname), obtaining this:
id time_1 time_2 time_3 count
1 1.0 0.1 0 3
2 0.9 0.4 0.5 NA
3 0.2 0 0.3 2
4 0 0.9 1.0 1
Do you have a tidyverse solution?
We may use max.col
v1 <- max.col(df[-1] ==0, "first")
v1[rowSums(df[-1] == 0) == 0] <- NA
df$count <- v1
-output
> df
id time_1 time_2 time_3 count
1 1 1.0 0.1 0.0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0.0 0.3 2
4 4 0.0 0.9 1.0 1
Or using dplyr - use if_any to check if there are any 0 in the 'time' columns for each row, if there are any, then return the index of the 'first' 0 value with max.col (pick is from devel version, can replace with across) within the case_when
library(dplyr)
df %>%
mutate(count = case_when(if_any(starts_with("time"), ~ .x== 0) ~
max.col(pick(starts_with("time")) ==0, "first")))
-output
id time_1 time_2 time_3 count
1 1 1.0 0.1 0.0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0.0 0.3 2
4 4 0.0 0.9 1.0 1
You can do this:
df <- df %>%
rowwise() %>%
mutate (count = which(c_across(starts_with("time")) == 0)[1])
df
id time_1 time_2 time_3 count
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 0.1 0 3
2 2 0.9 0.4 0.5 NA
3 3 0.2 0 0.3 2
4 4 0 0.9 1 1

Difficulty getting frequency/proportion output using prop.table and merge for numeric and categorical data

I have data on plant species cover at site and plot level which looks like this:
SITE PLOT SPECIES AREA
1 1 A 0.3
1 1 B 25.5
1 1 C 1.0
1 2 A 0.3
1 2 C 0.3
1 2 D 0.3
2 1 B 17.9
2 1 C 131.2
2 2 A 37.3
2 2 C 0.3
2 3 A 5.3
2 3 D 0.3
I have successfully used the following code to obtain percentage values for species at various sites,
dfnew <- merge(df1, prop.table(xtabs(AREA ~ SPECIES + SITE, df1), 2)*100)
I am trying now to find the relative proportion of each species within each plot(as
a proportion of all species in the plot) with a desired output like the one below:
SITE PLOT SPECIES AREA Plot-freq
1 1 A 0.3 1.06
1 1 B 25.5 95.39
1 1 C 1.0 3.56
1 2 A 0.3 33.33
1 2 C 0.3 33.33
1 2 D 0.3 33.33
2 1 B 17.9 12.02
2 1 C 131.2 87.98
2 2 A 37.3 99.25
2 2 C 0.3 0.75
2 3 A 5.3 94.94
2 3 D 0.3 5.06
I tried adding the PLOT variable to the original code but ended up with tiny values
a <- merge(df1, prop.table(xtabs(AREA ~ SPECIES + PLOT + SITE, woods2), 2)*100)
I have been looking at similar questions, but most of those don't have similar data and none of the solutions seem to work for me. Any help much appreciated.
data
> dput(df1)
structure(list(SITE = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),
PLOT = c(1, 1, 1, 2, 2, 2, 1, 1, 2, 2, 3, 3), SPECIES = c("A",
"B", "C", "A", "C", "D", "B", "C", "A", "C", "A", "D"), AREA = c(0.3,
25.5, 1, 0.3, 0.3, 0.3, 17.9, 131.2, 37.3, 0.3, 5.3, 0.3)), class = "data.frame", row.names = c(NA,
-12L))
I'm not sure I completely understand your calculation, but I believe you can do this:
library(dplyr)
df1 %>% group_by(SITE, PLOT) %>% mutate(Plot_freq = AREA/sum(AREA))
Output:
SITE PLOT SPECIES AREA Plot_freq
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1 A 0.3 0.0112
2 1 1 B 25.5 0.951
3 1 1 C 1 0.0373
4 1 2 A 0.3 0.333
5 1 2 C 0.3 0.333
6 1 2 D 0.3 0.333
7 2 1 B 17.9 0.120
8 2 1 C 131. 0.880
9 2 2 A 37.3 0.992
10 2 2 C 0.3 0.00798
11 2 3 A 5.3 0.946
12 2 3 D 0.3 0.0536
Very interesting to merge with the prop.table! I also wasn't lucky though, to modify your approach.
However, to avoid dplyr you may want to use ave to calculate plot sums, then just pipe |> it further to calculate the relative areas like so:
transform(df1, Psum=ave(AREA, SITE, PLOT, FUN=sum)) |> transform(Plot_freq=AREA/Psum*100)
# SITE PLOT SPECIES AREA Psum Plot_freq
# 1 1 1 A 0.3 26.8 1.1194030
# 2 1 1 B 25.5 26.8 95.1492537
# 3 1 1 C 1.0 26.8 3.7313433
# 4 1 2 A 0.3 0.9 33.3333333
# 5 1 2 C 0.3 0.9 33.3333333
# 6 1 2 D 0.3 0.9 33.3333333
# 7 2 1 B 17.9 149.1 12.0053655
# 8 2 1 C 131.2 149.1 87.9946345
# 9 2 2 A 37.3 37.6 99.2021277
# 10 2 2 C 0.3 37.6 0.7978723
# 11 2 3 A 5.3 5.6 94.6428571
# 12 2 3 D 0.3 5.6 5.3571429
Note: R >= 4.1 used.

Create proportion matrix for multivariate categorical data

Suppose I've got this data simulated from the below R code:
library(RNGforGPD)
set.seed(1)
sample.size = 10; no.gpois = 3
lambda.vec = c(-0.2, 0.2, -0.3); theta.vec = c(1, 3, 4)
M = c(0.352, 0.265, 0.342); N = diag(3); N[lower.tri(N)] = M
TV = N + t(N); diag(TV) = 1
cstar = CmatStarGpois(TV, theta.vec, lambda.vec, verbose = TRUE)
data = GenMVGpois(sample.size, no.gpois, cstar, theta.vec, lambda.vec, details = FALSE)
> prop.table(table(data[,1]))
0 1 2
0.3 0.4 0.3
> prop.table(table(data[,2]))
2 3 6 8 10
0.2 0.4 0.1 0.2 0.1
> prop.table(table(data[,3]))
2 3 4 5 6
0.2 0.3 0.1 0.3 0.1
> table(data)
data
0 1 2 3 4 5 6 8 10
3 4 7 7 1 3 2 2 1
I'd like to create a proportion matrix for each of the three categorical variables. If the category is missing for a specific column, it will be identified as 0.
Cat
X1
X2
X3
0
0.3
0.0
0.0
1
0.4
0.0
0.0
2
0.3
0.2
0.2
3
0.0
0.4
0.3
4
0.0
0.0
0.1
5
0.0
0.0
0.3
6
0.0
0.1
0.1
8
0.0
0.2
0.0
10
0.0
0.1
0.0
This is the data-object:
dput(data)
structure(c(1, 0, 2, 1, 0, 0, 1, 2, 2, 1, 3, 8, 3, 3, 2, 2, 6,
3, 10, 8, 2, 5, 2, 6, 3, 3, 4, 3, 5, 5), .Dim = c(10L, 3L), .Dimnames = list(
NULL, NULL))
Tried to put logic at appropriate points in code sequence.
props <- data.frame(Cat = sort(unique(c(data))) ) # Just the Cat column
#Now fill in the entries
# the entries will be obtained with table function
apply(data, 2, table) # run `table(.)` over the columns individually
[[1]]
0 1 2 # these are actually character valued names
3 4 3 # while these are the count values
[[2]]
2 3 6 8 10
2 4 1 2 1
[[3]]
2 3 4 5 6
2 3 1 3 1
Now iterate over that list to fill in values that match the Cat column:
props2 <- cbind(props, # using dfrm first argument returns dataframe object
lapply( apply(data, 2, table) , # irregular results are a list
function(col) { # first make a named vector of zeros
x <- setNames(rep(0,length(props$Cat)), props$Cat)
# could have skipped that step by using `tabulate`
# then fill with values using names as indices
x[names(col)] <- col # values to matching names
x}) )
props2
#-------------
Cat V1 V2 V3
0 0 3 0 0
1 1 4 0 0
2 2 3 2 2
3 3 0 4 3
4 4 0 0 1
5 5 0 0 3
6 6 0 1 1
8 8 0 2 0
10 10 0 1 0
#---
# now just "proportionalize" those counts
props2[2:4] <- prop.table(data.matrix(props2[2:4]), margin=2)
props2
#-------------
Cat V1 V2 V3
0 0 0.3 0.0 0.0
1 1 0.4 0.0 0.0
2 2 0.3 0.2 0.2
3 3 0.0 0.4 0.3
4 4 0.0 0.0 0.1
5 5 0.0 0.0 0.3
6 6 0.0 0.1 0.1
8 8 0.0 0.2 0.0
10 10 0.0 0.1 0.0
colnames(data) <- c("X1", "X2", "X3")
as_tibble(data) %>%
pivot_longer(cols = "X1":"X3", values_to = "Cat") %>%
group_by(name, Cat) %>%
count() %>%
ungroup(Cat) %>%
summarize(name, Cat, proportion = n / sum(n)) %>%
pivot_wider(names_from = name, values_from = proportion) %>%
arrange(Cat) %>%
replace(is.na(.), 0)
# A tibble: 9 × 4
Cat X1 X2 X3
<dbl> <dbl> <dbl> <dbl>
1 0 0.3 0 0
2 1 0.4 0 0
3 2 0.3 0.2 0.2
4 3 0 0.4 0.3
5 4 0 0 0.1
6 5 0 0 0.3
7 6 0 0.1 0.1
8 8 0 0.2 0
9 10 0 0.1 0
If you would like it as a matrix, you can use as.matrix()

Create a new variable based on three variables in R

I want to create a new variable called N1 based on three existing variables (resp, exp.1, exp.2) in R.
df <- data.frame(
resp = c(1, 2, 4, 3, 5, 7 ),
exp.1 = c(0, 0.24, 1, 1.5, 0, 0.4),
exp.2 = c(1, 1, 0, 0, 0.3, 0.2)
)
df resp exp.1 exp.2
1 1 0 1
2 2 0.24 1
3 2 1 0
4 4 1.5 0
5 5 0 0.3
6 7 0.4 0.2
I want to make a new variable N1 like this:
when resp >4, extracting values from exp.1
when resp <4, extracting values from exp.2
when resp == 4, making it missing values.
The desired outcome is:
df resp exp.1 exp.2 N1
1 1 0 1 1
2 2 0.24 1 1
3 4 1 0 NA
4 3 1.5 0 0
5 5 0 0.3 0
6 7 0.4 0.2 0.4
I tried my best using mutate() or car::recode() but it does not work. Any clues?
Using case_when,
library(dplyr)
df %>%
mutate(N1 = case_when(
resp>4 ~ exp.1,
resp<4 ~ exp.2,
resp == 5 ~ NA_real_
))
resp exp.1 exp.2 N1
1 1 0.00 1.0 1.0
2 2 0.24 1.0 1.0
3 4 1.00 0.0 NA
4 3 1.50 0.0 0.0
5 5 0.00 0.3 0.0
6 7 0.40 0.2 0.4
Edit: Using case_when(), as given in the solution above, might be better.
library(dplyr)
# #Data
df <- data.frame(
resp = c(1, 2, 4, 3, 5, 7 ),
exp.1 = c(0, 0.24, 1, 1.5, 0, 0.4),
exp.2 = c(1, 1, 0, 0, 0.3, 0.2)
)
df %>%
rowwise() %>%
mutate(N1 = if (resp >4) {
exp.1
} else if (resp <4) {
exp.2
} else if (resp ==4) {
NA
} else {
NA
}
)
## A tibble: 6 x 4
## Rowwise:
# resp exp.1 exp.2 N1
# <dbl> <dbl> <dbl> <dbl>
#1 1 0 1 1
#2 2 0.24 1 1
#3 4 1 0 NA
#4 3 1.5 0 0
#5 5 0 0.3 0
#6 7 0.4 0.2 0.4

R merge.data.frame: probabilistic result for ambiguous keys

Data and context first: The data in question is
set.seed(123)
df1 <- data.frame(A = rep(1, 4), B = c(2, 6, 4, 4), D = c(0.1, 0.2, 0.3, 0.4))
df2 <- data.frame(A = rep(1, 4), C = c(2, 4, 6, 4), D = c(0.5, 0.6, 0.7, 0.8))
so we have
> df1
A B D
1 1 2 0.1
2 1 6 0.2
3 1 4 0.3
4 1 4 0.4
and
> df2
A C D
1 1 2 0.5
2 1 4 0.6
3 1 6 0.7
4 1 4 0.8
Now, when doing
merge(df1, df2, by.x = c("A", "B"), by.y = c("A", "C"))
one gets
A B D.x D.y
1 1 2 0.1 0.5
2 1 4 0.3 0.6
3 1 4 0.3 0.8
4 1 4 0.4 0.6
5 1 4 0.4 0.8
6 1 6 0.2 0.7
because of ambiguous combinations of (A,B) and (A,C) values.
The actual question: How could one solve this by randomly distributing the D.x and D.y to the (A,B), e.g. to get equally likely
A B D.x D.y
1 1 2 0.1 0.5
2 1 4 0.3 0.6
5 1 4 0.4 0.8
6 1 6 0.2 0.7
and
A B D.x D.y
1 1 2 0.1 0.5
3 1 4 0.3 0.8
4 1 4 0.4 0.6
6 1 6 0.2 0.7
as a result of the merge?
With the use of the data.table package, you could do it as follows:
library(data.table)
DT <- dt1[dt2, on = c(A="A", B="C")][, .(i.D = sample(i.D,1)), by = .(A, B, D)]
which gives two possible results (run the code from above several times to see the different results):
> DT
A B D i.D
1: 1 2 0.1 0.5
2: 1 4 0.3 0.6
3: 1 4 0.4 0.8
4: 1 6 0.2 0.7
or:
> DT
A B D i.D
1: 1 2 0.1 0.5
2: 1 4 0.3 0.8
3: 1 4 0.4 0.6
4: 1 6 0.2 0.7
Although this simple solution works, it will be less efficient (especially with regard to memory use). A more memory efficient solution which leads to the same result is:
dt1[, indx := 1:.N, keyby = .(A, B)]
dt2[, indx := if(.N > 1L) sample(.N) else 1L, keyby = .(A, C)]
dt1[dt2, on = c(A = "A", B = "C", indx = "indx")]
By creating an index in both datasets and sampling that index for the second dataset, you can join on that. This prevents a cartesian join in which all possible combinations are included in the join at first.
Used data:
dt1 <- data.table(A = rep(1, 4), B = c(2, 6, 4, 4), D = c(0.1, 0.2, 0.3, 0.4))
dt2 <- data.table(A = rep(1, 4), C = c(2, 4, 6, 4), D = c(0.5, 0.6, 0.7, 0.8))
In base R you could do:
df12 <- merge(df1, df2, by.x = c("A", "B"), by.y = c("A", "C"))
aggregate( . ~ A + B + D.x, df12, sample, 1)
which gives me the following three results in three consequtive runs of the aggregate function:
# run 1
A B D.x D.y
1 1 2 0.1 0.5
2 1 6 0.2 0.7
3 1 4 0.3 0.6
4 1 4 0.4 0.8
# run 2
A B D.x D.y
1 1 2 0.1 0.5
2 1 6 0.2 0.7
3 1 4 0.3 0.8
4 1 4 0.4 0.8
# run 3
A B D.x D.y
1 1 2 0.1 0.5
2 1 6 0.2 0.7
3 1 4 0.3 0.8
4 1 4 0.4 0.6

Resources