Create subset of data using conditions from another data frame - r

I would like to use something like dplyr to create a subset of data from one data frame using conditions from another data frame. So in one data frame I have a set of data with minimum and maximum years and other sea-level data lsp , and in another frame I have a time series of ocean dynamics. For each row in the lsp dataframe, I would like to extract every year between the minimum and maximum ages in the dynamics data frame and create a sub set of data. I think this will require a for loop. Does anyone have any idea if this is possible?
Desired output using row 1 of LSP as an example:
Row 1 LSP (simplified) is:
Age min
Age max
1997
2007
I want to use this information to create a data frame like this from the dynamics file:
Subset
Year
Dynamics
1997
125
1998
109
1999
152
2000
161
2001
106
2002
120
2003
58
2004
68
2005
110
2006
144
2007
100
Many thanks
## LSP data
structure(list(Depth = c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5,
8.5, 10.5, 13.5, 14.5, 18.5, 19.5, 27.5, 28.5, 32, 35.5, 40.5,
41.5), RSL = c(0.03, 0.03, 0.01, 0.01, -0.04, -0.01, -0.03, 0,
0.04, 0.03, 0, -0.01, -0.05, -0.07, -0.19, -0.24, -0.31, -0.31,
-0.27, -0.29), RSL_err_1sig = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.1), Age_mean = c(2001.754499, 1994.278776, 1987.678949, 1980.805889,
1973.270485, 1965.018421, 1957.442729, 1952.134369, 1949.031929,
1945.148184, 1939.132213, 1936.957531, 1927.311071, 1924.379033,
1897.26123, 1892.977317, 1876.1995, 1858.135589, 1825.967544,
1820.605298), Age.min = c(1996.752238, 1985.111654, 1977.483594,
1968.26211, 1961.886124, 1958.219318, 1947.496532, 1943.084044,
1941.761439, 1935.843414, 1923.952516, 1920.057048, 1906.228232,
1902.242998, 1875.327613, 1869.925103, 1834.992176, 1811.928966,
1784.998245, 1767.524866), Age.max = c(2006.75676, 2003.445898,
1997.874304, 1993.349668, 1984.654846, 1971.817524, 1967.388926,
1961.184694, 1956.302419, 1954.452954, 1954.31191, 1953.858014,
1948.39391, 1946.515068, 1919.194847, 1916.029531, 1917.406824,
1904.342212, 1866.936843, 1873.68573)), class = "data.frame", row.names = c(NA,
-20L))
## Dynamics (only head)
structure(list(Year = 1815:1820, dynamics = c(-76.01893261, -64.50519732,
-66.06270761, -76.22822397, -72.35960029, -77.34157443)), row.names = c(NA,
6L), class = "data.frame")

Here is a base R option with Map and subset -
Map(function(x, y) subset(dynamics, Year >= x & Year <= y),
LSP$Age.min, LSP$Age.max)
The same logic can be implemented using tidyverse functions as well.
library(dplyr)
library(purrr)
map2(LSP$Age.min, LSP$Age.max, ~dynamics %>% filter(Year >= .x & Year <= .y))

As long as your dataset isn't huge, I would take something like the following approach.
Add the (nested) dynamics dataset to each row of your lsp dataset
Unnest the dynamics dataset to get one row per year
Filter out years that aren't relevant
(Optional)
Renest the dynamics columns to you have one row per lsp record with a tibble for all relevant years from the dynamics set.
lsp %>%
add_column(dynamics %>% nest(data = everything())) %>%
unnest(data) %>%
filter(year >= min & year <= max) %>%
nest(filtered = c(year, value))

I guess this does what you want to do. First assign names to your input data, so later you know what my codes mean.
lsp <- structure(list(Depth = c(0.5, 1.5, 2.5, 3.5, 4.5, 5.5, 6.5, 7.5,
8.5, 10.5, 13.5, 14.5, 18.5, 19.5, 27.5, 28.5, 32, 35.5, 40.5,
41.5), RSL = c(0.03, 0.03, 0.01, 0.01, -0.04, -0.01, -0.03, 0,
0.04, 0.03, 0, -0.01, -0.05, -0.07, -0.19, -0.24, -0.31, -0.31,
-0.27, -0.29), RSL_err_1sig = c(0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1,
0.1), Age_mean = c(2001.754499, 1994.278776, 1987.678949, 1980.805889,
1973.270485, 1965.018421, 1957.442729, 1952.134369, 1949.031929,
1945.148184, 1939.132213, 1936.957531, 1927.311071, 1924.379033,
1897.26123, 1892.977317, 1876.1995, 1858.135589, 1825.967544,
1820.605298), Age.min = c(1996.752238, 1985.111654, 1977.483594,
1968.26211, 1961.886124, 1958.219318, 1947.496532, 1943.084044,
1941.761439, 1935.843414, 1923.952516, 1920.057048, 1906.228232,
1902.242998, 1875.327613, 1869.925103, 1834.992176, 1811.928966,
1784.998245, 1767.524866), Age.max = c(2006.75676, 2003.445898,
1997.874304, 1993.349668, 1984.654846, 1971.817524, 1967.388926,
1961.184694, 1956.302419, 1954.452954, 1954.31191, 1953.858014,
1948.39391, 1946.515068, 1919.194847, 1916.029531, 1917.406824,
1904.342212, 1866.936843, 1873.68573)), class = "data.frame", row.names = c(NA,
-20L))
dynamics <- structure(list(Year = 1815:1820, dynamics = c(-76.01893261, -64.50519732,
-66.06270761, -76.22822397, -72.35960029, -77.34157443)), row.names = c(NA,
6L), class = "data.frame")
Then the actual codes to get the subset.
# first get info of years from the "lsp" dataset
# following your example in your comments
year_min <- list()
year_max <- list()
all_years <- list()
for(i in 1:nrow(lsp)){
year_min[[i]] <- round(lsp$Age.min[[i]])
year_max[[i]] <- round(lsp$Age.max[[i]])
all_years[[i]] <- c(year_min[[i]]:year_max[[i]])
all_years[[i]] <- as.data.frame(all_years[[i]])
colnames(all_years[[i]]) <- "Year"
}
# now join the info on "Year" from "lsp" data with "dynamics" data to get the subset
library(dplyr)
subset_output <- list()
for (i in 1:length(all_years)){
subset_output[[i]] <- left_join(dynamics,all_years[[i]])
}

Related

R: How do I sort a dataframe based on a numeric vector?

I want to calculate the fold change between thyroid and testes dataframe using TPM values and provide the top 10 genes overexpressed in testes tissue (testes$gene_id in the testes dataframe).
In my code below, I first calculated the fold change and store it as a numeric vector tpm.foldchange but then I don't know how to sort the gene_id column of the testes dataframe based on the sorted fold-change values tpm.foldchange.
# Parse the gene results file from the testes and thyroid output
thyroid <- read.table("thyroid.genes.results", header=T, sep="\t")
testes <- read.table("testes.genes.results", header=T, sep="\t")
# Extract the TPM values
# Add one to each value and log them (base 2)
library(tidyverse)
thyroid.tpm <- log(thyroid %>% pull(TPM) + 1)
testes.tpm <- log(testes %>% pull(TPM) + 1)
# Pearson's correlation coefficient between thyroid and testes using TPM
cor(thyroid.tpm, testes.tpm, method="pearson")
# Calculate fold change between the testes and thyroid tissue TPM values and provide top 10 genes that are overexpressed in testes
library(gtools)
tpm.foldchange <- foldchange(testes.tpm, thyroid.tpm)
#tpm.df <- merge(testes.tpm, tpm.foldchange)
tpm.sorted <- sort(tpm.foldchange, decreasing=T)
tpm.sortedgenes <- testes[order(factor(testes$TPM, levels=tpm.sorted)),]
tpm.top10genes <- head(tpm.sortedgenes, 10)
testes[order(factor(testes$TPM, levels=tpm.sorted)),]
I initially wanted to sort after merging like this:
tpm.df <- merge(testes.tpm, tpm.foldchange)
tpm.sorted <- sort(tpm.df$tpm.foldchange, decreasing=T)
but it raised an error:
Error: cannot allocate vector of size 8.0 Gb
thyroid dataframe:
# Show only the first 20 rows, first column, and 6th column of thyroid dataframe
dput(thyroid[1:20, c(1,6)])
structure(list(gene_id = c("gene0_DDX11L1", "gene1_WASH7P", "gene100_C1orf233",
"gene1000_ZC3H12A", "gene10000_CD86", "gene10001_CASR", "gene10003_CSTA",
"gene10004_CCDC58", "gene10005_FAM162A", "gene10006_WDR5B", "gene10007_LOC102723582",
"gene10008_KPNA1", "gene1001_MIR6732", "gene10010_PARP9", "gene10011_DTX3L",
"gene10012_PARP15", "gene10015_PARP14", "gene10016_HSPBAP1",
"gene10017_DIRC2", "gene10018_LOC100129550"), TPM = c(0, 45.96,
2.72, 2.4, 1.67, 5.14, 4.33, 47.68, 81.1, 10.12, 0.96, 45.21,
0, 19.63, 15.06, 0.49, 21.76, 12.16, 19.37, 5.3)), row.names = c(NA,
20L), class = "data.frame")
testes dataframe:
# Show only the first 20 rows, first column, and 6th column of testes dataframe
dput(testes[1:20, c(1,6)])
structure(list(gene_id = c("gene0_DDX11L1", "gene1_WASH7P", "gene100_C1orf233",
"gene1000_ZC3H12A", "gene10000_CD86", "gene10001_CASR", "gene10003_CSTA",
"gene10004_CCDC58", "gene10005_FAM162A", "gene10006_WDR5B", "gene10007_LOC102723582",
"gene10008_KPNA1", "gene1001_MIR6732", "gene10010_PARP9", "gene10011_DTX3L",
"gene10012_PARP15", "gene10015_PARP14", "gene10016_HSPBAP1",
"gene10017_DIRC2", "gene10018_LOC100129550"), TPM = c(2.33, 47.56,
9.45, 2.03, 3.09, 0.11, 3.73, 28.52, 120.65, 6.89, 1.38, 30.89,
0, 20.39, 13.66, 0.59, 9.62, 22.04, 7.42, 2.53)), row.names = c(NA,
20L), class = "data.frame")
Based on Akrun's comment, I've attempted:
library(gtools)
tpm.foldchange <- foldchange(thyroid.tpm, testes.tpm)
testes.sorted <- testes %>%
left_join(thyroid, by="gene_id") %>%
mutate(TPM=testes.tpm, tpm.foldchange, .keep="unused") %>%
slice_max(n=10, order_by=tpm.foldchange)
Output:
> dim(testes.sorted)
[1] 304 15
> dput(testes.sorted[1:10,])
structure(list(gene_id = c("gene10075_LOC101927056", "gene10311_A4GNT",
"gene10394_SLC9A9-AS1", "gene10504_SUCNR1", "gene10511_TMEM14E",
"gene10798_LOC102724550", "gene10990_FLJ42393", "gene11054_DPPA2P3",
"gene11065_GP5", "gene11400_USP17L12"), transcript_id.s..x = c("rna28860_NR_125396.1,rna28861_NR_125395.1",
"rna29540_NM_016161.2", "rna29785_NR_048544.1", "rna30020_NM_033050.4",
"rna30060_NM_001123228.1", "rna30716_NR_110826.1", "rna31241_NR_024413.1",
"rna31390_NR_027764.1", "rna31430_NM_004488.2", "rna32519_NM_001256853.1"
), length.x = c(659, 1771, 518, 1650, 1293, 2957, 2266, 1146,
3493, 1593), effective_length.x = c(413.57, 1525.5, 272.62, 1404.5,
1047.5, 2711.5, 2020.5, 900.5, 3247.5, 1347.5), expected_count.x = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0.12), TPM.x = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0), FPKM.x = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0), transcript_id.s..y = c("rna28860_NR_125396.1,rna28861_NR_125395.1",
"rna29540_NM_016161.2", "rna29785_NR_048544.1", "rna30020_NM_033050.4",
"rna30060_NM_001123228.1", "rna30716_NR_110826.1", "rna31241_NR_024413.1",
"rna31390_NR_027764.1", "rna31430_NM_004488.2", "rna32519_NM_001256853.1"
), length.y = c(796, 1771, 518, 1650, 1293, 2957, 2266, 1146,
3493, 1593), effective_length.y = c(535.05, 1510.04, 257.15,
1389.04, 1032.04, 2696.04, 2005.04, 885.04, 3232.04, 1332.04),
expected_count.y = c(9, 3, 2, 233, 2, 2, 36, 2, 35, 1.91),
TPM.y = c(0.58, 0.07, 0.27, 5.8, 0.07, 0.03, 0.62, 0.08,
0.37, 0.05), FPKM.y = c(0.29, 0.03, 0.14, 2.94, 0.03, 0.01,
0.31, 0.04, 0.19, 0.03), TPM = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0), tpm.foldchange = c(Inf, Inf, Inf, Inf, Inf, Inf, Inf,
Inf, Inf, Inf)), row.names = c(NA, 10L), class = "data.frame")
This code returns a dataframe with (304, 15) dimensions. But I'm only looking for the top ten genes. Also, please note that thyroid.tpm is the log2-transformed TPM values.
If we want to order by the foldchange, do a join first, and arrange based on the foldchange between the 'TPM' columns
library(dplyr)
library(gtools)
testes2 <- testes %>%
left_join(thyroid, by = 'gene_id') %>%
mutate(across(starts_with("TPM"), ~ log(.x + 1),
.names = "tpm_{.col}")) %>%
mutate(foldchange = foldchange(tpm_TPM.x, tpm_TPM.y)) %>%
filter(is.finite(foldchange)) %>%
arrange(tpm_TPM.x) %>%
dplyr::select(gene_id, TPM = TPM.x, foldchange) %>%
slice_head(n = 10)
If we want to select top 10 foldchange rows, use slice_max
testes %>%
left_join(thyroid, by = 'gene_id') %>%
mutate(TPM = TPM.x, foldchange = foldchange(log(TPM.x + 1), log(TPM.y + 1)),
.keep = "unused") %>%
filter(is.finite(foldchange)) %>%
slice_max(n = 10, order_by = foldchange, with_ties = FALSE)
-output
gene_id TPM foldchange
1 gene100_C1orf233 9.45 1.786222
2 gene10000_CD86 3.09 1.434249
3 gene10007_LOC102723582 1.38 1.288517
4 gene10016_HSPBAP1 22.04 1.217311
5 gene10012_PARP15 0.59 1.162893
6 gene10005_FAM162A 120.65 1.089205
7 gene10010_PARP9 20.39 1.011953
8 gene1_WASH7P 47.56 1.008704
9 gene10011_DTX3L 13.66 -1.033968
10 gene10003_CSTA 3.73 -1.076854
The merge results in memory error because it was done on two vectors creating a cartesian join

Split data by difference between rows values

I'm looking to split dataset per 10 days properly. The step between days is not alway 1 : could be 2 in the case of -149 -> -147
Is there any way smarter than test every time difference between days and register begin and end indexes for split ?
df = structure(list(day = c(-155, -153, -152, -151, -150, -149, -147,
-146, -145, -144, -143, -142, -141, -140, -139, -138, -137, -135,
-134, -131), margin = c(0.02, 0.03, 0.065, 0.06, 0.07, 0.05,
0.035, 0.06, 0.0266666666666667, 0.03, 0.04, 0.06, 0.0366666666666667,
0.035, 0.09, 0.12, 0.045, 0.04, 0.02, 0.06)), row.names = c(NA,
-20L), class = c("tbl_df", "tbl", "data.frame"))
An option is to get the diff of adjacent elements of 'day' column, then do the cumulative sum (cumsum), to create a the diff column, use that column to create a grouping with %/% for splitting at each 10 value as this returns an index that increments on every 10, then use that column in group_split to split the data into list of data.frames
library(dplyr)
df %>%
mutate(diff = cumsum(c(0, diff(day))),
diff = pmax(0, (diff - 1)) %/% 10) %>%
group_split(diff, .keep = FALSE)

Combining pheatmaps in R

I've been working around with no success in solving how 2 or more pheatmaps (heatmaps) can be combined in a final plot.
data1 <- structure(list(DC1 = c(NA, NA, 1.98), DC2 = c(NA, NA, 0.14),
DC3 = c(1.85, 1.51, 0.52), DC4 = c(0.89, 0.7, 1.47), DC5 = c(0,
0.78, 0), DC6 = c(0, 1.3, 0), DC7 = c(0, 1.47, 0), DC8 = c(0,
1.2, 0), DC9 = c(0, 0, 0), DC10 = c(0.51, 1.9, 0)), .Names = c("DC1",
"DC2", "DC3", "DC4", "DC5", "DC6", "DC7", "DC8", "DC9", "DC10"),
enter code here`class = "data.frame", row.names = c("A", "B", "C"))
data 2 <- structure(list(DC1 = c(9.56, 1.87, 2.07, 1.87, 2.07, 1.35), DC2 = c(5.51, 1.13, 1.25, 1.13, 0.99, 0.45), DC3 = c(4.84, 1.17, 0.66, 1.17,
0.34, 0.16), DC4 = c(4.18, 0.59, 0.05, 0.97, 0.43, 0.59), DC5 = c(3.26,
0, 0.14, 0.31, 0.79, 0.63), DC6 = c(3.35, 0, 1.12, 0.05, 1.12,
0), DC7 = c(4.18, 0.63, 1.27, 0.47, 1.27, 0), DC8 = c(4.37, 1.17,
1.3, 1.17, 0, 0), DC9 = c(4.3, 1.13, 0, 1.13, 0, 0), DC10 = c(7.47,
1.88, 0.71, 1.88, 0, 0)), .Names = c("DC1", "DC2", "DC3", "DC4",
"DC5", "DC6", "DC7", "DC8", "DC9", "DC10"), class = "data.frame", row.names = c("TD6 vs SH",
"TD6 vs SAP", "TD6 vs NEA", "SH vs SAP", "SH vs NEA", "SAP vs NEA"
))
I construct very easily a heatmap using pheatmap by using these two codes:
hm_data1 <- pheatmap(as.matrix(data1))
hm_data2 <- pheatmap(as.matrix(data2))
However, in no way I can get both printed in one figure. I would like to see both of them horizontally. However, my real figure will be composed by 16 pheatmaps, so they must be arrange in 4 columns and 4 rows.
I tried with par mfrow with no success.
How can I combine pheatmaps?
I know there are plenty of R packages that can plot heatmaps, but I would like to do it with pheatmap
This will work.
library(gridExtra); library(pheatmap)
m <- matrix(c(1:4), ncol=2)
n <- matrix(c(1,1,1,2), ncol=2)
a <- list(pheatmap(m)[[4]])
a[[2]] <- pheatmap(n)[[4]]
z <- do.call(grid.arrange,a)
plot(z)
Based on one of the comments. If you have many single plots; you can use a loop like this.
mn <- list(m, n)
a <- list()
for(i in 1:length(mn)){
a[i] <- list(pheatmap(mn[[i]])[[4]])
}
z <- do.call(grid.arrange,a)
plot(z)
The point is it to add all the data for your single plots in a list. You can then loop over the list, applying pheatmap.

How to store the output into a list of matrices

Data:
x <- seq(0, 1, len = 1024)
pos <- c(0.1, 0.13, 0.15, 0.23, 0.25, 0.40, 0.44, 0.65, 0.76, 0.78, 0.81)
hgt <- c(4, 5, 3, 4, 5, 4.2, 2.1, 4.3, 3.1, 5.1, 4.2)
wdt <- c(0.005, 0.005, 0.006, 0.01, 0.01, 0.03, 0.01, 0.01, 0.005, 0.008, 0.005)
pSignal <- numeric(length(x))
for (i in seq(along=pos)) {
pSignal <- pSignal + hgt[i]/(1 + abs((x - pos[i])/wdt[i]))^4
}
df = as.data.frame(rbind(pSignal,pSignal,pSignal))
dflist=list(df,df,df)
I'm trying to run this pracma package's findpeaks() function to find the local maxima of each row in each data.frame in the list, dflist. The output is a N x 4 array. N = the number of peaks. So in the first row of the first data.frame if it finds 4 peaks, it will be a 4x4 matrix. My goal is to loop this function over every row in each data.frame and store the matrix that is output in a list.
My code:
## Find Peaks
pks=list()
for (i in 1:length(dflist)){
for (j in 1:length(dflist[[i]])){
row = dflist[[i]][j,]
temppks = findpeaks(as.vector(row,mode='numeric')
,minpeakheight = 1.1,nups=2)
pks[i][[j]]=rbind(pks,temppks)
}
}
This doesn't seem to be doing quite what I want it too. any ideas?
A combination of apply() and sapply() could do the work:
my.f.row <- function(row) findpeaks(as.vector(row,mode='numeric'), minpeakheight = 1.1, nups=2)
sapply(dflist, function(df.i) apply(df.i, 1, my.f.row))
eventually you have to reorganize the result.

for loop to find threshold values between different data frames

I have 2 data frame with some matching columns (pollutants).
The first data frame contains the observations while the second one contains different thresholds for some pollutants.
Here a small subset of both data frames:
dput(df1)
structure(list(sample = structure(27:76, .Label = c("A_1", "A_2",
"A_LS", "A_PC", "A_PM", "B_1", "B1_1", "B1_2", "B1-8_PC", "B1-8_PM",
"B1_LS", "B1_PC", "B1_PM", "B_2", "B2_1", "B2_2", "B2-8_PC",
"B2-8_PM", "B2_LS", "B2_PC", "B2_PM", "B_LS", "B_PC", "B_PM",
"C_1", "C_2", "C386", "C387", "C388", "C389", "C390", "C391",
"C392", "C393", "C394", "C395", "C396", "C397", "C398", "C399",
"C400", "C401", "C402", "C403", "C404", "C405", "C406", "C407",
"C408", "C409", "C410", "C411", "C412", "C413", "C414", "C415",
"C416", "C417", "C418", "C419", "C420", "C421", "C422", "C423",
"C424", "C425", "C426", "C427", "C428", "C429", "C430", "C431",
"C432", "C433", "C434", "C435", "C436", "C437", "C438", "C439",
"C440", "C441", "C442", "C443", "C444", "C445", "C446", "C447",
"C448", "C449", "C450", "C451", "C452", "C453", "C454", "C455",
"C456", "C457", "C458", "C459", "C460", "C461", "C462", "C463",
"C464", "C465", "C466", "C467", "C468", "C469", "C470", "C471",
"C472", "C473", "C474", "C475", "C476", "C477", "C478", "C479",
"C480", "C481", "C482", "C483", "C484", "C485", "C486", "C487",
"C488", "C489", "C490", "C491", "C492", "C493", "C494", "C495",
"C496", "C497", "C498", "C499", "C500", "C501", "C502", "C503",
"C504", "C505", "C506", "C507", "C508", "C509", "C510", "C511",
"C512", "C513", "C514", "C515", "C516", "C517", "C518", "C519",
"C520", "C521", "C522", "C523", "C524", "C-8_PC", "C-8_PM", "D_1",
"D_2", "E_1", "E_2", "F_1", "F_2"), class = "factor"), As = c(9,
8.75, 13.5, 7.75, 7.6, 8.33, 8, 8.75, 7.4, 8.25, 8.17, 7.75,
7.6, 7.5, 7.2, 8, 7.83, 7.75, 7, 7.5, 8.17, 8.75, 6.67, 7, 5.83,
6.75, 5.6, 6.4, 6.2, 6.2, 6.2, 6.25, 7, 6, 6, 6.4, 6, 5.8, 5.6,
6, 5.8, 7.25, 8.8, 8.5, 8, 8.25, 8.25, 8.5, 8.25, 8.25), Al = c(30245,
38060, 36280, 24355, 27776, 35190, 38733.8, 36400, 29624, 33699.75,
32163.33, 30645.75, 31373, 26647.5, 19987.6, 32210, 27158, 24220.25,
18598.5, 23081.75, 29393, 26800.5, 22581.67, 29290, 29651.67,
20947.5, 19762.6, 23815, 32784.8, 20696.2, 26880.6, 25087.75,
19497.2, 21794, 32232, 24253.4, 20034, 21270, 22510, 15170.25,
8956.6, 21612.25, 35828, 30006.25, 27128.75, 25835, 31118.75,
35614.5, 37440.25, 33736.75), Hg = c(0.25, 0.35, 0.48, 1.03,
1.12, 0.2, 1.14, 0.4, 2, 0.48, 0.85, 0.18, 0.76, 0.4, 0.48, 0.35,
0.32, 0.33, 0.4, 0.13, 0.15, 0.13, 0.87, 0.12, 0.03, 0.33, 0.2,
0.22, 0.04, 0.16, 0.1, 0.18, 0.11, 0.08, 0.03, 0.06, 0.06, 0.1,
0.03, 0.07, 0.03, 0.1, 0.08, 0.11, 0.1, 0.13, 0.08, 0.12, 0.07,
0.09)), .Names = c("sample", "As", "Al", "Hg"), row.names = c(NA,
50L), class = "data.frame")
and
dput(df2)
structure(list(As = c(25L, 32L), Hg = c(0.4, 0.8), Cr = c(100L,
360L), Element = structure(c(1L, 3L), .Label = c("LCB", "LCB_pelite",
"LCL"), class = "factor")), .Names = c("As", "Hg", "Cr", "Element"
), row.names = c(NA, -2L), class = "data.frame")
Actually the original data frames are bigger, but this subset gives the idea.
What I want now is to put in a 3rd data frames the values of each element of the first df that exceed the threshold values contained in the second df.
Be aware that there are 2 different threshold values (for each element) in df2 and df2 has some element not matched in df1 (for example Cr).
I've tried to write a for loop but I was able to do that just for 1 element at a time:
for (i in df2$As) {
print(length(which(df1$As > i)))
}
I've also tried to use nested for loops but without success..
I'm pretty sure this does not look good, but I think it works. I added some extra lines to match only the elements found in both data frames, which in this case is only 1. It might ned some changes for your full data:
df1.2 <- rbind(df1, df1) #Duplicate the df1 to compare to each threshold value
df1.2 <- df1.2[order(df1.2$sample),] #Order by sample again
cols2 <- na.omit(match(colnames(df1), colnames(df2)))[[1]] #Get the columns of df2 which are in df1
cols1 <- na.omit(match(colnames(df2), colnames(df1)))[[1]] #Get the columns of df1 which are in df2
df2.2 <- df2[rep(1:2, nrow(df1)),cols2] #Replicates df2 the number of times to allow matching the thresholds to each sample, once for each threshold
exceeds <- df1.2[,cols1]>df2.2 #Make the comparions and return a boolean
sum(exceeds) #You will need colSums() for more than one column
With your sample data it's also not clear from the answer which elements ir refers to, but this shouldn't happen if more than one element matches and your result is a matrix.
Maybe there's a more elegant way without replicating the dataframes and having to worry about number of element matches.
df3=data.frame(Pollutant="Z",LCB=0,LCL=0,stringsAsFactors=FALSE)
for (p in names(df1)[-1]) {
if(p %in% names(df2)[1:(length(df2)-1)]) {
df3 = rbind(df3,c(p,sum(df1[p]>df2[[p]][1]),sum(df1[p]>df2[[p]][2])))
}
}
df3=df3[-1,]
df3
Update:
Ah, each new row is rbound as a character vector. To finish up:
str(df3)
df3$LCB=as.numeric(df3$LCB)
df3$LCL=as.numeric(df3$LCL)
str(df3)
How about this?
foo <- function(x, y) {
sapply(x, function(i) sum(y>i))
}
cols = c("As", "Hg")
mapply(foo, df2[cols], df1[cols])
# As Hg
# [1,] 0 10
# [2,] 0 6
Convert this to a data.frame if necessary.

Resources