Find the 3 nearest neighbours (dist()?) and calculate mean in new column - r

This is a sample of the data
structure(list(Season = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("2018/2019",
"2019/2020"), class = "factor"), Date2 = structure(c(17860, 17888,
17916, 17940, 17945, 17952, 17953, 17954, 17978, 17999, 18005,
18188, 18209, 18223, 18237, 18320, 18322, 18334, 18447, 18476
), class = "Date"), HT.av.points = c(0.57, 1.5, 1.67, 1.8, 1.09,
2.18, 1.42, 1.45, 1.79, 1.35, 1.14, 1.83, 2, 1.17, 1.88, 1.83,
1.33, 0.92, 1.31, 1.06), AT.av.points = c(1.14, 2.33, 0.56, 1.2,
1.09, 1.6, 1.08, 1.9, 1.17, 0.9, 1.38, 0.67, 2.14, 1.33, 0.62,
1.08, 2.17, 1.38, 0.56, 0.94), HT_av.PointsTotal = c(0.86, 1.16,
1.18, 1.23, 0.86, 1.86, 1.2, 1.18, 1.5, 1.1, 1.07, 1.46, 1.6,
1.08, 1.75, 1.4, 1.16, 0.92, 1.03, 0.97), AT_av.PointsTotal = c(2.07,
2.21, 0.76, 1.42, 1.59, 1.5, 1.2, 1.91, 1.65, 1.43, 1.38, 0.54,
1.87, 1.58, 0.8, 1.6, 2.32, 1.42, 1.12, 1.32), DIFF.AV.POINTS.PREDICTION = c(-0.28,
-0.43, 0.51, 0.52, -0.36, 0.56, 0.28, -0.38, -0.2, 0.03, -0.43,
1.24, -0.32, -0.29, 1.44, 0.28, -0.85, -0.38, 1.01, 0.22), Over2.5G = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1)), row.names = c(NA,
-20L), .internal.selfref = <pointer: 0x1ca2448>, class = c("data.table",
"data.frame"), .Names = c("Season", "Date2", "HT.av.points",
"AT.av.points", "HT_av.PointsTotal", "AT_av.PointsTotal", "DIFF.AV.POINTS.PREDICTION",
"Over2.5G"))
What I want to do:
group by Season
After the group by, I want to find the 3 previous rows that are most similar (according to the following columns) to the current row.
(HT.av.points, AT.av.points, HT_av.PointsTotal, AT_av.PointsTotal, DIFF.AV.POINTS.PREDICTION)
I guess the dist() function is a possibility.
Finally I want to create a new column with the mean of the values of the Over2.5G column of those 3 most similar rows.
New column:
First 3 rows(of the Season) NAs.
In fourth row(of the Season) the 3 nearest neighbours (and their Over2.5G values) will always be the first 3 rows.

breaking below code up:
a helper function which returns row indices of nearest neighbours with a ready-made function, e.g. get.knn of package FNN
calling this function for increasingly large slices (from row one to current) of the input data df and storing the result as an extra column
extracting the row indices as integers from the result string to index the desired column of the input data for the aggregation (mean, in your case)
here we go:
## helper function returns row indices of nearest 3 neighbours
## as comma-separated string
find_nearest_predecessors <- function(df, ...){
ifelse(nrow(df) < 4, ## can't calculate n neighbours for <n rows:
paste(1:3, collapse = ','),
## otherwise = if sufficient rows,
## get row indices of 3 nearest neighbours:
get.knn(data = df,
k = 3,
algo = 'CR'
) %>%
.[['nn.index']] %>%
tail(1) %>% paste(collapse = ',')
)
}
## df being your input data:
df %>%
mutate(rownum = row_number()) %>%
rowwise %>%
mutate(nearest_neighbours = find_nearest_predecessors(
df = ## use previous data up to current row:
slice(df, 1:rownum) %>%
## choose features/dimensions of distance:
select(HT.av.points, AT.av.points, HT_av.PointsTotal,
AT_av.PointsTotal, DIFF.AV.POINTS.PREDICTION)
),
## calculate mean of OVER2.5G
mean_Over2.5G = mean(df$Over2.5G[
strsplit(nearest_neighbours,',') %>%
unlist %>% as.integer
], na.rm = TRUE)
)

Related

Melt/ reshape dataframe to combine columns and fill rows with NAs

Apologies that there is a wealth of information on this site about melting and reshaping data, however, I cannot find the answer to my question on any of the pages I've visited. I have a data set which looks something like:
A Year | A Mean Temp | A Max Temp | A Min Temp | B Year | B Mean Temp | B Max Temp | B Min Temp |
and I want to end up with
Year | A Mean Temp | A Max Temp | A Min Temp |B Mean Temp | B Max Temp | B Min Temp
and fill columns which don't have data for that specific year with 'NA'.
The desired output would be something like:
[Table][1]
I believe the answer lies somewhere in something like:
library(dplyr)
library(tidyr)
library(stringr)
Data %>%
pivot_longer(cols = contains("Year"), names_to = c("Country", ".value"),
names_sep="_", values_drop_na = TRUE) %>%
rename_with(~ str_c('Country_', .), Rating:Year)```
But as of yet no luck.
Any help would be appreciated.
Thank you
Data
structure(list(Antarctica.Year.CE = 167:172, Antarctica.Temp..C. = c(0.33,
0.31, 0.18, 0.08, -0.01, -0.11), Antarctica.Min..C. = c(-1.24,
-1.26, -1.39, -1.48, -1.57, -1.67), Antarctica.Max..C. = c(1.89,
1.87, 1.74, 1.64, 1.55, 1.45), Arctic.Year.CE = 1:6, Arctic.Temp..C. = c(-1.15,
-0.96, -0.32, 0.1, -0.18, -0.61), Arctic.Min..C. = c(-1.92, -1.76,
-1.38, -0.74, -1.08, -1.17), Arctic.Max..C. = c(-0.31, -0.11,
0.48, 0.83, 0.73, 0.16), Asia.Year.CE = 800:805, Asia.Temp..C. = c(-0.31,
-0.14, -0.36, -0.67, -0.78, -0.26), Asia.Min..C. = c(-1.4, -1.23,
-1.45, -1.76, -1.87, -1.35), Asia.Max..C. = c(0.79, 0.96, 0.74,
0.43, 0.31, 0.83), Australasia.Year.CE = 1001:1006, Australasia.Temp..C. = c(-0.24,
-0.38, -0.29, -0.33, -0.34, -0.11), Australasia.Min..C. = c(-0.62,
-0.79, -0.71, -0.73, -0.73, -0.56), Australasia.Max..C. = c(0.15,
0.03, 0.13, 0.07, 0.05, 0.34), Europe.Year.CE = 1:6, Europe.Temp..C. = c(0.09,
-0.26, -0.24, 0.22, 0.32, 0.67), Europe.Min..C. = c(-0.69, -1.14,
-1.18, -0.66, -0.48, -0.11), Europe.Max..C. = c(0.88, 0.56, 0.61,
1.07, 1.14, 1.5), North.America...Pollen.Year.CE = c(480L, 510L,
540L, 570L, 600L, 630L), North.America...Pollen.Temp..C. = c(-0.25,
-0.29, -0.33, -0.34, -0.34, -0.34), North.America...Pollen.Min..C. = c(-0.74,
-0.7, -0.66, -0.65, -0.64, -0.64), North.America...Pollen.Max..C. = c(0.24,
0.11, 0, -0.04, -0.04, -0.04), North.America...Trees.Year.CE = c(1204L,
1214L, 1224L, 1234L, 1244L, 1254L), North.America...Trees.Temp..C. = c(-0.22,
-0.45, -0.38, -0.87, -0.81, -0.06), North.America...Trees.Min..C. = c(-0.53,
-0.72, -0.67, -1.12, -1.09, -0.35), North.America...Trees.Max..C. = c(0.04,
-0.2, -0.11, -0.57, -0.52, 0.18), South.America.Year.CE = 857:862,
South.America.Temp..C. = c(-0.3, -0.21, -0.07, -0.38, -0.41,
-0.19), South.America.Min..C. = c(-1.12, -1, -0.88, -1.19,
-1.22, -0.98), South.America.Max..C. = c(0.53, 0.58, 0.74,
0.43, 0.39, 0.61)), row.names = c(NA, 6L), class = "data.frame") ```
[1]: https://i.stack.imgur.com/0sV7a.png
For something as small as this, I'd often just go with a more manual approach.
Given your df above, I specify the lists of countries in the columns and then grepl() on the df columns to select those columns. Then, we rename the columns, return the new dataframe. We can then apply the function to the list of countries and then rbind with do.call.
country_list = c('Antarctica', 'Arctic', 'Asia', 'Australasia', 'Europe', 'North.America...Pollen', 'North.America...Trees', 'South.America')
get_cols = function(country) {
df_new = df[,grepl(country, colnames(df))]
df_new$Country = rep(country, nrow(df_new))
colnames(df_new) = c('Year', 'Temp', 'Min_Temp', 'Max_Temp', 'Country')
return(df_new)
}
df_final = do.call(rbind, lapply(country_list, get_cols))
Hope that returns what you're looking for?

How to calculate the average of a comma separated string of numbers in R

I have following file :
file 1
structure(list(Total_Gene_Symbol = c("5S_rRNA", "7SK", "A1BG-AS1"
), Test = c("1.02, 1.12, 1.11, 1.18, 1.12, 1.19, 1.25, 1.24, 1.24, 1.02",
"1.97, 2.27, 2.14, 1.15", "1.3, 1.01, 1.36, 1.42, 1.38, 1.01, 1.31, 1.34,
1.29, 1.34, 2.02, 1.12, 1.01, 1.31, 1.22"
)), .Names = c("Total_Gene_Symbol", "Test"), row.names = c(NA,
3L), class = "data.frame")
file 1 column test is number separated by ",".
I tried
mat <- stri_split_fixed(Down_FC, ',', simplify=T)
mat <- `dim<-`(as.numeric(mat), dim(mat)) # convert to numeric and save dims
rowMeans(mat, na.rm=T)->M
View(M)
but the above code is averaging entire data.
I want output same like below file 2
file 2
structure(list(Total_Gene_Symbol = c("5S_rRNA", "7SK", "A1BG-AS1"
), Test = c("1.02, 1.12, 1.11, 1.18, 1.12, 1.19, 1.25, 1.24, 1.24, 1.02",
"1.97, 2.27, 2.14, 1.15", "1.3, 1.01, 1.36, 1.42, 1.38, 1.01, 1.31, 1.34,
1.29, 1.34, 2.02, 1.12, 1.01, 1.31, 1.22"
), Average = c(11.49, 7.53, 19.44)), .Names = c("Total_Gene_Symbol",
"Test", "Average"), row.names = c(NA, 3L), class = "data.frame")
What you want is the sum not average! The average is something like the mode, median, mean.
library(magrittr)
df1$total_sum<-
df1$Test %>% str_split(.,",\\s+") %>% sapply(function(x) as.numeric(x) %>% sum(na.rm=T))
Using apply
d1$sum <- apply(d1,1,
function(x)(sum(as.numeric(unlist(strsplit(x['Test'],','))),na.rm = TRUE)))
You can use scan :
df$sum <- sapply(df$Test, function(x) sum(scan(text = x, what=numeric(),sep=","), na.rm=TRUE))
df$average <- sapply(df$Test, function(x) mean(scan(text = x, what=numeric(),sep=","), na.rm=TRUE))
# Total_Gene_Symbol Test sum average
# 1 5S_rRNA 1.02, 1.12, 1.11, 1.18, 1.12, 1.19, 1.25, 1.24, 1.24, 1.02 11.49 1.1490
# 2 7SK 1.97, 2.27, 2.14, 1.15 7.53 1.8825
# 3 A1BG-AS1 1.3, 1.01, 1.36, 1.42, 1.38, 1.01, 1.31, 1.34, \n 1.29, 1.34, 2.02, 1.12, 1.01, 1.31, 1.22 19.44 1.2960

Combining pheatmaps in R

I've been working around with no success in solving how 2 or more pheatmaps (heatmaps) can be combined in a final plot.
data1 <- structure(list(DC1 = c(NA, NA, 1.98), DC2 = c(NA, NA, 0.14),
DC3 = c(1.85, 1.51, 0.52), DC4 = c(0.89, 0.7, 1.47), DC5 = c(0,
0.78, 0), DC6 = c(0, 1.3, 0), DC7 = c(0, 1.47, 0), DC8 = c(0,
1.2, 0), DC9 = c(0, 0, 0), DC10 = c(0.51, 1.9, 0)), .Names = c("DC1",
"DC2", "DC3", "DC4", "DC5", "DC6", "DC7", "DC8", "DC9", "DC10"),
enter code here`class = "data.frame", row.names = c("A", "B", "C"))
data 2 <- structure(list(DC1 = c(9.56, 1.87, 2.07, 1.87, 2.07, 1.35), DC2 = c(5.51, 1.13, 1.25, 1.13, 0.99, 0.45), DC3 = c(4.84, 1.17, 0.66, 1.17,
0.34, 0.16), DC4 = c(4.18, 0.59, 0.05, 0.97, 0.43, 0.59), DC5 = c(3.26,
0, 0.14, 0.31, 0.79, 0.63), DC6 = c(3.35, 0, 1.12, 0.05, 1.12,
0), DC7 = c(4.18, 0.63, 1.27, 0.47, 1.27, 0), DC8 = c(4.37, 1.17,
1.3, 1.17, 0, 0), DC9 = c(4.3, 1.13, 0, 1.13, 0, 0), DC10 = c(7.47,
1.88, 0.71, 1.88, 0, 0)), .Names = c("DC1", "DC2", "DC3", "DC4",
"DC5", "DC6", "DC7", "DC8", "DC9", "DC10"), class = "data.frame", row.names = c("TD6 vs SH",
"TD6 vs SAP", "TD6 vs NEA", "SH vs SAP", "SH vs NEA", "SAP vs NEA"
))
I construct very easily a heatmap using pheatmap by using these two codes:
hm_data1 <- pheatmap(as.matrix(data1))
hm_data2 <- pheatmap(as.matrix(data2))
However, in no way I can get both printed in one figure. I would like to see both of them horizontally. However, my real figure will be composed by 16 pheatmaps, so they must be arrange in 4 columns and 4 rows.
I tried with par mfrow with no success.
How can I combine pheatmaps?
I know there are plenty of R packages that can plot heatmaps, but I would like to do it with pheatmap
This will work.
library(gridExtra); library(pheatmap)
m <- matrix(c(1:4), ncol=2)
n <- matrix(c(1,1,1,2), ncol=2)
a <- list(pheatmap(m)[[4]])
a[[2]] <- pheatmap(n)[[4]]
z <- do.call(grid.arrange,a)
plot(z)
Based on one of the comments. If you have many single plots; you can use a loop like this.
mn <- list(m, n)
a <- list()
for(i in 1:length(mn)){
a[i] <- list(pheatmap(mn[[i]])[[4]])
}
z <- do.call(grid.arrange,a)
plot(z)
The point is it to add all the data for your single plots in a list. You can then loop over the list, applying pheatmap.

for loop to find threshold values between different data frames

I have 2 data frame with some matching columns (pollutants).
The first data frame contains the observations while the second one contains different thresholds for some pollutants.
Here a small subset of both data frames:
dput(df1)
structure(list(sample = structure(27:76, .Label = c("A_1", "A_2",
"A_LS", "A_PC", "A_PM", "B_1", "B1_1", "B1_2", "B1-8_PC", "B1-8_PM",
"B1_LS", "B1_PC", "B1_PM", "B_2", "B2_1", "B2_2", "B2-8_PC",
"B2-8_PM", "B2_LS", "B2_PC", "B2_PM", "B_LS", "B_PC", "B_PM",
"C_1", "C_2", "C386", "C387", "C388", "C389", "C390", "C391",
"C392", "C393", "C394", "C395", "C396", "C397", "C398", "C399",
"C400", "C401", "C402", "C403", "C404", "C405", "C406", "C407",
"C408", "C409", "C410", "C411", "C412", "C413", "C414", "C415",
"C416", "C417", "C418", "C419", "C420", "C421", "C422", "C423",
"C424", "C425", "C426", "C427", "C428", "C429", "C430", "C431",
"C432", "C433", "C434", "C435", "C436", "C437", "C438", "C439",
"C440", "C441", "C442", "C443", "C444", "C445", "C446", "C447",
"C448", "C449", "C450", "C451", "C452", "C453", "C454", "C455",
"C456", "C457", "C458", "C459", "C460", "C461", "C462", "C463",
"C464", "C465", "C466", "C467", "C468", "C469", "C470", "C471",
"C472", "C473", "C474", "C475", "C476", "C477", "C478", "C479",
"C480", "C481", "C482", "C483", "C484", "C485", "C486", "C487",
"C488", "C489", "C490", "C491", "C492", "C493", "C494", "C495",
"C496", "C497", "C498", "C499", "C500", "C501", "C502", "C503",
"C504", "C505", "C506", "C507", "C508", "C509", "C510", "C511",
"C512", "C513", "C514", "C515", "C516", "C517", "C518", "C519",
"C520", "C521", "C522", "C523", "C524", "C-8_PC", "C-8_PM", "D_1",
"D_2", "E_1", "E_2", "F_1", "F_2"), class = "factor"), As = c(9,
8.75, 13.5, 7.75, 7.6, 8.33, 8, 8.75, 7.4, 8.25, 8.17, 7.75,
7.6, 7.5, 7.2, 8, 7.83, 7.75, 7, 7.5, 8.17, 8.75, 6.67, 7, 5.83,
6.75, 5.6, 6.4, 6.2, 6.2, 6.2, 6.25, 7, 6, 6, 6.4, 6, 5.8, 5.6,
6, 5.8, 7.25, 8.8, 8.5, 8, 8.25, 8.25, 8.5, 8.25, 8.25), Al = c(30245,
38060, 36280, 24355, 27776, 35190, 38733.8, 36400, 29624, 33699.75,
32163.33, 30645.75, 31373, 26647.5, 19987.6, 32210, 27158, 24220.25,
18598.5, 23081.75, 29393, 26800.5, 22581.67, 29290, 29651.67,
20947.5, 19762.6, 23815, 32784.8, 20696.2, 26880.6, 25087.75,
19497.2, 21794, 32232, 24253.4, 20034, 21270, 22510, 15170.25,
8956.6, 21612.25, 35828, 30006.25, 27128.75, 25835, 31118.75,
35614.5, 37440.25, 33736.75), Hg = c(0.25, 0.35, 0.48, 1.03,
1.12, 0.2, 1.14, 0.4, 2, 0.48, 0.85, 0.18, 0.76, 0.4, 0.48, 0.35,
0.32, 0.33, 0.4, 0.13, 0.15, 0.13, 0.87, 0.12, 0.03, 0.33, 0.2,
0.22, 0.04, 0.16, 0.1, 0.18, 0.11, 0.08, 0.03, 0.06, 0.06, 0.1,
0.03, 0.07, 0.03, 0.1, 0.08, 0.11, 0.1, 0.13, 0.08, 0.12, 0.07,
0.09)), .Names = c("sample", "As", "Al", "Hg"), row.names = c(NA,
50L), class = "data.frame")
and
dput(df2)
structure(list(As = c(25L, 32L), Hg = c(0.4, 0.8), Cr = c(100L,
360L), Element = structure(c(1L, 3L), .Label = c("LCB", "LCB_pelite",
"LCL"), class = "factor")), .Names = c("As", "Hg", "Cr", "Element"
), row.names = c(NA, -2L), class = "data.frame")
Actually the original data frames are bigger, but this subset gives the idea.
What I want now is to put in a 3rd data frames the values of each element of the first df that exceed the threshold values contained in the second df.
Be aware that there are 2 different threshold values (for each element) in df2 and df2 has some element not matched in df1 (for example Cr).
I've tried to write a for loop but I was able to do that just for 1 element at a time:
for (i in df2$As) {
print(length(which(df1$As > i)))
}
I've also tried to use nested for loops but without success..
I'm pretty sure this does not look good, but I think it works. I added some extra lines to match only the elements found in both data frames, which in this case is only 1. It might ned some changes for your full data:
df1.2 <- rbind(df1, df1) #Duplicate the df1 to compare to each threshold value
df1.2 <- df1.2[order(df1.2$sample),] #Order by sample again
cols2 <- na.omit(match(colnames(df1), colnames(df2)))[[1]] #Get the columns of df2 which are in df1
cols1 <- na.omit(match(colnames(df2), colnames(df1)))[[1]] #Get the columns of df1 which are in df2
df2.2 <- df2[rep(1:2, nrow(df1)),cols2] #Replicates df2 the number of times to allow matching the thresholds to each sample, once for each threshold
exceeds <- df1.2[,cols1]>df2.2 #Make the comparions and return a boolean
sum(exceeds) #You will need colSums() for more than one column
With your sample data it's also not clear from the answer which elements ir refers to, but this shouldn't happen if more than one element matches and your result is a matrix.
Maybe there's a more elegant way without replicating the dataframes and having to worry about number of element matches.
df3=data.frame(Pollutant="Z",LCB=0,LCL=0,stringsAsFactors=FALSE)
for (p in names(df1)[-1]) {
if(p %in% names(df2)[1:(length(df2)-1)]) {
df3 = rbind(df3,c(p,sum(df1[p]>df2[[p]][1]),sum(df1[p]>df2[[p]][2])))
}
}
df3=df3[-1,]
df3
Update:
Ah, each new row is rbound as a character vector. To finish up:
str(df3)
df3$LCB=as.numeric(df3$LCB)
df3$LCL=as.numeric(df3$LCL)
str(df3)
How about this?
foo <- function(x, y) {
sapply(x, function(i) sum(y>i))
}
cols = c("As", "Hg")
mapply(foo, df2[cols], df1[cols])
# As Hg
# [1,] 0 10
# [2,] 0 6
Convert this to a data.frame if necessary.

How can I stop the overwriting of lagged columns?

Reference: https://stackoverflow.com/a/28056113/3942806
I am using the code from the link above to create lagged values for two columns.
n<-4
odd2<-setDT(odd)[, paste("OBS_Q", 1:n) := shift(OBS_Q, 1:n)]
odd2<-setDT(odd)[, paste("sac", 1:n) := shift(sac, 1:n)]
This works great! I get 18 columns.
But for convenience, I tried to convert it into a function:
masterlag<-function(df,col,n){
setDT(df)[, paste(col,sep='_',1:n) := shift(df[[col]], 1:n)]
}
odd3<-masterlag(df=odd,col="OBS_Q",n=4)
odd3<-masterlag(df=odd,col="sac",n=4)
But in this case, the newly created columns of the first one ('OBS_Q') are getting replaced when I used the function the second time ('sac'). So, I am only left with 14 columns instead of 18.
Any pointers as to why?
odd<-structure(list(DATE = 19630101:19630104, PRECIP = c(0, 0, 0,0),
OBS_Q = c(1.61, 1.48, 1.4, 1.33), swb = c(1.75, 1.73, 1.7,1.67),
gr4j = c(1.9, 1.77, 1.67, 1.58), isba = c(0.83, 0.83,0.83, 0.83),
noah = c(1.31, 1.19, 1.24, 1.31), sac = c(1.99,1.8, 1.66, 1.57),
swap = c(1.1, 1.05, 1.08, 0.99), vic.mm.day. = c(2.1,1.75, 1.55, 1.43)),
.Names = c("DATE", "PRECIP", "OBS_Q", "swb","gr4j", "isba", "noah", "sac", "swap", "vic.mm.day."),
class = c("data.table","data.frame"), row.names = c(NA, -4L))
The dataframes are self-updating
odd<-masterlag(df=odd,col="OBS_Q",n=4)
odd<-masterlag(df=odd,col="sac",n=4)

Resources