Im trying to loop the reverse_geo() function from tidygeocoder package through a list.
When I apply the function to a single data frame it looks like this:
library(tidyverse, warn.conflicts = FALSE)
library(tidygeocoder)
num_coords <- 25 # number of coordinates
set.seed(103) # for reproducibility
# latitude and longitude bounds
lat_limits <- c(40.40857, 40.42585)
long_limits <- c(-3.72472, -3.66983)
# randomly sample latitudes and longitude values
random_lats <- runif(
num_coords,
min = lat_limits[1],
max = lat_limits[2]
)
random_longs <- runif(
num_coords,
min = long_limits[1],
max = long_limits[2]
)
# Reverse geocode the coordinates
# the speed of the query is limited to 1 coordinate per second to comply
# with Nominatim's usage policies
madrid <- reverse_geo(
lat = random_lats, random_longs,
method = 'osm', full_results = TRUE,
custom_query = list(extratags = 1, addressdetails = 1, namedetails = 1)
)
This works and returns results.
Now, when I try to apply it to each element of a list using lapply:
#NOW try to use reverse_geo() looping through a list
df1<- data.frame(random_lats,random_longs)
df2<- data.frame(random_lats,random_longs)
list <- list(df1, df2)
data_frame_list = lapply(list, function(x) reverse_geo (lat = x[["random_lats"]], long = x[["random_longs"]], "osm",
TRUE, list(extratags = 1, addressdetails = 1, namedetails = 1)))
I get the error: Error: limit must be NULL or >= 1. See ?reverse_geo
I think its because reverse_geo() is not correctly seeing the lat and long vectors bc Im not passing correctly into lapply().
Any recommendations on how I could fix this using lappyly or purrr::map()? I dont have a preference on using one or the other -- just want to avoid using for loop.
You're not having trouble with the lat or lon, what you're doing is passing in unnamed parameters to reverse_geo() that are not in the correct position. You should only pass in parameters positionally if you are certain they will be in the correct position.
You have:
data_frame_list = lapply(list, function(x) reverse_geo (
lat = x[["random_lats"]],
long = x[["random_longs"]],
"osm",
TRUE,
list(extratags = 1, addressdetails = 1, namedetails = 1)))
So, "osm", TRUE, and your list(...) are getting passed to the 3 arguments that follow lat and long. These are:
method: which does accept "osm", so that's fine.
address: name of the address column to output, which I am pretty sure you don't want to call TRUE
limit: which has to be numeric, limit of results per coordinate. Yet you are passing a list(...)
Clearly, your error statement highlights the issue:
#> Error: limit must be NULL or >= 1. See ?reverse_geo
You need to explicitly pass these arguments with the name of the argument they fit. I don't know exactly which argument you wanted to pass TRUE or the list(...) to, but the code below runs.
data_frame_list = lapply(list, function(x) reverse_geo (
lat = x[["random_lats"]],
long = x[["random_longs"]],
method = "osm",
flatten = TRUE,
custom_query = list(extratags = 1, addressdetails = 1, namedetails = 1)))
Related
My goal is to apply wavelet analysis and image construction to large data set of time series data to be eventually used in pipeline for time series clustering. The function to do the first step is from WaveletComp and I am using purr map () from Tidyverse package. Ideally the output is a list labeled for each column that I can then apply other functions to in the pipeline.
library(WaveletComp)
The data set has 3 columns and 6000 values
df <- data.frame(replicate(3,sample(-5:5,6000,rep=TRUE)))
wave_emg <- function(df) {
analyze.wavelet(my.data = df, my.series = "X1", loess.span =50,
dt=1, dj=1/250,
lowerPeriod = 32,
upperPeriod = 512,
make.pval = TRUE, n.sim = 100)
Solution <- mutate(model = map(df, wave_emg))
I get the following error *Error in my.data[, ind] : incorrect number of dimensions
It appears to me that the my.series command in the analyze.wavelet function is looking for a single column to be specified. Is there a way to inform the command to take the next column successively?
You could write a function which takes two input, dataframe and column name/position.
library(WaveletComp)
library(purrr)
ave_emg <- function(df, col) {
analyze.wavelet(my.data = df, my.series = col, loess.span =50,
dt=1, dj=1/250,
lowerPeriod = 32,
upperPeriod = 512,
make.pval = TRUE, n.sim = 100)
}
analyze.wavelet function takes column names or column index as input so you could use any of these versions :
#column names
result <- map(names(df), ave_emg, df = df)
#column index
result <- map(seq_along(df), ave_emg, df = df)
You can also replace map with lapply to get the same output.
Looks like df needs to be split first before entering into the function to avoid error for 'analyze.wavlet()'. This code seems to work with this function, but you #Ronak code works with other functions.
library(tidyverse)
library(WaveletComp)
wave_emg <- function(df) {
analyze.wavelet(my.data = df, my.series = "X1", loess.span =50,
dt=1, dj=1/250,
lowerPeriod = 32,
upperPeriod = 512,
make.pval = TRUE, n.sim = 100
Solution <- df %>% split.default(.,seq_along(.)) %>% map(., ave_emg)
I have 15 protein sequences as fasta format in one file. I have to pairwise align them globally and locally then generate a distance score matrix 15x15 to construct dendrogram.
But when I do, i.e. A sequence is not aligning with itself and I get NA result. Moreover, AxB gives 12131 score but BxA gives NA. Thus R can not construct phylogenetic tree.
What should I do?
I'm using this script for the loop but it reads one way only :
for (i in 1:150) {
globalpwa<-pairwiseAlignment(toString(ProtDF[D[1,i],2])
,toString(ProtDF[D[2,i],2]),
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
ScoreX[i]<-c(globalpwa#score)
nameSeq1[i]<-c(as.character(ProtDF[D[1,i],1]))
nameSeq2[i]<-c(as.character(ProtDF[D[2,i],1]))
}
I used an example fasta file, protein sequence of RPS29 in fungi.
ProtDF <-
c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR"
)
So you got it right to use combn. As you said, you need a distance score matrix for dendrogram, so better to store the values in a matrix. See below, I simply named the matrix after the names of the fasta, and slot in the pairwise values
library(Biostrings)
# you can also read in your file
# like ProtDF = readAAStringSet("fasta")
ProtDF=AAStringSet(ProtDF)
# combination like you want
# here we just use the names
D = combn(names(ProtDF),2)
#make the pairwise matrix
mat = matrix(NA,ncol=length(ProtDF),nrow=length(ProtDF))
rownames(mat) = names(ProtDF)
colnames(mat) = names(ProtDF)
# loop through D
for(idx in 1:ncol(D)){
i <- D[1,idx]
j <- D[2,idx]
globalpwa<-pairwiseAlignment(ProtDF[i],
ProtDF[j],
substitutionMatrix = "BLOSUM62",
gapOpening = 0,
gapExtension = -2,
scoreOnly=FALSE,
type="global")
mat[i,j]<-globalpwa#score
mat[j,i]<-globalpwa#score
}
# if you need to make diagonal zero
diag(mat) <- 0
# plot
plot(hclust(as.dist(mat)))
An alternate method, if you're interested, using the same example as above:
library(DECIPHER)
ProtDF <- c(OQS54945.1 = "MINDLKVRKDVEKSKAHCHVKPFGKGSRACERCASHRGHNRKYGMNLCRRCLHTNAKILGFTSFD",
XP_031008245.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY80688.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVKHR",
TVY57447.1 = "LPFLKIRVEPARRDNLKPAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCVDAMGTLEPRASSPEL",
TVY47820.1 = "EPARRDNLKTTIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKAADIGFVK",
TVY37154.1 = "AISKLNSRPQRPISTTPRKAKHTKSLVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKHR",
TVY29458.1 = "KHTESPVEPARRDNLKTAIMSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGHTDSSYDGSEF",
TVY14147.1 = "MSHESVWNSRPRTYGKGARACRVCTHKAGLIRKYGLNICRQCFREKASDIGFVKVCDGWIGTLEL",
`sp|Q6CPG3.1|RS29_KLULA` = "MAHENVWYSHPRKFGKGSRQCRISGSHSGLIRKYGLNIDRQSFREKANDIGFYKYR",
`sp|Q8SS73.1|RS29_ENCCU` = "MSFEPSGPHSHRKPFGKGSRSCVSCYTFRGIIRKLMMCRRCFREYAGDIGFAIYD",
`sp|O74329.3|RS29_SCHPO` = "MAHENVWFSHPRKYGKGSRQCAHTGRRLGLIRKYGLNISRQSFREYANDIGFVKYR",
TPX23066.1 = "MTHESVFYSRPRNYGKGSRQCRVCAHKAGLIRKYGLLVCRQCFREKSQDIGFVKYR",
`sp|Q6FWE3.1|RS29_CANGA` = "MAHENVWFSHPRRFGKGSRQCRVCSSHTGLIRKYDLNICRQCFRERASDIGFNKYR",
`sp|Q6BY86.1|RS29_DEBHA` = "MAHESVWFSHPRNFGKGSRQCRVCSSHSGLIRKYDLNICRQCFRERASDIGFNKFR",
XP_028490553.1 = "MSHESVWNSRPRSYGKGSRSCRVCKHSAGLIRKYDLNLCRQCFREKAKDIGFNKFR")
# All pairwise alignments:
# Convert characters to an AA String Set
ProtDF <- AAStringSet(ProtDF)
# Initialize some outputs
AliMat <- matrix(data = list(),
ncol = length(ProtDF),
nrow = length(ProtDF))
DistMat <- matrix(data = 0,
ncol = length(ProtDF),
nrow = length(ProtDF))
# loop through the upper triangle of your matrix
for (m1 in seq_len(length(ProtDF) - 1L)) {
for (m2 in (m1 + 1L):length(ProtDF)) {
# Align each pair
AliMat[[m1, m2]] <- AlignSeqs(myXStringSet = ProtDF[c(m1, m2)],
verbose = FALSE)
# Assign a distance to each alignment, fill both triangles of the matrix
DistMat[m1, m2] <- DistMat[m2, m1] <- DistanceMatrix(myXStringSet = AliMat[[m1, m2]],
type = "dist", # return a single value
includeTerminalGaps = TRUE, # return a global distance
verbose = FALSE)
}
}
dimnames(DistMat) <- list(names(ProtDF),
names(ProtDF))
Dend01 <- IdClusters(myDistMatrix = DistMat,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
# A single multiple alignment:
AllAli <- AlignSeqs(myXStringSet = ProtDF,
verbose = FALSE)
AllDist <- DistanceMatrix(myXStringSet = AllAli,
type = "matrix",
verbose = FALSE,
includeTerminalGaps = TRUE)
Dend02 <- IdClusters(myDistMatrix = AllDist,
method = "NJ",
type = "dendrogram",
showPlot = FALSE,
verbose = FALSE)
Dend01, from all the pairwise alignments:
Dend02, from a single multiple alignment:
Finally, DECIPHER has a function for loading up your alignment in your browser just to look at it, which, if your alignments are huge, can be a bit of a mistake, but in this case (and in cases up to a few hundred short sequences) is just fine:
BrowseSeqs(AllAli)
A side note about BrowseSeqs, for some reason it doesn't behave well with Safari, but it plays just fine with Chrome. Sequences are displayed in the order in which they exist in the aligned string set.
EDIT: BrowseSeqs DOES play fine with safari directly, but it does have a weird issue with being incorporated with HTMLs knitted together with RMarkdown. Weird, but not applicable here.
Another big aside: All of the functions i've used have a processors argument, which is set to 1 by default, if you'd like to get greedy with your cores you can set it to NULL and it'll just grab everything available. If you're aligning very large string sets, this can be pretty useful, if you're doing trivially small things like this example, not so much.
combn is a great function and I use it all the time. However for these really simple set ups I like looping through the upper triangle, but that's just a personal preference.
I have a portfolio of five shares, I calculate monthly returns and need all possible combination of three stocks at the time, hence 10 combinations.
How do I calculate monthly return for all possible portfolios? I think I need to use 'combn'. But the result I get is just a veeery long list of numbers. Maybe it's in there somewhere but I cant make anything out of all the numbers..
code so far:
library('quantmod')
tickers <- c('MSFT','YHOO','ORCL','EBAY','AMZN')
getSymbols(tickers, src='yahoo', from='2015-03-01', to='2016-02-29')
amzn_ret = monthlyReturn(AMZN, type = 'log')
ebay_ret = monthlyReturn(EBAY, type = 'log')
msft_ret = monthlyReturn(MSFT, type = 'log')
orcl_ret = monthlyReturn(ORCL, type = 'log')
yhoo_ret = monthlyReturn(YHOO, type = 'log')
stock_ret = c(amzn_ret, ebay_ret,yhoo_ret,orcl_ret,msft_ret)
combin = combn(stock_ret, 3, sum, simplify = FALSE)
You are binding your returns as a vector c, what you need is either cbind or some form of table:
stock_ret <- data.table(amzn_ret, ebay_ret,yhoo_ret,orcl_ret,msft_ret)
combin <- combn(stock_ret, 3, rowSums, simplify = TRUE)
and you probably meant to say rowSums?
I am trying to apply functions described here for a set of time series. For this, mapply seems to be a good approach but I guess there is some problem either in defining the function or in using mapply.
Here is the example code, where I found some discrepancy in the format of dataframe being returned and might be the source of error.
# define the function to apply
ccffunction <- function(x, y, plot = FALSE){
ts1 = get(x)
ts2 = get(y)
d <- ccf(ts1, ts2,lag.max = 24, plot = plot)
cor = d$acf[,,1]
lag = d$lag[,,1]
dd <- data.frame(lag = lag, ccf = cor)
return(t(dd)) # if I dont take transpose, not getting a df but info on the contents.
# It seems that mapply is adding the results from two series vertically ;
# and main part may be to define correct format of object returned
}
# List of time series simulated for testing results
rm(list = ls())
set.seed(123)
ts1 = arima.sim(model = list(ar=c(0.2, 0.4)), n = 10)
ts2 = arima.sim(model = list(ar=c(0.1, 0.2)), n = 10)
ts3 = arima.sim(model = list(ar=c(0.1, 0.8)), n = 10)
assign("series1", ts1)
assign("series2" , ts2)
assign("series3" , ts3)
tslist <- list(series1 = ts1, series2 = ts2, series3 = ts3)
# convert to mts object if it makes any difference
tsmts <- do.call(cbind, tslist)
class(tsmts)
# create pairs of time series using combn function
tspairs <- combn(names(tslist), 2)
tspairs
tspairs2 <- combn(colnames(tsmts), 2)
tspairs2
try1 <- mapply(ccffunction, tspairs[1, ], tspairs[2, ])
try2 <- mapply(function(x, y){ccf(x, y)}, tspairs2[1, ], tspairs2[2,])
I expected try2 to work directly when pairs of time series are created as combn(tslist, 2) and using plyr::mlply to input time series as arguments but that approach does not work or not using correctly.
Is there a way to find CCF matrix for a set of time series using this approach or any alternatives ?
Edits : Tried to make the question more clear and specific.
Thanks.
You can try this:
ccff <- function(tsVec)
{
return (list(ccf(tsVec[[1]], tsVec[[2]], plot=FALSE)))
}
corList <- aaply(combn(tslist, 2), 2, ccff)
The results are stored in corList which can then accessed through corList[[1]].
KeyPoints:
Note the tsVec[[1]] in the function definition. ccff essentially receives a list, hence the [[]].
Also note the return (list(...)) in the function definition. That is needed to be able to merge all the return values from the function into a single data structure from the caller.
Hope this helps.
Thank you,
GK
http://gk.palem.in/
ccf cannot get the time-series object - which is what the get in try1 does.
So, in try2 you are simply passing ccf two strings, because it cannot see the time-series objects.
> ccf("a_string","another_string")
Error in acf(X, lag.max = lag.max, plot = FALSE, type = type, na.action = na.action) :
'x' must be numeric
and
mapply(function(x, y){ccf(x, y)}, tspairs2[1, ], tspairs2[2,])
Error in acf(X, lag.max = lag.max, plot = FALSE, type = type, na.action = na.action) :
'x' must be numeric
I am trying to use for to create multiple objects from for, just example (not exact):
l_gr <- list (1:10, 11:20, 21:30)
for (i in 1:length(l_gr)){
grp <- NULL
grp[[i]] <- mean(l_gr[[i]])
}
This is not what I am expecting, rather I need to output multiple objects (of different class) however the name is different with i level for example: here grp1, grp2, grp3.
Each of these object has output of the function for particular i list. Sorry for simple question.
Edits: response to provide specific example:
install.packages("onemap")
require(onemap)
data(example.out)
twopts <- rf.2pts(example.out)
all.data <- make.seq(twopts,"all")
link_gr <- group(all.data)
link_gr$n.groups
starts the loop
# without loop:
# for 1
grp1 <- make.seq(link_gr, 1)
grp1.od <- order.seq(input.seq=grp1, n.init = 5, subset.search = "twopt",
twopt.alg = "rcd", THRES = 3, draw.try = TRUE, wait = 1, touchdown=TRUE)
# for 2
grp2 <- make.seq(link_gr, 2)
grp2.od <- order.seq(input.seq=grp2, n.init = 5, subset.search = "twopt",
twopt.alg = "rcd", THRES = 3, draw.try = TRUE, wait = 1, touchdown=TRUE)
same process report for 1:1:link_gr$n.groups
So I want create a for loop and output objects:
for (i in 1:link_gr$n.groups){
grp <- NULL
grp[i] <- make.seq(link_gr, i)
grp[i].od <- order.seq(input.seq=grp[i], n.init = 5, subset.search = "twopt",
twopt.alg = "rcd", THRES = 3, draw.try = TRUE, wait = 1, touchdown=TRUE)
}
Note that your for loops are wrong. If you set grp <- NULL within the loop, you'll just wipe your results variable with each iteration - probably not what you want. You need to put the variable initialisation outside the loop.
Note, too, that I'd suggest that you are still better off using a single variable instead of multiple ones. list objects are very flexible in R and can accomodate objects of different classes. You can do
require(onemap)
data(example.out)
twopts <- rf.2pts(example.out)
all.data <- make.seq(twopts,"all")
link_gr <- group(all.data)
link_gr$n.groups
# initialise list outputs
grp = list()
grp.od = list()
for (i in 1:2){
grp[[i]] <- make.seq(link_gr, i)
grp.od[[i]] <- order.seq(input.seq=grp[[i]], n.init = 5, subset.search = "twopt",
twopt.alg = "rcd", THRES = 3, draw.try = TRUE, wait = 1, touchdown=TRUE)
}
#check out output
str(grp)
str(grp.od)
grp[[1]]
grp[[2]
If you must insist on using different variables, consider ?assign and ?get. Something like this will work:
i = 1
assign(paste("grp", i, sep = ""), grp[[1]])
exists("grp1")
str(get(paste("grp", i, sep = "")))