I have a data.table that holds ids and locations. for example, here is it with one row in it:
(it has col and row names, don't know if it matters)
locations<-data.table(c(11,12),c(-159.58,0.2),c(21.901,22.221))
colnames(locations)<-c("id","location_lon","location_lat")
rownames(locations)<-c("1","2")
I then want to iterate over the rows and compare them to another point (with lat,lon).
In a for loop it works:
for (i in 1:nrow(locations)) {
loc <- locations[i,]
dist <- gdist(-159.5801, 21.901, loc$location_lon, loc$location_lat, units="m")
if(dist <= 50) {
return (loc)
}
return (NULL)
}
and returns:
id location_lon location_lat
1: 11 -159.58 21.901
but I want to use apply.
The following code fails to run:
dists <- apply(locations,1,function(x) if (50 - gdist(-159.5801, 21.901, x$location_lon, x$location_lat, units="m")>=0) x else NULL)
with $ operator is invalid for atomic vectors error. Changing to reference by location (x[2],x[3]) isn't enough to fix this, I get
Error in if (radius - gdist(lon, lat, x[2], x[3], units = "m") >= 0) x else NULL :
missing value where TRUE/FALSE needed
This is because the data.table is converted to matrix, and the coordinates are treated as text instead of numbers.
Is there a way to overcome this? The solution needs to be efficient (I want to run this check for >1,000,000 different coordinates). Changing the data structure of the locations table is possible if needed.
No loops are required, just use data.table as intended. If all you want to see are the rows that within 50 meters from the desired location, all you have to do is
locations[, if (gdist(-159.58, 21.901, location_lon, location_lat, units="m") <= 50) .SD, id]
## id location_lon location_lat
## 1: 11 -159.58 21.901
Here we are iterating by the id column within the locations data set itself and checking if each id is within 50 meters from -159.58, 21.901. If so, we are calling .SD which is basically the data set itself for that specific id.
As a side note, data.table doesn't have row.names, so there is no need of specifiying them, see here, for example
Related
EDIT: I implemented offered solutions so far, and the code looks way cleaner now. This was the key to finally finding my error. It was a logical condition that I didn't check within the while loop. It could happen that the iterator would exceed the number of elements in the vector and thus pass a "NA" to the while condition! Thx
I also changed the solution to use vector assignments to store the results and then recombine after the for loop, as vector indexing seems to be way faster than data.table indexing and value assignment within the loop.
Pls let me apologize first for any errors and lack of information for troubleshooting my problem as this is my first post so far. I have already read that this can happen accidentally whenever ther is an error in a computation and the value of a condition results in an error, such as
if (TRUE & sqrt(-1))
It's been days and I am still receiving this error. It really gives me a headache, as the inherent logic behind such code is actually pretty straigth forward and I still can't properly formalize it. It goes like following: Compare for each unique bond ID contained in a vector of size N (loop through with i), the static value of its corresponding maturity to 7 periods' end date for distinct set of rules (loop through with k) to determine which periods with unique rules the respective issue falls into, and then determine by looping through all the periods' size thresholds (loop through by l) to find if a particular issue has violted these minimium size requirements. If a violation is determined, I can assign the date of the violation. If (l == k), I can reckon that for all periods that the issue's maturity falls into, have also successfully looped through the corresponding size requirements checks and as such hasn't violated any rules. I then assign the result of the conditional checks as corresponding binary values in a new data.table column as well as the violation date. So far, I really cant determine what is casusing this error.
My data looks like following. I have a pretty large data.table containing bond issue identifiers and various other column variables that describe those issues. It was imported as initially with the read_dta() function and then transformed to a data.table with setDT().
I extract 3 columns out of this data.table, using
issue_IDs.vec <- as.numeric(issues.dt[[2]])
maturity.vec <- as.Date(issues.dt[[8]], "%Y-%m-%d")
offerings_atm.vec <- as.numeric(issues.dt[[33]])
Next, I transform eligibility criteria of an index as following.
# (1) Creating size requirement end periods (valid thru) ----
size_req_per_1 <- as.Date("1992-01-01", "%Y-%m-%d")
size_req_per_2 <- as.Date("1994-01-01", "%Y-%m-%d")
size_req_per_3 <- as.Date("1999-07-01", "%Y-%m-%d")
size_req_per_4 <- as.Date("2003-10-01", "%Y-%m-%d")
size_req_per_5 <- as.Date("2004-07-01", "%Y-%m-%d")
size_req_per_6 <- as.Date("2017-02-01", "%Y-%m-%d")
size_req_per_7 <- as.Date("2021-02-01", "%Y-%m-%d")
size_req_val_per.vec <- c(size_req_per_1, size_req_per_2, size_req_per_3, size_req_per_4,
size_req_per_5, size_req_per_6, size_req_per_7)
# (2) Create a size requirement threshold per rules' validity period ----
size_req_thresh_1 <- 25000
size_req_thresh_2 <- 50000
size_req_thresh_3 <- 100000
size_req_thresh_4 <- 150000
size_req_thresh_5 <- 200000
size_req_thresh_6 <- 250000
size_req_thresh_7 <- 300000
size_req_thresh.vec <- c(size_req_thresh_1, size_req_thresh_2, size_req_thresh_3,
size_req_thresh_4, size_req_thresh_5, size_req_thresh_6,
size_req_thresh_7)
Next, I do write a loop to perform conditional checks to find for each issue ID stored in the issues_ID.vec if they violate the index eligibility criterium of the minimim issance size during their maturity. I do this by passing the value of iterator variable i as a position value to the issues_ID.vec.
# (3) Looping through a set of conditional check to find out if and if so when a particular issue violated the size requirement ---
# Iterator variables ----
# Length of issues.dt
j <- issues.dt[, .N]
# Main iterator looping through all entries of isssues.dt extracted as vector
i <- 1
# Looping through vector elements of issue rules (vec. 1: validity periods)
k <- 1
# Looping through vector elements of issue rules (vec. 2: size thresholds)
l <- 1
# Loop
for (i in 1:j) {
id <- issue_IDs.vec[i]
maturity <- maturity.vec[i]
offering_atm <- issue_IDs.vec[i]
k <- 1
maturity_comp <- size_req_val_per.vec[k]
while (maturity >= maturity_comp) {
if (k < 7) {
k <- k + 1
maturity_comp <- size_req_val_per.vec[k]
} else {
break
}
}
l <- 1
offering_size_comp <- size_req_thresh.vec[l]
for (l in 1:k) {
if (offering_atm >= offering_size_comp) {
offering_size_comp <- size_req_thresh.vec[l]
next
} else {}
}
if (l == k) {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 0,
SIZE_REQ_VIOLATION_DATE = NA)]
} else {
issues.dt[ISSUE_ID == id,
`:=`(SIZE_REQ_VIOLATION = 1,
SIZE_REQ_VIOLATION_DATE = size_req_val_per.vec[l])]
}
i <- i + 1
}
Whenever I try running the code in a simplified version, such as
k <- 1
for (1 in 1:7) {
print(maturity >= maturity_comp)
k <- k + 1
maturity_comp <- format(as.Date(size_req_val_per.vec[k]), "%Y-%m-%d")
}
the code runs smooth and always results in the printed evaluations TRUE or FALSE, depending which ID I initially to create the corresponding static maturity of the particular bond issue. As this stage, I already exhasuted my troubleshooting skills.
I'd appreciate any input from you guys, and if you need any additional information, explanations etc. just let me know.
I think the answer lies in Gregor's comment. The way you are formatting your dates converts them to character variables. Here's a quick example:
Exmpl<-as.Date("08-25-2020", "%m-%d-%Y")
class(Exmpl)
[1] "Date"
##Not your preferred format, but it is a Date variable##
Exmpl
"2020-08-25"
##Formatting changes it to a character
Exmpl2<-format(as.Date(Exmpl), "%m-%d-%Y")
class(Exmpl2)
[1] "character"
When you call them in the while() function, R is trying make a comparison to decided if the condition (i.e., maturity is greater than or equal to maturity comp) is TRUE or FALSE (logical variables). Because you have character variables, R cannot make this comparison.
I think your code will work if you don't format the dates, but simply read them in and leave them in the YYYY-mm-dd format.
I have two excel file
And,
I want to know the range and positions with 0 coverage values and an output as follows:
Where,
size = (end - start)+1
mapped = positions with > 0 Coverage
%mapped = (mapped/size)*100
Completeness = (Total mapped/Total Size)*100
for e.g for the above output Completeness = ((3+2)/(7+5))*100 = 41.66%
I have several such input files to be analyzed. How can I do this in R?
To get to know which part of a data.frame satisfies some condition, you can use which, will give you all the indexes for which that condition is TRUE, so you can use that to get the parts you're interested in.
If we assume you have a data.frame called df1 for the first part of your question, and the second image is called df2, then you can get the index-range of the rows in df1 with 'chr1' like this:
range <- which(df1$chr=='chr1')[df2$start[1]]:which(df1$chr=='chr1')[df2$end[1]]
or instead of manually typing 'chr1', you can use df2$chr[1].
For the count, sum(df1[range, 'coverage'] > 0) tells you how many values are more then zero.
Now we need to do that for all rows together, we can use sapply to do something for all values provided:
df2$mapped <- sapply(1:nrow(df2), function(row) {
range <- which(df1$chr==df2$chr[row])[df2$start[row]]:which(df1$chr==df2$chr[row])[df2$end[row]]
sum(df1[range, 'coverage'] > 0)
}
Your other questions are easier answered then asked, as in R most functions are vectorised: you can do something for multiple values at the same time.
df2$size = (df2$end - df2$start)+1
df2$perc_mapped = (df2$mapped/df2$size)*100
Completeness is just a total of all rows together, sum(df2$size) and sum(df2$mapped)
I'm using the fb_insightsfrom the package fbRads like this: (I use more metrics in my real problem)
fb_campaigns <- rbindlist(lapply(l, function(l) cbind(Campaign = l$campaign_name, rbindlist(l$actions))))
Oh, and I get some warnings (I know I'm doing something wrong, but can't solve it):
Warning messages:
1: In data.table::data.table(...) :
Item 1 is of size 11 but maximum size is 104 (recycled leaving remainder of 5 items)
The result is the data frame with all the data I need (Campaign, action_type, value), but... the columns with the "action_types" and their numbers came out of order. The action data don't seem to be from the campaigns in the rows.
How can I merge the action types with the campaigns?
After I the data in the correct rows, I will use reshape to make the action_types columns with the values.
The data I get from fb Rads and I want to transform are like this:
The data I get using my code are like this (the format is OK, but not the order of the values, they are not the values for the campaigns)
daroczig give me the solution bellow, and seems to work fine!
## list of action types to extract
actiontypes <- c('link_click', 'comment', 'like', 'post')
## extract actions from the data returned by the Insights API
lactions <- unlist(lapply(l, function(x) x$actions), recursive = FALSE)
## extract fields from the actions
library(data.table)
lactions <- rbindlist(lapply(lactions, function(actions) {
setnames(as.data.table(
do.call(cbind,
lapply(actiontypes,
function(action) {
if (is.null(actions)) return(0)
value <- subset(actions, action_type == action, value)
if (nrow(value) == 0) return(0) else
return(value[[1]])
}))),
actiontypes)
}))
## Merging the dataframe with the original data and the dataframe with the actions
fb_campaigns <- cbind(l[,c(1,4:11)],lactions))
I’m new to programming and I’m currently writing a function to go through hundreds of csv files in the working directory.
The files have tons of NA values in it.
The function (which I call it corr) has two parameters, the directory, and a threshold value (numeric vector of length 1 indicating the number of complete cases).
The purpose of the function is to take the complete cases for two columns that are sulfate and nitrate(second and third column in the spreadsheet) and calculate the correlation between them if the number of complete cases is greater than the threshold parameter.
The function should return a vector with the correlation if it met the threshold requirement (the default threshold value is 0).
When I run the code I get back two of the following:
A + sign in the console
OR
2.The objects I created in the function can't be found.
Any help would be much appreciated. Thank you in advance!
corr <- function(directory, threshold=0){
filelist2<- data.frame(list.files(path=directory,
pattern=".csv", full.names=TRUE))
corvector <- numeric()
for(i in 1:length(filelist2)){
data <-data.frame(read.csv(filelist2[i]))
removedNA<-complete.cases(data)
newdata<-data[removedNA,2:3]
if(nrow(removedNA) > threshold){
corvector<-c(corvector, cor(data$sulfate, data$nitrate ))
}
}
corvector
}
I don't think your nrow(removedNA) does what you think it does. To replicate the example I use the mtcars dataset.
data <- mtcars # create dataset
data[2:4, 2] <- NA # create some missings in column 2
data[15:17, 3] <- NA # create some missing in column 3
removedNA <- complete.cases(data)
table(removedNA) # 6 missings indeed
nrow(removedNA) # NULL removedNA is no data.frame, so nrow() doesn't work
newdata <- data[removedNA, 2:3] # this works though
nrow(newdata) # and this shows the rows in 'newdata'
#---- therefore instead of nrow(removedNA) try
if(nrow(data)-nrow(newdata) < threshold) {
...
}
NB: I changed the > in < in the line with threshold. I guess it depends on whether you want to set an absolute minimum number of lines (in which cases you could simply use nrow(newdata) > threshold) as threshold, or whether you want the threshold to reflect the different number of lines in the original data and 'new' data.
I have this code:
dat<-dat[,list(colA,colB
,RelativeIncome=Income/.SD[Nation=="America",Income]
,RelativeIncomeLog2=log2(Income)-log2(.SD[Nation=="America",Income])) #Read 1)
,by=list(Name,Nation)]
1) I would like to be able to say "RelativeIncomeLog2=log2(RelativeIncome)", but "RelativeIncome" is not available in j's scope?
2) I tried the following instead (per the data.table FAQ). Now "RelativeIncome" is available but it doesn't add the columns:
dat<-dat[,{colA;colB;RelativeIncome=Income/.SD[Nation=="America",Income];
,RelativeIncomeLog2=log2(RelativeIncome)]))
,by=list(Name,Nation)]
You can create and assign objects in j, just use { curly braces }.
You can then pass these objects (or functions & calculations of the objects) out of j and assign them as columns of the data.table. To assign more than once column at a time, simply:
wrap the LHS in c(.) make sure column names are strings and
the last line of j (ie, the "return" value) should be a list.
dat[ , c("NewIncomeComlumn", "AnotherNewColumn") := {
RelativeIncome <- Income/.SD[Nation == "A", Income];
RelativeIncomeLog2 <- log2(RelativeIncome);
## this last line is what will be asigned.
list(RelativeIncomeLog2 * 100, c("A", "hello", "World"))
# assigned values are recycled as needed.
# If the recycling does not match up, a warning is issued.
}
, by = list(Name, Nation)
]
You can losely think of j as a function within the environment of dat
You can also get a lot more sophisticated and complex if required. You can also incorporate by arguments as well, using by=list(<someName>=col)
In fact, similar to functions, simply creating an object in j and assigning it a value, does not mean that it will be available outside of j. In order for it to be assigned to your data.table, you must return it. j automatically returns the last line; if that last line is a list, each element of the list will be handled as a column. If you are assigning by reference (ie, using := ) then you will achieve the results you are expecting.
On a separate note, I noticed the following in your code:
Income / .SD[Nation == "America", Income]
# Which instead could simply be:
Income / Income[Nation == "America"]
.SD is great in that it is a wonderful shorthand. However, to invoke it without needing all of the columns which it encapsulates is to burden your code with extra memory costs. If you are using only a single column, consider naming that column explicitly or perhaps add the .SDcols argument (after j) and being naming the columns needed there.