R: Update/generate variables without loop - r

How can I generate or update variables without using a loop? mutate doesn't work here (at least I don't know how to get it to work for this problem) because I need to calculate stuff from multiple rows in another data set.
I'm supposed to replicate the regression results of an academic paper, and I'm trying to generate some variables required in the regression. The following is what I need.
I have 2 relevant data sets for this question, subset (containing
geocoded residential property transactions) and sch_relocs (containing the date
of school relocation events as well as their locations)
I need to calculate the distance between each residential property and the nearest (relocated) school
If the closest school is one that relocated to the area near the residential property, the dummy variable new should be 1 (if the school relocated away from the area, then new should be 0)
If the relocated school moved only a small distance, and a house is within the overlapping portion of the respective 2km radii around the school locations, the dummy variable overlap should be 1, otherwise 0
If the distance to the nearest school is <= 2km, the dummy variable in_zone should be 1. If the distance is between 2km and 4km, these transactions are considered controls, and hence in_zone should be 0. If the distance is greater than 4km, I should drop the observations from the data
I have tried to do this using a for loop, but it's taking ages to run (it's still not done running after one night), so I need a better way to do it. Here's my code (very messy, I think the above explanation is a lot easier if you want to figure out what I'm trying to do.
for (i in 1:as.integer(tally(subset))) {
# dist to new sch locations
for (j in 1:as.integer(tally(sch_relocs))) {
dist = distHaversine(c(subset[i,]$longitude, subset[i,]$latitude),
c(sch_relocs[j,]$new_lon, sch_relocs[j,]$new_lat)) / 1000
if (dist < subset[i,]$min_dist_new) {
subset[i,]$min_dist_new = dist
subset[i,]$closest_new_sch = sch_relocs[j,]$school_name
subset[i,]$date_new_loc = sch_relocs[j,]$date_reloc
}
}
# dist to old sch locations
for (j in 1:as.integer(tally(sch_relocs))) {
dist = distHaversine(c(subset[i,]$longitude, subset[i,]$latitude),
c(sch_relocs[j,]$old_lon, sch_relocs[j,]$old_lat)) / 1000
if (dist < subset[i,]$min_dist_old) {
subset[i,]$min_dist_old = dist
subset[i,]$closest_old_sch = sch_relocs[j,]$school_name
subset[i,]$date_old_loc = sch_relocs[j,]$date_reloc
}
}
# generate dummy "new"
if (subset[i,]$min_dist_new < subset[i,]$min_dist_old) {
subset[i,]$new = 1
subset[i,]$date_move = subset[i,]$date_new_loc
}
else if (subset[i,]$min_dist_new >= subset[i,]$min_dist_old) {
subset[i,]$date_move = subset[i,]$date_old_loc
}
# find overlaps
if (subset[i,]$closest_old_sch == subset[i,]$closest_new_sch &
subset[i,]$min_dist_old <= 2 &
subset[i,]$min_dist_new <= 2) {
subset[i,]$overlap = 1
}
# find min dist
subset[i,]$min_dist = min(subset[i,]$min_dist_old, subset[i,]$min_dist_new)
# zoning
if (subset[i,]$min_dist <= 2) {
subset[i,]$in_zone = 1
}
else if (subset[i,]$min_dist <= 4) {
subset[i,]$in_zone = 0
}
else {
subset[i,]$in_zone = 2
}
}
Here's how the data sets look like (just the relevant variables)
subset data set with desired result (first 2 rows):
sch_relocs data set (full with only relevant columns)

Related

I want to calculate the timedifference between to times

I want to calculate the difference of two columns of a dataframe containing times. Since not always a value from the same column ist bigger/later, I have to do a workaround with an if-clause:
counter = 1
while(counter <= nrow(data)){
if(data$time_end[counter] - data$time_begin[counter] < 0){
data$chargingDuration[counter] = 1-abs(data$time_end[counter]-data$time_begin[counter])
}
if(data$time_end[counter] - data$time_begin[counter] > 0){
data$chargingDuration[counter] = data$time_end[counter]-data$time_begin[counter]
}
counter = counter + 1
}
The output I get is a decimalvalue smaller than 1 (i.e.: 0,53322 meaning half a day)... However, if I use my console and calculate the timedifference manually for a single line, I get my desired result looking like 02:12:03...
Thanks for the help guys :)

Find local minima in noisy data with exact criteria and without for-loop

I have a timerow and would like to find minima which fulfill specific criteria. Those are that the number of points within the valleys (below the red line) between 2 peaks should exceed a certain value and also the number of points above the red line should exceed a certain value for each peak neighboring the valley. Also the lower of the two peaks should be resolved at a value lower than 50% of its height (meaning that the max(intensity) of the lower of the two peaks should be at least 2 fold the intensity of the lowest intensity within the valley between the two peaks - as calculated below in the code sample). I drew the red line here at a specific height but in reality those unresolved peaks can have any height and can also be seperated by any distance. So what I am doing at the moment is to "scan" with the red line over each point of the time series which is of course very slow and inefficient.
So here is the for-loop I am using at the moment:
detect_double_peaks <- function(pot.doubleP.v, Min.PpP = 10) {
peak.dt <-
data.table(
breakP = NA,
breakH = NA,
resolved = NA
)
for (point in pot.doubleP.v[pot.doubleP.v > 0 &
pot.doubleP.v < 0.8 * max(pot.doubleP.v)]) {
doublePeak.rle <- S4Vectors::Rle(pot.doubleP.v > point)
doublePeak.rle.dt <-
data.table(
idx = as.numeric(seq.int(length(
doublePeak.rle#values
))),
values = doublePeak.rle#values,
lengths = doublePeak.rle#lengths,
start = start(doublePeak.rle),
end = end(doublePeak.rle)
)
doublePeak.rle.dt_p <-
doublePeak.rle.dt[values == TRUE & lengths > Min.PpP]
if (nrow(doublePeak.rle.dt_p) > 1) {
for(peak in 1:nrow(doublePeak.rle.dt_p)-1){
doublePeak.rle.dt_v <- doublePeak.rle.dt[idx > doublePeak.rle.dt_p[peak]$idx & idx < doublePeak.rle.dt_p[peak + 1]$idx]
if(sum(doublePeak.rle.dt_v[values == FALSE]$lengths) >= max(max(doublePeak.rle.dt_p[peak]$lengths, doublePeak.rle.dt_p[peak+1]$lengths) * 0.5, Min.PpP)){
dp.p_height_h <-
max(max(pot.doubleP.v[(doublePeak.rle.dt_p[peak]$start):(doublePeak.rle.dt_p[peak]$end)]),
max(pot.doubleP.v[(doublePeak.rle.dt_p[peak + 1]$start):(doublePeak.rle.dt_p[peak + 1]$end)]))# - baseL
dp.p_height_l <-
min(max(pot.doubleP.v[(doublePeak.rle.dt_p[peak]$start):(doublePeak.rle.dt_p[peak]$end)]),
max(pot.doubleP.v[(doublePeak.rle.dt_p[peak + 1]$start):(doublePeak.rle.dt_p[peak + 1]$end)]))# - baseL
breakH <-
min(pot.doubleP.v[min(doublePeak.rle.dt[idx > doublePeak.rle.dt_p[peak]$idx]$start):max(doublePeak.rle.dt[idx < doublePeak.rle.dt_p[peak+1]$idx]$end)])# - baseL
resolved <-
breakH / dp.p_height_l * 100
breakP <-
which.min(pot.doubleP.v[min(doublePeak.rle.dt[idx > doublePeak.rle.dt_p[peak]$idx]$start):max(doublePeak.rle.dt[idx < doublePeak.rle.dt_p[peak+1]$idx]$end)]) +
doublePeak.rle.dt_p[peak]$end
peak.dt <- rbind(peak.dt,
data.table(breakP = breakP,
breakH = breakH,
resolved = resolved))
}
}
}
}
if(nrow(peak.dt) == 1) {return(NULL)} else{
return(na.omit(unique(peak.dt, by="breakP")))
}
}
Here are some example data:
testvector <- c(13126.177734375, 12040.060546875, 10810.6171875, 10325.94140625,
13492.8359375, 33648.5703125, 14402.603515625, 29920.12890625,
24316.224609375, 36019.26171875, 34492.4609375, 53799.82421875,
45988.72265625, 47930.453125, 67438.9140625, 61231.83984375,
56710.9140625, 62301.6796875, 54844.7578125, 70913.578125, 81028.1640625,
75234.203125, 59611.05078125, 79240.4375, 52313.3828125, 78758.2734375,
87918.5859375, 80764.7421875, 108035.5390625, 76263.875, 72401.6796875,
83167.640625, 76173.96875, 66241.4296875, 68687.4375, 52107.83984375,
45672.5390625, 51907.33203125, 39967.453125, 58856.90625, 52402.53125,
36980.3125, 43365.76171875, 40480.75, 39057.96484375, 31622.58984375,
23830.455078125, 27393.30078125, 30675.208984375, 27327.48046875,
25150.08984375, 23746.212890625, 9637.625, 19065.58984375, 21367.40625,
6789.0625, 9892.7490234375, 26820.685546875, 19965.353515625,
28281.462890625, 25495.0703125, 28808.416015625, 40244.03125,
35159.421875, 35257.984375, 39971.8046875, 34710.4453125, 60987.73828125,
50620.06640625, 58757.69140625, 52998.97265625, 55601.96484375,
69057.9453125, 58486.52734375, 66115.4765625, 80801.7578125,
77444.6015625, 43545.48828125, 79545.0703125, 50352.484375, 77401.8671875,
85118.421875, 80521.9296875, 68945.8125, 93098.0234375, 83065.8046875,
95970.8203125, 74141.8828125, 90298.75, 81251.0234375, 99658.3359375,
88812.2578125, 81939.4921875, 82632.1015625, 100125.0078125,
71627.84375, 70560.1484375, 77831.765625, 68122.328125, 79049.140625,
88000.890625, 64897.4453125, 57333.3046875, 68185.3046875, 67742.3515625,
58941.85546875, 63184.8671875, 36998.67578125, 45416.58984375,
31547.3359375, 32141.58203125, 35292.9765625, 30511.861328125,
25419.716796875, 23901.431640625, 15616.8759765625, 14469.16015625,
15026.0009765625, 18321.42578125, 15820.861328125, 19532.056640625,
13230.6240234375, 14586.76953125, 14912.642578125, 8541.5224609375,
21740.98046875, 19588.986328125, 18603.662109375, 19656.5625,
10812.94921875, 18379.3359375, 31242.716796875, 25626.0390625,
42446.71875, 27782.294921875, 38450.703125, 39070.97265625, 52914.375,
56484.47265625, 47741.88671875, 52397.18359375, 79378.2109375,
77866.078125, 55902.09765625, 66988.2265625, 63571.01171875,
66192.53125, 79989.8046875, 57204.59765625, 51172.9921875, 49612.16015625,
60508.0390625, 69518.09375, 48079.5625, 48691.0390625, 33679.12890625,
30697.470703125, 31209.359375, 49656.16796875, 32041.912109375,
13851.48828125, 29316.44921875, 31586.216796875, 45422.19921875,
24208.515625, 31496.083984375, 26250.646484375, 14318.302734375
)
For this vector the minima at 56 and 125 should be returned.
Those should be returned because when scanning with the red line through the points of the vector there is at least one iteration at which there are more than Min.PpP = 10 consecutive points above the red line on each side of the valley, and with the same red line there are also more than Min.PpP = 10 points in the valley. The reason why point 4 should not be returned as minima is that no matter where we put the red line the valley will never exceed 3 points (which is lower than Min.PpP = 10) and the peak on the left side of that minima would only be 1 point (which is also lower than Min.PpP = 10).
I am aware of functions like pastecs::turnpoints. However, they do not seem to offer the implementation of criteria as I want to use them.
So it there any other more efficent way to achive that?
Ps.:
I have also put another criteria in the example code which says that there should be at least halve as many points in the vally as there are for the peak with the smaller number of points even when Min.PpP is exceeded:
if(sum(doublePeak.rle.dt_v[values == FALSE]$lengths) >= max(max(doublePeak.rle.dt_p[peak]$lengths, doublePeak.rle.dt_p[peak+1]$lengths) * 0.5, Min.PpP))
However I guess thats not really important for this problem.

How to include logical checks in a custom function

I have written a custom function that performs a mathematical transformation on a column of data with the inputs being the data and one other input (temperature). I would like to have 2 different logical checks. The first one is whether or not any values in the column exceed a certain threshold, because the transformation is different above and below the threshold. The second is a check if the temperature input is above a certain value and in that case, to deliver a warning that values above the threshold are unusual and to check the data.
Right now, I have the function written with a series of if/else statements. However, this a warning that it is only using the first element of the string of T/F statements. A simplified example of my function is as follows:
myfun = function(temp,data) {
if(temp > 34){
warning('Temperature higher than expected')
}
if (data > 50) {
result = temp*data
return(result)
} else if(data <= 50) {
result = temp/data
return(result)
}
}
myfun(temp = c(25,45,23,19,10), data = c(30,40,NA,50,10))
As you can see, because it is only using the first value for the if/else statements, it does not properly calculate the return values because it doesn't switch between the two versions of the transformation. Additionally, it's only checking if the first temp value is above the threshold. How can I get it to properly apply the logical check to every value and not just the first?
-edit-simplified the function per #The_Questioner's suggestion and changed < 50 to <= 50.
The main issue with your code is that you are passing all the values to the functions as vectors, but then are doing single element comparisons. You need to either pass the elements one by one to the function, or put some kind of vectorized comparison or for loop into your function. Below is the for loop approach, which is probably the least elegant way to do this, but at least it's easy to understand what's going on.
Another issue is that NA's apparently need to be handled in the data vector before passing to any of your conditional statements, or you'll get an error.
A final issue is what to do when data = 50. Right now you have conditional tests for greater or less than 50, but as you can see, the 4th point in data is 50, so right now you get an NA.
myfun = function(temp,data) {
result <- rep(NA,length(temp))
for (t in 1:length(temp)) {
if(temp[t] > 34) {
warning('Temperature higher than expected')
if (!is.na(data[t])) {
if (data [t] > 50) {
result[t] <- temp[t]*data[t]
} else if(data[t] < 50) {
result[t] <- temp[t]/data[t]
}
}
} else {
if (!is.na(data[t])) {
if (data[t] > 50) {
result[t] <- temp[t]*data[t]
} else if(data[t] < 50) {
result[t] <- temp[t]/data[t]
}
}
}
}
return(result)
}
Output:
> myfun(temp = c(25,45,23,19,10), data = c(30,40,NA,50,10))
[1] 0.8333333 1.1250000 NA NA 1.0000000

Null objects in list when splitting data with R

I'm new to R and got a assignment to do some basic research with the use of R
I have a csv file imported with data of wind direction and wind speed and want to split the wind speed based on direction
So i created this bit of R code
north.ls = list()
east.ls = list()
south.ls = list()
west.ls = list()
i = as.integer(1)
print("start")
for (i in 1:length(DD)) {
if (DD[i] >=315 & DD[i] <= 360 | DD[i] >= 1 & DD < 45) {
north.ls[[i]] = as.integer(FH[i])
print("nord")
}
if(DD[i] >=45 & DD[i] < 135){
east.ls[[i]] = as.integer(FH[i])
print("east")
}
if(DD[[i]] >= 145 & DD[i] < 225){
south.ls[[i]] = as.integer(FH[i])
print("south")
}
if(DD[[i]] >=225 & DD[i] < 315){
west.ls[[i]] = as.integer(FH[i])
print("west")
}
}
this works fine at puts the right speeds in the right lists but every time the condition is not correct the list still gets a null value so I have a lot of null values in the lists. What is the problem and how can I fix it?
I hope you understand my explanation
thanks in advance
When you create a new item on a list at position [i] without items in previous positions, all those positions get NULLs.
Here's a slightly better way of producing what you're trying to do (I'm making some educated guesses about your data structure and your goals), without introducing these NULLs:
north.ls<-FH[(DD>=315 & DD <= 360) | (DD >= 1 & DD < 45)]
east.ls<-FH[DD>=45 & DD < 135]
south.ls<-FH[DD>=135 & DD < 235]
west.ls<-FH[DD>=235 & DD < 315]
This will give you four vectors that divide the data in FH into north, east, south, and west based on the data in DD. The length of each of the four lists is NOT equal to the length of FH or DD (or each other), and there should be no NULLs introduced unless they're already in FH.

R and apply info

I could find any answers to that. So I've got the following code and trying to put it into apply, so it does the work quicker, my data set is 130k rows long. I need an apply that will calculate the missing times of the horses from Behind(in Length) and the winning Horse time. The problem is that the column Behind gives a the distance behind the horse before, not the first 1. So I'm in need to create a variable that will carry on as the function goes and if new race is identified, finds that the position == 1, it resets the variables.
missingTimes <- function(x) {
L <- 2.4384
for(i in 1:nrow(x) - 10) {
distanceL <- (x$distance[i] * 1000) / L
LperS <- x$Winner.Race.time[i] / distanceL
if(x$position[i] == 1 && !is.na(x$position[i])) {
distanceL <- NULL
LperS <- NULL
}
if(grepl("L",x$Behind[i])) {
x$results[i] <- (distanceL + as.numeric(sub("L", "", x$Behind[i]))) * LperS
}
}
}
I need at least 10 reputation to post images, thats why I give you links instead!
http://i.stack.imgur.com/xN23M.png
http://i.stack.imgur.com/Cspfr.png
The results should just give me a column with the proper times for the finish times of the other horses, in a form like the column Winner Race Time
For further understanding Imma count a few results myself for you:
Starting with first row, it sees position = 1, so it cleans the variables.
Then it takes the distance * 1000, and divides it by the constant L,
2.375 * 1000 / 2.4384 = 973.99
Then It need to get the time in seconds it takes to complete 1 length(L),
290.9 / 973.99 = 0.298
Now to get the finish time for the second horse It adds the length BEHIND to the distance of the racing track and multiplies it by the length per second,
973.99 + 2.25 = 976.24 * 0.298 = 290.91952
Then for the next horses time it'd be:
976.24 + 13 = 989.24 * 0.298 = 294.79352
and so on, remember when it hits position = 1, distance needs to reset
What I've done alternatively is put the distanceL in a separate column, same with LperS, of course after calculation.
If you could walk me through steps required to get that done It'd be great. I'm a complete rookie to the R stuff, so please be descriptive. I hope you catch my understanding!
Thank you!

Resources