Null objects in list when splitting data with R - r

I'm new to R and got a assignment to do some basic research with the use of R
I have a csv file imported with data of wind direction and wind speed and want to split the wind speed based on direction
So i created this bit of R code
north.ls = list()
east.ls = list()
south.ls = list()
west.ls = list()
i = as.integer(1)
print("start")
for (i in 1:length(DD)) {
if (DD[i] >=315 & DD[i] <= 360 | DD[i] >= 1 & DD < 45) {
north.ls[[i]] = as.integer(FH[i])
print("nord")
}
if(DD[i] >=45 & DD[i] < 135){
east.ls[[i]] = as.integer(FH[i])
print("east")
}
if(DD[[i]] >= 145 & DD[i] < 225){
south.ls[[i]] = as.integer(FH[i])
print("south")
}
if(DD[[i]] >=225 & DD[i] < 315){
west.ls[[i]] = as.integer(FH[i])
print("west")
}
}
this works fine at puts the right speeds in the right lists but every time the condition is not correct the list still gets a null value so I have a lot of null values in the lists. What is the problem and how can I fix it?
I hope you understand my explanation
thanks in advance

When you create a new item on a list at position [i] without items in previous positions, all those positions get NULLs.
Here's a slightly better way of producing what you're trying to do (I'm making some educated guesses about your data structure and your goals), without introducing these NULLs:
north.ls<-FH[(DD>=315 & DD <= 360) | (DD >= 1 & DD < 45)]
east.ls<-FH[DD>=45 & DD < 135]
south.ls<-FH[DD>=135 & DD < 235]
west.ls<-FH[DD>=235 & DD < 315]
This will give you four vectors that divide the data in FH into north, east, south, and west based on the data in DD. The length of each of the four lists is NOT equal to the length of FH or DD (or each other), and there should be no NULLs introduced unless they're already in FH.

Related

multiApply function for loop on a 3D array

I am trying to make my data processing more efficient for a spatial temperature data project. I have a for loop that will do what I want, but it is much too slow for processing multiple years of data. This loop looks at each spatial cell and, based on the 365 temperature values in that year, creates a value for the frequency, duration, number, and temp of heat events that will go into seperate 2d dataframes.
for (b in 1:299) { #longitude
for (c in 1:424) { #latitude
data <- year[b,c] #makes all temps into a vector
for (d in 2:364) {
if (data[d]>=Threshold & data[d+1]>=Threshold) {
frequencydf[b,c]=frequencydf[b,c]+1
tempsdf[b,c]=tempsdf[b,c]+data[d]
}else if (data[d-1]>=Threshold & data[d]>=Threshold & data[d+1]<Threshold) {
frequencydf[b,c]=frequencydf[b,c]+1
numberdf[b,c]=numberdf[b,c]+1
tempsdf[b,c]=tempsdf[b,c]+data[d]
}else {
frequencydf[b,c]=frequencydf[b,c]
numberdf[b,c]=numberdf[b,c]
tempsdf[b,c]=tempsdf[b,c]
}
}
durationdf[b,c]=frequencydf[b,c]/numberdf[b,c]
tempsdf[b,c]=tempsdfd[b,c]/frequencydf[b,c]
}
})
Therefore, I am trying to work with apply fuctions to speed up the process. I think I am running into issues when attempting to analyze each spacial cell by values in the 3rd (time) dimention in my array.
I am starting with the frequency parameter and trying to create the same data frame as above.
frequencylist <- Apply(year_array, fun = frequency.calc1, margins=c(1, 2))
frequencydf <- as.data.frame(frequencylist)
Using this function:
frequency.calc1 = function(cell) {
data <- as.vector(cell)
frequency <- 0
for (d in 2:364) {
if (data[d]>=Threshold & data[d+1]>=Threshold) {
frequency=frequency+1
}else if (data[d-1]>=Threshold & data[d]>=Threshold & data[d+1]<Threshold) {
frequency=frequency+1
}else {
frequency=frequency
}
return(frequency)
}
}
I am very new to creating functions and using the Apply function so any advice would be appreciated!
For-loops and *apply functions run about the same speed. Your problem is all those "if" s.
First of all, you have two separate conditions both of which lead to incrementing frequency. Figure out how to combine them. Next, remember that the R language is vectorized, so you don't need a loop at all. With a little careful thought, you can write a line something like
frequency <- sum(data[1:N-2] >=threshold & data[2:N-1] >=threshold & data[3:N<threshold)
I haven't checked all the ">" vs "<" but you get the idea.
As a side note, NEVER hard-code the range of a loop. You can start with "2" since your conditionals reference "d-1" but let the maximum value be defined as something like length(data) - 1
The solution used to simplify the process is shown below. Sum functions with conditionals were used in place of the if statements. This made the process incredibly efficient and did not use the apply function or an additional function.
for (b in 1:299) {
for (c in 1:424) {
data <- year[b,c]
N=length(data)
frequency[b,c] <- sum(data[1:N] >=Threshold & data[2:N] >=Threshold & data[3:N] <Threshold) + sum(data[1:N] >=Threshold & data[2:N] >=Threshold)
number[b,c] <- sum(data[1:N] >=Threshold & data[2:N] >=Threshold & data[3:N] <Threshold)
duration[b,c] <- frequency[b,c]/number[b,c]
temps[b,c] <- sum(data[data[1:N] >=Threshold & data[2:N] >=Threshold & data[3:N] <Threshold]) + sum(data[data[1:N] >=Threshold & data[2:N] >=Threshold])
temps[b,c] <- temps[b,c]/frequency[b,c]
}}
Thank you for your help #Carl Witthoft

I want to calculate the timedifference between to times

I want to calculate the difference of two columns of a dataframe containing times. Since not always a value from the same column ist bigger/later, I have to do a workaround with an if-clause:
counter = 1
while(counter <= nrow(data)){
if(data$time_end[counter] - data$time_begin[counter] < 0){
data$chargingDuration[counter] = 1-abs(data$time_end[counter]-data$time_begin[counter])
}
if(data$time_end[counter] - data$time_begin[counter] > 0){
data$chargingDuration[counter] = data$time_end[counter]-data$time_begin[counter]
}
counter = counter + 1
}
The output I get is a decimalvalue smaller than 1 (i.e.: 0,53322 meaning half a day)... However, if I use my console and calculate the timedifference manually for a single line, I get my desired result looking like 02:12:03...
Thanks for the help guys :)

R: Update/generate variables without loop

How can I generate or update variables without using a loop? mutate doesn't work here (at least I don't know how to get it to work for this problem) because I need to calculate stuff from multiple rows in another data set.
I'm supposed to replicate the regression results of an academic paper, and I'm trying to generate some variables required in the regression. The following is what I need.
I have 2 relevant data sets for this question, subset (containing
geocoded residential property transactions) and sch_relocs (containing the date
of school relocation events as well as their locations)
I need to calculate the distance between each residential property and the nearest (relocated) school
If the closest school is one that relocated to the area near the residential property, the dummy variable new should be 1 (if the school relocated away from the area, then new should be 0)
If the relocated school moved only a small distance, and a house is within the overlapping portion of the respective 2km radii around the school locations, the dummy variable overlap should be 1, otherwise 0
If the distance to the nearest school is <= 2km, the dummy variable in_zone should be 1. If the distance is between 2km and 4km, these transactions are considered controls, and hence in_zone should be 0. If the distance is greater than 4km, I should drop the observations from the data
I have tried to do this using a for loop, but it's taking ages to run (it's still not done running after one night), so I need a better way to do it. Here's my code (very messy, I think the above explanation is a lot easier if you want to figure out what I'm trying to do.
for (i in 1:as.integer(tally(subset))) {
# dist to new sch locations
for (j in 1:as.integer(tally(sch_relocs))) {
dist = distHaversine(c(subset[i,]$longitude, subset[i,]$latitude),
c(sch_relocs[j,]$new_lon, sch_relocs[j,]$new_lat)) / 1000
if (dist < subset[i,]$min_dist_new) {
subset[i,]$min_dist_new = dist
subset[i,]$closest_new_sch = sch_relocs[j,]$school_name
subset[i,]$date_new_loc = sch_relocs[j,]$date_reloc
}
}
# dist to old sch locations
for (j in 1:as.integer(tally(sch_relocs))) {
dist = distHaversine(c(subset[i,]$longitude, subset[i,]$latitude),
c(sch_relocs[j,]$old_lon, sch_relocs[j,]$old_lat)) / 1000
if (dist < subset[i,]$min_dist_old) {
subset[i,]$min_dist_old = dist
subset[i,]$closest_old_sch = sch_relocs[j,]$school_name
subset[i,]$date_old_loc = sch_relocs[j,]$date_reloc
}
}
# generate dummy "new"
if (subset[i,]$min_dist_new < subset[i,]$min_dist_old) {
subset[i,]$new = 1
subset[i,]$date_move = subset[i,]$date_new_loc
}
else if (subset[i,]$min_dist_new >= subset[i,]$min_dist_old) {
subset[i,]$date_move = subset[i,]$date_old_loc
}
# find overlaps
if (subset[i,]$closest_old_sch == subset[i,]$closest_new_sch &
subset[i,]$min_dist_old <= 2 &
subset[i,]$min_dist_new <= 2) {
subset[i,]$overlap = 1
}
# find min dist
subset[i,]$min_dist = min(subset[i,]$min_dist_old, subset[i,]$min_dist_new)
# zoning
if (subset[i,]$min_dist <= 2) {
subset[i,]$in_zone = 1
}
else if (subset[i,]$min_dist <= 4) {
subset[i,]$in_zone = 0
}
else {
subset[i,]$in_zone = 2
}
}
Here's how the data sets look like (just the relevant variables)
subset data set with desired result (first 2 rows):
sch_relocs data set (full with only relevant columns)

Nested loop for sequential recoding

I'm doing a sequence analysis on a large data sample. What I want to do is to rewrite my old Stata code in R, so that all of my analysis is performed in one single environment.
However, I would also like to improve it a little bit - the code is pretty long, and I would like to rewrite it using loops, so that it becomes more readable. Unfortunately my loop-writing skills are questionable.
1st loop [I think it needs to include an if statement]
I would like to write a loop for the following commands:
dt$dur.ofA1 <-(dt$M2_3R_A_1 - dt$M2_2R_A_1)
dt$dur.ofB1<-(dt$M2_3R_B_1 - dt$M2_2R_B_1)
dt$dur.ofC1<-(dt$M2_3R_C_1 - dt$M2_2R_C_1)
dt$dur.ofD1<-(dt$M2_3R_D_1 - dt$M2_2R_D_1)
dt$dur.ofE1<-(dt$M2_3R_E_1 - dt$M2_2R_E_1)
dt$dur.ofF1<-(dt$M2_3R_F_1 - dt$M2_2R_F_1)
dt$dur.ofG1<-(dt$M2_3R_G_1 - dt$M2_2R_G_1)
dt$dur.ofH1<-(dt$M2_3R_H_1 - dt$M2_2R_H_1)
dt$dur.ofA2<-(dt$M2_3R_A_2 - dt$M2_2R_A_2)
dt$dur.ofB2<-(dt$M2_3R_B_2 - dt$M2_2R_B_2)
dt$dur.ofC2<-(dt$M2_3R_C_2 - dt$M2_2R_C_2)
dt$dur.ofD2<-(dt$M2_3R_D_2 - dt$M2_2R_D_2)
dt$dur.ofE2<-(dt$M2_3R_E_2 - dt$M2_2R_E_2)
dt$dur.ofF2<-(dt$M2_3R_F_2 - dt$M2_2R_F_2)
dt$dur.ofG2<-(dt$M2_3R_G_2 - dt$M2_2R_G_2)
dt$dur.ofH2<-(dt$M2_3R_H_2 - dt$M2_2R_H_2)
dt$dur.ofA3<-(dt$M2_3R_A_3 - dt$M2_2R_A_3)
dt$dur.ofB3<-(dt$M2_3R_B_3 - dt$M2_2R_B_3)
dt$dur.ofC3<-(dt$M2_3R_C_3 - dt$M2_2R_C_3)
dt$dur.ofD3<-(dt$M2_3R_D_3 - dt$M2_2R_D_3)
dt$dur.ofE3<-(dt$M2_3R_E_3 - dt$M2_2R_E_3)
dt$dur.ofF3<-(dt$M2_3R_F_3 - dt$M2_2R_F_3)
dt$dur.ofG3<-(dt$M2_3R_G_3 - dt$M2_2R_G_3)
dt$dur.ofH3<-(dt$M2_3R_H_3 - dt$M2_2R_H_3)
My attempt:
db1 <- paste(rep("M2_", 24), "2R_", rep(LETTERS[seq( from = 1, to = 8)],3), "_",
rep(seq(from=1, to =3), 8),
sep = "")
db2 <- paste(rep("M2_", 24), "3R_", rep(LETTERS[seq( from = 1, to = 8)],3), "_",
rep(seq(from=1, to =3), 8),
sep = "")
dur <- paste(rep("dur.of", 24), rep(LETTERS[seq( from = 1, to = 8)],3),
rep(seq(from=1, to =3), 8),
sep = "")
dur <- as.list(dur)
for(e in dur){
for (j in db1){
for (i in db2){
{
dt[,e] <- dt[,i] - dt[,j]
}
I think the loop needs an if statement in the middle, so that it stops at a single item (subtracts A1 from A1, A2 from A2 etc.) from the list.
2) The second case is a little bit more complicated, but essentially it is the same case as described above:
The M2_2R_A_1 (start) M2_3R_A_1 (finish) indicate the yearly dates in which an educational activity took place. I would like to generate 1948:2013 variables that indicate that an activity took place in a particular year (stedu==x). A part of my Stata code is as follows (it goes on like that up to 2013):
recode stedu1948(0=2) if M2_2R_A_1<=1948 & 1948<= M2_3R_A_1 | M2_2R_A_2<=1948 & 1948<= M2_3R_A_2 | M2_2R_A_3<=1948 & 1948<= M2_3R_A_3
recode stedu1949(0=2) if M2_2R_A_1<=1949 & 1949<= M2_3R_A_1 | M2_2R_A_2<=1949 & 1949<= M2_3R_A_2 | M2_2R_A_3<=1949 & 1949<= M2_3R_A_3
recode stedu1950(0=2) if M2_2R_A_1<=1950 & 1950<= M2_3R_A_1 | M2_2R_A_2<=1950 & 1950<= M2_3R_A_2 | M2_2R_A_3<=1950 & 1950<= M2_3R_A_3
So in order to write a loop I would also need to include some conditions in order to stop the loop at a given point.
For your first item, use #thelatemail's suggestion. For second item, consider the following for loop using the ifelse() function:
for (i in 1948:2013) {
dt[[paste0("stedu", i)]] <- ifelse((dt$M2_2R_A_1 <= i & dt$M2_3R_A_1 >= i) OR
(dt$M2_2R_A_2 <= i & dt$M2_3R_A_2 >= i) OR
(dt$M2_2R_A_3 <= i & dt$M2_3R_A_3 >= i),
2,
dt[[paste0("stedu", i)]]
}

Use R switch for less than or greater than?

I've used switch for some easy conditionals where variables equal various values, but can't figure out how I would use it for less than or greater than conditionals such as
if (thedate >= as.Date("1981-01-20") & thedate < as.Date("1989-01-20")) {
thepres <- "Reagan"}
if (thedate >= as.Date("1989-01-20") & thedate < as.Date("1993-01-20")) {
thepres <- "George HW Bush"}
if (thedate >= as.Date("1993-01-20") & thedate < as.Date("2001-01-20")) {
thepres <- "Clinton"}
if (thedate >= as.Date("2001-01-01") & thedate < as.Date("2009-01-20")) {
thepres <- "George W Bush"}
if (thedate >= as.Date("2009-01-01")) {
thepres <- "Obama"}
(I know those should be nested ifelse statements but I find more than 3 or 4 difficult to code & follow).
Is there some way to use switch for situations like this, or do I have to go the nested ifelse route? (Or just leave it wildly inefficient like this)
Thanks.
The function cut is pretty good for situations like this. (I didn't include all of the presidents, but hopefully you get the idea)
thedate <- as.Date("1982-02-01")
thepresident <- cut(thedate,
c(as.Date("1981-01-20"), as.Date("1989-01-20"), as.Date("1993-01-20")),
labels=c("Reagan", "George HW Bush"), right=F)
Also, note that this returns a factor, so you may want to convert to a string.

Resources