I'm doing a sequence analysis on a large data sample. What I want to do is to rewrite my old Stata code in R, so that all of my analysis is performed in one single environment.
However, I would also like to improve it a little bit - the code is pretty long, and I would like to rewrite it using loops, so that it becomes more readable. Unfortunately my loop-writing skills are questionable.
1st loop [I think it needs to include an if statement]
I would like to write a loop for the following commands:
dt$dur.ofA1 <-(dt$M2_3R_A_1 - dt$M2_2R_A_1)
dt$dur.ofB1<-(dt$M2_3R_B_1 - dt$M2_2R_B_1)
dt$dur.ofC1<-(dt$M2_3R_C_1 - dt$M2_2R_C_1)
dt$dur.ofD1<-(dt$M2_3R_D_1 - dt$M2_2R_D_1)
dt$dur.ofE1<-(dt$M2_3R_E_1 - dt$M2_2R_E_1)
dt$dur.ofF1<-(dt$M2_3R_F_1 - dt$M2_2R_F_1)
dt$dur.ofG1<-(dt$M2_3R_G_1 - dt$M2_2R_G_1)
dt$dur.ofH1<-(dt$M2_3R_H_1 - dt$M2_2R_H_1)
dt$dur.ofA2<-(dt$M2_3R_A_2 - dt$M2_2R_A_2)
dt$dur.ofB2<-(dt$M2_3R_B_2 - dt$M2_2R_B_2)
dt$dur.ofC2<-(dt$M2_3R_C_2 - dt$M2_2R_C_2)
dt$dur.ofD2<-(dt$M2_3R_D_2 - dt$M2_2R_D_2)
dt$dur.ofE2<-(dt$M2_3R_E_2 - dt$M2_2R_E_2)
dt$dur.ofF2<-(dt$M2_3R_F_2 - dt$M2_2R_F_2)
dt$dur.ofG2<-(dt$M2_3R_G_2 - dt$M2_2R_G_2)
dt$dur.ofH2<-(dt$M2_3R_H_2 - dt$M2_2R_H_2)
dt$dur.ofA3<-(dt$M2_3R_A_3 - dt$M2_2R_A_3)
dt$dur.ofB3<-(dt$M2_3R_B_3 - dt$M2_2R_B_3)
dt$dur.ofC3<-(dt$M2_3R_C_3 - dt$M2_2R_C_3)
dt$dur.ofD3<-(dt$M2_3R_D_3 - dt$M2_2R_D_3)
dt$dur.ofE3<-(dt$M2_3R_E_3 - dt$M2_2R_E_3)
dt$dur.ofF3<-(dt$M2_3R_F_3 - dt$M2_2R_F_3)
dt$dur.ofG3<-(dt$M2_3R_G_3 - dt$M2_2R_G_3)
dt$dur.ofH3<-(dt$M2_3R_H_3 - dt$M2_2R_H_3)
My attempt:
db1 <- paste(rep("M2_", 24), "2R_", rep(LETTERS[seq( from = 1, to = 8)],3), "_",
rep(seq(from=1, to =3), 8),
sep = "")
db2 <- paste(rep("M2_", 24), "3R_", rep(LETTERS[seq( from = 1, to = 8)],3), "_",
rep(seq(from=1, to =3), 8),
sep = "")
dur <- paste(rep("dur.of", 24), rep(LETTERS[seq( from = 1, to = 8)],3),
rep(seq(from=1, to =3), 8),
sep = "")
dur <- as.list(dur)
for(e in dur){
for (j in db1){
for (i in db2){
{
dt[,e] <- dt[,i] - dt[,j]
}
I think the loop needs an if statement in the middle, so that it stops at a single item (subtracts A1 from A1, A2 from A2 etc.) from the list.
2) The second case is a little bit more complicated, but essentially it is the same case as described above:
The M2_2R_A_1 (start) M2_3R_A_1 (finish) indicate the yearly dates in which an educational activity took place. I would like to generate 1948:2013 variables that indicate that an activity took place in a particular year (stedu==x). A part of my Stata code is as follows (it goes on like that up to 2013):
recode stedu1948(0=2) if M2_2R_A_1<=1948 & 1948<= M2_3R_A_1 | M2_2R_A_2<=1948 & 1948<= M2_3R_A_2 | M2_2R_A_3<=1948 & 1948<= M2_3R_A_3
recode stedu1949(0=2) if M2_2R_A_1<=1949 & 1949<= M2_3R_A_1 | M2_2R_A_2<=1949 & 1949<= M2_3R_A_2 | M2_2R_A_3<=1949 & 1949<= M2_3R_A_3
recode stedu1950(0=2) if M2_2R_A_1<=1950 & 1950<= M2_3R_A_1 | M2_2R_A_2<=1950 & 1950<= M2_3R_A_2 | M2_2R_A_3<=1950 & 1950<= M2_3R_A_3
So in order to write a loop I would also need to include some conditions in order to stop the loop at a given point.
For your first item, use #thelatemail's suggestion. For second item, consider the following for loop using the ifelse() function:
for (i in 1948:2013) {
dt[[paste0("stedu", i)]] <- ifelse((dt$M2_2R_A_1 <= i & dt$M2_3R_A_1 >= i) OR
(dt$M2_2R_A_2 <= i & dt$M2_3R_A_2 >= i) OR
(dt$M2_2R_A_3 <= i & dt$M2_3R_A_3 >= i),
2,
dt[[paste0("stedu", i)]]
}
Related
I have a data frame in R as defined below:
df <- data.frame('ID'=c(1,1,1,1),
'Month' =c('M1','M2','M3','M4'),
"Initial.Balance" =c(100,100,100,0),
"Value" = c(0.1,0.2,0.2,0.2),
"Threshold"=c(0.05,0.18,0.25,0.25),
"Intermediate.Balance"=c(0,0,100,0),
"Final.Balance"=c(100,100,0,0))
This task uses Initial.Balance (in current row) from the Final.Balance of the previous row.
When Value >= Threshold, Intermediate.Balance=0 and Final.Balance = Initial.Balance-Intermediate.Balance
When Value < Threshold, Intermediate.Balance = Initial.Balance and Final.Balance = Initial.Balance-Intermediate.Balance
I have tried to accomplish this task using for loop but it takes lot of time on large dataset (for many IDs)
Here is my solution:
for (i in 1:nrow(df)){
df$Intermediate.Balance[i] <- ifelse(df$Value[i]>df$Threshold[i],0,df$Initial.balance[i])
df$Final.Balance[i] <- df$Initial.balance[i]-df$Intermediate.Balance[i]
if(i+1<=nrow(df)){
df$Initial.balance[i+1] <- df$Final.Balance[i] }
}
Can we look for similar solution using Data Table? As data table operations are quicker than for loop on dataframe, I believe this will help me save computation time.
Thanks,
I think in this particular case, final balance goes to 0 once there is a row with Value less than Threshold and subsequent balances all go to 0. So you can use this:
ib <- 100
df[, InitBal := ib * 0^shift(cumsum(Value<=Threshold), fill=0L)]
df[, ItmdBal := replace(rep(0, .N), which(Value<=Threshold)[1L], ib)]
df[, FinlBal := InitBal - ItmdBal]
or in one []:
df[, c("InitBal", "ItmdBal", "FinlBal") := {
v <- Value<=Threshold
InitBal <- ib * 0^shift(cumsum(v), fill=0L)
ItmdBal <- replace(rep(0, .N), which(v)[1L], ib)
.(InitBal, ItmdBal, InitBal - ItmdBal)
}]
Or a more general approach using Rcpp when the intermediate balance is not simply equal to the initial balance:
library(Rcpp)
cppFunction('List calc(NumericVector Value, NumericVector Threshold, double init) {
int n = Value.size();
NumericVector InitialBalance(n), IntermediateBalance(n), FinalBalance(n);
InitialBalance[0] = init;
for (int i=0; i<n; i++) {
if (Value[i] <= Threshold[i]) {
IntermediateBalance[i] = InitialBalance[i];
}
FinalBalance[i] = InitialBalance[i] - IntermediateBalance[i];
if (i < n-1) {
InitialBalance[i+1] = FinalBalance[i];
}
}
return List::create(Named("InitialBalance") = InitialBalance,
Named("IntermediateBalance") = IntermediateBalance,
Named("FinalBalance") = FinalBalance);
}')
setDT(df)[, calc(Value, Threshold, Initial.Balance[1L])]
I can't see an obvious way of getting rid of the loop since each row is deterministic into the next. That being said, data.frames copy the whole frame or at least whole columns whenever you set some portion of them. As such you can do this:
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
dt[i,Intermediate.Balance:=ifelse(Value>Threshold,0,Initial.Balance)]
dt[i,Final.Balance:=Initial.Balance-Intermediate.Balance]
if(i+1<=nrow(dt)) dt[i+1,Initial.Balance:=dt[i,Final.Balance]]
}
You could also try the set function but I'm not sure if it'll be faster, or by how much, given that the data comes from the data.table anyway.
dt<-as.data.table(df)
for(i in 1:nrow(dt)) {
i<-as.integer(i)
set(dt,i,"Intermediate.Balance", ifelse(dt[i,Value]>dt[i,Threshold],0,dt[i,Initial.Balance]))
set(dt,i,"Final.Balance", dt[i,Initial.Balance-Intermediate.Balance])
if(i+1<=nrow(dt)) set(dt,i+1L,"Initial.Balance", dt[i,Final.Balance])
}
I would like to delete all of the rows that sit between certain headers in this example text file.
fileConn <- file("sample.txt")
one <- "*Keyword"
two <- "*Node"
three <- "$ Node,X,Y,Z"
four <- "1,639982.78040607,4733827.5104821,0"
five <- "2,639757.59709573,4733830.43494066,0"
six <- "3,639738.81268144,4733834.3619618,0"
seven <- "*End"
writeLines (c(one, two, three, four, five, six, seven), fileConn)
close(fileConn)
sample <- readLines("sample.txt")
What I am looking to do is delete all of the rows/lines between "*Node" and "*End". Since I am dealing with files with different lengths of rows between these headers, the deletion method needs to be based on headers only. I have no idea how to do this since I've only deleted rows in dataframes referenced by row numbers previously. Any clues?
Expected output is:
*Keyword
*Node
*End
readLines returns a vector, not a data frame, so we can create the sample input more simply:
sample = c("*Keyword",
"*Node",
"$ Node,X,Y,Z",
"1,639982.78040607,4733827.5104821,0",
"2,639757.59709573,4733830.43494066,0",
"3,639738.81268144,4733834.3619618,0",
"*End")
Find the starting and ending headers, and remove the elements in between with negative indexing:
node = which(sample == "*Node")
end = which(sample == "*End")
result = sample[-seq(from = node + 1, to = end - 1)]
result
# [1] "*Keyword" "*Node" "*End"
This assumes there is a single *Node and a single *End line. It also assumes that there is at least one line to delete. You may want to create a more robust solution with some handling for those special cases, e.g.,
delete_between = function(input, start, end) {
start_index = which(sample == start)
end_index = which(sample == end)
if (length(start_index) == 0 | length(end_index) == 0) {
warning("No start or end found, returning input as-is")
return(input)
}
if (length(start_index) > 1 | length(end_index) > 1) {
stop("Multiple starts or ends found.")
}
if (start_index == end_index - 1) {
return(input)
}
return(input[-seq(from = start_index + 1, to = end_index - 1)])
}
I'm new to R and got a assignment to do some basic research with the use of R
I have a csv file imported with data of wind direction and wind speed and want to split the wind speed based on direction
So i created this bit of R code
north.ls = list()
east.ls = list()
south.ls = list()
west.ls = list()
i = as.integer(1)
print("start")
for (i in 1:length(DD)) {
if (DD[i] >=315 & DD[i] <= 360 | DD[i] >= 1 & DD < 45) {
north.ls[[i]] = as.integer(FH[i])
print("nord")
}
if(DD[i] >=45 & DD[i] < 135){
east.ls[[i]] = as.integer(FH[i])
print("east")
}
if(DD[[i]] >= 145 & DD[i] < 225){
south.ls[[i]] = as.integer(FH[i])
print("south")
}
if(DD[[i]] >=225 & DD[i] < 315){
west.ls[[i]] = as.integer(FH[i])
print("west")
}
}
this works fine at puts the right speeds in the right lists but every time the condition is not correct the list still gets a null value so I have a lot of null values in the lists. What is the problem and how can I fix it?
I hope you understand my explanation
thanks in advance
When you create a new item on a list at position [i] without items in previous positions, all those positions get NULLs.
Here's a slightly better way of producing what you're trying to do (I'm making some educated guesses about your data structure and your goals), without introducing these NULLs:
north.ls<-FH[(DD>=315 & DD <= 360) | (DD >= 1 & DD < 45)]
east.ls<-FH[DD>=45 & DD < 135]
south.ls<-FH[DD>=135 & DD < 235]
west.ls<-FH[DD>=235 & DD < 315]
This will give you four vectors that divide the data in FH into north, east, south, and west based on the data in DD. The length of each of the four lists is NOT equal to the length of FH or DD (or each other), and there should be no NULLs introduced unless they're already in FH.
I have the following code which performs a sensitivity analysis:
sensitivity_analysis <- function(stainless_margins, standard_margins,
df_in=all_vals_expanded){
if (length(stainless_margins) != length(standard_margins)){
stop("Error: Margin arrays must be of equal length")
}
out <- numeric(length(stainless_margins)^2)
out <- list(x=out, y=out, z=out)
i <- 1
for (stainless in stainless_margins){
df_in <- change_target(target_val=stainless, target_df=df_in,
target_shape="straight",
target_cat="Stainless - R6")
for (standard in standard_margins){
df_in <- change_target(target_val=standard, target_df=df_in,
target_shape="straight",
target_cat="Standard - R6")
out$x[i] = stainless
out$y[i] = standard
out$z[i] = sum(df_in$delta.margin.dollars)
i <- i + 1
}
}
return(as.data.frame(out))
}
Where the custom change_target function is as follows:
change_target <- function(target_val, target_df, target_shape="straight", target_cat="Standard - R6"){
target_df$proposed.best.net[target_df$categ == target_cat & target_df$shape
== target_shape] <- target_val
target_df$new.margin.dollars <-
with(target_df, cost/(1 - proposed.best.net) * (percent.of.list / 100)
/ 0.5364 - cost)
target_df$delta.margin.dollars <-
with(target_df, new.margin.dollars - old.margin.dollars)
target_df$delta.margin.percent.orig <-
with(target_df, delta.margin.dollars / old.margin.dollars)
target_df$new.list.price <-
with(target_df, cost / (1 - proposed.best.net) * (1 / 0.60))
target_df$list.paid.delta <-
with(target_df, (percent.of.list / 100) * round((new.list.price -
list.price), 0))
target_df$list.paid.delta.percent.orig <-
with(target_df,list.paid.delta / (list.price * (percent.of.list / 100)))
target_df$list.paid.delta.percent.orig[is.nan(target_df$list.paid.delta.percent.orig)] <- 0
return(target_df)
}
Is there a way to use R built-in functions, or even functions from add-on libraries, to avoid the double for-loop while still keeping the code relatively simple and readable? Further, is there a solution that would lend itself to a more generic situation, say one in which I wanted to run the sensitivity analysis with four different inputs, rather than two?
I am open to suggestions in which the change_target function must be modified, but would rather avoid this.
I could find any answers to that. So I've got the following code and trying to put it into apply, so it does the work quicker, my data set is 130k rows long. I need an apply that will calculate the missing times of the horses from Behind(in Length) and the winning Horse time. The problem is that the column Behind gives a the distance behind the horse before, not the first 1. So I'm in need to create a variable that will carry on as the function goes and if new race is identified, finds that the position == 1, it resets the variables.
missingTimes <- function(x) {
L <- 2.4384
for(i in 1:nrow(x) - 10) {
distanceL <- (x$distance[i] * 1000) / L
LperS <- x$Winner.Race.time[i] / distanceL
if(x$position[i] == 1 && !is.na(x$position[i])) {
distanceL <- NULL
LperS <- NULL
}
if(grepl("L",x$Behind[i])) {
x$results[i] <- (distanceL + as.numeric(sub("L", "", x$Behind[i]))) * LperS
}
}
}
I need at least 10 reputation to post images, thats why I give you links instead!
http://i.stack.imgur.com/xN23M.png
http://i.stack.imgur.com/Cspfr.png
The results should just give me a column with the proper times for the finish times of the other horses, in a form like the column Winner Race Time
For further understanding Imma count a few results myself for you:
Starting with first row, it sees position = 1, so it cleans the variables.
Then it takes the distance * 1000, and divides it by the constant L,
2.375 * 1000 / 2.4384 = 973.99
Then It need to get the time in seconds it takes to complete 1 length(L),
290.9 / 973.99 = 0.298
Now to get the finish time for the second horse It adds the length BEHIND to the distance of the racing track and multiplies it by the length per second,
973.99 + 2.25 = 976.24 * 0.298 = 290.91952
Then for the next horses time it'd be:
976.24 + 13 = 989.24 * 0.298 = 294.79352
and so on, remember when it hits position = 1, distance needs to reset
What I've done alternatively is put the distanceL in a separate column, same with LperS, of course after calculation.
If you could walk me through steps required to get that done It'd be great. I'm a complete rookie to the R stuff, so please be descriptive. I hope you catch my understanding!
Thank you!