R - for loop - two columns whose values depend on each other - r

Apologies - I'm used to working with Excel/Minitab/SQL, where this kind of thing works differently.
I have an excel dataset with columns "Date" (Column "A"), "Pallets" (Column "B), "Lt" (Column "K") and "Tt" (Column "L"), where the values for column "Lt" and "Tt" depend on each other, and fixed parameters (alpha and beta), which are listed in cells in the spreadsheet ("T3" and "T4", respectively).
In Excel, I simply enter the formulae for Lt and Tt in cells "K6" and "L6" respectively and drag these down. Excel then updates both columns simultaneously to reach the correct value. The formulae are =$T$3*K5+(1-$T$3)*(K5+L5) and =$T$4*(C6-C5)+(1-$T$4)*L5 respectively.
However, in R, I have tried updating the values of both columns using a for loop:
for(i in 3:368)
df[i,"Lt"]<-alpha*df[i-1,"Lt"]+(1-alpha)*(df[i-1,"Lt"]+df[i-1,"Tt"])
for (i in 3:368)
df[i,"Tt"]<-beta*(df[i,"Pallets"] - df[i-1,"Pallets"])+(1-beta)*df[i-1,"Tt"]
The problem with this is that this means that both columns change their values separately. Consequently, they don't interact with each other as they update, so I end up with two not-quite correct columns.
The values of alpha and beta are 269 and 0.787890411 respectively. In Excel, I get:
Date Pallets Lt Tt
01/01/2011 491
02/01/2011 385 269 0.79
03/01/2011 662 269.7879 0.843133
04/01/2011 28 270.6298 0.843133
05/01/2011 46 271.4718 0.843132
06/01/2011 403 272.3156 0.843132
07/01/2011 282 273.1588 0.843133
08/01/2011 315 274.0021 0.843133
Whereas with R, because the two columns don't update simultaneously, I get variously different values for Lt and Tt each time I update either column. Currently I have:
Date Pallets Lt Tt
1 28/12/2011 491 NA NA
2 29/12/2011 385 269.0000 0.7878904
3 30/12/2011 662 269.7879 0.8431328
4 31/12/2011 28 270.6310 0.7161642
5 01/01/2012 46 271.3472 0.7196210
6 02/01/2012 403 272.0668 0.7908770
7 03/01/2012 282 272.8577 0.7665189
8 04/01/2012 315 273.6242 0.7729656
9 05/01/2012 327 274.3971 0.7752110
10 06/01/2012 458 275.1724 0.8012559
How can I get both columns to update and reflect each other, as happens automatically in Minitab or Excel?

Combine the for loops:
for(i in seq(3, nrow(df), by = 1)) {
df[i,"Lt"]<-alpha*df[i-1,"Lt"]+(1-alpha)*(df[i-1,"Lt"]+df[i-1,"Tt"])
df[i,"Tt"]<-beta*(df[i,"Pallets"] - df[i-1,"Pallets"])+(1-beta)*df[i-1,"Tt"]
}
Also, you shouldn't use hard-coded numbers for things like row counts. Those can change, and it'd be a pain to constantly update the code when they do.

Related

Delete row n and rows from n to n+ x in dataframe in R

I have data representing stock prices, with 1 minute bars. I need to delete the row corresponding to the first minute of each day and the following 29 rows.
The first row of each day always has value >60 at the time_difference variable.
If I write del<- df[which(df$time_difference>60),] , then df_new=anti_join(df, sel, by= "Time")I select the first row of each day. However, I need to remove the next 29 rows as well.
Here is a sample of the df, I also added a time_difference vector computed as difference between each row and the next row for the Time variable (not displayed here). Df file can be downloaded from here
Time Open High Low Close Volume Wap Gap Count
1 1536154200 234.61 234.95 234.57 234.76 302 234.600 0 31
2 1536154260 234.76 235.23 234.76 235.16 135 235.008 0 94
3 1536154320 235.09 235.33 234.88 235.33 121 235.010 0 109
4 1536154380 235.24 235.35 235.08 235.35 24 235.203 0 22
5 1536154440 235.27 235.47 235.22 235.42 62 235.340 0 35
6 1536154500 235.39 235.81 235.39 235.63 136 235.633 0 110
My original answer only works on one set of rows at a time. Thanks for sharing your data. Below is an updated answer. Note that this is sensitive to your data being in chronological order as we are using row indices rather than the actual time!
dat <- read.csv("MSFT.3years.csv")
startofday <- which(dat$time_difference>60)
removerows <- unlist(Map(`:`, startofday, startofday+29))
dat_new <- dat[-removerows,]
inspired from here: Generate sequence between each element of 2 vectors

R function or loop for repeatedly selecting rows that meet a condition, saving as separate object, and renaming column headers

I have 16 large datasets of landcover variables around routes. Example dataset "Trial1":
RtNo TYPE CA PLAND NP PD LPI TE
2001 cls_11 996.57 6.4297 22 0.1419 6.3055 31080
2010 cls_11 56.34 0.3654 23 0.1492 0.1669 15480
18003 cls_11 141.12 0.9899 37 0.2596 0.1503 38700
18014 cls_11 797.58 5.3499 47 0.3153 1.3969 98310
2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670
2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260
18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780
18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650
2001 cls_22 732.33 4.7249 653 4.2131 0.7212 377430
2010 cls_22 32.31 0.2096 168 1.0896 0.0198 31470
18003 cls_22 275.85 1.9351 781 5.4787 0.0423 237390
18014 cls_22 469.44 3.1488 104 6.7345 0.1014 377580
I want to first select rows that meet a condition, for example, all rows in column "TYPE" that is cls_21. I know the following code does this work:
Trial21 <-subset(Trial1, TYPE==" cls_21 ")
(yes the invisible space before and after the categorical variable caused me a considerable headache).
And there are several other ways of doing this as shown in
[https://stackoverflow.com/questions/5391124/select-rows-of-a-matrix-that-meet-a-condition]
I get the following output (sorry this one has extra columns, but shouldn't affect my question):
RtNo TYPE CA PLAND NP PD LPI TE ED LSI
2 18003 cls_21 904.68 6.3463 612 4.2931 0.8769 549780 38.5668 46.1194
18 18014 cls_21 1189.89 7.9814 759 5.0911 0.4123 769650 51.6255 56.2522
34 2001 cls_21 1514.97 9.7744 592 3.8195 0.8443 761670 49.1418 49.3462
50 2010 cls_21 638.55 4.1414 95 0.6161 0.7489 463260 30.0457 46.0118
62 2020 cls_21 625.5 4.1165 180 1.1846 0.5064 384840 25.3268 38.6407
85 2021 cls_21 503.55 2.7926 214 1.1868 0.1178 348330 19.3175 38.9267
I want to rename the columns in this subset so they uniquely identify the class by adding "L21" at the back of existing column names, and I can do this using
library(data.table)
setnames(Trial21, old = c('CA', 'PLAND', 'NP', 'PD', 'LPI', 'TE', 'ED', 'LSI'),
new = c('CAL21', 'PLANDL21', 'NPL21', 'PDL21', 'LPIL21', 'TEL21', 'EDL21', 'LSIL21'))
I want help to develop a function or a loop that automates this process so I don't have to spend days repeating the same codes for 15 different classes and 16 datasets (240 times). Also, decrease the risk of errors. I may have to do the same for additional datasets. Any help to speed the process will be greatly appreciated.
You could do:
a <- split(df, df$TYPE)
b <- sapply(names(a), function(x)setNames(a[[x]],
paste0(names(a[[x]]), sub(".*_", 'L', x))), simplify = FALSE)
You can use ls to get the variable names of the datasets, and manipulate them as you wish inside a loop and with get function, then create new datasets with assign.
sets = grep("Trial", ls(), value=TRUE) #Assuming every dataset has "Trial" in the name
for(i in sets){
classes = unique(get(i)$TYPE)
for(j in classes){
number = gsub("(.+)([0-9]{2})( )", "\\2", j)#this might be an overly complicated way of getting just the number, you can look for better options if you want
assign(paste0("Trial", number),
subset(Trial1, TYPE==j) %>% rename_with(function(x){paste0(x, number)}))}}
Here is a start that should work for your example:
library(dplyr)
myfilter <- function(data, number) {
data %>%
filter(TYPE == sprintf(" cls_%s ") %>%
rename_with(\(x) sprintf("%s%s", x, suffix), !1:2)
}
myfilter(example_data, 21)
Given a list of numbers (here: 21 to 31) you could then automatically use them to filter a single dataframe:
multifilter <- function(data) {
purrr::map(21:31, \(i) myfilter(data, i))
}
multifilter(example_data)
Finally, given a list of dataframes, you can automatically apply the filters to them:
purrr::map(list_of_dataframes, multifilter)

Double entries in dataframe after merg r

My data
Hello, I have a problem with merging two dataframes with each other.
The goal is to merge them so that each date has the corresponding values. If there is no corresponding value, I want to replace NA with 0.
names(FiresNearLA.ab.03)[1] <- "Date.Local"
U.NO2.ab.03 <- unique(NO2.ab.03) # No2.ab.03 has all values multiplied
ind <- merge(FiresNearLA.ab.03,U.NO2.ab.03, all = TRUE, all.x=TRUE)
ind[is.na(ind)] <- 0
So far so good. And the first lines look like they are supposed to look. But beginning from 2004-04-24, all dates are doubled and it writes weird values in the second NO2.Mean colum.
U.NO2.Mean table:
Date.Local NO2.Mean
361 2004-03-31 30.217391
365 2004-04-24 50.000000
366 2004-04-25 47.304348
370 2004-04-26 50.913043
374 2004-04-27 41.157895
ind table:
Date.Local FIRE_SIZE F.number.n_fires NO2.Mean
113 2004-04-22 34.30 10 13.681818
114 2004-04-23 45.00 13 17.222222
115 2004-04-24 55.40 22 28.818182
116 2004-04-24 55.40 22 50.000000
117 2004-04-25 2306.85 15 47.304348
118 2004-04-25 2306.85 15 21.090909
Why, are there Values in NO2.Mean for 2004-04-23 and 2004-04-22 days if they should be 0? and why does it double the values after the 24th and where do the second ones come from?
Thank you
So I managed to merge your data:
FiresNearLA.ab.03 <- dget("FiresNearLA.ab.03.txt", keep.source = FALSE)
U.NO2.ab.03 <- dget("NO2.ab.03.txt", keep.source = FALSE)
ind <- merge(FiresNearLA.ab.03,
U.NO2.ab.03,
all = TRUE,
by.x = "DISCOVERY_DATEymd",
by.y = "Date.Local")
As a side note: Usually, you share a small sample of your data on stackoverflow, not the whole thing. In your case, dput(FiresNearLA.ab.03[1:50, ]) and then copy and paste from the console to the question would have been sufficient.
Back to your problem: The duplication already happens in NO2.ab.03 and a number of dates and values occurs twice or more often. The easiest way to solve this (in my experience) is to use the package data.table which has a duplicated which is more straightforward and also faster:
library(data.table)
# Test duplicated occurrences in U.NO2.ab.03
> table(duplicated(U.NO2.ab.03, by = c("DISCOVERY_DATEymd", "NO2.Mean")))
FALSE TRUE
7767 27308
>
> nrow(ind)
[1] 35229
# Remove duplicated rows from data frame
> ind <- ind[!duplicated(ind, by = c("DISCOVERY_DATEymd", "NO2.Mean")), ]
> nrow(ind)
[1] 7921
After these steps, you should be fine :)
I got the answer. The Original data source of NO3.ab.03 was faulty.
As JonGrub suggestes said, the problem was within NO3.ab.O3. For some days it had two different NO3.Means corresponding to the same date. I deleted this rows and now its working good. Thank you again for the help and the great advices

How to column bind and row bind a large number of data frames in R?

I have a large data set of vehicles. They were recorded every 0.1 seconds so there IDs repeat in Vehicle ID column. In total there are 2169 vehicles. I filtered the 'Vehicle velocity' column for every vehicle (using for loop) which resulted in a new column with first and last 30 values removed (per vehicle) . In order to bind it with original data frame, I removed the first and last 30 values of table too and then using cbind() combined them. This works for one last vehicle. I want this smoothing and column binding for all vehicles and finally I want to combine all the data frames of vehicles into one single table. That means rowbinding in sequence of vehicle IDs. This is what I wrote so far:
traj1 <- read.csv('trajectories-0750am-0805am.txt', sep=' ', header=F)
head(traj1)
names (traj1)<-c('Vehicle ID', 'Frame ID','Total Frames', 'Global Time','Local X', 'Local Y', 'Global X','Global Y','Vehicle Length','Vehicle width','Vehicle class','Vehicle velocity','Vehicle acceleration','Lane','Preceding Vehicle ID','Following Vehicle ID','Spacing','Headway')
# TIME COLUMN
Time <- sapply(traj1$'Frame ID', function(x) x/10)
traj1$'Time' <- Time
# SMOOTHING VELOCITY
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r
}
for (i in unique(traj1$'Vehicle ID')){
veh <- subset (traj1, traj1$'Vehicle ID'==i)
svel <- smooth(veh$'Vehicle velocity',30,10)
svel <- data.frame(svel)
veh <- head(tail(veh, -30), -30)
fta <- cbind(veh,svel)
}
'fta' now only shows the data frame for last vehicle. But I want all data frames (for all vehicles 'i') combined by row. May be for loop is not the right way to do it but I don't know how can I use tapply (or any other apply function) to do so many things same time.
EDIT
I can't reproduce my dataset here but 'Orange' data set in R could provide good analogy. Using the same smoothing function, the for loop would look like this (if 'age' column is smoothed and 'Tree' column is equivalent to my 'Vehicle ID' coulmn):
for (i in unique(Orange$Tree)){
tre <- subset (Orange, Orange$'Tree'==i)
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb <- cbind(tre,age2)}
}
Umair, I am not sure I understood what you want.
If I understood right, you want to combine all the results by row. To do that you could save all the results in a list and then do.call an rbind:
comb <- list() ### create list to save the results
length(comb) <- length(unique(Orange$Tree))
##Your loop for smoothing:
for (i in 1:length(unique(Orange$Tree))){
tre <- subset (Orange, Tree==unique(Orange$Tree)[i])
age2 <- round(smooth(tre$age,2,0.67),digits=2)
age2 <- data.frame(age2)
tre <- head(tail(tre, -2), -2)
comb[[i]] <- cbind(tre,age2) ### save results in the list
}
final.data<-do.call("rbind", comb) ### combine all results by row
This will give you:
Tree age circumference age2
3 1 664 87 687.88
4 1 1004 115 982.66
5 1 1231 120 1211.49
10 2 664 111 687.88
11 2 1004 156 982.66
12 2 1231 172 1211.49
17 3 664 75 687.88
18 3 1004 108 982.66
19 3 1231 115 1211.49
24 4 664 112 687.88
25 4 1004 167 982.66
26 4 1231 179 1211.49
31 5 664 81 687.88
32 5 1004 125 982.66
33 5 1231 142 1211.49
Just for fun, a different way to do it using plyr::ddply and sapply with split:
library(plyr)
data<-ddply(Orange, .(Tree), tail, n=-2)
data<-ddply(data, .(Tree), head, n=-2)
data<- cbind(data,
age2=matrix(sapply(split(Orange$age, Orange$Tree), smooth, D=2, delta=0.67), ncol=1, byrow=FALSE))

Select a column by criteria *and* the column name in each row of an R dataframe?

If I have a dataframe of election results by district and candidate, is there an easy way to find the winner in each district in R? That is, for each row, select both the maximum value and the column name for that maximum value?
District CandidateA CandidateB CandidateC
1 702 467 35
2 523 642 12
...
So I'd want to select not only 702 in row 1 and 642 in row 2, but also "CandidateA" from row 1 and "CandidateB" in row 2.
I'm asking this as a learning question, as I know I can do this with any general-purpose scripting language like Perl or Ruby. Perhaps R isn't the tool for this, but it seems like it could be. Thank you.
d <- read.table(textConnection(
"District CandidateA CandidateB CandidateC
1 702 467 35
2 523 642 12"),
header=TRUE)
d2 <- d[,-1] ## drop district number
data.frame(winner=names(d2)[apply(d2,1,which.max)],
votes=apply(d2,1,max))
result:
winner votes
1 CandidateA 702
2 CandidateB 642
Do you need to worry about ties? See the help for which and which.max, they treat ties differently ...
If this isn't too messy, you can try running a for loop and printing out the results using cat. So if your data.frame object is x:
for(i in 1:length(x$District)) {
row <- x[i,]
max_row <- max(row[2:length(row)])
winner_row <- names(x)[which(row==max_row)]
cat(winner_row, max_row, "\n")
}

Resources