I am trying to store the loop output. However, my dataset is quite big and it crashes Rstudio whenever I try to View it. I have tried different techniques such as the functions in library(iterators) and library(foreach), but it is not doing what I want it to do. I am trying to take a row from my main table (Table A)(number of rows 54000) and then a row from another smaller table (Table B)(number of rows = 6). I have also took a look at Storing loop output in a dataframe in R but it doesn't really allow me to view my results.
The code takes the first row from Table A and then iterates it 6 times through table B and then outputs the result of each iteration then moves to Table A's second row. As such my final dataset should contain 324000 (54000*6) observations.
Below is the code that provides me with the correct observations (but I am unable to view it to see it the values are being correctly calculated) and a snippet of Table A and Table B.
output_ratios <- NULL
for (yr in seq(yrs)) {
if (is.na(yr) == 'TRUE') {
numerator=0
numerator1=0
numerator2=0
denominator=0
} else {
numerator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("1")]
denominator=Table.B[Table.B$PERIOD == paste("PY_", yr, sep=""), c("2")]
denom=Table.A[, "1"] + (abs(Table.A[, "1"])*denominator)
num=Table.A[, "2"] + (abs(Table.A[, "2"])*numerator)
new.data$1=num
new.data$2=denom
NI=num / denom
NI_ratios$NI=c(NI)
output_ratios <<- (rbind(output_ratios, NI))
}
}
TABLE B:
PERIOD 1 2 3 4 5
1 PY_1 0.21935312 -0.32989691 0.12587413 -0.28323699 -0.04605116
2 PY_2 0.21328526 0.42051282 -0.10559006 0.41330645 0.26585064
3 PY_3 -0.01338112 -0.03971119 -0.06641667 -0.08238231 -0.05323772
4 PY_4 0.11625091 0.01127819 0.07114166 0.08501516 0.55676498
5 PY_5 -0.01269256 -0.02379182 0.39115278 -0.03716100 0.63530682
6 PY_6 0.69041864 0.51034273 0.59290357 0.78571429 -0.48683736
TABLE A:
1 2 3 4
1 25 3657 2258
2 23 361361 250
3 24 35 000
4 25 362 502
5 25 1039 502
I would greatly appreciate any help.
Related
I have a data frame called nurse. At the moment it contains several columns but only one (nurse$word) is relevant at the moment. I want to create a new column named nurse$w.frequency which looks at the words in the nurse$word column and if it finds the one specified, I want it to change the corresponding nurse$w.frequency value to a specified integer.
nurse <- read.csv(...)
file word w.frequency
1 determining
2 journey
3 journey
4 serving
5 work
6 journey
... ...
The word frequency for determining and journey, for instance, is 1590 and 4650 respectively. So it should look like the following:
file word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
... ...
I have tried it with the an ifelse statement (below) which seems to work, however, every time I try to change the actual word and frequency it overwrites the results from before.
nurse$w.frequency <- ifelse(nurse$word == "determining", nurse$w.frequency[nurse$word["determining"]] <- 1590, "")
You could first initialise an empty column
nurse$w.frequency <- NA
then populated it with the data you want
nurse$w.frequency[nurse$word == "determining"] <- 1590
nurse$w.frequency[nurse$word == "journey"] <- 4650
Using dplyr:
nurse %>%
mutate(w.frequency =
case_when(
word == "determining" ~ "1590",
word == "journey" ~ "4650",
TRUE ~ ""
))
Gives us:
word w.frequency
1 determining 1590
2 journey 4650
3 journey 4650
4 serving
5 work
6 journey 4650
Data:
nurse <- data.frame(word = c("determining", "journey", "journey", "serving", "work", "journey"))
I want to create a for loop that will replace a value in a row with a value from the previous year (same month), based on if two columns are matching.
I have created the structure of a for loop, but I have not made progress in determining how to get the for loop to reference a value from a previous year.
Here is an example dataset:
fish <- c("A","A","B","B","C","C")
fish_wt<-c(2,3,4,5,5,7)
fish_count<-c(2,200,47,78,5,845)
date <- as.Date(c('2010-11-1','2009-11-1','2009-11-1','2008-11-1','2008-2-1','2007-2-1'))
data <- data.frame(fish,fish_wt,fish_count,date)
data$newcount<-0
Here is my for loop so far:
for (i in 1:nrow(data)) {
if (data$fish_wt[i] == data$fish_count[i]) {
data$newcount[i] <- 10
} else {
data$newcount[i] <- data$fish_count[i]
}
}
Right now, I am using the value of the row-1, which is fine for this small dataset, but does not work for a larger one where the two fish A rows will not be next to one another.
for (i in 1:nrow(data)) {
if (data$fish_wt[i] == data$fish_count[i]) {
data$newcount[i] <- data$newcount[data$date==data$date[i-1])]
} else {
data$newcount[i] <- data$fish_count[i]
}
}
This is what I want my dataset to look like:
fish fish_wt fish_count date newcount
1 A 2 2 2010-11-01 200
2 A 3 200 2009-11-01 200
3 B 4 47 2009-11-01 47
4 B 5 78 2008-11-01 78
5 C 5 5 2008-02-01 845
6 C 7 845 2007-02-01 845
I have thought of separating rows by fish, then using the row-1 solution. I am just wondering if there is something easier.
As a solution to this problem, I set up a table of mean temperature by fish, year, and month (long format), then merged the dataset and used the average value for any row where fish_wt==fish_count.
I already tried my best but am still pretty much a newbie to R.
Based on like 500mb of input data that currently looks like this:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days
1 2818 5829821 335511.0 1
2 20168 5829746 335265.2 3
3 25428 5830640 331534.6 0
4 27886 5832156 332003.1 3
5 28658 5830888 329727.2 3
6 28871 5829980 332071.3 7
I need to calculate the conditional sum of reviews_last30days - the conditions being a specific and changing area range for each respective record, i.e. R should sum only those reviews for which the calc.latitude and calc.longitude do not deviate more than +/-500 from the longitude and latitude values in each row.
EXAMPLE:
ROW 1 has a calc.latitude 5829821 and a calc.longitude 335511.0, so R should take the sum of all reviews_last30days for which the following ranges apply: calc.latitude 5829321 to 5830321 (value of Row 1 latitude +/-500)
calc.longitude 335011.0 to 336011.0 (value of Row 1 longitude +/-500)
So my intended output would look somewhat like this in column 5:
TOTALLISTINGS
listing_id calc.latitude calc.longitude reviews_last30days reviewsper1000
1 2818 5829821 335511.0 1 4
2 20168 5829746 335265.2 3 4
3 25428 5830640 331534.6 0 10
4 27886 5832156 332003.1 3 3
5 28658 5830888 331727.2 3 10
6 28871 5829980 332071.3 7 10
Hope I calculated correctly in my head, but you get the idea..
Until now I particularly struggle with the fact that my sum conditions are dynamic and "newly assigned" since the latitude and longitude conditions have to be adjusted for each record.
My current code looks like this but it obviously doesn't work that way:
review1000 <- function(TOTALLISTINGS = NULL){
# tibble to return
to_return <- TOTALLISTINGS %>%
group_by(listing_id) %>%
summarise(
reviews1000 = sum(reviews_last30days[(calc.latitude>=(calc.latitude-500) | calc.latitude<=(calc.latitude+500))]))
return(to_return)
}
REVIEWPERAREA <- review1000(TOTALLISTINGS)
I know I also would have to add something for longitude in the code above
Does anyone have an idea how to fix this?
Any help or hints highly appreciated & thanks in advance! :)
See whether the below code will help.
TOTALLISTINGS$reviews1000 <- sapply(1:nrow(TOTALLISTINGS), function(r) {
currentLATI <- TOTALLISTINGS$calc.latitude[r]
currentLONG <- TOTALLISTINGS$calc.longitude[r]
sum(TOTALLISTINGS$reviews_last30days[between(TOTALLISTINGS$calc.latitude,currentLATI - 500, currentLATI + 500) & between(TOTALLISTINGS$calc.longitude,currentLONG - 500, currentLONG + 500)])
})
I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.
I have read up on vectorization as a solution for speeding up a for-loop. However, the data structure I am creating within a for-loop seems to need to be a data.frame/table.
Here is the scenario:
I have a large table of serial numbers and timestamps. Several timestamps can apply to the same serial number. I only want the latest timestamp for every serial number.
My approach now is to create a vector with unique serial numbers. Then for each loop through this vector, I create a temporary table that holds all observations of a serial number/timestamp combinations ('temp'). I then take the last entry of this temporary table (using tail command) and put it into another table that will eventually hold all unique serial numbers and their latest timestamp ('last.pass'). Finally, I simply remove rows from the starting table serial where number/timestamp combination cannot be found 'last.pass'
Here is my code:
#create list of unique serial numbers found in merged 9000 table
hddsn.unique <- unique(merge.data$HDDSN)
#create empty data.table to populate
last.pass < data.table(HDDSN=as.character(1:length(hddsn.unique)),
ENDDATE=as.character(1:length(hddsn.unique)))
#populate last.pass with the combination of serial numbers and their latest timestamps
for (i in 1:length(hddsn.unique)) {
#create temporary table that finds all serial number/timestamp combinations
temp <- merge.data[merge.data$HDDSN %in% hddsn.unique[i],][,.(HDDSN, ENDDATE)]
#populate last.pass with the latest timestamp record for every serial number
last.pass[i,] <- tail(temp, n=1)
}
match <- which(merge.data[,(merge.data$HDDSN %in% last.pass$HDDSN) &
(merge.data$ENDDATE %in% last.pass$ENDDATE)]==TRUE)
final <- merge.data[match]
My ultimate question is, how do I maintain the automated nature of this script while speeding it up, say, through vectorization or turning it into a function.
Thank you!!!
How about this. Without a clear idea of what your input data looks like, I took a guess.
# make some dummy data with multiple visits per serial
merge.data <- data.frame(HDDSN = 1001:1020,
timestamps = sample(1:9999, 100))
# create a function to find the final visit for a given serial
fun <- function(serial) {
this.serial <- subset(merge.data, HDDSN==serial)
this.serial[which.max(this.serial$timestamps), ]
}
# apply the function to each serial number and clean up the result
final <- as.data.frame(t(sapply(unique(merge.data$HDDSN), fun)))
This data has several ENDDATE for each HDDSN
merge.data <- data.frame(HDDSN = 1001:1100, ENDDATE = sample(9999, 1000))
place it in order, first by HDDSN then by ENDDATE
df = merge.data[do.call("order", merge.data),]
then find the last entry for each HDDSN
df[!duplicated(df[["HDDSN"]], fromLast=TRUE),]
The following illustrate the key steps
> head(df, 12)
HDDSN ENDDATE
701 1001 4
101 1001 101
1 1001 1225
301 1001 2800
201 1001 6051
501 1001 6714
801 1001 6956
601 1001 7894
401 1001 8234
901 1001 8676
802 1002 247
402 1002 274
> head(df[!duplicated(df[["HDDSN"]], fromLast=TRUE),])
HDDSN ENDDATE
901 1001 8676
902 1002 6329
803 1003 9947
204 1004 8825
505 1005 8472
606 1006 9743
If there are composite keys, then look for duplicates on a data.frame rather than a vector, !duplicated(df[, c("key1", "key2")]), as illustrated in the following:
> df = data.frame(k0=c(1:3, 1:6), k1=1:3)
> df[!duplicated(df, fromLast=TRUE),]
k0 k1
1 1 1
2 2 2
3 3 3
7 4 1
8 5 2
9 6 3
(the row numbers are from the original data frame, so rows 4-6 were duplicates). (Some care might need to be taken, especially if one of the columns is numeric, because duplicated.data.frame pastes columns together into a single string and rounding error may creep in).