Trying to create a loop for a population range and seem to be missing something - r

This is the prompt I am working from:
In the state population data, can you write a loop that pulls only states with populations between 200,000 and 250,000 in 1800? You could accomplish this a few different ways. Try to use next and break statements.
The dataset in question tracks how each state's population has shifted from year to year, decade to decade.
There are various columns representing each year of observations, including one for 1800 ("X1800"). The states are all also listed under one column ("state," of course).
I've been working on this one for quite a while, being new to coding. Right now, the code I have is as follows:
i <- 1
for (i in 1:length(statepopulations$X1800)) {
if ((statepopulations$X1800[i] < 200000) == TRUE)
next
if ((statepopulations$X1800[i] > 250000) == TRUE)
break
print (statepopulations$state[i])
}
I want to print the names of all states that fall within that population range for the year 1800.
I'm not sure where I'm going wrong. Any help is appreciated.
I keep getting a message that I'm "missing value where TRUE/FALSE needed."

in the condition, you don't need the == TRUE part.
(statepopulations$X1800[i] < 200000) should work by itself -- it will return a TRUE or FALSE, which will dictate what happens next

Related

Next/Break Statements for dataset problem

I'm trying to get a loop to pull only states with populations between 200,000 and 250,000. How can I do this with next and break statements? I feel like I have the right idea (population > 200,000) and (population < 250,000), but it's not working out for me.
The dataset tracks how much state populations have shifted from one year to the next, and has columns with each year (like for 1800, the year I'm looking at).
There are also columns for each state.
I'm not sure what I expected - a few states? tons of states? But instead I keep getting errors about my else if statement - saying that else was found in an area where it wasn't expected to be.
Given by your description, it seems that you only need states that have population between 200,000 and 250,000 regardless of the year.
Your dataset is like this
-------------------------
|state| population| year|
-------------------------
|NY | 220,000 | 2018|
-------------------------
.........
.........
.........
So you can do something like this (assuming your dataset is a list of list (2D matrix))
for i in dataset:
if 200000 < i[1] < 250000:
print('State: ', i[0])
print('Population: ', i[1])
This will only print rows that satisfies the population condition.
Later, instead of print you can put your own logic.

check if list object exist and give name them

I am trying to write my own function. And after some calculations for example I am obtaining a list like this ;
And according to data, number of clusters can vary from 1 to 31.
So, no matter how much the clusters are, I want to list them like the code below.
maxm5<-list(m.5$`Disaggregated rainfall depths`$`Cluster 1`, m.5$`Disaggregated rainfall depths`$`Cluster 2`...)
To perform these, I tried sapply;
maxm5<-sapply(1:31, function(zz) list(m.5$`Disaggregated rainfall depths`$`Cluster [zz]`))
And then I tried for loop
month<-31
maxm5<- for (i in month) {
list(m.5$`Disaggregated rainfall depths`$`Cluster [i]`)
}
But what I just got is a list with 31 null.
And then I want to give name them like;
m5.1<-maxm5[[1]]
m5.2<-maxm5[[2]] ....
Based on your last comment:
sapply(m.5$`Disaggregated rainfall depths`, function(x) max(x[, -(1:4)]))

Conditionally removing duplicates in R (20K observations)

I am currently working in a large data set looking at duplicate water rights. Each right holder is assigned an RightID, but some were recorded twice for clerical purposes. However, some rightIDs are listed more than once and do have relevance to my end goal. One example: there are double entries when a metal tag number was assigned to a specific water right. To avoid double counting the critical information I need to delete an observation.
I have this written at the moment,
#Updated Metal Tag Number
for(i in 1:nrow(duplicate.rights)) {
if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){
remove(i)
}
print[i]
}
The original data frame is set up similarly:
RightID Source Use MetalTagNu
1-0000 Wolf Creek Irrigation N/A
1-0000 Wolf Creek Irrigation 12345
1-0001 Bear River Domestic N/A
1-0002 Beaver Stream Domestic 00001
1-0002 Beaver Stream Irrigation 00001
E.g. right holder 1-0002 is necessary to keep because he is using his water right for two different purposes. However, right holder 1-0000 is unnecessary a repeat.
Right holder 1-0000 I need to eliminate but right holder 1-0002 is valuable to my end goal. I should also note that there can be up to 10 entries for a single rightID but out of those 10 only 1 is an unnecessary duplicate. Also, the duplicate and original entry will not be next to each other in the dataset.
I am quite the novice so please forgive my poor previous attempt. I know i can use the lapply function to make this go faster and more efficiently. Any guidance there would be much appreciated.
So I would suggest the following:
1) You say that you want to keep some duplicates (metal tag number was assigned to a specific water right). I don't know what this means. But I assume that it is something like this - if metal tag number = 1 then even if there are duplicates, you want to keep them. So I propose that you take these rows in your data (let's call this data) out:
data_to_keep <- data[data$metal_tag_number == 1, ]
data_to_dedupe <- data[data$metal_tag_number != 1, ]
2) Now that you have the two dataframes, you can dedupe the dataframe data_to_dedupe with no problem:
deduped_data = data_to_dedupe[!duplicated(data_to_dedupe$dedupe_key), ]
3) Now you can merge the two dataframes back together:
final_data <- rbind(data_to_keep, deduped_data)
If this is what you wanted please up-mark and suggest that the answer is correct. Thanks!
Create a new column,key, which is a combination of RightID & Use.
Assuming your dataframe is called df,
df$key <- paste(df$RightID,df$Use)
Then, remove duplicates using this command :
df1 <- df[!duplicated(df[,1],)]
df1 will have no duplicates.

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

R Accumulate equity data - add time and price

I have some data formatted as below. I have done some analysis on this and would like to be able to plot the price development in the same graph as the analyzed data.
This requires me to have the same x-axes for the data.
So I would like to aggregate the "shares" column in say 150 increments, and add the "finalprice" and "time" to this.
The aggregation should include the latest time and price, so if the aggregation needs to occur over two or more rows of data then the last row should provide the price and time data.
My question is how to create a new vector with 150 shares per row.
The length of the vector will equal sum(shares)/150.
Is there an easy way to do this? Thanks in advance.
Edit:
I thought about expanding the observations using rep(finalprice, shares) and then getting each 150th value of the expanded vector.
Data sample:
"date","ord","shares","finalprice","time","stock"
20120702,E,2000,99.35,540.84753333,500
20120702,E,28000,99.35,540.84753333,500
20120702,E,50,99.5,542.03073333,500
20120702,E,13874,99.5,542.29411667,500
20120702,E,292,99.5,542.30191667,500
20120702,E,784,99.5,542.30193333,500
20120702,E,13300,99.35,543.04805,500
20120702,E,16658,99.35,543.04805,500
20120702,E,42,99.5,543.04805,500
20120702,E,400,99.4,546.17173333,500
20120702,E,100,99.4,547.07,500
20120702,E,2219,99.3,549.47988333,500
20120702,E,781,99.3,549.5238,500
20120702,E,50,99.3,553.4052,500
20120702,E,1500,99.35,559.86275,500
20120702,E,103,99.5,567.56726667,500
20120702,E,1105,99.7,573.93326667,500
20120702,E,4100,99.5,582.2657,500
20120702,E,900,99.5,582.2657,500
20120702,E,1024,99.45,582.43891667,500
20120702,E,8214,99.45,582.43891667,500
20120702,E,10762,99.45,582.43895,500
20120702,E,1250,99.6,586.86446667,500
20120702,E,5000,99.45,594.39061667,500
20120702,E,20000,99.45,594.39061667,500
20120702,E,15000,99.45,594.39061667,500
20120702,E,4000,99.45,601.34491667,500
20120702,E,8700,99.45,603.53608333,500
20120702,E,3290,99.6,609.23213333,500
I think I got it solved.
expand <- rep(finalprice, shares)
Increment <- expand[seq(from = 1, to = length(expand), by = 150)]

Resources