Conditionally removing duplicates in R (20K observations) - r

I am currently working in a large data set looking at duplicate water rights. Each right holder is assigned an RightID, but some were recorded twice for clerical purposes. However, some rightIDs are listed more than once and do have relevance to my end goal. One example: there are double entries when a metal tag number was assigned to a specific water right. To avoid double counting the critical information I need to delete an observation.
I have this written at the moment,
#Updated Metal Tag Number
for(i in 1:nrow(duplicate.rights)) {
if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){
remove(i)
}
print[i]
}
The original data frame is set up similarly:
RightID Source Use MetalTagNu
1-0000 Wolf Creek Irrigation N/A
1-0000 Wolf Creek Irrigation 12345
1-0001 Bear River Domestic N/A
1-0002 Beaver Stream Domestic 00001
1-0002 Beaver Stream Irrigation 00001
E.g. right holder 1-0002 is necessary to keep because he is using his water right for two different purposes. However, right holder 1-0000 is unnecessary a repeat.
Right holder 1-0000 I need to eliminate but right holder 1-0002 is valuable to my end goal. I should also note that there can be up to 10 entries for a single rightID but out of those 10 only 1 is an unnecessary duplicate. Also, the duplicate and original entry will not be next to each other in the dataset.
I am quite the novice so please forgive my poor previous attempt. I know i can use the lapply function to make this go faster and more efficiently. Any guidance there would be much appreciated.

So I would suggest the following:
1) You say that you want to keep some duplicates (metal tag number was assigned to a specific water right). I don't know what this means. But I assume that it is something like this - if metal tag number = 1 then even if there are duplicates, you want to keep them. So I propose that you take these rows in your data (let's call this data) out:
data_to_keep <- data[data$metal_tag_number == 1, ]
data_to_dedupe <- data[data$metal_tag_number != 1, ]
2) Now that you have the two dataframes, you can dedupe the dataframe data_to_dedupe with no problem:
deduped_data = data_to_dedupe[!duplicated(data_to_dedupe$dedupe_key), ]
3) Now you can merge the two dataframes back together:
final_data <- rbind(data_to_keep, deduped_data)
If this is what you wanted please up-mark and suggest that the answer is correct. Thanks!

Create a new column,key, which is a combination of RightID & Use.
Assuming your dataframe is called df,
df$key <- paste(df$RightID,df$Use)
Then, remove duplicates using this command :
df1 <- df[!duplicated(df[,1],)]
df1 will have no duplicates.

Related

Trying to create a loop for a population range and seem to be missing something

This is the prompt I am working from:
In the state population data, can you write a loop that pulls only states with populations between 200,000 and 250,000 in 1800? You could accomplish this a few different ways. Try to use next and break statements.
The dataset in question tracks how each state's population has shifted from year to year, decade to decade.
There are various columns representing each year of observations, including one for 1800 ("X1800"). The states are all also listed under one column ("state," of course).
I've been working on this one for quite a while, being new to coding. Right now, the code I have is as follows:
i <- 1
for (i in 1:length(statepopulations$X1800)) {
if ((statepopulations$X1800[i] < 200000) == TRUE)
next
if ((statepopulations$X1800[i] > 250000) == TRUE)
break
print (statepopulations$state[i])
}
I want to print the names of all states that fall within that population range for the year 1800.
I'm not sure where I'm going wrong. Any help is appreciated.
I keep getting a message that I'm "missing value where TRUE/FALSE needed."
in the condition, you don't need the == TRUE part.
(statepopulations$X1800[i] < 200000) should work by itself -- it will return a TRUE or FALSE, which will dictate what happens next

How do I pull the values from multiple columns, conditionally, into a new column?

I am a relatively novice R user, though familiar with dplyr and tidy verse. I still can't seem to figure out how to pull in the actual data from one column if it meets certain condition, into a new column.
Here is what I'm trying to do. Participants have ranked specific practices (n=5) and provided responses to questions that represent their beliefs about these practices. I want to have five new columns that assign their beliefs about the practices to their ranks, rather than the practices.
For example, they have a score for "beliefs about NI" called ni.beliefs, if a participant ranked NI as their first choice, I want the value for ni.beliefs to be pulled into the new column for first.beliefs. The same is true that if a participant put pmii as their first choice practice, their value for pmii.beliefs should be pulled into the first.beliefs column.
So, I need five new columns called: first.beliefs, second.beliefs, third.beliefs, fourth.beliefs, last.beliefs and then I need each of these to have the data pulled in conditionally from the practice specific beliefs (ni.beliefs, dtt.beliefs, pmi.beliefs, sn.beliefs, script.beliefs) dependent on the practice specific ranks (rank assigned of 1-5 for each practice, rank.ni, rank.dtt, rank.pmi, rank.sn, rank.script).
Here is what I have so far but I am stuck and aware that this is not very close. Any help is appreciated!!!
`
Diss$first.beliefs <-ifelse(rank.ni==1, ni.beliefs,
ifelse(rank.dtt==1, dtt.beliefs,
ifelse(rank.pmi==1, pmi.beliefs,
ifelse(rank.sn, sn.beliefs,
ifelse(rank.script==1, script.beliefs)))))
`
Thank you!!
I'm not sure if I understood correctly (it would help if you show how your data looks like), but this is what I'm thinking:
Without using additional packages, if the ranking columns are equivalent to the index of the new columns you want (i.e. they rank each practice from 1 to 5, without repeats, and in the same order as the new columns "firsts belief, second belief, etc"), then you can use that data as the indices for the second set of columns:
for(j in 1:nrow(people_table)){
people_table[j,]$first.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 1]
people_table[j,]$second.belief[[1]] <- names(beliefs)[(people_table[j,c(A:B)]) %in% 2]
...
}
Where
A -> index of the first preference rank column
B -> index of the last preference rank column
(people_table[j,c(A:B)] %in% 1) -> this returns something like (FALSE FALSE TRUE FALSE FALSE)
beliefs -> vector with the names of each belief
That should work. It's simple, no need for packages, and it'll be fast too. Just make sure you've initialized/created the new columns first, otherwise you'll get some errors. If
This is done very easily with the case_when() function. You can improve on the code below.
library(dplyr)
Diss$first.beliefs <- case_when(
rank.ni == 1 ~ ni.beliefs,
rank.dtt == 1 ~ dtt.beliefs,
rank.pmi == 1 ~ pmi.beliefs,
rank.sn ~ sn.beliefs,
rank.script == 1 ~ script.beliefs
)

Looping through a dataset in R and count occurences of variables

I found an interesting data-set from a psychology study (data-set is called WearingTShirt), and I would like to replicate the results. I would need to summarize two variables into a single variable. This is what I have written:
Create empty variable
PinkAndRed = 0
Count instances of people wearing both pink and red and add 1
for i in WearingTShirt:
PinkAndRed+1 if:
WearingTShirt$PINKSHIRT==1 OR WearingTShirt$REDSHIRT==1
Add variable to dataset
WearingTShirt$PinkAndRed
I have not much R experience (I wrote mostly in Python).
Your code is more in python than in R. The equivalent code in R for what you want to do is:
PinkAndRed = rep(0,dim(WearingTShirt)[1])
for(i in 1:dim(WearingTShirt)[1]){
if((WearingTShirt$PINKSHIRT[i]==1) || (WearingTShirt$REDSHIRT[i]==1))
{
PinkAndRed[i] = 1
}
}
WearingTShirt=cbind(WearingTShirt,PinkAndRed)
You need to review basics on R. There are countless small difference between R and python, such as parenthesis in loops or conditions, set the length of a loop (in the above code with dim you calculate the dimension of the dataset and by doing [1] you indicate that you want the number of rows)...
Update:
thanks to the comments i've realized that is not clear if you want a cumulative sum of the individuals with pink and red shirts or a variable which is 1 with the shirt is pink or red, and 0 in other case.
The code above is for a varaible that includes pink and red shirts in one variable.
If you want the sum you must use cumsum function as it's said in the comments
I would not choose to loop, but:
WearingTShirt$PinkAndRed <- ifelse(WearingTShirt$PINKSHIRT==1 |
WearingTShirt$REDSHIRT==1,1,0)
PinkAndRed sounds more like PinkOrRed based on example given.

Deleting ranges of values based on character string in R

I have a pretty gigantic dataframe that looks
like this
I want to delete all NUTS2 values for certain countries (let's say Belgium here) and have no clue how to proceed. So far, the only thing that works has been this:
alldata<-alldata[!(alldata$nutscode=="be21" & alldata$nutslevel=="nuts2"),]
but I would have to keep writing this same line hundreds of times for all possible countries.
I want to exclude all values from the dataset where the nutscode variable has the character string "be" in the values AND the nutslevel equals 2.
I've tried using
alldata[!grepl("be", alldata$nutscode, alldata$nutslevel=="nuts2"),]
or
alldata[!grepl("be", alldata$nutscode) & alldata$nutslevel=="nuts2",]
since I've seen this posted in a similar thread here,
but I am clearly writing something wrong, it doesn't work, it just prints out values. I've also tried many many other alternatives, but nothing worked.
Is there a simpler way of removing the rows containing those specific strings from my dataframe, without writing the same line hundreds of times? Also please please if you reply, do provide a complete answer, I am a total noob at this and if I had known how to write a fancy loop or function to do this for me, I would have done it by now. :/
Thank you very much in advance!
Also for clarification: NUTS codes are used to classify regions and increase in complexity the deeper one goes on a regional level. E.g. AT0 is Austria as a whole, AT2 and AT3 are regions on NUTS1 level and AT21 or AT34 are even smaller regions on NUTS2 level. Each country has their own NUTS code following the same structure (e.g.BE, BE1 and BE34 are examples for NUTS levels 0,1 and 2 regions in Belgium)
I think you're very close with grepl. Why did you abandon the & construct from your first example? This works fine for me...
nutslevel <- c('nuts1', 'nuts1', 'nuts2', 'nuts2')
nutscode <- c('be2', 'o2', 'be2', 'o2')
dat <- data.frame(nutslevel, nutscode)
dat[!(grepl('be', dat$nutscode) & dat$nutslevel=='nuts2'), ]
last line returns
nutslevel nutscode
1 nuts1 be2
2 nuts1 o2
4 nuts2 o2
which excludes the third row, as desired.
Also, perhaps subset offers a slightly cleaner way to achieve this
subset(dat, !(grepl('be', nutscode) & nutslevel=='nuts2'))
Just for clarification. do the different countries nutscode? What is the pattern of the nutscode? As far as explained above, You did exclude all values from the dataset where the nutscode variable has the character string "be" in the values AND the nutslevel equals 2. Maybe only if the nutscode differ from country to country then would someone be able to respond to your question. One has to visualize the pattern.. So if possible, give nutscode for at least four countries. I hope the nutslevel=2for all the countries. Thank you

Summarized huge data, How to handle it with R?

I am working on EBS, Forex market Limit Order Book(LOB): here is an example of LOB in a 100 millisecond time slice:
datetime|side(0=Bid,1=Ask)| distance(1:best price, 2: 2nd best, etc.)| price
2008/01/28,09:11:28.000,0,1,1.6066
2008/01/28,09:11:28.000,0,2,1.6065
2008/01/28,09:11:28.000,0,3,1.6064
2008/01/28,09:11:28.000,0,4,1.6063
2008/01/28,09:11:28.000,0,5,1.6062
2008/01/28,09:11:28.000,1,1,1.6067
2008/01/28,09:11:28.000,1,2,1.6068
2008/01/28,09:11:28.000,1,3,1.6069
2008/01/28,09:11:28.000,1,4,1.6070
2008/01/28,09:11:28.000,1,5,1.6071
2008/01/28,09:11:28.500,0,1,1.6065 (I skip the rest)
To summarize the data, They have two rules(I have changed it a bit for simplicity):
If there is no change in LOB in Bid or Ask side, they will not record that side. Look at the last line of the data, millisecond was 000 and now is 500 which means there was no change at LOB in either side for 100, 200, 300 and 400 milliseconds(but those information are important for any calculation).
The last price (only the last) is removed from a given side of the order book. In this case, a single record with nothing in the price field. Again there will be no record for whole LOB at that time.
Example:2008/01/28,09:11:28.800,0,1,
I want to calculate minAsk-maxBid(1.6067-1.6066) or weighted average price (using sizes of all distances as weights, there is size column in my real data). I want to do for my whole data. But as you see the data has been summarized and this is not routine. I have written a code to produce the whole data (not just summary). This is fine for small data set but for a large one I am creating a huge file. I was wondering if you have any tips how to handle the data? How to fill the gaps while it is efficient.
You did not give a great reproducible example so this will be pseudo/untested code. Read the docs carefully and make adjustments as needed.
I'd suggest you first filter and split your data into two data.frames:
best.bid <- subset(data, side == 0 & distance == 1)
best.ask <- subset(data, side == 1 & distance == 1)
Then, for each of these two data.frames, use findInterval to compute the corresponding best ask or best bid:
best.bid$ask <- best.ask$price[findInterval(best.bid$time, best.ask$time)]
best.ask$bid <- best.bid$price[findInterval(best.ask$time, best.bid$time)]
(for this to work you might have to transform date/time into a linear measure, e.g. time in seconds since market opening.)
Then it should be easy:
min.spread <- min(c(best.bid$ask - best.bid$price,
best.ask$bid - best.ask$price))
I'm not sure I understand the end of day particularity but I bet you could just compute the spread at market close and add it to the final min call.
For the weighted average prices, use the same idea but instead of the two best.bid and best.ask data.frames, you should start with two weighted.avg.bid and weighted.avg.ask data.frames.

Resources