Deleting ranges of values based on character string in R - r

I have a pretty gigantic dataframe that looks
like this
I want to delete all NUTS2 values for certain countries (let's say Belgium here) and have no clue how to proceed. So far, the only thing that works has been this:
alldata<-alldata[!(alldata$nutscode=="be21" & alldata$nutslevel=="nuts2"),]
but I would have to keep writing this same line hundreds of times for all possible countries.
I want to exclude all values from the dataset where the nutscode variable has the character string "be" in the values AND the nutslevel equals 2.
I've tried using
alldata[!grepl("be", alldata$nutscode, alldata$nutslevel=="nuts2"),]
or
alldata[!grepl("be", alldata$nutscode) & alldata$nutslevel=="nuts2",]
since I've seen this posted in a similar thread here,
but I am clearly writing something wrong, it doesn't work, it just prints out values. I've also tried many many other alternatives, but nothing worked.
Is there a simpler way of removing the rows containing those specific strings from my dataframe, without writing the same line hundreds of times? Also please please if you reply, do provide a complete answer, I am a total noob at this and if I had known how to write a fancy loop or function to do this for me, I would have done it by now. :/
Thank you very much in advance!
Also for clarification: NUTS codes are used to classify regions and increase in complexity the deeper one goes on a regional level. E.g. AT0 is Austria as a whole, AT2 and AT3 are regions on NUTS1 level and AT21 or AT34 are even smaller regions on NUTS2 level. Each country has their own NUTS code following the same structure (e.g.BE, BE1 and BE34 are examples for NUTS levels 0,1 and 2 regions in Belgium)

I think you're very close with grepl. Why did you abandon the & construct from your first example? This works fine for me...
nutslevel <- c('nuts1', 'nuts1', 'nuts2', 'nuts2')
nutscode <- c('be2', 'o2', 'be2', 'o2')
dat <- data.frame(nutslevel, nutscode)
dat[!(grepl('be', dat$nutscode) & dat$nutslevel=='nuts2'), ]
last line returns
nutslevel nutscode
1 nuts1 be2
2 nuts1 o2
4 nuts2 o2
which excludes the third row, as desired.
Also, perhaps subset offers a slightly cleaner way to achieve this
subset(dat, !(grepl('be', nutscode) & nutslevel=='nuts2'))

Just for clarification. do the different countries nutscode? What is the pattern of the nutscode? As far as explained above, You did exclude all values from the dataset where the nutscode variable has the character string "be" in the values AND the nutslevel equals 2. Maybe only if the nutscode differ from country to country then would someone be able to respond to your question. One has to visualize the pattern.. So if possible, give nutscode for at least four countries. I hope the nutslevel=2for all the countries. Thank you

Related

Get next level from a given level factor

I am currently making my first steps using R with RStudio and right now I am struggling with the following problem:
I got some test data about a marathon with four columns, where the third column is a factor with 15 levels representing different age classes.
One age class randomAgeClass will be randomly selected at the beginning, and an object is created holding the data that matches this age class.
set.seed(12345678)
attach(marathon)
randomAgeClass <- sample(levels(marathon[,3]), 1)
filteredMara <- subset(marathon, AgeClass == randomAgeClass)
My goal is to store a second object that holds the data matching the next higher level, meaning that if age class 'kids' was randomly selected, I now want to access the data relating to 'teenagers', which is the next higher level. Looking something like this:
nextAgeClass <- .... randomAgeClass+1 .... ?
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
Note that I already found this StackOverflow question, which seems to partially match my situation, but the accepted answer is not understandable to me, thus I wasn't able to apply it to my needs.
Thanks a lot for any patient help!
First you have to make sure thar the levels of your factor are ordered by age:
factor(marathon$AgeClass,levels=c("kids","teenagers",etc.))
Then you almost got there in your example:
next_pos<-which(levels(marathon$AgeClass)==randomAgeClass)+1 #here you get the desired position in the level vector
nextAgeClass <- levels(marathon$AgeClass) [next_pos]
filteredMaraAgeClass <- subset(marathon, AgeClass == nextAgeClass)
You might have a problem if the randomAgeClass is the last one, so make sure to avoid that problem

Making a histogram

this sounds pretty basic but every time I try to make a histogram, my code is saying x needs to be numeric. I've been looking everywhere but can't find one relating to my problem. I have data with 240 obs with 5 variables.
Nipper length
Number of Whiskers
Crab Carapace
Sex
Estuary location
There is 3 locations and i'm trying to make a histogram with nipper length
I've tried making new factors and levels, with the 80 obs in each location but its not working
Crabs.data <-read.table(pipe("pbpaste"),header = FALSE)##Mac
names(Crabs.data)<-c("Crab Identification","Estuary Location","Sex","Crab Carapace","Length of Nipper","Number of Whiskers")
Crabs.data<-Crabs.data[,-1]
attach(Crabs.data)
hist(`Length of Nipper`~`Estuary Location`)
Error in hist.default(Length of Nipper ~ Estuary Location) :
'x' must be numeric
Instead of correct result
hist() doesn't seem to like taking more than one variable.
I think you'd have the best luck subsetting the data, that is, making a vector of nipper lengths for all crabs in a given estuary.
crabs.data<-read.table("whatever you're calling it")
names<-(as you have it)
Estuary1<-as.vector(unlist(subset(crabs.data, `Estuary Loc`=="Location", select = `Length of Nipper`)))
hist(Estuary1)
Repeat the last two lines for your other two estuaries. You may not need the unlist() command, depending on your table. I've tended to need it for Excel files, but I don't know what format your table is in (that would've been helpful).

Conditionally removing duplicates in R (20K observations)

I am currently working in a large data set looking at duplicate water rights. Each right holder is assigned an RightID, but some were recorded twice for clerical purposes. However, some rightIDs are listed more than once and do have relevance to my end goal. One example: there are double entries when a metal tag number was assigned to a specific water right. To avoid double counting the critical information I need to delete an observation.
I have this written at the moment,
#Updated Metal Tag Number
for(i in 1:nrow(duplicate.rights)) {
if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){
remove(i)
}
print[i]
}
The original data frame is set up similarly:
RightID Source Use MetalTagNu
1-0000 Wolf Creek Irrigation N/A
1-0000 Wolf Creek Irrigation 12345
1-0001 Bear River Domestic N/A
1-0002 Beaver Stream Domestic 00001
1-0002 Beaver Stream Irrigation 00001
E.g. right holder 1-0002 is necessary to keep because he is using his water right for two different purposes. However, right holder 1-0000 is unnecessary a repeat.
Right holder 1-0000 I need to eliminate but right holder 1-0002 is valuable to my end goal. I should also note that there can be up to 10 entries for a single rightID but out of those 10 only 1 is an unnecessary duplicate. Also, the duplicate and original entry will not be next to each other in the dataset.
I am quite the novice so please forgive my poor previous attempt. I know i can use the lapply function to make this go faster and more efficiently. Any guidance there would be much appreciated.
So I would suggest the following:
1) You say that you want to keep some duplicates (metal tag number was assigned to a specific water right). I don't know what this means. But I assume that it is something like this - if metal tag number = 1 then even if there are duplicates, you want to keep them. So I propose that you take these rows in your data (let's call this data) out:
data_to_keep <- data[data$metal_tag_number == 1, ]
data_to_dedupe <- data[data$metal_tag_number != 1, ]
2) Now that you have the two dataframes, you can dedupe the dataframe data_to_dedupe with no problem:
deduped_data = data_to_dedupe[!duplicated(data_to_dedupe$dedupe_key), ]
3) Now you can merge the two dataframes back together:
final_data <- rbind(data_to_keep, deduped_data)
If this is what you wanted please up-mark and suggest that the answer is correct. Thanks!
Create a new column,key, which is a combination of RightID & Use.
Assuming your dataframe is called df,
df$key <- paste(df$RightID,df$Use)
Then, remove duplicates using this command :
df1 <- df[!duplicated(df[,1],)]
df1 will have no duplicates.

Match dates and id between data.frames only once

Have 2 example databases as follows
id<-c(1,2,3,1,4,3,5)
date<-c("2011-1-1","2011-1-1","2011-2-2","2012-3-3","2012-4-4","2012-5-5","2012-6-6")
d<-data.frame(cbind(id,date))
colnames(d)<-c("id","date")
d$w<-do.call(paste,c(d[c("id","date")],sep=" "))
id<-c(7,8,9,10,7,10,8,10,11,12)
date<-c("2011-1-1","2011-1-1","2011-2-2","2012-3-3","2012-3-3","2012-4-4","2012-4-4","2012-5-5","2012-6-6","2012-6-6")
contr<-data.frame(cbind(id,date))
colnames(contr)<-c("id","date")
contr$w<-do.call(paste,c(contr[c("id","date")],sep=" "))
Consider that id and dates are repeated in both datasets but d$id are all different from contr$id and that all contr$date are %in% d$date
What I want is y that is a vector including ONE contr$w FOR EACH d$id that have a contr$date%in%d$date
I have tried this which does not work but I am sure there must be a much easier,simpler=better way to do it.
y<-0
for(i in length(levels(factor(d$w)))){
for(j in length(levels(factor(contr$w)))){
z<-ifelse(d$date[i]==contr$date[j],contr$w[j],NA)
y<-c(y,z)
y<-subset(y,!is.na(y))
}
}
Anyone can help?
Many thanks,
Marco
This did what I wanted, maybe I was not clear enough in my explanation. I just wanted a random date per id (then I can create the w column). I have sorted this by using a solution from this other question:
Random row selection in R
Many t
hanks for the effort anyway!
Marco
Actually I have written now a loop that does this (the previous answer did not work as some cases in d did not have a matching date in contr). It is very slow but it does exactly what I wanted
for(i in 1:length(d$rownames)){
if(TRUE%in%levels(factor(contr$w%in%d$w[i]))==TRUE){
control.2$rownames[i]<-sample(contr$rownames[ctr$w==d$w[i]],1)
contr<-contr[!contr$rownames%in%control.2$rownames[i],]
}else{
z<-contr[contr$practice==d$practice[i],]
z$tempo<-abs(difftime(z$date,d$date[i],units="days"))
z<-z[!is.na(z$tempo),]
z<-z[z$tempo==min(z$tempo),]
control.2$rownames[i]<-sample(z$rownames,1)
contr<-contr[!contr$rownames%in%control.2$rownames[i],]
}
}
Not the best code I am sure, but it works. The second look accounts for the few cases where there was no case with a matching date so I chose the sampled() one with the closest date. If you can come up with a faster version, that would be nice. My datasets are about d=~5K rows and contr=~2.5 million rows and it takes roughly 2 hours to run. Painful but worth the wait!

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Resources