Looping through a dataset in R and count occurences of variables - r

I found an interesting data-set from a psychology study (data-set is called WearingTShirt), and I would like to replicate the results. I would need to summarize two variables into a single variable. This is what I have written:
Create empty variable
PinkAndRed = 0
Count instances of people wearing both pink and red and add 1
for i in WearingTShirt:
PinkAndRed+1 if:
WearingTShirt$PINKSHIRT==1 OR WearingTShirt$REDSHIRT==1
Add variable to dataset
WearingTShirt$PinkAndRed
I have not much R experience (I wrote mostly in Python).

Your code is more in python than in R. The equivalent code in R for what you want to do is:
PinkAndRed = rep(0,dim(WearingTShirt)[1])
for(i in 1:dim(WearingTShirt)[1]){
if((WearingTShirt$PINKSHIRT[i]==1) || (WearingTShirt$REDSHIRT[i]==1))
{
PinkAndRed[i] = 1
}
}
WearingTShirt=cbind(WearingTShirt,PinkAndRed)
You need to review basics on R. There are countless small difference between R and python, such as parenthesis in loops or conditions, set the length of a loop (in the above code with dim you calculate the dimension of the dataset and by doing [1] you indicate that you want the number of rows)...
Update:
thanks to the comments i've realized that is not clear if you want a cumulative sum of the individuals with pink and red shirts or a variable which is 1 with the shirt is pink or red, and 0 in other case.
The code above is for a varaible that includes pink and red shirts in one variable.
If you want the sum you must use cumsum function as it's said in the comments

I would not choose to loop, but:
WearingTShirt$PinkAndRed <- ifelse(WearingTShirt$PINKSHIRT==1 |
WearingTShirt$REDSHIRT==1,1,0)
PinkAndRed sounds more like PinkOrRed based on example given.

Related

Create variables from list in R

I am trying to create variables based on a list I have.
The list looks something like this:
color = c("blue","green","yellow")
Each string in the list will become a variable. The variable should take the values based on another column (for example, usercolorlist). Here is the pseudocode:
For each row in usercolorlist
if usercolorlist contains blue
then blue = 1
else 0
Ultimately, the output would be:
usercolorlist blue
"blue/red/green" 1
"red/green" 0
"blue/red" 1
I want to implement this as cleanly as possible. I mainly use python and have been told that for loops are not as efficient in R.

Conditionally removing duplicates in R (20K observations)

I am currently working in a large data set looking at duplicate water rights. Each right holder is assigned an RightID, but some were recorded twice for clerical purposes. However, some rightIDs are listed more than once and do have relevance to my end goal. One example: there are double entries when a metal tag number was assigned to a specific water right. To avoid double counting the critical information I need to delete an observation.
I have this written at the moment,
#Updated Metal Tag Number
for(i in 1:nrow(duplicate.rights)) {
if( [i, "RightID"]==[i-1, "RightID"] & [i,"MetalTagNu"]=![i-1, "MetalTagNu"] ){
remove(i)
}
print[i]
}
The original data frame is set up similarly:
RightID Source Use MetalTagNu
1-0000 Wolf Creek Irrigation N/A
1-0000 Wolf Creek Irrigation 12345
1-0001 Bear River Domestic N/A
1-0002 Beaver Stream Domestic 00001
1-0002 Beaver Stream Irrigation 00001
E.g. right holder 1-0002 is necessary to keep because he is using his water right for two different purposes. However, right holder 1-0000 is unnecessary a repeat.
Right holder 1-0000 I need to eliminate but right holder 1-0002 is valuable to my end goal. I should also note that there can be up to 10 entries for a single rightID but out of those 10 only 1 is an unnecessary duplicate. Also, the duplicate and original entry will not be next to each other in the dataset.
I am quite the novice so please forgive my poor previous attempt. I know i can use the lapply function to make this go faster and more efficiently. Any guidance there would be much appreciated.
So I would suggest the following:
1) You say that you want to keep some duplicates (metal tag number was assigned to a specific water right). I don't know what this means. But I assume that it is something like this - if metal tag number = 1 then even if there are duplicates, you want to keep them. So I propose that you take these rows in your data (let's call this data) out:
data_to_keep <- data[data$metal_tag_number == 1, ]
data_to_dedupe <- data[data$metal_tag_number != 1, ]
2) Now that you have the two dataframes, you can dedupe the dataframe data_to_dedupe with no problem:
deduped_data = data_to_dedupe[!duplicated(data_to_dedupe$dedupe_key), ]
3) Now you can merge the two dataframes back together:
final_data <- rbind(data_to_keep, deduped_data)
If this is what you wanted please up-mark and suggest that the answer is correct. Thanks!
Create a new column,key, which is a combination of RightID & Use.
Assuming your dataframe is called df,
df$key <- paste(df$RightID,df$Use)
Then, remove duplicates using this command :
df1 <- df[!duplicated(df[,1],)]
df1 will have no duplicates.

Return matching names instead of binary variables in R

I'm new here and diving into R, and I'm encountering a problem while trying to solve a knapsack problem.
For optimization purposes I wrote a dynamic program in R, however, now that I am at the point of returning the items, which I succeeded in, I only get the binary numbers saying whether the item has been selected or not (1 = yes). Like this:
Select
[1] 1 0 0 1
However, now I would like the Select function to return the names of values instead of these binary values. Underneath I created an example of what my problem looks like.
This would be the data and a related data frame.
items <- c("Glasses","gloves","shoes")
grams <- c(4,2,3)
value <- c(100,20,50)
data <- data.frame(items,grams,value)
Now, I created various functions, with the final one clarifying whether a product has been selected by 1 (yes) or 0 (no). Like above. However, I would really like for it to return the related name of the item. Is there a manner to go around this by linking back to the dataframe created?
So that it would say instead of (in case all products are selected)
Select
[1] 1 1 1
Select
[1] Glasses gloves shoes
I believe I would have to create a new function. But as I mentioned, is there a good way to refer back to the data frame to take related values from another column in the data frame in case of a 1 (yes)?
I really hope my question is more clear now and someone can direct me in the right direction.
Best, Berber
Lets say your binary vector is
idx <- [1, 0, 1, 0, 1]
just use,
items[as.logical(idx)]
will give you the name for selected items, and
items[!as.logical(idx)]
will give you name for unselected items

Filtering grouped data in R

I was wondering if anyone can help with grouping the data below as I'm trying to use the subset function to filter out volumes below a certain threshold but given that the data represents groups of objects, this creates the problem of removing certain items that should be kept.
In Column F ( and I) you can see Blue, Red, and Yellow Objects. Each represent three separate colored probes on one DNA strand. Odd numbered or non-numbered Blue ,Red, and Yellow are paired with a homologous strand represented by an even numbered Blue, Red, and Yellow. Ie data in rows 2,3,and 4 are one "group" and pair with the "group" shown in rows 5,6,and 7. This then repeats, so 8,9,10 are a new group and that group pairs with the one in 11,12,13.
What I would like to do is subset the groups so that only those below a certain Distance to Midpoint (column M) are kept. The Midpoint here is the midpoint of the line that connects the blue of one group with the blue of its partner, so the subset should only apply to the Blue distance to midpoint, and that is where I'm having a problem. For instance if I ask to keep blue distances to midpoint that are less than 3, then the objects in row 3 and 4 should be kept because they are part of the group with the blue distance below 3. Right now though when I filter with the subset function I lose Red Selection and Yellow Selection. I'm confident there is a straighforward solution to this in R, but I'd also be open to some type of filtering in excel if anyone has suggestions via that route instead.
EDIT
I managed to work something out in Excel last night after posting the question. Solution isn't pretty but it works well enough. I just added a new column next to "distance to midpoint" that gives all the objects in one group the same distance so that when I filter the data I won't lose any objects that I shouldn't. If it helps anyone in the future, the formula I used in excel was:
=SQRT ( ((INDEX($B$2:$B$945,1+QUOTIENT(ROWS(B$2:B2)-1,3)*3))- (INDEX($O$2:$O$945,1+QUOTIENT(ROWS(O$2:O2)-1,3)*3)) ) ^2 +( (INDEX($C$2:$C$945,1+QUOTIENT(ROWS(C$2:C2)-1,3)*3))-(INDEX($P$2:$P$945,1+QUOTIENT(ROWS(P$2:P2)-1,3)*3)) ) ^2 +( (INDEX($D$2:$D$945,1+QUOTIENT(ROWS(D$2:D2)-1,3)*3))-(INDEX($Q$2:$Q$945,1+QUOTIENT(ROWS(Q$2:Q2)-1,3)*3)) ) ^2)
Would be easier with a reproducible example, but here's a (hacky) plyr solution:
filterframe<-function(df,threshold){
df$grouper<-rep(seq(from=1,to=6),nrow(df)/6)
dataout<-df%>%group_by(grouper)%>%summarise(keep=.[[1]]$distance_to_midpoint<threshold)
dataout[dataout$keep,]
}
filterframe(mydata)
A base R solution provided below. The idea is that once your data are in R, you (edit) keep! rows iff they meet 2 criteria. First, the Surpass column has to contain the word "blue" in it, which is done with the grepl function. Second, the distance must below a certain threshold (set arbitrarily by thresh.
fakeData=data.frame(Surpass=c('blue', 'red', 'green', 'blue'),
distance=c(1,2,5,3), num=c(90,10,9,4))
#thresh is your distance threshold
thresh = 2
fakeDataNoBlue = fakeData[which(grepl('blue', fakeData$Surpass)
& fakeData$distance < thresh),]
There's probably also a quick dplyr solution using filter, but I haven't fully explored the functionality there. Also, I may be a bit confused on if you also want to keep the other colors. If so, that's the same as saying you want to remove the blue ones exceeding a certain distance threshold, which you would just do a -which command, and turn the < operator into a > operator.

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Resources