Why does mutate() command create NAs? - r

I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data.
My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.
I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.
See my code attached:
any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE
#create variable that represents variance of rating, grouped by product type
data <- data %>%
group_by(asin) %>%
mutate(ProductVariance = var(overall))
any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289
I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.
Thank you in advance!

var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:
var(1)
# [1] NA
...
mutate(ProductVariance = coalesce(var(overall), 0))

Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.

Related

R mean of one column based on another [duplicate]

I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))

How to extract top features by CATScore in r?

I am running a machine learning algorithm that uses CAT score for feature selection as
library(sda)
train1<- data.matrix(train, rownames.force = NA)
ranking.LDA = sda.ranking(train1[,1:lengthvar], train1[,lengthtrain], diagonal=FALSE)
topfs<-which(ranking.LDA[,"score"] >2)
My question is how to ask the CAT score to give me for example top 20 features? The only way I could extract features was setting a threshold, but this way, it gives me various number of features for different data set. What I want is always having eg. top 20 (or any other number) features.
Thanks in advance for your valuable contribution.
ranking.LDA gives a list of numbers.Hence we use a list function.
#As ranking.LDA gives a ranking of predictors we directly extract column names using this ranking.
colnames(train1[,ranking.LDA[1:20]])

Optimizing dataset based on several conditions

I am trying to construct a (optimal) subset from a large dataset based on several conditions. I know that there are some possibilities to construct such a subset. See for example: this link. I tried this function but it is unsatisfactory since it takes to long to find such a subset and might be not "intelligent" enough. Below you can find some sample data
data <- data.table(id=rep(c("a","b","c","d","e","f"),3),
balance=c(1000,2000,1500,2000,4000,1500,
800,2000,1300,1800,2000,500,
700,1900,1100,1600,500,30),
rate=c(1100,1500,1000,700,300,200,
400,700,500,1300,1600,700,
800,1100,1200,700,400,150),
grade=c(70,100,90,50,150,40,
30,80,55,80,85,20,
35,70,55,75,15,10),
date= rep(c(2012,2013,2014),each=6))
data_agg <- aggregate(cbind(rate, grade) ~ date, data = data.frame(data),sum,na.rm=T)
data_agg$ratio <- data_agg$rate / data_agg$grade
> data_agg$ratio
[1] 9.60000 14.85714 16.73077
Now the objective is (e.g.) to minimize the increase in the data_agg$ratio over the years and at the same time include at least 3 id's in this subset.
By looking at the data we see e.g. dat ID == "e" has a ratio of 300/150=2 in 2012, 1600/85=19 in 2013 and 400/15=27 in 2014. The objective of my answer is to minimze the increase over the years, thus deleting "e" might have a desisarable effect on the subset.
datasubset <-subset(data, subset = id!=c("e"))
data_aggsubset <- aggregate(cbind(rate, grade) ~ date, data = data.frame(datasubset),sum,na.rm=T)
data_aggsubset$ratio <- data_aggsubset$rate / data_aggsubset$grade
data_aggsubset$ratio
[1] 12.85714 13.58491 16.12245
And indeed, the ratio is more stable over the years now. Thus my question is whether there is some optimizer function which seeks IDs such that this ratio is e.g. within a bandwidth of +/- 50% of the starting value (9.6 in this example) and contains at least three IDs. My original dataset is large, thus I am looking for a more intelligent function than the one I attached in the link. Please let me know if anything is unclear. Thank you in advance!

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Variable with vacant value

I want to know how R treats the vacant value of a variable. I mean: I have a variable with two values: 1 (a behavior occurs) 0 (a behavior doesn't occur). But in my table same data are missing because I couldn't see the behavior, so there are some vacant. if I work with this variable, I can use it without any problems or, before working, I have to get out the vacant data and remain just only with the known variables?
It really depends on what you want to do with the data. In R you would usually use NA for missing data. Some functions can deal with this, for example:
mean(c(1, NA))
# [1] NA
mean(c(1, NA), na.rm = TRUE)
# [1] 1
but in other cases you may need to eliminate missing values before performing analysis, for example using the subset function.

Resources