Variable with vacant value - r

I want to know how R treats the vacant value of a variable. I mean: I have a variable with two values: 1 (a behavior occurs) 0 (a behavior doesn't occur). But in my table same data are missing because I couldn't see the behavior, so there are some vacant. if I work with this variable, I can use it without any problems or, before working, I have to get out the vacant data and remain just only with the known variables?

It really depends on what you want to do with the data. In R you would usually use NA for missing data. Some functions can deal with this, for example:
mean(c(1, NA))
# [1] NA
mean(c(1, NA), na.rm = TRUE)
# [1] 1
but in other cases you may need to eliminate missing values before performing analysis, for example using the subset function.

Related

How to remove some values from a 4-dimensional matrix?

I'm working with a 4-dimensional matrix (Year, Simulation, Flow, Time instant: 10x5x20x10) in R. I need to remove some values from the matrix. For example, for year 1 I need to remove simulations number 1 and 2; for year 2 I need to remove simulation number 5.
Can anyone suggest me how I can make such changes?
Arrays (which is how R documentation usually refers to higher-dimensional 'matrices') can be indexed with negative values in the same way as matrices or vectors: a negative value removes the corresponding row/column/slice. So if you wanted to remove year 1 completely (for example), you could use a[-1,,,]; to remove simulation 5 completely, a[,-5,,].
However, arrays can't be "ragged", there has to be something in every row/column/slice combination. You could replace the values you want to remove with NAs (and then make sure to account for the NAs appropriately when computing, e.g. using na.rm = TRUE in sum()/min()/max()/median()/etc.): a[1,1:2,,] <- NA or a[2,5,,] <- NA in your examples.
If you knew that all values of Flow and Time would always be present, you could store your data as a list of lists of matrices: e.g.
results <- list(Year1 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...),
Year2 = list(Simulation1 = matrix(...),
Simulation2 = matrix(...),
...))
Then you could easily remove years or simulations within years (by setting them to NULL, but it would make indexing a little bit harder (e.g. "retrieve Simulation1 values for all years" would require an lapply or a loop across years).

Why does mutate() command create NAs?

I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data.
My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.
I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.
See my code attached:
any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE
#create variable that represents variance of rating, grouped by product type
data <- data %>%
group_by(asin) %>%
mutate(ProductVariance = var(overall))
any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289
I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.
Thank you in advance!
var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:
var(1)
# [1] NA
...
mutate(ProductVariance = coalesce(var(overall), 0))
Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.

R: Produce Index Values to Group Increasing Values in Vector

I have a list of increasing year values that occasionally has breaks in it and I want to create a grouping value for each unbroken sequence. Think of a vector like this one (missing 2005,2011):
x <- c(2001,2002,2003,2004,2006,2007,2008,2009,2010,2013,2014,2015,2016)
I would like to produce an equal length vector that numbers every value in a run with the same index to end up with something like this.
[1] 1 1 1 1 2 2 2 2 2 3 3 3 3
I would like to do this using best R practices so I am trying to avoid falling back to a for loop but I am not sure how to get from Vector A to Vector B. Does anyone have any suggestions?
Some things I know I can do:
I can flag the record before or after a gap as true with an ifelse
I can get the index of when the counter should change by wrapping that in a which statement
This is the code to do each
ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE)
which(ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE))
I think there a couple solutions to this problem. One as d.b posted in the comment above that will produce a sequence that increments every time there is a break in the sequence.
cummax(c(1, diff(x)))
There is a similar solution that I chose to use with ifelse() flagging breaks and cumsum(). I chose this solution because additional information,like other vectors, can be included in the decision and diff seems to have problems with very erratic up and down values.
cumsum(ifelse(!is.na(lag(x)) & x == lag(x) + 1, FALSE, TRUE))

extracting value of variable from dataframe

I have one issue in selecting a value of one variable conditional on the value of another variable in a dataframe.
Dilutionfactor=c(1,3,9,27,80)
Log10Dilutionfactor=log10(Dilutionfactor)
Protection=c(100,81.25,40,10.52,0)
RM=as.data.frame(cbind(Dilutionfactor,Log10Dilutionfactor,Protection))
Now i want to know the value of Log10Dilutionfactor condition on the value of Protection is equal to either 50 (if it appear) or the value immediately just below 50.
when i used subset(RM,Protection<= 50)it gives three rows and when I tried RM[grepl(RM$Protection<=50,Log10Dilutionfactor),] it gives 0 values with warning message. I really appreciate if someone help me.
You can use 2 subset:
subset(RM,Protection==max(subset(RM,Protection<= 50)$Protection))$Log10Dilutionfactor
# [1] 0.954243
You could use
with(RM, Log10Dilutionfactor[which(Protection == max(Protection[Protection <= 50]))])
# [1] 0.9542425
or find the index value of protection that is closest to 50
index = which(abs(RM$Protection-50)<=min(abs(RM$Protection-50)))
and then look it up in what ever column you want. e.g for Dilutionfactor
RM$Dilutionfactor[index]

R - How to completely detach a subset plm.dim from a parent plm.dim object?

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon

Resources