I have a dataframe, and I want to confirm that two columns match for each entry. So I tried:
> nrow(subset(df, col.a!=col.b))
[1] 0
That seemed good to me, but then I tried to compare how many matches there were to the total number of entries in the data frame. It seems like these numbers should be equal but they are not:
nrow(subset(df, col.a==col.b))
[1] 3443
nrow(df)
[1] 3453
Any idea what is going on here? Why does it looked like the subset dropped 10 entries? Thanks so much for your help.
Also, I'm fairly new to this, so please let me know if there is a better way of checking if the two columns match.
subset automatically drops rows where the criterion is NA. It should always (?) be the case that
nrow(d)
and
nrow(subset(d, col.a!=col.b))+
nrow(subset(d, col.a==col.b))+
nrow(subset(d, is.na(col.a) | is.na(col.b)))
should be equal.
Related
I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))
I am currently working on an amazon dataset with many rows, which makes it hard to spot issues in the data.
My goal is to look at the amazon data, and see whether certain products have a higher variance in star ratings than other ones. I have a variable indicating product ID (asin), a variable indicating the star rating (overall), and want to create a variance variable.
I have thus used dplyr's group_by function in combination with the mutate function. Even though all input variables don't have NAs/Missings, my output variable does. I have attempted to look for a solution, yet only found solutions on what to do if the input has NAs.
See my code attached:
any(is.na(data$asin))
#[1] FALSE
any(is.na(data$overall))
# [1] FALSE
#create variable that represents variance of rating, grouped by product type
data <- data %>%
group_by(asin) %>%
mutate(ProductVariance = var(overall))
any(is.na(data$ProductVariance))
#5226 [1] TRUE
> sum(is.na(data$ProductVariance))
# [1] 289
I would much appreciate your help! Even though the amount of NAs is not big regarding the number of reviews, I would still appreciate getting to accurate means (NAs hinder the usage of tapply) and being as precice as possible in follow-up analyses.
Thank you in advance!
var will return NA if the input is length one. So any ASINs that appear once in your data will have NA variance. Depending what you're doing with it, you may find it convenient to change those NAs to 0s:
var(1)
# [1] NA
...
mutate(ProductVariance = coalesce(var(overall), 0))
Is it possible that what you're seeing is that "empty" groups are not showing up? You can change the default with .drop.
When .drop = TRUE, empty groups are dropped.
I have one issue in selecting a value of one variable conditional on the value of another variable in a dataframe.
Dilutionfactor=c(1,3,9,27,80)
Log10Dilutionfactor=log10(Dilutionfactor)
Protection=c(100,81.25,40,10.52,0)
RM=as.data.frame(cbind(Dilutionfactor,Log10Dilutionfactor,Protection))
Now i want to know the value of Log10Dilutionfactor condition on the value of Protection is equal to either 50 (if it appear) or the value immediately just below 50.
when i used subset(RM,Protection<= 50)it gives three rows and when I tried RM[grepl(RM$Protection<=50,Log10Dilutionfactor),] it gives 0 values with warning message. I really appreciate if someone help me.
You can use 2 subset:
subset(RM,Protection==max(subset(RM,Protection<= 50)$Protection))$Log10Dilutionfactor
# [1] 0.954243
You could use
with(RM, Log10Dilutionfactor[which(Protection == max(Protection[Protection <= 50]))])
# [1] 0.9542425
or find the index value of protection that is closest to 50
index = which(abs(RM$Protection-50)<=min(abs(RM$Protection-50)))
and then look it up in what ever column you want. e.g for Dilutionfactor
RM$Dilutionfactor[index]
I will try to explain well my doubt.
I have a table with some variables X, Y, Z, for example.
Each variable has numeric values.
So, let's say I have
RDIST RDENS AGR BLF
1 146 0.000 0 0.0
2 338 0.000 0 0.0
3 931 0.000 0 3.7
I'm trying to identify outliers, so I used dotchart.
But now, I want to know, in each variable, in which observation I have the outliers.
With list(x$BLF>3) command, I get a table with TRUE or FALSE values. But what I need to know is if the outlier is in observation 2, 3, or 145.
I agree with #MrFlick. which() is the best way to go. If you want to take the next step and remove those outliers you could do
x$BLF<-x$BLF[-which(x$BLF>3)] which takes those indexes MrFlick was talking about and deletes those entries from the BLF column, using the - operator. Of course after that you store it back to the same column. Actually, REALLY don't do what I said above because if your data is stored in a dataframe R will automatically replace the removed values with the value just above it to maintain the right column length!
Probably best to replace the outliers with NA like this x$BLF[which(x$BLF>3)]<-NA. Or you could just remove the entire row from your dataset like this x<-x[-which(x$BLF>3),] The reason you have the comma now is that when your dealing with a rectangular dataframe you have to specify row, column, like this [row I want, column I want] so I just specify the row I want deleted without specifying the column.
Probably more than you wanted, but I thought it might help.
That's it! Thank you both!
For now, I just want to identify the outliers for each variable. That command solved my doubt, tottaly.
Thank you
I have a dataset named bwght which contains the variable cigs (cigarattes smoked per day)
When I calculate the mean of cigs in the dataset bwght using:
mean(bwght$cigs), I get a number 2.08.
Only 212 of the 1388 women in the sample smoke (and 1176 does not smoke):
summary(bwght$cigs>0) gives the result:
Mode FALSE TRUE NA's
logical 1176 212 0
I'm asked to find the average of cigs among the women who smoke (the 212).
I'm having a hard time finding the right syntax for excluding the non smokers = 0
I have tried:
mean(bwght$cigs| bwght$cigs>0)
mean(bwght$cigs>0 | bwght$cigs=TRUE)
if (bwght$cigs > 0){
sum(bwght$cigs)
}
x <-as.numeric(bwght$cigs, rm="0");
mean(x)
But nothing seems to work! Can anyone please help me??
If you want to exclude the non-smokers, you have a few options. The easiest is probably this:
mean(bwght[bwght$cigs>0,"cigs"])
With a data frame, the first variable is the row and the next is the column. So, you can subset using dataframe[1,2] to get the first row, second column. You can also use logic in the row selection. By using bwght$cigs>0 as the first element, you are subsetting to only have the rows where cigs is not zero.
Your other ones didn't work for the following reasons:
mean(bwght$cigs| bwght$cigs>0)
This is effectively a logical comparison. You're asking for the TRUE / FALSE result of bwght$cigs OR bwght$cigs>0, and then taking the mean on it. I'm not totally sure, but I think R can't even take data typed as logical for the mean() function.
mean(bwght$cigs>0 | bwght$cigs=TRUE)
Same problem. You use the | sign, which returns a logical, and R is trying to take the mean of logicals.
if(bwght$cigs > 0){sum(bwght$cigs)}
By any chance, were you a SAS programmer originally? This looks like how I used to type at first. Basically, if() doesn't work the same way in R as it does in SAS. In that example, you are using bwght$cigs > 0 as the if condition, which won't work because R will only look at the first element of the vector resulting from bwght$cigs > 0. R handles looping differently from SAS - check out functions like lapply, tapply, and so on.
x <-as.numeric(bwght$cigs, rm="0")
mean(x)
I honestly don't know what this would do. It might work if rm="0" didn't have quotes...?
mean(bwght[bwght$cigs>0,"cigs"])
I found the statement failed, returning "argument is not numeric or logical: returning NA"
Converting to matrix solved this:
mean(data.matrix(bwght[bwght$cigs>0,"cigs"]))