Using summarize function to calculate average ages in a dataset - r

I'm trying to use the "summarize" function to determine the average age for each occupation in a large dataset. I'm pairing this function with the "group_by" function. This is all part of my learning to use pipes to tidy up datasets. Here's the code I have written thus far:
Boston_occupations <- BostonWomenVoters %>%
select(Age, Occupation) %>%
group_by(Occupation) %>%
summarize(mean(as.numeric(Age)))
Once again, I'm trying to write a code that will return a new dataset that lists the average ages for all occupations--not any ones in particular but all of them. Right now, I'm receiving the following error message:
Warning: There were 4 warnings in `summarize()`.
The first warning was:
ℹ In argument: `mean(as.numeric(Age))`.
ℹ In group 372: `Occupation = "House-wife"`.
Caused by warning in `mean()`:
! NAs introduced by coercion
ℹ Run dplyr::last_dplyr_warnings() to see the 3 remaining warnings.
I've been writing and re-writing this code for some time and can't figure out where I'm going wrong. Any help would be much appreciated!

Related

I not able to run R code via dplyr library

I am trying to run this code:
data("ToothGrowth")
View(ToothGrowth)
filtered_tg <- filter(ToothGrowth, dose == 0,5)
but the following error is causing me problems:
Error in filter():
ℹ In argument: 5.
Caused by error:
! ..2 must be a logical vector, not the number 5.
Run rlang::last_error() to see where the error occurred.
I have already run the following in the RStudio console:
> install.packages("dplyr")
> library(dplyr)
There are a few possibilities here.
This is a simple typo, you meant to type 0.5 instead of 0,5.
You are confused about decimal separator conventions. This has the same solution, but is a different conceptual problem.
R uses the North American convention where ., not , is used as a decimal separator. Specify the dose value as 0.5, not 0,5.
filtered_tg <- filter(ToothGrowth, dose == 0.5)
As R uses a comma for lots of other things, this is a setting you can't change. (You can change it for the purpose of reading and writing data, e.g. see the read.csv2() function, or see here.)
You are trying to specify two different possible values for dose (in which case you should use dose == 0 | dose == 5 or dose %in% c(0,5) as your criterion). (This seems implausible but was mentioned by commenters.)

sum(births) : invalid 'type' (character) of argument

Hi everyone,
I am using a sample data in RStudio. I used the code below:
njnew <- nj %>%
group_by(NAME_2) %>%
summarise(Num.totalbirths=sum(births),
Num.totalvulnerable=sum(vulnerable)) %>%
mutate(percent.potentailcase=potentialcase/Num.totalpotentialcase,
percent.vulerablecase=vulnerable/Num.vulnerablecase)
I get after running:
Error in sum(births) : invalid 'type' (character) of argument
My dataset is an csv but I manually added/filled in 2 additional columns (births, vulnerable).
Could you kindly let me know how this error may have happened?
Judging from the error message, it looks like births is of type character. However, you can only compute the sum of numeric, complex or logical vectors. This likely happened when you manually added the column after reading in the csv.
You can double-check the type of the variable with class(nj$births), which probably returns character. Try converting your variable(s) with as.numeric(). You may need to repeat that process for other variables (such as vulnerable) which you manually added, e.g.:
nj <- nj %>%
mutate(births = as.numeric(births),
vulnerable = as.numeric(vulnerable))
Then your code should work fine.

Calculate duration of a response with R and dplyr? Some problems with group_by

I have measured a response ('y') over time ('x') in a group of animals ('subject') in a set of conditions ('factor1','factor2'). The response was measured continuously for a fixed period of 20 min after a stimulus of duration = 'z' was given.
For these data, I would like to compute the time taken (here denoted 'duration') for 'y' to return to its baseline value (which is 0) after the stimulus ended, grouping the data by 'subject', 'factor1' and 'factor2'. Here is an example data set
data<-
data.frame(x=rep(rep(1:20,4),6),y=rnorm(480,mean=4,sd=2),z=rep(3,80),
factor1=rep(rep(c("A","B"),each=20),4),
factor2=rep(c(rep("C",20),rep("D",20),rep("C",20),rep("D",20))),
subject=rep(factor(1:6),each=80))
I tried to solve this using dplyr:
library("dplyr")
data %>%
group_by(subject,factor1,factor2) %>%
mutate(duration=nth(x,first(which(y<=0)))-z)
This yields the error "Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed."
I thought that this might occur as some subjects never returned to baseline, so I tried amending the code by setting those observations to 'duration'=20:
data %>%
group_by(subject,factor1,factor2) %>%
mutate(duration=ifelse(
(nth(x,first(which(y<=0)))-z)<=(20-z),
(nth(x,first(which(y<=0)))-z),20)
)
However, the error message remains "Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed."
In both cases, the error message disappears when I remove the "group_by" statement, but I cannot quite figure out why (apart from the fact that some individuals never returned to baseline).
How do I best go about solving this? I assume I might be missing something quite obvious...
Many thanks,
Andreas
See my comment. Your which() call evaluates to NA in some groups. So you need to specify how to deal with those cases. Eg, replace with NA:
data %>%
group_by(subject,factor1,factor2) %>%
mutate(duration= ifelse(is.na(first(which(y<=0))),NA, nth(x,first(which(y<=0)))-z))
Also, I would recommend against the use of factors, they are messing up a lot if you don't understand what they actually are (I don't, so I don't use them). You can use characters instead.

R: Dummy coding using mutate, ifelse and grepl - Error

I'm attempting to dummy code two levels of three in a variable (in two steps) as I want to run a regression. I'm very new to R and have not written the code myself.
Step 1: The variable is Birth_order and the two levels I'd like to analyse are Firstborn and Later born, while excluding only children from the analysis (and dummy coding).
Dat <- mutate(Dat, Wth_Sib= ifelse(grepl("Firstborn", Dat$Birth_Order), 1,
ifelse(grepl("Later born", Dat$Birth_Order), 0, NA)))
Running the code it gives me the error of:
Error in mutate_impl(.data, dots) :
Column `Wth_Sib` must be length 212 (the number of rows) or one, not 0
Step 2: Comparing siblings vs. only children.
Dat <- mutate(Dat, Sib_vs_Only= ifelse(grepl("Firstborn", Dat$Birth_Order), 1,
ifelse(grepl("Later born", Dat$Birth_Order), 1, 0)))
Error:
Error in mutate_impl(.data, dots) :
Column `Sib_vs_Only` must be length 212 (the number of rows) or one, not 0
I don't know what the error means and I'm somewhat unsure of if the code is the best way of approaching the task. I've looked everywhere for answers and I'd be so grateful for any help or advice on a better method!
Thanks!

dplyr data frame tbl: incorrect reported column length

I came a across a problem with dplyr, which caused an error message when I used it in a survival analysis. The root cause turned out to be that when a variable in a grouped data frame (or any object with class tbl_df) is referred to using [,] notation, it always reports a length of 1, even when the real length is greater than that. Using the $x notation reports the correct length.
With a data frame, the following return the expected length of 32:
length(mtcars$mpg)
length(mtcars[ , "mpg"])
With a grouped data frame the $ notation returns 32, and all the rest using [] notation return a length of 1:
foo <- mtcars %>% group_by(cyl)
length(foo$mpg)
length(foo[ , "mpg"])
length(foo[ , 1])
VarName <- "mpg"
length(foo[ , VarName])
It is just the reported length that is incorrect The data itself is all there i.e.:
head(foo[ , "mpg"])
The incorrect reported length leads to an error message in functions such as Surv(), which presumably include a length() check. This is obviously a very simplified example to illustrate. In the failed program I was using [ , VarName] notation inside a function to refer to a variable column. The workaround is simply to convert the data from the offending Data Frame Tbl format to an ordinary data frame within the function. Can anyone shed any light on why this happens? It might save others wasting as much time as I have!

Resources