R: Dummy coding using mutate, ifelse and grepl - Error - r

I'm attempting to dummy code two levels of three in a variable (in two steps) as I want to run a regression. I'm very new to R and have not written the code myself.
Step 1: The variable is Birth_order and the two levels I'd like to analyse are Firstborn and Later born, while excluding only children from the analysis (and dummy coding).
Dat <- mutate(Dat, Wth_Sib= ifelse(grepl("Firstborn", Dat$Birth_Order), 1,
ifelse(grepl("Later born", Dat$Birth_Order), 0, NA)))
Running the code it gives me the error of:
Error in mutate_impl(.data, dots) :
Column `Wth_Sib` must be length 212 (the number of rows) or one, not 0
Step 2: Comparing siblings vs. only children.
Dat <- mutate(Dat, Sib_vs_Only= ifelse(grepl("Firstborn", Dat$Birth_Order), 1,
ifelse(grepl("Later born", Dat$Birth_Order), 1, 0)))
Error:
Error in mutate_impl(.data, dots) :
Column `Sib_vs_Only` must be length 212 (the number of rows) or one, not 0
I don't know what the error means and I'm somewhat unsure of if the code is the best way of approaching the task. I've looked everywhere for answers and I'd be so grateful for any help or advice on a better method!
Thanks!

Related

Need help fixing this Error message in R "Error in cbind(yval2, yprob, nodeprob) : number of rows of matrices must match (see arg 2)"

I am trying boosting and bagging methods in R using the penguins data in the palmerpenguins package but keep getting this error message while trying to do the bagging method.
This is my code below
penguins.bagging <- bagging(formula = species ~ ., data =penguins[train,], mfinal =8,
control=rpart.control(maxdepth = 1))
This is the error code I get after running this line
Error in cbind(yval2, yprob, nodeprob) : number of rows of matrices
must match (see arg 2)
I have tried changing the mfinal number to 344, 333 (both numbers represent number of rows in the file first one with NA's, 2nd number without NA's) I have also tried using smaller numbers like 10 and 5.

Calculate duration of a response with R and dplyr? Some problems with group_by

I have measured a response ('y') over time ('x') in a group of animals ('subject') in a set of conditions ('factor1','factor2'). The response was measured continuously for a fixed period of 20 min after a stimulus of duration = 'z' was given.
For these data, I would like to compute the time taken (here denoted 'duration') for 'y' to return to its baseline value (which is 0) after the stimulus ended, grouping the data by 'subject', 'factor1' and 'factor2'. Here is an example data set
data<-
data.frame(x=rep(rep(1:20,4),6),y=rnorm(480,mean=4,sd=2),z=rep(3,80),
factor1=rep(rep(c("A","B"),each=20),4),
factor2=rep(c(rep("C",20),rep("D",20),rep("C",20),rep("D",20))),
subject=rep(factor(1:6),each=80))
I tried to solve this using dplyr:
library("dplyr")
data %>%
group_by(subject,factor1,factor2) %>%
mutate(duration=nth(x,first(which(y<=0)))-z)
This yields the error "Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed."
I thought that this might occur as some subjects never returned to baseline, so I tried amending the code by setting those observations to 'duration'=20:
data %>%
group_by(subject,factor1,factor2) %>%
mutate(duration=ifelse(
(nth(x,first(which(y<=0)))-z)<=(20-z),
(nth(x,first(which(y<=0)))-z),20)
)
However, the error message remains "Error in mutate_impl(.data, dots) :
Evaluation error: missing value where TRUE/FALSE needed."
In both cases, the error message disappears when I remove the "group_by" statement, but I cannot quite figure out why (apart from the fact that some individuals never returned to baseline).
How do I best go about solving this? I assume I might be missing something quite obvious...
Many thanks,
Andreas
See my comment. Your which() call evaluates to NA in some groups. So you need to specify how to deal with those cases. Eg, replace with NA:
data %>%
group_by(subject,factor1,factor2) %>%
mutate(duration= ifelse(is.na(first(which(y<=0))),NA, nth(x,first(which(y<=0)))-z))
Also, I would recommend against the use of factors, they are messing up a lot if you don't understand what they actually are (I don't, so I don't use them). You can use characters instead.

(R) I'm trying to reference a column in a dataframe with an if() statement to compute multiple other columns

This is the simplified dataset I created to illustrate my question.
Hello all, I'm trying to reference a column in a dataframe Database. My goal is to be able to reference, say, the Weight column and from that populate the Risk and Overweight columns. Here is what I'm trying (along with other failed code):
ifelse(Database[,"Weight"] >190, Database$Risk="HIGH", Database$Risk="LOW")
Error: unexpected '=' in "ifelse(Database[,"Weight"] >190, Database$Risk="
I have also tried doing groups of code with the if() command.
if(Database$Weight > 190) {Database$Risk="HIGH"; Database$Overweight="YES"}
Error in if (Database$Weight > 190) { :
missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In Ops.factor(Database$Weight, 190) : ‘>’ not meaningful for factors
2: In if (Database$Weight > 190) { :
the condition has length > 1 and only the first element will be used
...Which clearly I'm not doing correctly.
The ideal output of this code would resemble this:
We can avoid ifelse with in place assignment using data.table
library(data.table)
setDT(Database)[, Risk := "LOW"][Weight > 190, Risk := "HIGH"]
Here is a dplyr solution. I have assumed Overweightis also classified based on Weight. You can change the condition if required.
library(dplyr)
df %>%
mutate(Risk=ifelse(Weight>190,"HIGH","LOW"),
Overweight=ifelse(Weight>190,"YES","NO"))

Error in huge R package when criterion "stars"

I am trying to do an association network using some expression data I have, the data is really huge: 300 samples and ~30,000 genes. I would like to apply a Gaussian graphical model to my data using the huge R package.
Here is the code I am using
dim(data)
#[1] 317 32291
huge.out <- huge.npn(data)
huge.stars <- huge.select(huge.out, criterion="stars")
However in this last step I got an error:
Error in cor(x) : ling....in progress:10%
Missing values present in input variable 'x'. Consider using use = 'pairwise.complete.obs'
Any help would be very appreciated.
You posted this exact question on Rhelp today. Both SO and Rhelp deprecate cross-posting but if you do choose to switch venues it is at the very least courteous to inform the readership.
You responded to the suggestion here on SO that there were missing data in your data-object named 'data' by claiming there were no missing data. So what does this code return:
lapply(data , function(x) sum(is.na(x)))
That would be a first level check, but there could also be an error caused by a later step that encountered a missing value in the matrix of correlation coefficients in the matrix 'huge.out". That could happen if there were: a) infinities in the calculations or b) if one of the columns were constant:
> cor(c(1:10,Inf), 1:11)
[1] NaN
> cor(rep(2,7), rep(2,7))
[1] NA
Warning message:
In cor(rep(2, 7), rep(2, 7)) : the standard deviation is zero
So the next check is:
sum( is.na(huge.out) )
That will at least give you some basis for defending your claim of no missings and will also give you a plausible theory as to the source of the error. To locate a column that is entirely constant you might do something like this (assuming it were a dataframe):
which(sapply(sapply(data, unique), length) > 1)
If it's a matrix, you need to use apply.

"Error in 1:ncol(x) : argument of length 0" when using Amelia in R

I am working with panel data. I have well over 6,000 country-year observations, and have specified my Amelia imputation as follows:
(CountDependentVariable, m=5, ts="year", cs="cowcode",
sqrts=c("OtherCountVariable2", "OtherCount3", "OtherCount4"),
ords=c("OrdinalVar1", "Ordinal Variable 2"),
lgstc=c("ProportionVariale"),
noms=c("NominalVar1"),p2s = 0, idvars = c("country"))
When I run those lines of code, I continue to receive the following error:
Error in 1:ncol(x) : argument of length 0
I've seen people get a similar error, but in different contexts. Importantly, there are several continuous independent variables I left out of the Amelia code, because I am under the impression that they get imputed WITHOUT having to do so. Does anyone know:
1) What this error means?
2) How to correct this error?
Update #1: Provided more context, in terms of the types of variables in my count panel data, in the above sample code.
Update #2: I did some research, and ran into an R file containing a function that diagnoses possible errors for Amelia code. After running the code, I got the following error message first (and many more thereafter):
AMn<-nrow(x)
Error in nrow(x) : object 'x' not found
AMp<-ncol(x)
Error in ncol(x) : object 'x' not found
subbedout<-c(idvars,cs,ts)
Error: object 'idvars' not found
Error Code: 4
if (any(colSums(!is.na(x)) <= 1)) {
all.miss <- colnames(x)[colSums(!is.na(x)) <= 1]
if (is.null(all.miss)) {
all.miss <- which(colSums(!is.na(x)) <= 1)
}
all.miss <- paste(all.miss, collapse = ", ")
error.code<-4
error.mess<-paste("The data has a column that is completely missing or only has one,observation. Remove these columns:", all.miss)
return(list(code=error.code,mess=error.mess))
}
Error in is.data.frame(x) : object 'x' not found
Error codes: 5-6
Errors in one of the list variables
idout<-listcheck(idvars,"One of the 'idvars'")
Error in identical(vars, NULL) : object 'idvars' not found
Currently, there are no missing values for the country variable I place in the idvars argument. However, the very first "chunk" of errors wants me to believe that this is so.
Am I not properly specifying the Amelia code I have above?
I had forgotten to specify the dataframe in the original Amelia code (slaps hand on forehead). So now, after resolving the whacky issue above, I am getting the following error from Amelia:
Amelia Error Code: 44
One of the variable names in the options list does not match a variable name in the data.
I've checked the variable names, and they match, verbatim, to what I named them in the dataframe.

Resources