New variable for missing values across multiple columns

New variable for missing values across multiple columns - r

I'm trying to see if R has a command similar to Stata. In Stata, the !mi(a, b, c,...) command creates a new variable and indicates a 1/0 if the indicated variable(s) have no missing data. 1 = no missing data across variables x, 0 = missing data in one of the variables x.
I'm looking for a simple code because sometimes I have about 15-20 variables (mainly to mark listwise deletion cases). It takes a little more work but I specify the column names instead of using the : marker. The options I've found creates a new dataframe (na.omit), but I want to retain all the cases.
I know that ifelse can accomplish this using:
df$test <- ifelse(!is.na(df$ID) & !is.na(df$STATUS), 1,0)
I like to know if there's another way with less code where I don't need to write "!is.na(df$ )" over and over. Maybe a $global code (similar to Stata)?

You should be able to do this using complete.cases
df$test <- as.numeric(complete.cases(df))

You could also use rowSums:
df$test <- as.numeric(rowSums(is.na(df)) == 0)

Related

How to recode multiple variables at once with an else=copy option depending on another variables condition in R?

Let's see an example.
library(sjmisc)
data(efc)
From this dataset I want to recode all variables whose name contains cop (so I could use the tidyselect contains) as follows. For males (e16sex==1) NA into 999 and else=copy (as I could do with sjmisc::rec(..., rec = "NA=999; else=copy"); for females (e16sex==2) keep them intact.
I tried through dplyr (and sjmisc) the next naive test:
mutate_at(efc, vars(contains("cop")), list(~if_else(e16sex == 1, rec(., rec="NA=999; else=copy"),.)))
but, as it is understandable, if_else does not process the second dot . as if it was the original contains("cop")-variables for the rows with e16sex != 1.
I am looking for a function (or composite) returning a data frame with the recoding specified (so, please, avoid for). I could not try with data.table because I do not know yet the language, but all effective (and efficient) solutions are welcome. Maybe could it be done with purrr?
Thank you!
UPDATE
The naive test above works. I hadn't tried it with this example but with iris dataset, and with Species variable instead of copvariables. As Species is factor, trying to change some of its levels by a new one produce NA's, thence my confusion.

I'm not sure I fully understand the question, but you could use a for loop for this:
for(x in grep( "cop",names(efc))) {
efc[!is.na(efc$e16sex) & efc$e16sex==1 & is.na(efc[,x]),x] <- 999
}

R: Check for finite values in DataFrame

I need to check whether data frame is "empty" or not ("empty" in a sense that dataframe contain zero finite value. If there is mix of finite and non-finite value, it should NOT be considered "empty")
Referring to How to check a data.frame for any non-finite, I came up with one line code to almost achieve this objective
nrow(tmp[rowSums(sapply(tmp, function(x) is.finite(x))) > 0,]) == 0
where tmp is some data frame.
This code works fine for most cases, but it fails if data frame contains a single row.
For example, the above code would work fine for,
tmp <- data.frame(a=c(NA,NA), b=c(NA,NA)) OR tmp <- data.frame(a=c(3,NA), b=c(4,NA))
But not for,
tmp <- data.frame(a=NA, b=NA)
because I think rowSums expects at least two rows
I looked at some other posts such as https://stats.stackexchange.com/questions/6142/how-to-calculate-the-rowmeans-with-some-single-rows-in-data, but I still couldn't come up a solution for my problem.
My question is, are there any clean ways (i.e. avoid using loops and ideally one liner) to check for being "empty" for any dataframes?
Thanks

If you are checking all columns, then you can just do
all(sapply(tmp, is.finite))
Here we are using all rather than the rowSums trick so we don't have to worry about preserving matrices.

Retaining a value in an R dataset if it's present in another dataset

I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?

For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]

As akrun wrote, use intersect(x,y) or
> x[x %in% y]

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!

If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)

You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.

Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

New variable for missing values across multiple columns - r

You should be able to do this using complete.cases df$test <- as.numeric(complete.cases(df))

You could also use rowSums: df$test <- as.numeric(rowSums(is.na(df)) == 0)

Related

How to recode multiple variables at once with an else=copy option depending on another variables condition in R?

R: Check for finite values in DataFrame

Retaining a value in an R dataset if it's present in another dataset

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

Specifying names of columns to be used in a loop R

Categories

Resources