Two data formatting questions for R - r

I have two questions, both are pretty simple I believe dealing with R.
I would like to create a IF statement that will assign a NA value to certain rows in a column. I have tried the following command:
a[a[,21]==0,5:10] <-NA
the error says:
Error in [<-.data.frame(tmp, a[, 21] == 0, 5:20, value = NA) : missing values are not allowed in subscripted assignments of data frames
Essentially that code is supposed to take any 0 value in column 21, and replace the values for that row from columns 5 to 10 to NA. There are NA's in column 21 already, but I am not sure whether that does anything?
I am not sure how to craft this next function at all. I need to manipulate data that contains positive and negative controls. However, when I manipulate the data, I don't want the positive and negative control values to be apart of the manipulation, but I want the positive and negative controls to remain in the columns because I have to use them later. Is there anyway to temporarily ignore these values so they aren't included in the manipulation?
Here sample data:
L = c(2,1,4,3,1,4,2,4,5,1)
R = c(2,4,5,1,"Neg",2,"",1,2,1)
T = c(2,1,4,2,"CTRL",2,"PCTRL",2,1,4)
test <- data.frame(L=L,R=R,T=T)
I would like to be able to temporarily ignore these rows based on the characters "Neg" "CTRL"/"" "PCTRL" rather than the position of them in the data frame if possible. Notice how for negative control, Neg and CTRL are in separate columns, same row, just like positive control where there is a blank and PCTRL in separate columns yet same rows. Any way to do this given these odd conditions?
Hope this was written clearly enough, and I thank anyone in advance for taking the time to help me!

Try this for subsetting your dataframe to those rows where R is not "Neg":
subset(test, R!="Neg")
For the NA problem, you probably already have NAs in your data frame, right? Try if this works:
a[a[,21] %in% 0, 5:10] <- NA

Try instead:
a[ which(a[,21]==0), 5:10] <-NA
Explanation: the == operation is returning NA values and the [<- function doesn't accept them. The which function will return a numeric vector and "throw away the NA's". As an aside, the [ function (without the '<-') will return all NA rows. This is considered a 'feature', but I find it to be an 'annoyance', so I will typically use which for selection as well as for selective-assignment.

For the first problem: if a[,21] is negative, do you want to assign NA? In this case,
a[replace(a[,21],is.na(a[,21]),0)==0,5:10] <- NA
Otherwise (note that I replaced replacement value of "0" with something nonzero ("1" used here but doesn't really matter as long as it's not zero),
a[replace(a[,21],is.na(a[,21]),1)==0,5:10] <- NA
As for the second problem,
subset(test,! (L %in% c("Neg","") | T %in% c("CTRL","PCTRL")))
In case the filtering conditions in L and T are not always coinciding. If they always coincide, then you can just apply test to one of L or T. Also, you may also want to keep in mind that T used to stand for TRUE in S, S-PLUS, and R (still does); you can reassign another value to T and things will be okay but I believe it's generally discouraged (same for c, which people also like to assign to).

Related

R returns incorrect answers for logical operations?

I am trying to subset a dataframe based on values in a single column (Column_A), using code similar to the following:
new_df <- subset(df, df$Column_A<4)
I noticed that this code returns all rows where the value for Column_A is less than 4...as well as one row where the value is 12.4 (so, greater than 4).
I tried to look more closely at what R believes the value of this cell to be--df$Column_A[[2]] returned the expected value of 12.4.
I then tested several other variants of this logical operation--e.g.df$Column_A[[2]]<12 , df$Column_A[[2]]<11 , df$Column_A[[2]]<10 , df$Column_A[[2]]<9...
The first three expressions returned the expected answer ("FALSE"). However, df$Column_A[[2]]<9 and all variants of this expression with lower values (e.g. <8, <7...) return the answer ("TRUE"). This is clearly incorrect.
I have no idea what is causing this and would really appreciate any insight.
It could happen if the class of the column is character
"12.4" < 4
[1] TRUE
Remedy is to convert to numeric first and then subset
df$Column_A <- as.numeric(df$Column_A)
subset(df, Column_A < 4)

R: Check for finite values in DataFrame

I need to check whether data frame is "empty" or not ("empty" in a sense that dataframe contain zero finite value. If there is mix of finite and non-finite value, it should NOT be considered "empty")
Referring to How to check a data.frame for any non-finite, I came up with one line code to almost achieve this objective
nrow(tmp[rowSums(sapply(tmp, function(x) is.finite(x))) > 0,]) == 0
where tmp is some data frame.
This code works fine for most cases, but it fails if data frame contains a single row.
For example, the above code would work fine for,
tmp <- data.frame(a=c(NA,NA), b=c(NA,NA)) OR tmp <- data.frame(a=c(3,NA), b=c(4,NA))
But not for,
tmp <- data.frame(a=NA, b=NA)
because I think rowSums expects at least two rows
I looked at some other posts such as https://stats.stackexchange.com/questions/6142/how-to-calculate-the-rowmeans-with-some-single-rows-in-data, but I still couldn't come up a solution for my problem.
My question is, are there any clean ways (i.e. avoid using loops and ideally one liner) to check for being "empty" for any dataframes?
Thanks
If you are checking all columns, then you can just do
all(sapply(tmp, is.finite))
Here we are using all rather than the rowSums trick so we don't have to worry about preserving matrices.

R: Assign values to a new column based on values of another column where a condition is satisfied

I want to create a new column in a data.frame where its value is equal to the value in another data.frame where a particular condition is satisfied between two columns in each data frame.
The R pseudo-code being something like this:
DF1$Activity <- DF2$Activity where DF2$NAME == DF1$NAME
In each data.frame values for $NAME are unique in the column.
Use the ifelse function. Here, I put NA when the condition is not met. However, you may choose any value or values from any vector.
Recycling rules1 apply.
DF1$Activity <- ifelse(DF2$NAME == DF1$NAME, DF2$Activity, NA)
I'm not sure this one actually needs an example. What happens when you create a column with a set of NA values and then assign the required rows with the same logical vector on both sides:
DF1$Activity <- NA
DF1$Activity[DF2$NAME == DF1$NAME] <- DF2$Activity[DF2$NAME == DF1$NAME]
without an example its quite hard to tell. But from your description it sounds like a base::merge or dplyr::inner_join operation. Those are quite fast in comparison to if statements.
Cheers

Omitting NA in specific rows when analyzing data from 2 columns of a very large dataframe

I am very new to R and I am struggling to understand how to omit NA values in a specific way.
I have a large dataframe with several columns (up to 40) and rows (up to 200ish). I want to use data from one of the columns to do simple stats (wilcox.test, boxplot, etc): one column will have a continuous variable (V1), while the other has a binary variable (V2; 0 or 1), which divides 2 groups. I want to do this for the continuous variable using different V2 binary variables, which are unrelated. I organized this data in Excel, saved it as CSV and am using R Studio.
All these columns have interspersed NA values and when I use omit.na, it just takes off every single row where a NA value is present, which takes away an awful load of data. Is there any simple solution to do this? I have seen some answers to similar topics, but none seems quite exactly what I need to do.
Many thanks for any answer. Again, I am a baby-level newbie to R and may have overlooked something in other topics!
If I understand, you want to apply to function to a pair of column each time.
wilcox.test(V1,V2)
wilcox.test(V1,V3)...
Where Vi have no missing values. I would do something like this :
## use complete.cases to assert that you have no missing values
## for the selected pair
apply_clean <-
function(x,y){
ok <- complete.cases(x, y)
wilcox.test(x[ok],dat$V1[ok])
}
## apply this function to all columns after removing the continuous column
lapply(subset(dat,select=-V1),apply_clean,y=dat$V1)
You can manipulate the data.frame to omit based on any rules you like. For example:
dirty.frame <- data.frame(col1 = c(1,2,3,4,5,6,7,NA,9,10), col2 = c(10, 9, 8, 7,6,5,4,3,2,1))
cleaned.frame <- dirty.frame[!is.na(dirty.frame$col1),]
This code used is.na() to test if a row in a specific column is na. The ! means not, and will omit that row.

Matrix turned to class(character) when removing NA values

Reproducible example below. I have a simulation loop, within which I occasionally have rows I need to remove from a matrix. I have done this by entering an 'NA' value into the row I need to remove in a specific position, and then I have a line of code to remove any line with an NA. This has worked great so far. My issue is, I am now running simulations in a certain way that occasionally whittles my matrix down to a single row. Then this occurs, the matrix gets transformed into a 'character', and crashes the simulation.
Example:
mat<-matrix(1:10,5,2) #setting up a simplified example matrix
mat[3:5,1]<-NA #Giving 3 rows 'NA' values, for removal of these rows
mat<-mat[!is.na(mat[,1]),] #An example where my procedure works just fine
class(mat)
mat[2,1]<-NA #Setting 1 of the remaining 2 rows as NA
mat<-mat[!is.na(mat[,1]),] #Removing one of final two rows
class(mat) #No longer a matrix
Is there some way I can do this, where I don't lose my formatting as a matrix at the end? I am assuming this issue is coming from my use of the "is.na" command, but I haven't found a good way around using this.
To give a bit more insight into the issue, in case there is a MUCH better way to do this I am too naive to have found yet... In my real-life simulation, I have a column in the matrix that holds a '1' when the individual in the given row is alive, and a '0' when dead. When an individual (a single row) dies, (and the value goes from a '1' to a '0'), I need to remove the row. The only way I knew how to do this was to change the '0' to an 'NA' and then remove all rows with an NA. If there is a way to just remove the rows with a '0' in a specific column that avoids this issue, that would be great!
By default, the [ function coerces the output into the lowest possible dimension. In your example, you have a two dimensional array (a matrix): when extracting a single row, it is coerced into a vector of characters.
To avoid that, have a look at the drop option to the [ function. You should be doing:
mat <- mat[!is.na(mat[,1]),, drop = FALSE]

Resources