Corr.test exlusion of NA values in R - r

I am trying to run the corr.test function in a for loop between a range of columns in a data frame against the rest of the columns in the same data frame. However, I have a lot of NA values throughout this data frame. I don't want to omit the rows altogether and lose the rest of the data in the rows and I also don't want to set NA = 0 because it will interfere with the rest of the data (scores that are either -1, 1, or 0). Every time I try to run the corr.test function, R keeps saying that x or y are not numeric vectors.
Is there any way to get around this?
The first column (rownames) of my data frame is a list of sample IDs, columns 2-50 are scores, and 51 onward are scores of a different type. What I've been doing so far is using for loop to run corr.test between each range of columns like this example:
cor.test(data[1:50], data[51:200])
This works fine in the for loop if I convert NA values to 0 but is there any way to avoid doing that?

Related

How do I keep my missing values to stay the same after I do mice imputation and save my results?

As a new R user I'm having trouble understanding why the NA valus in my dataframe keep changing. I'm running my code on Kaggle. Maybe that's where my problem is arising from?
Original dataframe titled "abc"
There are multiple columns that have NA values so I decided to try using multiple imputation to handle the na values.
So I created a new dataframe with just the columns that had na values and begin imputation
This is the new dataframe titled "abc1"
abc1 <- select(abc, c(9,10,15,16,17,18,19,25,26))
#mice imputation
input_data = abc1
my_imp = mice(input_data, m=5, method="pmm", maxit=20)
summary(input_data$m_0_9)
my_imp$imp$m_0_9
When the imputation begins it creates 5 columns that contain new values to fill in for the NA values of column m_0_9 and I choose which column.
Imputation of column 'm_0_9'
Then I run this code:
final_clean_abc1 <- complete(my_imp,5)
This assigns the values from column 5 of the last image to the NA values in my "abc1" dataframe and saves as "final_clean_abc1."
Lastly I replace the columns from the original "abc" dataframe that had missing values with the new columns in "final_clean_abc1."
I know this probably isnt the cleanest:
abc$m_0_9 <- final_clean_abc1$m_0_9
abc$m_10_12 <- final_clean_abc1$m_10_12
abc$f_0_9 <- final_clean_abc1$f_0_9
abc$f_10_12 <- final_clean_abc1$f_10_12
abc$f_13_14 <- final_clean_abc1$f_13_14
abc$f_15 <- final_clean_abc1$f_15
abc$f_16 <- final_clean_abc1$f_16
abc$asian_pacific_islander <- final_clean_abc1$asian_pacific_islander
abc$american_indian <- final_clean_abc1$american_indian
Now that I have a dataframe 'abc' with no missing values this is where my problem arises. I should be seeing '162' for row 10 for the m_0_9 column but when I save my code and view it on Kaggle I get the value '7' for that specific row and column. As shown in the photo below.
"abc" dataframe with no NA values
Hopefully this makes sense I tried to be as specific as I could be.
There are multiple stochastic processes going on in mice to impute multiple values for one target value, of which are then averaged. You should not expect the same result each time you run mice.
From the MICE documentation
In the first step, the dataset with missing values (i.e. the
incomplete dataset) is copied several times. Then in the next step,
the missing values are replaced with imputed values in each copy of
the dataset. In each copy, slightly different values are imputed due
to random variation. This results in mulitple imputed datasets. In the
third step, the imputed datasets are each analyzed and the study
results are then pooled into the final study result. In this Chapter,
the first phase in multiple imputation, the imputation step, is the
main topic. In the next Chapter, the analysis and pooling phases are
discussed.
https://bookdown.org/mwheymans/bookmi/multiple-imputation.html
We have a wonderful series of vignettes that detail the use of mice. Part of this series is the stochastic nature of the algorithm and how to fix that. Setting mice(yourdata, seed = 123) would generate the same set of multiple imputation every time.

Using FOR LOOP over Multiple Columns of MATRIC and keeping FIRST column constant in RStudio

I am running the Automatic Variance Ratio (AVR) test on my dataset in R. My Dataset Contains 6 Indices i.e. columns exculing the date column. In this test, I need to use FOR LOOP which would constantly roll over the first column i.e. Date column, and keep changing/moving from the 2nd till the 6th column. I am new to R, therefore, I don't know exactly what to do and how to do it. Currently, I have a code that can run this for only the 2nd column but from the 2nd column onwards it can loop over. All of you are requested to please help me in this regard.
A standard way to loop through the columns of a dataframe is with lapply. If your dataframe is df with 7 columns and you want to loop through columns 2 through 7 and your function is Av.VR() then
output_list <- lapply(df[,2:7], function(x) Av.VR(x))
should yield a list of outputs for each column.
Note I have no experience using the function Av.VR().

Update a data frame within a for loop

The point of this question is that I want to know how to update a dataframe inside of either a for loop or a function. So i know there are other ways to do the specific task i am looking at, but i want to know how to do it the way i am trying to do it.
I have a data frame with 15 columns and 2k observations with some 98 and 99s. For each row in where there is a 98 or 99 for any variable/column, I want to remove the whole row. I create a function to filter by variable name not equal to 98/99, and use lapply. however, instead of continually updating the data frame, It just spits out a series of data frames, overwriting the previous data frame, meaning that at the end i will only get a data frame with the last column cleaned. How do i get it to update the data frame for each column sequentially?
nafunction = function(variable){
kuwait5=kuwait5%>%
filter(variable<90)
}
`nafunction = function(variable){
kuwait5=kuwait5%>%
filter(variable<90)
}
lapply(kuwait5, nafunction)`
Expected result is a new data frame with all rows that have an 98 removed. What i get is a sequence of data frames each one having ONE column in which rows with NAS are removed.

Using data frame values to select columns of a different data frame

I'm relatively new in R so excuse me if I'm not even posting this question the right way.
I have a matrix generated from combination function.
double_expression_combinations <- combn(marker_column_vector,2)
This matrix has x columns and 2 rows. Each column has 2 rows with numbers that will be used to represent column numbers in my main data frame named initial. These columns numbers are combinations of columns to be tested. The initial data frame is 27 columns (thousands of rows) with values of 1 and 0. The test consists in using the 2 numbers given by double_expression_combinations as column numbers to use from initial. The test consists in adding each row of those 2 columns and counting how many times the sum is equal to 2.
I believe I'm able to come up with the counting part, I just don't know how to use the data from the double_expression_combinations data frame to select columns to test from the "initial" data frame.
Edited to fix corrections made by commenters
Using R it's important to keep your terminology precise. double_expression_combinations is not a dataframe but rather a matrix. It's easy to loop over columns in a matrix with apply. I'm a bit unclear about the exact test, but this might succeed:
apply( double_expression_combinations, 2, # the 2 selects each column in turn
function(cols){ sum( initial[ , cols[1] ] + initial[ , cols[2] ] == 2) } )
Both the '+' and '==' operators are vectorised so no additional loop is needed inside the call to sum.

Transpose/Reshape Data in R

I have a data set in a wide format, consisting of two rows, one with the variable names and one with the corresponding values. The variables represent characteristics of individuals from a sample of size 1000. For instance I have 1000 variables regarding the size of each individual, then 1000 variables with the height, then 1000 variables with the weight etc. Now I would like to run simple regressions (say weight on calorie consumption), the only way I can think of doing this is to declare a vector that contains the 1000 observations of each variable, say for instance:
regressor1=c(mydata$height0, mydata$height1, mydata$height2, mydata$height3, ... mydata$height1000)
But given that I have a few dozen variables and each containing 1000 observations this will become cumbersome. Is there a way to do this with a loop?
I have also thought a about the reshape options of R, but this again will put me in a position where I have to type 1000 variables a few dozen times.
Thank you for your help.
Here is how I would go about your issue. t() will transpose the data for you from many columns to many rows.
Note: t() can be used with a matrix rather than a data frame, I simply coerced to data frame to show my example will work with your data.
# Many columns, 2 rows
x <- as.data.frame(matrix(nrow=2,ncol=1000,seq(1:2000)))
#2 Columns, many rows
t(x)
Based on your comments you are looking to generate vectors.
If you have transposed:
regressor1 <- x[,1]
regressor2 <- x[,2]
If you have not transposed:
regressor1 <- x[1,]
regressor2 <- x[2,]

Resources