Using data frame values to select columns of a different data frame - r

I'm relatively new in R so excuse me if I'm not even posting this question the right way.
I have a matrix generated from combination function.
double_expression_combinations <- combn(marker_column_vector,2)
This matrix has x columns and 2 rows. Each column has 2 rows with numbers that will be used to represent column numbers in my main data frame named initial. These columns numbers are combinations of columns to be tested. The initial data frame is 27 columns (thousands of rows) with values of 1 and 0. The test consists in using the 2 numbers given by double_expression_combinations as column numbers to use from initial. The test consists in adding each row of those 2 columns and counting how many times the sum is equal to 2.
I believe I'm able to come up with the counting part, I just don't know how to use the data from the double_expression_combinations data frame to select columns to test from the "initial" data frame.
Edited to fix corrections made by commenters

Using R it's important to keep your terminology precise. double_expression_combinations is not a dataframe but rather a matrix. It's easy to loop over columns in a matrix with apply. I'm a bit unclear about the exact test, but this might succeed:
apply( double_expression_combinations, 2, # the 2 selects each column in turn
function(cols){ sum( initial[ , cols[1] ] + initial[ , cols[2] ] == 2) } )
Both the '+' and '==' operators are vectorised so no additional loop is needed inside the call to sum.

Related

Using FOR LOOP over Multiple Columns of MATRIC and keeping FIRST column constant in RStudio

I am running the Automatic Variance Ratio (AVR) test on my dataset in R. My Dataset Contains 6 Indices i.e. columns exculing the date column. In this test, I need to use FOR LOOP which would constantly roll over the first column i.e. Date column, and keep changing/moving from the 2nd till the 6th column. I am new to R, therefore, I don't know exactly what to do and how to do it. Currently, I have a code that can run this for only the 2nd column but from the 2nd column onwards it can loop over. All of you are requested to please help me in this regard.
A standard way to loop through the columns of a dataframe is with lapply. If your dataframe is df with 7 columns and you want to loop through columns 2 through 7 and your function is Av.VR() then
output_list <- lapply(df[,2:7], function(x) Av.VR(x))
should yield a list of outputs for each column.
Note I have no experience using the function Av.VR().

how to divide the value in each cell of a .csv by the value in another cell across multiple rows and variables in R?

I have a .csv file of 39 variables and 713 rows, each containing a count of plastic items. I have another column which is the survey length, and I want to standardise each count of items by a survey length of 100. I am unsure how to create a loop to run through each row and cell individually to do this. Many also have NA values.
Any ideas would be great.
Thank you.
Consider applying formula directly on columns without need of looping:
# RETRIEVE ALL COLUMN NAMES (MINUS SURVEY LENGTH)
vars <- names(df)[!grepl("survey_length", names(df))]
# EXPAND SINGLE COLUMN TO EQUAL DIMENSION OF DATA FRAME
survey_length_mat <- matrix(df$survey_length, ncol=length(vars), nrow=nrow(df))
# APPLY FORMULA
df[vars] <- (df[vars] / survey_length_mat) * 100
df

Selecting different elements of an R dataframe (one for each row, but possibly different columns) without using loops

Say I have a data.frame of arbitrary dimensions (n by p). I want to extract a vector of length n from that data.frame, one element in the vector per row in the data.frame. However, the column in which each element lies may vary by row. Is there a way to do this without loops?
For example, if I have the following (3x3) data frame, called say DATA
X Y Z
1 17 43
3 4 2
6 9 0
I want to extract one scalar value from DATA per row. I have a vector, call it column.list, c(1,3,1) (arbitrarily selected in this case) which gives the column index for the elements I want, where the kth element of column.list is the column index for row k in DATA. How do I do this without loops? I want to avoid loops because I am using this repeatedly in a simulation study that will take a lot of running time even without loops, and the row number might be 100,000 or so. Much appreciated!
You can do this by indexing your data.frame with a matrix. The first column indicates row, the second indicates column. So if you do
column.list <- c(1,3,1)
DATA[cbind(1:nrow(DATA), column.list)]
You will get
[1] 1 2 6
as desired. If you mix across columns of different classes, all the variable will be coerced to the most accommodating data type.

Subset dataframe based on statistical range of each column

I would like to subset a dataframe by selecting only columns that exceed a specific range. IE, I would like to evaluate max-min for each column individually and select only columns whose range is greater than a given value. For example, given the following simple dataframe, I would like to create a subset dataframe that only contains columns with a range > 99. (Columns b an c.)
d <- data.frame(a=seq(0,10,1),b=seq(0,100,10),c=seq(0,200,20))
I have tried modifying the example here: Subset a dataframe based on a single condition applied to multiple columns, but have had no luck. I'm sure I'm missing something simple.
You can use sapply() to apply function to each column of d and then calculate difference for range of column values. Then compare it to 99. As result you will get TRUE or FALSE and then use it to subset columns.
d[,sapply(d,function(x) diff(range(x))>99)]

extract columns that don't have a header or name in R

I need to extract the columns from a dataset without header names.
I have a ~10000 x 3 data set and I need to plot the first column against the second two.
I know how to do it when the columns have names ~ plot(data$V1, data$V2) but in this case they do not. How do I access each column individually when they do not have names?
Thanks
Why not give them sensible names?
names(data)=c("This","That","Other")
plot(data$This,data$That)
That's a better solution than using the column number, since names are meaningful and if your data changes to have a different number of columns your code may break in several places. Give your data the correct names and as long as you always refer to data$This then your code will work.
I usually select columns by their position in the matrix/data frame.
e.g.
dataset[,4] to select the 4th column.
The 1st number in brackets refers to rows, the second to columns. Here, I didn't use a "1st number" so all rows of column 4 are selected, i.e., the whole column.
This is easy to remember since it stems from matrix calculations. E.g., a 4x3 dimensional matrix has 4 rows and 3 columns. Thus when I want to select the 1st row of the third column, I could do something like matrix[1,3]

Resources