Convert range of column titles to variables for vars() function - r

I have a data frame with 100+ variables listed in columns, and each subject in rows. I'd like to loop through each column to perform an ANOVA, and while the loop function works fine the step I am stuck on is listing which columns to loop through. Currently I can set these by manually typing/pasting each variable name but this is obviously not practical.
Currently the loop runs through my list of vars, to get this I currently just type the name of these columns manually...
variables <- vars(height, width, strength)
Which only loops for those selected 3 out of 100+ variables that I have had to manually type in.
I had thought I could list the range of column names for dataframe df between columns 3 to 100 within the vars expression as below...
variables <- vars(colnames(df[3:100]))
This just provides one variable of the name colnames(df[3:100]).
Any ideas to avoid typing or manually inserting commas/removing quotation marks from 100+ different variable names? Thanks in advance.

Consider do.call which is shorthand for expanded list of arguments to a function. Specifically, below:
variables <- do.call(vars, colnames(df)[3:100])
is equivalent to expanded version:
variables <- vars(colnames(df)[3], colnames(df)[4], ..., colnames(df)[100])

Related

How to reference a column in a dataframe through a variable in R

I'm trying to iterate through columns in an R data.frame.
To do so, I'm hoping to write a for loop which loops over the column names and then filters the data.table accordingly with values.
My issue is that given the syntax:
df[which(df$XX == y), ]
XX needs to actually be a column name versus a variable that is a string equivalent to the column name.
Is there a way to loop over the columns via inputting a variable?
Many thanks!

Using a variable to point to a vector

I am trying to a comparison between many different items using a Spearman test(from the package pspearman). I would like to have a way to automate the switching in of variables so that rather than running it one of at a time and running it would be able to just switch in each one and run all at once.
I tried to pass the list of the vectors that I would like to compare it to.
spearman.test(access_sam2$Area,access_sam2$B)
All the columns are in the dataframe access_sam2. In the y, position there is a list of columns that I need to run:
"CD8_PD1, CD8_PDL1, CD8_GBNEG_FOXP3, CD8_GBNEG_FOXP3_CD45RO, CD8_GBNEG_FOXP3NEG_CD45RO, CD8NEG_PD1, CD8NEG_PDL1, CD8NEG_FOXP3, CD8NEG_FOXP3_CD45RO,CD68_PDL1, CK_PDL1."
The problem is that it is not possible to use indexes because they are not sequential columns, and has 660+ columns.
I could write 7 spearman tests but changing all 7 for each Area variable seems inefficent
First set yvars to be a character variable which names the columns you want or is a numeric variable that gives their column numbers. We have shown the first few elements of it below. Then we define a function which takes a variable name and outputs the spearman test. Finally use Map to apply that function to each component of yvars.
yvars <- c("CD8_PD1", "CD8_PDL1", "CD8_GBNEG_FOXP3")
sptest <- function(yvar) spearman.test(access_sam2$Area, access_sam2[[yvar]])
Map(sptest, yvars)
Reproducible example
Below is a reproducible example using the mtcars data frame that comnes with R.
library(pspearman)
yvars <- c("cyl", "disp", "hp")
sptest <- function(yvar) spearman.test(mtcars$mpg, mtcars[[yvar]])
Map(sptest, yvars)

Separating data frame based on column values

I am having a bit of trouble with trying to script a code in R so that it separates a data frame based on the character in a data frame column without manually specifying a subset command. Below is the script for reproduction in R:
a=c("Model_A","R1",358723.0,171704.0,1.0,36.818500,4.0222700,1.38895000)
b=c("Model_A","R2",358723.0,171704.0,2.6,36.447300,4.0116100,1.37479000)
c=c("Model_A","R3",358723.0,171704.0,5.0,35.615400,3.8092600,1.34301000)
d=c("Model_B","R1",358723.0,171704.0,1.0,39.818300,2.4475600,1.50384000)
e=c("Model_B","R2",358723.0,171704.0,2.6,39.391600,2.4209900,1.48754000)
f=c("Model_B","R3",358723.0,171704.0,5.0,38.442700,2.3618400,1.45126000)
g=c("Model_C","R1",358723.0,171704.0,1.0,31.246400,2.2388000,1.30652000)
h=c("Model_C","R2",358723.0,171704.0,2.6,30.911600,2.2144800,1.29234000)
i=c("Model_C","R3",358723.0,171704.0,5.0,30.166700,2.1603000,1.26077000)
df=data.frame(a,b,c,d,e,f,g,h,i)
df=t(df)
df=data.frame(df)
col_list=list("Model","Receptor.name","X(m.)","Y(m.)","Z(m.)",
"nox","PM10","PM2.5")
colnames(df)=col_list
Essentially what I am trying is to separate the data frame (df) by the Model names ("Model_A", "Model_B", and "Model_C") and store them in new and different data frames. I have been trying to use the following command
df_test=split(df,with(df,interaction(Model,Model)), drop = TRUE)
This command separates the data frame but stores them in lists, and I don't know how to extract the lists individually and store them as data frames. Is there a simpler solution (avoiding the subset command if possible as I need the script to be dynamic and relative) or does anyone know how to use the last command shown above to separate the lists into individual data frames? Also if possible, is it possible to name the data frame after the model?
I apologize if these are a lot of questions but any help would be hugely appreciated! Thank you!
list2env(split(df, df$Model), envir = .GlobalEnv) will give you three dataframes in your global environment, named after the models, containing the relevant rows.
> Model_A
Model Receptor.name X(m.) Y(m.) Z(m.) nox PM10 PM2.5
a Model_A R1 358723 171704 1 36.8185 4.02227 1.38895
b Model_A R2 358723 171704 2.6 36.4473 4.01161 1.37479
c Model_A R3 358723 171704 5 35.6154 3.80926 1.34301
Although I would just keep the list of three dataframes by only using dflist <- split(df, df$Model).
Why a list? Lists allow you the use of lapply - a looping function that applies an operation over every list element. A quick example: Let's say you'd want to get a frequency table for both PM variables in your data for all three datasets.
For single elements in your global environment this would be
table(Model_A$PM10)
table(Model_A$PM2.5)
...
table(Model_C$PM2.5)
With a list, it would be
lapply(dflist, function(x) table(x["PM10"]))
lapply(dflist, function(x) table(x["PM2.5"]))
Right now, it seems to only save some lines of code, but better yet, the output of lapply is again a list, which you can store in an object and further use for different operations. Due to this, you can have a global environment with only a few objects in it, each being lists which contain certain similar objects, like dataframes, tables, summaries or even plots.

correlation of several columns need to be calculated

I'm trying to get the correlation coefficient for corresponding columns of two csv files. I simply use the followings but get errors. consider each csv file has 50 columns
first values <- read.csv("")
second values <- read.csv("")
correlation.csv <- cor(x= first values , y=second values, method="spearman)
But i get x' must be numeric error!
subset of one csv file
Thanks for your help
The read.table function and all of it's derivatives return a data.frame which is an R list object. The mapply function processes lists in "parallel". If the matching columns are in the same order in the two datasets and have the same number of rows and do not have spaces in their names, it would be as simple as:
mapply(cor, first_values , second_values)
If it's more complicated tahn that, then you need to fill in the missing details with example data by editing the question (not by responding in comments.)
There must be some categorical variable in X.So you can first separate that categorical variable from X and then use X in cor() function.

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Resources