how to call a column name using loop: - r

I'm very new to R, and I would like to know what is the best way to call a different column using for loop.
My code goes like this:
Variables <- c("Var1","Var2","Var3","Var4","Var5","Var6","Var7")
Years <- c(2015,2016,2017,2018)
for (Year in Years) {
for (Var in Variables) {
TT = auc(data[data$Def_Year==Year,]$Good_Bad,
data[data$Def_Year==Year,]$Var)
print (TT)
}
}
I'm tryng to calculate the AUC (area under roc curve) for each variable in each year in order to check the credit scoring model performance stability.
The thing is R does not understand the $Var command. In excel I sometimes use & to overcome such obstacles. I would love to hear your recommendations.

Hi you could do something like this. See my sample code below
df <- data.frame(v1 = c(1,2,3), v2 = c(4,5,6))
variables <- c("v1", "v2")
for(var in variables) {
print(df[, var])
}
Output:
[1] 1 2 3
[1] 4 5 6
I have not solved your code directly as it is not advised on SO to solve the task fully but rather to give general guideline towards solution. I would suggest you go through this: https://stats.idre.ucla.edu/r/modules/subsetting-data/ to better understand subsetting in R.
Also see https://cran.r-project.org/doc/manuals/R-lang.html#Indexing to understand the indexing in R.
From above:
The form using $ applies to recursive objects such as lists and pairlists. It allows only a literal character string or a symbol as the index. That is, the index is not computable: for cases where you need to evaluate an expression to find the index, use x[[expr]]. Applying $ to a non-recursive object is an error.

Related

How to save multiple variables with 1 line of code in R?

I have 7 large seurat objects, saved as sn1, sn2, sn3 ... sn7
I am trying to do scaledata on all 7 samples. I could write the same line 7 times as:
all.genes <- rownames(sn1)
snN1<-ScaleData(sn1, features = all.genes)
all.genes <- rownames(sn2)
snN2<-ScaleData(sn2, features = all.genes)
all.genes <- rownames(sn2)
snN2<-ScaleData(sn2, features = all.genes)
.
.
.
This would work perfectly. Since I have to use all 7 samples for quite a while still, I thought I'd save time and make a for loop to do the job with one line of code, but I am unable to save the varables, getting an error "Error in { : target of assignment expands to non-language object".
This is what I tried:
samples<-c("sn1", "sn2", "sn3", "sn4", "sn5", "sn6", "sn7")
list<-c("snN1", "snN2", "snN3", "snN4", "snN5", "snN6", "snN7")
for (i in samples) {
all.genes <- rownames(get(i))
list[1:7]<-ScaleData(get(i), features = all.genes)
}
How do I have to format the code so it could create varables snN1, snN2, snN3 and save scaled data from sn1, sn2, sn3... to each respective new variable?
I think the error is in this line: list[1:7]<-ScaleData(get(i), features = all.genes). You are saying to the for-loop to reference the output of the function ScaleData, to the 7 string variables in the list, which makes no sense. I think you are looking for the function assign(), but it is recommended to use it in very specific contexts.
Also, there're better methods that for-loops in R, for example apply() and related functions. I recommend to you to create as a custom function the steps you want to apply, and then call lapply() to iteratively - as a for-loop would do - change every variable and store it in a list. To call every 'snX' variable as the input you can reference them in a list that direct to them.
# Custom function
custom_scale <- function(x){
all.genes <- rownames(x)
y = ScaleData(x, features = all.genes)
}
# Apply custom function and return saved in a list
# Create a list that directo to every variable
samples = list(sn1, sn2, sn3, sn4, sn5, sn6, sn7) # Note I'm not using characters, I'm referencing the actual variable.
# use lapply to iterate over the list and apply your custom function, saving the result as your list
scaled_Data_list = lapply(samples, function(x) custom_scale(x))
This should work, however without an example data I can't test it.
Here is how to do it using a loop and assign. I removed some redundant code/variables as this can always be a source of error. However, I agree with RobertoT that storing such data in a list and using lapply is a good idea.
samples <- paste0('sn', 1:7)
for (sn in samples) {
sn.data <- get(sn)
assign(sub('n', 'nN', sn),
ScaleData(sn.data, features=rownames(sn.data)))
}

How to reference variables from a list when looping over variables using "for"

I am a beginner at R coming from Stata and my first head ache is to figure out how I can loop over a list of names conducting the same operation to all names. The names are variables coming from a data frame. I tried defining a list in this way: mylist<- c("df$name1", "df$name2") and then I tried: for (i in mylist) { i } which I hoped would be equivalent to writing df$name1 and then df$name2 to make R print the content of the variables with the names name1 and name2 from the data frame df. I tried other commands like deleting a variable i=NULL within the for command, but that didn't work either. I would greatly appreciate if someone could tell me what am I doing wrong? I wonder if it has somethign to do with the way I write the i, maybe R does not interpret it to mean the elements of my character vector.
For more clarification I will write out the code I would use for Stata in this instance. Instead of asking Stata to print the content of a variable I am asking it to give summary statistics of a variable i.e. the no. of observations, mean, standard deviation and min and max using the summarize command. In Stata I don't need to refer to the dataframe as I ususally have only one dataset in memory and I need only write:
foreach i in name1 name2 { #name1 and name2 being the names of the variables
summarize `i'
}
So far, I don't manage to do the same thing using the for function in R, which I naivly thought would be:
mylist<-c("df$name1", "df$name2")
for (i in mylist) {
summary(i)
}
you probably just need to print the name to see it. For example, if we have a data frame like this:
df <- data.frame("A" = "a", "B" = "b", "C" = "c")
df
# > A B C
# > 1 a b c
names(df)
# "A" "B" "C"
We can operate on the names using a for loop on the names(df) vector (no need to define a special list).
for (name in names(df)){
print(name)
# your code here
}
R is a little more reticent to let you use strings/locals as code than Stata is. You can do it with functions like eval but in general that's not the ideal way to do it.
In the case of variable names, though, you're in luck, as you can use a string to pull out a variable from a data.frame with [[]]. For example:
df <- data.frame(a = 1:10,
b = 11:20,
c = 21:30)
for (i in c('a','b')) {
print(i)
print(summary(df[[i]]))
}
Notes:
if you want an object printed from inside a for loop you need to use print().
I'm assuming that you're using the summary() function just as an example and so need the loop. But if you really just want a summary of each variable, summary(df) will do them all, or summary(df[,c('a','b')]) to just do a and b. Or check out the stargazer() function in the stargazer package, which has defaults that will feel pretty comfortable for a Stata user.

Different variable lengths when looping over a string vector which corresponds to data frame columns

I am new in writing loops and I have some difficulties there. I already looked through other questions, but didn't find the answer to my specific problem.
So lets just create a random dataset, give column names and set the variables as character:
d<-data.frame(replicate(4,sample(1:9,197,rep=TRUE)))
colnames(d)<-c("variable1","variable2","trait1","trait2")
d$variable1<-as.character(d$variable1)
d$variable2<-as.character(d$variable2)
Now I define my vector over which I want to loop. It correspons to trait 1 and trait 2:
trt.nm <- names(d[c(3,4)])
Now I want to apply the following model for trait 1 and trait 2 (which should now be as column names in trt.nm) in a loop:
library(lme4)
for(trait in trt.nm)
{
lmer (trait ~ 1 + variable1 + (1|variable2) ,data=d)
}
Now I get the error that variable lengths differ. How could this be explained?
If I apply the model without loop for each trait, I get a result, so the problem has to be somewhere in the loop, I think.
trait is a string, so you'll have to convert it to a formula to work; see http://www.cookbook-r.com/Formulas/Creating_a_formula_from_a_string/ for more info.
Try this (you'll have to add a print statement or save the result to actually see what it does, but this will run without errors):
for(trait in trt.nm) {
lmer(as.formula(paste(trait, " ~ 1 + variable1 + (1|variable2)")), data = d)
}
Another suggestion would be to use a list and lapply or purrr::map instead. Good luck!

R- Please help. Having trouble writing for loop to lag date

I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?
So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.

How to rewrite this Stata code in R?

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)

Resources