Creating a loop over variables' names - r

I'm new to R (started a few days ago) and coming from STATA. I am trying to create a loop to create dummy variables when a variable has value -9. I want to use a loop as I have got plenty of variables like this.
In the following, reflex_working is my dataframe and "A7LECTUR" etc are my variables. I am trying to create a dummy called "miss_varname" for each variable using the ifelse function.
varlist<-c("A7LECTUR", "A7GROASG", "A7RESPRJ", "A7WORPLC", "A7PRACTI",
"A7THEORI", "A7TEACHR", "A7PROBAL", "A7WRIASG", "A7ORALPR")
for (i in varlist){
reflex_working$miss_[i]<-ifelse(reflex_working$i==-9,1,0)
}
I get the following warnings for each iteration:
1: Unknown or uninitialised column: 'miss_'.
2: Unknown or uninitialised column: 'i'.
And no variable is created. I assume this must be something very trivial for everyone, but I have been trying for the last hour to create this kind of loop and have zero results to show.
Edit:
I have something like:
A7LECTUR
1
2
1
4
-9
And would like, after the loop, to have a new column like:
reflex_working$miss_A7LECTUR
0
0
0
0
1
Hope this helps clarifying what I'm trying to achieve!
Any help would be seriously appreciated.
Gabriele

Let's break this down into why it doesn't work. For starters, in R
i
A7LECTUR
# and
"A7LECTUR"
are different. The first two are variablenames, the latter is a value. I am emphasising this difference, because it is an important distinction.
Working with lists (and data frames, as data frames are basically lists with some restrictions to make them rectangular), in the syntax reflex_working$i reflex_working refers to the variable and i is refers to the element named "i" within the list. In reflex_working$i, the i is literal and R doesn't care if you have an variable named i.
With programming, we want to be a bit more dynamic. So you correctly assumed using a variable would do the trick. If you want to do that, you have to use the [ or [[ subset method ([ always returns a list, while [[ will return the element without the encapsulating list[1]).
To summarise:
reflex_working$i # gets the element named i, no matter what.
reflex_working[[i]] # gets the element whose name (or position) is stored in the variable i
reflex_working$i == reflex_working[["i"]]
That should explain the right-hand-side of your line in the loop. The correct statement should read
ifelse(reflex_working[[i]]==-9,1,0)
For the left-hand-side, reflex_working$miss_[i], things are completely off. What you want can be decomposed into several steps:
Compose a value by concatenating "miss_" and the value of i.
Use that value as the element/column name.
We can combine these two into (as a commentor stated)
reflex_working[[paste0('miss_', i)]] <- ...
Good job on you, for realising that R is inherently vectorized - since you are not writing a loop for each row in the column. Good one!
[1] but [[ can return a list, if the element itself is a list. R can be... weird full of surprises.

Assuming you want this for the entire data frame.
tt <- read.table(text="
A7LECTUR A7GROASG
1 2
2 3
1 -9
4 -9
-9 0", header=TRUE)
tt.d <- (tt == -9)*1
colnames(tt.d) <- paste0("miss_", colnames(tt))
tt.d
# miss_A7LECTUR miss_A7GROASG
# [1,] 0 0
# [2,] 0 0
# [3,] 0 1
# [4,] 0 1
# [5,] 1 0

Related

using seq_along() to handle the empty case

I read that using seq_along() allows to handle the empty case much better, but this concept is not so clear in my mind.
For example, I have this data frame:
df
a b c d
1 1.2767671 0.133558438 1.5582137 0.6049921
2 -1.2133819 -0.595845408 -0.9492494 -0.9633872
3 0.4512179 0.425949910 0.1529301 -0.3012190
4 1.4945791 0.211932487 -1.2051334 0.1218442
5 2.0102918 0.135363711 0.2808456 1.1293810
6 1.0827021 0.290615747 2.5339719 -0.3265962
7 -0.1107592 -2.762735937 -0.2428827 -0.3340126
8 0.3439831 0.323193841 0.9623515 -0.1099747
9 0.3794022 -1.306189542 0.6185657 0.5889456
10 1.2966537 -0.004927108 -1.3796625 -1.1577800
Considering these three different code snippets:
# Case 1
for (i in 1:ncol(df)) {
print(median(df[[i]]))
}
# Case 2
for (i in seq_along(df)) {
print(median(df[[i]]))
}
# Case 3
for(i in df) print(median(i))
What is the difference between these different procedures when a full data.frame exists or in the presence of an empty data.frame?
Under the condition that df <- data.frame(), we have:
Case 1 falling victim to...
Error in .subset2(x, i, exact = exact) : subscript out of bounds
while Case 2 and 3 are not triggered.
In essence, the error in Case 1 is due to ncol(df) being 0. This leads the sequence 1:ncol(df) to be 1:0, which creates the vector c(1,0). In this case, the for loop tries to access the first element of the vector 1, which tries to access column 1 does not exist. Hence, the subset is found to be out of bounds.
Meanwhile, in Case 2 and 3 the for loop is never executed since there are no elements to process within their respective collections since the vectors are empty. Principally, this means that they have length of 0.
As this question specifically relates to what the heck is happening to seq_along(), let's take a traditional seq_along example by constructing a full vector a and seeing the results:
set.seed(111)
a <- runif(5)
seq_along(a)
#[1] 1 2 3 4 5
In essence, for each element of the vector a, there is a corresponding index that was created by seq_along to be accessed.
If we apply seq_along now to the empty df in the above case, we get:
seq_along(df)
# integer(0)
Thus, what was created was a zero length vector. Its mighty hard to move along a zero length vector.
Ergo, the Case 1 poorly protects the against the empty case.
Now, under the traditional assumption, that is there is some data within the data.frame, which is a very bad assumption for any kind of developer to make...
set.seed(1234)
df <- data.frame(matrix(rnorm(40), 4))
All three cases would be operating as expected. That is, you would receive a median per column of the data.frame.
[1] -0.5555419
[1] -0.4941011
[1] -0.4656169
[1] -0.605349

R: for-loop solution to deleting columns from multiple data frames

My question is probably quite simple but I think my code could definitely be improved. Right now it's two for-loops but I'm sure there's a way to do what I need in a single loop, for the life of me I can't see what it is.
Having searched Stack, I found this excellent answer from Ananda where he was able to extract and keep columns within a range using lapply and for-loop methods. The structure of my data gets in the way, however, as I want to be able to pick specific columns to delete. My data structure looks like this:
1 AAAT_1 1 GROUP **** 1 -13.70 0
2 AAAT_2 51 GROUP **** 1 -9.21 0
3 AAAT_3 101 GROUP **** 1 -7.60 0
4 AAAT_4 151 GROUP **** 1 -6.28 0
It's extract from some docking software and the only columns I want to keep are 2 (e.g. AAAT_1) and 7 (e.g. -13.70). The code I've used to do it, two for-loops:
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[2:7])
}
....to keep the data from columns 2-7, followed by:
for (i in 1:length(temp)) {
assign(temp[i], get(temp[i])[-2:-5])
}
....to delete the rest of the columns I didn't need, where temp[i] is just a list of data frames the loop is acting on.
So, as you can see, it's just two loops doing similar actions. Surely there's a way to be able to pick specific columns to keep/delete and do it all in one loop/lapply statement? Trying things like [2,7] in the get statement doesn't work, appears to keep only column 7 and turns each data frame into 'Values' instead. I'm not sure what's going so any insight there would be wonderful but, either way, if anyone can turn this two-loop solution into one, would be really appreciated. Definitely feel like I'm missing something really simple/obvious.
Cheers.
EDIT: Have taken into account the vectorised solutions from below to do the following instead. The names of raw imported data start with stuff like F0001, F0002, etc. hence the pattern to make the initial list.
lst <- mget(ls(pattern='^F\\d+'))
lst <- lapply(lst, "[", TRUE, c("V2","V7") )
lst <- lapply(seq_along(lst),
function(i,x) {assign(paste0(temp[i]),x[[i]], envir=.GlobalEnv)},
x=lst)
I know loops get a bad rap in R, was a natural solution to me as a CPP programmer but meh, this was far quicker. Initially, the only downside from the other example was that the assign command pasted a letter to each of the created tables in sequence 1,2,3,....,n when the list of raw imported data files weren't entirely in numerical order (i.e. 1,2,3,5,6,10,...etc.) so this didn't preserve that order. So I had to use a list of the files (our old friend temp) to name them correctly. Minor thing and the code isn't much shorter than two loops but it's most certainly faster.
So, in short, the above three lines add all the imported raw data to a list, keep only the columns I need then split the list up into separate dataframes whilst preserving the correct names. Cheers for the help!
If you have a data frame, you index rows and columns with
data.frame[row, column]
So, data.frame[2,7]) will give you the value of the 2nd row in the 7th column. I guess you were looking for
temp <- temp[, c(2,7)]
or, if temp is a list of data frames
temp <- lapply(temp, function(x) x[, c(2,7)])
So, if you want to use a vector of numbers as column- or row-indices, create this vector with c(...). If I understand your example right, you don't need any loop-command, if you use lapply.
A for loop? Maybe I'am missing something but just why do not use the solution proposed by #Daniel or a dplyr approach like this.
data
V1 V2 V3 V4 V5 V6 V7 V8
1 1 AAAT_1 1 GROUP **** 1 -13.70 0
2 2 AAAT_2 51 GROUP **** 1 -9.21 0
3 3 AAAT_3 101 GROUP **** 1 -7.60 0
4 4 AAAT_4 151 GROUP **** 1 -6.28 0
and here the code:
library(dplyr)
data <- select(data, V2, V7)
data
V2 V7
1 AAAT_1 -13.70
2 AAAT_2 -9.21
3 AAAT_3 -7.60
4 AAAT_4 -6.28

Calling on a column from a data frame within a data frame

I have a list of data frame (lets call that "data") that I have generated which goes something like this:
$"something.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
$"something else.csv"
x y z
1 1 1 1
2 2 2 2
3 3 3 3
I would like to output from the table "something.csv" one number within column x.
So far I have used:
data$"something.csv"$x[2]
This coding works and I am happy that it does, but my problem is that I want to automate this process and so i have put all the table titles into a list (filename) which goes:
[1] "something.csv", "something else.csv"
So i made a for loop which should allow me to do so but when I put in:
data$filename[1]$x[2]
it gives me back NULL.
When i print filename[1], I get [1] "something.csv" and if I type
data$"something.csv"$x[2]
I get the result I want. so if filename[1] = "something.csv", why does it not give me the same results?
I just want my code to out put the second row of column x and automate by using filename[i] in a for loop.
The way you have tried to approach the problem tries to find a column 'filename[1]' from the list, but it is not found. Hence, the NULL gets returned.
You need to use square brackets, and subset the object data. Here's an example:
# Generate data
data<-vector("list", 2)
names(data)<-c("something.csv", "something else.csv")
data[[1]]<-data.frame(x=1:3, y=1:3, z=1:3)
data[[2]]<-data.frame(x=1:3, y=1:3, z=1:3)
filename<-names(l)
# Subset the data
# The first data frame, notice the square brackets for subsetting lists!
data[[filename[1]]]
# column x
data[[filename[1]]]$x
# Second observation of x
data[[filename[1]]]$x[2]
The above uses for subsetting the names of the objects in the list. You can also use the number-based subsetting suggested by #Jeremy.
you can also use [ and [[ to call data$"something.csv"$x[2] try
data[[1]][2,1]
where [[1]] is the first list element and [2,1] is the data frame reference element

Create an unknown number of subsets with specific conditions using R

I am still an R beginner, so please be kind :). There are gaps that occur in my data at unknown times and for unknown intervals. I would like to pull these gaps out of my data by subsetting them. I don't want them removed from the data frame, I just want as many different subsets as there are data gaps so that I can make changes to them and eventually merge the changed subsets back into the original data frame. Also, eventually I will be running the greater part of this script on multiple .csv files so it cannot be hardcoded. A sample of my data is below with just the relevant column:
fixType (column 9)
fix
fix
fix
fix
fix
fix
lastvalidfix
0
0
0
0
0
firstfix
fix
fix
fix
fix
lastvalidfix
0
0
0
0
0
0
0
0
0
0
firstfix
The code I have now is not functional and not completely correct R, but I'm hoping that it expresses what I need to do. Essentially every time lastvalidfix and firstfix are found in the rows of column 9 I would like to create a subset which would include those two rows and however many rows are between them. If using my sample data above then I would be creating 2 subsets, the first with 7 rows and the second with 12 rows. The number of data gaps in each file varies, so the number of subset and the length will likely be different each time. I realize that each subset will need a unique name which is why I've done the subset + 1.
subset <- 0 # This is my attempt at creating unique names for the subsets
for (i in 2:nrow(dataMatrix)) { # Creating new subsets of data for each time the signal is lost
if ((dataMatrix[i, 9] == "lastvalidfix") &
(dataMatrix[i, 9] == "firstfix")){
subCreat <- subset(dataMatrix, dataMatrix["lastvalidfix":"firstfix", 9], subset + 1)
}
}
Any help would be most appreciated.
Try this:
start.idx <- which(df$fixType == "lastvalidfix")
end.idx <- which(df$fixType == "firstfix")
mapply(function(i, j) df[i:j, , drop = FALSE],
start.idx, end.idx, SIMPLIFY = FALSE)
It will return a list of sub-data.frames or sub-matrices.
(Note: my df$fixType is what you refer to as dataMatrix[, 9]. If it has a column name, I would highly recommend you use that.)

Trying to use user-defined function to populate new column in dataframe. What is going wrong?

Super short version: I'm trying to use a user-defined function to populate a new column in a dataframe with the command:
TestDF$ELN<-EmployeeLocationNumber(TestDF$Location)
However, when I run the command, it seems to just apply EmployeeLocationNumber to the first row's value of Location rather than using each row's value to determine the new column's value for that row individually.
Please note: I'm trying to understand R, not just perform this particular task. I was actually able to get the output I was looking for using the Apply() function, but that's irrelevant. My understanding is that the above line should work on a row-by-row basis, but it isn't.
Here are the specifics for testing:
TestDF<-data.frame(Employee=c(1,1,1,1,2,2,3,3,3),
Month=c(1,5,6,11,4,10,1,5,10),
Location=c(1,5,6,7,10,3,4,2,8))
This testDF keeps track of where each of 3 employees was over the course of the year among several locations.
(You can think of "Location" as unique to each Employee...it is eseentially a unique ID for that row.)
The the function EmployeeLocationNumber takes a location and outputs a number indicating the order that employee visited that location. For example EmployeeLocationNumber(8) = 2 because it was the second location visited by the employee who visited it.
EmployeeLocationNumber <- function(Site){
CurrentEmployee <- subset(TestDF,Location==Site,select=Employee, drop = TRUE)[[1]]
LocationDate<- subset(TestDF,Location==Site,select=Month, drop = TRUE)[[1]]
LocationNumber <- length(subset(TestDF,Employee==CurrentEmployee & Month<=LocationDate,select=Month)[[1]])
return(LocationNumber)
}
I realize I probably could have packed all of that into a single subset command, but I didn't know how referencing worked when you used subset commands inside other subset commands.
So, keeping in mind that I'm really trying to understand how to work in R, I have a few questions:
Why won't TestDF$ELN<-EmployeeLocationNumber(TestDF$Location) work row-by-row like other assignment statements do?
Is there an easier way to reference a particular value in a dataframe based on the value of another one? Perhaps one that does not return a dataframe/list that then must be flattened and extracted from?
I'm sure the function I'm using is laughably un-R-like...what should I have done to essentially emulate an INNER Join type query?
Using logical indexing, the condensed one-liner replacement for your function is:
EmployeeLocationNumber <- function(Site){
with(TestDF[do.call(order, TestDF), ], which(Location[Employee==Employee[which(Location==Site)]] == Site))
}
Of course this isn't the most readable way, but it demonstrates the principles of logical indexing and which() in R. Then, like others have said, just wrap it up with a vectorized *ply function to apply this across your dataset.
A) TestDF$Location is a vector. Your function is not set up to return a vector, so giving it a vector will probably fail.
B) In what sense is Location:8 the "second location visited"?
C) If you want within group ordering then you need to pass you dataframe split up by employee to a funciton that calculates a result.
D) Conditional access of a data.frame typically involves logical indexing and or the use of which()
If you just want the sequence of visits by employee try this:
(Changed first argument to Month since that is what determines the sequence of locations)
with(TestDF, ave(Location, Employee, FUN=seq))
[1] 1 2 3 4 2 1 2 1 3
TestDF$LocOrder <- with(TestDF, ave(Month, Employee, FUN=seq))
If you wanted the second location for EE:3 it would be:
subset(TestDF, LocOrder==2 & Employee==3, select= Location)
# Location
# 8 2
The vectorized nature of R (aka row-by-row) works not by repeatedly calling the function with each next value of the arguments, but by passing the entire vector at once and operating on all of it at one time. But in EmployeeLocationNumber, you only return a single value, so that value gets repeated for the entire data set.
Also, your example for EmployeeLocationNumber does not match your description.
> EmployeeLocationNumber(8)
[1] 3
Now, one way to vectorize a function in the manner you are thinking (repeated calls for each value) is to pass it through Vectorize()
TestDF$ELN<-Vectorize(EmployeeLocationNumber)(TestDF$Location)
which gives
> TestDF
Employee Month Location ELN
1 1 1 1 1
2 1 5 5 2
3 1 6 6 3
4 1 11 7 4
5 2 4 10 1
6 2 10 3 2
7 3 1 4 1
8 3 5 2 2
9 3 10 8 3
As to your other questions, I would just write it as
TestDF$ELN<-ave(TestDF$Month, TestDF$Employee, FUN=rank)
The logic is take the months, looking at groups of the months by employee separately, and give me the rank order of the months (where they fall in order).
Your EmployeeLocationNumber function takes a vector in and returns a single value.
The assignment to create a new data.frame column therefore just gets a single value:
EmployeeLocationNumber(TestDF$Location) # returns 1
TestDF$ELN<-1 # Creates a new column with the single value 1 everywhere
Assignment doesn't do any magic like that. It takes a value and puts it somewhere. In this case the value 1. If the value was a vector of the same length as the number of rows, it would work as you wanted.
I'll get back to you on that :)
Dito.
Update: I finally worked out some code to do it, but by then #DWin has a much better solution :(
TestDF$ELN <- unlist(lapply(split(TestDF, TestDF$Employee), function(x) rank(x$Month)))
...I guess the ave function does pretty much what the code above does. But for the record:
First I split the data.frame into sub-frames, one per employee. Then I rank the months (just in case your months are not in order). You could use order too, but rank can handle ties better. Finally I combine all the results into a vector and put it into the new column ELN.
Update again Regarding question 2, "What is the best way to reference a value in a dataframe?":
This depends a bit on the specific problem, but if you have a value, say Employee=3 and want to find all rows in the data.frame that matches that, then simply:
TestDF$Employee == 3 # Returns logical vector with TRUE for all rows with Employee == 3
which(TestDF$Employee == 3) # Returns a vector of indices instead
TestDF[which(TestDF$Employee == 3), ] # Subsets the data.frame on Employee == 3

Resources