Loop, create new variable as function of existing variable with conditional - r

I have some data that contains 400+ columns and ~80 observations. I would like to use a for loop to go through each column and, if it contains the desired prefix exp_, I would like to create a new column which is that value divided by a reference column, stored as the same name but with a suffix _pp. I'd also like to do an else if with the other prefix rev_ but I think as long as I can get the first problem figured out I can solve the rest myself. Some example data is below:
exp_alpha exp_bravo rev_charlie rev_delta pupils
10 28 38 95 2
24 56 39 24 5
94 50 95 45 3
15 93 72 83 9
72 66 10 12 3
The first time I tried it, the loop ran through properly but only stored the final column in which the if statement was true, rather than storing each column in which the if statement was true. I made some tweaks and lost that code but now have this which runs without error but doesn't modify the data frame at all.
for (i in colnames(test)) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils)
}
}
My understanding of what this is doing:
loop through the vector of column names
if the substring "exp_" is in the ith element of the colnames vector == TRUE
create a new column in the data set which is the ith element of the colnames vector divided by the reference category (pupils), and with "_pp" appended at the end
else do nothing
I imagine since my the code is executing without error but not doing anything that my problem is in the if() statement, but I can't figure out what I'm doing wrong. I also tried adding "==TRUE" in the if() statement but that achieved the same result.

Almost correct, you did not define the length of the loop so nothing happened. Try this:
for (i in 1:length(colnames(test))) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}

As an alternative to #timfaber's answer, you can keep your first line the same but not treat i as an index:
for (i in colnames(test)) {
if(grepl("exp_", i)) {
print(i)
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}

Linear solution:
Don't use loop for that! You can linearize your code and run it much faster than looping over columns. Here's how to do it:
# Extract column names
cNames <- colnames(test)
# Find exp in column names
foo <- grep("exp", cNames)
# Divide by reference: ALL columns at the SAME time
bar <- test[, foo] / test$pupils
# Rename exp to pp : ALL columns at the SAME time
colnames(bar) <- gsub("exp", "pp", cNames[foo])
# Add to original dataset instead of iteratively appending
cbind(test, bar)

Related

How to determine number of rows meeting a certain condition?

I created a data frame from another data set called CompleteData:
conditions=data.frame(CompleteData$CAD,CompleteData$CVD,CompleteData$AAA,CompleteData$PAD)
This data frame has 2000 rows of data, with potential values of either 0 or 1. (The values represent the presence or absence of a condition such as CAD or CVD). I am trying to determine how many rows have two or more conditions in this data frame.
My general plan was to use an if statement combined with a for loop to determine which rows have multiple conditions, and then add the number of rows together. Here is the function I created:
for (conditions in conditions)
{
multiple.conditions=function(conditions)
{
if(sum(conditions)>1){return("multiple conditions")}else{
return("0 or 1 condition")
}
}
}
I'm just stuck in trying to figure out how to apply the for loop so that it performs the if statement row by row. I tested the if statement portion, which comes out correct, but how to structure the for loop is confusing me. As this is an introductory-level class, we are limited in the types of functions we are able to use. In this case, we have learned the syntax of functions, creating nested functions, if statements, for loops, and while loops. Any ideas?
What about just using rowSums()?
sum(rowSums(conditions)>1)
[1] 106
If you need to use a loop, you can do something like this:
mult_cond_rows = 0
for(i in 1:nrow(conditions)) {
row = conditions[i,]
if(sum(row)>1) mult_cond_rows = mult_cond_rows + 1
}
print(mult_cond_rows)
[1] 106
Input:
set.seed(123)
conditions = as.data.frame(setNames(
lapply(1:4, \(i) sample(0:1,2000, replace=T, prob=c(.9,.1))),
nm = c("CAD", "CVD", "AAA", "PA")
))

When I try to iterate through a row of a dataframe in R and put the values in a vector, the values end up changing. What is causing that?

I want to iterate through a dataframe in R and put each value in a separate numeric vector. As you can see, in the original dataframe (called singaporeINF, see the first photo), the values are all between 3000 and 4000.
However, when I iterate and extract the values using the code below:
singaporeVector <- numeric()
for (i in singaporeINF){
singaporeVector <- append(singaporeVector, i)
}
print(singaporeVector)
I end up with this output:
[1] 10 8 8 8 8 8 7 8
What is happening? Why do the values change from 3000 to 10 or 8?
My response is better set up as a comment, but I lack the reputation to do that.
Check what class() those values are in the dataframe, it looks like there might be a space in them, which might mean those values are either a character or, more likely, a factor. Try
singaporeVector <- numeric()
for (i in singaporeINF){
singaporeVector <- append(as.numeric(as.character(singaporeINF)), i)
}
print(singaporeVector)
That said, if I'm right with the class issue, it would be easier to do the following without the loop:
singaporeVector <- as.numeric(as.character(singaporeINF[1,]))

write result of rank loop in r

I've been hitting walls trying to write the results of a loop to a csv. I'm trying to rank data within each of 20 columns. The loop I'm using is:
for (i in 1:ncol(testing_file)) {
print(rank(testing_file[[i]]))
}
This works and prints expected results to screen. I've tried a lot of methods suggested in various discussions to write this result to file or data frame, most with no luck.
I'll just include my most promising lead, which returns only one column of correct data, with a column heading of "testing":
for (i in 1:ncol(testing_file)) {
testing<- (rank(testing_file[[i]]))
testingdf <- as.data.frame(testing)
}
Any help is greatly appreciated!
I found a solution that works:
testage<- data.frame(matrix(, nrow=73, ncol=20)) #This creates an empty data
frame that the ranked results will go into
for (i in 1:ncol(testing_file)) {
testage[i] <- rank(testing_file[[i]])
print(testage[i])
} #this is the loop that ranks data within each column
colnames(testage) <- colnames(testing_file) #take the column names from the
original file and apply them to the ranked file.
I'm bad with nested loops so I'd try:
testing_file <- data.frame(x = 1:5, y = 15:11)
testing <- as.data.frame(lapply(seq_along(testing_file), function (x)
rank(testing_file[, x])))
> testing_file
x y
1 1 15
2 2 14
3 3 13
4 4 12
5 5 11
and gets you out of messy nested loops. Did you want to check results of rank() prior to writing to csv?
or just wrap it in a write.csv, the colnames will be the original df colnames:
> write.csv(testing <- as.data.frame(lapply(seq_along(testing_file),
function (x) rank(testing_file[, x]))), "testing.csv", quote = FALSE)

Convert A List Object into a Useable Matrix Name (R)

I want to be able to use a loop to perform the same funtion on a group of data sets without having to recall the name of all of the data sets individually. For example, say I have the following matricies:
a<-matrix(1:5,nrow=5,ncol=2)
b<-matrix(6:10,nrow=5,ncol=2)
c<-matrix(11:15,nrow=5,ncol=2)
I define a vector of set names:
SetNames<- c("a","b","c")
Then I want to sum the second column of all of the matricies without having to call each matrix name. Basically, I would like to be able to call SetNames[1], have the program return 'a' as USEABLE text which can be used to call apply(a[2],2,sum).
If apply(SetNames[1][2],2,sum) worked, that would be the basic syntax I was looking for, however I would replace the 1 with a variable I can increase in a loop.
sapply can do that.
sapply(SetNames, function(z) {
dfz <- get(z)
sum(dfz[,2])
})
# a b c
# 15 40 65
Notice that get() is used here to dynamically access a variable.
a less compact way of writing this would be
sumRowTwo <- function(z) {
dfz <- get(z)
sum(dfz[,2])
}
sapply(SetNames, sumRowTwo)
and now you can play around with sumRowTwo and see what e.g.
sumRowTwo("a")
returns

Creating pointer for specific point in a dataframe

A little background on the project before I get to the details. I'm working with a list of ~50 countries with data for somewhere between 40 and 60 years per country. I've been able to set up a loop for an individual country which tries out various values of a variable (named DELTA in the code) and logs results.
I first bring in the data and clean it to have no null values and create a vector containing all the 3 letter codes used to represent each country using the following code.
Clean <- na.omit(Data)
Clean <- Clean[order(country.isocode),]
Codes <- levels(Clean[,2])
I then use a loop and the subset function to create a different data frame for each country.
for (i in 1:length(Codes)) {
assign((Codes[i]),droplevels(subset(Clean,country.isocode==Codes[i])))
}
Now all 50 of my countries are in their own dataframe named after their 3 letter ISO code. The following is a the code I run to create the results I want for Angola (AGO).
AGO_Results <- matrix(numeric(0), 100,2)
AGOROW<-nrow(AGO)
for (j in 1:100) {
AGO[1,12]<-AGO[1,9]/DELTA
for (i in 2:AGOROW) {
AGO[i,12] <- AGO[i-1,12]*(1-DELTA)+AGO[i,9]
}
AGO[,13] <- AGO[,12]/AGO[,8]
AGO_Results[j,1] <- DELTA
AGO_Results[j,2] <- sum(AGO[,13] > 1 & AGO[,13] < 3)
DELTA=DELTA+.002
}
At the end of this AGO_Results contains the values I want, but I'd rather not do this manually for 50 countries, so I'm trying to create a loop around this for all 50 countries. I've managed using eval() and assign() to get rather far, but I'm stuck on what I think is the last hurdle.
for (k in 1:length(Codes)) {
# Initialize Delta and Create Storage Matrix and Row Count
DELTA <- .01
assign(paste(Codes[k],"_Results", sep=""), matrix(numeric(0), 100,2))
assign(paste(Codes[k],"ROW",sep=""), nrow(eval(as.name(Codes[k]))))
This portion is complete and works. Now we're at my real problem, how to reference the individual point [1,12] to be written in each data frame. What can I do to create a pointer to let me replace an individual item in a data frame, when I have to paste the name of the data frame in each time?
EDIT: Sample Data Posted below
country country.isocode year POP rgdpl ki rgdpl2wok rgdp investment workers L.P
21 Angola AGO 1970 5605.63 2366.51 23.27 5904.14 13265745651 3087431388 2246856 0.4
22 Angola AGO 1971 5752.96 2445.13 23.25 6127.95 14066747655 3270057880 2295508 0.4
First, there is a problem with
Clean <- Clean[order(country.isocode),]
(It will use a global variable country.isocode, not one in the data frame, if there is one. )
Instead of
for (i in 1:length(Codes)) {
assign((Codes[i]),droplevels(subset(Clean,country.isocode==Codes[i])))
}
you could do
xyz <- split(Clean, list(country.isocode)) # or, probably Clean$country.isocode
Now you have split the data frame by countries. You can lapply a function (possibly self-made) to the resulting list (xyz) and you get the results separately for each country. Try this and then say if you really need a "pointer".
edit after comments
xyz <- split(Clean, list(Clean$country.isocode))
xyz <- lapply(xyz, droplevels) # whatever that's for
Now you can define what you want to do with each country (I rewrote your code without trying to understand what it does but noted only an obvious problem):
doit <- function(x){
# where does the DELTA come from? do you initialize it to zero?
# anyway, you need to define it here or pass it as argument
Results <- matrix(numeric(0), 100,2) # I'd use 0 or NA instead of numeric(0)
NROWs<-nrow(x)
for (j in 1:100) {
x[1,12]<-x[1,9]/DELTA
for (i in 2:NROWs) {
x[i,12] <- x[i-1,12]*(1-DELTA)+x[i,9]
}
x[,13] <- x[,12]/x[,8]
Results[j,1] <- DELTA
Results[j,2] <- sum(x[,13] > 1 & x[,13] < 3)
DELTA=DELTA+.002
}
Results # returns results
}
And now you can apply the newly defined function to your list:
lapply(xyz, doit)
And that should be it. You probably need a few modifications and trials-and-errors but that's in my view a more sensible approach than creating lots of variables with assign.

Resources