Creating pointer for specific point in a dataframe - r

A little background on the project before I get to the details. I'm working with a list of ~50 countries with data for somewhere between 40 and 60 years per country. I've been able to set up a loop for an individual country which tries out various values of a variable (named DELTA in the code) and logs results.
I first bring in the data and clean it to have no null values and create a vector containing all the 3 letter codes used to represent each country using the following code.
Clean <- na.omit(Data)
Clean <- Clean[order(country.isocode),]
Codes <- levels(Clean[,2])
I then use a loop and the subset function to create a different data frame for each country.
for (i in 1:length(Codes)) {
assign((Codes[i]),droplevels(subset(Clean,country.isocode==Codes[i])))
}
Now all 50 of my countries are in their own dataframe named after their 3 letter ISO code. The following is a the code I run to create the results I want for Angola (AGO).
AGO_Results <- matrix(numeric(0), 100,2)
AGOROW<-nrow(AGO)
for (j in 1:100) {
AGO[1,12]<-AGO[1,9]/DELTA
for (i in 2:AGOROW) {
AGO[i,12] <- AGO[i-1,12]*(1-DELTA)+AGO[i,9]
}
AGO[,13] <- AGO[,12]/AGO[,8]
AGO_Results[j,1] <- DELTA
AGO_Results[j,2] <- sum(AGO[,13] > 1 & AGO[,13] < 3)
DELTA=DELTA+.002
}
At the end of this AGO_Results contains the values I want, but I'd rather not do this manually for 50 countries, so I'm trying to create a loop around this for all 50 countries. I've managed using eval() and assign() to get rather far, but I'm stuck on what I think is the last hurdle.
for (k in 1:length(Codes)) {
# Initialize Delta and Create Storage Matrix and Row Count
DELTA <- .01
assign(paste(Codes[k],"_Results", sep=""), matrix(numeric(0), 100,2))
assign(paste(Codes[k],"ROW",sep=""), nrow(eval(as.name(Codes[k]))))
This portion is complete and works. Now we're at my real problem, how to reference the individual point [1,12] to be written in each data frame. What can I do to create a pointer to let me replace an individual item in a data frame, when I have to paste the name of the data frame in each time?
EDIT: Sample Data Posted below
country country.isocode year POP rgdpl ki rgdpl2wok rgdp investment workers L.P
21 Angola AGO 1970 5605.63 2366.51 23.27 5904.14 13265745651 3087431388 2246856 0.4
22 Angola AGO 1971 5752.96 2445.13 23.25 6127.95 14066747655 3270057880 2295508 0.4

First, there is a problem with
Clean <- Clean[order(country.isocode),]
(It will use a global variable country.isocode, not one in the data frame, if there is one. )
Instead of
for (i in 1:length(Codes)) {
assign((Codes[i]),droplevels(subset(Clean,country.isocode==Codes[i])))
}
you could do
xyz <- split(Clean, list(country.isocode)) # or, probably Clean$country.isocode
Now you have split the data frame by countries. You can lapply a function (possibly self-made) to the resulting list (xyz) and you get the results separately for each country. Try this and then say if you really need a "pointer".
edit after comments
xyz <- split(Clean, list(Clean$country.isocode))
xyz <- lapply(xyz, droplevels) # whatever that's for
Now you can define what you want to do with each country (I rewrote your code without trying to understand what it does but noted only an obvious problem):
doit <- function(x){
# where does the DELTA come from? do you initialize it to zero?
# anyway, you need to define it here or pass it as argument
Results <- matrix(numeric(0), 100,2) # I'd use 0 or NA instead of numeric(0)
NROWs<-nrow(x)
for (j in 1:100) {
x[1,12]<-x[1,9]/DELTA
for (i in 2:NROWs) {
x[i,12] <- x[i-1,12]*(1-DELTA)+x[i,9]
}
x[,13] <- x[,12]/x[,8]
Results[j,1] <- DELTA
Results[j,2] <- sum(x[,13] > 1 & x[,13] < 3)
DELTA=DELTA+.002
}
Results # returns results
}
And now you can apply the newly defined function to your list:
lapply(xyz, doit)
And that should be it. You probably need a few modifications and trials-and-errors but that's in my view a more sensible approach than creating lots of variables with assign.

Related

write result of rank loop in r

I've been hitting walls trying to write the results of a loop to a csv. I'm trying to rank data within each of 20 columns. The loop I'm using is:
for (i in 1:ncol(testing_file)) {
print(rank(testing_file[[i]]))
}
This works and prints expected results to screen. I've tried a lot of methods suggested in various discussions to write this result to file or data frame, most with no luck.
I'll just include my most promising lead, which returns only one column of correct data, with a column heading of "testing":
for (i in 1:ncol(testing_file)) {
testing<- (rank(testing_file[[i]]))
testingdf <- as.data.frame(testing)
}
Any help is greatly appreciated!
I found a solution that works:
testage<- data.frame(matrix(, nrow=73, ncol=20)) #This creates an empty data
frame that the ranked results will go into
for (i in 1:ncol(testing_file)) {
testage[i] <- rank(testing_file[[i]])
print(testage[i])
} #this is the loop that ranks data within each column
colnames(testage) <- colnames(testing_file) #take the column names from the
original file and apply them to the ranked file.
I'm bad with nested loops so I'd try:
testing_file <- data.frame(x = 1:5, y = 15:11)
testing <- as.data.frame(lapply(seq_along(testing_file), function (x)
rank(testing_file[, x])))
> testing_file
x y
1 1 15
2 2 14
3 3 13
4 4 12
5 5 11
and gets you out of messy nested loops. Did you want to check results of rank() prior to writing to csv?
or just wrap it in a write.csv, the colnames will be the original df colnames:
> write.csv(testing <- as.data.frame(lapply(seq_along(testing_file),
function (x) rank(testing_file[, x]))), "testing.csv", quote = FALSE)

Loop, create new variable as function of existing variable with conditional

I have some data that contains 400+ columns and ~80 observations. I would like to use a for loop to go through each column and, if it contains the desired prefix exp_, I would like to create a new column which is that value divided by a reference column, stored as the same name but with a suffix _pp. I'd also like to do an else if with the other prefix rev_ but I think as long as I can get the first problem figured out I can solve the rest myself. Some example data is below:
exp_alpha exp_bravo rev_charlie rev_delta pupils
10 28 38 95 2
24 56 39 24 5
94 50 95 45 3
15 93 72 83 9
72 66 10 12 3
The first time I tried it, the loop ran through properly but only stored the final column in which the if statement was true, rather than storing each column in which the if statement was true. I made some tweaks and lost that code but now have this which runs without error but doesn't modify the data frame at all.
for (i in colnames(test)) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils)
}
}
My understanding of what this is doing:
loop through the vector of column names
if the substring "exp_" is in the ith element of the colnames vector == TRUE
create a new column in the data set which is the ith element of the colnames vector divided by the reference category (pupils), and with "_pp" appended at the end
else do nothing
I imagine since my the code is executing without error but not doing anything that my problem is in the if() statement, but I can't figure out what I'm doing wrong. I also tried adding "==TRUE" in the if() statement but that achieved the same result.
Almost correct, you did not define the length of the loop so nothing happened. Try this:
for (i in 1:length(colnames(test))) {
if(grepl("exp_", colnames(test)[i])) {
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}
As an alternative to #timfaber's answer, you can keep your first line the same but not treat i as an index:
for (i in colnames(test)) {
if(grepl("exp_", i)) {
print(i)
test[paste(i,"pp", sep="_")] <- test[i] / test$pupils
}
}
Linear solution:
Don't use loop for that! You can linearize your code and run it much faster than looping over columns. Here's how to do it:
# Extract column names
cNames <- colnames(test)
# Find exp in column names
foo <- grep("exp", cNames)
# Divide by reference: ALL columns at the SAME time
bar <- test[, foo] / test$pupils
# Rename exp to pp : ALL columns at the SAME time
colnames(bar) <- gsub("exp", "pp", cNames[foo])
# Add to original dataset instead of iteratively appending
cbind(test, bar)

how to fill up NA entries in the vector generated with assign command?

I would like to generate series of vectors like india_a, india_b, india_c. These vectors will have length 3. For example, 1st entry in india_a will be summation of 'total'when yrs=1 and crs=a.
for (i in crimes){ assign(paste("india_",i,sep=""),rep(NA,12))
for (j in 1:12){
india_i[j] <-sum(juvenile_crime$total[
juvenile_crime$year==years[j]&juvenile_crime$crime==crimes[i]]) } }
this is the message I get when I run above code
Error in india_i[j] <- sum(juvenile_crime$total[juvenile_crime$year == :
object 'india_i' not found
this example might help:
sts <- c(rep("s1",9),rep("s2",9))
yrs <- c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
crs <- c("a","b","c","a","b","c","a","b","c")
total <- c(1:18)
cri <- data.frame(sts,yrs,crs,total)
attach(cri)
yr <- levels(cri$yrs)
cr <- levels(cri$crs)
for (i in crs){
assign(paste("india_",i,sep=""),rep(NA,3))
for (j in 1:3){
india_i[j] <-sum(total[yrs==yr[j]&crs==cr[i]]
}
}
There is no object with the name "india_i". We don't have any information about what is in the "crimes" vector but if it's the numbers 1:12 then the objects have names like "india_1". You should learn to make named lists, rather than using separate objects.
After your edit, we can demonstrate this using a slightly modified version of your code ( and adding the missing close-parenthesis for the sum-call).
India_L <- list() # create an (empty) master list
for (i in crs){
assign(paste("india_",i,sep=""),rep(NA,3))
for (j in 1:3){
India_L[[paste("india_",j,sep="")]] <-sum(total[yrs==yr[j]&crs==cr[i]])
}
}
India_L # print to see the structure
#---- result
$india_1
[1] 0
$india_2
[1] 0
$india_3
[1] 0
The reason you got all zeroes was that there are no levels for the yrs column of the cri-object. It was "numeric" and only "factor"-classes have levels in R. A comment on your strategy. I wasn't really sure what goal you had set for yourself (besides) getting the assign function to succeed. The sum of those logical tests didn't seem particularly informative. Perhaps you meant to use the %in% operator. Using == with a vector will not generally be informative.
Using attach is generally very unwise. Notice the warning you got:
> attach(cri)
The following objects are masked _by_ .GlobalEnv:
crs, sts, total, yrs
The following objects are masked from cri (pos = 3):
crs, sts, total, yrs
So the objects named might be changed with an edit:
sts <- c(rep("s1.a",9),rep("s2.a",9))
What object do you think was altered? And if you then detach-ed the cri-dataframe, where do you think the edit would reside? One of the big problems with attaching objects is that the user gets confused about what is actually being changed.
It would be more clear to create the values with the dataframe in one pass and then work on components of the dataframe:
cri <- data.frame(
sts = c(rep("s1",9),rep("s2",9)),
yrs = c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3),
crs = c("a","b","c","a","b","c","a","b","c"),
total = c(1:18) )
If you want create 3 vectors, you may do something like this:
for (i in unique(crs)){
# Note: only one of each value
vect <- numeric(3)
# create a help vector
for (j in 1:3) {
vect[j] <-sum(total[yrs==yr[j] & crs==cr[i]])
}
assign(paste("india_",i,sep=""), vect)
# assign this vector to "india_i"
}
However, this program does not work. As yrs is numeric it will be included in the cri data frame as-is, and does not have any levels, and hence yrs==yr[j] is never true.
Another point: it is usually better to use lists instead of assignment of india_i. I would do
india <- vector("list", 3)
names(india) <- letters[1:3]
and the assignment later would be like
india[[i]] <- vect
And please!!! ensure your code runs (except the error you are struggling with.) Currently it does not even load as a parenthesis is missing from india_i[j] <-sum(total[yrs==yr[j]&crs==cr[i]].

Avoiding nested loops in R

I have this set of sequences with 2 variables for a 3rd variable(device). Now i want to break the sequence for each device into sets of 300. dsl is a data frame that contains d being the device id and s being the number of sequences of length 300.
First, I am labelling (column Sid) all the sequences rep(1,300) followed by rep(2,300) and so on till rep(s,300). Whatever remains unlabelled i.e. with initialized labels(=0) needs to be ignored. The actual labelling happens with seqid vector though.
I had to do this as I want to stack the sets of 300 data points and then transpose it. This would form one row of my predata data.frame. For each predata data frame i am doing a k-means to generate 5 clusters that I am storing in final data.
Essentially for every device I will have 5 clusters that I can then pull by referencing the row number in final data (mapped to device id).
#subset processed data by device
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data$Sid <- 0
temp.data[1:(s*300),4] <- rep(1:300,s)
temp.data <- subset(temp.data,temp.data$Sid!="0")
seqid <- NA
for (j in 1:s){ seqid[(300*(j-1)+1):(300*j)] <- j }
temp.data$Sid <- seqid
predata <- as.data.frame(matrix(numeric(0),s,600))
for(k in 1:s){
temp.data2 <- subset(temp.data[,c(1,2)], temp.data$Sid==k)
predata[k,] <- t(stack(temp.data2)[,1])
}
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
Being a noob to R, I ended up with 3 nested loops (the function did work for the outermost loop being one value). This has taken 5h and running. Need a faster way to go about this.
Any help will be appreciated.
Thanks
Ok, I am going to suggest a radical simplification of your code within the loop. However, it is hard to verify that I in fact did assume the right thing without having sample data. So please ensure that my predata in fact equals yours.
First the code:
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data <- temp.data[1:(s*300),]
predata <- cbind(matrix(temp.data[,1], byrow=T, ncol=300), matrix(temp.data[,2], byrow=T, ncol=300))
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
What I understand you are doing: Take the first 300*s elements from your subset(data, data$Devide == d. This might easily be done using the command
temp.data <- temp.data[1:(s*300),]
Afterwards, you collect a matrix that has the first row c(temp.data[1:300, 1], temp.data[1:300, 2]), and so on for all further rows. I do this using the matrix command as above.
I assume that your outer loop could be transformed in a call to tapply or something similar, but therefore, we would need more context.

R: rewrite loop with apply

I have the following type of data set:
id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA
That is, each row is unique customer and each column depicts a trait for this customer from the past 10 years (each month has its own column). The thing is that I want to condense this 120 column data frame into a 10 column data frame, this because I know that almost all rows have (although the month itself can vary) have 1 or 0 observations from each year.
I've already done, one year at the time, this using a loop with a nested if-clause:
for(i in 1:nrow(input_data)) {
temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
loc2011 <- which(!is.na(temp_row))
if(length(loc2011 ) > 0) {
temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
} else {
temp_row_2011[i,] <- NA
}
}
Since my data set is quite big, and I need to perform the above loop 10 times (one for each year), this is taking way too much time. I know one is much better of using apply commands in R, so I would greatly appreciate help on this task. How could I write the whole thing (including the different years) better?
Are you after something like this?:
temp_row_2011 <- apply(input_data, 1, function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
})
If this gives you the right output, and if it runs faster than your loop, then it's not necessarily due only to the fact of using an apply(), but also because it assigns less stuff and avoids an if {} else {}. You might be able to make it go even faster by compiling the anonymous function:
reduceyear <- function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
}
# compile, just in case it runs faster:
reduceyear_c <- compiler:::cmpfun(reduceyear)
# this ought to do the same as the above.
temp_row_2011 <- apply(input_data, 1, reduceyear_c)
You didn't say whether input_data is a data.frame or a matrix, but a matrix would be faster than the former (but only valid if input_data is all the same class of data).
[EDIT: full example, motivated by DWin]
input_data <- matrix(ncol=24,nrow=10)
# years and months:
colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_"))
# some ids
rownames(input_data) <- 1:10
# put in some values:
input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200))
# make an all-NA case:
input_data[2,1:12] <- NA
# and here's the full deal:
sapply(2010:2011, function(x,input_data){
input_data_yr <- input_data[, grep(x, colnames(input_data) )]
apply(input_data_yr, 1, function(id){
id[!is.na(id)][1]
}
)
}, input_data)
All NA case works. grep() column selection idea lifted from DWin. As in the above example, you could actually define the anonymous interior function and compile it to potentially make the thing run faster.
I built a tiny test case (for which timriffe's suggestion fails). You might attract more interest by putting up code that creates a more complete test case such as 4 quarters for 2 years and including pathological cases such as all NA's in one row of one year. I would think that instead of requiring you to write out all the year columns by name, that you ought to cycle through them with a grep() strategy:
# funyear <- function to work on one year's data and return a single vector
# my efforts keep failing on the all(NA) row by year combos
sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )

Resources