I am trying to make a dataframe with two columns and 10 rows whereby the first column contains weight (denoted by w in the code) and the second column contains the error rate (denoted by cv.error). However I get a dataframe with only NA in it. I don't know what I am doing wrong. Help would be appreciated.
I want a dataframe in which the first column has "w" and the other has cv.error.
Following is my code
l <- data.frame(matrix(NA, nrow = 10, ncol = 2))
k_fun <- function(combined_distance,n,j)
{
glm_fit <- glm(gcms$train$response ~ combined_distance ,family=binomial, data=gcms$train,control = list(maxit = 50))
cv.error = cv.glm(gcms$train, glm_fit,K=5)$delta[1]
l[j,1] = n
l[j,2] = cv.error
}
w = c(0.1,0.2,0.25,0.3,0.35,0.4,0.45,0.50,0.7,0.9)
for(j in 1:10)
{
combined_distance <- alkoloiddistance + (1 - alkoloiddistance^w[j]) * solventdistance
k_fun(combined_distance,w[j],j)
}
dont know why my answer was deleted. it answered the question, and it explained the reason.
u need l[j,1] <<- n and l[j,2] <<- cv.error. u hope to update l inside the function, but actually it is only its local copy in the function that is updated. so after running your loop, l in your r session is unchanged at all. u set up l as a data frame of NA, thus u still get a data frame with all NA.
The problem is probably related to how you instantiated the data frame. If you run str(l) after running the first line of your code you may see that the data type assigned by R is either logical or factor. If you try to assign numeric values to columns of these types you will get an NA instead. Try running options(stringsAsFactors = F) before the rest of your code. Alternatively, you can assign the class of NA you want in the first line of your code (e.g., NA_integer_).
Related
Can someone help me with this? I got the cut_interval code to work for a single test column, but can't seem to get it to work in a for loop to have it run on all of the columns.
#Bin worker data into three groups (low/medium/high %methylation) for the cpg cg10757709
#This code works
cg10757709_interval <- cut_interval(cpgs$cg10757709, n=3, labels = c("low","med","high"))
View(cg10757709_interval)
#Write a loop so that data for each of the significant cpgs will be binned into low, medium, and high groups
#This code gives an error (that there are more elements are supplied than there are to replace)
cpgs_interval <- matrix(ncol = length(cpgs), nrow = 29)
for (i in seq_along(cpgs)) {
cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
View(cpgs_interval)
The error says "Error in cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n = 3, labels = c("low", : more elements supplied than there are to replace". Should I not be using a matrix for cpgs_interval? Or is something else the problem? I'm rather new to writing for loops. Thanks.
In your example, cpgs_interval is a matrix. If you want to put the variable into the ith column of the matrix, you could do:
for (i in seq_along(cpgs)) {
cpgs_interval[,i] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
That said, you might be better off making cpgs_interval a data frame, then you'll retain the factor rather than turning it into text.
I am trying to create a loop, which produces new tables on each loop, I want each table to be called table_loopnumber, and they will need to look at the table created in the previous loop.
I've tested this code for I=1 and it works fine, but it doesn't work as a loop. Any help would be appreciated, as I am very new to R.
for(i in 1:2) {
proj4810_op_iteration_[i+1]<-setDT(proj4810_op_iteration_i[, list(Median_High = median(unlist(.SD), na.rm = TRUE)),
by = list(item1, section,RL_Description_Full,
seed_dept,
Total_DOD,
england_DoD,
scotland_DoD,
wales_DoD,
IOM_DoD,
NI_DoD,
unknown_DoD,
turnover,
baskets,
items,
unit_price,
Ambient_Low,
Bakery_Low ,
Cleaning_Low,
FTN_Low,
Fresh_Low,
FrozPrep_Low,
current_seedprod)])
}
Thanks in advance
you can "Assign a value to a name in an environment" using assign. It takes as a first argument "a variable name, given as a character string." and as the second argument the object you want to assign to that variable name. See ?assign. The opposite (get an object based on its name is get). Hence, the following should work:
for(i in 1:2){
previous <- get(paste0("proj4810_op_iteration_", i-1) # get previous data table
tmp<-setDT(previous[, list(Median_High = median(unlist(.SD), na.rm = TRUE)),
by = list(item1, section,RL_Description_Full,seed_dept,Total_DOD,england_DoD,scotland_DoD,
wales_DoD,IOM_DoD,NI_DoD,unknown_DoD,turnover,baskets,items, unit_price,
Ambient_Low, Bakery_Low ,Cleaning_Low, FTN_Low, Fresh_Low, FrozPrep_Low,
current_seedprod)])
vname <- paste0("proj4810_op_iteration_", i) # name of object to be created # name of the current data table
assign(vname, tmp) # save the data table
}
Of course, for the first loop iteration you need to create an object proj4810_op_iteration_0 before the loop begins, otherwise it won't find anything.
As for the elegance of this approach, I agree more with the list-solution someone else already posted, but if you really want to it this way, this should work.
And please remember for the next time that you ask something, that you provide a minimal reproducible example.
Another approach is to use a list for the dataframes.
Here is a simplified version of your problem solved. It involves calculating the emelents of the fibbonacci sequence. The Desired datatables are
dfList[[1]]
# xnmo xn
# 1 1 2
dfList[[2]]
# xnmo xn
# 1 2 3
dfList[[3]]
# xnmo xn
# 1 3 5
So the first table contains the first and the second part of the sequence. The second table contains the second and the third part of the sequence, etc.
A loop to calculate such tables can be written as follows
dfList = list(data.frame(xnmo=1, xn=1))
for(i in 1:10)
dfList[[i+1]] = data.frame(
xnmo = dfList[[i]]$xn,
xn = dfList[[i]]$xnmo + dfList[[i]]$xn
)
I am trying to write the following loop over an empirical data set where
each ID replicate has a different number of observations for each sample period.
Any suggestions would be greatly appreciated!
a <- unique(bma$ID)
t <- unique(bma$Sample.period)
# empty list to hold the data
dens.data <- vector(mode='list', length = length(a) * length(t))
tank1 <- double(length(a))
index = 0
for (i in 1:length(a)){
for (j in 1:length(t)){
index = index + 1
tank1[index] = a[index] ### building an ID column
temp.tank <- subset(bma, bma$ID == a[i])
time.tank <- subset(temp.tank, temp.tank$Sample.period == t[j])
temp1 <- unique(temp.tank$Sample.period)
temp.tank <- data.frame(temp.tank, temp1)
dens.1 <- density(time.tank$Biomass_.adults_mgC.mm.3, na.rm = T)
# extract the y-values from the pdf function - these need to be separated by each Replicate and Sample Period
dens.data[[index]] <- dens.1$y
}
}
#### extract the data and place into a dataframe
dens.new<- data.frame(dens.data)
dens.new
colnames(dens.new) <- c("Treatment","Sample Period","pdf/density for biomass")
all<- list(dens.new)
all
### create new spreadsheet with all the data from the loop
dens.new.data<- write.csv(dens.new, "New.density.csv") ## export file to excel spreadsheet
Calling dens.new<- data.frame(dens.data) Yield the following error message:
Error in data.frame(c(...) :
arguments imply differing number of rows: 512, 0
The loop seems to work for dens.data[[1]] but returns NULL for
dens.data[[>1]]
As there isn't a minimal example, it is difficult for me to guess what the original data.frame looks like. However, as for the error message, it is clear that your for-loop fails to assign values to the list dens.data for indices greater than 1.
My guess is that the index didn't update by index = index + 1. Maybe you could try changing the equal sign = to the standard R assignment operator <- and see whether the whole list is updated.
I heard that using equal sign for assignment may cause some problems in an older version of R, but I'm not sure whether you are facing the same problem. Anyway, using <- to assign a value is always safer and recommended.
I have this set of sequences with 2 variables for a 3rd variable(device). Now i want to break the sequence for each device into sets of 300. dsl is a data frame that contains d being the device id and s being the number of sequences of length 300.
First, I am labelling (column Sid) all the sequences rep(1,300) followed by rep(2,300) and so on till rep(s,300). Whatever remains unlabelled i.e. with initialized labels(=0) needs to be ignored. The actual labelling happens with seqid vector though.
I had to do this as I want to stack the sets of 300 data points and then transpose it. This would form one row of my predata data.frame. For each predata data frame i am doing a k-means to generate 5 clusters that I am storing in final data.
Essentially for every device I will have 5 clusters that I can then pull by referencing the row number in final data (mapped to device id).
#subset processed data by device
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data$Sid <- 0
temp.data[1:(s*300),4] <- rep(1:300,s)
temp.data <- subset(temp.data,temp.data$Sid!="0")
seqid <- NA
for (j in 1:s){ seqid[(300*(j-1)+1):(300*j)] <- j }
temp.data$Sid <- seqid
predata <- as.data.frame(matrix(numeric(0),s,600))
for(k in 1:s){
temp.data2 <- subset(temp.data[,c(1,2)], temp.data$Sid==k)
predata[k,] <- t(stack(temp.data2)[,1])
}
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
Being a noob to R, I ended up with 3 nested loops (the function did work for the outermost loop being one value). This has taken 5h and running. Need a faster way to go about this.
Any help will be appreciated.
Thanks
Ok, I am going to suggest a radical simplification of your code within the loop. However, it is hard to verify that I in fact did assume the right thing without having sample data. So please ensure that my predata in fact equals yours.
First the code:
for (ds in 1:387){
d <- dsl[ds,1]
s <- dsl[ds,3]
temp.data <- subset(data,data$Device==d)
temp.data <- temp.data[1:(s*300),]
predata <- cbind(matrix(temp.data[,1], byrow=T, ncol=300), matrix(temp.data[,2], byrow=T, ncol=300))
ob <- kmeans(predata,5,iter.max=10,algorithm="Hartigan-Wong")
finaldata <- rbind(finaldata,(unique(fitted(ob,method="centers"))))
}
What I understand you are doing: Take the first 300*s elements from your subset(data, data$Devide == d. This might easily be done using the command
temp.data <- temp.data[1:(s*300),]
Afterwards, you collect a matrix that has the first row c(temp.data[1:300, 1], temp.data[1:300, 2]), and so on for all further rows. I do this using the matrix command as above.
I assume that your outer loop could be transformed in a call to tapply or something similar, but therefore, we would need more context.
For example, I have a matrix k
> k
d e
a 1 3
b 2 4
I want to apply a function on k
> apply(k,MARGIN=1,function(p) {p+1})
a b
d 2 3
e 4 5
However, I also want to print the rowname of the row being apply so that I can know which row the function is applied on at that time.
It may looks like this:
apply(k,MARGIN=1,function(p) {print(rowname(p)); p+1})
But I really don't do how to do that in R.
Does anyone has any idea?
Here's a neat solution to what I think you're asking. (I've called the input matrix mat rather than k for clarity - in this example, mat has 2 columns and 10 rows, and the rows are named abc1 through to abc10.)
In the code below, the result out1 is the thing you wanted to calculate (the outcome of the apply command). The result out2 comes out identically to out1 except that it prints out the rownames that it is working on (I put in a delay of 0.3 seconds per row so you can see it really does do this - take this out when you want the code to run full speed obviously!)
The trick I came up with was to cbind the row numbers (1 to n) onto the left of mat (to create a matrix with one additional column), and then use this to refer back to the rownames of mat. Note the line x = y[-1] which means that the actual calculation within the function (here, adding 1) ignores the first column of row numbers, which means it's the same as the calculation done for out1. Whatever sort of calculation you want to perform on the rows can be done this way - just pretend that y never existed, and formulate your desired calculation using x. Hope this helps.
set.seed(1234)
mat = as.matrix(data.frame(x = rpois(10,4), y = rpois(10,4)))
rownames(mat) = paste("abc", 1:nrow(mat), sep="")
out1 = apply(mat,1,function(x) {x+1})
out2 = apply(cbind(seq_len(nrow(mat)),mat),1,
function(y) {
x = y[-1]
cat("Doing row:",rownames(mat)[y[1]],"\n")
Sys.sleep(0.3)
x+1
}
)
identical(out1,out2)
You can use a variable outside of the apply call to keep track of the row index and pass the row names as an extra argument to your function:
idx <- 1
apply(k, 1, function(p, rn) {print(rn[idx]); idx <<- idx + 1; p + 1}, rownames(k))
This should work. The cat() function is what you want to use when printing results during evaluation of a function. paste(), conversely, just returns a character vector but doesn't send it to the command window.
The solution below uses a counter created as a closure, allowing it to "remember" how many times the function has been run before. Note the use of the global assign <<-. If you really want to understand what's going on here, I recommend reading through this wiki https://github.com/hadley/devtools/wiki/
Note there may be an easier way to do this; my solution assumes that there is no way to access the rownumber or rowname of a current row using typical means within an apply function. As previously mentioned, this would be no problem in a loop.
k <- matrix(c(1,2,3,4),ncol=2)
rownames(k) <- c("a","b")
colnames(k) <- c("d","e")
make.counter <- function(x){
i <- 0
function(){
i <<- i+1
i
}
}
counter1 <- make.counter()
apply(k,MARGIN=1,function(p){
current.row <- rownames(k)[counter1()]
cat(current.row,"\n")
return(p+1)
})
As far as I know you cannot do that with apply, but you could loop through the rownames of your data frame. Lame example:
lapply(rownames(mtcars), function(x) sprintf('The mpg of %s is %s.', x, mtcars[x, 1]))