I can't for the life of me figure out what is going on here. I have a data frame that has several thousands rows. One of the columns is "name" and the other columns have various factors. I'm trying to count how many unique rows (i.e. sets of factors) belong to each "name".
Here is the loop that I am running as a script:
names<-as.matrix(unique(all.rows$name))
count<-matrix(1:length(names))
for (i in 1:length(names)) {
count[i]<-dim(unique(subset(all.rows,name==names[i])[,c(1,3,4,5)]))[1]
}
When I run the line in the for loop from the console and replace "i" with an arbitrary number (i.e. 10, 27, 40, ...), it gives me the correct count. But when I run this line inside the for loop, the end result is that the counts are all the same. I can't figure out why it's not working. Any ideas?
Your code works for me:
# Sample data.
set.seed(1)
n=10000
all.rows=data.frame(a=sample(LETTERS,n,replace=T),b=sample(LETTERS,n,replace=T),name=sample(LETTERS,n,replace=T))
names<-as.matrix(unique(all.rows$name))
count<-matrix(1:length(names))
for (i in 1:length(names)) {
count[i]<-dim(unique(subset(all.rows,name==names[i])[,c(1,2)]))[1]
}
t(count)
If you want to stick with a for loop, this is a little more clear:
count<-c()
for (i in unique(all.rows$name))
count[i]<-nrow(unique(all.rows [all.rows$name==i,names(all.rows)!='name']))
count
But using by would be very concise:
c(by(all.rows,all.rows$name,function(x) nrow(unique(x))))
You can do this with much simpler code. Try just pasting together the factor values in each row and then using tapply. Here is a working example:
data(trees)
trees$name <- rep(c('elm', 'oak'), length.out = nrow(trees))
trees$HV <- with(trees, paste(Height, Volume))
tapply(trees$HV, trees$name, function (x) length(unique(x)))
The last command gives you the counts that you need. As far as I can tell, the analogous code given your variable names is
all.rows$factorCombo <- apply(all.rows[, c(1, 3:5)], 2, function (x) paste(x, collapse = ''))
tapply(all.rows$factorCombo, all.rows$name, function (x) length(unique(x)))
Related
I want to concatenate iris$SepalLength, so I can use that in a function to get the Sepal Length column from iris data frame. But when I use paste function paste("iris$", colnames(iris[3])), the result is as characters (with quotes), as "iris$SepalLength". I need the result not as a character. I have tried noquotes(), as.datafram() etc but it doesn't work.
freq <- function(y) {
for (i in iris) {
count <-1
y <- paste0("iris$",colnames(iris[count]))
data.frame(as.list(y))
print(y)
span = seq(min(y),max(y), by = 1)
freq = cut(y, breaks = span, right = FALSE)
table(freq)
count = count +1
}
}
freq(1)
The crux of your problem isn't making that object not be a string, it's convincing R to do what you want with the string. You can do this with, e.g., eval(parse(text = foo)). Isolating out a small working example:
y <- "iris$Sepal.Length"
data.frame(as.list(y)) # does not display iris$Sepal.Length
data.frame(as.list(eval(parse(text = y)))) # DOES display iris.$Sepal.Length
That said, I wanted to point out some issues with your function:
The input variable appears to not do anything (because it is immediately overwritten), which may not have been intended.
The for loop seems broken, since it resets count to 1 on each pass, which I think you didn't mean. Relatedly, it iterates over all i in iris, but then it doesn't use i in any meaningful way other than to keep a count. Instead, you could do something like for(count in 1 : length(iris) which would establish the count variable and iterate it for you as well.
It's generally better to avoid for loops in R entirely; there's a host of families available for doing functions to (e.g.) every column of a data frame. As a very simple version of this, something like apply(iris, 2, table) will apply the table function along margin 2 (the columns) of iris and, in this case, place the results in a list. The idea would be to build your function to do what you want to a single vector, then pass each vector through the function with something from the apply() family. For instance:
cleantable <- function(x) {
myspan = seq(min(x), max(x)) # if unspecified, by = 1
myfreq = cut(x, breaks = myspan, right = FALSE)
table(myfreq)
}
apply(iris[1:4], 2, cleantable) # can only use first 4 columns since 5th isn't numeric
would do what I think you were trying to do on the first 4 columns of iris. This way of programming will be generally more readable and less prone to mistakes.
Can someone help me with this? I got the cut_interval code to work for a single test column, but can't seem to get it to work in a for loop to have it run on all of the columns.
#Bin worker data into three groups (low/medium/high %methylation) for the cpg cg10757709
#This code works
cg10757709_interval <- cut_interval(cpgs$cg10757709, n=3, labels = c("low","med","high"))
View(cg10757709_interval)
#Write a loop so that data for each of the significant cpgs will be binned into low, medium, and high groups
#This code gives an error (that there are more elements are supplied than there are to replace)
cpgs_interval <- matrix(ncol = length(cpgs), nrow = 29)
for (i in seq_along(cpgs)) {
cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
View(cpgs_interval)
The error says "Error in cpgs_interval[[i]] <- cut_interval(cpgs[[i]], n = 3, labels = c("low", : more elements supplied than there are to replace". Should I not be using a matrix for cpgs_interval? Or is something else the problem? I'm rather new to writing for loops. Thanks.
In your example, cpgs_interval is a matrix. If you want to put the variable into the ith column of the matrix, you could do:
for (i in seq_along(cpgs)) {
cpgs_interval[,i] <- cut_interval(cpgs[[i]], n=3, labels = c("low","med","high"))
}
That said, you might be better off making cpgs_interval a data frame, then you'll retain the factor rather than turning it into text.
For some basic publications I have to make almost same codes for many tables. So I have to make a quite fast code to make data frames from files and to make some same operations with data using only one same formula.
Example:
# Creating function
basic_sum <- function (place, DF, factor_col, sum) {
# Uploading data.frame
DF <- read.csv (place, sep = ";")
# Converting to factor
for (i in factor_col) {
DF [, i] <- as.factor (DF [, i])
}
# Summary
sum <- summary (DF)
View (sum)
}
Than I'm running that code and get a function basic_sum
If I want to work with my Data I call this function with arguments:
basic_sum (place = "~/DataFrame.csv", DF = DataFrame,
factor_col = c (1, 6 : 11), sum = DF_sum)
After running it nothing happens. I mean, I don't have anything new in Environment. No new data, no new vars or something else.
In my thoughts it seems that finally I have to get:
1) data.frame "DataFrame", that was uploaded DataFrame.csv;
2) 1st, 6th, 7th and all other columns until 11th will be factor
3) data.frame "DF_sum" with summary of all my columns from "DataFrame"
4) I will see data.frame "DF_sum".
Well, I see all of it in console, but I need it in Environment and to save it somewhere.
Seems that I'm doing something wrong... But I don't know what.
P.S.: If I try to run it without function (of course replacing DF to DataFrame, factor_col to с (1, 6 : 11) and so on...) everything is all right. But I have to rewrite code every time or at lest replace all DF and other that bother me.
With great regards,
Dmitrii
The following code in R uses a for-loop. What is a way I could solve the same problem without a for-loop (maybe by vectorizing it)?
I am looking at an unfamiliar dataset with many columns (243), and am trying to figure out which columns hold unstructured text. As a first check, I was going to flag columns that are 1) of class 'character' and 2) have at least ten unique values.
openEnded <- rep(x = NA, times = ncol(scaryData))
for(i in 1:ncol(scaryData)) {
openEnded[i] <- is.character(scaryData[[i]]) & length(unique(scaryData[[i]])) >= 10
}
This would probably do the job:
openEnded <- apply(scaryData, 2, function(x) is.character(x) & length(unique(x))>=10)
From the loop, you simply iterate over columns (that's the apply(scaryData, 2) part) an anonymous function that combines your two conditions (function(x) cond1 & cond2).
I guess your data is a data.frame so sapply(scaryData, 2, function(x) ...) would also work.
A nice post about the *apply family can be found there.
I am having trouble optimising a piece of R code. The following example code should illustrate my optimisation problem:
Some initialisations and a function definition:
a <- c(10,20,30,40,50,60,70,80)
b <- c(“a”,”b”,”c”,”d”,”z”,”g”,”h”,”r”)
c <- c(1,2,3,4,5,6,7,8)
myframe <- data.frame(a,b,c)
values <- vector(length=columns)
solution <- matrix(nrow=nrow(myframe),ncol=columns+3)
myfunction <- function(frame,columns){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
The problematic for-loop looks like this:
columns = 6
for(i in 1:nrow(myframe){
values <- myfunction(as.matrix(myframe[i,]), columns)
values[columns+2] = i
values[columns+3] = myframe[i,3]
#more columns added with simple operations (i.e. sum)
solution <- rbind(solution,values)
#solution is a large matrix from outside the for-loop
}
The problem seems to be the rbind function. I frequently get error messages regarding the size of solution which seems to be to large after a while (more than 50 MB).
I want to replace this loop and the rbind with a list and lapply and/or foreach. I have started with converting myframeto a list.
myframe_list <- lapply(seq_len(nrow(myframe)), function(i) myframe[i,])
I have not really come further than this, although I tried applying this very good introduction to parallel processing.
How do I have to reconstruct the for-loop without having to change myfunction? Obviously I am open to different solutions...
Edit: This problem seems to be straight from the 2nd circle of hell from the R Inferno. Any suggestions?
The reason that using rbind in a loop like this is bad practice, is that in each iteration you enlarge your solution data frame and then copy it to a new object, which is a very slow process and can also lead to memory problems. One way around this is to create a list, whose ith component will store the output of the ith loop iteration. The final step is to call rbind on that list (just once at the end). This will look something like
my.list <- vector("list", nrow(myframe))
for(i in 1:nrow(myframe)){
# Call all necessary commands to create values
my.list[[i]] <- values
}
solution <- rbind(solution, do.call(rbind, my.list))
A bit to long for comment, so I put it here:
If columns is known in advance:
myfunction <- function(frame){
athing = 0
if(columns == 5){
athing = 100
}
else{
athing = 1000
}
value[colums+1] = athing
return(value)}
apply(myframe, 2, myfunction)
If columns is not given via environment, you can use:
apply(myframe, 2, myfunction, columns) with your original myfunction definition.