Suppose I have a table of ages:
ages <- array(round(runif(min=10,max=200,n=100)),dim=100,dimnames=list(age=0:99))
Suppose now I want to collapse my ages table in 5-year wide age groups.
This could be done quite easily by summarizing over different values:
ages.5y <- array(NA,dim=20,dimnames=list(age=paste(seq(from=0,to=95,by=5),seq(from=4,to=99,by=5),sep=""))
ages.5y[1]<-sum(ages[1:5])
ages.5y[2]<-sum(ages[6:10)
...
ages.5y[20]<-sum(ages[96:100])
It could also be done using a loop:
for(i in 1:20) ages.5y[i]<-sum(ages[(5*i-4):(5*i)])
But while this method is easy for "regular" transformations, the loop approach becomes infeasible if the new intervals are irregular, eg. 0-4,5:12,13-24,25-50,60-99.
If, instead of a table, I had individual values, this could be done quite easily using cut:
flattened <- rep(as.numeric(dimnames(ages)$age),ages)
table(cut(flattened,breaks=seq(from=0,to=100,by=5)))
This allows the use of any random break points, eg breaks=c(5,10,22,33,41,63,88)
However, this is a quite ressource intense way to do it.
So, my question is: Is there a better way to recode a contingency table?
You could use cut on the age values, but not the counts. Like this:
ages =0:99
ageCounts = array(round(runif(min=10,max=200,n=100)),dim=100)
groups = cut(ages,breaks=seq(from=-1,to=100,by=5))
Then group them. I use data.table for this:
DT = data.table(ages=ages, ageCounts=ageCounts, groups)
DT[,list(sum=sum(ageCounts)), by=groups]
Related
I hope I phrased the question right, I'm not even sure how to word my question, which is probably part of why I'm having trouble finding the answer.
Consider a data.frame that has multiple string vectors. I would like to construct another variable that pair-wise combines the two vectors together, agnostic of their order.
For example, consider the following data.frame
df <- data.frame(var1 = c('string1', 'string2', 'string3'),
var2 = c('string3', 'string4', 'string1')
)
I'd like to have a variable that is identical for the first and 3rd element, like:
c('string1, string3', 'string2, string 4', 'string1, string3')
I'm imagining that it might be best to make a variable/vector that's a list of the two component variables, but I'm obviously open to any solution. I tried to make a list variable that does what I want based on this question but with no luck:
Create a data.frame where a column is a list
If possible, I'd like to do this in a way that could extend to more than 2 columns and could efficiently run over millions of rows, especially if there is a data.table method.
Thanks for your help!
Edit: A crappy example of how I could do it with a forloop that doesn't quite work but you get the idea:
for (i in 1:nrow(df)) {
df$var.new[i] <- paste(sort( c(df$var1[i], df$var2[i])))
}
Dear Friends I would appreciate if someone can help me in some question in R.
I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8}
my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations.
thanks in advance
You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:
names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)
This gives you all permutations of V1-V8 taken 3 at a time.
I'll use data.table instead of data.frame;
I'll include an extraneous variable for robustness.
This will get you your subsetted data frames:
nn<-8L
dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
c("id",paste0("V",1:nn)))
#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
rep(rep(c(0,1),each=64),2),
rep(rep(c(0,1),each=32),4),
rep(rep(c(0,1),each=16),8),
rep(rep(c(0,1),each=8),16),
rep(rep(c(0,1),each=4),32),
rep(rep(c(0,1),each=2),64),
rep(c(0,1),128)) *
t(matrix(rep(1:nn),2^nn,nrow=nn))
#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})
#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})
You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:
ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})
#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})
I have a data set, and I am trying to create a new variable with random values that are associated with a particular subset.
For example, given the data frame:
data(iris)
iris=iris
I want another variable that associates each value of iris$Species with a random number (between 0 and 1). This can be accomplished in a circuitous fashion by creating a data frame:
df=data.frame(unique(iris$Species),runif(length(unique(iris$Species))))
And merging it with the original data frame:
iris=merge(iris,df,by.x="Species",by.y="unique.iris.Species.")
This accomplishes what I want, but it is inelegant. Furthermore, if I wanted to replicate this process many times over different variables this process would be burdensome. What I would hope for is some quick indexing method that would hopefully look something like:
iris$Species.unif=runif(length(unique(iris$Species)))[iris$Species]
Given that indexing in R is typically very slick, I expect there is some way of doing this that I am not aware of.
Thank you in advance.
You may want to try by using levels:
iris <- iris
iris$species_unif <- iris$Species
levels(iris$species_unif ) <- runif(length(levels(iris$Species)))
I have a set of data that looks like this,
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
In this example, the column "ind" has more levels and they don't have always the same length (here each individual has 4 levels, but in reality some individuals have thousands of rows of data, while other only a few lines).
What I would like to do is to have an outer loop or function that will select all the rows from each individual ("ind") and generate a boxplot using the depth/hour columns.
This is the idea that I have in mind,
for (i in 1:length(unique(df$ind))){
data<-df[df$ind==df$ind[i],]
individual[i]<-data
plot.boxplot<-function(data){
boxplot(depth~hour,dat=data,xlab="Hour of day",ylab="Depth (m)")
}
}
par(mfrow=c(2,2),mar=c(5,4,3,1))
plot.boxplot(individual)
I realized that this loop might be inappropriate, but I am still learning. I can do the boxplot for each individual at a time, but I would like a faster, more efficient way of selecting the data for each individual and creating or storing boxplot results. This will be very useful for when I have many more individuals (instead of doing one at a time...). Thanks a lot in advance.
What about something like this?
par(mfrow=c(2,2))
invisible(
by(df,df$ind,
function(x)
boxplot(depth~hour,data=x,xlab="Hour of day",ylab="Depth (m)")
)
)
To provide some explanation, this runs a boxplot for each group of cases in df defined by df$ind. The invisible wrapper just makes it so that the bunch of output used for the boxplot is not written to the console.
I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.
So, as Andrie tells you can look at melt.
Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols
A third alternative would be array2df from package arrayhelpers.
I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:
As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.
Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns
That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.
yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)
And now you're done... mostly, you might want to make an extra column that names the new rows
newDat$condNum <- rep(1:10, nrow(newDat)/10)
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.
(read this to better understand your problems in this code... you've made most of the big errors in there)