Sort data set by biggest range of entries - r

I am dabbling my feet in R a bit and I am now able to sort columns by their means but I would now like to sort the columns by the biggest range of the data points in each column.
Say I have a table with a point rating for movies. How can I get the Top 10 movies where the opinions are most different. Is there a function that can measure this? One idea of mine was to use perhaps the minValue ans maxValue but then just one outlier can mess it all up.
I guess the size of the box of a boxplot could be a pretty good measure.
Any ideas?
Update:
So I tried sorting my table by their respective sd() but I cant quite get that to work. What I was trying is this. The table has headings btw. and is named newdata here.
> newdata.sd <- sapply(1:107, function(j) sd(newdata[,j], na.rm=TRUE))
> newdata.sorted.sd <- newdata[,names(sort(newdata.sd, decreasing=TRUE))]
The second line throws an error because the first doesn't produce a Named num vector.
When I did the same thing with sorting by the columns means it worked. That I did with the following two lines.
> newdata.mean <- colMeans(newdata, na.rm=TRUE)
> newdata.sorted <- newdata[,names(sort(newdata.mean, decreasing=TRUE))]
How can I produce a named vector of sd()s like in the second example?
A different way to sort by sd() would be also appreciated.

Related

How to make a histogram of a non numeric column in R?

I have a data frame called mydata with multiple columns, one of which is Benefits, which contains information about samples whether they are CB (complete responding), ICB (Intermediate) or NCB (Non-responding at all).
So basically the Benefit column is a vector with three values:
Benefit <- c("CB" , "ICB" , "NCB")
I want to make a histogram/barplot based on the number of each one of those. So basically it's not a numeric column. I tried solving this by the following code :
hist(as.numeric(metadata$Benefit))
tried also
barplot(metadata$Benefit)
didn't work obviously.
The second thing I want to do is to find a relation between the Age column of the same data frame and the Benefit column, like for example do the younger patients get more benefit ? Is there anyway to do that ?
THANKS!
Hi and welcome to the site :)
One nice way to find issues with code is to only run one command at the time.
# lets create some data
metadata <- data.frame(Benefit = c("ICB", "CB", "CB", "NCB"))
now the command 'as.numeric' does not work on character-data
as.numeric(metadata$Benefit) # returns NA's
Instead what we want is to count the number of instances of each unique value of the column Benefit, we do this with 'table'
tabledata <- table(metadata$Benefit)
and then it is the barplot function we want to create the plot
barplot(tabledata)

Specifying names of columns to be used in a loop R

I have a df with over 30 columns and over 200 rows, but for simplicity will use an example with 8 columns.
X1<-c(sample(100,25))
B<-c(sample(4,25,replace=TRUE))
C<-c(sample(2,25,replace =TRUE))
Y1<-c(sample(100,25))
Y2<-c(sample(100,25))
Y3<-c(sample(100,25))
Y4<-c(sample(100,25))
Y5<-c(sample(100,25))
df<-cbind(X1,B,C,Y1,Y2,Y3,Y4,Y5)
df<-as.data.frame(df)
I wrote a function that melts the data generates a plot with X1 giving the x-axis values and faceted using the values in B and C.
plotdata<-function(l){
melt<-melt(df,id.vars=c("X1","B","C"),measure.vars=l)
plot<-ggplot(melt,aes(x=X1,y=value))+geom_point()
plot2<-plot+facet_grid(B ~ C)
ggsave(filename=paste("X_vs_",l,"_faceted.jpeg",sep=""),plot=plot2)
}
I can then manually input the required Y variable
plotdata("Y1")
I don't want to generate plots for all columns. I could just type the column of interest into plotdata and then get the result, but this seems quite inelegant (and time consuming). I would prefer to be able to manually specify the columns of interest e.g. "Y1","Y3","Y4" and then write a loop function to do all those specified.
However I am new to writing for loops and can't find a way to loop in the specific column names that are required for my function to work. A standard for(i in 1:length(df)) wouldn't be appropriate because I only want to loop the user specified columns
Apologies if there is an answer to this is already in stackoverflow. I couldn't find it if there was.
Thanks to Roland for providing the following answer:
Try
for (x in c("Y1","Y3","Y4")) {plotdata(x)}
The index variable doesn't have to be numeric

Selecting matching row values from a column (data frame) to create plots using a loop in R

I have a set of data that looks like this,
species<-"ABC"
ind<-rep(1:4,each=24)
hour<-rep(seq(0,23,by=1),4)
depth<-runif(length(ind),1,50)
df<-data.frame(cbind(species,ind,hour,depth))
df$depth<-as.numeric(df$depth)
In this example, the column "ind" has more levels and they don't have always the same length (here each individual has 4 levels, but in reality some individuals have thousands of rows of data, while other only a few lines).
What I would like to do is to have an outer loop or function that will select all the rows from each individual ("ind") and generate a boxplot using the depth/hour columns.
This is the idea that I have in mind,
for (i in 1:length(unique(df$ind))){
data<-df[df$ind==df$ind[i],]
individual[i]<-data
plot.boxplot<-function(data){
boxplot(depth~hour,dat=data,xlab="Hour of day",ylab="Depth (m)")
}
}
par(mfrow=c(2,2),mar=c(5,4,3,1))
plot.boxplot(individual)
I realized that this loop might be inappropriate, but I am still learning. I can do the boxplot for each individual at a time, but I would like a faster, more efficient way of selecting the data for each individual and creating or storing boxplot results. This will be very useful for when I have many more individuals (instead of doing one at a time...). Thanks a lot in advance.
What about something like this?
par(mfrow=c(2,2))
invisible(
by(df,df$ind,
function(x)
boxplot(depth~hour,data=x,xlab="Hour of day",ylab="Depth (m)")
)
)
To provide some explanation, this runs a boxplot for each group of cases in df defined by df$ind. The invisible wrapper just makes it so that the bunch of output used for the boxplot is not written to the console.

R: assigning value from one column to another

This probably has a very simple answer, but I'm having trouble figuring it out...
What is a vector-based way to take one value in the cell of one column in a dataframe, conditional on some criterion in a given row being satisfied, and assign it to a cell along the same row but in a different column? I've done it with loops over if-else statements, but I'm working with pretty big data sets, and my little laptop freezes for many minutes going through the looping conditionals.
Eg. if I have sometihng like this:
Results$TResponseCorrect[Results$rownum %in% CorrectTs$rownum] <- 1
that works fine. But what doesn't work is something like
Results$TResponseCorrect[Results$rownum %in% CorrectTs$rownum] <- Results$TCorrect
In that case I get a warning saying, "number of items to replace is not a multiple of replacement length", which I basically take to mean that it can't figure out which cell of the Results$Subject column to take.
Since your problem statement implies that all these are in the same data frame you may want:
Results$TResponseCorrect[Results$rownum %in% CorrectTs$rownum] <-
Results$TCorrect[Results$rownum %in% CorrectTs$rownum]
It will then have the same number of items on the LHS and the RHS of the assignment.

Endless function/loop in R: Data Management

I am trying to restructure an enormous dataframe (about 12.000 cases): In the old dataframe one person is one row and has about 250 columns (e.g. Person 1, test A1, testA2, testB, ...)and I want all the results of test A (1 - 10 A´s overall and 24 items (A-Y) for that person in one column, so one person end up with 24 columns and 10 rows. There is also a fixed dataframe part before the items A-Y start (personal information like age, gender etc.), which I want to keep as it is (fixdata).
The function/loop works for 30 cases (I tried it in advance) but for the 12.000 it is still calculating, for nearly 24hours now. Any ideas why?
restructure <- function(data, firstcol, numcol, numsets){
out <- data.frame(t(rep(0, (firstcol-1)+ numcol)) )
names(out) <- names(daten[0:(firstcol+numcol-1)])
for(i in 1:nrow(daten)){
fixdata <- (daten[i, 1:(firstcol-1)])
for (j in (seq(firstcol, ((firstcol-1)+ numcol* numsets), by = numcol))){
flexdata <- daten[i, j:(j+numcol-1)]
tmp <- cbind(fixdata, flexdata)
names(tmp) <- names(daten[0:(firstcol+numcol-1)])
out <- rbind(out,tmp)
}
}
out <- out[2:nrow(out),]
return(out)
}
Thanks in advance!
Idea why: you rbind to out in each iteration. This will take longer each iteration as out grows - so you have to expect more than linear growth in run time with increasing data sets.
So, as Andrie tells you can look at melt.
Or you can do it with core R: stack.
Then you need to cbind the fixed part yourself to the result, (you need to repeat the fixed columns with each = n.var.cols
A third alternative would be array2df from package arrayhelpers.
I agree with the others, look into reshape2 and the plyr package, just want to add a little in another direction. Particularly melt, cast,dcast might help you. Plus, it might help to make use of smart column names, e.g.:
As<-grep("^testA",names(yourdf))
# returns a vector with the column position of all testA1 through 10s.
Besides, if you 'spent' the two dimensions of a data.frame on test# and test type, there's obviously none left for the person. Sure, you identify them by an ID, that you could add an aesthetic to when plotting, but depending on what you want to do you might want to store them in a list. So you end up with a list of persons with a data.frame for every person. I am not sure what you are trying to do, but still hope this helps though.
Maybe you're not getting the plyr or other functions for reshaping the data component. How about something more direct and low level. If you currently just have one line that goes A1, A2, A3... A10, B1-B10, etc. then extract that lump of stuff from your data frame, I'm guessing columns 11-250, and then just make that section the shape you want and put them back together.
yDat <- data[, 11:250]
yDF <- lapply( 1:nrow(data), function(i) matrix(yDat[i,], ncol = 24) )
yDF <- do.call(rbind, y) #combine the list of matrices returned above into one
yDF <- data.frame(yDF) #get it back into a data.frame
names(yDF) <- LETTERS[1:24] #might as well name the columns
That's the fastest way to get the bulk of your data in the shape you want. All the lapply function did was add dimension attributes to each row so that they were in the shape you wanted and then return them as a list, which was massaged with the subsequent rows. But now it doesn't have any of your ID information from the main data.frame. You just need to replicate each row of the first 10 columns 10 times. Or you can use the convenience function merge to help with that. Make a common column that is already in your first 10 rows one of the columns of the new data.frame and then just merge them.
yInfo <- data[, 1:10]
ID <- yInfo$ID
yDF$ID <- rep( yInfo$ID, each = 10 )
newDat <- merge(yInfo, yDF)
And now you're done... mostly, you might want to make an extra column that names the new rows
newDat$condNum <- rep(1:10, nrow(newDat)/10)
This will be very fast running code. Your data.frame really isn't that big at all and much of the above will execute in a couple of seconds.
This is how you should be thinking of data in R. Not that there aren't convenience functions to handle the bulk of this but you should be doing this that avoid looping as much as possible. Technically, what happened above only had one loop, the lapply used right at the start. It had very little in that loop as well (they should be compact when you use them). You're writing in scalar code and it is very very slow in R... even if you weren't really abusing memory and growing data while doing it. Furthermore, keep in mind that, while you can't always avoid a loop of some kind, you can almost always avoid nested loops, which is one of your biggest problems.
(read this to better understand your problems in this code... you've made most of the big errors in there)

Resources