R list/table manipulation replacing "for" loop with sapply? - r

I am attempting in R to just add a simple constant to a column of a table with e.g.
dim(exampletable)
[1] 3900 2
to add a value on the second column, what I do and works is:
newtable <- exampletable
for (i in 1:nrow(newtable)){newtable[i,2] <- exampletable[i,2] + constant}
but this seems a bit overkill. Is there a more elegant way to do it with, say sapply?
Thanks, Johannes

R is vectorised and has very handy syntax for operations that tend to be more verbose in other languages. What you have described is possibly the worst implementation of what you want to do, and pretty much the antithesis of what R is about. Instead, use R's inbuilt vectorisation and live a happy long life!
There are so many ways to do this, but the canonical way (excepting the use of column index integers rather than column names) is:
newtable[,2] <- newtable[,2] + constant
e.g.
df <- data.frame( x = 1:3 )
df$y <- df$x + 1
df
# x y
#1 1 2
#2 2 3
#3 3 4
I recommend reading up on the basics of R. There are several good tutorials on the info page of the r tag.

Try this:
#Dummy data
exampletable <- data.frame(x=runif(3900), y=runif(3900))
#Define new constant
MyConstant <- 10
#Make newtable with MyConstant update
newtable <- exampletable
newtable$y <- newtable$y + MyConstant
This is basics of R language, read some manuals.

Related

Order of column in R

I want to get the number in order of the column in a dataframe.
df <- data.frame(item = rep(c('a','b','c'), 3),
year = rep(c('2010','2011','2012'), each=3),
count = c(1,4,6,3,8,3,5,7,9))
Lets say the function i am looking for is columnorder. I want to have this result
x <- columnorder(df$count)
x
> 3
x <- columnorder(df$item)
x
> 1
It seems like a basic task but I couldn't find the answer until now. I will appreciate your help. Thank you
You said,
It seems like a basic task but I couldn't find the answer until now.
In the general sense what you are trying to do -- translate a column name into a column index -- is basic, and a pretty common question. However, the particular scenario you describe above, where your input is of the form object_name$column_name, is atypical WRT what you are trying to achieve, which is most likely why you haven't found an existing solution.
In short, the problem is that when you pass an argument as df$count, you may as well just have used c(1,4,6,3,8,3,5,7,9) instead, because df$count will be evaluated as c(1,4,6,3,8,3,5,7,9). Of course, R does allow for a fair bit of metaprogramming, so with a little extra work, this could be implemented as, for example
column_order <- function(expr) {
x <- strsplit(deparse(substitute(expr)), "$", TRUE)[[1]]
match(x[2], names(get(x[1])))
}
column_order(df$item)
#[1] 1
column_order(df$year)
#[1] 2
column_order(df$count)
#[1] 3
But as I said above, this is an atypical interface for what you are ultimately trying to do. A much more standard approach would be for this function to accept the column name (typically as a string) and the target object as arguments, in which case the solution is much simpler:
column_order2 <- function(col, obj) match(col, names(obj))
column_order2("item", df)
#[1] 1
column_order2("year", df)
#[1] 2
column_order2("count", df)
#[1] 3
As proposed in the comments by #mtoto, here is one solution:
x <- which(colnames(df) == "count")

Changing columns positions in a data frame without total reassignment

I want to swap two columns in a data.frame. I know I could do something like:
dd <- dd[c(1:4, 6:5, 7:10)]
But I find it inelegant, potentially slow (?) and not program-friendly (you need to know length(dd), and even have some cases if the swapped columns are close or not to that value...)
Is there an easy way to do it without reassigning the whole data frame?
dd[2:3] <- dd[3:2]
Turns out to be very "lossy" because the [ <- only concerns the values, and not the attributes. So for instance:
(dd <- data.frame( A = 1:4, Does = 'really', SO = 'rock' ) )
dd[3:2]
dd[2:3] <- dd[2:1]
print(dd)
The column names are obviously not flipped...
Any idea? I could also add a small custom function to my very long list, but grrr... should be a way. ;-)
It's not a single function, but relatively simple:
dd[replace(seq(dd), 2:3, 3:2)]
A SO Does
1 1 rock really
2 2 rock really
3 3 rock really
4 4 rock really
This:
dd[,2:3] <- dd[,3:2]
works, but you have to update the names as well:
names(dd)[2:3] <- names(dd)[3:2]

R- Please help. Having trouble writing for loop to lag date

I am attempting to write a for loop which will take subsets of a dataframe by person id and then lag the EXAMDATE variable by one for comparison. So a given row will have the original EXAMDATE and also a variable EXAMDATE_LAG which will contain the value of the EXAMDATE one row before it.
for (i in length(uniquerid))
{
temp <- subset(part2test, RID==uniquerid[i])
temp$EXAMDATE_LAG <- temp$EXAMDATE
temp2 <- data.frame(lag(temp, -1, na.pad=TRUE))
temp3 <- data.frame(cbind(temp,temp2))
}
It seems that I am creating the new variable just fine but I know that the lag won't work properly because I am missing steps. Perhaps I have also misunderstood other peoples' examples on how to use the lag function?
So that this can be fully answered. There are a handful of things wrong with your code. Lucaino has pointed one out. Each time through your loop you are going to create temp, temp2, and temp3 (or overwrite the old one). and thus you'll be left with only the output of the last time through the loop.
However, this isnt something that needs a loop. Instead you can make use of the vectorized nature of R
x <- 1:10
> c(x[-1], NA)
[1] 2 3 4 5 6 7 8 9 10 NA
So if you combine that notion with a library like plyr that splits data nicely you should have a workable solution. If I've missed something or this doesn't solve your problem, please provide a reproducible example.
library(plyr)
myLag <- function(x) {
c(x[-1], NA)
}
ddply(part2test, .(uniquerid), transform, EXAMDATE_LAG=myLag(EXAMDATE))
You could also do this in base R using split or the data.table package using its by= argument.

Assigning output of a function to two variables in R [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
function with multiple outputs
This seems like an easy question, but I can't figure it out and I haven't had luck in the R manuals I've looked at. I want to find dim(x), but I want to assign dim(x)[1] to a and dim(x)[2] to b in a single line.
I've tried [a b] <- dim(x) and c(a, b) <- dim(x), but neither has worked. Is there a one-line way to do this? It seems like a very basic thing that should be easy to handle.
This may not be as simple of a solution as you had wanted, but this gets the job done. It's also a very handy tool in the future, should you need to assign multiple variables at once (and you don't know how many values you have).
Output <- SomeFunction(x)
VariablesList <- letters[1:length(Output)]
for (i in seq(1, length(Output), by = 1)) {
assign(VariablesList[i], Output[i])
}
Loops aren't the most efficient things in R, but I've used this multiple times. I personally find it especially useful when gathering information from a folder with an unknown number of entries.
EDIT: And in this case, Output could be any length (as long as VariablesList is longer).
EDIT #2: Changed up the VariablesList vector to allow for more values, as Liz suggested.
You can also write your own function that will always make a global a and b. But this isn't advisable:
mydim <- function(x) {
out <- dim(x)
a <<- out[1]
b <<- out[2]
}
The "R" way to do this is to output the results as a list or vector just like the built in function does and access them as needed:
out <- dim(x)
out[1]
out[2]
R has excellent list and vector comprehension that many other languages lack and thus doesn't have this multiple assignment feature. Instead it has a rich set of functions to reach into complex data structures without looping constructs.
Doesn't look like there is a way to do this. Really the only way to deal with it is to add a couple of extra lines:
temp <- dim(x)
a <- temp[1]
b <- temp[2]
It depends what is in a and b. If they are just numbers try to return a vector like this:
dim <- function(x,y)
return(c(x,y))
dim(1,2)[1]
# [1] 1
dim(1,2)[2]
# [1] 2
If a and b are something else, you might want to return a list
dim <- function(x,y)
return(list(item1=x:y,item2=(2*x):(2*y)))
dim(1,2)[[1]]
[1] 1 2
dim(1,2)[[2]]
[1] 2 3 4
EDIT:
try this: x <- c(1,2); names(x) <- c("a","b")

Make nested loops more efficient?

I'm analyzing large sets of data using the following script:
M <- c_alignment
c_check <- function(x){
if (x == c_1) {
1
}else{
0
}
}
both_c_check <- function(x){
if (x[res_1] == c_1 && x[res_2] == c_1) {
1
}else{
0
}
}
variance_function <- function(x,y){
sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
for (c_1 in 1:conf_vector[res_1]){
res_1_insertion <- res_1_insertion + 1
insertion <- insertion + 1
res_1_subset <- sapply(M[,res_1], c_check)
freq_vector[insertion] <- sum(res_1_subset)/frames_total
res_2_insertion <- 0
for (res_2 in 1:cols){
if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
for (c_2 in 1:max(c_vector[res_2])){
res_2_insertion <- res_2_insertion + 1
both_res_subset <- apply(M, 1, both_c_check)
co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
}
}
}
}
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix
A model input would be something like this:
1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2
What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.
Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.
The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!
I thought of writing a comment, but I have too much to say.
First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.
Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)
Then, look at the help page of ?subset and the warning given there:
This is a convenience function intended for use interactively. For
programming it is better to use the standard subsetting functions like
[, and in particular the non-standard evaluation of argument subset
can have unanticipated consequences.
Always. Use. Indices.
Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.
Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:
cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2)))
As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.
Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.
Let this be a start: The R Inferno
It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.

Resources