Using mapply() in R over rows, vs. columns - r

I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.

You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)

I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.

Related

How to optimize for loops and rbinds with large datasets

I am currently working on a large dataset (~1.5M of entries) using R - a language I am not yet completely familiar with.
Basically, what I try to do is the following :
I want to check what happens during a time interval after "Start".
"Start" represents a few temporal values within every "Trial", and "Trial" represents all of the trials recorded for one "Reference".
So for each Reference, i want to check all Trials and see what happens after "Start", during this Trial
It's not so important if what i'm trying to do is still obscure, the thing is that I want to check every data in my dataframe.
My instinctive (understand, R-noob-ish) way of programming this function led me to a piece of code which I know is far from being optimized, and takes a LOT of time to run.
My_Function <- function(DataFrame){
counts <- data.frame()
for (reference in DataFrame$Ref){
ref_tested <- subset(DataFrame, Ref == reference)
ref_count <- data.frame()
for (trial in ref_tested$Trial){
trial_tested <- subset(ref_tested, Trial == trial)
for (timing in trial_tested$Start){
interesting <- subset(DataFrame, Start > timing & Start <= timing + some_time & Trial == trial)
ref_count <- rbind(ref_count,as.data.frame(table(interesting$ele)))
}
}
temp <- aggregate(Freq~Var1,data=ref_count,FUN=sum);
counts <- rbind (counts, temp)
}
return(counts)
}
Here, as.data.frame(table(interesting$ele)) can have different lengths, and thus, so do ref_count.
I failed to find a way to grow my dataframe without using rbind, but I also know that given the size of my output it is not time-efficient at all.
Also, I have already programmed in other languages such as Python or C++ (a long time ago) and also know that having 3 consecutive for loops usually means that you're doing it wrong. But then again, I did not find a way to avoid doing that in this particular case.
So, do you have any advice on how to use R, or one of its package, to avoid such a situation?
Thank you in advance,
K.
EDIT :
Thank you for your first advices.
I tried the 'plyr' package and was able to reduce the size of my code chunck - it does as expected and is more understandable.Plus, i was able to produce some example data for reproductivity. See :
#Example Input
DF <- data.frame(c(sample(1:400,500000, replace = TRUE)),c(sample(1:25,500000, replace = TRUE)), rnorm(n=500000, m=1, sd=1) )
colnames(DF)<-c("Trial","Ref","Start")
DF$rn<-rownames(DF)
tempDF <- DF[sample(nrow(DF), 100), ] #For testing purposes
Test<- ddply(.data = tempDF, "rn", function(x){
interesting <- subset(DF,
Trial == x$Trial &
Start > x$Start &
Start < x$Start + some_time )
interesting$Elec <- x$Ref
return(interesting)
})
This is nice, but I still feel like it is not the way to go ; in this example, we only browse 100 observations, which takes ~4sec (I used a system.time()), but if i want to scan the 500000 observations of DF, it'd take more than 5 hours.
I have checked data.table but I am still trying to understand how to use it for now.

Allocate people to teams based on 33/67% percentile of variable

I have a dataset where I would like to allocate people to different groups based on criteria, however, I would like R to do this automatically. I have separated my variables in <=.33 percentile and >=67 percetile and else.
dfOCEAN <-df[1:60,1:7]
print(colnames(dfOCEAN))
dfOCEAN <- dfOCEAN[complete.cases(dfOCEAN),]
i = 0
for(i in 1:length(dfOCEAN$factor_e)){
if(dfOCEAN$factor_e[i] <= quantile(dfOCEAN$factor_e, c(.33))){
dfOCEAN$Introversion[i] <- 1
}
else if(dfOCEAN$factor_e[i] >= quantile(dfOCEAN$factor_e, c(.67))){
dfOCEAN$Introversion[i] <- 2
}
else
dfOCEAN$Introversion[i] <- 3
}
i = 0
for(i in 1:length(dfOCEAN$factor_c)){
if(dfOCEAN$factor_c[i] <=quantile(dfOCEAN$factor_c, c(.33))){
dfOCEAN$Conscientious[i] <- 1
}
else if(dfOCEAN$factor_c[i] >= quantile(dfOCEAN$factor_c, c(.67))){
dfOCEAN$Conscientious[i] <- 2
}
else
dfOCEAN$Conscientious[i] <- 3
}
Then I am trying to create random samples with Dplyr's slice function.
dfOCEANset <- dfOCEAN %>% group_by(c(Introversion, Conscientious)) %>% slice(sample(c(1,2),1))
However, I am unable to get the desired results. Ideally, I would retrieve a dataframe whereby the data would be clustered with a combination of the different categories and the names would remain
Try this loop-less (but untested in the absence of a reproducible example) method:
dfOCEAN$fac_grp <- c(1,3,2)[ findInterval( dfOCEAN$factor_e,
quantile( dfOCEAN$factor_e, c(0, .33, .67)),
)}
R is intended to be used as a "vectorized" language and both the findInterval and quantile functions will return vectors, with findInterval giving a vector the same length as its first argument. You added a little wrinkle in asking us to arrange in a rather unnatural manner, which I handled by using the result from findInterval as an index into a three-item vector. The other function that does something similar (but returns a factor) is the cut function.

Double "for loops" in a dataframe in R

I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.

R: rewrite loop with apply

I have the following type of data set:
id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA
That is, each row is unique customer and each column depicts a trait for this customer from the past 10 years (each month has its own column). The thing is that I want to condense this 120 column data frame into a 10 column data frame, this because I know that almost all rows have (although the month itself can vary) have 1 or 0 observations from each year.
I've already done, one year at the time, this using a loop with a nested if-clause:
for(i in 1:nrow(input_data)) {
temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
loc2011 <- which(!is.na(temp_row))
if(length(loc2011 ) > 0) {
temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
} else {
temp_row_2011[i,] <- NA
}
}
Since my data set is quite big, and I need to perform the above loop 10 times (one for each year), this is taking way too much time. I know one is much better of using apply commands in R, so I would greatly appreciate help on this task. How could I write the whole thing (including the different years) better?
Are you after something like this?:
temp_row_2011 <- apply(input_data, 1, function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
})
If this gives you the right output, and if it runs faster than your loop, then it's not necessarily due only to the fact of using an apply(), but also because it assigns less stuff and avoids an if {} else {}. You might be able to make it go even faster by compiling the anonymous function:
reduceyear <- function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
}
# compile, just in case it runs faster:
reduceyear_c <- compiler:::cmpfun(reduceyear)
# this ought to do the same as the above.
temp_row_2011 <- apply(input_data, 1, reduceyear_c)
You didn't say whether input_data is a data.frame or a matrix, but a matrix would be faster than the former (but only valid if input_data is all the same class of data).
[EDIT: full example, motivated by DWin]
input_data <- matrix(ncol=24,nrow=10)
# years and months:
colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_"))
# some ids
rownames(input_data) <- 1:10
# put in some values:
input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200))
# make an all-NA case:
input_data[2,1:12] <- NA
# and here's the full deal:
sapply(2010:2011, function(x,input_data){
input_data_yr <- input_data[, grep(x, colnames(input_data) )]
apply(input_data_yr, 1, function(id){
id[!is.na(id)][1]
}
)
}, input_data)
All NA case works. grep() column selection idea lifted from DWin. As in the above example, you could actually define the anonymous interior function and compile it to potentially make the thing run faster.
I built a tiny test case (for which timriffe's suggestion fails). You might attract more interest by putting up code that creates a more complete test case such as 4 quarters for 2 years and including pathological cases such as all NA's in one row of one year. I would think that instead of requiring you to write out all the year columns by name, that you ought to cycle through them with a grep() strategy:
# funyear <- function to work on one year's data and return a single vector
# my efforts keep failing on the all(NA) row by year combos
sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )

How to rewrite this Stata code in R?

One of the things Stata does well is the way it constructs new variables (see example below). How to do this in R?
foreach i in A B C D {
forval n=1990/2000 {
local m = 'n'-1
# create new columns from existing ones on-the-fly
generate pop'i''n' = pop'i''m' * (1 + trend'n')
}
}
DONT do it in R. The reason its messy is because its UGLY code. Constructing lots of variables with programmatic names is a BAD THING. Names are names. They have no structure, so do not try to impose one on them. Decent programming languages have structures for this - rubbishy programming languages have tacked-on 'Macro' features and end up with this awful pattern of constructing variable names by pasting strings together. This is a practice from the 1970s that should have died out by now. Don't be a programming dinosaur.
For example, how do you know how many popXXXX variables you have? How do you know if you have a complete sequence of pop1990 to pop2000? What if you want to save the variables to a file to give to someone. Yuck, yuck yuck.
Use a data structure that the language gives you. In this case probably a list.
Both Spacedman and Joshua have very valid points. As Stata has only one dataset in memory at any given time, I'd suggest to add the variables to a dataframe (which is also a kind of list) instead of to the global environment (see below).
But honestly, the more R-ish way to do so, is to keep your factors factors instead of variable names.
I make some data as I believe it is in your R version now (at least, I hope so...)
Data <- data.frame(
popA1989 = 1:10,
popB1989 = 10:1,
popC1989 = 11:20,
popD1989 = 20:11
)
Trend <- replicate(11,runif(10,-0.1,0.1))
You can then use the stack() function to obtain a dataframe where you have a factor pop and a numeric variable year
newData <- stack(Data)
newData$pop <- substr(newData$ind,4,4)
newData$year <- as.numeric(substr(newData$ind,5,8))
newData$ind <- NULL
Filling up the dataframe is then quite easy :
for(i in 1:11){
tmp <- newData[newData$year==(1988+i),]
newData <- rbind(newData,
data.frame( values = tmp$values*Trend[,i],
pop = tmp$pop,
year = tmp$year+1
)
)
}
In this format, you'll find most R commands (selections of some years, of a single population, modelling effects of either or both, ...) a whole lot easier to perform later on.
And if you insist, you can still create a wide format with unstack()
unstack(newData,values~paste("pop",pop,year,sep=""))
Adaptation of Joshua's answer to add the columns to the dataframe :
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep=""),Data) # get old variable
trend <- Trend[,i-1989] # get trend variable
Data <- within(Data,assign(new, old*(1+trend)))
}
}
Assuming popA1989, popB1989, popC1989, popD1989 already exist in your global environment, the code below should work. There are certainly more "R-like" ways to do this, but I wanted to give you something similar to your Stata code.
for(L in LETTERS[1:4]) {
for(i in 1990:2000) {
new <- paste("pop",L,i,sep="") # create name for new variable
old <- get(paste("pop",L,i-1,sep="")) # get old variable
trend <- get(paste("trend",i,sep="")) # get trend variable
assign(new, old*(1+trend))
}
}
Assuming you have population data in vector pop1989
and data for trend in trend.
require(stringr)# because str_c has better default for sep parameter
dta <- kronecker(pop1989,cumprod(1+trend))
names(dta) <- kronecker(str_c("pop",LETTERS[1:4]),1990:2000,str_c)

Resources