I have the following type of data set:
id;2011_01;2011_02;2011_03; ... ;2001_12
id01;NA;NA;123; ... ;NA
id02;188;NA;NA; ... ;NA
That is, each row is unique customer and each column depicts a trait for this customer from the past 10 years (each month has its own column). The thing is that I want to condense this 120 column data frame into a 10 column data frame, this because I know that almost all rows have (although the month itself can vary) have 1 or 0 observations from each year.
I've already done, one year at the time, this using a loop with a nested if-clause:
for(i in 1:nrow(input_data)) {
temp_row <- input_data[i,c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
loc2011 <- which(!is.na(temp_row))
if(length(loc2011 ) > 0) {
temp_row_2011[i,] <- temp_row[loc2011[1]] #pick the first observation if there are several
} else {
temp_row_2011[i,] <- NA
}
}
Since my data set is quite big, and I need to perform the above loop 10 times (one for each year), this is taking way too much time. I know one is much better of using apply commands in R, so I would greatly appreciate help on this task. How could I write the whole thing (including the different years) better?
Are you after something like this?:
temp_row_2011 <- apply(input_data, 1, function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
})
If this gives you the right output, and if it runs faster than your loop, then it's not necessarily due only to the fact of using an apply(), but also because it assigns less stuff and avoids an if {} else {}. You might be able to make it go even faster by compiling the anonymous function:
reduceyear <- function(x){
temp_row <- x[c("2011_01","2011_02","2011_03","2011_04","2011_05","2011_06","2011_07","2011_08","2011_09","2011_10","2011_11", "2011_12")]
temp_row[!is.na(temp_row)][1]
}
# compile, just in case it runs faster:
reduceyear_c <- compiler:::cmpfun(reduceyear)
# this ought to do the same as the above.
temp_row_2011 <- apply(input_data, 1, reduceyear_c)
You didn't say whether input_data is a data.frame or a matrix, but a matrix would be faster than the former (but only valid if input_data is all the same class of data).
[EDIT: full example, motivated by DWin]
input_data <- matrix(ncol=24,nrow=10)
# years and months:
colnames(input_data) <- c(paste(2010,1:12,sep="_"),paste(2011,1:12,sep="_"))
# some ids
rownames(input_data) <- 1:10
# put in some values:
input_data[sample(1:length(input_data),200,replace=FALSE)] <- round(runif(200,100,200))
# make an all-NA case:
input_data[2,1:12] <- NA
# and here's the full deal:
sapply(2010:2011, function(x,input_data){
input_data_yr <- input_data[, grep(x, colnames(input_data) )]
apply(input_data_yr, 1, function(id){
id[!is.na(id)][1]
}
)
}, input_data)
All NA case works. grep() column selection idea lifted from DWin. As in the above example, you could actually define the anonymous interior function and compile it to potentially make the thing run faster.
I built a tiny test case (for which timriffe's suggestion fails). You might attract more interest by putting up code that creates a more complete test case such as 4 quarters for 2 years and including pathological cases such as all NA's in one row of one year. I would think that instead of requiring you to write out all the year columns by name, that you ought to cycle through them with a grep() strategy:
# funyear <- function to work on one year's data and return a single vector
# my efforts keep failing on the all(NA) row by year combos
sapply(seq("2011", "2001"), function (pat) funyear(input_data[grep(pat, names(input_data) )] )
Related
I'm in need of some R for-loop and grep optimisation assistance.
I have a data.frame made up of columns of different data types. 42 of these columns have the name "treatmentmedication_code_#", where # is a number 1 to 42.
There is a lot of code so a reproducible example is quite tricky. As a compromise, the following code is the precise operation I need to optimise.
for(i in 1:nTreatments) {
...lots of code...
controlsDrugStatusDF <- cbind(controlsTreatmentDF, Drug=0)
for(n in 1:nControls) {
if(treatment %in% controlsDrugStatusDF[n,grep(pattern="^treatmentmedication_code*",x=colnames(controlsDrugStatusDF))]) {
controlsDrugStatusDF$Drug[n] <- 1
} else {
controlsDrugStatusDF$Drug[n] <- 0
}
}
}
treatment is some coded medication e.g., 145374524. The condition inside the if statement is very slow. It checks to see whether the treatment value is present in any one of those columns defined by the grep for the row n. To make matters worse, this is done for every treatment, thus the i for-loop.
Short of launching multiple processes or massacring my data.frames into lots of separate matrices then pasting them together and converting them back into a data.frame, are there any notable improvements one could make on the if statement?
As part of optimization, the grep for selecting the columns can be done outside the loop. Regarding the treatments part it is not clear. Consider that it is a vector of values. We can use
nm1 <- grep("^treatmentmedication_code*",
colnames(controlsDrugStatusDF), values = TRUE)
nm2 <- paste0("Drug", seq_along(nm1))
controlsDrugStatusDF[nm2] <- lapply(controlsDrugStatusDF[nm1],
function(x)
+(x %in% treatments))
I am trying to obtain the number of cases for each variable in a df. There are 275 cases in the df but most columns have some missing data. I am trying to run a for loop to obtain the information as follows:
idef_id<-readxl::read_xlsx("IDEF.xlsx")
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(i))
275-nas
}
however the output for casenums is
> summary(casenums)
Length Class Mode
0 NULL NULL
Any help would be much appreciated!
A for loop isn't a function - it doesn't return anything, so x <- for(... doesn't ever make sense. You can do that with, e.g., sapply, like this
casenums <- sapply(idef_id, function(x) sum(!is.na(x)))
Or you can do it in a for loop, but you need to assign to a particular value inside the loop:
casenums = rep(NA, ncol(idef_id))
names(casenums) = names(idef_id)
for(i in names(idef_id)) {
casenums[i] = sum(!is.na(idef_id[[i]]))`
}
You also had a problem that i is taking on column names, so sum(is.na(i)) is asking if the value of the column name is missing. You need to use idef_id[[i]] to access the actual column, not just the column name, as I show above.
You seem to want the answer to be the number of non-NA values, so I switched to sum(!is.na(...)) to count that directly, rather than hard-coding the number of rows of the data frame and doing subtraction.
The immediate fix for your for loop is that your i is a column name, not the data within. On your first pass through the for loop, your i is class character, always length 1, so sum(is.na(i)) is going to be 0. Due to how frames are structured, there is very little likelihood that a name is NA (though it is possible ... with manual subterfuge).
I suggest a literal fix for your code could be:
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
275-nas
}
But this has the added problem that for loops don't return anything (as Gregor's answer also discusses). For the sake of walking through things, I'll keep that (for the first bullet), and then fix it (in the second):
Two things:
hard-coding 275 (assuming that's the number of rows in the frame) will be problematic if/when your data ever changes. Even if you're "confident" it never will ... I still recommend not hard-coding it. If it's based on the number of rows, then perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
casenums <- for (i in names(idef_id)) {
nas<- sum(is.na(idef_id[[i]]))
OUT_OF - nas
}
at least in a declarative sense, where the variable name (please choose something better) is clear as to how you determined 275 and how (if necessary) it should be fixed in the future.
(Or better, use Gregor's logic of sum(!is.na(...)) if you just need to count not-NA.)
doing something for each column of a frame is easily done using sapply or lapply, perhaps
OUT_OF <- 275 # should this be nrow(idef_id)?
OUT_OF - sapply(idef_id, function(one_column) sum(is.na(one_column)))
## or
sapply(idef_id, function(one_column) OUT_OF - sum(is.na(one_column)))
Let's assume that I have a dataset with the following structure:
I have N products
I'm operating in N countries
I have N payment partner
May dataset contains of N days
I have N different prices that customers can choose from
For example:
customer.id <- c(1,2,3,4,5,6,7,8)
product <- c("product1","product2","product1","product2","product1","product2","product1","product2")
country <- c("country1","country2","country1","country2","country1","country2","country1","country2")
payment.partner <- c("pp1","pp2","pp1","pp2","pp1","pp2","pp1","pp2")
day <- c("day1","day2","day1","day2","day1","day2","day1","day2")
price <- c("price1","price2","price1","price2","price1","price2","price1","price2")
customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
customer.data <- data.table(customer.data)
Suppose I want to generate an aggregate out of it that, for instance, performs a forecasting algorithm for each combination. In order to do so, I identify the unique items for each condition and iterate it as follows:
unique.products <- droplevels(unique(customer.data[,product]))
unique.countries <- droplevels(unique(customer.data[,country]))
unique.payment.partners <- droplevels(unique(customer.data[,payment.partner]))
unique.days <- droplevels(unique(customer.data[,day]))
unique.prices <- droplevels(unique(customer.data[,price]))
for(i in seq_along(unique.products)){
temp.data1 <- customer.data[product==unique.products[[i]]]
for(j in seq_along(unique.countries)){
temp.data2 <- temp.data1[country==unique.countries[[j]]]
for(k in seq_along(unique.payment.partners)){
temp.data3 <- temp.data2[payment.partner==unique.payment.partners[[k]]]
for(l in seq_along(unique.days)){
temp.data4 <- temp.data3[day==unique.days[[l]]]
for(m in seq_along(unique.prices)){
temp.data5 <- temp.data4[price==unique.prices[[m]]]
if(nrow(temp.data5)!=0){
# do your calculations here
print(temp.data5)
}
}
}
}
}
}
In general, this code structure works fine, but it gets really annoying when applying real data with 5 million rows on it. I guess R is not the best language in terms of speed and performance. Of course, I have used multicore processing in the past or tried to get such an aggregate straight out of Hive or an MySQL DataWarehouse. Using another language like C++ or Python is also always an option.
However, sometimes all these options are not possible, which then always leads me to that exact same processing structure. So I'm wondering for quite a while if there is a better, respectively faster solution from a rather architectural point of view since it is known (and also becomes VERY clear when benchmarking) that for loops and frequent data subselection is very, very slow.
Grateful for all comments, hints and possible solutions!
You should read the documentation of packages you are using. Package data.table offers some excellent introductory tutorials.
customer.data <- data.frame(customer.id,product,country,payment.partner,day,price)
library(data.table)
setDT(customer.data)
customer.data[,
print(customer.data[.I]), #don't do this, just refer to the columns you want to work on
by = .(product, country, payment.partner, day, price)]
Of course, generally, you wouldn't print the data.table subset here, but work directly on specific columns.
From your description (but not your code which I found incomprehensible as to its purpose, I am thinking you may want to use the `interaction function:
customer.data$grp=droplevels( with( customer.data,
interaction(product, country ,payment.partner, day, price) ) )
table(customer.data$grp)
#-----------------------
product1.country1.pp1.day1.price1
4
product2.country2.pp2.day2.price2
4
You could then use lapply( split( dat, dat$grp) , analytic_function) to create separate analyses within subsets. I didn't have data.table loaded so showed the method for dataframes but there's no reason interaction shouldn't succeed in the data.table world:
customer.data[ , grp2 := droplevels(interaction(
product, country ,payment.partner, day, price) ) ]
I've been learning R for my project and have been unable to google a solution to my current problem.
I have ~ 100 csv files and need to perform an exact set of operations across them. I've read them in as separate objects (which I assume is probably improper r style) but I've been unable to write a function that can loop through. Each csv is a dataframe that contain information, including a column with dates in decimal year form. I need to create 2 new columns containing year and day of year. I've figured out how to do it manually I would like to find a way to automate the process. Here's what I've been doing:
#setup
library(lubridate) #Used to check for leap years
df.00 <- data.frame( site = seq(1:10), date = runif(10,1980,2000 ))
#what I need done
df.00$doy <- NA # make an empty column which I'm going to place the day of the year
df.00$year <- floor(df.00$date) # grabs the year from the date column
df.00$dday <- df.00$date - df.00$year # get the year fraction. intermediate step.
# multiply the fraction year by 365 or 366 if it's a leap year to give me the day of the year
df.00$doy[which(leap_year(df.00$year))] <- round(df.00$dday[which(leap_year(df.00$year))] * 366)
df.00$doy[which(!leap_year(df.00$year))] <- round(df.00$dday[which(!leap_year(df.00$year))] * 365)
The above, while inelegant, does what I would like it to. However, I need to do this to the other data frames, df.01 - df.99. So far I've been unable to place it in a function or for loop. If I place it into a function:
funtest <- function(x) {
x$doy <- NA
}
funtest(df.00) does nothing. Which is what I would expect from my understanding of how functions work in r but if I wrap it up in a for loop:
for(i in c(df.00)) {
i$doy <- NA }
I get "In i$doy <- NA : Coercing LHS to a list" several times which tells me that the loop isn't treat the dataframe as a single unit but perhaps looking at each column in the frame.
I would really appreciate some insight on what I should be doing. I feel that I could have solved this easily using bash and awk but I would like to be less incompetent using r
the most efficient and direct way is to use a list.
Put all of your CSV's into one folder
grab a list of the files in that folder
eg: files <- dir('path/to/folder', full.names=TRUE)
iterativly read in all those files into a list of data.frames
eg: df.list <- lapply(files, read.csv, <additional args>)
apply your function iteratively over each data.frame
eg: lapply(df.list, myFunc, <additional args>)
Since your df's are already loaded, and they have nice convenient names, you can grab them easily using the following:
nms <- c(paste0("df.0", 0:9), paste0("df.", 10:99))
df.list <- lapply(nms, get)
Then take everything you have in the #what I need done portion and put inside a function, eg:
myFunc <- function(DF) {
# what you want done to a single DF
return(DF)
}
And then lapply accordingly
df.list <- lapply(df.list, myFunc)
On a separate notes, regarding functions:
The reason your funTest "does nothing" is that it you are not having it return anything. That is to say, it is doing something, but when it finishes doing that, then it does "nothing".
You need to include a return(.) statement in the function. Alternatively, the output of last line of the function, if not assigned to an object, will be used as the return value -- but this last sentence is only loosely true and hence one needs to be cautious. The cleanest option (in my opinion) is to use return(.)
regarding the for loop over the data.frame
As you observed, using for (i in someDataFrame) {...} iterates over the columns of the data.frame.
You can iterate over the rows using apply:
apply(myDF, MARGIN=1, function(x) { x$doy <- ...; return(x) } ) # dont forget to return
I deal with a great deal of survey data and the like in my work, and I often have to make various scoring programs that process data on a row-by-row level. For instance, I am dealing with a table right now that contains 12 columns with subscale scores from a psychometric instrument. These will be converted to normalized scores using tables provided by the instrument's creator. Seems straightforward so far.
However, there are four tables - the instrument is scored differently depending on gender and age range. So, for instance, a 14-year old female and an 10 year-old male get different normalization tables. All of the normalization data is stored in a R data frame.
What I would like to do is write a function which can be applied over rows, which returns a vector looked up from the normalization data. So, something vaguely like this:
converter <- function(rawscores,gender,age) {
if(gender=="Male") {
if(8 <= age & age <= 11) {convertvec <- c(1:12)}
if(12 <= age & age <= 14) {convertvec <- c(13:24)}
}
else if(gender=="Female") {
if(8 <= age & age <= 11) {convertvec <- c(25:36)}
if(12 <= age & age <= 14) {convertvec <- c(37:48)}
}
converted_scores <- rep(0,12)
for(z in 1:12) {
converted_scores[z] <- conversion_table[(unlist(rawscores)+1)[z],
convertvec[z]]
}
rm(z)
return(converted_scores)
}
EDITED: I updated this with the code I actually got to work yesterday. This version returns a simple vector with the scores. Here's how I then implemented it.
mydata[,21:32] <- 0
for(x in 1:dim(mydata)[1]) {
tscc_scores[x,21:32] <- converter(mydata[x,7:18],
mydata[x,"gender"],
mydata[x,"age"])
}
This works, but like I said, I'm given to understand that it is bad practice?
Side note: the reason rawscores+1 is there is that the data frame has a score of zero in the first index.
Fundamentally, the function doesn't seem very complicated, and I know I could just implement it using a loop where I would do for(x in 1:number_of_records), but my understanding is that doing so is poor practice. I had hoped to simply use apply() to do this, like as follows:
apply(X=mydata[,1:12],MARGIN=1,
FUN=converter,gender=mydata[,"gender"],age=mydata[,"age"])
Unfortunately, R doesn't seem to approve of this approach, as it does not iterate through the vectors passed to subsequent arguments, but rather tries to take them as the argument as a whole. The solution would appear to be mapply(), but I can't figure out if there's a way to use mapply() over rows, instead of columns.
So, I guess my questions are threefold. One, is there a way to use mapply() over rows? Two, is there a way to make apply() iterate over arguments? And three, is there a better option out there? I've seen and heard a lot about the plyr package, but I didn't want to jump to that before I fully investigated the options present in Base R.
You could rewrite 'converter' so that it takes vectors of gender, age, and a row index which you then use to do lookups and assignments to converted_scores using a conversion array and a data array that is jsut the numeric score columns. There is an additional problem with using apply since it will convert all its x arguments to "character" class because of the gender class being "character". It wasn't clear whether your code normdf[ rawscores+1, convertvec] was supposed to be an array extraction or a function call.
Untested in absence of working example (with normdf, mydata):
converted_scores <- matrix(NA, nrow=NROW(rawscores), ncol=12)
converter <- function(idx,gender,age) {
gidx <- match(gender, c("Male", "Female") )
aidx <- findInterval(age, c(8,12,15) )
ag.idx <- gidx + 2*aidx -1
# the aidx factor needs to be the same number of valid age categories
cvt <- cvt.arr[ ag.idx, ]
converted_scores[idx] <- normdf[rawscores+1,convertvec]
return(converted_scores)
}
cvt.arr <- matrix(1:48, nrow=4, byrow=TRUE)[1,3,2,4] # the genders alternate
cvt.scores <- mapply(converter, 1:NROW(mydata), mydata$gender, mydata$age)
I'd advise against applying this stuff by row, but would rather apply this by column. The reason is that there are only 12 columns, but there might be many rows.
The following piece of code works for me. There might be better ways, but it might be interesting for you nevertheless.
offset <- with(mydata, 24*(gender == "Female") + 12*(age >= 12))
idxs <- expand.grid(row = 1:nrow(mydata), col = 1:12)
idxs$off <- idxs$col + offset
idxs$val <- as.numeric(mydata[as.matrix(idxs[c("row", "col")])]) + 1
idxs$norm <- normdf[as.matrix(idxs[c("val", "off")])]
converted <- mydata
converted[,1:12] <- as.matrix(idxs$norm, ncol=12)
The tricky part here is this idxs data frame which combines all the rest. It has the folowing columns:
row and column: Position in the original data
off: column in normdf, based on gender and age
val: row in normdf, based on original value + 1
norm: corresponding normalized value
I'll post this here with this first thought, and see whether I can come up with a better answer, either based on jorans comment, or using a three- or four-dimensional array for normdf. Not sure yet.