I have 8 datasets and I want to apply a function to convert any number less than 5 to NA on 3 columns(var1,var2,var3) of each dataset. How can I write a function to do it effectively and faster ? I went through lots of such questions on Stack overflow but I didnt find any answer where specific columns were used. I have written the function to replace but cant figure out how to apply to all the datasets.
Input:
Data1
variable1 variable2 variable3 variable4
10 36 56 99
15 3 2 56
4 24 1 1
Expected output:
variable1 variable2 variable3 variable4
10 36 56 99
15 NA NA 56
NA 24 NA 1
Perform the same thing for 7 more datasets.
Till now I have stored the needed variables and datasets in two different list.
var1=enquo(variable1)
var2=enquo(variable2)
var3=enquo(variable3)
Total=3
listofdfs=list()
listofdfs_1=list()
for(i in 1:8) {
df=sym((paste0("Data",i)))
listofdfs[[i]]=df
}
for(e in 1:Ttoal) {
listofdfs[[e]]= eval(sym(paste0("var",e)))
}
The selected columns will go through this function:
temp_1=function(x,h) {
h=enquo(h)
for(e in 1:Total) {
if(substr(eval(sym(paste0("var",e))),1,3)=="var") {
y= x %>% mutate_at(vars(!!h), ~ replace(., which(.<=5),NA))
return(y)
}
}
}
I was expecting something :
lapply(for each dataset's selected columns,temp_1)
Here's a simple approach that should work:
cols_to_edit = paste0("var", 1:3)
result_list = lapply(list_of_dfs, function(x) {
x[cols_to_edit][x[cols_to_edit] < 5] = NA
return(x)
})
I assume your starting data is in a list called list_of_dfs, that the names of columns to edit are the same in all data frames, and that you can construct a character vector cols_to_edit with those names.
Here is a solution to the problem in the question.
First of all, create a test data set.
createData <- function(Total = 3){
numcols <- Total + 1
set.seed(1234)
for(i in 1:8){
tmp <- replicate(numcols, sample(10, 20, TRUE))
tmp <- as.data.frame(tmp)
names(tmp) <- paste0("var", seq_len(numcols))
assign(paste0("Data", i), tmp, envir = .GlobalEnv)
}
}
createData()
Now, the data transformation.
This is much easier if the many dataframes are in a "list".
df_list <- mget(ls(pattern = "^Data"))
I will present solutions, a base R solution and a tidyverse one. Note that both solutions will use function temp_1, written in base R only.
library(tidyverse)
temp_1 <- function(x, h){
f <- function(v){
is.na(v) <- v <= 5
v
}
x[h] <- lapply(x[h], f)
x
}
h <- grep("var[123]", names(df_list[[1]]), value = TRUE)
df_list1 <- lapply(df_list, temp_1, h)
df_list2 <- df_list %>% map(temp_1, h)
identical(df_list1, df_list2)
#[1] TRUE
Related
My first post here so please tell me if I'm missing any important information.
I am handling a lot of data in form of time(1:30=rowID) vs value all stored in a number of dataframes and I need to keep it as a data.frame.
I wrote a function that gets dataframes from my global environment and sorts the columns in each set into new data frames depending on their values.
So I start with a list of names of my data frames as input for my function and then end with assigning the created new dataframes to my global environment while using the assign function.
The dataframes I get all are 30 rows long, but have different column length depending on how often a case appears in a dataset. The names of each dataframe represent one data set and the column names inside represent one timeline. I use data frames, so I don't loose the information of the column name.
This works for having 0 cases and everything above 1.
But if a data.frame ends up with only one column and I use the assign function it appears as a vector in my global environment instead of a data frame. Therefore I loose the name of the column and my other functions that only use data frames stop at such a case and throw errors.
Here is a basic example of my problem:
#create two datasets with different cases
data1 <- data.frame(matrix(nrow=30, ncol=5))
data1[1] <- c(rep(1,each=30))
data1[2] <- c(rep(5, each=30))
data1[3] <- c(rep(5, each=30))
data1[4] <- c(rep(10, each=30))
data1[5] <- c(rep(10, each=30))
data2 <- data.frame(matrix(nrow=30, ncol=6))
data2[1] <- c(rep(5,each=30))
data2[2] <- c(rep(1, each=30))
data2[3] <- c(rep(1, each=30))
data2[4] <- c(rep(0, each=30))
data2[5] <- c(rep(0, each=30))
data2[6] <- c(rep(10, each=30))
#create list with names of datasets
names <- c('data1','data2')
#function for sorting
examplefunction <- function(VarNames) {
for (i in 1:length(VarNames)) {
#get current dataset
name <- VarNames[i]
data <- get(VarNames[i])
#create new empty data.frames for sorting
data.0 <- data.frame(matrix(nrow=30))
name.data.0 <- paste(name,"0", sep=".")
c.0 = 2 #start at second column, since first doesn't like the colname later
data.1 <- data.frame(matrix(nrow=30))
name.data.1 <- paste(name,"1", sep=".")
c.1 = 2
data.5 <- data.frame(matrix(nrow=30))
name.data.5 <- paste(name,"5", sep=".")
c.5 = 2
data.10 <- data.frame(matrix(nrow=30))
name.data.10 <- paste(name,"10", sep=".")
c.10 = 2
#sort data into new different data.frames
for (c in 1:ncol(data)) {
if(data[1,c]==0) {
data.0[c.0] = data[c]
c.0 = c.0 +1
}
else if(data[1,c]==1) {
data.1[c.1] = data[c]
c.1 = c.1 +1
}
else if(data[1,c]==5) {
data.5[c.5] = data[c]
c.5 = c.5 +1
}
else if(data[1,c]==10) {
data.10[c.10] = data[c]
c.10 = c.10 +1
}
else (stop="new values")
}
#remove first column with weird name
data.0 <- data.0[,-1]
data.1 <- data.1[,-1]
data.5 <- data.5[,-1]
data.10 <- data.10[,-1]
#assign data frames to global environment
assign(name.data.0, data.0, envir = .GlobalEnv)
assign(name.data.1, data.1, envir = .GlobalEnv)
assign(name.data.5, data.5, envir = .GlobalEnv)
assign(name.data.10, data.10, envir = .GlobalEnv)
}
}
#function call
examplefunction(names)
As explained before, if you run this you will end up with data frames of 0 variables and >1 variables.
And three vectors, where the data frame had only one column.
So my questions are:
1. Is there any way to keep the data type and forcing R to assign it to a data frame instead of a vector?
2. Or is there an alternative function I could use instead of assign()? If I use <<- how can I do the name assigning as above?
You can use drop = FALSE when subsetting:
examplefunction <- function(VarNames) {
for (i in 1:length(VarNames)) {
#get current dataset
name <- VarNames[i]
data <- get(VarNames[i])
#create new empty data.frames for sorting
data.0 <- data.frame(matrix(nrow=30))
name.data.0 <- paste(name,"0", sep=".")
c.0 = 2 #start at second column, since first doesn't like the colname later
data.1 <- data.frame(matrix(nrow=30))
name.data.1 <- paste(name,"1", sep=".")
c.1 = 2
data.5 <- data.frame(matrix(nrow=30))
name.data.5 <- paste(name,"5", sep=".")
c.5 = 2
data.10 <- data.frame(matrix(nrow=30))
name.data.10 <- paste(name,"10", sep=".")
c.10 = 2
#sort data into new different data.frames
for (c in 1:ncol(data)) {
if(data[1,c]==0) {
data.0[c.0] = data[c]
c.0 = c.0 +1
}
else if(data[1,c]==1) {
data.1[c.1] = data[c]
c.1 = c.1 +1
}
else if(data[1,c]==5) {
data.5[c.5] = data[c]
c.5 = c.5 +1
}
else if(data[1,c]==10) {
data.10[c.10] = data[c]
c.10 = c.10 +1
}
else (stop="new values")
}
#remove first column with weird name
data.0 <- data.0[ , -1, drop = FALSE]
data.1 <- data.1[ , -1, drop = FALSE]
data.5 <- data.5[ , -1, drop = FALSE]
data.10 <- data.10[ , -1, drop = FALSE]
#assign data frames to global environment
assign(name.data.0, data.0, envir = .GlobalEnv)
assign(name.data.1, data.1, envir = .GlobalEnv)
assign(name.data.5, data.5, envir = .GlobalEnv)
assign(name.data.10, data.10, envir = .GlobalEnv)
}
}
#function call
examplefunction(names)
Let's take a look at the one-column dataframes:
str(data1.1)
'data.frame': 30 obs. of 1 variable:
$ X1: num 1 1 1 1 1 1 1 1 1 1 ...
str(data2.10)
'data.frame': 30 obs. of 1 variable:
$ X6: num 10 10 10 10 10 10 10 10 10 10 ...
Now, all that said, I agree with Roland's comment -- you almost never want to take this approach of assigning to the global environment in a complicated way, and instead should return a list; that's best practice. However, you'd still need drop = FALSE to keep the column names.
Really, to me, there's probably an entirely different approach to doing whatever kind of data wrangling you're wanting to do that is a much better approach. I just don't have a good grasp of your task to make a suggestion.
I am trying to create an iterative function in R using a loop or array, which will create three variables and three data frames with the same 1-3 suffix. My current code is:
function1 <- function(b1,lvl1,lvl2,lvl3,b2,x) {
lo1 <- exp(b1*lvl1 + b2*x)
lo2 <- exp(b1*lvl2 + b2*x)
lo3 <- exp(b1*lvl3 + b2*x)
out1 <- t(c(lv1,lo1))
out2 <- t(c(lvl2,lo2))
out3 <- t(c(lvl3,lo3))
out <- rbind(out1, out2, out3)
colnames(out) <- c("level","risk")
return(out)
}
function1(.18, 1, 2, 3, .007, 24)
However, I would like to iterate the same line of code three times to create lo1, lo2, lo3, and out1, out2 and out3. The syntax below is completely wrong because I don't know how to use two arguments in a for-loop, or nest a for loop within a function, but as a rough idea:
function1 <- function(b1,b2,x) {
for (i in 1:3) {
loi <- exp(b1*i + b2*x)
return(lo[i])
outi <- t(c(i, loi)
return(out[i])
}
out <- rbind(out1, out2, out3)
colnames(out) <- c("level","risk")
return(out)
}
function1(.18,.007,24)
The output should look like:
level risk
1 1.42
2 1.70
3 2.03
In R, the for loops are really inefficient. A good practice is to use all the functions from the apply family and try to use as much as possible vectorization. Here are some discussions about this.
For your work, you can simply do it with the dataframe structure. Here the example:
# The function
function1 <- function(b1,b2,level,x) {
# Create the dataframe with the level column
df = data.frame("level" = level)
# Add the risk column
df$risk = exp(b1*df$level + b2*x)
return(df)
}
# Your variables
b1 = .18
b2 = .007
level = c(1,2,3)
# Your process
function1(b1, b2, level, 24)
# level risk
# 1 1 1.416232
# 2 2 1.695538
# 3 3 2.029927
I have a dataframe with a set of objects df$data and a set of rules to be applied on every object df$rules.
df <- data.frame(
data = c(1,2,3),
rules = c("rule1", "rule1, rule2, rule3", "rule3, rule2"),
stringsAsFactors = FALSE
)
The rules are
rule1 <- function(data) {
data * 2
}
rule2 <- function(data) {
data + 1
}
rule3 <- function(data) {
data ^ 3
}
For every row in the dataframe I want to apply all the rules specified in the rules column. The rules should be applied in series.
What I figured out:
apply_rules <- function(data, rules) {
for (i in 1:length(data)) {
rules_now <- unlist(strsplit(rules[i], ", "))
for (j in 1:length(rules_now)) {
data[i] <- apply_rule(data[i], rules_now[j])
}
}
return(data)
}
apply_rule <- function(data, rule) {
return(sapply(data, rule))
}
apply_rules(df$data, df$rules)
# [1] 2 125 28
Although this works I'm pretty sure there must be more elegant solutions. On SO I could find lot's of stuff about the apply-functions and also one post about applying many functions to a vector and something about chaining functions. The Compose idea looks promising but I couldn't figure out how to make a call to Compose with my rules as string. (parse() didn't work..)
Any hints?
Some good answers already but throw in another option - build a pipe chain as a string then evaluate it. For example - for row 1 - eval(parse(text = "1 %>% rule1")) gives 2
eval_chain <- function(df) {
eval(parse(text = paste(c(df$data, unlist(strsplit(df$rules, ", "))), collapse=" %>% ")))
}
df$value <- sapply(1:nrow(df), function(i) df[i, ] %>% eval_chain)
# data rules value
# 1 1 rule1 2
# 2 2 rule1, rule2, rule3 125
# 3 3 rule3, rule2 28
You can use mapply and Reduce together with mget in this case.
mapply(function(d,r) Reduce(function(lhs,rhs) rhs(lhs),
c(d,mget(strsplit(r,", ")[[1]],envir = globalenv())))
,df$data
,df$rules)
# [1] 2 125 28
You might have to adjust the envir argument of mget to your specific case. It would probably be more robust to explicitly pass the environment where your rules are defined to mget.
I think you have to change the approach a little (expressions will only make things worse in this case):
df <- data.frame(
data = c(1,2,3),
rules = c("rule1", "rule1, rule2, rule3", "rule3, rule2"),
stringsAsFactors = FALSE
)
# list of functions
fun_list <- list(
rule1 = function(x) x*2,
rule2 = function(x) x+1,
rule3 = function(x) x^3
)
# function to call list of functions
call_funs <- function(x, fun_vec) {
for (i in seq_along(fun_vec)) {
x <- fun_list[[fun_vec[[i]]]](x)
}
x
}
(want <- unlist(Map(call_funs, df$data, strsplit(gsub(" ", "", df$rules), ","))))
# 2 125 28
So I have a small problem in R. I have multiple data sets (data0, data1,...) and I want to do the following:
data01 <- data0[1:6,]
data02 <- data0[7:12,]
data11 <- data1[1:6,]
data12 <- data1[7:12,]
data21 <- data2[1:6,]
data22 <- data2[7:12,]
data31 <- data3[1:6,]
data32 <- data3[7:12,]
...etc
I would like to do this in a for loop like so:
for(i in 1:(some high number)){
datai1 <- datai[1:6,]
datai2 <- datai[7:12,]
}
I've tried messing around with assign() and get(), however I cannot make it work. I found something that might work in this question, however the difference is that here the variable d should also change depending on the index. Any idea how I could make this work?
Here is a more R-like approach than using assign:
data1 <- data0 <- data.frame(x = 1:12, y = letters[1:12]) #some data
mylist <- mget(ls(pattern = "data\\d")) #collect free floating objects into list
#it would be better to put the data.frames into a list when you create them
res <- lapply(mylist, function(d) split(d[1:12,], rep(1:2, each = 6))) #loop over list and split each data.frame
The result is a nested list and it's easy to extract its elements:
res[["data1"]][["2"]]
# x y
#7 7 g
#8 8 h
#9 9 i
#10 10 j
#11 11 k
#12 12 l
Assemble the variable names with paste() and then use get() and assign() as you suggest.
for (i in 1:10) {
datai <- get(paste('data', i, sep = ''))
assign(paste('data', i, '1', sep = ''), datai[1:6,])
assign(paste('data', i, '2', sep = ''), datai[7:12,])
}
I am doing systematic calculations for my created dataframe. I have the code for the calculations but I would like to:
1) Wite it as a function and calling it for the dataframe I created.
2) reset the calculations for next ID in the dataframe.
I would appreciate your help and advice on this.
The dataframe is created in R using the following code:
#Create a dataframe
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
The other thing is that the previous code assumes the subject ID=1 for all time points. If subject ID=2 when the WT (weight) changes to 120. How can I reset the calculations and make it automated for all subject IDs in the dataframe? In this case the original dataframe would be like this:
#code:
rm(list=ls(all=TRUE))
dosetimes <- c(0,6,12,18)
df <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
df$ID[(df$WT>=120)==T] <- 2
df$TIME[df$ID==2] <- c(seq(0,20,1))
Thank you in advance!
In general, when doing calculations on different subject's data, I like to split the dataframe by ID, pass the vector of individual subject data into a for loop, do all the calculations, build a vector containing all the newly calculated data and then collapse the resultant and return the dataframe with all the numbers you want. This allows for a lot of control over what you do for each subject
subjects = split(df, df$ID)
forResults = vector("list", length=length(subjects))
# initialize these constants
C <- 2
V <- 10
k <- C/V
myFunc = function(data, resultsArray){
for(k in seq_along(subjects)){
df = subjects[[k]]
df$A1 = 100 # I assume this should be 100 for t=0 for each subject?
# you could vectorize this nested for loop..
for(i in 2:nrow(df)) {
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
# you can add all sorts of other calculations you want to do on each subject's data
# when you're done doing calculations, put the resultant into
# the resultsArray and we'll rebuild the dataframe with all the new variables
resultsArray[[k]] = df
# if you're not using RStudio, then you want to use dev.new() to instantiate a new plot canvas
# dev.new() # dont need this if you're using RStudio (which doesnt allow multiple plots open)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
# collapse the results vector into a dataframe
resultsDF = do.call(rbind, resultsArray)
return(resultsDF)
}
results = myFunc(subjects, forResults)
Do you want this:
ddf <- data.frame("ID"=1,"TIME"=sort(unique(c(seq(0,30,1),dosetimes))),"AMT"=0,"A1"=NA,"WT"=NA)
myfn = function(df){
dosetimes <- c(0,6,12,18)
doserows <- subset(df, TIME%in%dosetimes)
doserows$AMT[doserows$TIME==dosetimes[1]] <- 100
doserows$AMT[doserows$TIME==dosetimes[2]] <- 100
doserows$AMT[doserows$TIME==dosetimes[3]] <- 100
doserows$AMT[doserows$TIME==dosetimes[4]] <- 100
#Add back dose information
df <- rbind(df,doserows)
df <- df[order(df$TIME,-df$AMT),]
df <- subset(df, (TIME==0 & AMT==0)==F)
df$A1[(df$TIME==0)] <- df$AMT[(df$TIME ==0)]
#Time-dependent covariate
df$WT <- 70
df$WT[df$TIME >= 12] <- 120
#The calculations are done in a for-loop. Here is the code for it:
#values needed for the calculation
C <- 2
V <- 10
k <- C/V
#I would like this part to be written as a function
for(i in 2:nrow(df))
{
t <- df$TIME[i]-df$TIME[i-1]
A1last <- df$A1[i-1]
df$A1[i] = df$AMT[i]+ A1last*exp(-t*k)
}
head(df)
plot(A1~TIME, data=df, type="b", col="blue", ylim=c(0,150))
}
myfn(ddf)
For multiple calls:
for(i in 1:N) {
myfn(ddf[ddf$ID==i,])
readline(prompt="Press <Enter> to continue...")
}