How to subset rows with strings - r

I want to use function for repetitively making up set with different names.
for example, if I have 5 random vectors.
number1<-sample(1:10, 3)
number2<-sample(1:10, 3)
number3<-sample(1:10, 3)
number4<-sample(1:10, 3)
number5<-sample(1:10, 3)
Then, I will use these vectors for selecting rows in raw data set(i.e. dataframe)
testset1<-raw[number1,]
testset2<-raw[number2,]
testset3<-raw[number3,]
tsetset4<-raw[number4,]
testset5<-raw[number5,]
It takes lot of spaces in manuscript for writing up each commands. I'm trying to shorten these commands with using 'function'
However, I found that it is hard to use variables in a function statement for writing 'text argument'. For example, it is easy to use variables like this.
mean_function<-function(x){
mean(x)
}
But, I want to use function like this.
testset "number with 1-5" <-raw[number"number 1-5",]
I would really appreciate your help.

You don't need to create a function for this task, simply use lapply to loop over the list of elements produced by mget(), then set some names and finally put all results in the global environment:
rowSelected <-lapply(mget(paste0("number", 1:5)), function(x) raw[x, ])
names(rowSelected) <- paste0("testset", 1:5)
list2env(rowSelected, envir = .GlobalEnv)

Related

How to use lapply function to instead of for loop in a custom function which is not vectorized in arguments

Firstly, let us generate data like this:
library(data.table)
data <- data.table(date = as.Date("2015-05-01")+0:299)
set.seed(123)
data[,":="(
a = round(30*cumprod(1+rnorm(300,0.001,0.05)),2),
b = rbinom(300,5000,0.8)
)]
Then I want to use my custom function to operate multiple columns multiple times without manually typing out .Such as my custom function is add <- function(x,n) (x+n)
I provide my for loops code as following:
add <- function(x,n) (x+n)
n <- 3
freture_old <- c("a","b")
for(i in 1:n ){
data[,(paste0(freture_old,"_",i)) := add(.SD,i),.SDcols =freture_old ]
}
Could you please tell me a lapply version to instead of for loop?
If all you want is to use an lapply loop instead of a for loop you really do not need to change much. For a data.table object it is even easier since every iteration will change the data.table without having to save a copy to the global environment. One thing I add just to suppress the output to the console is to wrap an invisible around it.
lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old])
Note that if you assign this lapply to an object you will get a list of data.tables the size of the number of iterations or in this case 3. This will kill memory because you are really only interested in the final entry. Therefore just run the code without assigning it to a variable. Now if you do not assign it to anything you will get every iteration printed out to the console. So what I would suggest is to wrap an invisible around it like this:
invisible(lapply(1:n,function(i) data[,paste0(freture_old,"_",i):=lapply(.SD,add,i),.SDcols =freture_old]))
Hope this helps and let me know if you need me to add anything else to this answer. Good luck!
An option without R "loop" (quoted since ultimately its a loop at certain level somewhere):
data[,
c(outer(freture_old, seq_len(n), paste, sep="_")) :=
as.data.table(matrix(outer(as.matrix(.SD), seq_len(n), add), .N)),
.SDcols=freture_old]
Or equivalently in base R:
setDF(data)
cbind(data, matrix(outer(as.matrix(data[, freture_old]), seq_len(n), add),
nrow(data)))

get() not working for column in a data frame in a list in R (phew)

I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.

r- call function inside for loop alternative

my_function <- function(n){}
result = list()
for(i in 0:59){
result[i] = my_function(i)
}
write.csv(result, "result.csv")
New to R, read that for-loops are bad in R, so is there an alternative to what I'm doing? I'm basically trying to call my_function with a parameter that's increasing, and then write the results to a file.
Edit
Sorry I didn't specify that I wanted to use some function of i as a parameter for my_function, 12 + (22*i) for example. Should I create a list of values and then called lapply with that list of values?
for loops are fine in R, but they're syntactically inefficient in a lot of use cases, especially simple ones. The apply family of functions usually makes a good substitute.
result <- lapply(0:59, my_function)
write.csv(result, "result.csv")
Depending on what your function's output is, you might want sapply rather than lapply.
Edit:
Per your update, you could do it as you say, creating the vector of values first, or you could just do something like:
lapply(12+(22*0:59), my_function)

Replace characters in a column, based on a translation table from another data frame

I have a data.frame mapping which contains path and map.
I also have another data.frame DATA which contains the raw path and value.
EDIT: Path might have two components or more: e.g. "A>C" or "A>C>B"
set.seed(24);
DATA <- data.frame(
path=paste0(sample(LETTERS[1:3], 25, replace=TRUE), ">", sample(LETTERS[1:3], 25, replace=TRUE)),
value=rnorm(25)
)
mapping <- data.frame(path=c("A","B","C"), map=c("X","Y","Z"))
lapply(mapping, function (x) {
for (i in 1:nrow(DATA)) {
DATA$path[i] <- gsub(as.character(x["path"]),as.character(x["map"]),as.character(DATA$path[i]))
}
})
I'm trying to replace the path in DATA with the map value in mapping but this doesn't seem to be working for me.
"A>C" will be converted to "X>Z".
I understand that for loops are not good in R, but I can't think of another way to code it. Data size I'm working with is 6m row in DATA and 16k rows in mapping.
Clarification on Data: While the path consists of alphabets (ABC) now, the real path are actually domain names. Number of steps in a path is also not fixed at 2 and can be any number.
You can use chartr
DATA$path <- chartr('ABC', 'XYZ', DATA$path)
Or if we are using the data from 'mapping'
DATA$path <- chartr(paste(mapping$path, collapse=''),
paste(mapping$map, collapse=''), DATA$path)
Or using gsubfn
library(gsubfn)
pat <- paste0('[', paste(mapping$path, collapse=''),']')
indx <- setNames(as.character(mapping$map), mapping$path)
gsubfn(pat, as.list(indx), as.character(DATA$path))
Or a base R option based on #smci's comment
vapply(strsplit(as.character(DATA$path), '>'), function(x)
paste(indx[x], collapse=">"), character(1L))
Using data.table (1.9.5+), especially advisable b/c of the size of your data.
library(data.table)
setDT(DATA); setDT(mapping)
DATA[,paste0("path",1:2):=tstrsplit(path,split=">")]
setkey(DATA,path1)[mapping,new.path1:=i.map]
setkey(DATA,path2)[mapping,new.path2:=i.map]
DATA[,new.path:=paste0(new.path1,">",new.path2)]
If you want to get rid of the extra columns:
DATA[,paste0(c("","","new.","new."),"path",rep(1:2,2)):=NULL]
If you just want to overwrite path, use path on the LHS of the last line instead of new.path.
This could also be written more concisely:
library(data.table)
setDT(mapping)
setkey(setkey(setDT(DATA)[,paste0("path",1:2):=tstrsplit(path,split=">")
],path1)[mapping,new.path1:=i.map],path2
)[mapping,new.path:=paste0(new.path1,">",i.map)]
I think you're using the wrong apply.
mapply allows you to use two arguments to the function, here the path and the map. Note that in mapply, the argument FUN comes first. You also do not need to do this row by row, you can just do the entire column at once. Finally, in an apply the variables do not get updated as they do in a for loop, so you need to assign them in the .GlobalEnv. You can do this with an explicit call to assign() or using <<- which assigns them in the first place it finds them in the stack. In this case, that will be back in .GlobalEnv.
After defining mapping and DATA as you do above, try this.
head(DATA)
invisible(mapply( function (x,y) {
DATA$path <<- gsub(x,y,DATA$path)
},mapping$path, mapping$map))
head(DATA)
note that the call to invisible suppresses output from mapply.
If you really want to use lapply, you can. But you need to transpose mapping. You can do that but it will be converted to a matrix, so you have to convert it back. Then, you can just use the same tricks with <<- and not using a for loop as above to get this code:
invisible(lapply(as.data.frame(t(mapping)), function (x) {
DATA$path <<- gsub(x[1],x[2],DATA$path)
}))
head(DATA)
Thanks for sharing, I learned a lot answering this question.

Loop within a function and automatically create objects in R

I try to calculate the column means for diffrent groups in R. there exist several methods to assign groups and so two columns where created that contain diffrent groupings.
# create a test df
df.abcd.2<-data.frame(Grouping1=c("a","f","a","d","d","f","a"),Grouping2=c("y","y","z","z","x","x","q"),Var1=sample(1:7),Var2=sample(1:7),Var3=rnorm(1:7))
df.abcd.2
Now I created a loop with assign, lapply, split and colMeans to get my results and store the in diffrent dfs. The loop works fine.
#Loop to create the colmeans and store them in dataframes
for (i in 1:2){
nam <- paste("RRRRRR",deparse(i), sep=".")
assign(nam, as.data.frame(
lapply(
split(df.abcd.2[,3:5], df.abcd.2[,i]), colMeans)
)
)
}
So now i would like to create a function to apply this method on diffrent dataframes. My attemp looked like this:
# 1. function to calculate colMeans for diffrent groups
# df= desired datatframe,
# a=starting column: beginning of the columns that contain the groups, b= end of columns that contain the groups
# c=startinc column: beginning of columns to be analized, d=end of columns do be analized
function.split.colMeans<-function(df,a,b,c,d)
{for (i in a:b){
nam <- paste("OOOOO",deparse(i), sep=".")
assign(nam, as.data.frame(
lapply(
split(df[,c:d], df[,i]), colMeans)
)
)
}
}
#test the function
function.split.colMeans(df.abcd.2,1,2,3,5)
So when I test this function I get neither an error message nor results... Can anyone help me out, please?
It's working perfectly. Read the help for assign. Learn about frames and environments.
In other words, its creating the variables inside your function, but they don't leak out into the environment you see when you do ls() at the command line. If you put print(ls()) inside your functions loop you'll see them, but when the function ends, they disappear.
Normally, the only way functions interact with their calling environment is by their return value. Any other method is entering a whole world of pain.
DONT use assign to create things with sequential or informative names. Ever. Unless you know what you are doing, which you don't... Stick them in lists, then you can index the parts for looping and so on.

Resources