As an example, I want a function that will iterate over the columns in a dataframe and print out each column's data type (e.g., "numeric", "integer", "character", etc)
Without a variable I know I can do class(df$MyColumn) and get the data type. How can I change it so "MyColumn" is a variable?
What I'm trying is
f <- function(df) {
for(column in names(df)) {
columnClass = class(df[column])
print(columnClass)
}
}
But this just prints out [1] "data.frame" for each column.
Since a data frame is simply a list, you can loop over the columns using lapply and apply the class function to each column:
lapply(df, class)
To address the previously unspoken concerns in User's comment.... if you build a function that does whatever it is that you hope to a column, then this will succeed:
func <- function(col) {print(class(col))}
lapply(df, func)
It's really mostly equivalent to:
for(col in names(df) ) { print(class(df[[col]]))}
And there would not be an unneeded 'colClass' variable cluttering up the .GlobalEnv.
Use a comma before column:
for(column in names(df)) {
columnClass = class(df[,column])
print(columnClass)
}
Much as DWin suggested
apply(df,2,class)
but you say you want to do more with each coloumn?
What do you want to do. Try to avoid abstract examples.
In case it helps
apply(df,2,mean)
apply(df,2,sd)
or something more complicated
apply(df,2,function(x){s = c(summary(x)["Mean"], summary(x)["Median"], sd(x))})
Note that the summary function gives you most of this functionality anyway, but this is just an example. any function can be place inside of an apply and iterated over the cols of a matrix or dataframe. that function can be as complex or as simple as you need it to be.
You can use the colwise function of the plyr package to transform any function into a column wise function. This is a wrapper for lapply.
library(plyr)
colwise.print.class<-colwise(.fun=function(col) {print(class(col))})
colwise.print.class(df)
You can view the function created with
print(colwise.print.class)
Related
This topic has been covered numerous times I see but I can't really get the answer I'm looking for. Thus, here I go.
I am trying to do a loop to create variables in 5 data sets that have similar names as such:
Ech_repondants_nom_1
Ech_repondants_nom_2
Ech_repondants_nom_3
Ech_repondants_nom_4
Ech_repondants_nom_5
Below if the code that I have tried:
list <- c(1:5)
for (i in list) {
Ech_repondants_nom_[[i]]$sec = as.numeric(Ech_repondants_nom_[[i]]$interviewtime)
Ech_repondants_nom_[[i]]$min = round(Ech_repondants_nom_[[i]]$sec/60,1)
Ech_repondants_nom_[[i]]$heure = round(Ech_repondants_nom_[[i]]$min/60,1)
}
Any clues why this does not work?
cheers!
These are object names and not list elements to subset as Ech_repondants_nom_[[i]]. We may need to get the object by paste i.e.
get(paste0("Ech_repondants_nom_", i)$sec
but, then if we need to update the original object, have to call assign. Instead of all this, it can be done more easily if we load the datasets into a list and loop over the list with lapply
lst1 <- lapply(mget(paste0("Ech_repondants_nom_", 1:5)), function(dat)
within(dat, {sec <- as.numeric(interviewtime);
min <- round(sec/60, 1);
heure <- round(min/60, 1)}))
It may be better to keep it as a list, but if we need to update the original object, use list2env
list2env(lst1, .GlobalEnv)
Ech_repondants_nom_[[i]]
Isn't actually selcting that dataframe because you can't call objects like that. Try creating a function that takes a dataframe as an argument then iterating through the dataframes
changing_time_stamp<-function(df){
df$sec = as.numeric(df$interviewtime)
df$min = round(df$sec/60,1)
df$heure = round(df$min/60,1)
for (i in list) {
changing_time_stamp(i)
}
EDIT: I fixed some of the variable names in the function
here is how I created number of data sets with names data_1,data_2,data_3 .....and so on
for initial
dim(data)<- 500(rows) 17(column) matrix
for ( i in 1:length(unique( data$cluster ))) {
assign(paste("data", i, sep = "_"),subset(data[data$cluster == i,]))
}
upto this point everything is fine
now I am trying to use these inside the other loop one by one like
for (i in 1:5) {
data<- paste(data, i, sep = "_")
}
however this is not giving me the data with required format
any help will be really appreciated.
Thank you in advance
Let me give you a tip here: Don't just assign everything in the global environment but use lists for this. That way you avoid all the things that can go wrong when meddling with the global environment. The code you have in your question, will overwrite the original dataset data, so you'll be in trouble if you want to rerun that code when something went wrong. You'll have to reconstruct the original dataframe.
Second: If you need to split a data frame based on a factor and carry out some code on each part, you should take a look at split, by and tapply, or at the plyr and dplyr packages.
Using Base R
With base R, it depends on what you want to do. In the most general case you can use a combination of split() and lapply or even a for loop:
mylist <- split( data, f = data$cluster)
for(mydata in mylist){
head(mydata)
...
}
Or
mylist <- split( data, f = data$cluster)
result <- lapply(mylist, function(mydata){
doSomething(mydata)
})
Which one you use, depends largely on what the result should be. If you need some kind of a summary for every subset, using lapply will give you a list with the results per subset. If you need this for a simulation or plotting or so, you better use the for loop.
If you want to add some variables based on other variables, then the plyr or dplyr packages come in handy
Using plyr and dplyr
These packages come especially handy if the result of your code is going to be an array or data frame of some kind. This would be similar to using split and lapply but then in a way Hadley approves of :-)
For example:
library(plyr)
result <- ddply(data, .(cluster),
function(mydata){
doSomething(mydata)
})
Use dlply if the result should be a list.
I need to apply transformations to all numeric variables of a large dataframe. The dataframe has variables of other types as well. My initial idea was to iterate over all the columns, check if they are numerical and then divide them by 1000.
I've got stuck in my code for a function, would appreciate some pointers here:
transformDivideThousand <- function(data_frame){
for(i in ncol(data_frame)){
if (is.numeric(data_frame[i])) {
data_frame[i]/1000
}
}
return(data_frame)
}
The execution of the function:
test <- transformDivideThousand(mypatients)
test is a dataframe, but the transformations are not happening. Where did I err?
As an extra, I would also like transformDivideThousand to have an optional argument where I could pass a list with the names for the variables to use, if empty, than iterate over all of them.
#nicola's comment explains what's going wrong with your loop. Another option is to use sapply to identify the numeric columns, which results in more succinct code. For example, using the built-in iris data frame:
iris[, sapply(iris, is.numeric)] =
iris[, sapply(iris, is.numeric)]/1000
You can just run this directly on a data frame, as above, or put it inside a function:
tDT <- function(data_frame) {
data_frame[, sapply(data_frame, is.numeric)] =
data_frame[, sapply(data_frame, is.numeric)]/1000
return(data_frame)
}
Then, to run it:
iris.new = tDT(iris)
For future reference, per #nicola's comment, here's how to make the for loop version work:
tDT2 <- function(data_frame) {
for (i in 1:ncol(data_frame)) {
if (is.numeric(data_frame[,i])) {
data_frame[,i] = data_frame[,i]/1000
}
}
return(data_frame)
}
I have a custom function that I want to apply to any dataset that shares a common name.
common_funct=function(rank_p=5){
df = ANY_DATAFRAME_HERE[ANY_DATAFRAME_HERE$rank <rank_p,]
return(df)
}
I know with common functions I could do something like below to get the value of each.
apply(mtcars,1,mean)
But what if I wanted to do :
apply(any_dataset, 1, common_funct(anyvalue))
How would I pass that along?
library(dplyr)
mtcars$rank = dense_rank(mtcars$mpg)
iris$rank = dense_rank(iris$Sepal.Length)
Now how would I go about applying my same function to both values?
If I understand you question, I would suggest putting you data frames into a list and apply over it. So
## Your example function
common_funct=function(df, rank_p=5){
df[df$rank <rank_p,]
}
## Sanity check
common_funct(mtcars)
common_funct(iris)
Next create a list of the data frames
l = list(mtcars, iris)
and use lapply
lapply(l, common_funct)
I have a data.frame mapping which contains path and map.
I also have another data.frame DATA which contains the raw path and value.
EDIT: Path might have two components or more: e.g. "A>C" or "A>C>B"
set.seed(24);
DATA <- data.frame(
path=paste0(sample(LETTERS[1:3], 25, replace=TRUE), ">", sample(LETTERS[1:3], 25, replace=TRUE)),
value=rnorm(25)
)
mapping <- data.frame(path=c("A","B","C"), map=c("X","Y","Z"))
lapply(mapping, function (x) {
for (i in 1:nrow(DATA)) {
DATA$path[i] <- gsub(as.character(x["path"]),as.character(x["map"]),as.character(DATA$path[i]))
}
})
I'm trying to replace the path in DATA with the map value in mapping but this doesn't seem to be working for me.
"A>C" will be converted to "X>Z".
I understand that for loops are not good in R, but I can't think of another way to code it. Data size I'm working with is 6m row in DATA and 16k rows in mapping.
Clarification on Data: While the path consists of alphabets (ABC) now, the real path are actually domain names. Number of steps in a path is also not fixed at 2 and can be any number.
You can use chartr
DATA$path <- chartr('ABC', 'XYZ', DATA$path)
Or if we are using the data from 'mapping'
DATA$path <- chartr(paste(mapping$path, collapse=''),
paste(mapping$map, collapse=''), DATA$path)
Or using gsubfn
library(gsubfn)
pat <- paste0('[', paste(mapping$path, collapse=''),']')
indx <- setNames(as.character(mapping$map), mapping$path)
gsubfn(pat, as.list(indx), as.character(DATA$path))
Or a base R option based on #smci's comment
vapply(strsplit(as.character(DATA$path), '>'), function(x)
paste(indx[x], collapse=">"), character(1L))
Using data.table (1.9.5+), especially advisable b/c of the size of your data.
library(data.table)
setDT(DATA); setDT(mapping)
DATA[,paste0("path",1:2):=tstrsplit(path,split=">")]
setkey(DATA,path1)[mapping,new.path1:=i.map]
setkey(DATA,path2)[mapping,new.path2:=i.map]
DATA[,new.path:=paste0(new.path1,">",new.path2)]
If you want to get rid of the extra columns:
DATA[,paste0(c("","","new.","new."),"path",rep(1:2,2)):=NULL]
If you just want to overwrite path, use path on the LHS of the last line instead of new.path.
This could also be written more concisely:
library(data.table)
setDT(mapping)
setkey(setkey(setDT(DATA)[,paste0("path",1:2):=tstrsplit(path,split=">")
],path1)[mapping,new.path1:=i.map],path2
)[mapping,new.path:=paste0(new.path1,">",i.map)]
I think you're using the wrong apply.
mapply allows you to use two arguments to the function, here the path and the map. Note that in mapply, the argument FUN comes first. You also do not need to do this row by row, you can just do the entire column at once. Finally, in an apply the variables do not get updated as they do in a for loop, so you need to assign them in the .GlobalEnv. You can do this with an explicit call to assign() or using <<- which assigns them in the first place it finds them in the stack. In this case, that will be back in .GlobalEnv.
After defining mapping and DATA as you do above, try this.
head(DATA)
invisible(mapply( function (x,y) {
DATA$path <<- gsub(x,y,DATA$path)
},mapping$path, mapping$map))
head(DATA)
note that the call to invisible suppresses output from mapply.
If you really want to use lapply, you can. But you need to transpose mapping. You can do that but it will be converted to a matrix, so you have to convert it back. Then, you can just use the same tricks with <<- and not using a for loop as above to get this code:
invisible(lapply(as.data.frame(t(mapping)), function (x) {
DATA$path <<- gsub(x[1],x[2],DATA$path)
}))
head(DATA)
Thanks for sharing, I learned a lot answering this question.