I have a question here about the print() inside a for loop.
I have a dataset (gpa) with 2 columns. I am trying to get mean, variance, and standard deviation of values inside the two columns. When I code,
for(x in c(1:2)) {
mean(gpa[[x]])
var(gpa[[x]])
sd(gpa[[x]])
}
I don't get any output:
for(x in c(1:2)) {
print(mean(gpa[[x]]))
print(var(gpa[[x]]))
print(sd(gpa[[x]]))
}
But If i insert print before each of the lines, I do get the desired values.
What is the difference here? Is print really necessary?
The reason for this is, that all the stuff inside the loop gets evaluated, but never returned to somewhere. Though print deliver something I wouldn't advise it, because you can't use these values later, because they get returned to the console rather then the global environment. Instead you might want to assign them to something.
For example:
#example data
df <- data.frame(x = 1:10, y = rnorm(10))
#it is good to create the output in the desired length first
#it is much more efficient in terms of speed an memory beeing used but here it don't really matter
ret <- vector("list", NCOL(df))
for(x in seq_len(NCOL(df))){
ret[[x]] <- c(mean(df[[x]]),
var(df[[x]]),
sd(df[[x]]))
}
ret
Though I don't advise that neither. For the most (if not all things) you can do with a for loop you can use the apply family of functions from base r or the map function from purrr. This would look that way:
library(purrr)
map(df, ~c(mean(.), var(.), sd(.)))
#or even save it with names
ret <- map(df, ~c(mean = mean(.), var = var(.), sd = sd(.)))
ret
The apply/map variants are faster & shorter, but more importantly for me easier to understand and have less room for errors. Though there are a hole bunch of other arguments why you might want to use apply/map.
Related
I have a list of data frames. I want to use lapply on a specific column for each of those data frames, but I keep throwing errors when I tried methods from similar answers:
The setup is something like this:
a <- list(*a series of data frames that each have a column named DIM*)
dim_loc <- lapply(1:length(a), function(x){paste0("a[[", x, "]]$DIM")}
Eventually, I'll want to write something like results <- lapply(dim_loc, *some function on the DIMs*)
However, when I try get(dim_loc[[1]]), say, I get an error: Error in get(dim_loc[[1]]) : object 'a[[1]]$DIM' not found
But I can return values from function(a[[1]]$DIM) all day long. It's there.
I've tried working around this by using as.name() in the dim_loc assignment, but that doesn't seem to do the trick either.
I'm curious 1. what's up with get(), and 2. if there's a better solution. I'm constraining myself to the apply family of functions because I want to try to get out of the for-loop habit, and this name-as-list method seems to be preferred based on something like R- how to dynamically name data frames?, but I'd be interested in other, more elegant solutions, too.
I'd say that if you want to modify an object in place you are better off using a for loop since lapply would require the <<- assignment symbol (<- doesn't work on lapply`). Like so:
set.seed(1)
aList <- list(cars = mtcars, iris = iris)
for(i in seq_along(aList)){
aList[[i]][["newcol"]] <- runif(nrow(aList[[i]]))
}
As opposed to...
invisible(
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <<- runif(nrow(aList[[x]]))
})
)
You have to use invisible() otherwise lapply would print the output on the console. The <<- assigns the vector runif(...) to the new created column.
If you want to produce another set of data.frames using lapply then you do:
lapply(seq_along(aList), function(x){
aList[[x]][["newcol"]] <- runif(nrow(aList[[x]]))
return(aList[[x]])
})
Also, may I suggest the use of seq_along(list) in lapply and for loops as opposed to 1:length(list) since it avoids unexpected behavior such as:
# no length list
seq_along(list()) # prints integer(0)
1:length(list()) # prints 1 0.
my_function <- function(n){}
result = list()
for(i in 0:59){
result[i] = my_function(i)
}
write.csv(result, "result.csv")
New to R, read that for-loops are bad in R, so is there an alternative to what I'm doing? I'm basically trying to call my_function with a parameter that's increasing, and then write the results to a file.
Edit
Sorry I didn't specify that I wanted to use some function of i as a parameter for my_function, 12 + (22*i) for example. Should I create a list of values and then called lapply with that list of values?
for loops are fine in R, but they're syntactically inefficient in a lot of use cases, especially simple ones. The apply family of functions usually makes a good substitute.
result <- lapply(0:59, my_function)
write.csv(result, "result.csv")
Depending on what your function's output is, you might want sapply rather than lapply.
Edit:
Per your update, you could do it as you say, creating the vector of values first, or you could just do something like:
lapply(12+(22*0:59), my_function)
In Stata, summarize prints a brief statistical summary of all variables in the current workspace. In R, summary(<myvariable>) does something similar for a particular <myvariable>.
Q: In R, how should I print a statistical summary of ALL relevant variables in my workspace?
I tried:
x <- runif(4)
y <- runif(4)
z <- runif(3)
w <- matrix(runif(4), nrow = 2)
sapply(ls(), function(i) {if (class(get(i)) == "numeric") summary(get(i))})
which gets close to what I want. But it still prints
$w
NULL
...
which is undesirable. Also, this code throws an error when there's a variable of type closure in my workspace...
I feel like I'm going off into the weeds here. There must be a simpler, straightforward way of more-or-less replicating Stata's summarize in R, right?
You can use methods to determine which variable types work with summary
summary.methods = methods(summary)
check.method <- function(x){
any(grepl(paste0('^summary\\.',class(x)[1],'$'),summary.methods))
}
lapply(ls(), function(z,envir = .GlobalEnv) {
obj = get(z)
if (class(obj) %in% c('list','data.frame')
Recall(names(obj),as.environment(obj))
else if (check.method(obj))
print(summary(obj))
else
print(paste0("No summary for: ",z))
})
You may want to change this depending on how much data you have, but it should work.
Added some recursion for list/data frames.
If you want to get it to work with lists and individual data frame columns, I would check for those classes and use as.environment to get variables from the list/frame. I can show you a more explicit way of doing this later if you like.
I need to do a quality control in a dataset with more than 3000 variables (columns). However, I only want to apply some conditions in a couple of them. A first step would be to replace outliers by NA. I want to replace the observations that are greater or smaller than 3 standard deviations from the mean by NA. I got it, doing column by column:
height = ifelse(abs(height-mean(height,na.rm=TRUE)) <
3*sd(height,na.rm=TRUE),height,NA)
And I also want to create other variables based on different columns. For example:
data$CGmark = ifelse(!is.na(data$mark) & !is.na(data$height) ,
paste(data$age, data$mark,sep=""),NA)
An example of my dataset would be:
name = factor(c("A","B","C","D","E","F","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
I have tried this (for one condition):
d1=names(data)
list = c("age","height","mark")
ntraits=length(list)
nrows=dim(data)[1]
for(i in 1:ntraits){
a=list[i]
b=which(d1==a)
d2=data[,b]
for (j in 1:nrows){
d2[j] = ifelse(abs(d2[j]-mean(d2,na.rm=TRUE)) < 3*sd(d2,na.rm=TRUE),d2[j],NA)
}
}
Someone told me that I am not storing d2. How can I create for loops to apply the conditions I want? I know that there are similar questions but i didnt get it yet. Thanks in advance.
You pretty much wrote the answer in your first line. You're overthinking this one.
First, it's good practice to encapsulate this kind of operation in a function. Yes, function dispatch is a tiny bit slower than otherwise, but the code is often easier to read and debug. Same goes for assigning "helper" variables like mean_x: the cost of assigning the variable is very, very small and absolutely not worth worrying about.
NA_outside_3s <- function(x) {
mean_x <- mean(x)
sd_x <- sd(x,na.rm=TRUE)
x_outside_3s <- abs(x - mean(x)) < 3 * sd_x
x[x_outside_3s] <- NA # no need for ifelse here
x
}
of course, you can choose any function name you want. More descriptive is better.
Then if you want to apply the function to very column, just loop over the columns. That function NA_outside_3s is already vectorized, i.e. it takes a logical vector as an argument and returns a vector of the same length.
cols_to_loop_over <- 1:ncol(my_data) # or, some subset of columns.
for (j in cols_to_loop_over) {
my_data[, j] <- NA_if_3_sd(my_data[, j])
}
I'm not sure why you wrote your code the way you did (and it took me a minute to even understand what you were trying to do), but looping over columns is usually straightforward.
In my comment I said not to worry about efficiency, but once you understand how the loop works, you should rewrite it using lapply:
my_data[cols_to_loop_over] <- lapply(my_data[cols_to_loop_over], NA_outside_3s)
Once you know how the apply family of functions works, they are very easy to read if written properly. And yes, they are somewhat faster than looping, but not as much as they used to be. It's more a matter of style and readability.
Also: do NOT name a variable list! This masks the function list, which is an R built-in function and a fairly important one at that. You also shouldn't generally name variables data because there is also a data function for loading built-in data sets.
I found an odd issue with plyr when using it inside a loop.
What I want to perform with this script is to iterate the plyr function with different input values (provided by the for loop) and store the results as a list of data.frames.
k=as.factor(c(rep("a",2), rep("b",2), rep("c",2), rep("d",2), rep("e",2)))
indata=data.frame(k)
outdata<-list()
for (i in 1:10){
tempdata<-ddply(.data = indata, .variables = .(k), .fun = summarize, i=i)
data[[i]]<-tempdata
rm(tempdata)
}
data
I would expect it to produce a list of data.frames each produced within a single iteration of the loop, and therefore a single value of the loop variable.
What happens instead is that each of the data.frames looks identical, with each row having a sequential value of the loop variable.
Storing the loop variable into a separate one makes it work, but seems like an awkward workaround.
k=as.factor(c(rep("a",2), rep("b",2), rep("c",2), rep("d",2), rep("e",2)))
indata=data.frame(k)
outdata<-list()
for (i in 1:10){
z=i
tempdata<-ddply(.data = indata, .variables = .(k), .fun = summarize, i=i, z=z)
data[[i]]<-tempdata
rm(tempdata)
}
data
Any ideas on what's causing this odd behavior?
This is a scoping issue. Functions within ddply (I believe llply) use i as a local variable and that's before your i in the search path. The easiest fix would be using j as the iterator:
for (j in 1:10)
However, I have no idea why you use ddply in your example. It doesn't seem necessary, so I assume it's only a toy example.