Standardizing and renaming variables in a data.frame in R? - r

I'm trying to standardize all variables in a given data.frame, and add these standardized variables to the original data.frame with some prefix name like "s." (e.g., if original variable's name was wt, the standard one is s.wt).
My function below does that but I'm wondering why I can NOT access the new standardized variables? (see example below)
standard <- function(dataframe = mtcars){
var.names <- names(dataframe)
dataframe$s <- as.data.frame(lapply(dataframe[, var.names], scale))
dataframe
}
# Example:
(d <- standard()) # HERE I see the new standardized variables with prefix "s."
d$s.wt # NULL # but HERE I can't access the standard variables!!!

Assuming that we need to have new columns with suffix 's', in the OP's function, the assignment is to a single column i.e. dataframe$s while the number of scaled columns returned by the list is the same as the number of columns in the dataset. So, we can use paste to create new column with suffix 's'
standard <- function(dataframe = mtcars){
var.names <- names(dataframe)
dataframe[paste0("s.", var.names)] <- lapply(dataframe[var.names], function(x) c(scale(x)))
dataframe
}
standard()$s.wt
#[1] -0.610399567 -0.349785269 -0.917004624 -0.002299538 0.227654255 0.248094592 0.360516446 -0.027849959 -0.068730634 0.227654255 0.227654255
#[12] 0.871524874 0.524039143 0.575139986 2.077504765 2.255335698 2.174596366 -1.039646647 -1.637526508 -1.412682800 -0.768812180 0.309415603
#[23] 0.222544170 0.636460997 0.641571082 -1.310481114 -1.100967659 -1.741772228 -0.048290296 -0.457097039 0.360516446 -0.446876870
NOTE: The output of scale is a matrix with a single column, By using c, it is converted to a vectorr
Also, we can apply the function on the entire dataset
mtcars[paste0("s.", names(mtcars))] <- scale(mtcars)
identical(mtcars$s.wt, standard()$s.wt)
#[1] TRUE

Related

R: Define ranges from text using regex

I need a way to call defined variables dependant from a string within text.
Let's say I have five variables (r010, r020, r030, r040, r050).
If there is a given text in that form "r010-050" I want to have the sum of values from all five variables.
The whole text would look like "{r010-050} == {r060}"
The first part of that equation needs to be replaced by the sum of the five variables and since r060 is also a variable the result (via parsing the text) should be a logical value.
I think regex will help here again.
Can anyone help?
Thanks.
Define the inputs: the variables r010 etc. which we assume are scalars and the string s.
Then define a pattern pat which matches the {...} part and a function Sum which accepts the 3 capture groups in pat (i.e. the strings matched to the parts of pat within parentheses) and performs the desired sum.
Use gsubfn to match the pattern, passing the capture groups to Sum and replacing the match with the output of Sum. Then evaluate it.
In the example the only variables in the global environment whose names are between r010 and r050 inclusive are r010 and r020 (it would have used more had they existed) and since they sum to r060 it returned TRUE.
library(gsubfn)
# inputs
r010 <- 1; r020 <- 2; r060 <- 3
s <- "{r010-050} == {r060}"
pat <- "[{](\\w+)(-(\\w+))?[}]"
Sum <- function(x1, x2, x3, env = .GlobalEnv) {
x3 <- if(x3 == "") x1 else paste0(gsub("\\d", "", x1), x3)
lst <- ls(env)
sum(unlist(mget(lst[lst >= x1 & lst <= x3], envir = env)))
}
eval(parse(text = gsubfn(pat, Sum, s)))
## [1] TRUE

R: created a names vector containing the means of multiple numeric vectors

I have over 20 numeric vectors which consist of a series of values. each vector is distinguished by a letter, e.g. val_a, val_b, val_c etc...
I would like to put the means from each of these vectors into a single named vector. I could of course do this in a laborious manner like so:
obs <- c("val_a" = round(mean(val_a),3),
"val_b" = round(mean(val_b),3),
"val_c" = round(mean(val_c),3))
But with 20 vectors this then becomes tedious to write out, and not to mention an inelegant solution. How can I create the named vector in a more succinct way? I have made an attempt using a for loop, as so:
obs <- c(for (j in 1:20) {
assign(paste("val",letters[j], sep = "_"),
mean(as.name(paste('val',letters[j], sep = '_'))),)
})
In the right hand argument passed to assign, "as.name" is used in order to remove the quotation marks from output of "paste". So the second argument passed to assign returns a character which has the exact same name as the numeric vector that I want get the mean of, e.g. val_a. But I get the error messsage:
Warning messages:
1: In mean.default(as.name(paste("val", letters[j], sep = "_"))) :
argument is not numeric or logical: returning NA
Does anyone know how to accomplish this?
Solution
To build on bouncyball's comment so you have a full answer, you can do this:
sapply(paste('val', letters[1:20], sep='_'), function(x) round(mean(get(x)), 3))
Explanation
For an object in your environment called x, get("x") will return x. See help("get"). Then we can do this for every element of paste('val', letters[1:20], sep='_') using sapply(), or if you like, a loop.
Example
val_a <- rnorm(100)
val_b <- rnorm(100)
val_c <- rnorm(100)
sapply(paste('val', letters[1:3], sep='_'), function(x) round(mean(get(x)), 3))
val_a val_b val_c
-0.09328504 -0.15632654 -0.09759111

Function to Rename Factors in R

I am trying to create a function in R that acts the same way as the "Reclassify" node in SPSS modeler. The function below has 3 arguments but it doesn't seem to be working and I'm not sure why.
column <- iris$Species
list.of.names <- c("setosa","versicolor")
new.name <- "some flowers"
reclassify <- function(column, list.of.names, new.name){
new.name.vec <- rep(new.name, length(list.of.names))
require(plyr)
column <- mapvalues(column, from = list.of.names, to = new.name.vec)
}
Why doesn't this change the two factor names to "some flowers" ?

rename columns for group of data frames using name()

I would like to rename a bunch of data frames with name function but not able to use lapply or loop.
I have group of data frames name qcew.2007, qcew.2014, etc... I have vector with name I would like all of data frame to have. They are all the same. the vector is name colnm:
colnm = c("area_fips" , "own_code", "industry_code", "agglvl_code") # example shortened
# groups has names of all data frames and goes to 2013
group =c("qcew.2007", "qcew.2008", "qcew.2009")
# using lapply
names <- lapply(group, function(d){
n = paste0(d)
names(n) = colnm
})
# using loop does not work either
for (i in seq(group)) {
names(group[[i]]) = colnm
}
Neither option works, as it is saying I am comparing vectors with uneven lengths. I must be missing something obvious. Thanks
Here you go. You need to use get otherwise you're assigning names to the character vectors in group:
# sample data
qcew.2007 <- data.frame(a=1, b=2, c=3, d=4)
qcew.2008 <- data.frame(a=3, b=4, c=5, d=6)
qcew.2009 <- data.frame(a=5, b=6, c=7, d=8)
for(i in 1:3)
assign(group[i], `names<-`(get(group[i]), colnm))
names(qcew.2007)
# [1] "area_fips" "own_code" "industry_code" "agglvl_code"
names(qcew.2008)
# [1] "area_fips" "own_code" "industry_code" "agglvl_code"
names(qcew.2009)
# [1] "area_fips" "own_code" "industry_code" "agglvl_code"
Here you use get to get the object named in each position in group and then use assign to reassign the modified object (modified by changing column names) back into that named object.
Also:
list2env(lapply(mget(group), setNames, colnm),envir=.GlobalEnv)
names(qcew.2007)
#[1] "area_fips" "own_code" "industry_code" "agglvl_code"

Vector-version / Vectorizing a for which equals loop in R

I have a vector of values, call it X, and a data frame, call it dat.fram. I want to run something like "grep" or "which" to find all the indices of dat.fram[,3] which match each of the elements of X.
This is the very inefficient for loop I have below. Notice that there are many observations in X and each member of "match.ind" can have zero or more matches. Also, dat.fram has over 1 million observations. Is there any way to use a vector function in R to make this process more efficient?
Ultimately, I need a list since I will pass the list to another function that will retrieve the appropriate values from dat.fram .
Code:
match.ind=list()
for(i in 1:150000){
match.ind[[i]]=which(dat.fram[,3]==X[i])
}
UPDATE:
Ok, wow, I just found an awesome way of doing this... it's really slick. Wondering if it's useful in other contexts...?!
### define v as a sample column of data - you should define v to be
### the column in the data frame you mentioned (data.fram[,3])
v = sample(1:150000, 1500000, rep=TRUE)
### now here's the trick: concatenate the indices for each possible value of v,
### to form mybiglist - the rownames of mybiglist give you the possible values
### of v, and the values in mybiglist give you the index points
mybiglist = tapply(seq_along(v),v,c)
### now you just want the parts of this that intersect with X... again I'll
### generate a random X but use whatever X you need to
X = sample(1:200000, 150000)
mylist = mybiglist[which(names(mybiglist)%in%X)]
And that's it! As a check, let's look at the first 3 rows of mylist:
> mylist[1:3]
$`1`
[1] 401143 494448 703954 757808 1364904 1485811
$`2`
[1] 230769 332970 389601 582724 804046 997184 1080412 1169588 1310105
$`4`
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
There's a gap at 3, as 3 doesn't appear in X (even though it occurs in v). And the
numbers listed against 4 are the index points in v where 4 appears:
> which(X==3)
integer(0)
> which(v==3)
[1] 102194 424873 468660 593570 713547 769309 786156 828021 870796
883932 1036943 1246745 1381907 1437148
> which(v==4)
[1] 149021 282361 289661 456147 774672 944760 969734 1043875 1226377
Finally, it's worth noting that values that appear in X but not in v won't have an entry in the list, but this is presumably what you want anyway as they're NULL!
Extra note: You can use the code below to create an NA entry for each member of X not in v...
blanks = sort(setdiff(X,names(mylist)))
mylist_extras = rep(list(NA),length(blanks))
names(mylist_extras) = blanks
mylist_all = c(mylist,mylist_extras)
mylist_all = mylist_all[order(as.numeric(names(mylist_all)))]
Fairly self-explanatory: mylist_extras is a list with all the additional list stuff you need (the names are the values of X not featuring in names(mylist), and the actual entries in the list are simply NA). The final two lines firstly merge mylist and mylist_extras, and then perform a reordering so that the names in mylist_all are in numeric order. These names should then match exactly the (unique) values in the vector X.
Cheers! :)
ORIGINAL POST BELOW... superseded by the above, obviously!
Here's a toy example with tapply that might well run significantly quicker... I made X and d relatively small so you could see what's going on:
X = 3:7
n = 100
d = data.frame(a = sample(1:10,n,rep=TRUE), b = sample(1:10,n,rep=TRUE),
c = sample(1:10,n,rep=TRUE), stringsAsFactors = FALSE)
tapply(X,X,function(x) {which(d[,3]==x)})

Resources