Conditionally creating a new column - r

I am fairly certain this is a really obvious question, but I can't figure it out.
Lets say I have the following dataset:
test <- data.frame(A = c(1:10),
B = c(1:10), C = c(1:10),
P = c(1:10))
And I want to test, if there is a column called "P", create a new column called "Z" and put some content in it calculated from P.
I wrote the following code (just to try and get it to conditionally create the column, I've not tried to get it to do anything with that yet!):
Clean <- function(data) {
if("P" %in% colnames(data)) {
data$Z <- NA
}
else {
cat("doobedooo")
}
}
Clean(test)
But it doesn't seem to do anything, and I don't understand why, when simply running test$Z <- NA on the dataset does work.
I put the "doobedooo" in there, to see if it is returning a false at the first condition. It doesn't seem to be doing so.
Have I simply misunderstood how if statements work?

You have to return a value from your function, and then assign that value to an object. Unlike many other languages, R doesn't modify objects in-place, at least not without a lot of work.
Clean <- function(data) {
if("P" %in% colnames(data)) {
data$Z <- NA
} else {
cat("doobedooo"
}
return(data)
}
test <- Clean(test)

#HongOi answer is the direct answer to your question. Mine is the R way to deal with your problem. Since you want to create , another column combinations of others, you can use transform (or within), for example:
if('P' %in% colnames(test))
test <- transform(test,Z={## you can put any statement here
x=P+1
x^2
round(x/12,2)
}
)
head(test)
A B C P Z
1 1 1 1 1 0.17
2 2 2 2 2 0.25
3 3 3 3 3 0.33
4 4 4 4 4 0.42
5 5 5 5 5 0.50
6 6 6 6 6 0.58

Previous answer already gives everything you need. However, there is another way to deal with these problems. In R you can use environment to set and add data by reference instead of return()ing the whole table (even if you change a piece of it).
env <- new.env()
env$test <- test
system.time({
Clean <- function(data) {
if("P" %in% names(data$test)) {
data$test$Z <- NA
}
else {
cat("doobedooo")
}
}
Clean(env)
})
> env$test
A B C P Z
1 1 1 1 1 NA
2 2 2 2 2 NA
3 3 3 3 3 NA
4 4 4 4 4 NA
5 5 5 5 5 NA
6 6 6 6 6 NA
7 7 7 7 7 NA
8 8 8 8 8 NA
9 9 9 9 9 NA
10 10 10 10 10 NA

Related

Placing multiple outputs from each function call using apply into a row in a dataframe in R

I have a function that I repeat, changing the argument each time, using apply/sapply/lapply.
Works great.
I want to return a data set, where each row contains two (or more) variables from each iteration of the function.
Instead I get an unusable list.
do <-function(x){
a <- x+1
b <- x+2
cbind(a,b)
}
over <- [1:6]
final <- lapply(over, do)
Any suggestions?
Without changing your function do, you can use sapply and transpose it.
data.frame(t(sapply(over, do)))
# X1 X2
#1 2 3
#2 3 4
#3 4 5
#4 5 6
#5 6 7
#6 7 8
If you want to use do in current form with lapply, we can do
do.call(rbind.data.frame, lapply(over, do))
You could also try
as.data.frame(Reduce(rbind, final))
# a b
# 1 2 3
# 2 3 4
# 3 4 5
# 4 5 6
# 5 6 7
# 6 7 8
See ?Reduce and ?rbind for information about what they'll do.
You could also modify your final expression as
final <- as.data.frame(Reduce(rbind, lapply(over, do)))
#final
# a b
# 1 2 3
# 2 3 4
# 3 4 5
# 4 5 6
# 5 6 7
# 6 7 8

Why TTR::SMA returns NA for first element of series when n=1?

This is what I am looking at:
library(TTR)
test <- c(1:10)
test <- SMA(test, n=1)
test
[1] NA 2 3 4 5 6 7 8 9 10
The reason I am asking is actually that I have a script that let you define n:
library(TTR)
test <- c(1:10)
Index_Transformation <- 1 #1 means no transformation to the series
test <- SMA(test, n = Index_Transformation)
test
[1] NA 2 3 4 5 6 7 8 9 10
Is there any way I can have the SMA function return the first element of the series when "n =1" instead of NA?
Thanks a lot for your help
You can use rollmean instead from zoo package
library(zoo)
rollmean(test, 1)
#[1] 1 2 3 4 8 6 7 8 9 10
Just out of curiosity I was studying SMA function , it calls runMean function internally. So if you do
runMean(test, 1)
# [1] NA 2 3 4 5 6 7 8 9 10
it still gives the same output.
Further, runMean calls runSum in this way
runSum(x, n)/n
So if you now do
runSum(test, 1)
#[1] NA 2 3 4 5 6 7 8 9 10
there is still NA. Now runSum is a very big function from where the original NA is generated.
So if in case you still have to persist in using SMA function can you add an additional if check saying
if (Index_Transformation > 1) # OR (Index_Transformation != 1)
test <- SMA(test, n = Index_Transformation)
So test only changes if Index_Transformation is greater than 1 and stays as it is if it is 1.

Finding and matching reversed strings efficiently in R

I have a large number of strings (~280,000) that all have the following format "ABC12D/XYZ34A". In my data, each of those strings has a duplicate entry that is identical but in reverse, e.g. "XYZ34A/ABC12D" for the example above. So, my data looks something like this:
1 "ABC12D/XYZ34A"
2 "TUR44F/SWP29R"
3 "PLL93S/WQQ22F"
4 "YNV77C/AAZ05S"
5 "SWP29R/TUR44F"
6 "AAZ05S/YNV77C"
7 "CLK86G/ERF74Q"
8 "XYZ34A/ABC12D"
9 "ERF74Q/CLK86G"
10 "WQQ22F/PLL93S"
Row 1 matches row 8, row 2 matches row 5, etc.
My aims are: 1) for a given string, find where its reversed entry is and keep this index and then 2) replace the reverse entry with the non-reverse entry:
1 "ABC12D/XYZ34A" 8
2 "TUR44F/SWP29R" 5
3 "PLL93S/WQQ22F" 10
4 "YNV77C/AAZ05S" 6
5 "TUR44F/SWP29R" 0
6 "YNV77C/AAZ05S" 0
7 "CLK86G/ERF74Q" 9
8 "ABC12D/XYZ34A" 0
9 "CLK86G/ERF74Q" 0
10 "PLL93S/WQQ22F" 0
Currently, I do this in the following way using a loop:
df <- data.frame(c("ABC12D/XYZ34A", "TUR44F/SWP29R", "PLL93S/WQQ22F",
"YNV77C/AAZ05S", "SWP29R/TUR44F", "AAZ05S/YNV77C", "CLK86G/ERF74Q",
"XYZ34A/ABC12D", "ERF74Q/CLK86G", "WQQ22F/PLL93S"), stringsAsFactors =
FALSE)
colnames(df) <- "entries"
df
# Reverse function
reverse.entry <- function(string) {
string.reversed <- paste(rev(strsplit(string, "/")[[1]]), collapse = '/')
string.reversed
}
duplicate.flag <- list()
duplicate.idx <- list()
# Find and replace reversed entries
for (i in 1:dim(df)[[1]]) {
# current entry
string = df[i,]
# reverse the current entry
string.reversed <- reverse.entry(string)
# if any other entry matches the reversed string get match index
if (grepl(string.reversed, df)) {
print(sprintf("%d found a reversal", i))
idx <- which(df == string.reversed)
duplicate.flag[i] <- 1;
duplicate.idx[i] <- idx;
# replace reversed strings with original strings
df[idx,] <- string
} else {
duplicate.flag[i] <- 0;
duplicate.idx[i] <- 0;
}
}
data.frame(df, unlist(duplicate.idx), unlist(duplicate.flag))
However, this is quite slow and is taking several hours. Is there a better way of programming this? I'm fairly new to R and programming so am not terribly good at vectorization etc. Since each entry has one reverse entry, I could also just have the loop for 1:dim(df)[[1]] / 2. Would that already save a lot of time?
Many thanks!
You could do something like this...
df$no <- seq_along(df$entries) #number the entries
df$rev <- gsub("(.+)/(.+)","\\2/\\1",df$entries) #calculate reverse entries
df$whererev <- match(df$rev, df$entries) #identify where reversed entries occur
df$whererev[df$whererev>df$no] <- NA #remove the first of each duplicated pair
df$entries[!is.na(df$whererev)] <- df$rev[!is.na(df$whererev)] #replace duplicates
df
no entries rev whererev
1 1 ABC12D/XYZ34A XYZ34A/ABC12D NA
2 2 TUR44F/SWP29R SWP29R/TUR44F NA
3 3 PLL93S/WQQ22F WQQ22F/PLL93S NA
4 4 YNV77C/AAZ05S AAZ05S/YNV77C NA
5 5 TUR44F/SWP29R TUR44F/SWP29R 2
6 6 YNV77C/AAZ05S YNV77C/AAZ05S 4
7 7 CLK86G/ERF74Q ERF74Q/CLK86G NA
8 8 ABC12D/XYZ34A ABC12D/XYZ34A 1
9 9 CLK86G/ERF74Q CLK86G/ERF74Q 7
10 10 PLL93S/WQQ22F PLL93S/WQQ22F 3
Note that I have marked the second duplicate rather than the first, as this makes it easier (and probably substantially quicker) to replace the second one, rather than having to look it up from the first one. (Line 4 would have < rather than > if you wanted to recreate your marking of the first of each duplicated pair).
Here's my solution:
require(data.table)
get_index <- function(string,values,current_index){
string_present <- match(string,values)
string_present[string_present<current_index] <- 0
return(string_present)
}
mydata <- c("ABC12D/XYZ34A","TUR44F/SWP29R","PLL93S/WQQ22F","YNV77C/AAZ05S","SWP29R/TUR44F","AAZ05S/YNV77C","CLK86G/ERF74Q","XYZ34A/ABC12D","ERF74Q/CLK86G","WQQ22F/PLL93S")
mydf <- data.table(mystring = mydata,stringsAsFactors = FALSE)
mydf[,revmystring:=gsub("(.+)\\/(.+)","\\2\\/\\1",mystring)]
mydf[,duplicate_index:=get_index(revmystring,mystring,.I)]
The solution it gives is:
> mydf
mystring revmystring duplicate_index
1: ABC12D/XYZ34A XYZ34A/ABC12D 8
2: TUR44F/SWP29R SWP29R/TUR44F 5
3: PLL93S/WQQ22F WQQ22F/PLL93S 10
4: YNV77C/AAZ05S AAZ05S/YNV77C 6
5: SWP29R/TUR44F TUR44F/SWP29R 0
6: AAZ05S/YNV77C YNV77C/AAZ05S 0
7: CLK86G/ERF74Q ERF74Q/CLK86G 9
8: XYZ34A/ABC12D ABC12D/XYZ34A 0
9: ERF74Q/CLK86G CLK86G/ERF74Q 0
10: WQQ22F/PLL93S PLL93S/WQQ22F 0
You can implement this without data.table as well.
Here is a propostion using outer and gsub:
## Create a matrix of correspondence o between elements and reverses
o = outer(df[,1],df[,1],function(x,y) gsub("(.*)/(.*)","\\2/\\1",y)==x)
o[upper.tri(o)] = F
## Identify the indices of correspondence
df$ind = unlist(apply(o,2,function(x) which(x==T)[1]))
df$ind[is.na(df$ind)] = 0
## Replace reverses by originals
df[,1][df$ind[df$ind!=0]] = df[,1][df$ind!=0]
This returns:
V1 ind
1 ABC12D/XYZ34A 8
2 TUR44F/SWP29R 5
3 PLL93S/WQQ22F 10
4 YNV77C/AAZ05S 6
5 TUR44F/SWP29R 0
6 YNV77C/AAZ05S 0
7 CLK86G/ERF74Q 9
8 ABC12D/XYZ34A 0
9 CLK86G/ERF74Q 0
10 PLL93S/WQQ22F 0

Deleting columns in a data frame using a list of variable names

I have an automated script that produces a standard formula (i.e., y~x1+x2) and I would like to screen my data out based on those variables.
So far I have gotten this far, but I hit a sticking point where I can't quite figure it out:
#Example data
df <- data.frame(x=1:5, y=2:6, z=3:7, u=4:8)
df
x y z u
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
5 5 6 7 8
#Example formula
ex_form = "x~y+u"
#Delete the ~ and add a + sign to be consistent
step1 = gsub("~","+", ex_form)
#Remove + signs
step2 = strsplit(step1, "\\+")
#Final list of variables
step3 = unlist(step2)
Most solutions I've seen is something along the lines of:
#Create list of variables
mylist = c("x", "y", "u")
#Cut data
temp = df[ ,mylist]
temp
x y u
1 1 2 4
2 2 3 5
3 3 4 6
4 4 5 7
5 5 6 8
But this solution doesn't quite fit into the automation...so I need to jump from what I have to that outcome. Any thoughts?
Note: Tags are my guesses.
If you don't put your formula between " " it will be recognized as such, and can use all.vars() to extract variables from it.
ex_form = x~y+u #Without quotes it is a formula, check str(ex_form)
df[, all.vars(ex_form)]
# x y u
#1 1 2 4
#2 2 3 5
#3 3 4 6
#4 4 5 7
#5 5 6 8
Am I missing something or does simply doing temp <- df[,step3] return exactly what you say you want?

Passing a variable name to a function in R

I've noticed that quite a few packages allow you to pass symbol names that may not even be valid in the context where the function is called. I'm wondering how this works and how I can use it in my own code?
Here is an example with ggplot2:
a <- data.frame(x=1:10,y=1:10)
library(ggplot2)
qplot(data=a,x=x,y=y)
x and y don't exist in my namespace, but ggplot understands that they are part of the data frame and postpones their evaluation to a context in which they are valid. I've tried doing the same thing:
b <- function(data,name) { within(data,print(name)) }
b(a,x)
However, this fails miserably:
Error in print(name) : object 'x' not found
What am I doing wrong? How does this work?
Note: this is not a duplicate of Pass variable name to a function in r
I've recently discovered what I think is a better approach to passing variable names.
a <- data.frame(x = 1:10, y = 1:10)
b <- function(df, name){
eval(substitute(name), df)
}
b(a, x)
[1] 1 2 3 4 5 6 7 8 9 10
Update The approach uses non standard evaluation. I began explaining but quickly realized that Hadley Wickham does it much better than I could. Read this http://adv-r.had.co.nz/Computing-on-the-language.html
You can do this using match.call for example:
b <- function(data,name) {
## match.call return a call containing the specified arguments
## and the function name also
## I convert it to a list , from which I remove the first element(-1)
## which is the function name
pars <- as.list(match.call()[-1])
data[,as.character(pars$name)]
}
b(mtcars,cyl)
[1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
explanation:
match.call returns a call in which all of the specified arguments are
specified by their full names.
So here the output of match.call is 2 symbols:
b <- function(data,name) {
str(as.list(match.call()[-1])) ## I am using str to get the type and name
}
b(mtcars,cyl)
List of 2
$ data: symbol mtcars
$ name: symbol cyl
So Then I use first symbol mtcars ansd convert the second to a string:
mtcars[,"cyl"]
or equivalent to :
eval(pars$data)[,as.character(pars$name)]
Very old thread but you can also use the get command as well. It seems to work better for me.
a <- data.frame(x = 1:10, y = 11:20)
b <- function(df, name){
get(name, df)
}
b(a, "x")
[1] 1 2 3 4 5 6 7 8 9 10
If you put the variable name between quotes when you call the function, it works:
> b <- function(data,name) { within(data,print(name)) }
> b(a, "x")
[1] "x"
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10

Resources