Difference between dataframe's $ and [] functions - r

How & Why, are a dataframe's $ and [] functions different when assigning values.
Can I tweak the abc.df[,"b"] = get("b") line to have same effect as abc.df$b = get("b")
abc.df = NULL
a = 1:10
abc.df = data.frame(a)
b_vector = 11:20
b_list = rep(list(c(1,2)),10)
sp_colmns1 = c("b_vector")
# This works :
abc.df$b_vector_method1 = get(sp_colmns1) # Method 1
abc.df[,"b_vector_method2"] = get(sp_colmns1) # Method 2
print(abc.df)
sp_colmns2 = c("b_list")
# Similarly :
# The same code as above, but does not work
# Only difference is b_list is a list
abc.df$b_list_method1 = get(sp_colmns2) # Method 1 (Works)
# TODO: Need to get the reason for & Solve the error on following line
# abc.df[,"b_list_method2"] = get(sp_colmns2) # Method 2 (Doesnt work)
print(abc.df)

You could add the list with any name "new" and change the column name in a second step with the string you saved somewhere else.
abc.df$new <- get(sp_colmns2)
names(abc.df)[which(names(abc.df) == "new")] <- "b_list_method2"
# > head(abc.df)
# a b_list_method2
# 1 1 1, 2
# 2 2 1, 2
# 3 3 1, 2
# 4 4 1, 2
# 5 5 1, 2
# 6 6 1, 2

After quite a lot of trial and error, this seems to work.
The solution turns out to be quite a simple one...
list(get(sp_colmns2)) instead of get(sp_colmns2)
abc.df = NULL
a = 1:10
abc.df = data.frame(a)
b_vector = 11:20
b_list = rep(list(c(1,2)),10)
sp_colmns1 = c("b_vector")
# This works :
abc.df$b_vector_method1 = get(sp_colmns1) # Method 1
abc.df[,"b_vector_method2"] = get(sp_colmns1) # Method 2
print(abc.df)
sp_colmns2 = c("b_list")
# Similarly :
# The same code as above, but does not work
# Only difference is b_list is a list
abc.df$b_list_method1 = get(sp_colmns2) # Method 1 (Works)
# TODO: Need to get the reason for & Solve the error on following line
abc.df[,"b_list_method2"] = list(get(sp_colmns2)) # Method 2 (Doesnt work)
print(abc.df)

Related

How to find a sequence in r in the middle of the texts?

So say there is a string of t and f, how might one use the grep function to find the pattern of say, something starting with f and stays in f for some time and go to t and I want to count the number of times it stays in t
a <- "fffftttfff"
b <- "fttttttfff"
c <- "tttttttttt"
d <- "fffffffftf"
path_ <- c(a,b,c,d)
ID <- 1:4
tf_dt <- data.table("ID" = ID,"path" = path_)
tf_dt
ID path
1: 1 fffftttfff
2: 2 fttttttfff
3: 3 tttttttttt
4: 4 fffffffftf
dt_raw <- tf_dt[,-1]
s <- paste0(as.vector(t(dt_raw)), collapse = "")
v <- substring(s,seq(1,nchar(s)-9,10), seq(10,nchar(s),10))
idx <- grep("^f*f.+t",v)
dt_final <- data.frame("ID" = tf_dt$ID, count = FALSE, time = NA)
dt_final$count[idx] <- TRUE
dt_final$time[idx] <- ???
What I reckon I should do is to remove the first string of f and all the remaining string of letters after the first string of t appearance. However I am not sure how might I be able to do that? Any help is appreciated.
My attempt:
nchar(gsub("^f*f","",gsub("something that relates to the end of the string","",v)))
More attempts:
#If I do gsub("^f*f+t*","",v) it gives me the last string that I want to remove
#But I cant do something like
nchar(gsub("^f*f","",gsub("gsub("^f*f+t*","",v)$",""v)))
Expected Output:
tf_count <- c(TRUE,TRUE,FALSE,TRUE)
tf_time <- c(3,6,NA,1)
output <- data.table("ID" = ID, "count" = tf_count,"time_taken" = tf_time)
# ID count time_taken
# 1: 1 TRUE 3
# 2: 2 TRUE 6
# 3: 3 FALSE NA
# 4: 4 TRUE 1
Also side note, is there somewhere that I can look at a lot of examples of how grep() and stringr() works. (I think from what I have seen this is under stringr()?) I tried reading things on this, but nothing really came out of it, and I am still equally as confused as before. Thanks.
A solution in base using grepl and gsub as you have tried already in the question.
tf_count <- grepl("^f+t+", tf_dt$path)
tf_time <- nchar(gsub("^f+(t+).*","\\1",tf_dt$path))
tf_time[!tf_count] <- NA
output <- data.frame("ID" = ID, "count" = tf_count,"time_taken" = tf_time)
output
# ID count time_taken
#1 1 TRUE 3
#2 2 TRUE 6
#3 3 FALSE NA
#4 4 TRUE 1
One way would be to find out number of t's after removing the first set of f's which can be achieved by
library(data.table)
tf_dt[, time_taken:= NA_integer_]
tf_dt[grep('^f', path), time_taken := nchar(sub('^f*(t{1,}).*', '\\1',path))]
tf_dt
# ID path time_taken
#1: 1 fffftttfff 3
#2: 2 fttttttfff 6
#3: 3 tttttttttt NA
#4: 4 fffffffftf 1
If you are interested in a stringr & tidyverse solution, try the following code. I borrowed a piece of code "^f*(t{1,})" from Ronak Shah's exellent answer:
tf_dt %>%
mutate(count = str_detect(path, "ft"),
time_taken = ifelse(count, str_count(str_extract(path, "^f*(t{1,})"), "t"), NA))

Hidden objects in list

I am creating a custom object for a package and I want to have a list of two objects, but for one of those elements to be 'hidden'
For example:
l = list(data = data.frame(a = 1:3, b = 4:6), hidden = list(obj1 = 1, obj2 = 2))
When I interact with the list I want to only interact with the data element and the other be only accessed specifically.
So, if i typed l
> l
a b
1 1 4
2 2 5
3 3 6
Which I can manage with a custom print method. But I also want to be able to do
> l[,1]
[1] 1 2 3
Which I don't think is possible with a custom print method.
I don't have any specific requirements for how the other element should be accessed, but something 'R friendly' I guess.
Is there a different class I should be using or creating a new class? Any advice would be appreciated.
You could indeed define a custom class for your object. Let
class(l) <- "myclass"
Then you may define custom-specific methods for your functions of interest. For instance, in the case of l[, 1] we have
`[.myclass` <- function(x, ...) `[`(x[[1]], ...)
which takes this double list and then calls the usual [ function on the first list element:
l[, 1]
# [1] 1 2 3
The same can be done with other functions, such as print:
fun.myclass <- function(x, ...) fun(x[[1]], ...)
And you still can always access the second object in the usual way,
l$hidden
# $obj1
# [1] 1
#
# $obj2
# [1] 2
I think it would be cleaner for you to use attributes :
l <- list(data = data.frame(a = 1:3, b = 4:6),
hidden = list(obj1 = 1, obj2 = 2))
foo <- function(x){
attr(x$data,"hidden") <- x$hidden
x$data
}
l <- foo(l)
l
# a b
# 1 1 4
# 2 2 5
# 3 3 6
l[,1]
# [1] 1 2 3
attr(l,"hidden")
#
# [1] 1
#
#
# [1] 2
#

Counting the frequency of differing patterns in a character string

I currently have a string in R that looks like this:
a <- "BMMBMMMMBMMMBMMBBMMM"
First, I need to determine the frequency of different patterns of "M" that appear in the string.
In this example it would be:
MM = 2
MMM = 2
MMMM = 1
Secondly, I then need to designate a numerical value/score for each different pattern.
i.e:
MM = 1
MMM = 2
MMMM = 3
This would mean that the total value/score of M's in a would equal 9.
If anyone knows any script that would allow me to do this for multiple strings like this in a dataframe that would be great?
Thank you.
a <- "BMMBMMMMBMMMBMMBBMMM"
tbl <- table(strsplit(a, "B"), exclude="")
tbl
# MM MMM MMMM
# 2 2 1
score <- sum(tbl * 1:3)
score
# 9
You could also use the table function.
a_list<-unlist(strsplit(a,"B"))
a_list<-a_list[!a_list==""] #remove cases when 2 B are together
a_list<-table(a_list)
# MM MMM MMMM
# 2 2 1
Here's a solution that uses the dplyr package. First, I load the library and define my string.
library(dplyr)
a <- "BMMBMMMMBMMMBMMBBMMM"
Next, I define a function that counts the occurrences of character x in string y.
char_count <- function(x, y){
# Get runs of same character
tmp <- rle(strsplit(y, split = "")[[1]])
# Count runs of character stored in `x`
tmp <- data.frame(table(tmp$lengths[tmp$values == x]))
# Return strings and frequencies
tmp %>%
mutate(String = strrep(x, Var1)) %>%
select(String, Freq)
}
Then, I run the function.
# Run the function
res <- char_count("M", a)
# String Freq
# 1 M 2
# 2 MM 2
# 3 MMM 1
Finally, I define my value vector and calculate the total value of vector a.
# My value vector
value_vec <- c(M = 1, MM = 2, MMM = 3)
# Total `value` of vector `a`
sum(value_vec * res$Freq)
#[1] 9
It it's acceptable to skip the first step you could do:
nchar(gsub("(B+M)|(^M)","",a))
# [1] 9
First compute all diffrent patterns that appear in your sting :
a <- "BMMBMMMMBMMMBMMBBMMM"
chars = unlist(strsplit(a, ""))
pat = c()
for ( i in 1:length(chars)){
for (j in 1:(length(chars) - i+1)){ pat = c(pat, paste(chars[j:(j+i-1)], collapse = ""))}}
pat =sort(unique(pat))
pat[1:5] : [1] "B" "BB" "BBM" "BBMM" "BBMMM"
Next, count the occurence of each pattern :
counts = sapply(pat, function(w) length(gregexpr(w, a, fixed = TRUE)[[1]]))
Finally build a nice dataframe to summary everything up :
df = data.frame(counts = counts, num = 1:length(pat))
head(df, 10)
counts num
B 6 1
BB 1 2
BBM 1 3
BBMM 1 4
BBMMM 1 5
BM 5 6
BMM 5 7
BMMB 2 8
BMMBB 1 9
BMMBBM 1 10
library(stringr)
str_count(a, "MMMM")
gives 1
str_count(gsub("MMMM", "", a), "MMM") # now count how many times "MMM" occurs, but first delete the "MMMM"
gives 2
str_count(gsub("MMM", "", a), "MM") #now count how many times "MM" occurs, but first delete the "MMM"'s
gives 2

How to get consistent output when checking for infinite values

# works fine
check = c(1,2,3,4, Inf)
out = check[-which(check == Inf)]
print(out)
# [1] 1 2 3 4
# does not work fine
check = c(1,2,3,4)
out = check[-which(check == Inf)]
print(out)
# numeric(0)
The first example creates an outvariable with the correct values 1,2,3,4. The second variable creates an empty variable out as the which function returns integer(0) and apparently remove integer(0) from the check vector gives 0 elements.
I know how to write this in several lines but is there a one-liner for this?
Try, is.finite():
# example 1
check <- c(1, 2, 3, 4, Inf)
out <- check[ is.finite(check) ]
out
# [1] 1 2 3 4
# example 2
check <- c(1, 2, 3, 4)
out <- check[ is.finite(check) ]
out
# [1] 1 2 3 4
Related post about: is.finite().
check = c(1,2,3,4)
out = check[!is.infinite(check)]
print(out)
Not sure whether this is technically a oneliner...
out = if (any(is.na(check))) {check[-which(is.na(check))]} else {check}

Nested ifelse: improved syntax

Description
ifelse() function allows to filter the values in a vector through a series of tests, each of them producing different actions in case of a positive result. For instance, let xx be a data.frame, as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx
a b
1 1
2 2
1 3
3 4
Suppose that you want to create a new column, c, from column b, but depending on the values in column a in the following way:
For each row,
if the value in column a is 1, the value in column c, is the same value in column b.
if the value in column a is 2, the value in column c, is 100 times the value in column b.
in any other case, the value in column c is the negative of the value in column b.
Using ifelse(), a solution could be:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
-xx$b))
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
Problem 1
An aesthetic problem arises when the number of tests increases, say, four tests:
xx$c <- ifelse(xx$a==1, xx$b,
ifelse(xx$a==2, xx$b*100,
ifelse(xx$a==3, ...,
ifelse(xx$a==4, ...,
...))))
I found partial solution to the problem in this page, which consists in the definition of the functions if.else_(), i_(), e_(), as follows:
library(lazyeval)
i_ <- function(if_stat, then) {
if_stat <- lazyeval::expr_text(if_stat)
then <- lazyeval::expr_text(then)
sprintf("ifelse(%s, %s, ", if_stat, then)
}
e_ <- function(else_ret) {
else_ret <- lazyeval::expr_text(else_ret)
else_ret
}
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string))
}
In this way, the problem given in the Description, can be rewritten as follows:
xx <- data.frame(a=c(1,2,1,3), b=1:4)
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
e_(-xx$b)
)
xx
a b c
1 1 1
2 2 200
1 3 3
3 4 -4
And the code for the four tests will simply be:
xx$c <- if.else_(
i_(xx$a==1, xx$b),
i_(xx$a==2, xx$b*100),
i_(xx$a==3, ...), # dots meaning actions for xx$a==3
i_(xx$a==4, ...), # dots meaning actions for xx$a==4
e_(...) # dots meaning actions for any other case
)
Problem 2 & Question
The given code apparently solves the problem. Then, I wrote the following test function:
test.ie <- function() {
dd <- data.frame(a=c(1,2,1,3), b=1:4)
if.else_(
i_(dd$a==1, dd$b),
i_(dd$a==2, dd$b*100),
e_(-dd$b)
) # it should give c(1, 200, 3, -4)
}
When I tried the test:
test.ie()
it spit the following error message:
Error in ifelse(dd$a == 1, dd$b, ifelse(dd$a == 2, dd$b * 100, -dd$b)) :
object 'dd' not found
Question
Since the if.else_() syntactic constructor is not supposed to run only from the console, is there a way for it to 'know' the variables from the function that calls it?
Note
In "Best way to replace a lengthy ifelse structure in R", a similar problem was posted. However, the given solution there focuses on building the table's new column with the given constant output values (the "then" or "else" slots of the ifelse() function), whereas my case addresses a syntactic problem in which the "then" or "else" slots can even be expressions in terms of other data.frame elements or variables.
I think you can use dplyr::case_when inside dplyr::mutate to achieve this.
library(dplyr)
df <- tibble(a=c(1,2,1,3), b=1:4)
df %>%
mutate(
foo = case_when(
.$a == 1 ~ .$b,
.$a == 2 ~ .$b * 100L,
TRUE ~ .$b * -1L
)
)
#> # A tibble: 4 x 3
#> a b foo
#> <dbl> <int> <int>
#> 1 1 1 1
#> 2 2 2 200
#> 3 1 3 3
#> 4 3 4 -4
In the upcoming relase of dplyr 0.6.0 you won't need to use the akward work-around of .$, and you can just use:
df %>%
mutate(
foo = case_when(
a == 1 ~ b,
a == 2 ~ b * 100L,
TRUE ~ b * -1L
)
)
Taking into account MrFlick's advice, I re-coded the if.else_() function as follows:
if.else_ <- function(...) {
args <- list(...)
for (i in 1:(length(args) - 1) ) {
if (substr(args[[i]], 1, 6) != "ifelse") {
stop("All but the last argument, need to be if.then_ functions.", call. = FALSE)
}
}
if (substr(args[[length(args)]], 1, 6) == "ifelse"){
stop("Last argument needs to be an else_ function.", call. = FALSE)
}
args$final <- paste(rep(')', length(args) - 1), collapse = '')
eval_string <- do.call('paste', args)
eval(parse(text = eval_string), envir = parent.frame())
}
Now the test.ie() function runs properly
test.ie()
[1] 1 200 3 -4
With full respect to the OP's remarkable effort to improve nested ifelse(), I prefer a different approach which I believe is easy to write, concise, maintainable and fast:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
library(data.table)
# coerce to data.table, and set the default first
setDT(xx)[, c:= -b]
xx[a == 1L, c := b] # 1st special case
xx[a == 2L, c := 100L*b] # 2nd special case, note use of integer 100L
# xx[a == 3L, c := ...] # other cases
# xx[a == 4L, c := ...]
#...
xx
# a b c
#1: 1 1 1
#2: 2 2 200
#3: 1 3 3
#4: 3 4 -4
Note that for the 2nd special case b is multiplied by the integer constant 100L to make sure that the right hand sides are all of type integer in order to avoid type conversion to double.
Edit 2: This can also be written in an even more concise (but still maintainable) way as a one-liner:
setDT(xx)[, c:= -b][a == 1L, c := b][a == 2L, c := 100*b][]
data.table chaining works here, because c is updated in place so that subsequent expressions are acting on all rows of xx even if the previous expression was a selective update of a subset of rows.
Edit 1: This approach can be implemented with base R as well:
xx <- data.frame(a=c(1L,2L,1L,3L), b=1:4)
xx$c <- -xx$b
idx <- xx$a == 1L; xx$c[idx] <- xx$b[idx]
idx <- xx$a == 2L; xx$c[idx] <- 100 * xx$b[idx]
xx
# a b c
#1 1 1 1
#2 2 2 200
#3 1 3 3
#4 3 4 -4

Resources