Attach/detach in R behaving very strangely - r

I want to subset a dataframe by applying two conditions to it. When I attach the dataframe, apply the first condition, detach the dataframe, attach it again, apply the second condition, and detach again, I get the expected result, a dataframe with 9 observations.
Of course, you wouldn't normally detach/attach before applying the second condition. So I attach, apply the two conditions after one another, and then detach. But the result is different now: It's a dataframe with 24 observations. All but 5 of these observations consist exclusively of NA-values.
I know there's lots of advice against using attach, and I understand the point that it's dangerous, because it's easy to loose track of an attach statement still being active. My point here is a different one; I see a behaviour in attach that I just can't understand. I'm using R Studio 0.99.465 with 64-bit-R 3.2.1.
So here's the code, first the version that is clumsy, but produces the correct result (df with 9 observations, all non-NA):
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
detach(df)
attach(df)
df <- df[early_vvl <= late_no_reaction,]
detach(df)
Now the one that produces the dataframe with 24 observations, most of which consist only of NA values:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
df <- df[early_vvl <= late_no_reaction,]
detach(df)
I'm puzzled. Does anybody understand why the second version produces a different result?

Have a look at what happens here:
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl,]
length(early_vvl <= late_no_reaction)
## [1] 32
df <- df[early_vvl <= late_no_reaction,]
detach(df)
So your logical vector early_vvl <= late_no_reaction still uses the original df, the one that you attached. When you subset the data.frame the second time, the logical is longer than the data.frame has rows and it behaves like this:
df <- data.frame(x=1:5, y = letters[1:5])
df[rep(c(TRUE, FALSE), 5), ]
## x y
## 1 1 a
## 3 3 c
## 5 5 e
## NA NA <NA>
## NA.1 NA <NA>
You could just use & to avoid the problem:
df <- expand.grid(early_vvl=c(0,1), inter_churn=c(0,1), inter_new_contract=c(0,1), late_vvl=c(0,1), late_no_reaction=c(0,1))
attach(df)
df <- df[(1-early_vvl) >= inter_churn + inter_new_contract + late_vvl & early_vvl <= late_no_reaction,]
detach(df)

Related

Split a group of integers into two subgroups of approximately the same suns

I have a group of integers, as in this R data.frame:
set.seed(1)
df <- data.frame(id = paste0("id",1:100), length = as.integer(runif(100,10000,1000000)), stringsAsFactors = F)
So each element has an id and a length.
I'd like to split df into two data.frames with approximately equal sums of length.
Any idea of an R function to achieve that?
I thought that Hmisc's cut2 might do it but I don't think that's its intended use:
library(Hmisc) # cut2
ll <- split(df, cut2(df$length, g=2))
> sum(ll[[1]]$length)
[1] 14702139
> sum(ll[[2]]$length)
[1] 37564671
It's called Bin pack problem. https://en.wikipedia.org/wiki/Bin_packing_problem this link may be helpful.
Using BBmisc::binPack function,
df$bins <- binPack(df$length, sum(df$length)/2 + 1)
tapply(df$length, df$bins, sum)
results like
1 2 3
25019106 24994566 26346
Now since you want two groups,
dummy$bins[dummy$bins == 3] <- 2 #because labeled as 2's sum is smaller
result is
1 2
25019106 25020912

Automatic changing of rows and columns for facet_wrap()

How would one automate the control of columns and rows in facet_wrap such that there will never be any empty spaces? For example, my facet_wrap() code displays 24 plots with 5 columns and rows, but would like to have 4 columns and 6 rows. I know one can control the output within facet_wrap, but would like a solution that can automate this when producing multiple ggplots.
A bit long-winded, but it should give you the most "balanced" pair of rows and columns for a given n_plots (n_plots being the number of distinct values in the variable you are using to facet on):
n_plots <- 24
possible_arrangements <- data.frame()
ind = 1
for (i in 1:n_plots) {
if (n_plots%%i == 0){
j = n_plots/i
print(paste(i,j)) #for debugging purposes
possible_arrangements[ind,1] <- i
possible_arrangements[ind,2] <- j
ind = ind + 1
}
}
colnames(possible_arrangements) <- c("rows", "cols")
possible_arrangements <- possible_arrangements %>%
mutate(balance = abs(rows-cols))
solution <- possible_arrangements %>%
filter(balance == min(balance))
Output:
> solution
rows cols balance
1 4 6 2
2 6 4 2
If you want an unique solution then you can further require that cols>=rows or rows>=cols depending on how you like it.

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

R programming Sum data frame

i have a code using R language, i want to sum all data frame (df$number is unlist result in 'res')
total result is = [1] 1 3 5 7 9 20 31 42
digits <- function(x){as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))}
generated <- function(x){ x + sum(digits(x))}
digitadition <- function(x,N) { c(x, replicate(N-1, x <<- generated(x))) }
res <- NULL
for(i in 0:50){
for(j in 2:50){
tmp <- digitadition(i,j)
IND <- 50*(i-1) + (j-1) - (i-1) #to index results
res[IND] <- tmp[length(tmp)]
}
}
df <- data.frame(number = unlist(res), generator=rep(1:50, each=49), N=2:50)
total <- table(df$number)[as.numeric(names(table(df$number)))<=50]
setdiff(1:50, as.numeric(names(total)))
sum(total)
i'm using sum(total) but the result of summary is '155' it is not the right answer, cause the right answer is '118'
what the spesific code to sum the 'total'?
thank you.
I ran your code and I think you may be confused on what you want to sum.
You setdiff contains the values 1 3 5 7 9 20 31 42 which sum is 118.
So, if you do sum(setdiff(1:50, as.numeric(names(total)))), you'll get the 118 you are looking for.
Your total variable is different from this. Let me explain what you are doing and what I think you should do.
Your code: total <- table(df$number)[as.numeric(names(table(df$number)))<=50]]
When you table(), you get each unique value from the vector, and the number of how many times this number appears on your vector.
And when you get the names() of this table, you get each of these unique values as a character, that's why you are setting as.numeric.
But the function unique() do this job for you, he extracts uniques values from a vector.
Here's what you can do: total <- unique(df$number[which(df$number <= 50)])
Where which() get the ID's of values <= 50, and unique extracts unique values of these ID's.
And finally: sum(setdiff(1:50, total)) that sums all the values from 1 to 50 that are not in your total vector.
And in my opinion, sum(setdiff(total, 1:50)) its more intuitive.

Creating function to read data set and columns and displyaing nrow

I am struggling a bit with a probably fairly simple task. I wanted to create a function that has arguments of dataframe(df), column names of dataframe(T and R), value of the selected column of dataframe(a and b). I know that the function reads the dataframe. but , I don't know how the columns are selected. I'm getting an error.
fun <- function(df,T,a,R,b)
{
col <- ds[c("x","y")]
omit <- na.omit(col)
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
nrow(data2)/nrow(data1)
}
fun(jugs,Place,UK,Price,10)
I'm new to r language. So, please help me.
There are several errors you're making.
col <- ds[c("x","y")]
What are x and y? Presumably they're arguments that you're passing, but you specify T and R in your function, not x and y.
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
Again, presumably, you want a and b to be arguments you passed to the function, but you specified 'a' and 'b' which are specific, not general arguments. Also, I assume that second "omit$x" should be "omit$y" (or vice versa). And actually, since you just made this into a new data frame with two columns, you can just use the column index.
nrow(data2)/nrow(data1)
You should print this line, or return it. Either one should suffice.
fun(jugs,Place,UK,Price,10)
Finally, you should use quotes on Place, UK, and Price, at least the way I've done it.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
data1 <- omit[omit[,1] == val1,]
data2 <- omit[omit[,2] == val2,]
print(nrow(data2)/nrow(data1))
}
fun(jugs, "Place", "UK", "Price", 10)
And if I understand what you're trying to do, it may be easier to avoid creating multiple dataframes that you don't need and just use counts instead.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
n1 <- sum(omit[,1] == val1)
n2 <- sum(omit[,2] == val2)
print(n2/n1)
}
fun(jugs, "Place", "UK", "Price", 10)
I would write this function as follows:
fun <- function(df,T,a,R,b) {
data <- na.omit(df[c(T,R)]);
sum(data[[R]]==b)/sum(data[[T]]==a);
};
As you can see, you can combine the first two lines into one, because in your code col was not reused anywhere. Secondly, since you only care about the number of rows of the two subsets of the intermediate data.frame, you don't actually need to construct those two data.frames; instead, you can just compute the logical vectors that result from the two comparisons, and then call sum() on those logical vectors, which naturally treats FALSE as 0 and TRUE as 1.
Demo:
fun <- function(df,T,a,R,b) { data <- na.omit(df[c(T,R)]); sum(data[[R]]==b)/sum(data[[T]]==a); };
df <- data.frame(place=c(rep(c('p1','p2'),each=4),NA,NA), price=c(10,10,20,NA,20,20,20,NA,20,20), stringsAsFactors=F );
df;
## place price
## 1 p1 10
## 2 p1 10
## 3 p1 20
## 4 p1 NA
## 5 p2 20
## 6 p2 20
## 7 p2 20
## 8 p2 NA
## 9 <NA> 20
## 10 <NA> 20
fun(df,'place','p1','price',20);
## [1] 1.333333

Resources