Here's what had happened:
> NA.of.df = which(rowSums(is.na(df)) == ncol(df))
> NA.of.df
named integer(0)
> fix(df) # i want to see what's in here -- nothing wrong
> NA.of.df # so i run it again
1 3 5 7 9 # it works!
why would this happens??
A producible example (but doesn't seems like any data structure with dput()) is like the following:
> dput(NA.of.df)
structure(integer(0), .Names = character(0))
and NA.of.df is just the code for finding rows with all NAs (obtained from here:
Remove rows in R matrix where all data is NA). (i.e. NA.of.df = which(rowSums(is.na(df)) == ncol(df)))
It could be an issue with quotes around the NA resulting in is.na to not pick up those elements
is.na(c(NA, "NA"))
#[1] TRUE FALSE
After doing the fix, it may have dropped the quotes and evaluate it correctly
Related
I am encountering an issue when I use the extraction operator `$() inside of a function. The problem does not exist if I follow the same logic outside of the loop, so I assume there might be a scoping issue that I'm unaware of.
The general setup:
## Make some fake data for your reproducible needs.
set.seed(2345)
my_df <- data.frame(cat_1 = sample(c("a", "b"), 100, replace = TRUE),
cat_2 = sample(c("c", "d"), 100, replace = TRUE),
continuous = rnorm(100),
stringsAsFactors = FALSE)
head(my_df)
This process I am trying to dynamically reproduce:
index <- which(`$`(my_df, "cat_1") == "a")
my_df$continuous[index]
But once I program this logic into a function, it fails:
## Function should take a string for the following:
## cat_var - string with the categorical variable name as it appears in df
## level - a level of cat_var appearing in df
## df - data frame to operate on. Function assumes it has a column
## "continuous".
extract_sample <- function(cat_var, level, df = my_df) {
index <- which(`$`(df, cat_var) == level)
df$continuous[index]
}
## Does not work.
extract_sample(cat_var = "cat_1", level = "a")
This is returning numeric(0). Any thoughts on what I'm missing? Alternative approaches are welcome as well.
The problem isn't the function, it's the way $ handles the input.
cat_var = "cat_1"
length(`$`(my_df,"cat_1"))
#> [1] 100
length(`$`(my_df,cat_var))
#> [1] 0
You can instead use [[ to achieve your desired outcome.
cat_var = "cat_1"
length(`[[`(my_df,"cat_1"))
#> [1] 100
length(`[[`(my_df,cat_var))
#> [1] 100
UPDATE
It's been noted that using [[ this way is ugly. And it is. It's useful when you want to write something like lapply(stuff,'[[',1)
Here, you should probably be writing it as my_df[[cat_var]].
Also, this question/answer goes into a little more detail about why $ doesn't work the way you want it to.
The problem is that the $ is non-standard, in the sense that when you don't quote the parameter input, it still tries to parse it and use what you typed, even if that was meant to refer to another variable.
Or more simply, as #42 put it in the first comment in the linked question:
The "$" function does not evaluate its arguments, whereas "[[" does`.
Here's a much simpler data set as an example.
my_df <- data.frame(a=c(1,2))
v <- "a"
Compare the usual usage; the first two give the same result, if you don't quote it, it parses it. So the third one (now) clearly doesn't work properly.
my_df$"a"
## [1] 1 2
my_df$a
## [1] 1 2
my_df$v
## NULL
That's exactly what's happening to you:
`$`(my_df, "a")
## [1] 1 2
`$`(my_df, v)
## NULL
Instead we need to evaluate v before sending to $ by using do.call.
do.call(`$`, list(my_df, v))
## [1] 1 2
Or, more appropriately, use the [[ version which does evaluate the parameters first.
`[[`(my_df, v)
## [1] 1 2
Problem lies in the way you are indexing to the column. This works just making a slight tweak to yours:
extract_sample <- function(cat_var, level, df = my_df) {
index <- df[, cat_var] == level
df$continuous[index]
}
Using it dynamically:
> extract_sample(cat_var = "cat_2", level = "d")
[1] -0.42769207 -0.75650031 0.64077840 -1.02986889 1.34800344 0.70258431 1.25193247
[8] -0.62892048 0.48822673 0.10432070 1.11986063 -0.88222370 0.39158408 1.39553002
[15] -0.51464283 -1.05265106 0.58391650 0.10555913 0.16277385 -0.55387829 -1.07822831
[22] -1.23894422 -2.32291394 0.11118881 0.34410388 0.07097271 1.00036812 -2.01981056
[29] 0.63417799 -0.53008375 1.16633422 -0.57130500 0.61614135 1.06768285 0.74182293
[36] 0.56538633 0.16784205 -0.14757303 -0.70928924 -1.91557732 0.61471302 -2.80741967
[43] 0.40552376 -1.88020372 -0.38821089 -0.42043745 1.87370600 -0.46198139 0.10788358
[50] -1.83945868 -0.11052531 -0.38743950 0.68110902 -1.48026285
I am using the following code in a loop, I am just replicating the part which I am facing the problem in. The entire code is extremely long and I have removed parts which are running fine in between these lines. This is just to explain the problem:
for (j in 1:2)
{
assign(paste("numeric_data",j,sep="_"),unique_id)
for (i in 1:2)
{
assign(paste("numeric_data",j,sep="_"),
merge(eval(as.symbol(paste("numeric_data",j,sep="_"))),
eval(as.symbol(paste("sd_1",i,sep="_"))),all.x = TRUE))
}
}
The problem that I am facing is that instead of assign in the second step, I want to use (eval+paste)
for (j in 1:2)
{
assign(paste("numeric_data",j,sep="_"),unique_id)
for (i in 1:2)
{
eval(as.symbol((paste("numeric_data",j,sep="_"))))<-
merge(eval(as.symbol(paste("numeric_data",j,sep="_"))),
eval(as.symbol(paste("sd_1",i,sep="_"))),all.x = TRUE)
}
}
However R does not accept eval while assigning new variables. I looked at the forum and everywhere assign is suggested to solve the problem. However, if I use assign the loop overwrites my previously generated "numeric_data" instead of adding to it, hence I get output for only one value of i instead of both.
Here is a very basic intro to one of the most fundamental data structures in R. I highly recommend reading more about them in standard documentation sources.
#A list is a (possible named) set of objects
numeric_data <- list(A1 = 1, A2 = 2)
#I can refer to elements by name or by position, e.g. numeric_data[[1]]
> numeric_data[["A1"]]
[1] 1
#I can add elements to a list with a particular name
> numeric_data <- list()
> numeric_data[["A1"]] <- 1
> numeric_data[["A2"]] <- 2
> numeric_data
$A1
[1] 1
$A2
[1] 2
#I can refer to named elements by building the name with paste()
> numeric_data[[paste0("A",1)]]
[1] 1
#I can change all the names at once...
> numeric_data <- setNames(numeric_data,paste0("B",1:2))
> numeric_data
$B1
[1] 1
$B2
[1] 2
#...in multiple ways
> names(numeric_data) <- paste0("C",1:2)
> numeric_data
$C1
[1] 1
$C2
[1] 2
Basically, the lesson is that if you have objects with names with numeric suffixes: object_1, object_2, etc. they should almost always be elements in a single list with names that you can easily construct and refer to.
Sorry if this is trivial. I am seeing the following behaviour in R:
> myDF <- data.frame(Score=5, scoreScaled=1)
> myDF$score ## forgot that the Score variable was capitalized
[1] 1
Expected result: returns NULL (even better: throws error).
I have searched for this, but was unable to find any discussion of this behaviour. Is anyone able to provide any references on this, the rationale on why this is done and if there is any way to prevent this? In general I would love a version of R that is a little stricter with its variables, but it seems that will never happen...
The $ operator needs only the first unique part of a data frame name to index it. So for example:
> d <- data.frame(score=1, scotch=2)
> d$sco
NULL
> d$scor
[1] 1
A way of avoiding this behavior is to use the [[]] operator, which will behave like so:
> d <- data.frame(score=1, scotch=2)
> d[['scor']]
NULL
> d[['score']]
[1] 1
I hope that was helpful.
Cheers!
Using [,""] instead of $ will throw an error in case the name is not found.
myDF$score
#[1] 1
myDF[,"score"]
#Error in `[.data.frame`(myDF, , "score") : undefined columns selected
myDF[,"Score"]
#[1] 5
myDF[,"score", drop=TRUE] #More explicit and will also work with tibble::as_tibble
#Error in `[.data.frame`(myDF, , "score", drop = TRUE) :
# undefined columns selected
myDF[,"Score", drop=TRUE]
#[1] 5
as.data.frame(myDF)[,"score"] #Will work also with tibble::as_tibble and data.table::as.data.table
#Error in `[.data.frame`(as.data.frame(myDF), , "score") :
# undefined columns selected
as.data.frame(myDF)[,"Score"]
#[1] 5
unlist(myDF[,"score"], use.names = FALSE) #Will work also with tibble::as_tibble and data.table::as.data.table
#Error in `[.data.frame`(myDF, , "score") : undefined columns selected
unlist(myDF[,"Score"], use.names = FALSE)
#[1] 5
I am trying to remove useless column from a data frame. I used a while loop with an if statement, and it seems that it never leaves the if statement :
it = 1
while (it < ncol(testing))
{
if ("drop" %in% CategOfData[it,])
{testing[,it]<-NULL}
else it = it + 1
}
the if loop works as long as it's not nested in a while loop.
testing is my data frame containing 400 rows and 12 columns,
CategOfData is a data frame of 12 rows and 2 columns, CategOfData contains the header of my df "testing" and the categories of it, 3 rows contain the word "drop"
I tested this code by replacing {testing[,it]<-NULL} with {jkl = jkl + 0.5},
And again the code ran long, I cut it short, asked the console what the value of jkl was, and it returned a number well over 800 000, while it should have returned 2.5 (1 + 3*0.5)
I don't understand why it nevers enters the else part of the code. which makes the while loop infinite since "it" never incrementes
I would use a for loop, but R doesn't agree since I'm dropping columns as I go.
the type of CategOfData :
> CategOfData [1,]
header x
"PIERRE.MARIE" "drop"
and "testing"
> head(testing[,1])
[1] PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE PIERRE-MARIE
Levels: LAURENNE PIERRE-MARIE
Can you help me pinpoint where the problem lies please?
I tried this instead
it = 1
rem = ncol(testing)
while (it < rem)
{
if ("drop" %in% CategOfData2[it,])
{testing[,it]<-NULL
CategOfData2 = CategOfData2[-it,]
rem = rem - 1
}
it = it + 1
}
It works ~Ok, but doesn't remove the last column which is a drop
Your problem can be solved with two lines of R code:
drop <- apply(CategOfData, 1, function(x) { "drop" %in% x })
testing <- testing[, !drop]
And this is a good example of the kind of power which R has, when you use it correctly.
I have a simple but strange problem.
indices.list is a list, containing 118,771 Elements(integers or numeric). By applying the function unlist I lose about 500 elements.
Look at the following code:
> indices <- unlist(indices.list, use.names = FALSE)
>
> length(indices.list)
[1] 118771
> length(indices)
[1] 118248
How is that Possible?? I checked if indices.list contains any NA. But it does not:
> any(is.na(indices.list) == TRUE)
[1] FALSE
data.set.merged is a dataframe containing more than 200,000 rows. When I use the vector indices (which apparently has the length 118,248) in order to get a subset of data.set.merged, I get a dataframe with 118,771 rows!?? That's so strange!
data.set.merged.2 <- data.set.merged[indices, ]
> nrow(data.set.2)
[1] 118771
Any ideas whats going on here?
Well, for your first mystery, the likely explanation is that some elements of indices.list are NULL, which means they will disappear when you use unlist:
unlist(list(a = 1,b = "test",c = 2,d = NULL, e = 5))
a b c e
"1" "test" "2" "5"