Add a column to empty data.frame - r

I want to initialise a column in a data.frame look so:
df$newCol = 1
where df is a data.frame that I have defined earlier and already done some processing on. As long as nrow(df)>0, this isn't a problem, but sometimes my data.frame has row length 0 and I get:
> df$newCol = 1
Error in `[[<-`(`*tmp*`, name, value = 1) :
1 elements in value to replace 0 elements
I can work around this by changing my original line to
df$newCol = rep(1,nrow(df))
but this seems a bit clumsy and is computationally prohibitive if the number of rows in df is large. Is there a built in or standard solution to this problem? Or should I use some custom function like so
addCol = function(df,name,value) {
if(nrow(df)==0){
df[,name] = rep(value,0)
}else{
df[,name] = value
}
df
}

If I understand correctly,
df = mtcars[0, ]
df$newCol = numeric(nrow(df))
should be it?
This is assuming that by "row length" you mean nrows, in which case you need to append a vector of length 0. In such case, numeric(nrow(df)) will give you the exact same result as rep(0, nrow(df)).
It also kind of assumes that you just need a new column, and not specifically column of ones - then you would simply do +1, which is a vectorized operation and therefore fast.
Other than that, I'm not sure you can have an "empty" column - the vector should have the same number of elements as the other vectors in the data frame. But numeric is fast, it should not hurt.

Related

R- Remove rows based on condition across some columns

I have a data frame like this :
I want to remove rows which have values = 0 in those columns which are "numeric". I tried some functions but returned to me error o dind't remove anything be cause not the entire row is = 0. Summarizing, i need to remove the rows which are equals to 0 on the colums that have a numeric class( i.e from sales month to expected sales,). How i could do this???(below attach the result i expect)
PD: If I could do it with some function that allows me to put the number of the column instead of the name, it would be great!
Here a simple solution with lapply.
set.seed(5)
df <- data.frame(a=1:10,b=letters[1:10],x=sample(0:5,5,replace = T),y=sample(c(0,10,20,30,40,50),5,replace = T))
df <-df[!unlist(lapply(1:nrow(df), function(i) {
any(df[i, ] == 0)
})), ]

Subset data.table based on value in column of type list

So I have this case currently of a data.table with one column of type list.
This list can contain different values, NULL among other possible values.
I tried to subset the data.table to keep only rows for which this column has the value NULL.
Behold... my attempts below (for the example I named the column "ColofTypeList"):
DT[is.null(ColofTypeList)]
It returns me an Empty data.table.
Then I tried:
DT[ColofTypeList == NULL]
It returns the following error (I expected an error):
Error in .prepareFastSubset(isub = isub, x = x, enclos = parent.frame(), :
RHS of == is length 0 which is not 1 or nrow (96). For robustness, no recycling is allowed (other than of length 1 RHS). Consider %in% instead.
(Just a precision my original data.table contains 96 rows, which is why the error message say such thing:
which is not 1 or nrow (96).
The number of rows is not the point).
Then I tried this:
DT[ColofTypeList == list(NULL)]
It returns the following error:
Error: comparison of these types is not implemented
I also tried to give a list of the same length than the length of the column, and got this same last error.
So my question is simple: What is the correct data.table way to subset the rows for which elements of this "ColofTypeList" are NULL ?
EDIT: here is a reproducible example
DT<-data.table(Random_stuff=c(1:9),ColofTypeList=rep(list(NULL,"hello",NULL),3))
Have fun!
If it is a list, we can loop through the list and apply the is.null to return a logical vector
DT[unlist(lapply(ColofTypeList, is.null))]
# ColofTypeList anotherCol
#1: 3
Or another option is lengths
DT[lengths(ColofTypeList)==0]
data
DT <- data.table(ColofTypeList = list(0, 1:5, NULL, NA), anotherCol = 1:4)
I have found another way that is also quite nice:
DT[lapply(ColofTypeList, is.null)==TRUE]
It is also important to mention that using isTRUE() doesn't work.

For each row in DF, check if there is a match in a vector

I have a dataframe in R, and I want to check for any record in a vector that finds matches for the string in the DF. I can't seem to get it to work exactly right.
exampledf=as.data.frame(c("PIT","SLC"))
colnames(exampledf)="Column1"
examplevector=c("PITTPA","LAXLAS","JFKIAH")
This gets me close, but the result is a vector of (1,0,0) instead of a 0 or 1 for each row
exampledf$match=by(exampledf,1:nrow(exampledf),function(row) ifelse(grepl(exampledf$Column1,examplevector),1,0))
Expected result:
exampledf$match=c("1","0")
grepl returns a logical vector the same length as your examplevector. You can wrap it with the any() function (equivalent to using sum() as suggested above).
Here's a slightly modified form of your code:
exampledf$match = vapply(exampledf$Column1, function(x) any(grepl(x, examplevector)), 1L)
So here is my solution:
library(dplyr)
exampledf=as.data.frame(c("PIT","SLC"))
colnames(exampledf)="Column1"
examplevector=c("PITTPA","LAXLAS","JFKIAH")
pmatch does what you want and gives you which example vector it matches to. Use duplicates.ok because you want multiple matches to show up. If you dont want that, then make the argument equal to false. I just used dpylr to create the new column but you can do this however you would like.
exampledf %>% mutate(match_flag = ifelse(is.na(pmatch(Column1, examplevector, duplicates.ok = T)),0
, pmatch(Column1, examplevector, duplicates.ok = T)))
Column1 match_flag
1 PIT 1
2 SLC 0

R - sample used in %in% modify dataframe which is being subsetted

Not sure if I titled question correctly, because I don't fully understand the reason of following behaviour:
dfSet <- data.frame(ID = sample(1:15, size = 15, replace = FALSE), va1 = NA, va3 = 0, stringsAsFactors = FALSE)
dfSet[1:10, ]$va1 <- 'o1'
dfSet[11:15, ]$va1 <- 'o2'
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1
print(length(unique(dfSet$ID)))
I expect that final print shows 15, but it doesn't. Instead 13 or 14 appears and dfSet is modified in the way, that there are at least two rows with the same ID. It seems that this part of the code:
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), ]$va3 <- 1
modify $ID column - I don't know why?
Workaround:
temp <- sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE)
dfSet[dfSet$ID %in% temp, ]$va3 <- 1
In this case everything works as expected - there are 15 rows with unique ID.
The question is why direct usage of sample in %in% modifies data frame?
What seems to be the problem is that that R does some tricky thing when you assign to function return values. For example, something like
a <- c(1,3)
names(a) <- c("one", "three")
would look very odd in most languages. How do you assign a value to the return value of a function? What's really happening is that there is a function named names<- that's defined. Basically that's returning a transformed version of the original object that can then be used to replace the value passed to that function. So it really looks like this
.temp. <- `names<-`(a, c("one","three"))
a <- .temp.
The variable a is always completely replaced, not just it's names.
When you do something like
dfSet$a<-1
what's really happening again is
.temp. <- "$<-"(dfSet, a, 1)
dfSet <- .temp.
Now things get a bit more tricky when you try to do both [] and $ subsetting. Look at this sample
#for subsetting
f <- function(x,v) {print("testing"); x==v}
x <- rep(0:1, length.out=nrow(dfSet))
dfSet$a <- 0
dfSet[f(x,1),]$a<-1
Notice how "testing" is printed twice. What's going on is really more like
.temp1. <- "$<-"(dfSet[f(x,1),], a, 1)
.temp2. <- "[<-"(dfSet, f(x,1), , .temp1.)
dfSet <- .temp2.
So the f(x,1) is evaluated twice. This means that sample would be evaluated twice as well.
The error is a bit more obvious is you try to replace a variable that does not exist yet
dfSet[f(x,1),]$b<-1
# Warning message:
# In `[<-.data.frame`(`*tmp*`, f(x, 1), , value = list(ID = c(6L, :
# provided 4 variables to replace 3 variables
Here you get the warning because the .temp1. variable as added the column and now has 4 columns but when you try to do the assignment to .temp2. you now have a problem that the slice of the data frame that you are trying to replace is a different size.
The IDs are replaced because the $<- operator doesn't just return a new column, it returns a new data.frame with the column updated to whatever value you assigned. This means that the rows that were updated are returned along with the ID that was there when the assignment happened. This is saved in the .temp1. variable. Then when you do the [<- assignment, you are choosing a new set of rows to swap out. The values of all columns of these rows are replaced with the values from .temp1.. This means that you will be overwriting the IDs for the replacement rows and they may differ so you are likely to wind up with two or more copies of a given ID.
Although I'm not 100% sure, I suspect that R is running the sample two times. When you subset and assign in R, for example:
x[i:j,]$v1 <- 1
It gets evaluated as "take out rows i to j from x as a temporary data frame, assign 1 to the v1 column of that data frame, then copy the temporary data frame back into rows i to j in x".
So maybe the indexing expression (i:j) gets executed twice (once to extract, and once to put back), and if it's a random variable, it's going to put the results back in different rows than the ones originally selected.
Consider this simpler example:
x <- data.frame(a=1:10, b=10:1)
x$b <- 5
What the second line actually does is
x <- `$<-`(x, 'b', 5)
You can see that $<- is just a function that takes three arguments, an
object, a name and a value. (Note the backticks are necessary if you want to use $<- directly.)
The problem I think is that in your example x is an expression that
evaluates to different things each time it's evaluated, due to the call to
sample, so you should avoid this.
An alternative is to use [<- which apparently doesn't have this problem:
dfSet[dfSet$ID %in% sample(dfSet[dfSet$va1 == 'o1', ]$ID, 7, replace = FALSE), 'va3'] <- 1

Learning Functions [Error: undefined columns selected]

R newbie here.
I'm learning functions, and i have a problem running this:
newfunction = function(x) {
limit = ncol(x)
for(i in 1:limit){
if(anyNA(x[,i] == T)) {
x[,i] = NULL
}
}
}
newfunction(WBD_SA)
I get the error: Error in '[.data.frame(x, , i) : undefined columns selected
I'm trying to remove all columns that have any NA values from my data set WBD_SA.
I know na.omit() removes for rows with NA values, but not sure if there is something for columns.
Any suggestions regarding packages/functions that can make this happen are also appreciated.
Cheers!
You are getting this error because you are iterating from 1 to limit, where limit is the number of columns at the start of the function, and you're dropping columns from the data.frame as you iterate through the for loop. This means that if you drop even 1 column, ncol(x) will be less than limit by the time the for loop ends. I'll give you 3 alternatives that work:
iterate backward:
for(i in limit:1)
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
with the above loop, the i'th column will always be in the the same position as the it was when the for loop started.
iterate forward using a while loop:
i = 1
while(i <=ncol(x)){
if(anyNA(x[,i] == TRUE))
x[,i] = NULL
i=i+1
}
use the fact that data.frames are subclasses of lists, and use lapply to create an index that is TRUE for columns that contain a missing value and FALSE otherwise, like so:
columnHasMissingValue <- lapply(x,function(y)any(is.na(y)))
x <- x[,!columnHasMissingValue]
as long as you're learing about data.frames, it's useful that you can use negative indicies to drop column like so:
x <- x[,-which(columnHasMissingValue)]
Note that the above solution is similar to the apply solution in user1362215's solution, which takes advantage of the fact that data.frames have two dimensions* so you can apply a function over the second margin (columns) like so:
good_cols = apply(x,# the object over which to apply the function
2,# apply the function over the second margin (columns)
function(x) # the function to apply
!any(is.na(x))
)
x = x[,good_cols]
* 2 dimensions means that the [ operator defined for the data.frame class takes 2 arguments that are interpreted as rows and columns indexes.
When you are iterating over the columns, using x[,i] = NULL removes the column, reducing the number of columns by 1. Unless i is the last column, this will produce errors for future values of i. You should instead do something like this
good_cols = apply(x,2,function(x) {!any(is.na(x))})
x = x[,good_cols]
apply(x,margin,function) applies function over the margin dimension (rows for the value of 1, columns for the value of 2; 3 or higher is possible with arrays) of x, which is more efficient than looping (and doesn't cause errors from changing x partway).

Resources