R selecting rows from dataframe using logical indexing: accessing columns by `$` vs `[]` - r

I have a simple R data.frame object df. I am trying to select rows from this dataframe based on logical indexing from a column col in df.
I am coming from the python world where during similar operations i can either choose to select using df[df[col] == 1] or df[df.col == 1] with the same end result.
However, in the R data frame df[df$col == 1] gives an incorrect result compared to df[df[,col] == 1] (confirmed by summary command). I am not able to understand this difference as from links like http://adv-r.had.co.nz/Subsetting.html it seems that either way is ok. Also, str command on df$col and df[, col] shows the same output.
Is there any guidelines about when to use $ vs [] operator ?
Edit:
digging a little deeper and using this question as reference, it seems like the following code works correctly
df[which(df$col == 1), ]
however, not clear how to guard against NA and when to use which

You confused many things.
In
df[,col]
col should be the column number. For example,
col = 2
x = df[,col]
would select the second column and store it to x.
In
df$col
col should be the column name. For example,
df=data.frame(aa=1:5,bb=10:14)
x = df$bb
would select the second column and store it to x. But you cannot write df$2.
Finally,
df[[col]]
is the same as df[,col] if col is a number. If col is a character ("character" in R means the same as string in other languages), then it selects the column with this name. Example:
df=data.frame(aa=1:5,bb=10:14)
foo = "bb"
x = df[[foo]]
y = df[[2]]
z = df[["bb"]]
Now x, y, and z are all contain the copy of the second column of df.
The notation foo[[bar]] is from lists. The notation foo[,bar] is from matrices. Since dataframe has features of both matrix and list, it can use both.

Use $ when you want to select one specific column by name df$col_name.
Use [] when you want to select one or more columns by number:
df[,1] # select column with index 1
df[,1:3]# select columns with indexes 1 to 3
df[,c(1,3:5,7)] # select columns with indexes 1, 3 to 5 and 7.
[[]] is mostly for lists.
EDIT: df[which(df$col == 1), ] works because which function creates a logical vector which checks if the column index is equal to 1 (true) or not (false). This logical vector is passed to df[] and only true value is shown.
Remove rows with NAs (missing values) in data.frame - to find out more about how to deal with missing values. It is always a good practice to exclude missing values from dataset.

Related

Subset data.table based on value in column of type list

So I have this case currently of a data.table with one column of type list.
This list can contain different values, NULL among other possible values.
I tried to subset the data.table to keep only rows for which this column has the value NULL.
Behold... my attempts below (for the example I named the column "ColofTypeList"):
DT[is.null(ColofTypeList)]
It returns me an Empty data.table.
Then I tried:
DT[ColofTypeList == NULL]
It returns the following error (I expected an error):
Error in .prepareFastSubset(isub = isub, x = x, enclos = parent.frame(), :
RHS of == is length 0 which is not 1 or nrow (96). For robustness, no recycling is allowed (other than of length 1 RHS). Consider %in% instead.
(Just a precision my original data.table contains 96 rows, which is why the error message say such thing:
which is not 1 or nrow (96).
The number of rows is not the point).
Then I tried this:
DT[ColofTypeList == list(NULL)]
It returns the following error:
Error: comparison of these types is not implemented
I also tried to give a list of the same length than the length of the column, and got this same last error.
So my question is simple: What is the correct data.table way to subset the rows for which elements of this "ColofTypeList" are NULL ?
EDIT: here is a reproducible example
DT<-data.table(Random_stuff=c(1:9),ColofTypeList=rep(list(NULL,"hello",NULL),3))
Have fun!
If it is a list, we can loop through the list and apply the is.null to return a logical vector
DT[unlist(lapply(ColofTypeList, is.null))]
# ColofTypeList anotherCol
#1: 3
Or another option is lengths
DT[lengths(ColofTypeList)==0]
data
DT <- data.table(ColofTypeList = list(0, 1:5, NULL, NA), anotherCol = 1:4)
I have found another way that is also quite nice:
DT[lapply(ColofTypeList, is.null)==TRUE]
It is also important to mention that using isTRUE() doesn't work.

Find elements in one vector and replace by the equivalents in another vector in R

I have 3 columns of names (corresponding to different diatom species). The first column is the current name of my species, the second column is the "old" (i.e. not used any more) name of the species and the third one is the "new" (i.e. after taxonomic update) name of the species.
For each value in the first column I need to find it in the second column and, if found, I need to replace it by the updated name (stored in the third column). So for example, given this matrix:
Column 1 Column 2 Column 3
Achnanthes.atomus Amphora.coffeaeformis Halamphora.coffeaeformis
Achnanthes.biasolettiana Achnanthes.atomus Achnanthidium.atomus
Achnanthes.atomus found in column 1 (first row), should be identified in column 2 (second row) and replaced by its "new name" Achnanthidium.atomus (column 2, second row).
My matrix is called Diatosdef. If I do this, it works:
colnames(Diatosdef) <- gsub("Achnanthes.atomus","Achnanthidium.atomus",colnames(Diatosdef))
But I need to do it species by species, and I have almost 100 species
Can anybody please help me?
Thanks!
P.S: I found that I can do it in Excel with the vlookup function, but I am still looking for a way of doing it in R
fun <- function(x, mat){
if(x %in% mat[,2]) mat[which(mat[,2]==x), 3]
else x
}
Diatosdef[,1] <- sapply(Diatosdef[,1], fun, mat = Diatosdef)
you can try this:
for(i in matrix[:1]){
if(matrix[i:1] == matrix[i:2])
matrix[i:1] = matrix[i:3]
}
maybe the syntax isn't correct.

R select subset of data

I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)

Guetting a subset in R

I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]

What is the difference between colnames(x[1])<-"name" and colnames(x)[1]<-"name"?

I want to rename column names in data.frame,
> x=data.frame(name=c("n1","n2"),sex=c("F","M"))
> colnames(x[1])="Name"
> x
name sex
1 n1 F
2 n2 M
> colnames(x)[1]="Name"
> x
Name sex
1 n1 F
2 n2 M
>
Why does colnames(x[1]) = "Name" not work, while colnames(x)[1]="Name" does?
What is the reason? What is the difference betweent them?
The too much information answer:
If you look at what each of the options "de-sugars" to:
# 1.
`[<-`(x, 1, value=`colnames<-`(x[1], 'Name'))
# 2.
`colnames<-`(x, `[<-`(colnames(x), 1, 'Name'))
The first option makes a new data.frame from just the first column, renames that column (successfully), and then tries to assign that data.frame back over the first column. [<-.data.frame will propagate the values, however will not rename existing columns based on the names of value.
The second option gets the colnames of the data.frame, updates the first value, and creates a new data.frame with the updated names.
(Answer to #Peng Peng's question here because I can't figure out how to get backtick quoting to work in a comment...)
The backtick is to quote the variable name. Consider the difference here:
x<-1
`x<-`<-1
The first assigns 1 to a variable called x, but the second assigns to a variable called x<-. These unusal variable names are actually used by the <- primitive function - you are allowed arbitrary function calls on the lhs of an assignment, and a function with <- appended to the name specifies how to perform the update (similar to setf in lisp).
Because you want to modify the column names attribute of x, a data.frame. Hence
colnames(x) <- ....
is correct, whether or not you assign one or more at the same time.

Resources