Understanding the syntax for Column vs Row indexing in R - r

I'm a bit confused on the filtering scheme on an R data frame.
For example, let's say we have the following data frame titled dframe:
> str(dframe)
'data.frame': 143 obs. of 3 variables:
$ Year : int 1999 2005 2007 2008 2009 2010 2005 2006 2007 2008 ...
$ Name : Factor w/ 18 levels "AADAM","AADEN",..: 1 1 2 2 2 2 3 3 3 3 ...
$ Frequency: int 5 6 10 34 38 12 10 6 10 5 ...
Now if I want to filter dframe where the values of Name is of "AADAM", the proper filter is:
dframe[dframe$Name=="AADAM",]
The part where I'm confused is why the comma doesn't come first. Why isn't it this: dframe[,dframe$Name=="AARUSH"]

UPDATE: You clarified your question is really "Please give examples of what sort of logical expressions are valid for filtering columns?"
I agree with you the syntax appears weird initially, but it has the following logic.
The bottom line is that column-filter expressions are typically less rich and expressive than row-filtering expressions, and in particular you can't chain logical indexing the way you do with rows.
Best way is to think of indexing expressions as the general form:
dframe[<row-index-expression>,<col-index-expression>]
where either index-expression is optional, so you can just do one and we (crucially!) need the comma to disambiguate whether it's row- or column-indexing:
dframe[<row-index-expression>,] # such as dframe[dframe$Name=="ADAM",]
dframe[,<col-index-expression>]
Before we look at examples of col-index-expression and what's valid (and invalid) to include in one, let's review and discuss how R does indexing - I had the same confusion when I started with it.
In this example, you have three columns. You can refer to them by their string names 'Year','Name','Frequency'. You can also refer to them by column indices 1,2,3 where the numbers 1,2,3 correspond to the entries colnames(dframe). R does indexing using the '[' operator, also the '[[' operator. Here are some valid examples of ways to index column-indexing:
dframe[,2] # column 2 / Name
dframe[,'Name'] # column 2 / Name
dframe[,c('Name','Frequency')] # string vector - very common
dframe[,c(2,3)] # integer vector - also very common
dframe[,c(F,T,T)] # logical vector - very rarely seen, and a pain in the butt to compute
Now, if you choose to use a logical expression for the column-index, it must be a valid expression without using column names - inside a column it doesn't know their own names.
Suppose you wanted to dynamically filter "give me only the factor columns from dframe".
Something like:
unlist(apply(dframe[1,1:3], 2, is.factor), use.names=F) # except I can't seem to remove the colnames
For more help and examples on indexing look at the '[' operator help-page:
Type ?'['
dframe[,dframe$Name=="ADAM"] is invalid attempt at column-indexing because the columns know nothing about Name=="ADAM"
Addendum: code to generate example dataframe (because you didn't dump us a dput output)
set.seed(123)
N = 10
randomName <- function() { cat(sample(letters, size=runif(1)*6+2, replace=T), sep='') }
dframe = data.frame(Year=round(runif(N,1980,2014)),
Name = as.factor(replicate(N, randomName())),
Frequency=round(runif(N, 2,40)))

You have to remember that when you're sub-setting, the part before the comma is specifying which rows you want, and the part after the comma is specifying which columns you want. ie:
dframe[rowsyouwant, columnsyouwant]
You're filtering based on columns, but you want all of the columns in your result, so the space after the comma is blank. You want some sub-set of rows, so your filtering specification goes before the comma, where the rows you want are specified.

As others have indicated, requesting a certain subset of a data frame requires the syntax [rows, columns]. Since dframe[has 143 rows, has 3 columns], any request for some part of dframe should be of the form
dframe[which of the 143 rows do I want?, which of the 3 columns do I want?].
Because dframe$Name is a vector of length 143, the comparison dframe$Name=='AADAM' is a vector of T/F values that also has length 143. So,
dframe[dframe$Name=='AADAM',]
is like saying
dframe[of the 143 rows I want these ones, I want all columns]
whereas
dframe[,dframe$Name=='AADAM']
generates an error because it's like saying
dframe[I want all rows, of the 143 columns I want these ones]
On a side note, you may want to look into the subset() function if you're not already familiar with it. You could get the same result by writing subset(dframe, Name=='AADAM')

As others have said, the structure within brackets is row, then column.
One way I think of the syntax of selecting data from a data.frame using:
dframe[dframe$Name=="AADAM",]
is to think of a noun, then a verb where:
dframe[] is the noun. It is the object on which you want to perform an action
and
[dframe$Name=="AADAM",] is the verb. It is the action you want to perform.
I have a silly way of expressing this to myself, but it keeps things straight in my mind:
Hey, you! dframe! I am going to... ...in this case, select all of your rows in which Name is equal to AADAM!
By keeping the column portion of [dframe$Name=="AADAM",] blank you are saying you want to keep all columns.
Sometimes it can be a little difficult to remember that you have to write dframe both inside and outside the brackets.
As for exactly why row comes first and column comes second, I do not know, but row had to be either first or second.
dframe <- read.table(text = '
Year Name Frequency
1 ADAM 4
3 BOB 10
7 SALLY 5
2 ADAM 12
4 JIM 3
12 ADAM 7
', header = TRUE)
dframe[,dframe$Name=="ADAM"]
# Error in `[.data.frame`(dframe, , dframe$Name == "ADAM") :
# undefined columns selected
dframe[dframe$Name=="ADAM",]
# Year Name Frequency
# 1 1 ADAM 4
# 4 2 ADAM 12
# 6 12 ADAM 7
dframe[,'Name']
# [1] ADAM BOB SALLY ADAM JIM ADAM
# Levels: ADAM BOB JIM SALLY
dframe[dframe$Name=="ADAM",'Name']
# [1] ADAM ADAM ADAM
# Levels: ADAM BOB JIM SALLY

Related

R - Update value of a column based on condition

I need to update all the values of a column, using as reference another df.
The two dataframes have equal structures:
cod name dom_by
1 A 3
2 B 4
3 C 1
4 D 2
I tried to use the following line, but apparently it did not work:
df2$name[df2$dom_by==df1$cod] <- df1$name[df2$dom_by==df1$cod]
It keeps saying that replacement has 92 rows, data has 2.
(df1 has 92 rows and df2 has 2).
Although it seems like a simple problem, I still can not solve it, even after some searches.

R subset dataframe column with variable when column name is escaped

I am trying to select a column from a dataframe using a variable as a column name, with the problem that the column name is escaped. I have a couple of workarounds for doing it, which involve changing my code a bit too much, and anyway I've been looking around and I am curious if anybody knew the solution for this kind of weird case.
My dataset is actually a list of time series (which I construct after some operations), this would be a toy example.
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
> df
$`01/19/17`
[1] 1 2 3 4 5 6 7 8 9 10
$`01/20/17`
[1] 2 3 4 5 6 7 8 9 10 11
I don't put the escapes ` in the column names because I want to, but because they come as dates from the process I follow to construct the dataset.
If I know the column name I can access like this,
df$`01/19/17`
If I want to use a variable, looking around e.g. here I see I could rewrite it to something like this,
`$`(df, `01/19/17`)
But I cannot assign a variable like this,
> name1 <- `01/19/17`
Error: object '01/19/17' not found
and if assign it this other way I get a NULL,
> name1 <- "01/19/17"
> `$`(df, name1)
NULL
As I say there are workarounds like e.g. changing all the column names in the list of series, but I just would like to know. Thank you so much.
You can access with brackets rather than with $, even when the key is a string:
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
name1 <- "01/19/17"
df[[name1]]
# [1] 1 2 3 4 5 6 7 8 9 10

Exclude rows that contain NA in a particular column in subsets

I am trying exclude rows of a subset which contain an NA for a particular column that I choose. I have a CSV spreadsheet of survey data this kind of organization, for instance:
name idnum term type q2 q3
bob 0321 1 2 0 .
. . 3 1 5 3
ron . 2 4 2 1
. 2561 4 3 4 2
When I was creating my R-workspace, I set it such that data <- read.csv(..., na.strings='.'). For purposes of my analysis, I then created subsets by term and type, like set13 <- subset(data, term=1 & type=2), for example. When I trying to conduct t-tests, I noticed that the function threw out any instance of NA, effectively cutting my sample size in half.
For my analysis, I want to exclude responses that are missing survey items, such as Bob from my example, missing question 3. But I still want to include rows that have one or more NAs in the name or idnum columns. So, in essence, I want to pick by columns which NAs are omitted. (Keep in mind, this is just an example - my actual CSV has about 1000 rows, so each subset may contain 100-150 rows.)
I know this can be done using data frames, but I'm not sure how to incorporate that into my given subset format. Is there a way to do this?
Check out complete.cases as shown in the answer to this SO post.
data[complete.cases(data[,3:6]),]
This will return all rows with complete information in columns 3 through 6.
Another approach.
data[rowSums(is.na(data[,3:6]))==0,]
Another option is
data[!Reduce(`|`, lapply(data[3:6], is.na)),]

Applying a function on each row of a data frame in R

I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50

count of entries in data frame in R

I'm looking to get a count for the following data frame:
> Santa
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty
of the number of children who believe. What command would I use to get this?
(The actual data frame is much bigger. I've just given you the first four rows...)
Thanks!
You could use table:
R> x <- read.table(textConnection('
Believe Age Gender Presents Behaviour
1 FALSE 9 male 25 naughty
2 TRUE 5 male 20 nice
3 TRUE 4 female 30 nice
4 TRUE 4 male 34 naughty'
), header=TRUE)
R> table(x$Believe)
FALSE TRUE
1 3
I think of this as a two-step process:
subset the original data frame according to the filter supplied
(Believe==FALSE); then
get the row count of this subset
For the first step, the subset function is a good way to do this (just an alternative to ordinary index or bracket notation).
For the second step, i would use dim or nrow
One advantage of using subset: you don't have to parse the result it returns to get the result you need--just call nrow on it directly.
so in your case:
v = nrow(subset(Santa, Believe==FALSE)) # 'subset' returns a data.frame
or wrapped in an anonymous function:
>> fnx = function(fac, lev){nrow(subset(Santa, fac==lev))}
>> fnx(Believe, TRUE)
3
Aside from nrow, dim will also do the job. This function returns the dimensions of a data frame (rows, cols) so you just need to supply the appropriate index to access the number of rows:
v = dim(subset(Santa, Believe==FALSE))[1]
An answer to the OP posted before this one shows the use of a contingency table. I don't like that approach for the general problem as recited in the OP. Here's the reason. Granted, the general problem of how many rows in this data frame have value x in column C? can be answered using a contingency table as well as using a "filtering" scheme (as in my answer here). If you want row counts for all values for a given factor variable (column) then a contingency table (via calling table and passing in the column(s) of interest) is the most sensible solution; however, the OP asks for the count of a particular value in a factor variable, not counts across all values. Aside from the performance hit (might be big, might be trivial, just depends on the size of the data frame and the processing pipeline context in which this function resides). And of course once the result from the call to table is returned, you still have to parse from that result just the count that you want.
So that's why, to me, this is a filtering rather than a cross-tab problem.
sum(Santa$Believe)
You can do summary(santa$Believe) and you will get the count for TRUE and FALSE
DPLYR makes this really easy.
x<-santa%>%
count(Believe)
If you wanted to count by a group; for instance, how many males v females believe, just add a group_by:
x<-santa%>%
group_by(Gender)%>%
count(Believe)
A one-line solution with data.table could be
library(data.table)
setDT(x)[,.N,by=Believe]
Believe N
1: FALSE 1
2: TRUE 3
using sqldf fits here:
library(sqldf)
sqldf("SELECT Believe, Count(1) as N FROM Santa
GROUP BY Believe")

Resources