Subset a dataframe using a logical vector with $ - r

I'm having trouble understanding both the reason for use and behavior of the $ symbol in subsetting a data.frame in R. The following example was presented in a beginner's class I'm taking (not with a live professor so can't ask there):
temp_mat <- matrix(1:9, nrow=3)
colnames(temp_mat) <- c('a', 'b', 'c')
temp_df <- data.frame(temp_mat)
Calling temp_df obviously outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
The example given in the course is then:
temp_df[temp_df$c < 10]
Which outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Reason for use question: The course indicates that $ is used for partial matching, and that x$y is an exact substitute for x[["y", exact=FALSE]]. Why would we want to use a partial matching operator here? Do we use it because we know for sure that in our temp_df there is no other column similar to "c" that could be mistakenly picked up? Additionally, how is partial match measured? A minimum % of characters matching or something? It appears there is a getElement function that would be much more appropriate if working with datasets with unknown or similar column names (e.g. Home Phone versus Cell Phone, would these be seen as a valid partial match?)
Behavior question: it appears the above example temp_df[temp_df$c < 10] is saying "return the subset of elements from temp_df where column c is less than 10" and because all column c elements meet the criteria, the entire dataframe is returned. My interpretation is obviously wrong because temp_df[temp_df$c < 9] returns:
a b
1 1 4
2 2 5
3 3 6
Although the row 1 and 2 elements in column c do meet the criteria of being less than 9, the entire column is omitted. My question then becomes twofold: what is that logical vector actually saying/doing? And how would I write my interpretation of "return the subset of elements from temp_df where column c is less than 9" and have it return:
a b c
1 1 4 7
2 2 5 8
Because in my mind, elements 1 and 2 (rows 1 and 2) met that criteria as their column c values are less than 9 and thus should be returned.

Try breaking down the operation in steps.
temp_df$c < 9
gives a vector as follows:
[1] TRUE TRUE FALSE
When you pass this vector in the manner you have shown:
temp_df[c(TRUE, TRUE, FALSE)] has the effect of operating on columns.
Think about a data.frame as a list, with column names as the keys and the column contents as vector values. The operation preserves the TRUE keys (i.e. columns) and drops the FALSE.
The comma serves to mark the vector as row index. The first two rows are retained and the last one is dropped. Thus, temp_df[c(TRUE, TRUE, FALSE), ] gives:
a b c
1 1 4 7
2 2 5 8

Both the $ and [[ are extract operator which allows to extract elements by name.
OP has raised one query about behavior of exact argument. The exact argument of the [[ operator has been documented in RStudio as:
Controls possible partial matching of [[ when extracting by a
character vector (for most objects, but see under ‘Environments’). The
default is no partial matching. Value NA allows partial matching but
issues a warning when it occurs. Value FALSE allows partial matching
without any warning.
What does it mean? To understand its behavior lets change the column names of data.frame used by OP as:
names(temp_df) <- c("aa","bb","cc")
#partial name of column will work with exact = FALSE
temp_df[["a", exact = FALSE]]
#[1] 1 2 3
#partial name of column will not work with exact = TRUE
temp_df[["a", exact = TRUE]]
#NULL
temp_df[["a", exact = NA]]
#[1] 1 2 3
#Warning message:
#In .subset2(x, i, exact = exact) : partial match of 'a' to 'aa'

Related

Function to recode multiple variables conditional on other variables

I have a dataset with multiple variables. Each question has the actual survey answer and three other characteristics. So there are four variables for each question. I want to specify if Q135_L ==1 , leave Q135_RT as it is, otherwise code it as NA. I can do that with an ifelse statement.
df$Q135_RT <- ifelse(df$Q135_L == 1, df$Q22_RT, NA)
However, I have hundreds of variables and the names are not related. For example, in the picture we can see Q135, SG1_1 and so on. How can I specify for the whole dataset if a variable ends at _L, then for the same variable ending at _RT should remain as it is, otherwise the variable ending at _RT should be coded as NA.
I tried this but it only returns NAs
ifelse(grepl("//b_L" ==1, df), "//b_RT" , NA)
If I understand your problem correctly, you have a data frame of which the columns represent survey question variables. Each column contains two identifiers, namely: a survey question number (134, 135, etc) and a variable letter (L, R, etc). Because you provide no reproducible example, I tried to make a simplified example of your data frame:
set.seed(5)
DF <- data.frame(array(sample(1:4, 24, replace = TRUE), c(4,6)))
colnames(DF) <- c("Q134_L","Q135_L", "Q134_R", "Q135_R", "Q_L1", "Q134_S")
DF
# Q134_L Q135_L Q134_R Q135_R Q_L1 Q134_S
# 1 2 3 2 3 1 1
# 2 3 1 3 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 3 3 2 1
What you want is that if Q135_L == 1, leave Q135_RT as it is, otherwise code it as NA. Here is a function that implements this recoding logic:
recode <- function(yourdf, questnums) {
for (k in 1:length(questnums)) {
charnum <- as.character(questnums)
col_end_L_k <- yourdf[grepl("_L\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
col_end_R_k <- yourdf[grepl("_RT\\b", colnames(yourdf)) &
grepl(charnum[k], colnames(yourdf))]
row_is_1 <- which(col_end_L_k == 1)
col_end_R_k[-row_is_1, ] <- NA
yourdf[, colnames(col_end_R_k)] <- col_end_R_k
}
return(yourdf)
}
This function takes a data frame and a vector of question numbers, and then returns the data frame that has been recoded.
What this function does:
Selecting each question number using for.
Using grepl to identify any column that contains the selected number and contains _L at the end of the column name.
Similar with above but for _RT at the end of the column name.
Using which to identify the location of rows in the _L column that contain 1.
Keeping the values of the _RT column, which has the same question number with the corresponding _L column, in those rows, and change values on other rows to NA.
The result:
recode(DF, 134:135)
# Q134_L Q135_L Q134_RT Q135_RT Q_L1 Q134_S
# 1 2 3 NA NA 1 1
# 2 3 1 NA 2 4 4
# 3 1 1 3 2 4 3
# 4 3 1 NA 3 2 1
Note that the Q_L1 column is not affected because _L in this column is not located on the end of the column name.
As for how to define questnums, the question numbers, you just need to create a numeric vector. Examples:
Your questnums are 1 to 200. Then use 1:200 or seq(200), so recode(DF, 1:200).
Your questnums are 1, 3, 134, 135. Then, use recode(DF, c(1, 3, 134, 135)).
You can also assign the question numbers to an object first, such as n = c(25, 135, 145) and the use it : recode(DF, n)

Sort function in R when index.return=TRUE

I have the following vector in R:
> A<-c(8.1915935, 3.0138083, 0.3245712, 10.7353747, 13.7505131 ,63.2337407, 16.7505131, 5.7781297)
I want to sort it, and, at the same time, know each element's position in the sorted vector. So i use the following function:
sort(A, index.return=T)
And I get the following output, which I don't clearly understand:
$x
[1] 0.3245712 3.0138083 5.7781297 8.1915935 10.7353747 13.7505131 16.7505131 63.2337407
$ix
[1] 3 2 8 1 4 5 7 6
Looking at the original vector A, the first element, goes in the 4th position of the sorted vector. So the first element of "$ix" should be 4. Why is it 3?
Then, the biggest number of the vector is the 6th of A. But the 6th element of $ix is not 8, as I expected to see (the length of the vector)but 6. Why?
And so on, for all the elements. Clearly, there is something I don't understand about this output.
$ix is indicating the position of the elements of x in the original vector; you were hoping for the reverse -- the location of the elements in the original vector in x. The difference is between order() and rank()
> order(A)
[1] 3 2 8 1 4 5 7 6
> rank(A)
[1] 4 2 1 5 6 8 7 3
Note that order(order(A)) == rank(A), so one way to get the answer you're looking for is
result <- sort(A, index.return = TRUE)
order(result$ix)

Creating tibble returns error due to name

I have 2 vectors. I am trying to create a tibble with all combinations of the 2 vectors with the following error.
C <- c(1,2,3,4)
G <- c(1,2,3,4,5)
tibble('C' = rep(C, each = length(G)), 'G' = rep(G, length(C)))
Error: Column `C` must be length 1 or 100, not 20
Error disappears when I rename column 'C' to column 'A' for example.
We also don't get the same error with a data.frame
I suspect length(C) takes 'C' value from the tibble.
Is this an intended behaviour?
If so can someone explain how this is useful in practice? (i.e how would someone take advantage of this in their code)
Because tibbles are an extension to data.frame, and not an exact drop-in replacement, you can do things like:
tibble(a=1:3, b=a+1)
## A tibble: 3 x 2
# a b
# <int> <dbl>
#1 1 2
#2 2 3
#3 3 4
...where you can reference earlier created columns. And your example is an instance of when that might be a problem.
To quote the manual:
"Arguments are evaluated sequentially, so you can refer to previously
created variables."
Source: http://tibble.tidyverse.org/reference/tibble.html
So in this case, the C in rep(G, length(C)) is actually referencing the tibblename$C you just created, which is length 20, rather than the vector C in the global environment, which is length 4.

Count number of short strings in a long string in R [duplicate]

This question already has answers here:
How to calculate the number of occurrence of a given character in each row of a column of strings?
(14 answers)
Closed 6 years ago.
suppose I have a long string such like:
c<-"abcabcdabcdeabcdefghijkabcdabcaba"
My question is how to quickly count the number of exact "abcd" in c.
1) gregexpr First paste "abcd" onto c so that there is at least 1 match. (This is needed because gregexpr returns -1 for any component of c having no matches rather than a zero length numeric vector.) Now, gregexpr returns a list whose components are numeric vectors of the starting positions of the matches one component per component of c -- in this case c only has one component but the code below works more generally. Now find the lengths of the components of the result of gregexpr and subtract 1 to take into account the extra abcd we added. No packages are used.
Example 1
lengths(gregexpr("abcd", paste(c, "abcd"))) - 1
## [1] 4
Note: If we knew that there was at least one match it could be slightly simplified to: lengths(gregexpr("abcd", c)) .
Example 2
Here is another example. Here DF has 3 rows and the corresponding components of c have 4, 4, and 0 occurrences of "abcd".
DF <- data.frame(c = c(c, c, "X")) # test input
lengths(gregexpr("abcd", paste(DF$c, "abcd"))) - 1
## [1] 4 4 0
2) regmatches
Here is an alternative approach. This approach has the advantage that no special code is needed for the no-match case. Again, no packages are used.
Here are the same two examples:
lengths(regmatches(c, gregexpr("abcd", c)))
## [1] 4
lengths(regmatches(DF$c, gregexpr("abcd", DF$c)))
## [1] 4 4 0
Using library stringr, you can do it as follows (on larger set, it will be fairly fast and efficient):
library(stringr)
c <- "abcabcdabcdeabcdefghijkabcdabcaba"
c
[1] "abcabcdabcdeabcdefghijkabcdabcaba"
str_count(c, 'abcd')
[1] 4
This will work on a column of a data frame as follows:
df <- data.frame(txt = rep(c, 10))
df$abcd_count <- str_count(df$txt, 'abcd')
df
txt abcd_count
1 abcabcdabcdeabcdefghijkabcdabcaba 4
2 abcabcdabcdeabcdefghijkabcdabcaba 4
3 abcabcdabcdeabcdefghijkabcdabcaba 4
4 abcabcdabcdeabcdefghijkabcdabcaba 4
5 abcabcdabcdeabcdefghijkabcdabcaba 4
6 abcabcdabcdeabcdefghijkabcdabcaba 4
7 abcabcdabcdeabcdefghijkabcdabcaba 4
8 abcabcdabcdeabcdefghijkabcdabcaba 4
9 abcabcdabcdeabcdefghijkabcdabcaba 4
10 abcabcdabcdeabcdefghijkabcdabcaba 4
Here is one method using base Rs gsub and strsplit:
# example
temp <- "abcabcdabcdeabcdefghijkabcdabcaba"
# substitute pattern for character not in string, here 9
temp2 <- gsub("abcd", "9", temp)
# split on 9, and count number of elements
length(strsplit(temp2, split="9")[[1]]) - 1
You need the [[1]] because strsplit is designed to operate over vectors of strings, here the vector is of length 1. An alternative to [[1]] in this case is unlist.
Also, 1 is subtracted because the number of elements are one larger than the number of abcd patterns by 1.

How to unquote string in R to access column in data table

Suppose I have a data.table called mysample. It has multiple columns, two of them being weight and height. I can access the weight column by typing:
mysample[,weight]
But when I try to write mysample[,colnames(mysample)[1]] I cannot see the elements of weight. Is there something wrong with my code?
Please refer to section 1.1 of data.table FAQ: http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf
colnames(mysample)[1] evaluates to character vector "weight", and the 2nd argument J in data.table is an expression which is evaluated within the scope of DT. Thus, "weight" evaluates to character vector "weight" itself and you can't see the elements of "weight" column. To actually subset "weight" column you should try:
mysample[,colnames(mysample)[1], with = F]
Your syntax should work for data frames. data.table has its unique rules.
df <- data.frame(a=1:3, b=4:6)
df
a b
1 1 4
2 2 5
3 3 6
df[,"a"]
[1] 1 2 3
df$a
[1] 1 2 3
df[,1]
[1] 1 2 3
df[,colnames(df)[1]]
[1] 1 2 3

Resources