R: match () only returns first occurrence - r

I have a dataframe
names2 <- c('AdagioBarber','AdagioBarber', 'Beethovan','Beethovan')
Value <- c(33,55,21,54)
song.data <- data.frame(names2,Value)
I would like to arrange it according to this character vector
names <- c('Beethovan','Beethovan','AdagioBarber','AdagioBarber')
I am using match() to achieve this
data.frame(song.data[match((names), (song.data$names2)),])
The problem is that match returns only first occurences
names2 Value
3 Beethovan 21
3.1 Beethovan 21
1 AdagioBarber 33
1.1 AdagioBarber 33

You can use order, as #zx8754 and #Evan Friedland have pointed out.
> name.order <- c('Beethovan','AdagioBarber')
> song.data$names2 <- factor(song.data$names2, levels= name.order)
> song.data[order(song.data$names2), ]
names2 Value
3 Beethovan 21
4 Beethovan 54
1 AdagioBarber 33
2 AdagioBarber 55
Basically, factor turns the strings into integers and creates a lookup table of what integers correspond to what strings. The levels argument specifies what you want that lookup table to be. Without that argument, it would just go by order of appearance.
So for example:
> as.numeric(factor(letters[1:5]))
[1] 1 2 3 4 5
> as.numeric(factor(letters[1:5], levels=c("d","b","e","a","c")))
[1] 4 2 5 1 3
Note: You'll need to be absolutely sure you get all your (correctly spelled) levels in that name.order vector, otherwise you'll end up with NA's in the output from order.
(I'm not sure why sort doesn't have the ability to sort factors, but it is what it is.)

Related

Subset a dataframe using a logical vector with $

I'm having trouble understanding both the reason for use and behavior of the $ symbol in subsetting a data.frame in R. The following example was presented in a beginner's class I'm taking (not with a live professor so can't ask there):
temp_mat <- matrix(1:9, nrow=3)
colnames(temp_mat) <- c('a', 'b', 'c')
temp_df <- data.frame(temp_mat)
Calling temp_df obviously outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
The example given in the course is then:
temp_df[temp_df$c < 10]
Which outputs:
a b c
1 1 4 7
2 2 5 8
3 3 6 9
Reason for use question: The course indicates that $ is used for partial matching, and that x$y is an exact substitute for x[["y", exact=FALSE]]. Why would we want to use a partial matching operator here? Do we use it because we know for sure that in our temp_df there is no other column similar to "c" that could be mistakenly picked up? Additionally, how is partial match measured? A minimum % of characters matching or something? It appears there is a getElement function that would be much more appropriate if working with datasets with unknown or similar column names (e.g. Home Phone versus Cell Phone, would these be seen as a valid partial match?)
Behavior question: it appears the above example temp_df[temp_df$c < 10] is saying "return the subset of elements from temp_df where column c is less than 10" and because all column c elements meet the criteria, the entire dataframe is returned. My interpretation is obviously wrong because temp_df[temp_df$c < 9] returns:
a b
1 1 4
2 2 5
3 3 6
Although the row 1 and 2 elements in column c do meet the criteria of being less than 9, the entire column is omitted. My question then becomes twofold: what is that logical vector actually saying/doing? And how would I write my interpretation of "return the subset of elements from temp_df where column c is less than 9" and have it return:
a b c
1 1 4 7
2 2 5 8
Because in my mind, elements 1 and 2 (rows 1 and 2) met that criteria as their column c values are less than 9 and thus should be returned.
Try breaking down the operation in steps.
temp_df$c < 9
gives a vector as follows:
[1] TRUE TRUE FALSE
When you pass this vector in the manner you have shown:
temp_df[c(TRUE, TRUE, FALSE)] has the effect of operating on columns.
Think about a data.frame as a list, with column names as the keys and the column contents as vector values. The operation preserves the TRUE keys (i.e. columns) and drops the FALSE.
The comma serves to mark the vector as row index. The first two rows are retained and the last one is dropped. Thus, temp_df[c(TRUE, TRUE, FALSE), ] gives:
a b c
1 1 4 7
2 2 5 8
Both the $ and [[ are extract operator which allows to extract elements by name.
OP has raised one query about behavior of exact argument. The exact argument of the [[ operator has been documented in RStudio as:
Controls possible partial matching of [[ when extracting by a
character vector (for most objects, but see under ‘Environments’). The
default is no partial matching. Value NA allows partial matching but
issues a warning when it occurs. Value FALSE allows partial matching
without any warning.
What does it mean? To understand its behavior lets change the column names of data.frame used by OP as:
names(temp_df) <- c("aa","bb","cc")
#partial name of column will work with exact = FALSE
temp_df[["a", exact = FALSE]]
#[1] 1 2 3
#partial name of column will not work with exact = TRUE
temp_df[["a", exact = TRUE]]
#NULL
temp_df[["a", exact = NA]]
#[1] 1 2 3
#Warning message:
#In .subset2(x, i, exact = exact) : partial match of 'a' to 'aa'

convert all features of a image to a row vector

I currently have:
arr<-array(1:3, c(2,4,6))
dim(arr)
#[1] 2 4 6
mg <-data.frame(arr)
dim(mg)
#[1] 2 24
dim(mg2)
#[1] 2 24
dim(mg2)
#[1] 2 24
And I want to get a row vector with the result:
1 * 48
I've tried to use:
as.vector(t(mg2))
But doesn't the result doesn't multiply 2*24.
How can I get the result?
to transform the array to a vector you can use any of the ones below:
> c(arr)
> as.vector(arr)
> as.matrix(arr)
> t(as.matrix(arr))
The first two codes produce a column vector while the last two produce a matrix of dim 48*1,1*48.
If you first make it as a dataframe, remember the dimension of your array: 2 rows. Thus your dataframe must have two rows. That's why it is giving you the 2*24. But still from here you can make it a a vector.
The code as.vector(t(mg)) will give a vector but the values will be read in a row instead of in a column. Thus for the example above the result will be 1 3 2 1 3 2... instead of 1 2 3 1 2 3 .... You can fix this by doing a double transpose on the mg. ie as.vector(t(t(mg))) or c(as.matrix(mg)).

R enumerate duplicates in a dataframe with unique value

I have a dataframe containing a set of parts and test results. The parts are tested on 3 sites (North Centre and South). Sometimes those parts are re-tested. I want to eventually create some charts that compare the results from the first time that a part was tested with the second (or third, etc.) time that it was tested, e.g. to look at tester repeatability.
As an example, I've come up with the below code. I've explicitly removed the "Experiment" column from the morley data set, as this is the column I'm effectively trying to recreate. The code works, however it seems that there must be a more elegant way to approach this problem. Any thoughts?
Edit - I realise that the example given was overly simplistic for my actual needs (I was trying to generate a reproducible example as easily as possible).
New example:
part<-as.factor(c("A","A","A","B","B","B","A","A","A","C","C","C"))
site<-as.factor(c("N","C","S","C","N","S","N","C","S","N","S","C"))
result<-c(17,20,25,51,50,49,43,45,47,52,51,56)
data<-data.frame(part,site,result)
data$index<-1
repeat {
if(!anyDuplicated(data[,c("part","site","index")]))
{ break }
data$index<-ifelse(duplicated(data[,1:2]),data$index+1,data$index)
}
data
part site result index
1 A N 17 1
2 A C 20 1
3 A S 25 1
4 B C 51 1
5 B N 50 1
6 B S 49 1
7 A N 43 2
8 A C 45 2
9 A S 47 2
10 C N 52 1
11 C S 51 1
12 C C 56 1
Old example:
#Generate a trial data frame from the morley dataset
df<-morley[,c(2,3)]
#Set up an iterative variable
#Create the index column and initialise to 1
df$index<-1
# Loop through the dataframe looking for duplicate pairs of
# Runs and Indices and increment the index if it's a duplicate
repeat {
if(!anyDuplicated(df[,c(1,3)]))
{ break }
df$index<-ifelse(duplicated(df[,c(1,3)]),df$index+1,df$index)
}
# Check - The below vector should all be true
df$index==morley$Expt
We may use diff and cumsum on the 'Run' column to get the expected output. In this method, we are not creating a column of 1s i.e 'index' and also assuming that the sequence in 'Run' is ordered as showed in the OP's example.
indx <- cumsum(c(TRUE,diff(df$Run)<0))
identical(indx, morley$Expt)
#[1] TRUE
Or we can use ave
indx2 <- with(df, ave(Run, Run, FUN=seq_along))
identical(indx2, morley$Expt)
#[1] TRUE
Update
Using the new example
with(data, ave(seq_along(part), part, site, FUN=seq_along))
#[1] 1 1 1 1 1 1 2 2 2 1 1 1
Or we can use getanID from library(splitstackshape)
library(splitstackshape)
getanID(data, c('part', 'site'))[]
I think this is a job for make.unique, with some manipulation.
index <- 1L + as.integer(sub("\\d+(\\.)?","",make.unique(as.character(morley$Run))))
index <- ifelse(is.na(index),1L,index)
identical(index,morley$Expt)
[1] TRUE
Details of your actual data.frame may matter. However, a couple of options working with your example:
#this works if each group starts with 1:
df$index<-cumsum(df$Run==1)
#this is maybe more general, with data.table
require(data.table)
dt<-as.data.table(df)
dt[,index:=seq_along(Speed),by=Run]

Obtain indices of a factor of characters with names contained in a character vector

I have a factor of characters, let's say:
A <- factor(c(rep("home", times=5), rep("work", times=3), rep("hobby", times=7), rep("friends", times=10)))
and I would like to get the indices of the characters equal to the ones contained in another vector, say:
B <- c("work", "hobby")
in this case I would like to obtain the vector 6:15.
I tried with which(A==B) but it does not work...
Any idea?
As akrun pointed out %in% should do the trick. Gives the output as:
[1] 6 7 8 9 10 11 12 13 14 15

How to unquote string in R to access column in data table

Suppose I have a data.table called mysample. It has multiple columns, two of them being weight and height. I can access the weight column by typing:
mysample[,weight]
But when I try to write mysample[,colnames(mysample)[1]] I cannot see the elements of weight. Is there something wrong with my code?
Please refer to section 1.1 of data.table FAQ: http://cran.r-project.org/web/packages/data.table/vignettes/datatable-faq.pdf
colnames(mysample)[1] evaluates to character vector "weight", and the 2nd argument J in data.table is an expression which is evaluated within the scope of DT. Thus, "weight" evaluates to character vector "weight" itself and you can't see the elements of "weight" column. To actually subset "weight" column you should try:
mysample[,colnames(mysample)[1], with = F]
Your syntax should work for data frames. data.table has its unique rules.
df <- data.frame(a=1:3, b=4:6)
df
a b
1 1 4
2 2 5
3 3 6
df[,"a"]
[1] 1 2 3
df$a
[1] 1 2 3
df[,1]
[1] 1 2 3
df[,colnames(df)[1]]
[1] 1 2 3

Resources