Subset rows based on "start and stop" strings - r

looking to write an R script that will search a column for a specific value and begin sub setting rows until a specific text value is reached.
Example:
X1 X2
[1,] "a" "1"
[2,] "b" "2"
[3,] "c" "3"
[4,] "d" "4"
[5,] "e" "5"
[6,] "f" "6"
[7,] "c" "7"
[8,] "k" "8"
What I'd like to do is search through X1 until the letter 'c' is found, and begin to subset rows until another letter 'c' is found, at which point the subset procedure would stop. Using the above example, the result should be a vector containing c(3,4,5,6,7).
Assume there will be no more than 2 rows where X1 equals 'c'
Any help is greatly appreciated.

You can lookup where a value is with the function which, and use that as in index to get the values you are looking for. If you want everything from the first to the second "c", it would look like this:
indices <- which(df$X1=='c')
range <- indices[1]:indices[2]
df$X2[range]

Related

R subset string values including vertical bar(|)

I am trying to subset a data based on a column value. I am trying to subset if that specific column has only one level information. Here how my data look like.
data <- cbind(v1=c("a", "ab", "a|12|bc", "a|b", "ac","bc|2","b|bc|12"),
v2=c(1,2,3,5,3,1,2))
> data
v1 v2
[1,] "a" "1"
[2,] "ab" "2"
[3,] "a|12|bc" "3"
[4,] "a|b" "5"
[5,] "ac" "3"
[6,] "bc|2" "1"
[7,] "b|bc|12" "2"
I want to subset only with the character values that were not including "|", like below:
> data
v1 v2
[1,] "a" "1"
[2,] "ab" "2"
[3,] "ac" "3"
basically, I am trying to get rid of two-level (x|y) or three level values (x|y|z). Any thoughts on this?
Thanks!
We can use grep to find the row that have |, use the invert option to get the row index of elements that have no |, use that to subset the rows of the matrix
data[grep("|", data[,1], invert = TRUE, fixed = TRUE), ]
# v1 v2
#[1,] "a" "1"
#[2,] "ab" "2"
#[3,] "ac" "3"
NOTE: The fixed = TRUE is used or else it will check with the regex mode on and | is a metacharacter for OR condition. Other option are to escape (\\|) or place it inside square brackets ([|]) to capture the literal character (when fixed = FALSE)
Using logical grepl this can be done as follows. I will leave it in two code lines for clarity but it's straightforward to make of it a one-liner.
i <- !grepl("\\|", data[, 1])
data[i, ]
# v1 v2
#[1,] "a" "1"
#[2,] "ab" "2"
#[3,] "ac" "3"

Obtain previous level of ordinal factor in R

I must be missing something in this question, because it looks simple and it has already been too much time on it.
Let's say we have a ordinal factor column in a dataframe. We want a new column by dropping or scaling up the original column by one level or category. What is the fastest way?
Data
col_string <- as.character(c(1:5))
col_factor <- factor(col_string, levels = as.character(c(1:8)), ordered = TRUE)
Desired solution:
col_solution <- c(8,1,2,3,4)
df <- cbind(col_string, col_factor, col_solution)
df
col_string col_factor col_solution
[1,] "1" "1" "8"
[2,] "2" "2" "1"
[3,] "3" "3" "2"
[4,] "4" "4" "3"
[5,] "5" "5" "4"
How, in code, can i tell R to:
col_solution <- shift down one level of the element in col_factor
edit for clarification:
The col_factor column has 8 categories altough there are just representation of 5 of them. The categories are ordered as 1-2-3-4-5-6-7-8. If one element is on category 1 and we want to go down by one category, we would go to category 8.
function(myFactor,shift = -1){
myFactor[] <- (as.numeric(myFactor)-1+shift)(length(levels(myFactor)))+1
return(myFactor)
}
A bit of a pain to get your head around what the indexing is doing.
((x-1) %% y) +1 Gives the remainder of x/y, but when the remainder is 0, it returns y.

Subsetting Identical Observations in R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 8 years ago.
I am trying to look at protein sequence homology using R, and I'd like to go through a data frame looking for identical pairs of Position and Letter. The data look similar to the frame below:
Letter <- c("A", "B", "C", "D", "D", "E", "G", "L")
Position <- c(1, 2, 3, 4, 4, 5, 6, 7)
data.set <- cbind(Position, Letter)
Which yields:
Position Letter
[1,] "1" "A"
[2,] "2" "B"
[3,] "3" "C"
[4,] "4" "D"
[5,] "4" "D"
[6,] "5" "E"
[7,] "6" "G"
[8,] "7" "L"
I'd like to loop through and find all identical observations (in this case, observations 4 and 5), but I'm having difficulty in discovering the best way to do it.
I'd like the resultant data frame to look like:
Position Letter
[1,] "4" "D"
[2,] "4" "D"
The ways I've tried to do this ended up yielding this code, but unfortunately it returns one value of TRUE because I realized that I am comparing two identical data frames:
> identical(data.set[1:nrow(data.set),1:2], data.set[1:nrow(data.set),1:2])
[1] TRUE
I'm not sure if looping through using the identical() function would be the best way? I'm sure there's a more elegant solution that I am missing.
Thanks for any help!
Try the unique function:
unique(data.set)
...
You can use duplicated using fromLast to go in two directions:
data.set[(duplicated(data.set)==T | duplicated(data.set, fromLast = TRUE) == T),]
# Position Letter
#[1,] "4" "D"
#[2,] "4" "D"

Print out matrix in R within function for which each column has a specified number of digits defined by a function parameter

I have been thinking about this for some time already but I cannot find the solution. Here is the problem.
I have a function that iteratively calculated the root for a function that I plug in there. So for every iteration I come closer to the final solution (Newton procedure). Within the function I build a matrix that stores the number of the iteration (i), the value for x (x) and the value for f(x) (y).
matrix <- rbind(matrix, c(i,x,y))
The function itself works perfectly fine. But I want to print out the result in a specific way.
I want to return the matrix that is built in the function like this:
[,1] [,2] [,3]
[1,] "1" "0.000" "3.000"
[2,] "2" "-299999.975" "89999985109.735"
[3,] "3" "-150000.381" "22500114442.253"
[4,] "4" "-75000.123" "5625014307.234"
[5,] "5" "-37500.048" "1406253577.781"
[6,] "6" "-18750.030" "351563619.088"
[7,] "7" "-9375.093" "87890906.234"
[8,] "8" "-4687.507" "21972727.599"
[9,] "9" "-2343.753" "5493182.588"
What I am doing at the moment is:
return(matrix(sprintf(c("%.0f","%.3f","%.3f"),matrix),nrow=N))
But this yields
[,1] [,2] [,3]
[1,] "1" "0" "3"
[2,] "2.000" "-299999.975" "89999985109.735"
[3,] "3.000" "-150000.381" "22500114442.253"
[4,] "4" "-75000" "5625014307"
[5,] "5.000" "-37500.048" "1406253577.781"
[6,] "6.000" "-18750.030" "351563619.088"
[7,] "7" "-9375" "87890906"
[8,] "8.000" "-4687.507" "21972727.599"
[9,] "9.000" "-2343.753" "5493182.588"
So the digits are somehow specified by column and not by row.
In a next step - to make it even more complicated - my function is supposed to have a parameter that allows users to specify the number of digits of column 2 and 3.
so something like:
newton <- function(fx, p=0)
Where p is the number of digits and by default 0.
Can somebody help me with this? Thank you!
If your matrix has always 3 columns you can simply do:
x.digits = 3
y.digits = 4
mxStr <-
cbind(sprintf('%d',mx[,1]),
sprintf(paste('%.',x.digits,'f',sep=''),mx[,2]),
sprintf(paste('%.',y.digits,'f',sep=''),mx[,3])
)
Of course you can wrap this code in a function and pass x.digits and y.digits as parameters...

r - pairwise combinations of rows from table?

Assume a table as below:
X =
col1 col2 col3
row1 "A" "0" "1"
row2 "B" "2" "NA"
row3 "C" "1" "2"
I select combinations of two rows, using the code below:
pair <- apply(X, 2, combn, m=2)
This returns a matrix of the form:
pair =
[,1] [,2] [,3]
[1,] "A" "0" "1"
[2,] "B" "2" NA
[3,] "A" "0" "1"
[4,] "C" "1" "2"
[5,] "B" "2" NA
[6,] "C" "1" "2"
I wish to iterate over pair, taking two rows at a time, i.e. first isolate [1,] and [2,], then [3,] and [4,] and finaly, [5,] and [6,]. These rows will then be passed as arguments to regression models, i.e. lm(Y ~ row[i]*row[j]).
I am dealing with a large dataset. Can anybody advise how to iterate over a matrix two rows at a time, assign those rows to variables and pass as arguments to a function?
Thanks,
S ;-)
It is unnecessary to multiply the rows of your matrix like that, and if you have a large data set it is might get problematic. In stead just pick out the relevant rows for each instance. But it is convenient to create the selection beforehand, something like this perhaps:
xselect <- combn(1:nrow(X),2)
To illustrate with your data (assuming you only use columns 2 and 3):
X <- matrix(c("A", "B", "C", 0,2,1,1,NA,2),3,3)
Y <- rnorm(2, 4, 2)
for (i in 1:ncol(xselect))
{
x1 <- as.numeric(X[xselect[1,i], c(2,3)])
x2 <- as.numeric(X[xselect[2,i], c(2,3)])
print(lm(Y ~ x1 * x2))
}
I'm not sure exactly what you're trying to do with the linear models, but to iterate over X, a pair of rows at a time, make a factor for each pair, and then use by
fac <- as.factor(sort(rep(1:(nrow(X)/2), 2)))
by(X, fac, FUN)
where FUN is whatever function you want to apply over the pairs of rows in X.

Resources