Select lines of a matrix with conditions - r

I have a matrix called a=
lab col1 col2 col3
one 1 4 7
two 2 5 8
three 3 6 9
and i want to select only the lines that have the "lab"="one" and "two".
In fact my matrix is way bigger and i want to select a lot of different value from the column "lab".
I tried to do a vector
selected.lines=c("one","two")
a=a[a$lab==selected.lines,]
but it doesn't work, i guess because R tries to select the lines from the column "lab" that have a value equals to "one" AND "two"?
any help would be appreciated.

We need to use %in% when the number of elements to compare are greater than 1 as == does a recycling with the elements in 'selected.lines' i.e. the first elements in 'lab' are compared with the elements in 'selected.lines', then the third element in 'lab' is compared with the first element in 'selected.lines' and so on it till the end of the 'lab' column . Also, with matrix, we use [ for subsetting instead of $
a[a[,"lab"] %in% selected.lines,]

Related

How to compute percentage change between corresponding elements in two dataframes r

I have two dataframes let's say A
col1 col2
4 7
5 8
and B
col1 col2
2 5
1 4
Now, I want to compute the percentage change between each corresponding element in the two dataframes. So, the percentage change between element 1,1 in A and B, between element 2,1 in A and B and so on. I want to store these percentage changes also in a 2 times 2 dataframe. Does anyone knows how to do this without looping over the dataframes?
As these are equal-sized data.frames, simply do the subtraction and divide by one of the datasets would get the output
(A - B)/A
You can just use R element-wise matrix division.
If you do A/B, it will perform the division element by element. So, the complete formula for percentage would be (A-B)/A

Create vector selecting values from two different vectors

Currently, this code works to do what I want to do where dx$res is a vector selecting values from dx$val1 or dx$val2 depending on value of dx$x0.
x0<-c(1,2,1,2,2,1)
val1<-c(8,6,4,5,3,2)
val2<-c(4,8,6,7,9,5)
dx<-data.frame(x0,val1,val2)
dx$res<-(dx$x0==1)*dx$val1+(dx$x0==2)*dx$val2
I would like to know if there were more elegant methods to do this like using apply function.
One option is model.matrix with rowSums. It is also more general for 'n' number of distinct elements in the 'x0' column.
dx$res <- rowSums(dx[-1]*model.matrix(~ factor(x0) - 1 , dx))
dx$res
#[1] 8 8 4 7 9 2

Extract the first, second and last row that meets a criterion

I would like to know how to extract the last row that follow a criterion. I have seen the solution for getting the first one by the function "duplicate" in the next link How do I select the first row in an R data frame that meets certain criteria?.
However is it possible to get the second or last row that meet a criterion?
I would like to make a loop for each Class (here I only put two) and select the first, second and last row that meet the criterion Weight >= 10. And if there is no row that meets the criterion to get a NA.
Finally I want to store the three values (first, second, and last row) in a list containing the values for each class.
Class Weight
1 A 20
2 A 15
3 B 10
4 B 23
5 A 11
6 B 12
7 B 11
8 A 25
9 A 7
10 B 3
Data table can help with this.
This is an edit off of Davids comment to move it into the answers as his approach is the correct way to do this.
library(data.table)
DT <- as.data.table(db)
DT[Weight >= 10][, .SD[c(1, 2, .N)], by = Class]
As as faster alternative also from David look at
indx <- DT[Weight >= 10][, .I[c(1, 2, .N)], by = Class]$V1 ; DT[indx]
Which creates the wanted index using .I and then subsets DT based on those rows.

R select multiple rows by conditional row number

I have a R dataframe like this one:
a<-c(1,2,3,4,5)
b<-c(6,7,8,9,10)
df<-data.frame(a,b)
colnames(df)<-c("a","b")
df
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I would like to get the 1st, 2nd, 3rd AND 5th row of the column a, so 1 2 3 5, by selecting rows by their number.
I have tried df$a[1:3,5] but I get Error in df$a[1:3, 5] : incorrect number of dimensions.
What DOES work is c(df$a[1:3],df$a[5]) but I was wondering if there was an easier way to achieve this with R?
Your data frame has two dimensions (rows and columns). When you use the square brackets to extract values, R expects everything prior to the comma to indicate the rows desired, and everything after the comma to indicate the columns desired (see: ?[). Hence, df[1:3,5] means rows 1 through 3, from column 5. To turn your desired rows into a single vector, you need to concatenate (i.e., c(1:3,5)). That would all go before the comma, the column indicator, 1 or "a", would go after the comma. Thus, df[c(1:3,5), 1] is what you need.
For alternative answer (that might be more appropriate to a dataframe with many more columns), df[c(1:3, 5), "a"] as suggested by #Mamoun Benghezal would also get it done!

R - Updating a Dataframe Column

I have a data-frame with 2 columns that contains two different types of text
The first column contains codes that are strings in the form of DD-HI-HO (DD being the code)
Column 2 is free text which anyone can insert
I am trying to populate the third column based on three statements which use the logic below to give a single vector column of 1 or 0
i don't seem to be able to update a vector column to incorporate all three rules. Below is Pseudo code
Basic info:
Codes is a vector (basically a reference table with one column)
Fuzzy is a vector (basically another reference table with one column)
#----CHECK SEQUENCES----
# Check if code is applied in column 1
Data$Has.Code <- grepl(pattern = "(HC|HD|HE|HK|HM|HH|HY|HL)", Data.Raw$Col1)
# Check if string contains relevant text in col 2
Data$Has.DG <- if(length(intersect(Codes, Data$Contents)) > 0) {1}
# Check how closely Strings are related. Take the highest match If its over 45% then set flag as 1
levenshteinSim(Fuzzy ,Data$Contents)
-------Added Table with sample data
Col1, Col2, Col3
1.HC-IE, Ice-cream, 1
2.IE-GB, Volvo, 0
3,IE-DE, Iced_Lollipop, 1
Record 1,
Rule number 1 would catch "HC" in Col1 and so set Col 3 to 1 (boolean)
Rule number 2 would also catch something in Col2 for record 1 as the vector Codes contains "Ice" as an element. It wouldn't execute in any case because
Rule one supercedes it
Record 2
None of the rules would return anything for the second item so col 3 is set to 0
Record 3
A bit of a daft example but the levenschtein distance computes a 75% similarity between Col 2 and one of the elements in the vector Fuzzy. This is above our stated threshold so col 3 is set to 1
Can anyone help
Thank you for your help

Resources