I have a data-frame with 2 columns that contains two different types of text
The first column contains codes that are strings in the form of DD-HI-HO (DD being the code)
Column 2 is free text which anyone can insert
I am trying to populate the third column based on three statements which use the logic below to give a single vector column of 1 or 0
i don't seem to be able to update a vector column to incorporate all three rules. Below is Pseudo code
Basic info:
Codes is a vector (basically a reference table with one column)
Fuzzy is a vector (basically another reference table with one column)
#----CHECK SEQUENCES----
# Check if code is applied in column 1
Data$Has.Code <- grepl(pattern = "(HC|HD|HE|HK|HM|HH|HY|HL)", Data.Raw$Col1)
# Check if string contains relevant text in col 2
Data$Has.DG <- if(length(intersect(Codes, Data$Contents)) > 0) {1}
# Check how closely Strings are related. Take the highest match If its over 45% then set flag as 1
levenshteinSim(Fuzzy ,Data$Contents)
-------Added Table with sample data
Col1, Col2, Col3
1.HC-IE, Ice-cream, 1
2.IE-GB, Volvo, 0
3,IE-DE, Iced_Lollipop, 1
Record 1,
Rule number 1 would catch "HC" in Col1 and so set Col 3 to 1 (boolean)
Rule number 2 would also catch something in Col2 for record 1 as the vector Codes contains "Ice" as an element. It wouldn't execute in any case because
Rule one supercedes it
Record 2
None of the rules would return anything for the second item so col 3 is set to 0
Record 3
A bit of a daft example but the levenschtein distance computes a 75% similarity between Col 2 and one of the elements in the vector Fuzzy. This is above our stated threshold so col 3 is set to 1
Can anyone help
Thank you for your help
Related
I would count with the func table() in R how many time a value occures in a cell. But, some cell contains more value divided by colon. I report an example below:
example <- data.frame(c("A","B","A:::B"))
table(example)
the result is:
A A:::B B
1 1 1
but i want something like this
A B
2 2
I try to duplicate the rows with this characteristics, but the dataset is already too large and duplicate rows makes dataset impossible to use. How can i do?
thanks
We can split the column values by ::: and get the table
table(unlist(strsplit(example[[1]], "\\:+")))
# A B
# 2 2
I have a matrix called a=
lab col1 col2 col3
one 1 4 7
two 2 5 8
three 3 6 9
and i want to select only the lines that have the "lab"="one" and "two".
In fact my matrix is way bigger and i want to select a lot of different value from the column "lab".
I tried to do a vector
selected.lines=c("one","two")
a=a[a$lab==selected.lines,]
but it doesn't work, i guess because R tries to select the lines from the column "lab" that have a value equals to "one" AND "two"?
any help would be appreciated.
We need to use %in% when the number of elements to compare are greater than 1 as == does a recycling with the elements in 'selected.lines' i.e. the first elements in 'lab' are compared with the elements in 'selected.lines', then the third element in 'lab' is compared with the first element in 'selected.lines' and so on it till the end of the 'lab' column . Also, with matrix, we use [ for subsetting instead of $
a[a[,"lab"] %in% selected.lines,]
Assuming my dataframe has one column, I wish to add another column to indicate if my ith element is unique within the first i elements. The results I want is:
c1 c2
1 1
2 1
3 1
2 0
1 0
For example, 1 is unique in {1}, 2 is unique in {1,2}, 3 is unique in {1,2,3}, 2 is not unique in {1,2,3,2}, 1 is not unique in {1,2,3,2,1}.
Here is my code, but is runs extremely slow given I have nearly 1 million rows.
for(i in 1:nrow(df)){
k <- sum(df$C1[1:i]==df$C1[i]))
if(k>1){df[i,"C2"]=0}
else{df[i,"C2"]=1}
}
Is there a quicker way of achieving this?
The following works:
x$c2 = as.numeric(! duplicated(x$c1))
Or, if you prefer more explicit code (I do, but it’s slower in this case):
x$c2 = ifelse(duplicated(x$c1), 0, 1)
I have a R dataframe like this one:
a<-c(1,2,3,4,5)
b<-c(6,7,8,9,10)
df<-data.frame(a,b)
colnames(df)<-c("a","b")
df
a b
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I would like to get the 1st, 2nd, 3rd AND 5th row of the column a, so 1 2 3 5, by selecting rows by their number.
I have tried df$a[1:3,5] but I get Error in df$a[1:3, 5] : incorrect number of dimensions.
What DOES work is c(df$a[1:3],df$a[5]) but I was wondering if there was an easier way to achieve this with R?
Your data frame has two dimensions (rows and columns). When you use the square brackets to extract values, R expects everything prior to the comma to indicate the rows desired, and everything after the comma to indicate the columns desired (see: ?[). Hence, df[1:3,5] means rows 1 through 3, from column 5. To turn your desired rows into a single vector, you need to concatenate (i.e., c(1:3,5)). That would all go before the comma, the column indicator, 1 or "a", would go after the comma. Thus, df[c(1:3,5), 1] is what you need.
For alternative answer (that might be more appropriate to a dataframe with many more columns), df[c(1:3, 5), "a"] as suggested by #Mamoun Benghezal would also get it done!
Basic question but I'm a beginner sorry :-) And I still struggle with all these different data types etc. So I have a table with different variable names in column 1. In column 2 These variables have certain values. I want to extract now the value for a certain variable.
VarNames<-read.table(paste("O:/Daten/RatsDaten/CodesandDescription/VarNamesDir.asc"), sep="", skip=0,header=FALSE)
And the table Looks somehow like this
Test1 5
Test2 7
Test3 1
So how do I Access these Test variable values with their names? VarNames["Test1",2] didn't work..neither did any other option I've tried. Are there better data type options for this or how would I do it with a comfortable data frame?
You should have one of this 2 situations , either
Testxx are rownames of VarNames, you can test this using rownames(VarNames), and in this case you should do :
VarNames["Test1",1]
Or Testxx are components of a column, and you should do something like this :
VarNames[VarNames$v =='Test1',2]
For the first option :
m <- matrix(1:3,ncol=1,dimnames=list(paste0('Test',1:3),NULL))
m['Test1',]
Test1
1
for the second option
m1 <- data.frame(v=paste0('Test',1:3),b=1:3)
m1[m1$v=='Test1',]
v b
1 Test1 1
As your example is not reproducible, it is unclear whether the first column denotes row names or a variable with values TestX.
In case it is a variable, your table actually looks like this:
V1 V2
Test1 5
Test2 7
Test3 1
So you can get value of Test2 by calling VarNames[VarNames$V1 == "Test2",] for the whole row or VarNames[VarNames$V1 == "Test2",2] for the value only. You specify 2 since it is the second column.
If the first column denotes row names, the call is VarNames["Test2",] for the whole row, or as #agstudy answered, VarNames["Test2",1] for the value alone. You specify 1 since it is the first column provided Test2 is a row name, and thus is not contained in a column.