Count multiple Data in a string cell - r

I would count with the func table() in R how many time a value occures in a cell. But, some cell contains more value divided by colon. I report an example below:
example <- data.frame(c("A","B","A:::B"))
table(example)
the result is:
A A:::B B
1 1 1
but i want something like this
A B
2 2
I try to duplicate the rows with this characteristics, but the dataset is already too large and duplicate rows makes dataset impossible to use. How can i do?
thanks

We can split the column values by ::: and get the table
table(unlist(strsplit(example[[1]], "\\:+")))
# A B
# 2 2

Related

Split string in sliding window in a dataframe

I have previously asked for a method to split a string each 3 characters and save the results in a dataframe. Now I want to do the same thing but instead in a sliding window of size n.
This question differs from the marked duplicate one as the results here should be outputed in a dataframe. The mapply function given would require quite some extra work to combine it in a new dataframe and to add the positions as column names as explained at the top of my previous question .
Example data
df <- data.frame(id = 1:2, seq = c('ABCDEF', 'XYZZZY'))
Looks like this:
id seq
1 1 ABCDEF
2 2 XYZZZY
Splitting on every third character with a window size of n = 1
id 1 2 3 4
1 ABC BCD CDE DEF
2 XYZ YZZ ZZZ ZZY
I tried to do this using the seperate function as answered on my previous post however as far as I can find this can only split on fixed split points rather than on a range.

Using list of row numbers as criteria to populate field

I have a list of row numbers that represent row containing outliers in a data set. I would like to add an "outlier" column to the original data set that flags the rows containing outliers, but I can't figure out how to use row numbers as criteria in r.
Example:
I have a dataframe like this:
id <-c("a","b","c","d")
values <-c(10,11,22,33)
df<-data.frame(names,values)
id values
1 a 10
2 b 11
3 c 22
4 d 33
And a list like this containing row number (more correctly "row names"):
outliers <-c(2,4)
I'd like to find a way to use the list of row numbers as criteria in something like:
df$outlier_test<-ifelse( if row number is on my list, "outlier","")
to produce something like this:
id values outlier_test
1 a 10
2 b 11 outlier
3 c 22
4 d 33 outlier
Spent quite a while trying to puzzle this out and had inspiration as soon as I posted the question. For anyone else who comes here with this question:
First:
df$rownumber<- row.names(df)
then:
df$outlier_test<- ifelse(df$rownumber %in% outliers,"outlier","")

R - Updating a Dataframe Column

I have a data-frame with 2 columns that contains two different types of text
The first column contains codes that are strings in the form of DD-HI-HO (DD being the code)
Column 2 is free text which anyone can insert
I am trying to populate the third column based on three statements which use the logic below to give a single vector column of 1 or 0
i don't seem to be able to update a vector column to incorporate all three rules. Below is Pseudo code
Basic info:
Codes is a vector (basically a reference table with one column)
Fuzzy is a vector (basically another reference table with one column)
#----CHECK SEQUENCES----
# Check if code is applied in column 1
Data$Has.Code <- grepl(pattern = "(HC|HD|HE|HK|HM|HH|HY|HL)", Data.Raw$Col1)
# Check if string contains relevant text in col 2
Data$Has.DG <- if(length(intersect(Codes, Data$Contents)) > 0) {1}
# Check how closely Strings are related. Take the highest match If its over 45% then set flag as 1
levenshteinSim(Fuzzy ,Data$Contents)
-------Added Table with sample data
Col1, Col2, Col3
1.HC-IE, Ice-cream, 1
2.IE-GB, Volvo, 0
3,IE-DE, Iced_Lollipop, 1
Record 1,
Rule number 1 would catch "HC" in Col1 and so set Col 3 to 1 (boolean)
Rule number 2 would also catch something in Col2 for record 1 as the vector Codes contains "Ice" as an element. It wouldn't execute in any case because
Rule one supercedes it
Record 2
None of the rules would return anything for the second item so col 3 is set to 0
Record 3
A bit of a daft example but the levenschtein distance computes a 75% similarity between Col 2 and one of the elements in the vector Fuzzy. This is above our stated threshold so col 3 is set to 1
Can anyone help
Thank you for your help

R Dataframe: Get column 2 where column 1 value = x?

Basic question but I'm a beginner sorry :-) And I still struggle with all these different data types etc. So I have a table with different variable names in column 1. In column 2 These variables have certain values. I want to extract now the value for a certain variable.
VarNames<-read.table(paste("O:/Daten/RatsDaten/CodesandDescription/VarNamesDir.asc"), sep="", skip=0,header=FALSE)
And the table Looks somehow like this
Test1 5
Test2 7
Test3 1
So how do I Access these Test variable values with their names? VarNames["Test1",2] didn't work..neither did any other option I've tried. Are there better data type options for this or how would I do it with a comfortable data frame?
You should have one of this 2 situations , either
Testxx are rownames of VarNames, you can test this using rownames(VarNames), and in this case you should do :
VarNames["Test1",1]
Or Testxx are components of a column, and you should do something like this :
VarNames[VarNames$v =='Test1',2]
For the first option :
m <- matrix(1:3,ncol=1,dimnames=list(paste0('Test',1:3),NULL))
m['Test1',]
Test1
1
for the second option
m1 <- data.frame(v=paste0('Test',1:3),b=1:3)
m1[m1$v=='Test1',]
v b
1 Test1 1
As your example is not reproducible, it is unclear whether the first column denotes row names or a variable with values TestX.
In case it is a variable, your table actually looks like this:
V1 V2
Test1 5
Test2 7
Test3 1
So you can get value of Test2 by calling VarNames[VarNames$V1 == "Test2",] for the whole row or VarNames[VarNames$V1 == "Test2",2] for the value only. You specify 2 since it is the second column.
If the first column denotes row names, the call is VarNames["Test2",] for the whole row, or as #agstudy answered, VarNames["Test2",1] for the value alone. You specify 1 since it is the first column provided Test2 is a row name, and thus is not contained in a column.

How to build a new column (/data.frame) from a table, and assign corresponding values to the rows

I printed out the summary of a column variables as such:
summary(document$subject)
A,B,C,D,E,F,.. are the subjects belonging to a column of a data.frame where A,B,C,...appear many times in the column, and the summary above shows the number of times (frequency) these subjects have appeared in the file. Also, the term "OTHER" refers to those subjects which have appeared only once in the file, I also need to assign "1" to these subjects.
There are so many different subjects that it's difficult to list out all of them if we use command "c".
I want to build up a new column (or data.frame) and then assign these corresponding numbers (scores) to the subjects. Ideally, it will become this in the file:
A 198
B 113
C 96
D 69
A 198
E 65
F 62
A 198
C 113
BZ 21
BC 1
CJ 1
...
I wonder what command I should use to take the scores/values from the summary table and then build a new column to assign these values to the corresponding subjects in the file.
Plus, since it's a summary table printed by R, I don't know how to build it into a table in a file, or take out the values and subject names from the table. I also wonder how I could find out the subject names which appeared only once in the file, so that the summary table added them up into "OTHER".
Your question is hard to interpret without a reproducible example. Please take a look this threat for tips on how to do that:
How to make a great R reproducible example?
Having said that, here is how I interpret your question. You have two data frames, one with a score per subject and another with the subjects multiple times in a column:
Sum <- data.frame(subject=c("A","B"),score=c(1,2))
foo <- data.frame(subject=c("A","B","A"))
> Sum
subject score
1 A 1
2 B 2
> foo
subject
1 A
2 B
3 A
You can then use match() to match the subjects in one data frame to the other and create the new variable in the second data frame:
foo$score <- Sum$score[match(foo$subject, Sum$subject)]
> foo
subject score
1 A 1
2 B 2
3 A 1

Resources