R: count word occurrence by row and create variable - r

new to R. I am looking to create a function to count the number of rows that contain 1 or more of the following words ("foo", "x", "y") from a column.
I then want to label that row with a variable, such as "1".
I have a data frame that looks like this:
a->
id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"
The correct output should be:
count: 3
new data frame
a2 ->
id text time username keywordtag
1 "hello x" 10 "me" 1
2 "foo and y" 5 "you" 1
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know" 1
Any hints on how to do this would be appreciated!

Here are 2 approaches with base and qdap:
a <- read.table(text='id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"', header=TRUE)
# Base
a$keywordtag <- as.numeric(grepl("\\b[foo]\\b|\\b[x]\\b|\\b[y]\\b", a$text))
a
# qdap
library(qdap)
terms <- termco(gsub("(,)([^ ])", "\\1 \\2", a$text),
id(a), list(terms = c(" foo ", " x ", " y ")))
a$keywordtag <- as.numeric(counts(terms)[[3]] > 0)
a
# output
## id text time username keywordtag
## 1 1 hello x 10 me 1
## 2 2 foo and y 5 you 1
## 3 3 nothing 15 everyone 0
## 4 4 x,y,foo 0 know 1
The base approach is bar far more eloquent and simple.
# EDIT (borrowing from Richard I believe this is most generalizable and undestandable):
words <- c("foo", "x", "y")
regex <- paste(sprintf("\\b[%s]\\b", words), collapse="|")
within(a,{
keywordtag = as.numeric(grepl(regex, a$text))
})

This is probably much safer than my previous answer.
> string <- c("foo", "x", "y")
> a$keywordtag <-
(1:nrow(a) %in% c(sapply(string, grep, a$text, fixed = TRUE)))+0
> a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 nothing 15 everyone 0
# 4 4 x,y,foo 0 know 1

Your question boils down to splitting a vector of strings on multiple delimiters and checking if any of the tokens are in your set of desired words. You can split on multiple delimiters using strsplit (I'll use comma and whitespace, since your question doesn't specify the full set of delimiters for your problem), and I'll use intersect to check if it contains any word in your set:
m <- c("foo", "x", "y")
a$keywordtag <- as.numeric(unlist(lapply(strsplit(as.character(a$text), ",|\\s"),
function(x) length(intersect(x, m)) > 0)))
a
# id text time username keywordtag
# 1 1 hello x 10 me 1
# 2 2 foo and y 5 you 1
# 3 3 exciting 15 everyone 0
# 4 4 x,y,foo 0 know 1
I've included "exciting", which is a word that contains "x" but that isn't listed as a match by this approach.

Another way of Tyler Rinker's answer:
within(a,{keywordtag = as.numeric(grepl("foo|x|y", fixed = FALSE, a$keywordtag))})

Related

Obtaining the index of elements that belong to common group in R

I have the following data frame,
>df
Label
0 control1
1 control1
2 control2
3 control2
4 control1
To get the index of the elements with label control1 and control2, I do the following
Index1 <- grep("control1",df[,1])
Index2 <- grep("control2",df[,1])
In the above syntax, the labels control1 and control2 are explicity mentioned in the command.
Is there a way to find the labels automatically? The reason is the data frame, df,contents are parsed from different input files.
For instance, I could have another data frame that reads
>df2
Label
0 trol1
1 trol1
2 trol2
3 trol3
4 trol2
Is there a way to create a list of unique labels present in the column of df?
We can use split to get list of index according to unique Label
split(1:nrow(df), df$Label)
#$control1
#[1] 1 2 5
#$control2
#[1] 3 4
With df2
split(1:nrow(df2), df2$Label)
#$trol1
#[1] 1 2
#$trol2
#[1] 3 5
#$trol3
#[1] 4
Using unique and which you can do:
df <- data.frame(Label = c("trol1", "trol1", "trol2", "trol3", "trol2"), stringsAsFactors=FALSE)
label_idx = list()
for(lbl in unique(df$Label)){
label_idx[[lbl]] = which(df$Label == lbl)
}
label_idx
$`trol1`
[1] 1 2
$trol2
[1] 3 5
$trol3
[1] 4
You can try also
lapply(unique(df$Label), function(x) which(df$Label%in% x))
#with df
[[1]]
[1] 1 2 5
[[2]]
[1] 3 4
lapply(unique(df2$Label), function(x) which(df2$Label%in% x))
#with df2
[[1]]
[1] 1 2
[[2]]
[1] 3 5
[[3]]
[1] 4

Getting index of vector with delimited parts

I have vectors that looks like these variations:
cn1 <- c("Probe","Genes","foo","bar","Probe","Genes","foo","bar")
# 0 1 2 3 4 5 6 7
cn2 <- c("Probe","Genes","foo","bar","qux","Probe","Genes","foo","bar","qux")
# 0 1 2 3 4 5 6 7 8 9
Note that in each vector above consists of two parts. They are separated with "Probe","Genes".
What I want to do is to get the indexes of the first part of the entry in between that separator. Yielding
cn1_id ------> [2,3]
cn2_id ------> [2,3,4]
How can I achieve that in R?
I tried this but it doesn't do what I want:
> split(cn1,c("Probe","Genes"))
$Genes
[1] "Genes" "bar" "Genes" "bar"
$Probe
[1] "Probe" "foo" "Probe" "foo"
Here's a function that you can use. Note that R vectors are 1-based so counting starts at 1 rather than 0.
findidx <- function(x) {
idx <- which(x=="Probe" & c(tail(x,-1),NA)=="Genes")
if (length(idx)>1) {
(idx[1]+2):(idx[2]-1)
} else {
NA # what to return if no match found
}
}
findidx(cn1)
# [1] 3 4
findidx(cn2)
# [1] 3 4 5
You could try between from data.table
indx <- between(cn1, 'Genes', 'Probe')
indx2 <- between(cn2, 'Genes', 'Probe')
which(cumsum(indx)==2)[-1]-1
#[1] 2 3
which(cumsum(indx2)==2)[-1]-1
#[1] 2 3 4

how to avoid change string to number automaticlly in r

I was trying to save some string into a matrix, but it automatically changed to numbers (levels). How can i avoid it??
Here is the table:
trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b
I wanna to save to a matrix like:
J0001 a ab ab b
But, what i get is:
J0001 1 2 2 3
How can i avoid this?
Your M column is defined as a factor. You can save it as-is by wrapping it with as.character
> dat <- read.table(header = TRUE, text = "trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b")
> as.numeric(dat$M)
# [1] 1 2 2 3
> as.character(dat$M)
# [1] "a" "ab" "ab" "b"
You can avoid this in the first place by using stringsAsFactors = FALSE when you read the data into R, or take advantage of the colClasses argument in some of the read-in functions.

Matching without replacement by id in R

In R, I can easily match unique identifiers using the match function:
match(c(1,2,3,4),c(2,3,4,1))
# [1] 4 1 2 3
When I try to match non-unique identifiers, I get the following result:
match(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 3
Is there a way to match the indices "without replacement", that is, each index appearing only once?
othermatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4 # note the 4 where there was a 3 at the end
you're looking for pmatch
pmatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4
A more naive approach -
library(data.table)
a <- data.table(p = c(1,2,3,1))
a[,indexa := .I]
b <- data.table(q = c(2,3,1,1))
b[,indexb := .I]
setkey(a,p)
setkey(b,q)
# since they are permutation, therefore cbinding the ordered vectors should return ab with ab[,p] = ab[,q]
ab <- cbind(a,b)
setkey(ab,indexa)
ab[,indexb]
#[1] 3 1 2 4

Renaming duplicate strings in R

I have an R dataframe that has two columns of strings. In one of the columns (say, Column1) there are duplicate values. I need to relabel that column so that it would have the duplicated strings renamed with ordered suffixes, like in the Column1.new
Column1 Column2 Column1.new
1 A 1_1
1 B 1_2
2 C 2_1
2 D 2_2
3 E 3
4 F 4
Any ideas of how to do this would be appreciated.
Cheers,
Antti
Let's say your data (ordered by Column1) is within an object called tab. First create a run length object
c1.rle <- rle(tab$Column1)
c1.rle
##lengths: int [1:4] 2 2 1 1
##values : int [1:4] 1 2 3 4
That gives you values of Column1 and the according number of appearences of each element. Then use that information to create the new column with unique identifiers:
tab$Column1.new <- paste0(rep(c1.rle$values, times = c1.rle$lengths), "_",
unlist(lapply(c1.rle$lengths, seq_len)))
Not sure, if this is appropriate in your situation, but you could also just paste together Column1 and Column2, to create an unique identifier...
May be a little more of a workaround, but parts of this may be more useful and simpler for someone with not quite the same needs. make.names with the unique=T attribute adds a dot and numbers names that are repeated:
x <- make.names(tab$Column1,unique=T)
> print(x)
[1] "X1" "X1.1" "X2" "X2.1" "X3" "X4"
This might be enough for some folks. Here you can then grab the first entries of elements that are repeated, but not elements that are not repeated, then add a .0 to the end.
y <- rle(tab$Column1)
tmp <- !duplicated(tab$Column1) & (tab$Column1 %in% y$values[y$lengths>1])
x[tmp] <- str_replace(x[tmp],"$","\\.0")
> print(x)
[1] "X1.0" "X1.1" "X2.0" "X2.1" "X3" "X4"
Replace the dots and remove the X
x <- str_replace(x,"X","")
x <- str_replace(x,"\\.","_")
> print(x)
[1] "1_0" "1_1" "2_0" "2_1" "3" "4"
Might be good enough for you. But if you want the indexing to start at 1, grab the numbers, add one then put them back.
z <- str_match(x,"_([0-9]*)$")[,2]
z <- as.character(as.numeric(z)+1)
x <- str_replace(x,"_([0-9]*)$",paste0("_",z))
> print(x)
[1] "1_1" "1_2" "2_1" "2_2" "3" "4"
Like I said, more of a workaround here, but gives some options.
d <- read.table(text='Column1 Column2
1 A
1 B
2 C
2 D
3 E
4 F', header=TRUE)
transform(d,
Column1.new = ifelse(duplicated(Column1) | duplicated(Column1, fromLast=TRUE),
paste(Column1, ave(Column1, Column1, FUN=seq_along), sep='_'),
Column1))
# Column1 Column2 Column1.new
# 1 1 A 1_1
# 2 1 B 1_2
# 3 2 C 2_1
# 4 2 D 2_2
# 5 3 E 3
# 6 4 F 4
#Cão answer only with base R:
x=read.table(text="
Column1 Column2 #Column1.new
1 A #1_1
1 B #1_2
2 C #2_1
2 D #2_2
3 E #3
4 F #4", stringsAsFactors=F, header=T)
string<-x$Column1
mstring <- make.unique(as.character(string) )
mstring<-sub("(.*)(\\.)([0-9]+)","\\1_\\3",mstring)
y <- rle(string)
tmp <- !duplicated(string) & (string %in% y$values[y$lengths>1])
mstring[tmp]<-gsub("(.*)","\\1_0", mstring[tmp])
end <- sub(".*_([0-9]+)","\\1",grep("_([0-9]*)$",mstring,value=T) )
beg <- sub("(.*_)[0-9]+","\\1",grep("_([0-9]*)$",mstring,value=T) )
newend <- as.numeric(end)+1
mstring[grep("_([0-9]*)$",mstring)]<-paste0(beg,newend)
x$Column1New<-mstring
x
It's a very old post, and I am probably missing something obvious, but what is wrong with(?):
tab$Column1 <- make.unique(tab$Column1.sep="_")
Albeit I believe this requires character input.

Resources