I have a Data Frame that has two columns like that:
USER ID
text
1
"..."
2
"..."
.
.
.
.
.
.
100
"..."
Let's say there are 100 users and each user has a text.
I want to count the proportion the texts that has question marks in them:
for example, let's say I have only 20 texts in which there are question marks. That means the value I will get is 20/100 (I don't care how many questions marks are within each text).
I tried to use str_count() and build a loop for it:
for (i in 1:length(data_frame$text)) {
str_count(data_frame$text[i], pattern = "\\?")}
but it just not working, it's not even producing an error
If you want to find if there is a question mark in the string (dichotomize as 1/0) you could do this in base R:
df <- data.frame(id = 1:10,
text = c(LETTERS[1:5], paste0(LETTERS[1:5],"?")))
df$question_mark <- grepl("\\?", df$text)*1
You can find the proportion by:
sum(df$question_mark) / nrow(df)
You may want to use stringr::str_detect() and you do not need a for loop.
Most of the str_* functions are vectorized, which is one of R's core strengths. (It still is a hidden for loop of course but it is implemented in c++ and so it's much faster as well as easier to write).
Consider:
df$test <- c("asa", "asa?", "asa??", "asa???", "asa??")
result <- paste0( sum(stringr::str_detect(df$test, "\\?")), "/", length(df$test) )
print(result)
4/5
Related
I have a question how to write a loop in r which goes checks if a certain expression occurs in a string . So I want to check if the the expression “i-sty” occurs in my variable for each i between 1:200 and, if this is true, it should give the corresponding i.
For example if we have “4-sty” the loop should give me 4 and if there is no “i-sty” in the variable it should give me . for the observation.
I used
for (i in 1:200){
datafram$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
But it did not work. I literally only receive points. Attached I show a picture of the string variable.
enter image description here
"i-sty" is just a string with the letter i in it. To you use a regex pattern with your variable i, you need to paste together a string, e.g., grepl(paste0(i, "-sty"), ...). I'd also recommend using NA rather than "." for the "else" result - that way the resulting height variable can be numeric.
for (i in 1:200){
dataframe$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
The above works syntactically, but not logically. You also have a problem that you are overwriting height each time through the loop - when i is 2, you erase the results from when i is 1, when i is 3, you erase the results from when i is 2... I think a better approach would be to extract the match, which is easy using stringr (but also possible in base). As a benefit, with the right pattern we can skip the loop entirely:
library(stringr)
dataframe$height = str_match(string = dataframe$Description, pattern = "[0-9]+-sty")[, 2]
# might want to wrap in `as.numeric`
You use both datafram and dataframe. I've assumed dataframe is correct.
I have a 10 x ~15,000 data frame with salaries in column 9 and I'm trying to remove the $ from the start of each entry in that column.
This is the best version of what I have. I am new to R and far more familiar with other languages. Preferably if there is a way to run an operation on each element of a data frame (like cellfun in Matlab, or a list comprehension in python) that would make this far easier.
Based on my debugging attempts it seems like gsub just isn't doing anything, even outside a loop. Any suggestions from a more experienced user would be appreciated.
Thanks.
bbdat <- read.csv("C:/Users/musta/Downloads/BBs1.csv", header=TRUE, sep=",", dec=".", stringsAsFactors=FALSE)
i <- 0
for (val in bbdat[,9])
{
i = i+1
bbdat[i,9]<- gsub("$","",val)
}
The $ is a metacharacter and it implies the end of the string. If we want to evaluate it literally, either use the fixed = TRUE (by default it is FALSE) or keep it inside square bracket ("[$]") or escape (\\$). As gsub/sub are vectorized, looping is not required
bbdat[,9] <- gsub("$", "", bbdat[,9], fixed = TRUE)
If there is only a single instance of $ in each element, use sub (gsub - global substitution) instead ofgsub`
I already have tried to find a solutions on the internet for my problem, and I have the feeling I know all the small pieces but I am unable to put them together. I'm quite knew at programing so pleace be patient :D...
I have a (in reality much larger) text string which look like this:
string <- "Test test [438] test. Test 299, test [82]."
Now I want to replace the numbers in square brackets using a lookup table and get a new string back. There are other numbers in the text but I only want to change those in brackets and need to have them back in brackets.
lookup <- read.table(text = "
Number orderedNbr
1 270 1
2 299 2
3 82 3
4 314 4
5 438 5", header = TRUE)
I have made a pattern to find the square brackets using regular expressions
pattern <- "\\[(\\d+)\\]"
Now I looked all around and tried sub/gsub, lapply, merge, str_replace, but I find myself unable to make it work... I don't know how to tell R! to look what's inside the brackets, to look for that same argument in the lookup table and give out what's standing in the next column.
I hope you can help me, and that it's not a really stupid question. Thx
We can use a regex look around to match only numbers that are inside a square bracket
library(gsubfn)
gsubfn("(?<=\\[)(\\d+)(?=\\])", setNames(as.list(lookup$orderedNbr),
lookup$Number), string, perl = TRUE)
#[1] "Test test [5] test. Test [3]."
Or without regex lookaround by pasteing the square bracket on each column of 'lookup'
gsubfn("(\\[\\d+\\])", setNames(as.list(paste0("[", lookup$orderedNbr,
"]")), paste0("[", lookup$Number, "]")), string)
Read your table of keys and values (a 2 column table) into a data frame. If your source information be a flat text file, then you can easily use read.csv to obtain a data frame. In the example below, I hard code a data frame with just two entries. Then, I iterate over it and make replacements in the input string.
df <- data.frame(keys=c(438, 82), values=c(5, 3))
string <- "Test test [438] test. Test [82]."
for (i in 1:nrow(df)) {
string <- gsub(paste0("(?<=\\[)", df$keys[i], "(?=\\])"), df$values[i], string, perl=TRUE)
}
string
[1] "Test test 5 test. Test 3."
Demo
Note: As #Frank wisely pointed out, my solution would fail if your number markers (e.g. [438]) happen to have replacements which are numbers also appearing as other markers. That is, if replacing a key with a value results in yet another key, there could be problems. If this be a possibility, I would suggest using markers for which this cannot happen. For example, you could remove the brackets after each replacement.
You can use regmatches<- with a pattern containing lookahead/lookbehind:
patt = "(?<=\\[)\\d+(?=\\])"
m = gregexpr(patt, string, perl=TRUE)
v = as.integer(unlist(regmatches(string, m)))
`regmatches<-`(string, m, value = list(lookup$orderedNbr[match(v, lookup$Number)]))
# [1] "Test test [5] test. Test 299, test [3]."
Or to modify the string directly, change the last line to the more readable...
regmatches(string, m) <- list(lookup$orderedNbr[match(v, lookup$Number)])
I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)
I have millions of Keywords in a column labeled Keyword.text. Each factor or Keyword can contains multiple words (or shall we say token). Here is an example with 4 keywords
Keyword.text
The quick brown fox the
.8 .crazy lazy dog
dog
jumps over+the 9
I'd like to count the number of tokens in each Keyword, so as to obtain:
Keyword.length
5
4
1
4
I installed the Tau package but I haven't gotten very far...
textcnt(Mydf$Keyword.text, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
returns an error I don't understand. Maybe it's due to having factors; it worked fine when practicing with a string.
I know how to do it in excel, but it doesn't work for the last line. If A2 has the keywords then: =LEN(TRIM(A2))-LEN(SUBSTITUTE(A2," ",""))+1 would do
Edit : For a dataframe and the total number of keywords, just use strsplit. There's no need to use strcnt if you're not interested in the counts per keyword. That's where I got you wrong :
tt <- data.frame(
a=rnorm(3),
b=rnorm(3),
c=c("the quick fox lazy","rbrown+fr even","what what goes & around"),
stringsAsFactors=F
)
sapply(tt$c, function(n){
length(strsplit(n, split = "[[:space:][:punct:]]+")[[1]])
})
To read the data, take also a look at ?readLines and/or ?scan. This preserves the string format and allows you to process the file line by line (or row per row). If you use a file connection, you can even load the file in parts, which helps you when you hit memory limits.
A simple example using readLines :
con <- textConnection("
The lazy fog+fog fog
never ended for fog jumping over the
fog whatever . $ plus.
")
# You use con <- file("myfile.txt")
Text <- readLines(con)
sapply(Text,textcnt, split = "[[:space:][:punct:]]+", method = "string", n = 1L)
On a sidenote, using the option Dirk mentioned (stringsAsFactors=F) won't slow down performance compared to the usual read.table command. In contrary actually. You should use the sapply as mentioned above, but replace Text with as.character(Mydf$Keyword.text) (or use the stringsAsFactors=F option and drop the as.character().
Please show the error.
Also try:
require(tau)
textcnt(as character(Mydf$Keyword.txt), split, ....)
... to force character mode.
Or load your data with stringsAsFactors=FALSE -- the same question has come up here before.
What about a nice little function that let us also decide which kind of words we would like to count and which works on whole vectors as well?
require(stringr)
nwords <- function(string, pseudo=F){
ifelse( pseudo,
pattern <- "\\S+",
pattern <- "[[:alpha:]]+"
)
str_count(string, pattern)
}
nwords("one, two three 4,,,, 5 6")
# 3
nwords("one, two three 4,,,, 5 6", pseudo=T)
# 6