average number of words in a character vector in R

average number of words in a character vector in R - r

i'm trying to get the average number of words in my character vector in R
one <- c(9, 23, 43)
two <- c("this is a new york times article.", "short article.", "he went outside to smoke a cigarette.")
mydf <- data.frame(one, two)
mydf
# one two
# 1 9 this is a new york times article.
# 2 23 short article.
# 3 43 he went outside to smoke a cigarette.
i'm looking for a function that gives me the average number of words of character vector "two".
the output here should be 5.3333 (=(7+2+7)/3)

Here's a possibility with the qdap package:
library(qdap)
wc(mydf$two, FALSE)/nrow(mydf)
## [1] 5.333333
This is overkill but you could also do:
word_stats(mydf$two)
## all n.sent n.words n.char n.syl n.poly wps cps sps psps cpw spw pspw n.state proDF2 n.hapax n.dis grow.rate prop.dis
## 1 all 3 16 68 23 3 5.333 22.667 7.667 1 4.250 1.438 .188 3 1 12 2 .750 .125
And wps column is words per sentence.

Or gregexpr()
mean(sapply(mydf$two,function(x)length(unlist(gregexpr(" ",x)))+1))
[1] 5.333333

Hadley Wickham's stringr package provides possibly the easiest way for this:
library(stringr)
foo<- str_split(two, " ") # split each element of your vector by the space sign
sapply(foo,length) # just a quick test: how many words has each element?
sum(sapply(foo,length))/length(foo) # calculate sum and divide it by the length of your original object
[1] 5.333333

I'm sure there are some more elaborated methods available but you can use strsplit to split your strings at spaces into a character vector and count its length of elements.
mean(sapply(strsplit(as.character(mydf$two), "[[:space:]]+"), length))
# [1] 5.3333

Related

Replace apply function with lapply

I am creating a data set to compute the aggregate values for different combinations of words using regex. Each row has a unique regex value which I want to check against another dataset and find the number of times it appeared in it.
The first dataset (df1) looks like this :
word1 word2 pattern
air 10 (^|\\s)air(\\s.*)?\\s10($|\\s)
airport 20 (^|\\s)airport(\\s.*)?\\s20($|\\s)
car 30 (^|\\s)car(\\s.*)?\\s30($|\\s)
The other dataset (df2) from which I want to match this looks like
sl_no query
1 air 10
2 airport 20
3 airport 20
3 airport 20
3 car 30
The final output I want should look like
word1 word2 total_occ
air 10 1
airport 20 3
car 30 1
I am able to do this by using apply in R
process <-
function(x)
{
length(grep(x[["pattern"]], df2$query))
}
df1$total_occ=apply(df1,1,process)
but find it time taking since my dataset is pretty big.
I found out that "mclapply" function of "parallel" package can be used to run such things on multicores, for which I am trying to run lapply first. Its giving me error saying
lapply(df,process)
Error in x[, "pattern"] : incorrect number of dimensions
Please let me know what changes should I make to run lapply correctly.

Why not just lapply() over the pattern?
Here I've just pulled out your pattern but this could just as easily be df$pattern
pattern <- c("(^|\\s)air(\\s.*)?\\s10($|\\s)",
"(^|\\s)airport(\\s.*)?\\s20($|\\s)",
"(^|\\s)car(\\s.*)?\\s30($|\\s)")
Using your data for df2
txt <- "sl_no query
1 'air 10'
2 'airport 20'
3 'airport 20'
3 'airport 20'
3 'car 30'"
df2 <- read.table(text = txt, header = TRUE)
Just iterate on pattern directly
> lapply(pattern, grep, x = df2$query)
[[1]]
[1] 1
[[2]]
[1] 2 3 4
[[3]]
[1] 5
If you want more compact output as suggested in your question, you'll need to run lengths() over the output returned (Thanks to #Frank for pointing out the new function lengths().)). Eg
lengths(lapply(pattern, grep, x = df2$query))
which gives
> lengths(lapply(pattern, grep, x = df2$query))
[1] 1 3 1
You can add this to the original data via
dfnew <- cbind(df1[, 1:2],
Count = lengths(lapply(pattern, grep, x = df2$query)))

R: Reorder columns from dcast output numerically instead of lexicographically

This is about ordering column names that contain both numbers and text. I have a dataframe which resulted from dcastand has 200 rows. I have a problem with the ordering.
The column names are in the following format:
names(DF) <- c('Testname1.1', 'Testname1.100','Testname1.11','Testname1.2',...,Testname2.99)
Edit: I would like to have the columns ordered as:
names(DF) <- c('Testname1.1, Testname1.2,Testname1.3,...Testname1.100,Testname2.1,...Testname 2.100)
The original input has a column which specifies the day, but it is not being used when I 'cast' the data. Is there a way to specify the 'dcast' function to order combined column names numerically?
What would be the easiest way to get the columns ordered as I need to in R?
Thanks a lot!

I think you need to split the column before you can use it to order the data frame:
library("reshape2") ## for colsplit()
library("gtools")
Construct test data:
dat <- data.frame(matrix(1:25,5))
names(dat) <- c('Testname1.1', 'Testname1.100',
'Testname1.11','Testname1.2','Testname2.99')
Split and order:
cdat <- colsplit(names(dat),"\\.",c("name","num"))
dat[,order(mixedorder(cdat$name),cdat$num)]
## Testname1.1 Testname1.2 Testname1.11 Testname1.100 Testname2.99
## 1 1 16 11 6 21
## 2 2 17 12 7 22
## 3 3 18 13 8 23
## 4 4 19 14 9 24
## 5 5 20 15 10 25
The mixedorder() above (borrowed from #BondedDust's answer) is not really necessary for this example, but would be needed if the first (Testnamexx) component had more than 9 elements, so that Testname1, Testname2, and Testname10 would come in the proper order.

The mixedorder and mixedsort functions of pkg:gtools sometimes does what is desired but in this case I think the period separator is messing things up because it is part of numeric values. But clearly was intended go be a separator rather than decimal point. Try
nvec <- c('Testname1.1', 'Testname1.100', 'Testname1.11', 'Testname1.2', 'Testname2.99')
#------------
> require(gtools)
Loading required package: gtools
Attaching package: ‘gtools’
The following objects are masked from ‘package:boot’:
inv.logit, logit
#------------
myvec <- nvec[order( mixedorder( sapply(strsplit(nvec, "\\."), "[[", 1)),
as.numeric(sapply(strsplit(nvec, "\\."), "[[", 2)) )
]

One way would be:
library(gtools) #use gtools library
library(NCmisc) #use NCmisc library for pad.left()
myvec <- c('Testname1.1', 'Testname1.100','Testname1.11','Testname1.2','Testname2.99') #construct your vector
myvec[mixedorder( paste(substring(myvec,1,9), pad.left(substring(myvec,11,100),'0') , sep='') ) ]
[1] "Testname1.1" "Testname1.2" "Testname1.11" "Testname1.100" "Testname2.99"

find all possible sums in vector (R)

I have a vector of dollar values like this (vec):
[1] 460.08 3220.56 1506.20 1363.76 1838.00 1838.00 3684.94 2352.66 1606.02
[10] 1840.05 518.98 1603.53 1556.94 347.32 253.16 12.95 1828.81 1896.32
[19] 4962.60 426.33 3237.04 1601.40 2004.57 183.80 1570.75 3622.96 230.04
[28] 426.33 3237.04 1601.40 2004.57 183.80
If I have a charge that resulted from some sum of these numbers, how could I find it? For example, if the charge was 6747.81, then it must have resulted from 1506.20 + 3237.04 + 2004.57 (the 3rd, 29th and 31st vector elements). How could I solve for these vector elements given the sum?
I would imagine finding all possible sums is the answer then matching it to the vector elements that led to it.
I have played with using combn(vec, 3) to find all 3 but this doesn't quite quite give what I want.

You'll want to use colSums (or apply) after combn to get the sums.
set.seed(100)
# Generate fake data
vec <- rpois(10, 20)
# Get all combinations of 3 elements
combs <- combn(vec, 3)
# Find the resulting sums
out <- colSums(combs)
# Making up a value to search for
val <- vec[2]+vec[6]+vec[8]
# Find which combinations lead to that value
id <- which(out == val)
# Pull out those combinations
combs[,id]
Some output to show the results for this example
> vec
[1] 17 12 23 20 21 17 21 18 22 22
> val
[1] 47
> combs[,id]
[,1] [,2]
[1,] 17 12
[2,] 12 17
[3,] 18 18
Edit: Just saw that there isn't necessarily a restriction to use 3 items. One could generalize this just by doing it for every possible sample size but I don't have time to do that right now. It would also be fairly slow for even moderately sized problems.

Arithmetic Progression series in R

I am new to this forum. I guess something like this has been asked before but, I am not really sure if that is what I want.
I have a sequence like this,
1 2 3 4 5 8 9 10 12 14 15 17 18 19
So, what I wish to do is this, get all the numbers which form a series,i.e.the numbers that belonging to that set should all have a constant difference with the previous element, and also the minimum number of elements should be 3 in that set.
i.e., I can see that (1,2,3,4,5) forms one such series in which numbers appear after an interval of 1 and the total size of this set is 5 which satisfies the minimum threshold criteria.
(1,3,5) forms one such a pattern in which the numbers appear after an interval of 2.
(8,10,12,14) forms another such pattern with an interval of 2. So, as you can see, the interval of repetition can be anything.
Also, for a particular set, I want its maximal one. I dont want, (8,10,12) (although it satisfies the minimum threshold of 3 and constant difference ) as the output and only of the maximal length I want, i.e. (8,10,12,14).
Similarly, for, (1,2,3,4,5) , I dont want (1,2,3) or (2,3,4,5) as the output, only the MAXIMAL LENGTH ONE I WANT, i.e. (1,2,3,4,5).
How can I do this in R?
Edit: That is, I want any set which forms a basic AP series with any difference, however the total value should be greater than 3 in that series and it should be maximal.
Edit2: I have tried using rle and acf in R but that doesnt entirely solves my problem.
Edit3: When I did acf, it basically gave me the maximum peak difference that I could have used. However, I want all the differences possible. Also, rle is just way different. It gave me the longest continuous sequence of similar numbers. Which is not there in my case.

If you are looking for sequences of consecutive numbers, then cgwtools::seqle will find them for you in the same way rle finds a sequence of repeated values.
In the general case of basically any subset of your data which form such a sequence, such as the 8,10,12,14 case you cite, your criteria are so general as to be very difficult to satisfy. You'd have to start at each element of your series and do a forward-looking search for x[j] +1, x[j]+2, x[j]+3 ... ad infinitum. This suggests using some tree-based algorithms.

Here's a potential solution - albeit a very ugly, sloppy one:
##
arithSeq <- function(x=nSeq, minSize=4){
##
dx <- diff(x,lag=1)
Runs <- rle(diff(x))
##
rLens <- Runs[[1]]
rVals <- Runs[[2]]
pStart <- c(
rep(1,rLens[1]),
rep(cumsum(1+rLens[-length(rLens)]),times=rLens[-1])
)
pEnd <- pStart + c(
rep(rLens[1]-1, rLens[1]),
rep(rLens[-1],times=rLens[-1])
)
pGrp <- rep(1:length(rLens),times=rLens)
pLen <- rep(rLens, times=rLens)
dAll <- data.frame(
pStart=pStart,
pEnd=pEnd,
pGrp=pGrp,
pLen=pLen,
runVal=rep(rVals,rLens)
)
##
dSub <- subset(dAll, pLen >= minSize - 1)
##
uVals <- unique(dSub$runVal)
##
maxSub <- subset(dSub, runVal==uVals[1])
maxLen <- max(maxSub$pLen)
maxSub <- subset(maxSub, pLen==maxLen)
##
if(length(uVals) > 1){
for(i in 2:length(uVals)){
iSub <- subset(dSub, runVal==uVals[i])
iMaxLen <- max(iSub$pLen)
iSub <- subset(iSub, pLen==iMaxLen)
maxSub <- rbind(
maxSub,
iSub)
maxSub
}
##
}
##
deDup <- maxSub[!duplicated(maxSub),]
seqStarts <- as.numeric(rownames(deDup))
outList <- list(NULL); length(outList) <- nrow(deDup)
for(i in 1:nrow(deDup)){
outList[[i]] <- list(
Sequence = x[seqStarts[i]:(seqStarts[i]+deDup[i,"pLen"])],
Length=deDup[i,"pLen"]+1,
StartPosition=seqStarts[i],
EndPosition=seqStarts[i]+deDup[i,"pLen"])
outList
}
##
return(outList)
##
}
##
So there are things that can definitely be improved in this function - for instance I made a mistake somewhere in the calculation of pStart and pEnd, the start and end indices of a given arithmetic sequence, but it just so happened that the true start positions of such sequences are given as the rownumbers of one of the intermediate data.frames, so that was a hacky sort of solution. Anyways, it accepts a numeric vector x and a minimum length parameter, minSize. It will return a list containing information about sequences meeting the criteria you outlined above.
set.seed(1234)
lSeq <- sample(1:25,100000,replace=TRUE)
nSeq <- c(1:10,12,33,13:17,16:26)
##
> arithSeq(nSeq)
[[1]]
[[1]]$Sequence
[1] 16 17 18 19 20 21 22 23 24 25 26
[[1]]$Length
[1] 11
[[1]]$StartPosition
[1] 18
[[1]]$EndPosition
[1] 28
##
> arithSeq(x=lSeq,minSize=5)
[[1]]
[[1]]$Sequence
[1] 13 16 19 22 25
[[1]]$Length
[1] 5
[[1]]$StartPosition
[1] 12760
[[1]]$EndPosition
[1] 12764
[[2]]
[[2]]$Sequence
[1] 11 13 15 17 19
[[2]]$Length
[1] 5
[[2]]$StartPosition
[1] 37988
[[2]]$EndPosition
[1] 37992
Like I said, its sloppy and inelegant, but it should get you started.

What happens when var() is applied to a data frame row in R?

Newbie R question. Sorry to ask: I'm sure it's been answered, but it's one that's hard to search, apparently. I've read the man page for var (variance), but I don't understand it. Checked books, web pages (OK, only two books). I'll wait for someone to point me to an existing answer ....
> df
first second
1 1 3
2 2 5
3 3 7
> df[,2]
[1] 3 5 7
> var(df[,2])
[1] 4
OK, so far, so good.
> df[1,]
first second
1 1 3
> var(df[1,])
first second
first NA NA
second NA NA
Huh??
Thanks in advance.
!

The first issue is that you get a different class of object when you select a row from a data.frame, than when you select a column:
df = data.frame(first=c(1, 2, 3), second=c(3, 5, 7))
class(df[, 2])
[1] "integer"
class(df[1, ])
[1] "data.frame"
# But you can explicitly convert with as.integer.
var(as.integer(df[1, ]))
# [1] 2
The second issue is that var() treats a data.frame quite differently. It treats each column as variable and computes a matrix of variances and covariances by comparing each column to every other column:
# Create a data frame with some random data.
dat = data.frame(first=rnorm(20), second=rnorm(20), third=rnorm(20))
var(dat)
# first second third
# first 0.98363062 -0.2453755 0.04255154
# second -0.24537550 1.1177863 -0.16445768
# third 0.04255154 -0.1644577 0.58928970
var(dat$third)
# [1] 0.5892897
cov(dat$first, dat$second)
# [1] -0.2453755

If you know that a data.frame is all numeric and want it to be available for both row and column operations, then convert it to a matrix:
dat = data.frame(first=rnorm(20), second=rnorm(20), third=rnorm(20))
dm <- data.matrix(df)
var(dm[1,])
#[1] 20.25
(The same effect occurs when you use apply() .... the list structure is lost and the rows are all converted to the lowest common denominator.)
> apply(dat, 1, var)
[1] 0.45998066 1.51241166 0.13634927 0.49981030 0.04440448 1.21224067 0.28113135 0.57968597
[9] 0.26102036 0.41999510 1.01237100 0.17304770 0.50572223 1.17825272 1.39342510 2.94125062
[17] 1.18640684 2.15245595 3.06482195 0.96396008

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex