How to grep a word exactly - r

I'd like to grep for "nitrogen" in the following character vector and want to get
back only the entry which is containing "nitrogen" and nothing of the rest (e.g. nitrogen fixation):
varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")
I tried something like this:
grepl(pattern= "![[:space:]]nitrogen![[:space:]]", varnames)
But this doesn't work.

Although Dason's answer is easier, you could do an exact match using grep via:
varnames=c("nitrogen", "dissolved organic nitrogen", "nitrogen fixation", "total dissolved nitrogen", "total nitrogen")
grep("^nitrogen$",varnames,value=TRUE)
[1] "nitrogen"
grep("^nitrogen$",varnames)
[1] 1

To get the indices that are exactly equal to "nitrogen" you could use
which(varnames == "nitrogen")
Depending on what you want to do you might not even need the 'which' as varnames == "nitrogen" gives a logical vector of TRUE/FALSE. If you just want to do something like replace all of the occurances of "nitrogen" with "oxygen" this should suffice
varnames[varnames == "nitrogen"] <- "oxygen"

Or use fixed = TRUE if you want to match actual string (regexlessly):
v <- sample(c("nitrogen", "potassium", "hidrogen"), size = 100, replace = TRUE, prob = c(.8, .1, .1))
grep("nitrogen", v, fixed = TRUE)
# [1] 3 4 5 6 7 8 9 11 12 13 14 16 19 20 21 22 23 24 25
# [20] 26 27 29 31 32 35 36 38 39 40 41 43 44 46 47 48 49 50 51
# [39] 52 53 54 56 57 60 61 62 65 66 67 69 70 71 72 73 74 75 76
# [58] 78 79 80 81 82 83 84 85 86 87 88 89 91 92 93 94 95 96 97
# [77] 98 99 100
Dunno about the speed issues, I like to test stuff and claim that approach A is faster than approach B, but in theory, at least from my experience, indexing/binary operators should be the fastest, so I vote for #Dason's approach. Also note that regexes are always slower than fixed = TRUE greping.
A little proof is attached bellow. Note that this is a lame test, and system.time should be put inside replicate to get (more) accurate differences, you should take outliers into an account, etc. But surely this one proves that you should use which! =)
(a0 <- system.time(replicate(1e5, grep("^nitrogen$", v))))
# user system elapsed
# 5.700 0.023 5.724
(a1 <- system.time(replicate(1e5, grep("nitrogen", v, fixed = TRUE))))
# user system elapsed
# 1.147 0.020 1.168
(a2 <- system.time(replicate(1e5, which(v == "nitrogen"))))
# user system elapsed
# 1.013 0.020 1.033

Related

How to use characters in variables summing in R?

I have some dataframe. Here is a small expample:
a <- rnorm(100, 5, 2)
b <- rnorm(100, 10, 3)
c <- rnorm(100, 15, 4)
df <- data.frame(a, b, c)
And I have a character variable vect <- "c('a','b')"
When I try to calculate sum of vars using command
df$d <- df[vect]
which must be an equivalent of
df$d <- df[c('a','b')]
But, as a reslut I have got an error
[.data.frame(df, vect) :undefined columns selected
You're assumption that
vect <- "c('a','b')"
df$d <- df[vect]
is equivalent to
df$d <- df[c('a','b')]
is incorrect.
As #Karthik points out, you should remove the quotation marks in the assignment to vect
However, from your question it sounds like you want to then sum the elements specified in vect and then assign to d. To do this you need to slightly change your code
vect <- c('a','b')
df$d <- apply(X = df[vect], MARGIN = 1, FUN = sum)
This does elementwise sum on the columns in df specified by vect. The MARGIN = 1 specifies that we want to apply the sum rowise rather than columnwise.
EDIT:
As #ThomasIsCoding points out below, if for some reason vect has to be a string, you can parse a string to an R expression using str2lang
vect <- "c('a','b')"
parsed_vect <- eval(str2lang(vect))
df$d <- apply(X = df[parsed_vect], MARGIN = 1, FUN = sum)
Perhaps you can try
> df[eval(str2lang(vect))]
a b
1 8.1588519 9.0617818
2 3.9361214 13.2752377
3 5.5370983 8.8739725
4 8.4542050 8.5704234
5 3.9044461 13.2642793
6 5.6679639 12.9529061
7 4.0183808 6.4746806
8 3.6415608 11.0308990
9 4.5237453 7.3255129
10 6.9379168 9.4594150
11 5.1557935 11.6776181
12 2.3829337 3.5170335
13 4.3556430 7.9706624
14 7.3274615 8.1852829
15 -0.5650641 2.8109197
16 7.1742283 6.8161200
17 3.3412044 11.6298940
18 2.5388981 10.1289533
19 3.8845686 14.1517643
20 2.4431608 6.8374837
21 4.8731053 12.7258259
22 6.9534912 6.5069513
23 4.4394807 14.5320225
24 2.0427553 12.1786148
25 7.1563978 11.9671603
26 2.4231207 6.1801862
27 6.5830372 0.9814878
28 2.5443326 9.8774632
29 1.1260322 9.4804636
30 4.0078436 12.9909014
31 9.3599808 12.2178596
32 3.5362245 8.6758910
33 4.6462337 8.6647953
34 2.0698037 7.2750532
35 7.0727970 8.9386798
36 4.8465248 8.0565347
37 5.6084462 7.5676308
38 6.7617479 9.5357666
39 5.2138482 13.6822924
40 3.6259103 13.8659939
41 5.8586547 6.5087016
42 4.3490281 9.5367522
43 7.5130701 8.1699117
44 3.7933813 9.3241308
45 4.9466813 9.4432584
46 -0.3730035 6.4695187
47 2.0646458 10.6511916
48 4.6027309 4.9207746
49 5.9919348 7.1946723
50 6.0148330 13.4702419
51 5.5354452 9.0193366
52 5.2621651 12.8856488
53 6.8580210 6.3526151
54 8.0812166 14.4659778
55 3.6039030 5.9857886
56 9.8548553 15.9081336
57 3.3675037 14.7207681
58 3.9935336 14.3186175
59 3.4308085 10.6024579
60 3.9609624 6.6595521
61 4.2358603 10.6600581
62 5.1791856 9.3241118
63 4.6976289 13.2833055
64 5.1868906 7.1323826
65 3.1810915 12.8402472
66 6.0258287 9.3805249
67 5.3768112 6.3805096
68 5.7072092 7.1130150
69 6.5789349 8.0092541
70 5.3175820 17.3377234
71 9.7706112 10.8648956
72 5.2332127 12.3418373
73 4.7626124 13.8816910
74 3.9395911 6.5270785
75 6.4394724 10.6344965
76 2.6803695 10.4501753
77 3.5577834 8.2323369
78 5.8431140 7.7932460
79 2.8596818 8.9581837
80 2.7365174 10.2902512
81 4.7560973 6.4555758
82 4.6519084 8.9786777
83 4.9467471 11.2818536
84 5.6167284 5.2641380
85 9.4700525 2.9904731
86 4.7392906 11.3572521
87 3.1221908 6.3881556
88 5.6949432 7.4518023
89 5.1435241 10.8912283
90 2.1628966 10.5080671
91 3.6380837 15.0594135
92 5.3434709 7.4034042
93 -0.1298439 0.4832707
94 7.8759390 2.7411723
95 2.0898649 9.7687250
96 4.2131549 9.3175228
97 5.0648105 11.3943350
98 7.7225193 11.4180456
99 3.1018895 12.8890257
100 4.4166832 10.4901303

Select a range of rows from every n rows from a data frame

I have 2880 observations in my data.frame. I have to create a new data.frame in which, I have to select rows from 25-77 from every 96 selected rows.
df.new = df[seq(25, nrow(df), 77), ] # extract from 25 to 77
The above code extracts only row number 25 to 77 but I want every row from 25 to 77 in every 96 rows.
One option is to create a vector of indeces with which subset the dataframe.
idx <- rep(25:77, times = nrow(df)/96) + 96*rep(0:29, each = 77-25+1)
df[idx, ]
You can use recycling technique to extract these rows :
from = 25
to = 77
n = 96
df.new <- df[rep(c(FALSE, TRUE, FALSE), c(from - 1, to - from + 1, n - to))), ]
To explain for this example it will work as :
length(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19))) #returns
#[1] 96
In these 96 values, value 25-77 are TRUE and rest of them are FALSE which we can verify by :
which(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19)))
# [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#[23] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#[45] 69 70 71 72 73 74 75 76 77
Now this vector is recycled for all the remaining rows in the dataframe.
First, define a Group variable, with values 1 to 30, each value repeating 96 times. Then define RowWithinGroup and filter as required. Finally, undo the changes introduced to do the filtering.
df <- tibble(X=rnorm(2880)) %>%
add_column(Group=rep(1:96, each=30)) %>%
group_by(Group) %>%
mutate(RowWithinGroup=row_number()) %>%
filter(RowWithinGroup >= 25 & RowWithinGroup <= 77) %>%
select(-Group, -RowWithinGroup) %>%
ungroup()
Welcome to SO. This question may not have been asked in this exact form before, but the proinciples required have been rerefenced in many, many questions and answers,
A one-liner base solution.
lapply(split(df, cut(1:nrow(df), nrow(df)/96, F)), `[`, 25:77, )
Note: Nothing after the last comma
The code above returns a list. To combine all data together, just pass the result above into
do.call(rbind, ...)

R - apriori() not recognising lhs from numerical transaction

I am having real trouble getting my data to produce any rules using the arules package. I have managed to get 100000 rows of transaction data and in SAS the rules are shown. I cannot get it to work in R.
[5] {19,29,40,119,134}
[6] {24,40,45,67,141}
[7] {17,18,57,74,412}
[8] {16,79,90,150,498}
[9] {18,57,111,161,267}
[10] {11,75,131,427,429}
[11] {57,99,111,143,236}
The transactions data looks like this and originally came from a table where all the numbers were separate.
arules <- read.transactions('tid.csv', format = c("basket", "single"),
sep=",")
rules <- apriori(arules,parameter = list(supp = 0.1, conf = 0.1, target =
"rules"))
summary(rules)
For reference the supports and confidence settings make no difference. Sometimes I get this when I inspect the rules.
lhs rhs support confidence lift count
[1] {} => {8,11,96,112,432} 9.710623e-06 9.710623e-06 1 1
[2] {} => {62,134,222,254,412} 9.710623e-06 9.710623e-06 1 1
Any idea why apriori can't separate the items in the transaction? Does this need to be recast into long format and if so how would I do that form this data frame?
V2 V3 V4 V5 V6
8 11 96 112 432
10 35 39 76 119
18 38 68 141 267
29 36 57 61 63
19 29 40 119 134
24 40 45 67 141
17 18 57 74 412
If I understood you correctly then you should try this and let us know if it helped.
library(arules)
library(arulesViz)
#sample data
df <- read.table(text="V2 V3 V4 V5 V6
8 11 96 112 432
10 35 39 76 119
18 38 68 141 267
29 36 57 61 63
19 29 40 119 134
24 40 45 67 141
17 18 57 74 412", header=T)
write.csv(df, "apriori_demo.csv", row.names = F)
#convert sample data into transactions format for apriori algorithm
trx <- read.transactions("apriori_demo.csv", format="basket", sep=",", skip=1)
#apriori rules
apriori_rule <- apriori(trx, parameter = list(supp = 0.1, conf = 0.1))
#obviously you need to have better parameters compared to the one you have used in your post!
inspect(apriori_rule)
plot(apriori_rule, method="graph")

assign objects to dynamic lists in r

I have a nested loops which produce outputs that I want to store in list objects with dynamic names. A toy example of this would look as follows:
set.seed(8020)
names<-sample(LETTERS,5,replace = F)
for(n in names)
{
#Create the list
assign(paste0("examples_",n),list())
#Poulate the list
get(paste0("examples_",n))[[1]]<-sample(100,10)
get(paste0("examples_",n))[[2]]<-sample(100,10)
get(paste0("examples_",n))[[3]]<-sample(100,10)
}
Unfortunately I keep getting the error:
Error in get(paste0("examples_", n))[[1]] <- sample(100, 10) :
target of assignment expands to non-language object
I have tried all kind of assign, eval, get type of functions to parse the object, but haven't had any luck
Expanding on my comment with a worked example:
examples <- vector(mode="list", length=length(names) )
names(examples) <- names # please change that to mynames
# or almost anything other than `names`
examples <- lapply( examples, function(L) {L[[1]] <- sample(100,10)
L[[2]] <- sample(100,10)
L[[3]] <- sample(100,10); L} )
# Top of the output:
> examples
$P
$P[[1]]
[1] 34 49 6 55 19 28 72 42 14 92
$P[[2]]
[1] 97 71 63 59 66 50 27 45 76 58
$P[[3]]
[1] 94 39 77 44 73 15 51 78 97 53
$F
$F[[1]]
[1] 12 21 89 26 16 93 4 13 62 45
$F[[2]]
[1] 83 21 68 74 32 86 52 49 16 13
$F[[3]]
[1] 14 45 40 46 64 85 88 28 53 42
This mode of programming does become more natural over time. It gets you out of writing clunky for-loops all the time. Develop your algorithms for a single list-node at a time and then use sapply or lapply to iterate the processing.

Transpose with multiple variables and more than one metrics in R

I'm previously a SAS user - since I don't have SAS anymore I need to learn to use R for work.
The dataset has the following column:
market date sitename impression clicks
I want to transpose it into:
market date sitename-impression sitename-clicks
I think in SAS I used to do:
Proc Transpose
by market date;
id sitename;
var impression clicks;
run;
I do have a book on R and googled a lot, but couldn't find the solution that works...
Would really appreciate if anyone can help.
Thanks in advance!!!
Let me start by saying welcome to stackoverflow. Glad to have anew user. When you ask a question it's helpful and encouraged for you to provide the code you're using and a reproducible data set that looks like the original. This is called a minimal reproducible example. To get a data set into here you can use several options, here are two: use dput() around the object name and cut and paste what is displayed in the console or just post the dataframe directly. For the code provide all the code necessary to replicate your problem. I hope you find this helpful for future questions you'll ask.
I may not fully understand but I think you want to transform, not transpose, the data.
dat <- data.frame(market=rnorm(10), date=rnorm(10), #let's create a data set
sitename=rnorm(10), impression=rnorm(10), clicks=rnorm(10))
dat #look at it (I pasted it below)
# > dat
# market date sitename impression clicks
# 1 -0.9593797 -0.08411994 1.6079129 -0.5204772 -0.31633966
# 2 -0.5088689 1.78799500 -0.2469315 1.3476964 -0.04344779
# 3 -0.1527465 0.81673996 1.7824969 -1.5531260 -1.28304384
# 4 -0.7026194 0.52072913 -0.1174356 0.5722210 -1.20474443
# 5 -0.4537490 -0.69139062 1.1124277 -0.2452974 -0.33025320
# 6 0.7466588 0.36318337 -0.4623319 -0.9036768 -0.65754302
# 7 0.8007612 2.59588554 0.1820732 0.4318629 -0.36308748
# 8 1.0781715 -1.01512734 0.2297475 0.9219439 -1.15687902
# 9 0.3731450 -0.19004572 0.5190749 -1.4020371 -0.97370295
# 10 0.7724259 1.76528303 0.5781786 -0.5490849 -0.83819036
#now to create the new columns (I think this is what you want)
#the easiest way is to use transform. ?tranform for more
dat.new <- transform(dat, sitename.clicks=sitename-clicks,
impression.clicks=impression-clicks)
dat.new #here's the new data set. Notice it has the new and old columns.
#To get rid of the old columns you can use indexing and specify the columns you want.
dat.new[, c(1:2, 6:7)]
#We could have also done:
dat.new[, c(1,2,6,7)]
#or said the columns not wanted with negative indexing:
dat.new[, -c(3:5)]
EDIT In looking at Brian's comments and the variables I would think that a long to wide transformation is what the poster desires. I would likely approach it using Wickham's reshape2 package as well, as this method is easier for me to work with and I imagine it would be easier for an R beginner as well. However, here is a base way to do the long to wide format using the same data set Brian provided:
wide <- reshape(DF, v.names=c("impression", "clicks"), idvar=c("market", "date"),
timevar="sitename", direction="wide")
reshape(wide)
The reshape function is very flexible but takes some getting used to to use appropriately. I'm leaving my previous response up as well to keep the history of this post though I now believe this is not the posters intent. It serves as a reminder that a reproducible example is very helpful in providing clarity to your query.
Example data, as Tyler said, is important. I interpreted your question differently because I thought your data was different. I didn't take the - as a literal subtraction of numerics, but a combination of variables.
DF <- expand.grid(market = LETTERS[1:5],
date = Sys.Date()+(0:5),
sitename = letters[1:2])
n <- nrow(DF)
DF$impression <- sample(100, n, replace=TRUE)
DF$clicks <- sample(100, n, replace=TRUE)
I find the reshape2 package useful for these sort of transpositions/transformations/rearrangements.
library("reshape2")
dcast(melt(DF, id.vars=c("market","date","sitename")),
market+date~sitename+variable)
gives
market date a_impression a_clicks b_impression b_clicks
1 A 2012-02-28 74 97 11 71
2 A 2012-02-29 34 30 88 35
3 A 2012-03-01 40 85 40 49
4 A 2012-03-02 46 12 99 20
5 A 2012-03-03 6 95 85 56
6 A 2012-03-04 61 61 42 64
7 B 2012-02-28 4 53 74 9
8 B 2012-02-29 43 27 92 59
9 B 2012-03-01 34 26 86 43
10 B 2012-03-02 81 47 84 35
11 B 2012-03-03 3 5 91 48
12 B 2012-03-04 19 26 99 21
13 C 2012-02-28 22 31 100 53
14 C 2012-02-29 40 83 95 27
15 C 2012-03-01 78 89 81 29
16 C 2012-03-02 57 55 79 87
17 C 2012-03-03 37 61 3 97
18 C 2012-03-04 83 61 41 77
19 D 2012-02-28 81 18 47 3
20 D 2012-02-29 90 100 17 83
21 D 2012-03-01 12 40 35 93
22 D 2012-03-02 85 14 63 67
23 D 2012-03-03 63 53 29 58
24 D 2012-03-04 40 79 56 70
25 E 2012-02-28 97 62 68 31
26 E 2012-02-29 24 84 17 63
27 E 2012-03-01 94 93 32 2
28 E 2012-03-02 6 26 86 26
29 E 2012-03-03 100 34 37 80
30 E 2012-03-04 89 87 72 11
The column names have a _ between them rather than a -, but you can change that if you want. I wouldn't recommend it, though, because then you will have problems later referencing the column since the - will be taken as subtraction (you would need to quote the name).

Resources