Selecting rows in data.frame based on character strings - r

I've a data.frame with row.names as in test.
test <-
c("Env_1990:trait_KPS", "Env_1990:trait_SPSM", "Env_1990:trait_TKW",
"Env_1990:trait_Yield", "Env_1991:trait_KPS", "Env_1991:trait_SPSM",
"Env_1991:trait_TKW", "Env_1991:trait_Yield", "Env_1992:trait_KPS",
"Env_1992:trait_SPSM", "Env_1992:trait_TKW", "Env_1992:trait_Yield",
"Env_1993:trait_KPS", "Env_1993:trait_SPSM", "Env_1993:trait_TKW",
"Env_1993:trait_Yield", "Env_1994:trait_KPS", "Env_1994:trait_SPSM",
"Env_1994:trait_TKW", "Env_1994:trait_Yield", "Env_1995:trait_KPS",
"Env_1995:trait_SPSM", "Env_1995:trait_TKW", "Env_1995:trait_Yield",
"Gen_B88:Env_1990:trait_KPS", "Gen_B88:Env_1990:trait_SPSM",
"Gen_B88:Env_1990:trait_TKW", "Gen_B88:Env_1990:trait_Yield",
"Gen_B88:Env_1991:trait_KPS", "Gen_B88:Env_1991:trait_SPSM",
"Gen_B88:Env_1991:trait_TKW", "Gen_B88:Env_1991:trait_Yield",
"Gen_B88:Env_1992:trait_KPS", "Gen_B88:Env_1992:trait_SPSM",
"Gen_B88:Env_1992:trait_TKW", "Gen_B88:Env_1992:trait_Yield",
"Gen_B88:Env_1993:trait_KPS", "Gen_B88:Env_1993:trait_SPSM",
"Gen_B88:Env_1993:trait_TKW", "Gen_B88:Env_1993:trait_Yield")
I want to select only those rows which start with Env_. I tried this code in R
grep(pattern="[Env_]", x=test).
This code gives me all rows because Env_ appears in every row name. I wonder how to select rows which starts only with Env_. Thanks in advance for your help.

You want to add the ^ character for beginning of line/string:
> grep("^Env_", test)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
> grep("^Env_", test, value = TRUE)
[1] "Env_1990:trait_KPS" "Env_1990:trait_SPSM" "Env_1990:trait_TKW"
[4] "Env_1990:trait_Yield" "Env_1991:trait_KPS" "Env_1991:trait_SPSM"
[7] "Env_1991:trait_TKW" "Env_1991:trait_Yield" "Env_1992:trait_KPS"
[10] "Env_1992:trait_SPSM" "Env_1992:trait_TKW" "Env_1992:trait_Yield"
[13] "Env_1993:trait_KPS" "Env_1993:trait_SPSM" "Env_1993:trait_TKW"
[16] "Env_1993:trait_Yield" "Env_1994:trait_KPS" "Env_1994:trait_SPSM"
[19] "Env_1994:trait_TKW" "Env_1994:trait_Yield" "Env_1995:trait_KPS"
[22] "Env_1995:trait_SPSM" "Env_1995:trait_TKW" "Env_1995:trait_Yield"

Related

compare the freq with the threshold and store characters of corresponding frequency [duplicate]

This question already has an answer here:
R Extract rows where column greater than 40 [duplicate]
(1 answer)
Closed 5 years ago.
Let's call the data frame below df. I want to store the names of Factory in a vector such that freq is greater than 15 (threshold).
Factory freq
1 F63F5C2CC9ADEC78 93
2 437D11819C8F3086 73
3 BCCFA6F2C54A964B 72
4 0C1DFC7996E98A98 60
5 4DBE085C274FC0D2 32
6 A8FCA1AD604D3A61 31
7 B33691F8279D733C 28
8 001DD6C2202E54F1 25
9 BBBC5737EFE9C6F5 25
10 09FDC29D7442958A 21
11 4A61DE171F2743E7 19
12 62131A16C832AB49 18
13 73DF23BF482EE5FE 18
14 793C792AE6E71D33 16
15 5F3A38C49F3C3296 6
16 923963E76AF1360D 6
17 D7055DCB51E1297A 6
18 1F4D81F7A9BC7031 4
19 898C2388F2312392 2
20 CAD1A7D01E482069 2
vec = with(dat,Factory[freq>=15])
vec
[1] "F63F5C2CC9ADEC78" "437D11819C8F3086" "BCCFA6F2C54A964B" "0C1DFC7996E98A98" "4DBE085C274FC0D2" "A8FCA1AD604D3A61"
[7] "B33691F8279D733C" "001DD6C2202E54F1" "BBBC5737EFE9C6F5" "09FDC29D7442958A" "4A61DE171F2743E7" "62131A16C832AB49"
[13] "73DF23BF482EE5FE" "793C792AE6E71D33"
Another easy option could be:
> v <- df[which(df$freq > 15), "Factory"]
> v
[1] "F63F5C2CC9ADEC78" "437D11819C8F3086" "BCCFA6F2C54A964B" "0C1DFC7996E98A98"
[5] "4DBE085C274FC0D2" "A8FCA1AD604D3A61" "B33691F8279D733C" "001DD6C2202E54F1"
[9] "BBBC5737EFE9C6F5" "09FDC29D7442958A" "4A61DE171F2743E7" "62131A16C832AB49"
[13] "73DF23BF482EE5FE" "793C792AE6E71D33"

Data Manipulation in R for Apriori

I have a part of the data-set as shown below in the form of csv,the number of rows and columns are more than what is shown.I want to implement apriori on this data-set,Say I have this:-
Maths Science C++ Java DC
[1] 75 44 55 56 88
[2] 56 88 54 78 44
the original dataset has total columns(representing subjects)=30 and serial number(representing students)=24,
DATASET:link
I want to covert this dataset in the form shown below:-
[1] {Maths,DC}
[2] {Science,Java}
i.e A list of list(I think this is what it is called) containing the colnames.A list for a student shows in which subject he/she scored more than or equal to 75 marks,rest of the subjects are dropped(The only condition of the problem)
eq:- first student scored 75+ marks in Dc and Maths and so his list includes only dc and maths.
I am sorry for posting this,but I searched a lot on stack,and found a few of the working suggestions ,but couldn't reach the final goal.
My goal is to get a form like this:-
[9834] {semi-finished bread,
bottled water,
soda,
bottled beer}
[9835] {chicken,
tropical fruit,
other vegetables,
vinegar,
shopping bags}
As given in :-
library(arules)
inspect(Groceries)
OR I WILL APPRECIATE IF ANYONE CAN SUGGEST A WAY TO REPRESENT THE DATA IN OTHER FORM WHICH APRIORI CAN UNDERSTAND,BUT IT SHOULD FOLLOW THE NECESSARY CONDITIONS AS STATED.
*(sorry for the long post,I hope this conversion of my dataset in this format may help me study the pattern in student-subject dataset,thnx a ton for all the help)
library(plyr)
library(arules)
df <- read.table(text =
" 75 44 55 56 88
56 88 54 78 44")
names(df) <- c("Maths", "Science", "C++", "Java", "DC")
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {DC,Maths} 1
# [2] {Java,Science} 2
Edit: It works with your example dataset, too:
library(plyr)
library(arules)
df <- read.csv(file = url("https://drive.google.com/uc?export=download&id=0B3kdblyHw4qLR0dpT24xWUZGcGs"))
transactions <- as(alply(df, 1, function(x) names(x)[x >= 75]), "transactions")
inspect(transactions)
# items transactionID
# [1] {CD,CG,CN,DA,Data.Struc} 1
# [2] {CD,CG,CO,ML,OS} 2
# [3] {CN,Data.Struc,DC,DM,DMS} 3
# [4] {CHE,DD,DM,EC,EE} 4
# [5] {CHE,CN,MATHS,PHY} 5
# [6] {Data.Science,DM,DMS,ML,OS} 6
# [7] {CD,DA,Data.Struc,EC,MATHS} 7
# [8] {CG,CHE,CN,CO,OS} 8
# [9] {CN,CO,Data.Science,DC,DMS} 9
# [10] {DC,DD,EC,EE,PHY} 10
# [11] {CHE,DD,DMS,MATHS,PHY} 11
# [12] {CN,Data.Science,DM,MATHS,ML} 12
# [13] {CD,CG,DA,Data.Science,Data.Struc} 13
# [14] {CG,CO,EE,MATHS,OS} 14
# [15] {CN,CO,DC,DMS,PHY} 15
# [16] {CN,CO,DD,EC,EE} 16
# [17] {CHE,DA,EE,MATHS,PHY} 17
# [18] {Data.Science,DD,DM,ML,PHY} 18
# [19] {CD,CO,DA,Data.Struc,DC} 19
# [20] {CG,CO,DD,DM,OS} 20
# [21] {CG,CN,DA,DC,DMS} 21
# [22] {DD,EC,EE,ML,OS} 22
# [23] {CHE,CN,Data.Struc,MATHS,PHY} 23
# [24] {CG,Data.Science,DM,EE,ML} 24

How can I tell a for loop in R to regenerate a sample if the sample contains a certain pair of species?

I am creating 1000 random communities (vectors) from a species pool of 128 with certain operations applied to the community and stored in a new vector. For simplicity, I have been practicing writing code using 10 random communities from a species pool of 20. The problem is that there are a couple of pairs of species such that if one of the pairs is generated in the random community, I need that community to be thrown out and a new one regenerated. I have been able to code that if the pair is found in a community for that community(vector) to be labeled NA. I also know how to tell the loop to skip that vector using the "next" command. But with both of these options, I do not get all of the communities that I needing.
Here is my code using the NA option, but again that ends up shorting me communities.
C<-c(1:20)
D<-numeric(10)
X<- numeric(5)
for(i in 1:10){
X<-sample(C, size=5, replace = FALSE)
if("10" %in% X & "11" %in% X) X=NA else X=X
if("1" %in% X & "2" %in% X) X=NA else X=X
print(X)
D[i]<-sum(X)
}
print(D)
This is what my result looks like.
[1] 5 1 7 3 14
[1] 20 8 3 18 17
[1] NA
[1] NA
[1] 4 7 1 5 3
[1] 16 1 11 3 12
[1] 14 3 8 10 15
[1] 7 6 18 3 17
[1] 6 5 7 3 20
[1] 16 14 17 7 9
> print(D)
[1] 30 66 NA NA 20 43 50 51 41 63
Thanks so much!

R creating dataframe

I have a sequence:
seq <- seq (5, 10)
and list of floats like:
values<-runif(20,0,15)
> values
[1] 3.9826299 5.5818585 8.5928005 13.6231168 3.0252290 13.4758453
[7] 14.1701290 9.9119669 9.4367107 0.9267941 3.0896186 2.6483513
[13] 10.3053427 5.7615558 11.5476213 7.4654886 10.7642776 14.8785914
[19] 5.7005277 11.6616783
I need to create dataframe, which 1st column will contain sequence, and second - count of numbers from the values, which is greater than sequence number.
like
seq sum
1 5 15
2 6 12
3 7 12
4 8 11
5 9 10
6 10 8
If I understand correctly, something like this:
> set.seed(1)
> seq<-5:10
> values<-runif(20,0,15)
> values
[1] 3.9826299 5.5818585 8.5928005 13.6231168 3.0252290 13.4758453
[7] 14.1701290 9.9119669 9.4367107 0.9267941 3.0896186 2.6483513
[13] 10.3053427 5.7615558 11.5476213 7.4654886 10.7642776 14.8785914
[19] 5.7005277 11.6616783
> data.frame(seq,sum=sapply(seq,function(x)sum(values[values>x])))
seq sum
1 5 152.8775
2 6 135.8336
3 7 135.8336
4 8 128.3681
5 9 119.7753
6 10 100.4266
Edit: from your comment, it looks like you actually want this:
> data.frame(seq,sum=sapply(seq,function(x)sum(values>x)))
seq sum
1 5 15
2 6 12
3 7 12
4 8 11
5 9 10
6 10 8

Merge values of a factor column

Column data$form contains 170 unique different values, (numbers from 1 to ~800).
I would like to merge some values (e.g with a 10 radius/step).
I need to do this in order to use:
colors = rainbow(length(unique(data$form)))
In a plot and provide a better visual result.
Thank you in advance for your help.
you can use %/% to group them and mean to combine them and normalize to scale them.
# if you want specifically 20 groups:
groups <- sort(form) %/% (800/20)
x <- c(by(sort(form), groups, mean))
x <- normalize(x, TRUE) * 19 + 1
0 1 2 3 4
1.000000 1.971781 2.957476 4.103704 4.948560
5 6 7 8 9
5.950617 7.175309 7.996914 8.953086 9.952263
10 11 12 13 14
10.800705 11.901235 12.888889 13.772291 14.888889
15 16 17 18 19
15.927984 16.864198 17.918519 18.860082 20.000000
You could also use cut. If you use the argument labels=FALSE, you get an integer value:
form <- runif(170, min=1,max=800)
> cut(form, breaks=20)
[1] (518,558] (280,320] (240,280] (121,160] (757,797]
[6] (160,200] (320,359] (598,638] (80.8,121] (359,399]
[7] (121,160] (200,240] ...
20 Levels: (1.18,41] (41,80.8] (80.8,121] (121,160] (160,200] (200,240] (240,280] (280,320] (320,359] (359,399] (399,439] ... (757,797]
> cut(form, breaks=20, labels=FALSE)
[1] 14 8 7 4 20 5 9 16 3 10 4 6 5 18 18 6 2 12
[19] 2 19 13 11 13 11 14 12 17 5 ...
On a side-note, I want you to re-consider plotting with rainbow colours, as it distorts reading the data, cf. Rainbow Color Map (Still) Considered Harmful.

Resources