In a named vector i want those words starting with good along with their frequency. I am getting the word only but not frequency
v <- c(10,20,30,40,50)
names(v) <- c("good afternoon", "hi", "this","good morning","what")
v
# gives error
grep("^good",v,value = TRUE)
# below code works but frequency not showing
grep("^good",names(v),value = TRUE)
I'm not entirely clear what you're asking.
You could stack the vector to give a data.frame with two columns: the values corresponding to your frequency (?) and the expression ind.
stack(v)
# values ind
#1 10 good afternoon
#2 20 hi
#3 30 this
#4 40 good morning
#5 50 what
Then to get the frequency and expression that matches your regexp you could do
stack(v)[grep("^good", stack(v)$ind), ]
# values ind
#1 10 good afternoon
#4 40 good morning
In response to your comment, is this what you're after?
v[grep("^good", names(v))]
#good afternoon good morning
# 10 40
This return object is again a named vector with the vector entries giving the frequencies and the names of the vector corresponding to the expressions.
You want number of hits?
length(grep("^good",names(v)))
# [1] 2
Related
I have the following data with two observations per subject:
SUBJECT <- c(8,8,10,10,11,11,15,15)
POSITION <- c("H","L","H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30,90,90)
RESPONSE <- c(5.6,5.2,0,0,4.8,4.9,1.2,.9)
DATA <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I want the rows of DATA which have SUBJECT numbers that are in a vector, V:
V <- c(8,10,10)
How can I obtain both observations from DATA whose SUBJECT number is in V and have those observations repeated the same number of times as the corresponding SUBJECT number appears in V?
Desired result:
SUBJECT <- c(8,8,10,10,10,10)
POSITION <- c("H","L","H","L","H","L")
TIME <- c(90,90,30,30,30,30)
RESPONSE <- c(5.6,5.2,0,0,0,0)
OUT <- data.frame(SUBJECT,POSITION,TIME,RESPONSE)
I thought some variation of the %in% operator would do the trick but it does not account for repeated subject numbers in V. Even though a subject number is listed twice in V, I only get one copy of the corresponding rows in DATA.
I could also create a loop and append matching observations but this piece is inside a bootstrap sampler and this option would dramatically increase computation time.
merge is your friend:
merge(list(SUBJECT=V), DATA)
# SUBJECT POSITION TIME RESPONSE
#1 8 H 90 5.6
#2 8 L 90 5.2
#3 10 H 30 0.0
#4 10 L 30 0.0
#5 10 H 30 0.0
#6 10 L 30 0.0
As #Frank implies, this logic can be translated to data.table or dplyr or sql of anything else that will handle a left-join.
I'm working in R and I have a data frame containing epigenetic information. I have 300,000 rows containing genomic locations and 15 columns each of which identifies a transcription factor motif that may or may not occur at each locus.
I'm trying to use regular expressions to count how many times each transcription factor occurs at each genomic locus. Individual motifs can occur > 15 times at any one locus, so I'd like the output to be a matrix/data frame containing motif counts for each individual cell of the data frame.
A typical single occurrence of a motif in a cell could be:
2212(AATTGCCCCACA,-,0.00)
Whereas if there were multiple occurrences of a motif, these would exist in the cell as a continuous string each entry separated by a comma, for example for two entries:
144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)
Here is some toy data:
df <-data.frame(NAMES = c('LOC_A', 'LOC_B', 'LOC_C', 'LOC_D'),
TFM1 = c("144(TGTGAGTCAC,+,0.00),145(GTGAGTCACT,-,0.00)", "2(TGTGAGTCAC,+,0.00)", "0", "0"),
TFM2 = c("632(TAAAGAGTCAC,-,0.00),60(GTCCCTCACT,-,0.00),", "7(TGTGAGTCAC,+,0.00)", "7(TGTGAGTCAC,+,0.00)", "0"),
stringsAsFactors = F)
I'd be looking for the output in the following format:
NAMES TFM1 TFM2
LOC_A 2 2
LOC_B 1 1
LOC_C 0 1
LOC_D 0 0
If possible, I'd like to avoid for loops, but if loops are required so be it. To get row counts for this data frame I used the following code, kindly recommended by #akrun:
df$MotifCount <- Reduce(`+`, lapply(df[-1],
function(x) lengths(str_extract_all(x, "\\d+\\("))))
Notice that the unique identifier for the motifs used here is "\\d+\\(" to pick up the number and opening bracket at the start of each motif identification string. This would have to be included in any solution code. Something similar which worked across the whole data frame to provide individual cell counts would be ideal.
Many Thanks
We don't need the Reduce part
data.frame(c(df[1],lapply(df[-1], function(x) lengths(str_extract_all(x, "\\d+\\(")))) )
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0
This will also work:
cbind.data.frame(df[1],sapply(lapply(df[-1], function(x) str_extract_all(x, "\\d+\\(")), function(x) lapply(x, length)))
# NAMES TFM1 TFM2
#1 LOC_A 2 2
#2 LOC_B 1 1
#3 LOC_C 0 1
#4 LOC_D 0 0
I load the supportdata variable as follows.
supportdata=aggregate(scoredata$Support, list(Topic = scoredata$Topic), sum)
slices <- supportdata[2]
lbls <- supportdata[1]
typeof(slices)
3D Exploded Pie Chart Below
pie3D(slices,labels=lbls,explode=0.1,main="Year wise scores for topic 1")
and I get the below error:
Error in pie3D(slices, labels = lbls, explode = 0.1, main = "Year wise
scores for topic 1") :pie3D: x values must be positive numbers
supportdata variable contains the following information and is generated using aggregate function which sums up the scores in the second column.
# supportdata
#
# Topic x
#
# 1 c 14
# 2 c# 80
# 3 c++ 15
# 4 css 4
# 5 html 3
# 6 .net 3
# 7 php 0
# 8 sql 0
How do I get rid of this error? I tried searching but couldn't find a solution to this problem..I tried casting into as.numeric, as.integer but it says the list cannot be coerced into double or integer type. :(
Your problem is indexing with [ rather than [[, which returns a list of numbers rather than a numeric vector.
library("plotrix")
pie3D(supportdata[[2]],labels=supportdata[[1]],
explode=0.1,main="Year wise scores for topic 1")
works fine, as does
with(supportdata,pie3D(x,labels=Topic,
explode=0.1,main="Year wise scores for topic 1"))
The below solution works too apart from one provided by Ben.
slices <- t(supportdata[2])
lbls <- t(supportdata[1])
pie3D(slices,labels=lbls,explode=0.1,main="Pie Diagram for Support")
I'm trying to do something very simple: to run a loop through a vector of names and use those names in my code.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
My real dataset has 14 countries and 75 time periods. I would like to find a function which for example loops through the countries, then subsets them so I have the single datasets such as:
data_AT <- subset(Data, (Data$geo=="AT"))
data_BE <- subset(Data, (Data$geo=="BE"))
but with a loop and ideally with a solution I can apply to other functions as well :-)
In my mind, this should look something like this:
codes <- unique(Data$geo)
for (i in 1:length(codes))
{k <- codes[i]
data_(k) <- subset(Data, (Data$geo==k))}
however subset doesn't work like this, neither do other functions. I think my problem is that I don't know how to address the respective name which "k" has taken (e.g. "AT") as part of my code. If at all possible, I would very much appreciate an answer with a general solution of how I can run a function through a vector containing text and use each element of that vector in my code. Maybe in the direction of the apply functions? Though I'm not getting very far with that either...
Any help would be very much appreciated!
I'm using loops for simiral purposes too. Maybe it's not the fastest way, but at least I understand it -- for example, when saving plots for different subsets.
There is no need to loop through length of vector, you can loop through vector itself. For converting string to variable name, you can use assign.
geo = c(rep("AT",3),rep("BE",3))
time = c(rep(c("1990Q1","1990Q2","1990Q3"),2))
value = c(1:6)
Data <- data.frame(geo,time,value)
codes <- sort(unique(Data$geo))
for (k in codes) {
name<-paste("data", k, sep="_")
assign(name, subset(Data, (Data$geo==k)))
}
BTW, filter from package dplyr is much faster than subset!
In R, you would typically do this with a list of data.frames instead of several separate data.frames:
lst <- split(Data, Data$geo)
lst
#$AT
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
#
#$BE
# geo time value
#4 BE 1990Q1 4
#5 BE 1990Q2 5
#6 BE 1990Q3 6
Now you can access each element (which is a data.frame) by typing:
lst[["AT"]]
# geo time value
#1 AT 1990Q1 1
#2 AT 1990Q2 2
#3 AT 1990Q3 3
If you have a vector of country names for which you want to add +1 to the value column, you can do it like this:
cntrs <- c("BE", "AT")
lst[cntrs] <- lapply(lst[cntrs], function(x) {x$value <- x$value + 1; return(x)} )
#$BE
# geo time value
#4 BE 1990Q1 5
#5 BE 1990Q2 6
#6 BE 1990Q3 7
#
#$AT
# geo time value
#1 AT 1990Q1 2
#2 AT 1990Q2 3
#3 AT 1990Q3 4
Edit: if you really want to stick with a for loop, I recommend not to split the data into several separate data.frames but to run the loop on the whole data set like this for example:
cntrs <- "BE"
for(i in cntrs){
Data$value[Data$geo == i] <- Data$value[Data$geo == i] + 1
}
I am trying to modify a R script but I have only basic experience with R:
question 1:
In line: for (i in 1:nrow(x)). what does the integer 1 actually do? Changing the value to 2 or higher seem to have a big effect on the output.
question 2:
I have been getting the message:
"Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed"
. In general, what might be causing this?
Any help is much appreciated!
question edited:
Say I have a dataframe for plotting scatterplot. The dataframe would be organized in the following fashion (in CSV format):
name ABC EFG
1 32 45
2 56 67
to, say 200 000 entries
I am going to first do a scatterplot, after which I am going to subset a portion of the dataset into A using alphahull and export them as XYZ. The script for doing this:
#plot first plot containing all data
plot(x = X$ABC,
y = X$EFG,
pch=20,
)
#subset data using ahull. choose 4 points on the plot
A <- ahull(locator(4, type="p", pch=20), alpha=10000)
#exporting subset
XYZ <- {}
for (i in 1:nrow(X)) { if (inahull(A, c(X$ABC[i],X$EFG[i]))) XYZ <- rbind(X,X[i,])}
I am getting the following message if the number of data points in the subset that I choose is too large:Error in if (p[2] > a + b * p[1]) { :
missing value where TRUE/FALSE needed
Question 1 - this is a for loop - it is executing once for each row in the matrix or data frame x (not sure what x is here exactly). Changing it to 2 will mean the loop happens one less time. Without the rest of the code I can't say much else.
Question 2 - can you post the whole code? It apparently needs to evaluate that expression and one or more of the values is missing.
Say you have data x
set.seed(123) # for reproducibility
x<-as.data.frame(rnorm(10)) # generate random number and store it as dataframe
k<-2 #assign n as 2
for (i in (1:nrow(x))){
cat("this is row",i,"\n")
show (k)
k<-k+i
}
show (k)
this is row 1
[1] 2
this is row 2
[1] 3
this is row 3
[1] 5
this is row 4
[1] 8
this is row 5
[1] 12
this is row 6
[1] 17
this is row 7
[1] 23
this is row 8
[1] 30
this is row 9
[1] 38
this is row 10
[1] 47
> show (k)
[1] 57