I have a string like this one:
0|294|314|20|314|SC49TST57ASG75A|1428.0
Using R, I want to extract only the data between two | (example- SC49TST57ASG75A), and then count only the numbers which are bigger than 20 (in this case I have the numbers 49,57,75 so the code needs to return the number 3)
I want to apply it on a column in a data frame.
Eventually, I want to get a new column that specify for each row how many numbers that are greater than 20 there is inside the |....|.
Thanks!
You can try strsplit with split = '\\|', if you only want to count between two pipes then you should exclude the first and the last elements also since you want elements greater than 20 ( we are using > sign for clarity in the solution)
I am assuming here that your columns have same structure as given in your question.
st <- '0|294|314|20|314|SC5GSC12ASG266T|1428.0'
Solution:
lapply(strsplit(st, '\\|'), function(x)sum(as.numeric(x[2:(length(x)-1)]) > 20, na.rm=TRUE))
I am not sure if this is what you are looking for, otherwise please tell me what is your expected result.
cnt <- Map(function(x) sum(as.numeric(x)>20),
regmatches(r <- unlist(regmatches(s,gregexpr("(?<=\\|).*?(?=\\|)",s,perl = TRUE))),
gregexpr("\\d+\\.?\\d+?",r)))
such that
> cnt
[[1]]
[1] 1
[[2]]
[1] 1
[[3]]
[1] 0
[[4]]
[1] 1
[[5]]
[1] 1
DATA
s <- "0|294|314|20|314|SC5GSC12ASG266T|1428.0"
Related
This question already has answers here:
How do I extract a single column from a data.frame as a data.frame?
(3 answers)
Closed 1 year ago.
I am simply extracting a single row from a data.frame. Consider for example
d=data.frame(a=1:3,b=1:3)
d[1,] # returns a data.frame
# a b
# 1 1 1
The output matched my expectation. The result was not as I expected though when dealing with a data.frame that contains a single column.
d=data.frame(a=1:3)
d[1,] # returns an integer
# [1] 1
Indeed, here, the extracted data is not a data.frame anymore but an integer! To me, it seems a little strange that the same function on the same data type wants to return different data types. One of the issue with this conversion is the loss of the column name.
To solve the issue, I did
extractRow = function(d,index)
{
if (ncol(d) > 1)
{
return(d[index,])
} else
{
d2 = as.data.frame(d[index,])
names(d2) = names(d)
return(d2)
}
}
d=data.frame(a=1:3,b=1:3)
extractRow(d,1)
# a b
# 1 1 1
d=data.frame(a=1:3)
extractRow(d,1)
# a
# 1 1
But it seems unnecessarily cumbersome. Is there a better solution?
Just subset with the drop = FALSE option:
extractRow = function(d, index) {
return(d[index, , drop=FALSE])
}
R tries to simplify data.frame cuts by default, the same thing happens with columns:
d[, "a"]
# [1] 1 2 3
Alternatives are:
d[1, , drop = FALSE]
tibble::tibble which has drop = FALSE by default
I can't tell you why that happens - it seems weird. One workaround would be to use slice from dplyr (although using a library seems unecessary for such a simple task).
library(dplyr)
slice(d, 1)
a
1 1
data.frames will simplify to vectors or scallars whith base subsetting [,].
If you want to avoid that, you can use tibbles instead:
> tibble(a=1:2)[1,]
# A tibble: 1 x 1
a
<int>
1 1
tibble(a=1:2)[1,] %>% class
[1] "tbl_df" "tbl" "data.frame"
I have a dataframe where i need to compare two columns and find the number of matching characters between two elements.
For eg: x and y are two elements to be compared which look like below:
x<- "1/2"
y<-"2/3"
I did unlisted and splitted them by '/' as below:
unlist(strsplit(x,"/"))->a
unlist(strsplit(y,"/"))->b
Then i used pmatch:
pmatch(a,b,nomatch =0)
[1] 0 1
Used sum() to know how many characters are matching:
sum(pmatch(a,b,nomatch =0))
[1] 1
However, when the comparison is done the other way:
pmatch(b,a,nomatch = 0)
[1] 2 0
Since there is only one match between the two string, why is it showing 2. It could be index. But i would need to get how many characters are same between the strings irrespective of the comparison a vs b or b vs a.
Could someone help how to get this.
Per ?pmatch, pmatch seeks matches for the elements of its first argument among those of its second.
For example, "2" in the first list matches the second element in the second list.
> pmatch(c("2", "1"),c("3","2"),nomatch =0)
# [1] 2 0
One way to know the number of elements got matched is to sum non-zero elements:
sum(pmatch(c("2", "1"),c("3","2"),nomatch =0) != 0)
# [1] 1
Both
sum(pmatch(b, a, nomatch = 0) != 0) # 1
sum(pmatch(a, b, nomatch = 0) != 0) # 1
return the same value.
Another option could be
sum(b %in% a)
[1] 1
sum(a %in% b)
[1] 1
I've figured out if I use as.character(df[x,y]) or as.<whatever>df[x,y] I can get/coerce what I need, every time from my data frames
What I cant seem to find/figure out is why. Details below.
When I access df[1,1] (or anything in column 1) I get
df[1,1]
[1] a
Levels: a b c
but when I access 1,3 it works fine
> df[1,3]
[1] 10
but then when I use as.character() it works.
> as.character(df[1,1])
[1] "a"
The data frame was built using this line
df = data.frame(names = c("a","b","c"), size = c(1,2,3),num = c(10,20,30) )
> df
names size num
1 a 1 10
2 b 2 20
3 c 3 30
But in this data frame
imp2met = read.csv('tomet.csv', header = TRUE, sep=",",dec='.')
> imp2met
unit mult ret
1 (yd) 0.9100 (m)
2 (in) 2.5200 (cm)
3 .....
I get these results for 1,3
> imp2met[1,3]
[1] (m)
Levels: (c) (cm) (cm^2) ....
>
> as.character(imp2met[1,3])
[1] "(m)"
So why the "random" results? Why do I need as.<whatever>() but only some of the time?
data.frame default is to convert character vectors to factors. You can change this with the argument stringsAsFactors=FALSE
Also, when you subset a dataframe using [, you can add the drop=FALSE argument to simplify the results in some cases.
Suppose we have a vector:
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
Expected output:
v_index <- c(5,6,7)
v always starts and ends with 0. There is only one possibility of having cluster of zeros between two 1s.
Seems simple enough, can't get my head around...
I think this will do
which(cumsum(v == 1L) == 1L)[-1L]
## [1] 5 6 7
The idea here is to separate all the instances of "one"s to groups and select the first group while removing the occurrence of the "one" at the beginning (because you only want the zeroes).
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
v_index<-seq(which(v!=0)[1]+1,which(v!=0)[2]-1,1)
> v_index
[1] 5 6 7
Explanation:I ask which indices are not equal to 0:
which(v!=0)
then I take the first and second index from that vector and create a sequence out of it.
This is probably one of the simplest answers out there. Find which items are equal to one, then produce a sequence using the first two indexes, incrementing the first and decrementing the other.
block <- which(v == 1)
start <- block[1] + 1
end <- block[2] - 1
v_index <- start:end
v_index
[1] 5 6 7
Newbie R question. Sorry to ask: I'm sure it's been answered, but it's one that's hard to search, apparently. I've read the man page for var (variance), but I don't understand it. Checked books, web pages (OK, only two books). I'll wait for someone to point me to an existing answer ....
> df
first second
1 1 3
2 2 5
3 3 7
> df[,2]
[1] 3 5 7
> var(df[,2])
[1] 4
OK, so far, so good.
> df[1,]
first second
1 1 3
> var(df[1,])
first second
first NA NA
second NA NA
Huh??
Thanks in advance.
!
The first issue is that you get a different class of object when you select a row from a data.frame, than when you select a column:
df = data.frame(first=c(1, 2, 3), second=c(3, 5, 7))
class(df[, 2])
[1] "integer"
class(df[1, ])
[1] "data.frame"
# But you can explicitly convert with as.integer.
var(as.integer(df[1, ]))
# [1] 2
The second issue is that var() treats a data.frame quite differently. It treats each column as variable and computes a matrix of variances and covariances by comparing each column to every other column:
# Create a data frame with some random data.
dat = data.frame(first=rnorm(20), second=rnorm(20), third=rnorm(20))
var(dat)
# first second third
# first 0.98363062 -0.2453755 0.04255154
# second -0.24537550 1.1177863 -0.16445768
# third 0.04255154 -0.1644577 0.58928970
var(dat$third)
# [1] 0.5892897
cov(dat$first, dat$second)
# [1] -0.2453755
If you know that a data.frame is all numeric and want it to be available for both row and column operations, then convert it to a matrix:
dat = data.frame(first=rnorm(20), second=rnorm(20), third=rnorm(20))
dm <- data.matrix(df)
var(dm[1,])
#[1] 20.25
(The same effect occurs when you use apply() .... the list structure is lost and the rows are all converted to the lowest common denominator.)
> apply(dat, 1, var)
[1] 0.45998066 1.51241166 0.13634927 0.49981030 0.04440448 1.21224067 0.28113135 0.57968597
[9] 0.26102036 0.41999510 1.01237100 0.17304770 0.50572223 1.17825272 1.39342510 2.94125062
[17] 1.18640684 2.15245595 3.06482195 0.96396008