search match and replace in R - r

I have data with two columns. What I want is for column block, search until you reach the first block-1, then replace all following T with block-1, keep doing this for blocks 2 and 3 and replace if they have any T. I could not get my head around how to do this in R
What I want is

Related

Remove the numbers and the period after numbers in a dataframe in R?

I have a mock-up dataframe representing some of the confidential data I have and it looks like this:
Name Value
1. AaaaBaCCCaaa.x 1
2. AbbAbbKalllNBN.y 2
3. CCCdddEfffFg.x 8
4. ZZZtTThGGtGGGG.y 1
...
9. AAAHHHhhhhIIIIII.x 2
10. RRRRmmmmJJJJJJJ.y 3
11. MMMMMnnnnNNNNrrrr.x 4
...
What's important to notice here is that the Name variable contains ordinal numbers (e.g. 1. 2., 10.) at the beginning of the string and either .x or .y at the end of the string. Also, length of the Name variable is not the same in each row.
How can I remove the number from the beginning of the each string in the Name variable along with the period and the space that come after it? It's very important for me to get rid of them because I need to use the separate function on this data afterwards to separate into x and y from the end of the string. If I will still have that period after the number on the beginning of the string, separate will fail.
I wanted to use substr but I didn't know how to do it since, for example, 10. is longer than 9. and I don't know which values I would put into the start and stop arguments.

Problems with DataFrame to generate a table of 15000 rows x 37 columns in Julia

I'm trying to generate a table with 15000 rows and 16 columns, however Julia loses or omits some variables.
I have tried different ways to run DataFrame but I get the following results:
df = DataFrame(periods=15000, households=5000, giniY=giniY)
15,000 rows × 3 columns
However, when I run with the 16 variables I get the following result
df = DataFrame(periods=15000, households=5000, gamma=gamma, delta=delta,
betta=betta, alfa=alfa, miz=miz, roz=roz,
phi=phi, rok=rok, mie=mie, roe=roe,
roez=roez)
1×13 DataFrame. Omitted printing of 3 columns
Your second df variable has 13 columns (I have aligned the code in my edit 4 variables per line so that it is clearly visible). Julia omits printing all columns if they would not fit the screen (imagine what would happen if you had a data frame with 10 000 columns and always tried to print them all).
In REPL Julia omits printing columns if they do not fit the screen unless you pass allcols=true keyword argument to show or create a custom IOContext that you pass to show that defines a non-standard output width. All this is explained in show documentation for DataFrame.
In Jupyter Notebook a similar thing happens, but by default the width of the output is governed by "COLUMNS" environment variable. The details how you can set it are explained at the beginning of the DataFrames.jl manual here.

How to find the length of a list based on a condition in R

The problem
I would like to find a length of a list.
The expected output
I would like to find the length based on a condition.
Example
Suppose that I have a list of 4 elements as follows:
myve <–list(1,2,3,0)
Here I have 4 elements, one of them is zero. How can I find the length by extracting the zero values? Then, if the length is > 1I would like to substruct one. That is:
If the length is 4 then, I would like to have 4-1=3. So, the output should be 3.
Note
Please note that I am working with a problem where the zero values may be changed from one case to another. For example, For the first list may I have only one 0 value, while for the second list may I have 2 or 3 zero values.
The values are always positive or zero.
You just need to apply the condition to each element. This will produce a list of boolean, then you sum it to get the number of True elements (i.e. validation your condition).
In your case:
sum(myve != 0)
In a more complex case, where the confition is expressed by a function f:
sapply(myve, f)
Use sapply to extract the ones different to zeros and sum to count them
sum(sapply(myve, function(x) x!=0))

R commands for finding mode in R seem to be wrong

I watched video on YouTube re finding mode in R from list of numerics. When I enter commands they do not work. R does not even give an error message. The vector is
X <- c(1,2,2,2,3,4,5,6,7,8,9)
Then instructor says use
temp <- table(as.vector(x))
to basically sort all unique values in list. R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given. Then he says to use command,
names(temp)[temp--max(temp)]
which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list. I would like to stay with these commands as far as is possible as the instructor explains them in detail. Am I doing a typo or something?
You're kind of confused.
X <- c(1,2,2,2,3,4,5,6,7,8,9) ## define vector
temp <- table(as.vector(X))
to basically sort all unique values in list.
That's not exactly what this command does (sort(unique(X)) would give a sorted vector of the unique values; note that in R, lists and vectors are different kinds of objects, it's best not to use the words interchangeably). What table() does is to count the number of instances of each unique value (in sorted order); also, as.vector() is redundant.
R should give me from this command 1,2,3,4,5,6,7,8,9 but nothing happens except when the instructor does it this list is given.
If you assign results to a variable, R doesn't print anything. If you want to see the value of a variable, type the variable's name by itself:
temp
you should see
1 2 3 4 5 6 7 8 9
1 3 1 1 1 1 1 1 1
the first row is the labels (unique values), the second is the counts.
Then he says to use command, names(temp)[temp--max(temp)] which basically should give me this: 1,3,1,1,1,1,1,1,1 where 3 shows that the mode is 2 because it is repeated 3 times in list.
No. You already have the sequence of counts stored in temp. You should have typed
names(temp)[temp==max(temp)]
(note =, not -) which should print
[1] "2"
i.e., this is the mode. The logic here is that temp==max(temp) gives you a logical vector (a vector of TRUE and FALSE values) that's only TRUE for the elements of temp that are equal to the maximum value; names(temp)[temp==max(temp)] selects the elements of the names vector (the first row shown in the printout of temp above) that correspond to TRUE values ...

Counting specific characters in a string, across a data frame. sapply

I have found similar problems to this here:
Count the number of words in a string in R?
and here
Faster way to split a string and count characters using R?
but I can't get either to work in my example.
I have quite a large dataframe. One of the columns has genomic locations for features and the entries are formatted as follows:
[hg19:2:224840068-224840089:-]
[hg19:17:37092945-37092969:-]
[hg19:20:3904018-3904040:+]
[hg19:16:67000244-67000248,67000628-67000647:+]
I am splitting out these elements into thier individual elements to get the following (i,e, for the first entry):
hg19 2 224840068 224840089 -
But in the case of the fourth entry, I would like to pase this into two seperate locations.
i.e
hg19:16:67000244-67000248,67000628-67000647:+]
becomes
hg19 16 67000244 67000248 +
hg19 16 67000628 67000647 +
(with all the associated data in the adjacent columns filled in from the original)
An easy way for me to identify which rows need this action is to simply count the rows with commas ',' as they don't appear in any other text in any other columns, except where there are multiple genomic locations for the feature.
However I am failing at the first hurdle because the sapply command incorrectly returns '1' for every entry.
testdat$multiple <- sapply(gregexpr(",", testdat$genome_coordinates), length)
(or)
testdat$multiple <- sapply(gregexpr("\\,", testdat$genome_coordinates), length)
table(testdat$multiple)
1
4
Using the example I have posted above, I would expect the output to be
testdat$multiple
0
0
0
1
Actually doing
grep -c
on the same data in the command line shows I have 10 entries containing ','.
Using the example I have posted above, I would expect the output to be
So initially I would like to get this working but also I am a bit stumped for ideas as to how to then extract the two (or more) locations and put them on thier own rows, filling in the adjacent data.
Actually what I intended to to was to stick to something I know (on the command line) grepping the rows with ','out, duplicate the file and split and awk selected columns (1st and second location in respective files) then cat and sort them. If there is a niftier way for me to do this in R then I would love a pointer.
gregexpr does in fact return an object of length 1. If you want to find the rows which have a match vs the ones which don't, then you need to look at the returned value , not the length. A match failure returns -1 .
Try foo<-sapply(testdat$genome, function(x) gregexpr(',',x)); as.logical(foo) to get the rows with a comma.

Resources