Find Pattern Matching every column in excel, output as number of time repeating - r

I have excel file full of alphabets which is "a","b","c" only. There are total 20 columns and 300 rows. I just wanted to find pattern matching in all columns. For example, if i search for 4 rows of pattern matching where there is "aabc","bcdb" and etc the output will be how many times the same pattern repeat from the csv file. If i search for 5 rows of pattern matching where there is "abcaa", "bbaca" and etc the output will be how many times the same pattern repeat from the csv file. It is not necessary that the matching must be in same rows. If the string pattern occur anywhere else in the file can be considered too. The output may be in the next sheet should be fine. I have tried in VBA and R using Regex but only counting single cell. Any advice on how to find the pattern matching from the excel file would be greatly appreciated. Thanks in advance.
Excel File:
**A B C D**
1 a | a | a | b |
2 a | a | a | c |
3 b | b | b | a |
4 c | c | c | c |
5 d | b | d | b |
6 b | a | b | c |
7 b | b | b | b |
8 a | a | a | c |
9 a | c | a | a |
10 c | c | c | c |
11 c | a | c | c |
12 a | a | a | a |
13 b | b | a | b |
14 b | b | b | a |
15 c | c | c | c |
Output Example:
If search for 4 rows
aabc 3
dbba 2
baac 2
so on...
If search for 5 rows
aabcd 2
aacca 3
so on..

In E1 enter:
=TEXTJOIN("",TRUE,A1:D1)
and copy down. Then copy column E and PasteSpecialValues into column F.Then apply RemoveDuplicates to column F.Then in G1 enter:
=COUNTIF(E:E,F1)
and copy down:

Related

Find substring that have the the correlation greater than k in R

I have a big problems in R :(. We have a dataframe named:"hcmut" show the answers of students in half term test like here:
hcmut
Code | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
2011 | B | D | A | A | C | B | A | B | C | C
2012 | A | D | AC | B | D | B | A | B | C | C
2013 | A | D | A | A | C | D | C | B | D | D
2014 | A | B | A | C | BC | D | D | D | D | D
Question 1: find substring that have the corrrelation greater than k?
I think k is from range 0:1
Question 2: find substring that have the greatest correlation and show substring ( like "ABCD"...)
Could you help me with problem in R ?
:( :(

Is there a way in R to create a column based on order of multiple values in one another column in dataframe? [duplicate]

This question already has answers here:
Aggregating all unique values of each column of data frame
(2 answers)
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 1 year ago.
I would like to create a column in my R data frame based on the order in which multiple values occur in one column.
For example, my data frame has an id column and an item type column, and the values of the order column is what I would like to add. Is there a way to tell R to look at the order of values in the item column so that it can spit out "ABCD" or "ADCB" (any other order) as the cell value under the 3rd column?
| id | item | order |
| 11 | A | ABCD |
| 11 | A | ABCD |
| 11 | B | ABCD |
| 11 | B | ABCD |
| 11 | C | ABCD |
| 11 | C | ABCD |
| 11 | D | ABCD |
| 11 | D | ABCD |
| 12 | A | ADCB |
| 12 | A | ADCB |
| 12 | D | ADCB |
| 12 | D | ADCB |
| 12 | C | ADCB |
| 12 | C | ADCB |
| 12 | B | ADCB |
| 12 | B | ADCB |
...

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________
test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

Populate table column by relative row values of another column in R

I’m trying to populate a table column by relative row values of another column in R. I have a table with two data columns (Col1, Col2) and two point value columns (P1, P2). Data1 is populated, Data2 is not. I want the value of Data2 to be populated by the value in either P1 or P2, based on the relative value of Data 1. In a given row, if the previous value of Data1 is higher than its current value, the Data2 cell is populated by the value in P1. If the previous value of Data1 is lower than its current value, the Data2 cell is populated by the value in P2. To illustrate what I’m trying to do, I’ve provided two sample tables. The first table is what I have (Data2 is not populated), and the second table is the desired outcome.
Table1 (What I have)
+-----+----+----+-------+-------+
| FID | P1 | P2 | Data1 | Data2 |
+-----+----+----+-------+-------+
| 1 | A | B | 50 | |
| 2 | C | D | 40 | |
| 3 | E | F | 60 | |
| 4 | G | H | 70 | |
| 5 | I | J | 65 | |
Table2 (Desired Outcome)
+-----+----+----+-------+-------+
| FID | P1 | P2 | Data1 | Data2 |
+-----+----+----+-------+-------+
| 1 | A | B | 50 | NA |
| 2 | C | D | 40 | C |
| 3 | E | F | 60 | F |
| 4 | G | H | 70 | H |
| 5 | I | J | 65 | I |
+-----+----+----+-------+-------+
Is there a built in function in R to accomplish this? If not, any advice on how to create one?
A solution using dplyr could be:
df %>%
mutate(Data2 = ifelse(lag(Data1) > Data1, paste0(P1), paste0(P2)))
FID P1 P2 Data1 Data2
1 1 A B 50 <NA>
2 2 C D 40 C
3 3 E F 60 F
4 4 G H 70 H
5 5 I J 65 I

Creation of Panel Data set in R

Programmers,
I have some difficulties in structuring my panel data set.
My panel data set, for the moment, has the following structure:
Exemplary here only with T = 2 and N = 3. (My real data set, however, is of size T = 6 and N = 20 000 000 )
Panel data structure 1:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
1 | 2 | C | ... | D |
1 | 3 | E | ... | F |
2 | 1 | G | ... | H |
2 | 2 | I | ... | J |
2 | 3 | K | ... | L |
The desired structure is:
Panel data structure 2:
Year | ID | Variable_1 | ... | Variable_k |
1 | 1 | A | ... | B |
2 | 1 | G | ... | H |
1 | 2 | C | ... | D |
2 | 2 | I | ... | J |
1 | 3 | E | ... | F |
2 | 3 | K | ... | L |
This data structure represents the classic panel data structure, where the yearly observations over the whole period are structured for all individuals block by block.
My question: Is there any simple and efficient R-solution that changes the data structure from Table 1 to Table 2 for very large data sets (data.frame).
Thank you very much for all responses in advance!!
Enrico
You can reorder the rows of your dataframe using order():
df=df[order(df$ID,df$Year),]

Resources