REGEXP_SIMILAR in Teradata - teradata

I am trying to take out names from a table where name does not contain characters or numbers.
Name Flag
Abc 0
124 0
£% 1
AB÷ 0
17 0
*& 1
What function to be used in Teradata to achieve the desired result?
Best,

Related

How do I make a selected table confined to a matrix, rather than a running list?

For my previous lines of code for making tables from column names, they successfully made short and dense matrices for me to readily process data from two questions (from survey results): (2nd example).
However, when I try using the same line of code (above), I don't get that sleek matrix. I end up getting a list of un-linked tables, which I do not want. Perhaps it's due to the new column only having 0's and 1's as numeric characters, vs. the others that have more than 2: (1st example).
[Please forgive my formatting issues (StackOverflow Status: Newbie). Also, many thanks in advance to those checking in on and answering my question!]
>table(select(data_final, `Relationship 2Affected Individual`, Satisfied_Treatments))
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, , 1 = 1, Response = 10679308122
0
Relationship 2Affected Individual 1
1 0
2 0
3 0
6 0
Other (please specify) 0
, ,
...
> table(select(data_final, `Relationship 2Affected Individual`, Indirect_Benefits))
Indirect_Benefits
Relationship 2Affected Individual 0 1 2 3
1 4 1 0 0
2 42 17 9 3
3 12 1 1 0
6 5 2 2 0
Other (please specify) 1 0 0 0
>#rstudioapi::versionInfo()
>#packageVersion("dplyr")
table(data_final$Relationship 2Affected Individual, data_final$Satisfied_Treatments)
Problem Solved^

Frequency Distribution Plot of Document Term Matrix

I have created a document term matrix that looks something like this:
inspect(dtm[1:4,1:6])
allowed allowing almost alone companyunder companywide
Doc1.txt 1 1 1 0 1 0
Doc2.txt 0 1 1 0 1 1
Doc3.txt 0 0 0 1 0 1
Doc4.txt 1 0 1 0 1 1
After taking it's column sum it gives me.
colSums(dtm)
allowed 2
allowing 2
almost 3
alone 1
companyunder 3
companywide 3
This essentially indicates that these words are found in how many documents (for eg allowed 2 tells me that allowed is found in two documents.).
I'm having difficulty in creating a frequency distribution plot which will have x-axis as the document number and y-axis as the number of words the document contains.
Is this what you're looking for?
dtm = array(c(1,0,0,1,1,1,0,0,1,1,0,1,0,0,1,0,1,1,0,1,0,1,1,1),dim=c(4,6))
dimnames(dtm) = list(c("Doc1","Doc2","Doc3","Doc4"),c("allowed","allowing","almost","alone","companyunder","companywide"))
print(dtm)
plot(rowSums(dtm))

extracting data from excel spreadsheet that is not organized in columns but repeats every x number of rows

I'm trying to extract information from an excel spreadsheet that is not organized in columns but by rows. key points:
the excel spreadsheet was converted to csv resulting in 2023 rows
and 5 columns.
read this file and converted in a data.frame,
called "test".
attempt to create a data.frame with 2 loops.
result
There were 50 or more warnings (use warnings() to see the first 50)
warning(extractor)
Error in FUN(X[[1L]], ...) :
cannot coerce type 'closure' to vector of type 'character'
very much appreciate your help..
extractor<-function(test){
##x<-data.frame(matrix(NA,nrow=920,ncol=3))
x<-data.frame(name=character(920),date=numeric(920),ton=numeric(920))
for (i in 1:920){
m<-11*i-9
{for(j in 1:5) {
x$name[i]=test[m,][1]
x$date[i]=test[m+j+2,][1]
x$ton[i]=test[m+j+2,][3]
}
}
}
test.csv looks like this:
XXXX-XXX-LHS-P1
2 XXXX-XXX-BHS-P1
3 Date blasted BLASTED (T) MUCKED (T) REM'G (T)
4 BLAST #1 0 0
5 BLAST #2 0.00 0
6 BLAST #3 0 0
7 BLAST #4 0 0
8 BLAST #5 0 0
9 TOTAL 0 0
10 % Mucked to Date 0 0 of design
11 REM'G TO BLAST 25419
12 XXXX-XXX-LHS-P1
13 XXXX-XXX-BHS-P1 10069 Ready? 0
14 Date blasted BLASTED (T) MUCKED (T) REM'G (T)
15 41556 BLAST #1 10069 10069
16 BLAST #2 0 0
17 BLAST #3 0 0
18 BLAST #4 0 0
19 BLAST #5 0 0
20 TOTAL 10069 9656 413
21 % Mucked to Date 0.958983017
22 REM'G TO BLAST 0
...
I'm not sure that this will definitely address all of the warnings, but try adding the argument stringsAsFactors=FALSE to the end of the line where you create the data.frame.
Just by creating a character column you are using factors, which can't be modified with a simple assignment command. Your command should read x<-data.frame(name=character(920),date=numeric(920),ton=numeric(920),stringsAsFactors=FALSE).

How to perform a repeated G.test in R?

I downloaded the R package RVAideMemoire in order to use the G.test.
> head(bio)
Date Trt Treated Control Dead DeadinC AliveinC
1 23Ap citol 1 3 1 0 13
2 23Ap cital 1 5 3 1 6
3 23Ap gerol 0 3 0 0 9
4 23Ap mix 0 5 0 0 8
5 23Ap cital 0 5 1 0 13
6 23Ap cella 0 5 0 1 4
So, I make subsets of the data to look at each treatment, because the G.test result will need to be pooled for each one.
datamix<-subset(bio, Trt=="mix")
head(datamix)
Date Trt Treated Control Dead DeadinC AliveinC
4 23Ap mix 0 5 0 0 8
8 23Ap mix 0 5 1 0 8
10 23Ap mix 0 2 3 0 5
20 23Ap mix 0 0 0 0 18
25 23Ap mix 0 2 1 0 15
28 23Ap mix 0 1 0 0 12
So for the G.test(x) to work if x is a matrix, it must be constructed as 2 columns containing numbers, with 1 row per population. If I use the apply() function I can run the G,test on each row if my data set contains only two columns of numbers. I want to look only at the treated and control for example, but I'm not sure how to omit columns so the G.test can ignore the headers, and other columns. I've tried using the following but I get an error:
apply(datamix, 1, G.test)
Error in match.fun(FUN) : object 'G.test' not found
I have also thought about trying to use something like this rather than creating subsets.
by(bio, Trt, rowG.test)
The G.test spits out this, when you compare two numbers.
G-test for given probabilities
data: counts
G = 0.6796, df = 1, p-value = 0.4097
My other question is, is there someway to add all the df and G values that I get for each row (once I'm able to get all these numbers) for each treatment? Is there also some way to have R report the G, df and p-values in a table to be summed rather than like above for each row?
Any help is hugely appreciated.
You're really close. This seems to work (hard to tell with such a small sample though).
by(bio,bio$Trt,function(x)G.test(as.matrix(x[,3:4])))
So first, the indices argument to by(...) (the second argument) is not evaluated in the context of bio, so you have to specify bio$Trt instead of just Trt.
Second, this will pass all the columns of bio, for each unique value of bio$Trt, to the function specified in the third argument. You need to extract only the two columns you want (columns 3 and 4).
Third, and this is a bit subtle, passing x[,3:4] to G.test(...) causes it to fail with an unintelligible error. Looking at the code, G.test(...) requires a matrix as it's first argument, whereas x[,3:4] in the code above is a data.frame. So you need to convert with as.matrix(...).

How do I create/ sort a Table containing a list of matched terms with their corresponding counts

I am having problems trying to create a table containing a master list of names that have been matched and counted in two separate groups.
The Input_list.txt contains a master list of names and looks like this:
-5S_rRNA
-7SK
-ABCA8
-AC002480.4
-AC002978.1
-RP11-129B22.2
These names have been grep'd and counted in two separate data groups; group1_data.txt and group2_data.txt and look like this:
group1_data.txt
-5S_rRNA 20
-7SK 25
-AC002480.4 1
-AC002978.1 2
group2_data.txt
-5S_rRNA 1
-ABCA8 1
I would like to create a table that contains the master Input_list.txt and the 2 data.txt files with the matched names and corresponding counts. If there isn't a match, I would like to return a value of 0 and to look like this:
Input group1 group2
5S_rRNA 20 1
7SK 25 0
ABCA8 0 1
AC002480.4 1 0
AC002978.1 2 0
The number of matched names are not equal between the Input_list.txt and two data.txt files.
I've tried sort but I'm really stuck. Any suggestions would be great!
Using join:
join -e 0 -a 1 -o '1.1 2.2' Input_list.txt group1_data.txt | \
join -a 1 -e 0 -o '1.1 1.2 2.2' - group2_data.txt | \
sed '/ 0 0$/d'
Prints:
-5S_rRNA 20 1
-7SK 25 0
-ABCA8 0 1
-AC002480.4 1 0
-AC002978.1 2 0

Resources