SQLite, how to apply function to each cell in a column and then perform IN operation on transformed values? - sqlite

I'm trying to set up a query to apply a function on cell values for a column prior to performing an IN check or rather, match rows between table two tables where a column's cell values in a table are a substring of the other table's column of the same data type, specifically Strings.
I need something like,
'A' IN ('A', 'B', 'C', 'D') from 'A' IN ('A|B', 'C|D').
What it comes down to, is the ability to say whether A from Table 1 is in Table 2.
Table 1 Table 2
------- --------
A A|B
B C|D
C
D

Related

Retrieve correspondent rows when startsWith == TRUE as numeric values

I am trying to write a script for a complete automation of my data.
Here, I am trying to implement a loop in order to search for all the values which start with "Blank" in the name column of my data.frame.
How can I print all the correspondent rows in a vector?
i.e. I have a value Blank C in the column names at row 5, I want to get a vector with the values of the same row (5) in columns names and the other data columns 3:6, not as NA but as values because the output is NA.
for (val in ms.data$Name) {
if(startsWith(val,"Blank")) {
print(list(ms.data[val,3:6]))
}
}
Example data
Blank 1 05-Apr-17 7:04 PM 5.771899218 4.922906441 219.0199184 7.779938257
Blank 2 05-Apr-17 7:15 PM 4.913695034 4.071889653 2.161167065 2.567102283
Thank you

how to efficiently match two data tables in R

Situation:
I have a CSV file A with two columns Customer ID and Entry date.
A contains about 1.500.000 observations.
I have another CSV file B with a single column Customer ID.
B is a smaller subset of A.
Goal:
Since the info about their entry date is missing in table B, I would like to get that info from table A and write it all into a new table C.
Current Progress:
I've created 10 subsets S1,...,S10 from A and from each subset the maximum customer ID. In a for loop, I run through all entries of B and check if B lies within one of the subsets (via customer ID and max c ID of the subset). Once I've found a subset in which I am supposed to find the customer ID, I use the function which to look for the element of B in A.
This is awfully slow.
Isn't there another quicker way?
And which would be the best objects in R to use the CSV file as, currently, A is a data Frame, and B is a large integer.
I would use data.table. It is trivially easy to do this (see last command!), and very fast using what is known as a keyed join. Basically you look up entries from b in a using their common key (in your case "Customer ID"). As an example:
require(data.table)
a <- data.table(id=1:10,date=as.Date(1:10))
setkey(a,id)
b <- data.table(id=4:6)
setkey(b,id)
a[b]
# id date
#1: 4 2016-02-01
#2: 5 2016-02-02
#3: 6 2016-02-03
With your given example you would type this, to read in your data and do a keyed join to get the entry date for each person in table b:
a <- fread( "A.csv" )
setkey(a, "Customer ID")
b <- fread( "B.csv" )
setkey(a, "Customer ID")
c <- a[b]
Use data.table's fread to read in your CSV files:
library(data.table);
table_a <- fread("A.csv"); # Defaults are probably fine
table_b <- fread("B.csv");
Use merge to use the index in B to create C:
# Assuming the column name has an underscore instead of space
setkey(table_a, Customer_ID);
setkey(table_b, Customer_ID);
table_c <- merge(x=table_b, y=table_a, by="Customer_ID", all.x=TRUE);
Write the new table to CSV if desired:
write.csv(x=table_c, file="C.csv");

R - Updating a Dataframe Column

I have a data-frame with 2 columns that contains two different types of text
The first column contains codes that are strings in the form of DD-HI-HO (DD being the code)
Column 2 is free text which anyone can insert
I am trying to populate the third column based on three statements which use the logic below to give a single vector column of 1 or 0
i don't seem to be able to update a vector column to incorporate all three rules. Below is Pseudo code
Basic info:
Codes is a vector (basically a reference table with one column)
Fuzzy is a vector (basically another reference table with one column)
#----CHECK SEQUENCES----
# Check if code is applied in column 1
Data$Has.Code <- grepl(pattern = "(HC|HD|HE|HK|HM|HH|HY|HL)", Data.Raw$Col1)
# Check if string contains relevant text in col 2
Data$Has.DG <- if(length(intersect(Codes, Data$Contents)) > 0) {1}
# Check how closely Strings are related. Take the highest match If its over 45% then set flag as 1
levenshteinSim(Fuzzy ,Data$Contents)
-------Added Table with sample data
Col1, Col2, Col3
1.HC-IE, Ice-cream, 1
2.IE-GB, Volvo, 0
3,IE-DE, Iced_Lollipop, 1
Record 1,
Rule number 1 would catch "HC" in Col1 and so set Col 3 to 1 (boolean)
Rule number 2 would also catch something in Col2 for record 1 as the vector Codes contains "Ice" as an element. It wouldn't execute in any case because
Rule one supercedes it
Record 2
None of the rules would return anything for the second item so col 3 is set to 0
Record 3
A bit of a daft example but the levenschtein distance computes a 75% similarity between Col 2 and one of the elements in the vector Fuzzy. This is above our stated threshold so col 3 is set to 1
Can anyone help
Thank you for your help

Problems with using subset in r

I need to subset my data frame, but I do not know what condition to use.
df2<-subset(df, condition )
A part of the dataframe, `df`:
state value
a 1
b 2
c 3
a 1
b 4
c 5
I count the sum of the value column for each state using : table(df$state)
I need to create a date frame where I show just the rows where the sum of the value column is bigger then a given value x.
If x is 3, I need to have in the new data frame just the rows that have the "state" column equal to b or c.
What should I replace "condition" with? How can I use : table(df$state) in the condition?
It is not clear what are you trying to do.
table(df$state) count the occurence of each state in your data, not the sum of variable "value" for each "state".You should instead use something like this:
vv <- tapply(dat$value,dat$state,sum)
vv
a b c
2 6 8
Now you can use the result within subset, to get the sum of the value column is bigger then a given value x. For example x == 3:
subset(dat,state %in% names(vv)[vv>3])
or without using `subset ( more efficient)
dat[dat$state %in% names(vv)[vv>3],]

R Dataframe: Get column 2 where column 1 value = x?

Basic question but I'm a beginner sorry :-) And I still struggle with all these different data types etc. So I have a table with different variable names in column 1. In column 2 These variables have certain values. I want to extract now the value for a certain variable.
VarNames<-read.table(paste("O:/Daten/RatsDaten/CodesandDescription/VarNamesDir.asc"), sep="", skip=0,header=FALSE)
And the table Looks somehow like this
Test1 5
Test2 7
Test3 1
So how do I Access these Test variable values with their names? VarNames["Test1",2] didn't work..neither did any other option I've tried. Are there better data type options for this or how would I do it with a comfortable data frame?
You should have one of this 2 situations , either
Testxx are rownames of VarNames, you can test this using rownames(VarNames), and in this case you should do :
VarNames["Test1",1]
Or Testxx are components of a column, and you should do something like this :
VarNames[VarNames$v =='Test1',2]
For the first option :
m <- matrix(1:3,ncol=1,dimnames=list(paste0('Test',1:3),NULL))
m['Test1',]
Test1
1
for the second option
m1 <- data.frame(v=paste0('Test',1:3),b=1:3)
m1[m1$v=='Test1',]
v b
1 Test1 1
As your example is not reproducible, it is unclear whether the first column denotes row names or a variable with values TestX.
In case it is a variable, your table actually looks like this:
V1 V2
Test1 5
Test2 7
Test3 1
So you can get value of Test2 by calling VarNames[VarNames$V1 == "Test2",] for the whole row or VarNames[VarNames$V1 == "Test2",2] for the value only. You specify 2 since it is the second column.
If the first column denotes row names, the call is VarNames["Test2",] for the whole row, or as #agstudy answered, VarNames["Test2",1] for the value alone. You specify 1 since it is the first column provided Test2 is a row name, and thus is not contained in a column.

Resources