Mapping tibble columns specified by substrings - r

I have a tibble with a list of stocks, each has a sector ids, each sector is a string with 8 characters (it is a level 4 GICS sector https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard):
tabl <- tibble(Stock=c("A","B","C","D"), SectorId=c("30101010", "30101010", "20103015", "55102010"))
I also have a tibble that map a SectorId to a ClusterId:
map_tabl <- tibble(ClusterId=c("C1","C1", "C2","C3"), SectorId=c("3010", "3020", "201030", "551020"))
Note that in the cluster mapping we have a mix of sectors defined on the 4 different levels (see https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard), i.e. Sector "3010" contains sector "30101010". The first 2 characters correspond to Level 1, the first 4 to Level 2, the first 6 to Level 3, and the 8 characters to Level 4. So for example in this case "30101010" belongs to the higher level sector "3010", which is in ClusterId="C1". Note that "30101010" is not specified at all in map_tabl, so probably I should use a function that look at substrings, like grepl.
The resulting tibble should be:
tibble(Stock=c("A","B","C","D"), SectorId=c("30101010", "30101010", "20103015", "55102010"), ClusterId=c("C1", "C1", "C2", "C3")

I think we can use a regex (fuzzy) join for this:
library(dplyr)
library(fuzzyjoin) # regex_left_join
map_tabl %>%
mutate(SectorId = paste0("^", SectorId)) %>%
regex_left_join(tabl, ., by = "SectorId")
# # A tibble: 4 x 4
# Stock SectorId.x ClusterId SectorId.y
# <chr> <chr> <chr> <chr>
# 1 A 30101010 C1 ^3010
# 2 B 30101010 C1 ^3010
# 3 C 20103015 C2 ^201030
# 4 D 55102010 C3 ^551020
fuzzyjoin always keeps both versions of the join variables around, it's easy enough to mutate(SectorId = SectorId.x, SectorId.x = NULL, SectorId.y = NULL) or similar (choosing select(-SectorId.x), etc, also works).
The precondition of SectorId to add the ^ is so that the matches only occur at the beginning of the string.
This does not attempt to limit the number of matches, so if there are multiple rows in map_tabl that might match an entry (e.g., SectorId=c("3010", "301010")), then you will need to define a clear way to choose which of these to retain. For this, I assume either Stock is a unique ID of sorts, or if not then you can add one yourself to make sure you end the operation with the same rows (no dupes) as before the join.

Related

splitting a data frame column -- using the second capital letter as a delimiter

I have a list of labels like this.
Label = c("ProjectCrash", "ProjectNoCrash", "TreatmentFed", "TreatmentPre",
"TreatmentStarve")
And the labels are in a data frame/tibble
myTibble <- tibble(Label = Label)
myTibble
I'd like to create two more columns, delimiting on the second capital letter.
So the first column would contain either "Project" or "Treatment".
The second would contain one of Crash, NoCrash, Fed, Pre or Starve.
You can use tidyr's extract by providing regex to extract the data into different columns.
tidyr::extract(myTibble, Label, c('First', 'Second'),
'([A-Z][a-z]+)([A-Z].*)', remove= FALSE)
# A tibble: 5 x 3
# Label First Second
# <chr> <chr> <chr>
#1 ProjectCrash Project Crash
#2 ProjectNoCrash Project NoCrash
#3 TreatmentFed Treatment Fed
#4 TreatmentPre Treatment Pre
#5 TreatmentStarve Treatment Starve
You could also use separate from tidyr
tidyr::separate(myTibble, Label,c("a","b"), "(?<=[a-z])(?=[A-Z])",
extra = "merge", remove = FALSE)
# A tibble: 5 x 3
Label a b
<chr> <chr> <chr>
1 ProjectCrash Project Crash
2 ProjectNoCrash Project NoCrash
3 TreatmentFed Treatment Fed
4 TreatmentPre Treatment Pre
5 TreatmentStarve Treatment Starve
In base R, you could do:
transform(myTibble, a=strcapture('([A-Z][a-z]+)(\\w+)',Label,
data.frame(a=character(), b=character())))
Label a.a a.b
1 ProjectCrash Project Crash
2 ProjectNoCrash Project NoCrash
3 TreatmentFed Treatment Fed
4 TreatmentPre Treatment Pre
5 TreatmentStarve Treatment Starve

How can I identify and count unique pairs at every level in a list of lists in R?

I have a list of lists that looks like this:
> class(cladelist)
[1] "list"
cladelist <- list( `46` = scan(text=' "KbFk2" "PeHa3" "PeHa51" "EeBi27" "EeBi17" "PeHa23" "PeHa44" "EeBi4" "EeBi26" "PeHa8" "PeHa26" "EeBi24" "EeBi3"
"EeBi20" "KbFk5" "PeHa15" "PeHa43" "PeHa11" "PeHa12" "PeHa49" "PeHa67" "PeHa17" "PeHa59" "KbFk4" "PeHa10" "PeHa55"
"PeHa73" "EeBi23" "PeHa78" "PeHa81" "EeBi11" "PeHa45" "EeBi6" "EeBi34" "PeHa25" "PeHa52" "PeHa62" "PeHa31" "PeHa65"
"PeHa47" "PeHa50" "PeHa34" "PeHa54" "PeHa22" "PeHa30"', what=""),
`47`= scan(text='
"KbFk2" "EeBi27" "EeBi17" "EeBi4" "EeBi26" "EeBi3" "EeBi20" "KbFk5" "KbFk4" "EeBi6" "EeBi34"', what=""),
`48`= scan(text=' "PeHa3" "PeHa51" "PeHa23" "PeHa44" "PeHa8" "PeHa26" "EeBi24" "PeHa15" "PeHa43" "PeHa11" "PeHa12" "PeHa49" "PeHa67"
"PeHa17" "PeHa59" "PeHa10" "PeHa55" "PeHa73" "EeBi23" "PeHa78" "PeHa81" "EeBi11" "PeHa45" "PeHa25" "PeHa52" "PeHa62"
"PeHa31" "PeHa65" "PeHa47" "PeHa50" "PeHa34" "PeHa54" "PeHa22" "PeHa30"', what=""),
`49`= scan(text=' "PeHa51" "PeHa23" "PeHa44" "PeHa8" "PeHa26" "EeBi24" "PeHa15" "PeHa43" "PeHa11" "PeHa12" "PeHa49" "PeHa67" "PeHa17"
"PeHa59" "PeHa10" "PeHa55" "PeHa73" "EeBi23" "PeHa78" "PeHa81" "EeBi11" "PeHa45" "PeHa25" "PeHa52" "PeHa62" "PeHa31"
"PeHa65" "PeHa47" "PeHa50" "PeHa34" "PeHa54" "PeHa22" "PeHa30"', what=""),
`50`= scan(text=' "EeBi27" "EeBi17" "EeBi4" "EeBi26" "EeBi3" "EeBi20" "KbFk5" "KbFk4" "EeBi6" "EeBi34"', what="") )
Each of these sublists (ie "46", "47" etc) represents a clade in a dendogram that I've extracted using:
> cladelist <- clade.members.list(VB.phy, tips = FALSE, tip.labels = TRUE, include.nodes=FALSE)
Im trying to find each unique pair found within each sublist, and calculate the sum of times it appears between all sublists (clades).
The ideal output would be a dataframe that looks like this where the count is the number of times this pair was found between all sublists (clades):
Pair Count
Peha1/PeHa2 2
Peha1/PeHa3 4
PeHa1/PeHa4 7
PeHa1/PeHa5 3
What sort of formulas am I looking for?
Background for the question (just for interest, doesnt add that much to question):
The idea is that I have a data set of 121 of these elements (Peha1, KbFk3, etc). They are artifacts (I'm an archaeologist) that I'm evaluating using 3D geometric morphometrics. The problem is that these artifacts are not all complete; they are damaged or degraded and thus provide an inconsistent amount of data. So I've had to reduce what data I use per artifact to have a reasonable, yet still inconsistent, sample size. By selecting certain variables to evaluate, I can get useful information, but it requires that I test every combination of variables. One of my analyses gives me the dendograms using divisive hierarchical clustering.
Counting the frequency of each pair as found between each clade should be the strength of the relationship of each pair of artifacts. That count I will then divide by total number of clades in order to standardize for the following step. Once I've done this for X number of dendograms, I will combine all these values for each pair, and divide them by the number representing whether that pair appeared in a dendogram (if it shows up in 2 dendograms, that I divide by 2), because each pair will not appear in each of my tests and I have to standardize it so that more complete artifacts that appear more often in my tests don't have too much more weight. This should allow me to evaluate which pairs have the strongest relationships.
This falls into a set of association kind of problems for which I find the widyr package to be super useful, since it does pairwise counts and correlations. (The stack() function just converts into a dataframe for the rest to flow).
I couldn't check against your sample output, but for an example like "PeHa23/PeHa51", the output shows they are paired together in 3 different clades.
This currently doesn't include zero counts to exhaust all possible pairs, but that could be shown as well (using complete()).
UPDATE: Made references clearer for packages like dplyr, and filtered so that counts are non-directional (item1-item2 is same as item2-item1 and can be filtered).
library(tidyverse)
library(widyr)
df <- stack(cladelist) %>%
dplyr::rename(clade = "ind", artifact = "values")
df %>%
widyr::pairwise_count(feature = clade, item = artifact) %>%
filter(item1 > item2) %>%
mutate(Pair = paste(item1, item2, sep = "/")) %>%
dplyr::select(Pair, Count = n)
#> # A tibble: 990 x 2
#> Pair Count
#> <chr> <dbl>
#> 1 PeHa3/KbFk2 1
#> 2 PeHa51/KbFk2 1
#> 3 PeHa23/KbFk2 1
#> 4 PeHa44/KbFk2 1
#> 5 PeHa8/KbFk2 1
#> 6 PeHa26/KbFk2 1
#> 7 KbFk5/KbFk2 2
#> 8 PeHa15/KbFk2 1
#> 9 PeHa43/KbFk2 1
#> 10 PeHa11/KbFk2 1
#> # … with 980 more rows

Data Cleaning in R: remove test customer names

I am handling customer data that has customer first and last name. I want to clean the names of any random keystrokes. Test accounts are jumbled in the data-set and have junk names. For example in the below data I want to remove customers 2,5,9,10,12 etc. I would appreciate your help.
Customer Id FirstName LastName
1 MARY MEYER
2 GFRTYUIO UHBVYY
3 CHARLES BEAL
4 MARNI MONTANEZ
5 GDTDTTD DTTHDTHTHTHD
6 TIFFANY BAYLESS
7 CATHRYN JONES
8 TINA CUNNINGHAM
9 FGCYFCGCGFC FGCGFCHGHG
10 ADDHJSDLG DHGAHG
11 WALTER FINN
12 GFCTFCGCFGC CG GFCGFCGFCGF
13 ASDASDASD AASDASDASD
14 TYKTYKYTKTY YTKTYKTYK
15 HFHFHF HAVE
16 REBECCA CROSSWHITE
17 GHSGHG HGASGH
18 JESSICA TREMBLEY
19 GFRTYUIO UHBVYY
20 HUBHGBUHBUH YTVYVFYVYFFV
21 HEATHER WYRICK
22 JASON SPLICHAL
23 RUSTY OWENS
24 DUSTIN WILLIAMS
25 GFCGFCFGCGFC GRCGFXFGDGF
26 QWQWQW QWQWWW
27 LIWNDVLIHWDV LIAENVLIHEAV
28 DARLENE SHORTRIDGE
29 BETH HDHDHDH
30 ROBERT SHIELDS
31 GHERDHBXFH DFHFDHDFH
32 ACE TESSSSSRT
33 ALLISON AWTREY
34 UYGUGVHGVGHVG HGHGVUYYU
35 HCJHV FHJSEFHSIEHF
The problem seems to be that you'd need a solid definition of improbable names, and that is not really related to R. Anyway, I suggest you go by the first names and remove all those names that are not plausible. As a source of plausible first names or positive list, you could use e.g. SSA Baby Name Database. This should work reasonably well to filter out English first names. If you have more location specific needs for first names, just look online for other baby name databases and try to scrape them as a positive list.
Once you have them in a vector named positiveNames, filter out all non-positive names like this:
data_new <- data_original[!data_original$firstName %in% positiveNames,]
My approach is the following:
1) Merge FirstName and LastName into a single string, strname.
Then, count the number of letters for each strname.
2) At this point, we find that for real names, like "MARNIMONTANEZ", are composed of two 'M'; two 'A'; one 'R'; one 'I'; three 'N'; one 'O'; one 'T'.
And we find that fake names, like "GFCTFCGCFGCCGGFCGFCGFCGF", are composed of six 'G'; five 'F'; 8 'C'.
3) The pattern to distinguish real names from fake names becomes clear:
real names are characterized by a more variety of letters. We can measure this by creating a variable check_real computed as: number of unique letters / total string length
fake names are characterized by few letters repeated several times. We can measure this by creating a variable check_fake computed as: average frequency of each letter
4) Finally, we just have to define a threshold to identify an anomaly for both variable. In the cases where these threshold are triggered, a flag_real and a flag_fake appears.
if flag_real == 1 & flag_fake == 0, the name is real
if flag_real == 0 & flag_fake == 1, the name is fake
In the rare cases when the two flags agrees (i.e. flag_real == 1 & flag_fake == 1), you have to investigate the record manually to optimize the threshold.
You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.
I did this using charToRaw function because it very faster and using dplyr library, as below:
# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7),
FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
#test data: df
# CustomerId FirstName LastName
#1 1 MARY MEYER
#2 2 FGCYFCGCGFC FGCGFCHGHG
#3 3 GFCTFCGCFGC GFCGFCGFCGF
#4 4 ASDASDASD AASDASDASD
#5 5 GDTDTTD DTTHDTHTHTHD
#6 6 WALTER FINN
#7 7 GFCTFCGCFGC CG GFCGFCGFCGF
library(dplyr)
df %>%
## Combining FirstName and LastName
mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
group_by(FullName) %>%
## Calculating variability strength for each full name
mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
## Filtering full name, I set above or equal to 0.4 (You can change this)
## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
filter(Variability >= 0.40)
# A tibble: 2 x 5
# Groups: FullName [2]
# CustomerId FirstName LastName FullName Variability
# <dbl> <chr> <chr> <chr> <dbl>
#1 1 MARY MEYER MARY MEYER 0.6000000
#2 6 WALTER FINN WALTER FINN 0.9090909
I tried to combine the suggestions in the below code. Thanks everyone for the help.
# load required libraries
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8),
FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)
# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x,
col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)
# Unique string length
for(n in 1:nrow(df))
{
df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}
# Ratio variable for unique string length over total string length
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]

Subset rows for each group based on a character in a column and order of occurrence in a data frame

I have a data similar to this.
B <- data.frame(State = c(rep("Arizona", 8), rep("California", 8), rep("Texas", 8)),
Account = rep(c("Balance", "Balance", "In the Bimester", "In the Bimester", "Expenses",
"Expenses", "In the Bimester", "In the Bimester"), 3), Value = runif(24))
You can see that Account has 4 occurrences of the element "In the Bimester", two "chunks" of two elements for each state, "Expenses" in between them.
The order here matters because the first chunk is not referring to the same thing as the second chunk.
My data is actually more complex, It has a 4th variable, indicating what each row of Account means. The number of its elements for each Account element (factor per se) can change. For example, In some state, the first "chunk" of "In the Bimester" can have 6 rows and the second, 7; but, I cannot differentiate by this 4th variable.
Desired: I'd like to subset my data, spliting those two "In the Bimester" by each state, subsetting only the rows of the first "chunks" by each state or the second "chunks".
I have a solution using data.table package, but I'm finding it kind of poor. any thoughts?
library(data.table)
B <- as.data.table(B)
B <- B[, .(Account, Value, index = 1:.N), by = .(State)]
x <- B[Account == "Expenses", .(min_ind = min(index)), by = .(State)]
B <- merge(B, x, by = "State")
B <- B[index < min_ind & Account == "In the Bimester", .(Value), by = .(State)]
You can use dplyr package:
library(dplyr)
B %>% mutate(helper = data.table::rleid(Account)) %>%
filter(Account == "In the Bimester") %>%
group_by(State) %>% filter(helper == min(helper)) %>% select(-helper)
# # A tibble: 6 x 3
# # Groups: State [3]
# State Account Value
# <fctr> <fctr> <dbl>
# 1 Arizona In the Bimester 0.17730148
# 2 Arizona In the Bimester 0.05695585
# 3 California In the Bimester 0.29089678
# 4 California In the Bimester 0.86952723
# 5 Texas In the Bimester 0.54076144
# 6 Texas In the Bimester 0.59168138
If instead of min you use max you'll get the last occurrences of "In the Bimester" for each State. You can also exclude Account column by changing the last pipe to select(-helper,-Account).
p.s. If you don't want to use rleid from data.table and just use dplyr functions take a look at this thread.

Order R dataframe columns by using second dataframe as a reference.

I am working on developing a statistical program using R, this program accepts two dataFrames. The first dataFrame carries demographics information of patients and the second carries their clinical information. The key column in the demographics dataFrame is the patientID column. While in the clinical dataFrame each patientID is a column. I wish to arrange/sort my demographics dataFrame by patientID, based upon the order of patientID's(ind columns) in the clinical dataFrame. Also the ID's could numeric or alphanumeric or could just be some-alphabet sequence. I was able to write some code, but would need help/guidance to come up with a better way to sort columns irrespective of their datatype(character, factor, numeric etc).
demogr = read.csv(mydemoFile, header = T, stringsAsFactors
=TRUE,colClasses=c('factor','factor','factor','factor','factor'))
demogr=demogr[order(as.numeric(demogr$Patient_ID)),]
myClinicalFrame=fread(myInputFile,header=T,data.table=FALSE,sep=",")
rowNames=myClinicalFrame[,1]
myClinicalFrame[,1]<-NULL
rownames(myClinicalFrame)=rowNames
names(myClinicalFrame)=sort((names(myClinicalFrame)))
The above works for certain types but fails for others. eg: Patient_ID in
demoFrame is numerically sorted above, in some situations R changes patient_ID like
109999345554545465 to 1.09e+18, which doesn't match with the second dataFrame.
Thanks
Let's start by creating two example data frames:
patientID = c(123456789012345,1234,1234567890,123)
state = c("FL","NJ","CA","TX")
demog = data.frame(ID = patientID,state = state)
clinical = data.frame(col1 = c(1,2,3),
col2 = c(3,4,5),
col2 = c(1,7,9),
col2 = c(6,4,2))
colnames(clinical) = c("1234567890","123","123456789012345","1234")
This gives us:
> demog
ID state
1 1.234568e+14 FL
2 1.234000e+03 NJ
3 1.234568e+09 CA
4 1.230000e+02 TX
and
> clinical
1234567890 123 123456789012345 1234
1 1 3 1 6
2 2 4 7 4
3 3 5 9 2
As you can see the rows in demog are in a different order than the columns in clinical.
To sort the rows in demog do:
rownames(demog) = demog$ID
demog = demog[colnames(clinical),]
This works even for IDs that are factors or characters, because rownames() will convert them to character.
Result:
> demog
ID state
1234567890 1.234568e+09 CA
123 1.230000e+02 TX
123456789012345 1.234568e+14 FL
1234 1.234000e+03 NJ

Resources