Matching pattern 2 data frames r - r

I have two data frames that contains strings that are slightly different let's say:
Name Size
Company.1 Inc. 234
Company.2 LLC 164
Company.3 INC 231
On the other hand, my second data frame is:
Name State
Company.1 INC MA
Company.2 NY
Company.3 inc. CA
Do you know a tool that could for example match the first 6 characters and merge into a new table the result (or at least shows me the option if there is a multiple match)?
I tried grep or sapply but it is not working because I need to compare all name values of the first data frame to all the name value of the second one.
Thanks for your help!

It seems like all you need here is to use match in order to match the first 9 letters in both files, something like (I'm assuming here df1 is your first data set and df2 is the second respectively)
indx <- match(substr(df1$Name, 1, 9), substr(df2$Name, 1, 9))
df1["State"] <- df2$State[indx]
df1
# Name Size State
# 1 Company.1 Inc. 234 MA
# 2 Company.2 LLC 164 NY
# 3 Company.3 INC 231 CA
Or using some fast join using the data.table package
library(data.table)
setkey(setDT(df1)[, Name := substr(Name, 1, 9)], Name)
setDT(df2)[, Name := substr(Name, 1, 9)]
df1[df2]
# Name Size State
# 1: Company.1 234 MA
# 2: Company.2 164 NY
# 3: Company.3 231 CA

Related

Separate character variable into two columns

I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59

Data Cleaning in R: remove test customer names

I am handling customer data that has customer first and last name. I want to clean the names of any random keystrokes. Test accounts are jumbled in the data-set and have junk names. For example in the below data I want to remove customers 2,5,9,10,12 etc. I would appreciate your help.
Customer Id FirstName LastName
1 MARY MEYER
2 GFRTYUIO UHBVYY
3 CHARLES BEAL
4 MARNI MONTANEZ
5 GDTDTTD DTTHDTHTHTHD
6 TIFFANY BAYLESS
7 CATHRYN JONES
8 TINA CUNNINGHAM
9 FGCYFCGCGFC FGCGFCHGHG
10 ADDHJSDLG DHGAHG
11 WALTER FINN
12 GFCTFCGCFGC CG GFCGFCGFCGF
13 ASDASDASD AASDASDASD
14 TYKTYKYTKTY YTKTYKTYK
15 HFHFHF HAVE
16 REBECCA CROSSWHITE
17 GHSGHG HGASGH
18 JESSICA TREMBLEY
19 GFRTYUIO UHBVYY
20 HUBHGBUHBUH YTVYVFYVYFFV
21 HEATHER WYRICK
22 JASON SPLICHAL
23 RUSTY OWENS
24 DUSTIN WILLIAMS
25 GFCGFCFGCGFC GRCGFXFGDGF
26 QWQWQW QWQWWW
27 LIWNDVLIHWDV LIAENVLIHEAV
28 DARLENE SHORTRIDGE
29 BETH HDHDHDH
30 ROBERT SHIELDS
31 GHERDHBXFH DFHFDHDFH
32 ACE TESSSSSRT
33 ALLISON AWTREY
34 UYGUGVHGVGHVG HGHGVUYYU
35 HCJHV FHJSEFHSIEHF
The problem seems to be that you'd need a solid definition of improbable names, and that is not really related to R. Anyway, I suggest you go by the first names and remove all those names that are not plausible. As a source of plausible first names or positive list, you could use e.g. SSA Baby Name Database. This should work reasonably well to filter out English first names. If you have more location specific needs for first names, just look online for other baby name databases and try to scrape them as a positive list.
Once you have them in a vector named positiveNames, filter out all non-positive names like this:
data_new <- data_original[!data_original$firstName %in% positiveNames,]
My approach is the following:
1) Merge FirstName and LastName into a single string, strname.
Then, count the number of letters for each strname.
2) At this point, we find that for real names, like "MARNIMONTANEZ", are composed of two 'M'; two 'A'; one 'R'; one 'I'; three 'N'; one 'O'; one 'T'.
And we find that fake names, like "GFCTFCGCFGCCGGFCGFCGFCGF", are composed of six 'G'; five 'F'; 8 'C'.
3) The pattern to distinguish real names from fake names becomes clear:
real names are characterized by a more variety of letters. We can measure this by creating a variable check_real computed as: number of unique letters / total string length
fake names are characterized by few letters repeated several times. We can measure this by creating a variable check_fake computed as: average frequency of each letter
4) Finally, we just have to define a threshold to identify an anomaly for both variable. In the cases where these threshold are triggered, a flag_real and a flag_fake appears.
if flag_real == 1 & flag_fake == 0, the name is real
if flag_real == 0 & flag_fake == 1, the name is fake
In the rare cases when the two flags agrees (i.e. flag_real == 1 & flag_fake == 1), you have to investigate the record manually to optimize the threshold.
You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.
I did this using charToRaw function because it very faster and using dplyr library, as below:
# Building Test Data
df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7),
FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
#test data: df
# CustomerId FirstName LastName
#1 1 MARY MEYER
#2 2 FGCYFCGCGFC FGCGFCHGHG
#3 3 GFCTFCGCFGC GFCGFCGFCGF
#4 4 ASDASDASD AASDASDASD
#5 5 GDTDTTD DTTHDTHTHTHD
#6 6 WALTER FINN
#7 7 GFCTFCGCFGC CG GFCGFCGFCGF
library(dplyr)
df %>%
## Combining FirstName and LastName
mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
group_by(FullName) %>%
## Calculating variability strength for each full name
mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
## Filtering full name, I set above or equal to 0.4 (You can change this)
## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
filter(Variability >= 0.40)
# A tibble: 2 x 5
# Groups: FullName [2]
# CustomerId FirstName LastName FullName Variability
# <dbl> <chr> <chr> <chr> <dbl>
#1 1 MARY MEYER MARY MEYER 0.6000000
#2 6 WALTER FINN WALTER FINN 0.9090909
I tried to combine the suggestions in the below code. Thanks everyone for the help.
# load required libraries
library(hunspell)
library(dplyr)
# read data in dataframe df
df<-data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7,8),
FirstName = c("MARY"," ALBERT SAM", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
LastName = c("MEYER","TEST", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
# Keep unique names
df<-distinct(df,FirstName, LastName, .keep_all = TRUE)
# Spell check using hunspel
df$flag <- hunspell_check(df$FirstName) | hunspell_check(as.character(df$LastName))
# remove middle names
df$FirstNameOnly<-gsub(" .*","",df$FirstName)
# SSA name data using https://www.ssa.gov/oact/babynames/names.zip
# unzip files in folder named names
files<-list.files("/names",pattern="*.txt")
ssa_names<- do.call(rbind, lapply(files, function(x) read.csv(x,
col.names = c("Name","Gender","Frequency"),stringsAsFactors = FALSE)))
# Change SSA names to uppercase
ssa_names$Name <- toupper(ssa_names$Name)
# Flad for SSA names
df$flag_SSA<-ifelse(df$FirstNameOnly %in% ssa_names$Name,TRUE,FALSE)
rm(ssa_names)
# remove spaces and concatenate first name and last name
df$strname<-gsub(" ","",paste(df$FirstName,df$LastName, sep = ""))
# Name string length
df$len<-nchar(df$strname)
# Unique string length
for(n in 1:nrow(df))
{
df$ulen[n]<-length(unique(strsplit(df$strname[n], "")[[1]]))
}
# Ratio variable for unique string length over total string length
df$ratio<-ifelse(df$len==0,0,df$ulen/df$len)
# Histogram to determine cutoff ratio
hist(df$ratio)
test<-df[df$ratio<.4 & df$flag_SSA==FALSE & df$flag==FALSE,]

How to iteratively take random sample from R datatable until different column values equal sample size in R?

I have an inventory dataframe that is like:
set.seed(5)
library(data.table)
#replicated data
invntry <- data.table(
warehouse <- sample(c("NY", "NJ"), 1000, replace = T),
intid <- c(rep(1,150), rep(2,100), rep(3,210), rep(4,50), rep(5,80), rep(6,70), rep(7,140), rep(8,90), rep(9,90), rep(10,20)),
placement <- c(1:150, 1:100, 1:210, 1:50, 1:80, 1:70, 1:140, 1:90, 1:90, 1:20),
container <- sample(1:100,1000, replace = T),
inventory <- c(rep(3242,150), rep(9076,100), rep(5876,210), rep(9572,50), rep(3369,80), rep(4845,70), rep(8643,140), rep(4567,90), rep(7658,90), rep(1211,20)),
stock <- c(rep(3200,150), rep(10000,100), rep(6656,210), rep(9871,50), rep(3443,80), rep(5321,70), rep(8659,140), rep(4567,90), rep(7650,90), rep(1298,20)),
risk <- runif(100)
)
setnames(invntry, c("warehouse", "intid", "placement", "container", "inventory", "stock", "risk"))
invntry[ , ticket := 1:.N, by=c("intid", "warehouse")]
invntry$ticket[invntry$warehouse=="NJ"] <- 0
#ensuring some same brands are same container
invntry$container[27:32] <- 6
invntry$container[790:810] <- 71
invntry[790:820,]
There's more variables in the actual data that I want to use to compare the same items itid that are in different containers. So I would like to conduct multiple trials for a given range of sample sizes n for each item, such that I keep randomly selecting an item until I have n items from different containers, but keeping the duplicates if they've already been selected. So for a sample size of 6 for item 8, it might take 7 tries to get a sample size of 6:
warehouse intid placement container inventory stock risk ticket
21: NY 8 10 71 4567 4567 0.38404806 5
22: NY 8 11 96 4567 4567 0.64665968 6
23: NJ 8 12 15 4567 4567 0.68265602 0
24: NY 8 13 19 4567 4567 0.84437586 7
21: NY 8 10 71 4567 4567 0.38404806 5
26: NY 8 15 34 4567 4567 0.69580270 8
28: NY 8 17 78 4567 4567 0.25352370 9
I tried searching on this site, but couldn't find for the above and something to accommodate wanting to compute some values for each trial and sample size from the trial's rows' columns so I think I have to use a for loop so that I can distinguish each trial for each sample size. To summarize, two goals:
conduct random sampling of each itid n unique containers are selected cumulatively keeping the itids already selected
be able to do calculations on variables for each trial for each sample size for each item
Any ideas?
*doesn't have to involve data.table, that's just how it got started
(I think it's essentially the basic probability example of continuing to draw marbles from the urn until you have a sample size of all different colors-but even realizing that didn't help me find a solution!)
I'm not positive, but isn't this equivalent to grouping by intid and then sampling n values with replacement, where n is some integer? If so, then here's a way to do that using tidyverse functions. The code below groups by intid and samples 6 through 10 values with replacement from each group. The column Sample_Size identifies each n-sample group for each intid:
library(tidyverse)
invntry.sampled = map_df(setNames(6:10, 6:10),
~ invntry %>%
group_by(intid) %>%
sample_n(.x, replace=TRUE),
.id="Sample_Size")
And here's a data.table approach, using code adapted from this SO answer. I've wrapped the data.table code in lapply to cycle through the different sample sizes, as my data.table skills are limited. There may be a way to do this within the data.table code itself.
invntry.sampled = do.call(rbind,
lapply(6:10, function(n) invntry[ , .SD[sample(.N, n, replace=TRUE)], by=intid]))

I need to give NAs in a column a value based on value in another column

My question above doesn't fully explain the issue I am facing.
Just a disclaimer - I am very, very new with R, and I am teaching myself (or rather Google is teaching me) so apologies if my questions are really naive.
I have a household level data which I converted to individual level. The long and the short of it is that created lots of NAs. The data looks something like this:
snapshot of data
I want the households with the same code to have the same province and region, not NA. The data is like this because there is more than one individual in one household (obviously). The actual data is much bigger than this.
Would appreciate any help! I can give more info as required.
Best,
Asma
You can try this looping method:
# in initialize a new data frame
data2 = NULL
codes = unique(data$hhcode)
for(i in 1:length(codes)){
# subset data by hhcode
data1 = data[data$hhcode == codes[i],]
# as long as you only have one unique region per code
# you can pull out the unique factor and then set all
# region variable for a single code
region = data1$region[is.na(data1$region) == F]
data1$region = region
# do the same for province
province = data1$province[is.na(data1$province ) == F]
data1$province = province
#bind data to a new data frame
data2 = rbind(data2,data1)
}
head(data2)
data2[1:30,]
You want something like:
dataframe$Z <- ifelse(is.na(dataframe$X), dataframe$Y, dataframe$X)
Where dataframe is the data-frame in question; X is the column containing some NA values; Y is the column to fall-back to; and Z is the column containing the coalesced result
So as a rooky you could use a simple for loop. Later on better use sthg from apply().
Step 1)
Create the dataset.
"sdgfsdh" is right the usage of dput(head(dataframe, 10)) by the OP would have been better. However for convenience for a R-Rooky.
Recreate the dataset:
df = data.frame(hhcode = c(rep(101010101, 5), rep(101010102, 5), rep(101010103, 5)),
province = c(rep(c(rep(NA, 4), "punjab"), 2), c(rep(NA, 4), "sindh")),
region = rep(c(rep(NA, 4), "urban"), 3))
2)
Replace the NA´s.
For each row we want to replace the second and third column. Or with other words: We want to replace every column, except for the first one. We can exclude columns if we write a minus in front of the index: df[, -1].
Now we want to replace the NAs by the rows a) that do not obtain NAs for "region" and "province", but b) share the same hhcode value.
a) How do we identify the rows, that do not obtain NAs? Use na.omit(df).
b) Lets say the df$hhcode is stored in a variable called hhcode, then we want the rows where df$hhcode is equal to hhcode --> df$hhcode == hcode. (note that which() gives us the index of the "TRUE" cases in df$hhcode == hcode.
Finally, we want to repeat that for every unique hhcode that exists. The important words in this sentence are: "for" and "unique".
In your dataset I can identify groups that share the same "hhcode". The hhcode we can access by df$hhcode. To get all unique hhcode we use unique(df$hhcode).
So we loop through every element in unique(df$hhcode) and replace the NAs =).
for(hhcode in unique(df$hhcode)){
df[which(df$hhcode == hhcode), -1] = na.omit(df)[na.omit(df)$hhcode == hhcode, -1]
}
df
df = data.frame(hhcode = c(rep(101010101, 5), rep(101010102, 5), rep(101010103, 5)),
province = c(rep(c(rep(NA, 4), "punjab"), 2), c(rep(NA, 4), "sindh")),
region = rep(c(rep(NA, 4), "urban"), 3))
First you generate a data.frame df_complete123 consisting only of complete cases in the first three columns (no NAs)
df_complete123 <- df[!is.na(df$province) & !is.na(df$region),]
It will look like this
hhcode province region
101010101 punjab urban
101010102 punjab urban
101010103 sindh urban
Next you will use this as some kind of look-up table. First
indices <- match(df$hhcode, df_complete123$hhcode)
which will give you this
1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
This means that the entries of the first column (hhcode) in df are found in the 1, 1, 1, 1, 1, 2, 2, ... -th row in df_complete123.
You use this to replace the values in 2nd and 3rd columns in df by those of df_complete123:
df$province <- df_complete123$province[indices]
df$region <- df_complete123$region[indices]
This results in
hhcode province region
1 101010101 punjab urban
2 101010101 punjab urban
3 101010101 punjab urban
4 101010101 punjab urban
5 101010101 punjab urban
6 101010102 punjab urban
7 101010102 punjab urban
8 101010102 punjab urban
9 101010102 punjab urban
10 101010102 punjab urban
11 101010103 sindh urban
12 101010103 sindh urban
13 101010103 sindh urban
14 101010103 sindh urban
15 101010103 sindh urban
Good Luck!

R - Is the result of tapply always in alphabetical order

I work with the dataframe df
Name = c("Albert", "Caeser", "Albert", "Frank")
Earnings = c(1000,2000,1000,5000)
df = data.frame(Name, Earnings)
Name Earnings
Albert 1000
Caesar 2000
Albert 1000
Frank 5000
If I use the tapply function
result <- tapply(df$Earnings, df$Name, sum)
I get this table result
Albert 2000
Caeser 2000
Frank 5000
Are there any circumstances, under which the table "result" would not be ordered alphabetically, if I use the tapply function as described above?
When I tried to find an answer, I changed the order of the rows:
Name Earnings
Frank 5000
Caeser 2000
Albert 1000
Albert 1000
but still get the same result.
I use multiple functions where I calculate with the output of tapply calculations and I have to be absolutely sure, that the output is always delivered in the same order.
Normally the output is ordered, but you can come up with examples where it is not. For example if you have factors with unordered levels.
df <- data.frame(Name = factor(c('Ben', 'Al'), levels = c('Ben', 'Al')),
Earnings = c(1, 4))
tapply(df$Earnings, df$Name, sum)
## Ben Al
## 1 4
In that case you can either use as.character or (probably saver) order the result afterwards.
tapply(df$Earnings, as.character(df$Name), sum)
## Al Ben
## 4 1
result <- tapply(df$Earnings, df$Name, sum)
result[order(names(result))]
## Al Ben
## 4 1
Another possible problem can be leading spaces:
df <- data.frame(Name = c(' Ben', 'Al'),
Earnings = c(1, 4))
tapply(df$Earnings, df$Name, sum)
## Ben Al
## 1 4
In that case, just remove all leading spaces to get results ordered.
You can order sapply output as you order any array in R. Using the [sort] command.1
> result
Albert Caeser Frank
2000 2000 5000
> sort(result,decreasing=TRUE)
Frank Albert Caeser
5000 2000 2000
Depending on what you want to order by, you can either sort the values as shown above (by leaving decreasing NULL, i.e. sort(result) you will get values in increasing order), or by sorting the names:
This will deliver the results by name in reverse alphabetical order
result[sort(names(result),decreasing=TRUE)]
Frank Caeser Albert
5000 2000 2000
What else would you like to sort and order by?

Resources