formula in R name string - r

I have a data frame of the following form:
country company hitid
1 Switzerland CH1 <NA>
2 Switzerland CH2 <NA>
3 Switzerland CH3 <NA>
4 Sweden SU1 <NA>
5 Sweden SU2 <NA>
6 Sweden SU3 <NA>
in the hitid collumn, I would like to fill in automatically results of a loop I have run before. The results are given in the form d$COUNTRY$hitid, where for each country, I have got another hitid that I would like to fill in.
EDIT:
my loop output is of the following form:
$Switzerland
HITTypeId HITId Valid
1 1010 123 TRUE
$Sweden
HITTypeId HITId Valid
1 1010 456 TRUE
Is there any way that one can use a formula inside of a name string? That i could construct something like:
hitid=d$"formula to look up country"$hitid
Or any ideas how to construct this problem more elegant?
Basically I just want to extract the HITId for each country out of the loop and into the existing datfile.

Here a plyr solution.
library(plyr)
ddply(dat,.(country),transform,
hitid= d[[unique(country)]]$hitid)
Where I assume that :
d <- list(Switzerland=list(hitid=1),
Sweden=list(hitid=2))

This makes some assumptions about your data, i.e., that DF$country is a character column and that d is a list.
DF <- read.table(text=" country company hitid
1 Switzerland CH1 <NA>
2 Switzerland CH2 <NA>
3 Switzerland CH3 <NA>
4 Sweden SU1 <NA>
5 Sweden SU2 <NA>
6 Sweden SU3 <NA>",header=TRUE,stringsAsFactors=FALSE)
d <- list(Switzerland=list(hitid=123),Sweden=list(hitid=456))
fun <- function(x) d[[x]][["hitid"]]
DF$hitid <- sapply(DF$country,fun)
# country company hitid
# 1 Switzerland CH1 123
# 2 Switzerland CH2 123
# 3 Switzerland CH3 123
# 4 Sweden SU1 456
# 5 Sweden SU2 456
# 6 Sweden SU3 456

Related

Replace all partial string entries with NA

I have a data frame similar to:
df<-as.data.frame(cbind(rep("Canada",6),
c(rep("Alberta",3), rep("Manitoba",2),rep("Unknown_province",1)),
c("Edmonton", "Unknown_city","Unknown_city","Brandon","Unknown_city","Unknown_city")))
colnames(df)<- c("Country","Province","City")
I would like to substitute all entries that contain "Unknown" with NA.
I have tried using grepl, but it removes all entries for that variable if one entry matches, I would like to only replace individual cells.
df[grepl("Unknown", df, ignore.case=TRUE)] <- NA
df1 <- df # This is to ensure that we can refert back to df incase there is an issue
Then you could use any of the following:
is.na(df1) <- array(grepl('Unknown', as.matrix(df1)), dim(df1))
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
or even:
df1[] <- sub("Unknown.*", NA, as.matrix(df1), ignore.case = TRUE)
df1
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
Note that grepl and even sub are vectorized hence no need to use the *aply family or even for loops
Here is one possible way to solve your problem:
df[] <- lapply(df, function(x) ifelse(grepl("Unknown", x, TRUE), NA, x))
df
# Country Province City
# 1 Canada Alberta Edmonton
# 2 Canada Alberta <NA>
# 3 Canada Alberta <NA>
# 4 Canada Manitoba Brandon
# 5 Canada Manitoba <NA>
# 6 Canada <NA> <NA>
Using dplyr
library(dplyr)
library(stringr)
df %>%
mutate(across(everything(),
~ case_when(str_detect(., 'Unknown', negate = TRUE) ~ .)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
I like to use replace() in such cases in which values in a vector are replaced or left as is, depending on a condition :
library(dplyr)
library(stringr)
df%>%mutate(across(everything(), ~replace(.x, str_detect(.x, 'Unknown'), NA)))
Country Province City
1 Canada Alberta Edmonton
2 Canada Alberta <NA>
3 Canada Alberta <NA>
4 Canada Manitoba Brandon
5 Canada Manitoba <NA>
6 Canada <NA> <NA>
df[]<- lapply(df, gsub, pattern = "Unknown", replacement = NA, fixed = TRUE)

Having trouble merging/joining two datasets on two variables in R

I realize there have already been many asked and answered questions about merging datasets here, but I've been unable to find one that addresses my issue.
What I'm trying to do is merge to datasets using two variables and keeping all data from each. I've tried merge and all of the join operations from dplyr, as well as cbind and have not gotten the result I want. Usually what happens is that one column from one of the datasets gets overwritten with NAs. Another thing that will happen, as when I do full_join in dplyr or all = TRUE in merge is that I get double the number of rows.
Here's my data:
Primary_State Primary_County n
<fctr> <fctr> <int>
1 AK 12
2 AK Aleutians West 1
3 AK Anchorage 961
4 AK Bethel 1
5 AK Fairbanks North Star 124
6 AK Haines 1
Primary_County Primary_State Population
1 Autauga AL 55416
2 Baldwin AL 208563
3 Barbour AL 25965
4 Bibb AL 22643
5 Blount AL 57704
6 Bullock AL 10362
So I want to merge or join based on Primary_State and Primary_County, which is necessary because there are a lot of duplicate county names in the U.S. and retain the data from both n and Population. From there I can then divide the Population by n and get a per capita figure for each county. I just can't figure out how to do it and keep all of the data, so any help would be appreciated. Thanks in advance!
EDIT: Adding code examples of what I've already described above.
This code (as well as left_join):
countyPerCap <- merge(countyLicense, countyPops, all.x = TRUE)
Produces this:
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians West 1 NA
3 AK Anchorage 961 NA
4 AK Bethel 1 NA
5 AK Fairbanks North Star 124 NA
6 AK Haines 1 NA
This code:
countyPerCap <- right_join(countyLicense, countyPops)
Produces this:
Primary_State Primary_County n Population
<chr> <chr> <int> <int>
1 AL Autauga NA 55416
2 AL Baldwin NA 208563
3 AL Barbour NA 25965
4 AL Bibb NA 22643
5 AL Blount NA 57704
6 AL Bullock NA 10362
Hope that's helpful.
EDIT: This is what happens with the following code:
countyPerCap <- merge(countyLicense, countyPops, all = TRUE)
Primary_State Primary_County n Population
1 AK 12 NA
2 AK Aleutians East NA 3296
3 AK Aleutians West 1 NA
4 AK Aleutians West NA 5647
5 AK Anchorage 961 NA
6 AK Anchorage NA 298192
It duplicates state and county and then adds n to one record and Population in another. Is there a way to deduplicate the dataset and remove the NAs?
We can give column names in merge by mentioning "by" in merge statement
merge(x,y, by=c(col1, col2 names))
in merge statement
I figured it out. There were trailing whitespaces in the Census data's county names, so they weren't matching with the other dataset's county names. (Note to self: Always check that factors match when trying to merge datasets!)
trim.trailing <- function (x) sub("\\s+$", "", x)
countyPops$Primary_County <- trim.trailing(countyPops$Primary_County)
countyPerCap <- full_join(countyLicense, countyPops,
by=c("Primary_State", "Primary_County"), copy=TRUE)
Those three lines did the trick. Thanks everyone!

From monadic to dyadic data in R

For the sake of simplicity, let's say I have a dataset at the country-year level, that lists organizations that received aid from a government, how much money was that, and the type of project. The data frame has "space" for 10 organizations each year, but not every government subsidizes so many organizations each year, so there are a lot a blank spaces. Moreover, they do not follow any order: one organization can be in the first spot one year, and the next year be coded in the second spot. The data looks like this:
> State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2 Org3 Aid3 Proj3 Org4 Aid4 Proj4 ...
Italy 2000 A 1000 Arts B 500 Arts C 300 Social
Italy 2001 B 700 Social A 1000 Envir
Italy 2002 A 1000 Arts C 300 Envir
UK 2000
UK 2001 Z 2000 Social
UK 2002 Z 2000 Social
...
I'm trying to transform this into dyadic data, which would look like this:
> State Org Year Aid Proj
Italy A 2000 1000 Arts
Italy A 2001 1000 Envir
Italy A 2002 1000 Arts
Italy B 2000 500 Arts
Italy B 2001 700 Social
Italy C 2000 300 Social
Italy C 2002 300 Envir
UK Z 2001 2000 Social
...
I'm using R, and the best way I could find was building a pre-defined possible set of dyads —using something like expand.grid(unique(State), unique(Org))— and then looping through the data, finding the corresponding column and filling the data frame. But I don't thing this is the most effective method, so I was wondering whether there would be a better way. I thought about dplyror reshape but can't find a solution.
I know this is a recurring question, but couldn't really find an answer. The most similar question is this one, but it's not exactly the same.
Thanks a lot in advance.
Since you did not use dput, I will try and make some data that resemble yours:
dat = data.frame(State = rep(c("Italy", "UK"), 3),
Year = rep(c(2014, 2015, 2016), 2),
Org1 = letters[1:6],
Aid1 = sample(800:1000, 6),
Proj1 = rep(c("A", "B"), 3),
Org2 = letters[7:12],
Aid2 = sample(600:700, 6),
Proj2 = rep(c("C", "D"), 3),
stringsAsFactors = FALSE)
dat
# State Year Org1 Aid1 Proj1 Org2 Aid2 Proj2
# 1 Italy 2014 a 910 A g 658 C
# 2 UK 2015 b 926 B h 681 D
# 3 Italy 2016 c 834 A i 625 C
# 4 UK 2014 d 858 B j 620 D
# 5 Italy 2015 e 831 A k 650 C
# 6 UK 2016 f 821 B l 687 D
Next I gather the data and then use extract to make 2 new columns and then spread it all again:
library(tidyr)
library(dplyr)
dat %>%
gather(key, value, -c(State, Year)) %>%
extract(key, into = c("key", "num"), "([A-Za-z]+)([0-9]+)") %>%
spread(key, value) %>%
select(-num)
# State Year Aid Org Proj
# 1 Italy 2014 910 a A
# 2 Italy 2014 658 g C
# 3 Italy 2015 831 e A
# 4 Italy 2015 650 k C
# 5 Italy 2016 834 c A
# 6 Italy 2016 625 i C
# 7 UK 2014 858 d B
# 8 UK 2014 620 j D
# 9 UK 2015 926 b B
# 10 UK 2015 681 h D
# 11 UK 2016 821 f B
# 12 UK 2016 687 l D
Is this the desired output?

Extracting parts of data.frame

I have an issue while extracting and creating a new data.frame on the basis of previous one.
So we have:
> head(data.raw)
date id contacted contacted_again region
1 2015-11-29 234 CHAT EMAIL APAC
2 2015-11-29 234 EMAIL EMAIL APAC
3 2015-11-27 257 PHONE PHONE EMEA
4 2015-11-27 278 PHONE EMAIL APAC
5 2015-11-27 293 CHAT EMAIL EMEA
6 2015-11-27 243 EMAIL EMAIL EMEA
market
1 AU/NZ
2 SE Asia (English)
3 Spain
4 China Mainland
5 DACH
6 DACH
However, one I write
data.ru <- data.raw[data.raw$market=="Russia",]
I receive the following mess:
date id contacted contacted_again region market
67 2015-11-25 334 CHAT EMAIL EMEA Russia
NA <NA> <NA> <NA> <NA> <NA> <NA>
NA.1 <NA> <NA> <NA> <NA> <NA> <NA>
NA.2 <NA> <NA> <NA> <NA> <NA> <NA>
NA.3 <NA> <NA> <NA> <NA> <NA> <NA>
NA.4 <NA> <NA> <NA> <NA> <NA> <NA>
How should I write a command to receive just a normal data.frame with all rows that $market=="Russia" without any NAs?
I would just use the subset function.
test <- data.frame(x = c("USA", "USA", "USA", "Russia", "Russia", NA), y = c("Orlando", "Boston", "Memphis", NA, "St. Petersburg", "Mexico City"))
print(test)
x y
1 USA Orlando
2 USA Boston
3 USA Memphis
4 Russia <NA>
5 Russia St. Petersburg
6 <NA> Mexico City
subset(test, x == "Russia")
x y
4 Russia <NA>
5 Russia St. Petersburg
You may want to try: data.ru <- data.raw[data.raw$market %in% "Russia",]
Explanation: I am assuming you have empty lines in your dataset, which are read as NAs (missing value). Since R cannot know if a given NA is equal to "Russia" or not, the generated data frame includes them.
Illustration in code:
# create sample dataset
example.df <- data.frame(market=c(NA, "Russia", NA), outcome = c(1,2,3))
# match market using ==
example.df$market == "Russia"
example.df[example.df$market == "Russia",]
# match market using %in%
example.df$market %in% "Russia"
example.df[example.df$market %in% "Russia",]

Clustering / Matching Over Many Dimensions in R

I have a very large and complex data set with many observations of companies. Some of the observations of the companies are redundant and I need to make a key to map the redundant observations to a single one. However the only way to tell if they are actually representing the same company is through the similarity of a variety of variables. I think the appropriate approach is a kind of clustering based on a variety of conditions or perhaps even some kind of propensity score matching. Perhaps I just need flexible tools for making a complex kind of similarity matrix.
Unfortunately, I am not quite sure how to go about that in R. Most of the tools I've seen for clustering and categorizing seem to do so with either numerical distance or categorical data, but don't seem to allow multiple conditions or user specified conditions.
Below I've tried to create a smaller, public example of the kind of data I am working with and the result I am trying to produce. There are some conditions that must apply, for example, the location must be the same. There are some features that may associate one with another, for example var1 and var2. Then there are some features that may associate one with another, but they must not conflict, such as var3.
An additional layer of complexity is that the kind of association I am trying to use to map the redundant observation varies. For example, id1 and id2 are the same company redundantly entered into the data twice. In one place its name is "apples" and another "red apples". They share the same location, var1 value and var3 (after adjusting for formatting). Similarly ids 3, 5 and 6, are also really just one company, though much of the input for each is different. Some clusters would identify multiple observations, others would only have one. Ideally I would like to find a way to categorize or associate the observations based on several conditions, for example:
1. Test that the location is the same
2. Test whether var3 is different
3. Test whether the names is a substring of others
4. Test the edit distance of names
5. Test the similarity of var1 and var2 between observations
Anyways, hopefully there are better, more flexible tools for this than what I am finding or someone has experience with this kind of data work in R. Any and all suggestions and advice are much appreciated!
Data
id name location var1 var2 var3
1 apples US 1 abc 12345
2 red apples US 1 NA 12-345
3 green apples Mexico 2 def 235-92
4 bananas Brazil 2 abc NA
5 oranges Mexico 2 NA 23592
6 green apple Mexico NA def NA
7 tangerines Honduras NA abc 3498
8 mango Honduras 1 NA NA
9 strawberries Honduras NA abcd 3498
10 strawberry Honduras NA abc 3498
11 blueberry Brazil 1 abcd 2348
12 blueberry Brazil 3 abc NA
13 blueberry Mexico NA def 1859
14 bananas Brazil 1 def 2348
15 blackberries Honduras NA abc NA
16 grapes Mexico 6 qrs NA
17 grapefruits Brazil 1 NA 1379
18 grapefruit Brazil 2 bcd 1379
19 mango Brazil 3 efaq NA
20 fuji apples US 4 NA 189-35
Result
id name location var1 var2 var3 Result
1 apples US 1 abc 12345 1
2 red apples US 1 NA 12-345 1
3 green apples Mexico 2 def 235-92 3
4 bananas Brazil 2 abc NA 4
5 oranges Mexico 2 NA 23592 3
6 green apple Mexico NA def NA 3
7 tangerines Honduras NA abc 3498 7
8 mango Honduras 1 NA NA 8
9 strawberries Honduras NA abcd 3498 7
10 strawberry Honduras NA abc 3498 7
11 blueberry Brazil 1 abcd 2348 11
12 blueberry Brazil 3 abc NA 11
13 blueberry Mexico NA def 1859 13
14 bananas Brazil 1 def 2348 11
15 blackberries Honduras NA abc NA 15
16 grapes Mexico 6 qrs NA 16
17 grapefruits Brazil 1 NA 1379 17
18 grapefruit Brazil 2 bcd 1379 17
19 mango Brazil 3 efaq NA 19
20 fuji apples US 4 NA 189-35 20
Thanks in advance for your time and help!
library(stringdist)
getMatches <- function(df, tolerance=6){
out <- integer(nrow(df))
for(row in 1:nrow(df)){
dists <- numeric(nrow(df))
for(col in 1:ncol(df)){
tempDist <- stringdist(df[row, col], df[ , col], method="lv")
# WARNING: Matches NA perfectly.
tempDist[is.na(tempDist)] <- 0
dists <- dists + tempDist
}
dists[row] <- Inf
min_dist <- min(dists)
if(min_dist < tolerance){
out[row] <- which.min(dists)
}
else{
out[row] <- row
}
}
return(out)
}
test$Result <- getMatches(test[, -1])
Where test is your data. This probably definitely needs some refining and certainly needs some postprocessing. This creates a column with the index of the closest match. If it can't find a match within the given tolerance, it returns the index of itself.
EDIT: I will attempt some more later.

Resources