In result output, Additional rows for NA is showing - r

I'm learning R and there is one issue I am facing while running the code. I wrote the code to get the data for NY (New York). But some additional rows as complete NAs is showing up. Please help.
Dummy Data:
ID Name Industry Inception Employees State City Revenue Expenses
1 Over-Hex Software 2006 25 TN Franklin 9,684,527 1,130,700
2 Unimattax IT 2009 36 NY New York 14,016,543 804,035
3 Greenfax Retail 2012 NA SC Greenville 9,746,272 1,044,375
4 Blacklane IT 2011 66 NY New York 15,359,369 4,631,808
Result output:
2 Unimattax IT 2009 36 NY New York 14,016,543 804,035
NA <NA> <NA> NA NA <NA> <NA> NA NA
4 Blacklane IT 2011 66 NY New York 15,359,369 4,631,808
NA <NA> <NA> NA NA <NA> <NA> NA NA
fin[fin$State == "NY",] # fin is the table name

It could be an issue with having NA elements in 'State' which will result in NA when we do the ==. To avoid that, create an & expression with is.na to make those NA elements to FALSE.
fin[fin$State == "NY" & !is.na(fin$State),]
Or another option is %in%, that generates FALSE for NA
fin[fin$State %in% "NY",]

Related

Calculating CAGR with differing time-based data. Issue with variable column referencing in R

I have a table of Calpers Private Equity Fund performance from several years. I cleaned and joined all the data into a large table with 186 entries for individual fund investments. Some of these funds have data for 5 yrs, most for 4 or less. I would like to calculate the CAGR for each fund using the earliest value and the latest value in the formula:
CAGR= Latest/First^(1/n)-1 ...
The columns with the data are named:
2017,2018,2019,2020,2021, so the formula in R will look something like this: (calper is the table with all the data ... one fund per row)
idx<- which(startsWith(names(calperMV),"2")) # locate columns with data needed for CAGR calc
idx <- rev(idx) # match to NCOL_NA order ...
the values here are (6,5,4,3,2) ... which are the column numbers for 2021-2020-2019-2018-2017.
the indx was formed by counting the number of NA in each row ... all the NA are left to right, so the totals here should be a reference to the idx and thus the correct columns.
I use the !!sym(as.String()) with name()[idx[indx]] to pull out the column names symbolically
calperMV %>% rowwise() %>%
mutate(CAGR=`2021`/!!sym((colnames(.)[idx[indx]])^(1/(5-indx))-1))))
Problem is that the referencing either does not work correctly or gets this error:
"Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?"
I've tried creating test code which shows the addressing is working:
calper %>% rowwise() %>% mutate(test = (names(.)[idx[indx]]),
test1= !!sym(as.String(names(.)[idx[1]])),
test2= !!sym(as.String(names(.)[idx[2]])),
test3= !!sym(as.String(names(.)[idx[3]])),
test4= !!sym(as.String(names(.)[idx[4]])),
test5= !!sym(as.String(names(.)[idx[5]])))
But when I do the full CAGR calc I get that recursive error. Here'a tibble of the test data for reference:
Input data:
Security Name 2017 2018 2019 2020 2021 NA_cols indx
ASIA ALT NA NA NA 6,256,876.00 7,687,037.00 3 2
ASIA ALT NA NA NA 32,549,704.00 34,813,844.00 3 2
AVATAR NA NA NA NA 700,088.00 - 3 2
AVENUE FUND VI (A) NA NA NA 10,561,674.00 19,145,496.00 3 2
BDC III C NA 48,098,429.00 85,808,280.00 100,933,699.00 146,420,669.00 1 4
BIRCH HILL NA NA NA 6,488,941.00 9,348,941.00 3 2
BLACKSTONE NA NA NA 4,011,072.00 2,406,075.00 3 2
BLACKSTONE IV NA NA NA 4,923,625.00 3,101,081.00 3 2
BLACKSTONE V NA NA NA 18,456,472.00 17,796,711.00 3 2
BLACKSTONE VI NA NA NA 245,269,656.00 310,576,064.00 3 2
BLACKSTONE VII NA NA NA 465,415,036.00 607,172,062.00 3 2
Results: The indexing selects the proper String and also selects the proper # from the column ... but won't do when I operate with the selected variable:
selYR test1 test2 test3 test4 test5
2020 7,687,037.00 6,256,876.00 NA NA NA
2020 34,813,844.00 32,549,704.00 NA NA NA
2020 - 700,088.00 NA NA NA
2020 19,145,496.00 10,561,674.00 NA NA NA
2018 146,420,669.00 100,933,699.00 85,808,280.00 48,098,429.00 NA
2020 9,348,941.00 6,488,941.00 NA NA NA
2020 2,406,075.00 4,011,072.00 NA NA NA
2020 3,101,081.00 4,923,625.00 NA NA NA
2020 17,796,711.00 18,456,472.00 NA NA NA
2020 310,576,064.00 245,269,656.00 NA NA NA
2020 607,172,062.00 465,415,036.00 NA NA NA
(Sorry ... I don't know how to put these into proper columns :( )
I never learned all those fancy tidystuff techniques. Here's a base R approach:
Firstand second: Use read.delim to bring in tab data and your data has (yeccch) commas in the numbers.
(ignore the warnings, they are correct and you do want the NA's.)
calpDat <- read.delim(text=calpTab)
calpDat[2:6] <- lapply(calpDat[2:6], function(x) as.numeric(gsub("[,]", "",x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion
Note that lapply in this case returns a list of numeric vectors which can be assigned back inot the origianl dataframe to overwrite the original character values. Or you could have created new columns which could then have gotten the same treatment as below. Now that the data is in, you can count the number of valid numbers and then calculate the CAGR for each row using apply on the numeric columns in a rowwise fashion:
calpDat$CAGR <- apply(calpDat[2:6], 1, function(rw) {n <- length(na.omit(rw));
(rw[5]/rw[6-n])^(1/n) -1})
calpDat
#----------------
Security.Name X2017 X2018 X2019 X2020 X2021 NA_cols indx CAGR
1 ASIA ALT NA NA NA 6256876 7687037 3 2 0.10841071
2 ASIA ALT NA NA NA 32549704 34813844 3 2 0.03419508
3 AVATAR NA NA NA NA 700088 NA 3 2 NA
4 AVENUE FUND VI (A) NA NA NA 10561674 19145496 3 2 0.34637777
5 BDC III C NA 48098429 85808280 100933699 146420669 1 4 0.32089372
6 BIRCH HILL NA NA NA 6488941 9348941 3 2 0.20031241
7 BLACKSTONE NA NA NA 4011072 2406075 3 2 -0.22549478
8 BLACKSTONE IV NA NA NA 4923625 3101081 3 2 -0.20637732
9 BLACKSTONE V NA NA NA 18456472 17796711 3 2 -0.01803608
10 BLACKSTONE VI NA NA NA 245269656 310576064 3 2 0.12528383
11 BLACKSTONE VII NA NA NA 465415036 607172062 3 2 0.14218298
Problems remaining ... funds that did not have a value in the most recent year; funds that might have had discontinuous reporting. You need to say how these would be handled and provide example data if you want tested solutions.

populating data based on column/rownames with uneqal row number [duplicate]

This question already has answers here:
Match values in data frame with values in another data frame and replace former with a corresponding pattern from the other data frame
(3 answers)
Closed 4 years ago.
I need to populate empty data frame with values based on values in first columns (or alternatively row names it is the same for me in this case). So here are three objects:
set.seed=11
empty_df=data.frame(cities=c("New York","London","Rome","Vienna","Amsterdam"),
col.a=rep(NA,5),
col.b=rep(NA,5),
col.c=rep(NA,5))
values=rnorm(4,0,1)
to_fill=data.frame(cities=c("New York","London","Vienna","Amsterdam"),
col.a=values)
desired_output=data.frame(cities=c("New York","London","Rome","Vienna","Amsterdam"),
col.a=c(values[1],values[2],NA,values[3],values[4]),
col.b=rep(NA,5),
col.c=rep(NA,5))
First column (it can be converted to row names, both solutions using row names or first column with city name is fine) consists some cities i like to visit and other some unspecified values. First is df I want to fill with values and its output is:
cities col.a col.b col.c
1 New York NA NA NA
2 London NA NA NA
3 Rome NA NA NA
4 Vienna NA NA NA
5 Amsterdam NA NA NA
Second is object I want put INTO empty df and as you can see it is missing one row (with "Rome"):
cities col1
1 New York 0.55213218
2 London 0.98907729
3 Vienna 1.11703741
4 Amsterdam -0.04616725
So now I want to put this inside empty df leaving NA in row which dose not match:
cities col.a col.b col.c
1 New York -0.62731870 NA NA
2 London -1.80206612 NA NA
3 Rome NA NA NA
4 Vienna -1.73446286 NA NA
5 Amsterdam -0.05709419 NA NA
I was trying to use simplest merge solution like this: merge(empty_df,to_fill, by="cities"):
cities col.a.x col.b col.c col.a.y
1 Amsterdam NA NA NA -0.05709419
2 London NA NA NA -1.80206612
3 New York NA NA NA -0.62731870
4 Vienna NA NA NA -1.73446286
And when i tried desired_output$col.a=merge(empty_df,to_fill, by="cities") error occurred(replacement has 4 rows, data has 5). Is there any simple solution to do this that can be put in for loop or apply?
We can use match:
empty_df$col.a <- to_fill$col.a[match(empty_df$cities, to_fill$cities)]
empty_df;
# cities col.a col.b col.c
#1 New York 1.5567564 NA NA
#2 London -0.6969401 NA NA
#3 Rome NA NA NA
#4 Vienna 1.3336636 NA NA
#5 Amsterdam 0.7329989 NA NA
We fill col.a of empty_df with col.a values from to_fill by matching cities from empty_df with cities from to_fill.

Split text string into column based on variable

I have a dataframe with a text column that I would like to split into multiple columns since the text string contains multiple variables, such a location, education, distance etc.
Dataframe:
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
df = data.frame(text.string)
df
text.string
1 &location=NY&distance=30&education=University
2 &location=CA&distance=30&education=Highschool&education=University
3 &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
I can split this using cSplit: cSplit(df, 'text.string', sep = "&"):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1: NA location=NY distance=30 education=University NA NA
2: NA location=CA distance=30 education=Highschool education=University NA
3: NA location=MN distance=10 industry=Healthcare NA NA
4: NA location=VT distance=30 education=University industry=IT industry=Business
Problem is that the text string may contain a multiple of the same variable, or some miss a certain variable. With cSplit the grouping of the variables per column become all mixed up. I would like to avoid this, and group them together.
So it would like similar to this (education and industry do not appear in multiple columns anymore):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1 NA location=NY distance=30 education=University <NA> NA
2 NA location=CA distance=30 education=Highschool education=University <NA> NA
3 NA location=MN distance=10 <NA> industry=Healthcare NA
4 NA location=VT distance=30 education=University industry=IT industry=Business NA
Taking into account #NicE comment:
This is one way, following your example:
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business

Clustering / Matching Over Many Dimensions in R

I have a very large and complex data set with many observations of companies. Some of the observations of the companies are redundant and I need to make a key to map the redundant observations to a single one. However the only way to tell if they are actually representing the same company is through the similarity of a variety of variables. I think the appropriate approach is a kind of clustering based on a variety of conditions or perhaps even some kind of propensity score matching. Perhaps I just need flexible tools for making a complex kind of similarity matrix.
Unfortunately, I am not quite sure how to go about that in R. Most of the tools I've seen for clustering and categorizing seem to do so with either numerical distance or categorical data, but don't seem to allow multiple conditions or user specified conditions.
Below I've tried to create a smaller, public example of the kind of data I am working with and the result I am trying to produce. There are some conditions that must apply, for example, the location must be the same. There are some features that may associate one with another, for example var1 and var2. Then there are some features that may associate one with another, but they must not conflict, such as var3.
An additional layer of complexity is that the kind of association I am trying to use to map the redundant observation varies. For example, id1 and id2 are the same company redundantly entered into the data twice. In one place its name is "apples" and another "red apples". They share the same location, var1 value and var3 (after adjusting for formatting). Similarly ids 3, 5 and 6, are also really just one company, though much of the input for each is different. Some clusters would identify multiple observations, others would only have one. Ideally I would like to find a way to categorize or associate the observations based on several conditions, for example:
1. Test that the location is the same
2. Test whether var3 is different
3. Test whether the names is a substring of others
4. Test the edit distance of names
5. Test the similarity of var1 and var2 between observations
Anyways, hopefully there are better, more flexible tools for this than what I am finding or someone has experience with this kind of data work in R. Any and all suggestions and advice are much appreciated!
Data
id name location var1 var2 var3
1 apples US 1 abc 12345
2 red apples US 1 NA 12-345
3 green apples Mexico 2 def 235-92
4 bananas Brazil 2 abc NA
5 oranges Mexico 2 NA 23592
6 green apple Mexico NA def NA
7 tangerines Honduras NA abc 3498
8 mango Honduras 1 NA NA
9 strawberries Honduras NA abcd 3498
10 strawberry Honduras NA abc 3498
11 blueberry Brazil 1 abcd 2348
12 blueberry Brazil 3 abc NA
13 blueberry Mexico NA def 1859
14 bananas Brazil 1 def 2348
15 blackberries Honduras NA abc NA
16 grapes Mexico 6 qrs NA
17 grapefruits Brazil 1 NA 1379
18 grapefruit Brazil 2 bcd 1379
19 mango Brazil 3 efaq NA
20 fuji apples US 4 NA 189-35
Result
id name location var1 var2 var3 Result
1 apples US 1 abc 12345 1
2 red apples US 1 NA 12-345 1
3 green apples Mexico 2 def 235-92 3
4 bananas Brazil 2 abc NA 4
5 oranges Mexico 2 NA 23592 3
6 green apple Mexico NA def NA 3
7 tangerines Honduras NA abc 3498 7
8 mango Honduras 1 NA NA 8
9 strawberries Honduras NA abcd 3498 7
10 strawberry Honduras NA abc 3498 7
11 blueberry Brazil 1 abcd 2348 11
12 blueberry Brazil 3 abc NA 11
13 blueberry Mexico NA def 1859 13
14 bananas Brazil 1 def 2348 11
15 blackberries Honduras NA abc NA 15
16 grapes Mexico 6 qrs NA 16
17 grapefruits Brazil 1 NA 1379 17
18 grapefruit Brazil 2 bcd 1379 17
19 mango Brazil 3 efaq NA 19
20 fuji apples US 4 NA 189-35 20
Thanks in advance for your time and help!
library(stringdist)
getMatches <- function(df, tolerance=6){
out <- integer(nrow(df))
for(row in 1:nrow(df)){
dists <- numeric(nrow(df))
for(col in 1:ncol(df)){
tempDist <- stringdist(df[row, col], df[ , col], method="lv")
# WARNING: Matches NA perfectly.
tempDist[is.na(tempDist)] <- 0
dists <- dists + tempDist
}
dists[row] <- Inf
min_dist <- min(dists)
if(min_dist < tolerance){
out[row] <- which.min(dists)
}
else{
out[row] <- row
}
}
return(out)
}
test$Result <- getMatches(test[, -1])
Where test is your data. This probably definitely needs some refining and certainly needs some postprocessing. This creates a column with the index of the closest match. If it can't find a match within the given tolerance, it returns the index of itself.
EDIT: I will attempt some more later.

Searching for greater/less than values with NAs

I have a dataframe for which I've calculated and added a difftime column:
name amount 1st_date 2nd_date days_out
JEAN 318.5 1971-02-16 1972-11-27 650 days
GREGORY 1518.5 <NA> <NA> NA days
JOHN 318.5 <NA> <NA> NA days
EDWARD 318.5 <NA> <NA> NA days
WALTER 518.5 1971-07-06 1975-03-14 1347 days
BARRY 1518.5 1971-11-09 1972-02-09 92 days
LARRY 518.5 1971-09-08 1972-02-09 154 days
HARRY 318.5 1971-09-16 1972-02-09 146 days
GARRY 1018.5 1971-10-26 1972-02-09 106 days
I want to break it out and take subtotals where days_out is 0-60, 61-90, 91-120, 121-180.
For some reason I can't even reliably write bracket notation. I would expect
members[members$days_out<=120, ] to show just Barry and Garry, but I get a whole lot of lines like:
NA.1095 <NA> NA <NA> <NA> NA days
NA.1096 <NA> NA <NA> <NA> NA days
NA.1097 <NA> NA <NA> <NA> NA days
Those don't exist in the original data. There's no one without a name. What am I doing wrong here?
This is standard behavior for < and other relational operators: when asked to evaluate whether NA is less than (or greater than, or equal to, or ...) some other number, they return NA, rather than TRUE or FALSE.
Here's an example that should make clear what is going on and point to a simple fix.
x <- c(1, 2, NA, 4, 5)
x[x < 3]
# [1] 1 2 NA
x[x < 3 & !is.na(x)]
# [1] 1 2
To see why all of those rows indexed by NA's have row.names like NA.1095, NA.1096, and so on, try this:
data.frame(a=1:2, b=1:2)[rep(NA, 5),]
# a b
# NA NA NA
# NA.1 NA NA
# NA.2 NA NA
# NA.3 NA NA
# NA.4 NA NA
If you are working at the console the subset function does not have that annoying 'feature' which is actually due to the behavior of [ more than to the relational operators.
subset(members, days_out <= 120)
If you are programming, then you can use which or Josh's conjunction with & is.na(.) that which does behind "the scenes":
members[ which(members$days_out <= 120), ]

Resources