How to subset variables which have NA in the values? - r

I have an imdb dataset where I would like to replace the missing values for budget and box_office_gross, for which I think using multiple imputation would be a way to replace the missing values.
In order to separate the numeric columns from the entire dataset and perform imputation, I tried to subset the variables
> NBCU_Limited <- subset(NBCU_dataLaurel_Modified, select = c(NBCU_dataLaurel_Modified$imdb_votes, NBCU_dataLaurel_Modified$runtime_min, NBCU_dataLaurel_Modified$Budget, NBCU_dataLaurel_Modified$Box_Office_Gross))
Error: NA column indexes not supported
But I get an error because there are NA values in the variables, I cannot negate the rest of the character columns because even they have NA's and I get the same error.
How do I get only these four variables out into a new dataframe so that I can perform multiple imputation on them.
Sample Dataset
Update: The error is causing because I am specifying the data.frame individually in the subset, if I do not specify data.frame and just specify the name of the variable I do not get this error. I am not sure why but that is what causes the error, so maybe this is because of my improper code.
Below is the data,
> dput(Sample)
structure(list(imdbid = c("tt6256056", "tt0085450", "tt5050772",
"tt5069876", "tt0083791", "tt0083929"), title = c("Una Famiglia",
"Doctor Detroit", "Honeytrap", "Maniac 8.2.8", "The Dark Crystal",
"Fast Times at Ridgemont High"), plot = c("N/A", "A timid college professor, conned into posing as a flamboyant pimp, finds himself enjoying his new occupation on the streets.",
"Simeon's evening goes horribly wrong when a young woman tries to pick him up.",
"Maniac: a person afflicted with mania. Mania: A manifestation of bipolar disorder, characterized by profuse and rapidly changing ideas, exaggerated sexuality, gaiety, or irritability, decreased sleep and violent abnormal behavior.",
"On another planet in the distant past, a Gelfling embarks on a quest to find the missing shard of a magical crystal, and so restore order to his world.",
"A group of Southern California high school students are enjoying their most important subjects: sex, drugs and rock n' roll."
), rating = c("N/A", "R", "N/A", "N/A", "PG", "R"), imdb_rating = c(NA,
5.1, NA, NA, 7.2, 7.2), metacritic = c(NA, NA, NA, NA, NA, 67
), dvd_release = structure(c(NA, 1126569600, NA, NA, 939081600,
1099353600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
production = c("N/A", "Universal", "Array Releasing", "N/A",
"Sony Pictures Home Entertainment", "Universal Pictures"),
actors = c("Patrick Bruel, Fortunato Cerlino, Matilda De Angelis, Ennio Fantastichini",
"Dan Aykroyd, Howard Hesseman, Donna Dixon, Lydia Lei", "Jennifer Nelson, Daemian Greaves, Polina Vasileva, Becki Lloyd",
"Dimitra Aggelou, Giorgos Efthimiou, Stavroula Kontopoulou, Maria-Antouanetta Tatsi",
"Jim Henson, Kathryn Mullen, Frank Oz, Dave Goelz", "Sean Penn, Jennifer Jason Leigh, Judge Reinhold, Robert Romanus"
), imdb_votes = c(NA, 4492, NA, NA, 44862, 76980), poster = c("N/A",
"https://images-na.ssl-images-amazon.com/images/M/MV5BMjhjY2Q4NWEtYTUzZC00YjE2LTk0ZjktNzUyZjIwNmQ0YTkyXkEyXkFqcGdeQXVyMTQxNzMzNDI#._V1_SX300.jpg",
"N/A", "https://images-na.ssl-images-amazon.com/images/M/MV5BZjdmZTRhYzgtOGY4MS00OGM5LWJlNmItYzJiYjZiNmVmYjhkXkEyXkFqcGdeQXVyNDA2NjM2ODk#._V1_SX300.jpg",
"https://images-na.ssl-images-amazon.com/images/M/MV5BMWZlZjk1MGEtYWMzOC00N2EyLWFkOTUtZDM4NGNlY2M0YjVmXkEyXkFqcGdeQXVyNTAyODkwOQ##._V1_SX300.jpg",
"https://images-na.ssl-images-amazon.com/images/M/MV5BYzBlZjE1MDctYjZmZC00ZTJmLWFkOWEtYjdmZDZkODBkZmI2XkEyXkFqcGdeQXVyNjQ2MjQ5NzM#._V1_SX300.jpg"
), director = c("Sebastiano Riso", "Michael Pressman", "Nick Archer",
"Giorgos Efthimiou", "Jim Henson, Frank Oz", "Amy Heckerling"
), release_date = structure(c(1493596800, 421027200, 1448928000,
1431734400, 408931200, 398044800), class = c("POSIXct", "POSIXt"
), tzone = "UTC"), Year = c(2017, 1983, 2015, 2015, 1982,
1982), Year_Groups = c("2010-2020", "1980-1989", "2010-2020",
"2010-2020", "1980-1989", "1980-1989"), Month = c("May",
"May", "December", "May", "December", "August"), runtime_min = c(97,
89, NA, 15, 93, 90), genre = c("Drama", "Comedy", "Short, Thriller",
"Short, Horror", "Adventure, Family, Fantasy", "Comedy, Drama"
), awards = c("N/A", "N/A", "N/A", "1 win.", "Nominated for 1 BAFTA Film Award. Another 2 wins & 4 nominations.",
"1 win & 1 nomination."), keywords = c(NA, "pimp|college-professor|voyeurism|voyeur|blue-panties|panties|red-dress|blonde|female-frontal-nudity|female-nudity|nude-girl|nude|bare-breasts|breasts|topless-female-nudity|scantily-clad-female|cleavage|two-word-title|reference-to-joe-frazier|reference-to-yul-brynner|mother-son-relationship|f-word|place-name-in-title|city-name-in-title|dual-identity|prostitution|independent-film|title-spoken-by-character|character-name-in-title",
NA, NA, "mystic|magical-crystal|crystal-shard|sword-and-sorcery|puppetry|crystal|shard|quest|evil|monster|feeding-on-energy|hidden-entrance|giant-crystal|actor-voicing-multiple-characters|planetary-alignment|reunification|three-word-title|dark-fantasy|slow-motion-scene|vampire|surrealism|christ-allegory|cult-film|sorceress|relic|race-against-time|muppet|mission|magic|kingdom|creature|good-versus-evil|directed-by-star|epic|multiple-monsters|invented-language|slavery|orrery|puppet|mutation|darkness|destiny",
"high-school|title-directed-by-female|females-talking-about-sex|unwanted-pregnancy|fired-from-the-job|teacher-student-relationship|irreverence|sexual-awakening|innocence-lost|ensemble-film|coming-of-age|teen-movie|high-school-teacher|advice|ticket-scalping|shopping-mall|loss-of-virginity|female-nudity|brother-sister-relationship|caught-masturbating|california|surfer|teacher|break-up|rock-'n'-roll|virgin|teenager|friendship|drugs|date|surfer-dude|blond-boy|redheaded-boy|generation-x|f-rated|vomiting|sex-scene|cult-film|breasts|jeans|hawaiian-shirt|payphone|teenage-girl|teen-sex-comedy|scantily-clad-female|reference-to-led-zeppelin|dream-girl|underage-girl|jailbait|trophy-wife|voyeur|sexual-promiscuity|sexual-desire|sexual-attraction|lust|sex-on-couch|female-rear-nudity|female-frontal-nudity|panties|cheerleader-uniform|female-removes-her-clothes|cleavage|marijuana|drug-use|teen-angst|surfing|school-life|pregnancy|masturbation|football-player|first-love|employment|bikini|stoner|rock-m... <truncated>
), Budget = c(NA, 10375893, NA, NA, 1.5e+07, 4500000), Box_Office_Gross = c(2.48,
70, 70, 124, 140, 140)), .Names = c("imdbid", "title", "plot",
"rating", "imdb_rating", "metacritic", "dvd_release", "production",
"actors", "imdb_votes", "poster", "director", "release_date",
"Year", "Year_Groups", "Month", "runtime_min", "genre", "awards",
"keywords", "Budget", "Box_Office_Gross"), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))

The error is causing because I am specifying the data.frame individually in the subset, if I do not specify data.frame and just specify the name of the variable I do not get this error. I am not sure why but that is what causes the error, so maybe this is because of my improper code. Thanks #Tung for pointing this out.

Related

Why would script call round_any despite not being explicitly called?

I've been struggling with this script for the past month and I still haven't been able to answer this question. I know round_any is used in the plyr package but I don't even load it. I checked all my other packages using ls("package: ") and they don't have this function. Nothing else I find online has been able to point me in the right direction. In browser () I am able to see my type is double [4] (S3:integer64). Am I better off just changing my class from integer64 or finding out how to remove round_any?
Error in `mutate()`:
! Problem while computing `..2 = across(...)`.
Caused by error in `across()`:
! Problem while computing column `Property Count`.
Caused by error in `UseMethod()`:
! no applicable method for 'round_any' applied to an object of class "integer64"
Edit:
This .r file contains all the functions and I have another .r file that calls them.
Argument/Call
market_stats_table = key_market_stats(historical_stats, historical_stats_by_class, report_quarter)
Function
total_stats = stats_combined %>%
rename(`Net Absorption` = `Net Absorption QTD - Total`,
`Net Absorption YTD` = `Net Absorption YTD - Total`,
`Construction Deliveries` = `Construction Deliveries QTD`) %>%
filter(Submarket %in% submarket_order) %>%
mutate(Submarket = factor(Submarket, levels = submarket_order, ordered = T)) %>%
arrange(Submarket) %>%
# glimpse()
mutate(across(c(`Direct Vacancy Rate`, `Overall Vacancy Rate`, `Overall Availability Rate`), scales::percent, accuracy = .1),
across(any_of(sum_vars),
scales::dollar, accuracy = 1, style_negative="parens", prefix=""),
across(any_of(c("Full Service Gross Asking Rate", "Lease Rate")),
scales::dollar)) %>%
select(Submarket, all_of(stat_order))
market_stats_table(total_stats,
cell_width = if_else(property_type == "Office", 1.05, .9),
cell_height = if_else(property_type == "Office", .33, .27),
submarket_order,
totals)
}
structure
structure(list(Market = c("Los Angeles", "0.4", NA, "0.5", "0.3",
"New York"), `Property Count` = c("New York", "0.3", NA, "0.2",
"0.9", "New York"), C = c("Chicago", "0.1", NA, "0.4", "0.3",
"DC"), D = c("DC", "0.7", NA, "0", "0.2", "DC"), e = c("Miami",
"0.8", NA, "0.2", "0.1", "Los Angeles")), row.names = c(NA, 6L
), class = "data.frame")

for loop in R to compute the yearly evolution of a variable

here is the structure of my dataset for reproducibility :
structure(list(numero = c("133", "62", "75", "76", "86", "281"
), tranche_age = c("20-30", "20-30", "20-30", "20-30", "20-30",
"20-30"), tranche_anciennete = c("5 ans et moins", "5 à 10 ans",
"5 ans et moins", "5 ans et moins", "5 à 10 ans", "5 à 10 ans"
), code_statut = c("C", "E", "E", "E", "E", "E"), code_contrat = c("A",
"A", "A", "A", "A", "A"), taux_demploi_mois = c(100, 100, 100,
100, 100, 100), echelon = c("E1", NA, NA, NA, NA, NA), niveau = c("N7",
NA, NA, NA, NA, NA), brut_mensuel = c(NA, 786.13, 1156.95, 1156.95,
904.79, 904.79), estimation_annuelle = c(NA, 10219.69, 15040.35,
15040.35, 11762.27, 11762.27), annee = c(2017, 2017, 2017, 2017,
2017, 2017), primes_en_montant = c(0, 0, 0, 0, 0, 0), primes_en_pourcentage = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), brut_mensuel_ETP = c(NA,
786.13, 1156.95, 1156.95, 904.79, 904.79)), row.names = c(NA,
-6L), class = c("tbl_df",
"tbl",
"data.frame"))
Each worker is identified with one number ("numero"), which doesn't change from year to year. I would like to compute a new variable, to add to this dataframe, representing the evolution of the "estimation_annuelle" (which is the yearly wage) of each worker, from year to year (from 2017 to 2021), and then the average annual growth rate over the 5 years. Then, I would like to view those who have less than a 2% raise on one year (2017-2018 for example), and see whether it has been caught up in the following years or no (that is, if one's wage has increased by less than 2% between 2017 and 2018, if the wage increased one had between 2018 and 2019 compensated, and by how much, the insufficient raise on the previous yearly period).
I have tried a code to compute the variable evolution from year to year, which doesn't work :
test <- liste_complete %>%
group_by(annee, numero) %>%
select(numero, annee, estimation_annuelle)%>%
data.frame()
for(i in 1:length(test$estimation_annuelle)) {
print((test[i+1,] - test[i,])/test[i,])
}
And I have not found anything to compute the average annual growth rate (here is the formula : https://investinganswers.com/dictionary/a/average-annual-growth-rate-aagr), nor computed whether the insufficient increase for those who are concerned has been made up for in the following years.
Could anyone help ?

Using gsub for removing unwanted characters : facing issues

df$Claim_Value <- gsub("Rs.", "", df$`Total Amount Claimed`)
checked class(df$Total Amount Claimed): showing numeric
will gsub work for numeric column ?
Here df$'Total Amount Claimed' is a column which has amount with the text Rs.
For example : Rs.200000. Trying to remove Rs. from this column. so used gsub. Its working but showing amount in thousands and not in lakhs.
How to show amount in lakhs
structure(list(Approver = c("Amarjeet Singh", "Amit Barot", "Amit Barot",
"Amit Barot", "Amit Barot", "Amit Barot"), `Assigned To` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM"), `Resolution Date` = structure(c(1609341652,
1574165400, 1591818814, 1592327216, 1592397052, 1592496000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `Allow Submit Till Date` = structure(c(NA,
1589414400, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`Amt App by CSS (Without Tax)` = c(NA, NA, NA, NA, NA, NA
), `ESN/Alternator No.` = c("AAG8045S126087", "84846607",
"22321621", "191014875", "25452001", "78939252"), `Auto Approved` = c("No",
"No", "No", "No", "No", "No"), BIS = c("N", "N", "N", "Y",
"Y", "N"), `Batch Amount` = c(NA, NA, NA, NA, NA, NA), `Batch Date` = c(NA,
NA, NA, NA, NA, NA), `Batch Number` = c(NA, NA, NA, NA, NA,
NA), `Category Of Service` = c("Maintenance or repair service",
"Maintenance or repair service", "Maintenance or repair service",
"Maintenance or repair service", "Maintenance or repair service",
"Maintenance or repair service"), `Claim Scope` = c("In Scope",
"In Scope", "In Scope", "In Scope", "In Scope", "In Scope"
), `Claim Type` = c("WARRANTY", "WARRANTY", "WARRANTY", "WARRANTY",
"WARRANTY", "WARRANTY"), `Customer Name` = c("SUDHIR POWER LIMITED",
"BHARGAV EARTH MOVERS", "WAGAD INFRA PROJECT PVT LTD", "SUDHIR POWER LIMITED",
"SUDHIR POWER LIMITED", "CORE MULTI SERVICE"), `Final Amount Approved...16` = c(NA,
NA, NA, NA, NA, NA), `Division Name` = c("Pal Engineers - Mohali",
"Sudhir (Ahmedabad) - Rajkot", "Sudhir Sales & Services Limited, Ahmedabad",
"Sudhir Sales & Services Limited, Ahmedabad", "Sudhir Sales & Services Limited, Ahmedabad",
"Sudhir Sales & Services Limited, Ahmedabad"), `Failure Type` = c("Warranty Failure",
"Warranty Failure", "Warranty Failure", "Warranty Failure",
"Warranty Failure", "Warranty Failure"), `GIEA Agreement Name` = c(NA,
NA, NA, NA, NA, NA), `Cummins Invoice Num` = c(NA, NA, NA,
NA, NA, NA), Agreement = c(NA, NA, NA, NA, NA, NA), `Problem Summary` = c("Electrical issue / PCC Controller issues / Starter / alternator issue / Battery",
"Engine not starting / Tripping / Not stopping", "Maintenance / General Check",
"Engine not starting / Tripping / Not stopping", "Engine not starting / Tripping / Not stopping",
"Leakages - Oil/ Fuel/ Coolant / Air"), `Resolution Summary` = c("After recharge battery tested and failed on load test replaced battery warranty",
"REPAIRED THE FUEL PUMP TAKEN TRAIL ALL PARAMETER LIMIT",
"Last service done by 23/12/2019 at 724 hours qt this time change air filter also.today service done at 974 hours.in between customer says top up oil 2.5 ltr then start the engine running ok now all parameters within limits.",
"attend site check & found starter loose connection then correct it & Suggested to customer requests load balances and require proper ventilation for dg set suction and discharge air .",
"ATTEND THE SITE OBSERVE ENGINE FOUND FAULT SHUTDOWN ERROR NEED TO VISIT OEM SIDE",
"ATTEND SITE CHECK & FOUND FUEL LEAKAGE FROM BLEIND PLUG THEN REMOVED IT & FITMENT GAIN & START ENGINE & FOUND ENGINE RUNNING WITHIN LIMIT.."
), `Claim Rejected` = c("Y", "Y", "Y", "Y", "Y", "Y"), `SR Number` = c("SR-PE-MO-2021-006884",
"SR-SU-RJ-1920-002793", "SR-SU-AH-2021-000683", "SR-SU-AH-2021-000857",
"SR-SU-AH-2021-000865", "SR-SU-AH-2021-000913"), `Service Type` = c(NA,
NA, NA, NA, NA, NA), `Sub Type` = c(NA, NA, NA, NA, NA, NA
), `Amount Claimed By Dealer` = c("Rs.5,721.00", "Rs.19,087.00",
"Rs.1,166.00", "Rs.836.00", "Rs.1,034.00", "Rs.2,057.00"),
`Processed By...29` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM"), `Claim #` = c("1-5W4ZZVR",
"1-5PWNEAT", "1-5QQ4Z4J", "1-5QWC86P", "1-5QXPYU1", "1-5QXU7VN"
), `Claim Category` = c("STANDARD", "STANDARD", "STANDARD",
"STANDARD", "STANDARD", "STANDARD"), `Claim Creation Date` = structure(c(1609844392,
1588360803, 1591890038, 1592481430, 1592577627, 1592582659
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), `Created By` = c("1-5LS00O1",
"1-2CD07UT", "1-2CD07UT", "1-2CD07UT", "1-2CD07UT", "1-2CD07UT"
), `Currency Code` = c("INR", "INR", "INR", "INR", "INR",
"INR"), Partner = c(NA, NA, NA, NA, NA, NA), `Final Amount Approved...36` = c("Rs.0.00",
"Rs.0.00", "Rs.0.00", "Rs.0.00", "Rs.0.00", "Rs.0.00"), `Fund Req Category` = c(NA,
NA, NA, NA, NA, NA), Comments = c(NA, NA, NA, NA, NA, NA),
`Claim Name` = c("CLM-PE-MO-2021-002442", "CLM-SU-RJ-2021-000055",
"CLM-SU-AH-2021-000527", "CLM-SU-AH-2021-000627", "CLM-SU-AH-2021-000641",
"CLM-SU-AH-2021-000643"), `Organization Name` = c("Pal Engineers, Jammu",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD"),
Period = c(NA, NA, NA, NA, NA, NA), `Pre-Approval #` = c(NA,
NA, NA, NA, NA, NA), `Processed By...43` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM"), `Program Account Name` = c(NA,
NA, NA, NA, NA, NA), `Program Name` = c(NA, NA, NA, NA, NA,
NA), `Promotion Name` = c("BTRY_CHANDIGARH", "CIL_20000",
"CIC_5000", "Warranty_BIS_RECON", "Warranty_BIS_RECON", "CIC_5000"
), Description = c(NA, NA, NA, NA, NA, NA), Status = c("Pending",
"Pending", "Pending", "Pending", "Pending", "Pending"), `Final Approval Date` = c(NA,
NA, NA, NA, NA, NA), `Submitted By` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"WARRANTY.AHD#SUDHIRGROUP.COM", "WARRANTY.AHD#SUDHIRGROUP.COM",
"WARRANTY.AHD#SUDHIRGROUP.COM", "WARRANTY.AHD#SUDHIRGROUP.COM",
"WARRANTY.AHD#SUDHIRGROUP.COM"), `Total Amount Approved` = c(0,
0, 0, 0, 0, 0), `Total Amount Claimed` = c("Rs.5,721.00",
"Rs.19,087.00", "Rs.1,166.00", "Rs.836.00", "Rs.1,034.00",
"Rs.2,057.00"), `Total Participation Amount` = c(NA, NA,
NA, NA, NA, NA), Updated = structure(c(1610113437, 1589227258,
1591896091, 1592491326, 1592645576, 1592839702), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `Updated By` = c("1-4YO9LU", "1-2QTU4R",
"1-SDR5", "1-SDRU", "1-SDRU", "1-SDR5"), `Resolved By FSL` = c("Y",
"N", "Y", "Y", "N", "N"), `Parts Warranty Claim` = c(NA,
NA, NA, NA, NA, NA), `Inbox Last Updated` = structure(c(1610113437,
1589227258, 1591896091, 1592491326, 1592645576, 1592839702
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), `Claim Submitted Date` = structure(c(1609939043,
NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`Claim Rejection Reason` = c("Incorrect/Missing Commercial Bills",
"Incorrect/Missing Technical Documents", "HCS or KAM Approval Required",
"Incorrect/Missing Technical Documents", "Incorrect/Missing Technical Documents",
"Incorrect/Missing Technical Documents"), `Claim Categorization Reason` = c(NA,
NA, NA, NA, NA, NA), Aging = c(2.4278125, 244.16599537037,
213.276724537037, 206.387430555556, 204.60212962963, 202.355300925926
), AgeGroup = structure(c(2L, 6L, 6L, 6L, 6L, 6L), .Label = c("0-1 Days",
"2-4 Days", "5-7 Days", "8-15 Days", "16-30 Days", ">30 Days"
), class = "factor"), Zones = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Approver.y = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Claim_Value = c(NA,
NA, NA, 836L, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
The following should work fine.
as.numeric(gsub("Rs.", "", "Rs 2000"))
provided df$`Total Amount Claimed` column is character type and not a factor type.
For showing in lakhs and not in exponential format, use the option
options("scipen"=100, "digits"=4)
You can turn the values to numeric by using gsub in the folowing way :
df$`Total Amount Claimed`
#[1] "Rs.5,721.00" "Rs.19,087.00" "Rs.1,166.00" "Rs.836.00" "Rs.1,034.00" "Rs.2,057.00"
df$Claim_Value <- as.numeric(gsub('Rs\\.|,', '', df$`Total Amount Claimed`))
df$Claim_Value
#[1] 5721 19087 1166 836 1034 2057

Using mutate and a lookup/calc funtion

I wrote a function where I pass a company name to lookup in a 2nd table a set of records, calculate a complicated result, and return the result.
I want to process all companies and add a value to each record with that result.
I am using the following code:
`aa <- mutate(companies,newcol=sum_rounds(companies$company_name))`
But I get the following warning:
Warning message:
In c("Bwom", "Symple", "TravelTriangle", "Ark Biosciences", "Artizan Biosciences", :
longer object length is not a multiple of shorter object length
(each of these is a company name)
The company dataframe gets a new column, but all values are "false" where actually there should be both true and false.
Any advice would be welcome to a newbie.
Function follows:
sum_rounds<-function(co_name) {
#get records from rounds for the company name passed to the function
#remove NAs from column roundtype too
outval<- rounds %>%
filter(company_name.x==co_name & !is.na(roundtype)) %>%
#sort by date round is announced
arrange(announced_on) %>%
select(roundtype) %>%
#create a string of all round types in order
apply(2,paste,collapse="")
#the values from mixed to "M", venture to "V" and pureangel to "A"
# now see if it is of the form aaaaa (and #) followed by m or v
# in grep: ^ is start of a line and + is for ar least one copy
# [mv] is either m or v
# nice summary is here: http://www.endmemo.com/program/R/gsub.php
#is angel2vc?
angel2vc<-grepl("^a+[mv]+",outval)
#return(list("roundcodes"=outval,"angel2vc"=angel2vc))
return(angel2vc)
}
DPUT from Companies table Follows:
structure(list(company_name = c("Bwom", "Symple", "TravelTriangle",
"Ark Biosciences", "Artizan Biosciences", "Audiense"), domain = c("b-wom.com",
"getsymple.com", "traveltriangle.com", "arkbiosciences.com",
NA, "audiense.com"), country_code = c("ESP", "USA", "USA", "CHN",
"USA", "GBR"), state_code = c(NA, "CA", "VA", NA, "NC", NA),
region = c("Barcelona", "SF Bay Area", "Washington, D.C.",
"Shanghai", "Raleigh", "London"), city = c("Barcelona", "San Francisco",
"Charlottesville", "Shanghai", "Durham", "London"), status = c("operating",
"operating", "operating", "operating", "operating", "operating"
), short_description = c("Bwom is a tool that offers a test and personalized exercises for women's intimate health.",
"Symple is the cloud platform for all your business payments. Pay, get paid, connect.",
"TravelTriangle enables travel enthusiasts to reserve a personalized holiday plan with a local travel agent.",
"Ark Biosciences is a biopharmaceutical company that is dedicated to the discovery and development",
"Artizan Biosciences", "SaaS developer delivering unique consumer insight and engagement capabilities to many of the world’s biggest brands and agencies."
), category_list = c("health care", "cloud computing|machine learning|mobile apps|mobile payments|retail technology",
"e-commerce|personalization|tourism|travel", "health care",
"biopharma", "analytics|apps|marketing|market research|social crm|social media|social media marketing"
), category_group_list = c("health care", "apps|commerce and shopping|data and analytics|financial services|hardware|internet services|mobile|payments|software",
"commerce and shopping|travel and tourism", "health care",
"biotechnology|health care|science and engineering", "apps|data and analytics|design|information technology|internet services|media and entertainment|sales and marketing|software"
), employee_count = c("1 to 10", "11 to 50", "101 to 250",
NA, "1 to 10", "51 to 100"), funding_rounds = c(2L, 1L, 4L,
2L, 2L, 5L), funding_total_usd = c(1075791, 120000, 19900000,
NA, 3e+06, 8013391), founded_on = structure(c(16555, 16770,
15156, 16071, NA, 14975), class = "Date"), first_funding_on = structure(c(16526,
17204, 15492, 16532, 17091, 15294), class = "Date"), last_funding_on = structure(c(17204,
17204, 17204, 17203, 17203, 17203), class = "Date"), closed_on = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), email = c("hello#b-wom.com", "info#getsymple.com",
"admin#traveltriangle.com", "info#arkbiosciences.com", NA,
"moreinfo#audiense.com"), phone = c(NA, NA, "'+91 98 99 120408",
"###############################################################################################################################################################################################################################################################",
NA, "###############################################################################################################################################################################################################################################################"
), cb_url = c("https://www.crunchbase.com/organization/bwom",
"https://www.crunchbase.com/organization/symple-2", "https://www.crunchbase.com/organization/traveltriangle-com",
"https://www.crunchbase.com/organization/ark-biosciences",
"https://www.crunchbase.com/organization/artizan-biosciences",
"https://www.crunchbase.com/organization/socialbro"), twitter_url = c("https://www.twitter.com/hellobwom",
NA, "https://www.twitter.com/traveltriangle", NA, NA, "https://www.twitter.com/socialbro"
), facebook_url = c("https://www.facebook.com/hellobwom/?fref=ts",
NA, "http://www.facebook.com/traveltriangle", NA, NA, "http://www.facebook.com/socialbro"
), uuid = c("e6096d58-3454-d982-0dbe-7de9b06cd493", "fd0ab78f-0dc4-1f18-21d1-7ce9ff7a173b",
"742043c1-c17a-4526-4ed0-e911e6e9555b", "8e27eb22-ce03-a2af-58ba-53f0f458f49c",
"ed07ac9e-1071-fca0-46d9-42035c2da505", "fed333e5-2754-7413-1e3d-5939d70541d2"
), isbio = c("other", "other", "other", "other", "bio", "other"
), co_type = c("m", "m", "m", "v", "v", "m")), .Names = c("company_name",
"domain", "country_code", "state_code", "region", "city", "status",
"short_description", "category_list", "category_group_list",
"employee_count", "funding_rounds", "funding_total_usd", "founded_on",
"first_funding_on", "last_funding_on", "closed_on", "email",
"phone", "cb_url", "twitter_url", "facebook_url", "uuid", "isbio",
"co_type"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
>

Extract specific columns from dataset, create column of NAs if it doesn't exist

Data frame df has 57 columns. I later read in other csv files, each of which may have the same 57, but more likely have more or fewer columns. I take the names of the original file as:
df = read.csv(...)
str = colnames(df)
I know I can take subsets of a data frame as:
file = read.csv(...)
file = file[, str]
If the columns of file have the same or greater number of columns than the original 57, this will work fine. The extra columns would simply be dropped. However, if the columns of file are fewer than the original 57, the following error arises:
Error in `[.data.frame`(file, , str) : undefined columns selected
Is there a way to take this same approach, but create columns of NA if the column does not exist in file?
EDIT: Including dput ouput for #akrun. I'm not familiar with dput so I hope this is what you were asking for:
File 1 example:
`structure(list(ObservationURI = c("http://resources.usgin.org/uri-gin/wygs/bhtemp/49-037-20341_182_12296/",
"http://resources.usgin.org/uri-gin/wygs/bhtemp/49-037-20341_215_14316/",
"http://resources.usgin.org/uri-gin/wygs/bhtemp/49-037-20341_236_16496/"
), WellName = c("1 BRADY UNIT ANADARKO E&P COMPANY LP", "1 BRADY UNIT ANADARKO E&P COMPANY LP",
"1 BRADY UNIT ANADARKO E&P COMPANY LP"), APINo = c("49-037-20341",
"49-037-20341", "49-037-20341"), HeaderURI = c("http://resources.usgin.org/uri-gin/wygs/well/3720341/",
"http://resources.usgin.org/uri-gin/wygs/well/3720341/", "http://resources.usgin.org/uri-gin/wygs/well/3720341/"
), OtherID = c(3720341, 3720341, 3720341), OtherName = c(NA,
NA, NA), BoreholeName = c(NA, NA, NA), Label = c("Temperature observation for well 3720341",
"Temperature observation for well 3720341", "Temperature observation for well 3720341"
), Operator = c("", "", ""), LeaseName = c("", "", ""), LeaseOwner = c("",
"", ""), LeaseNo = c("", "", ""), SpudDate = c("1900-01-01T00:00",
"1900-01-01T00:00", "1900-01-01T00:00"), EndedDrillingDate = c("",
"", ""), WellType = c("Oil", "Oil", "Oil"), Status = c("Producing Oil Well",
"Producing Oil Well", "Producing Oil Well"), CommodityOfInterest = c("",
"", ""), StatusDate = c("1973-05-03T00:00:00", "1973-05-03T00:00:00",
"1973-05-03T00:00:00"), Function = c(NA, NA, NA), Production = c(NA,
NA, NA), ProducingInterval = c(NA, NA, NA), ReleaseDate = c(NA,
NA, NA), Field = c("", "", ""), OtherLocationName = c("Great Divide Basin",
"Great Divide Basin", "Great Divide Basin"), County = c("Sweetwater",
"Sweetwater", "Sweetwater"), State = c("WY", "WY", "WY"), PLSS_Meridians = c(NA,
NA, NA), TWP = c("16N", "16N", "16N"), RGE = c("101W", "101W",
"101W"), Section_ = c(11, 11, 11), SectionPart = c("NENW", "NENW",
"NENW"), Parcel = c(NA, NA, NA), UTM_E = c(NA, NA, NA), UTM_N = c(NA,
NA, NA), UTMDatumZone = c(NA, NA, NA), LatDegree = c(41.38696,
41.38696, 41.38696), LongDegree = c(-108.75009, -108.75009, -108.75009
), SRS = c("EPSG:4326", "EPSG:4326", "EPSG:4326"), LocationUncertaintyStatement = c("nil:missing",
"nil:missing", "nil:missing"), LocationUncertaintyCode = c(NA,
NA, NA), LocationUncertaintyRadius = c(NA, NA, NA), DrillerTotalDepth = c(NA_real_,
NA_real_, NA_real_), DepthReferencePoint = c(NA, NA, NA), LengthUnits = c("ft",
"ft", "ft"), WellBoreShape = c(NA, NA, NA), TrueVerticalDepth = c(NA,
NA, NA), ElevationKB = c(7135, 7135, 7135), ElevationDF = c(7106,
7106, 7106), ElevationGL = c(0, 0, 0), FormationTD = c("", "",
""), BitDiameterCollar = c(NA, NA, NA), BitDiameterTD = c(NA_real_,
NA_real_, NA_real_), DiameterUnits = c("", "", ""), Notes = c("Depth of measurement assumed to be equal to driller total depth (CRC-AZGS, 2013).",
"Depth of measurement assumed to be equal to driller total depth (CRC-AZGS, 2013).",
"Depth of measurement assumed to be equal to driller total depth (CRC-AZGS, 2013)."
), MaximumRecordedTemperature = c(NA_real_, NA_real_, NA_real_
), MeasuredTemperature = c(182, 215, 236), CorrectedTemperature = c(NA_real_,
NA_real_, NA_real_), TemperatureUnits = c(FALSE, FALSE, FALSE
), TimeSinceCirculation = c(NA_real_, NA_real_, NA_real_), CirculationDuration = c(11,
12, 12), MeasurementProcedure = c("Well log", "Well log", "Well log"
), CorrectionType = c(NA, NA, NA), DepthOfMeasurement = c(-99999,
-99999, -99999), MeasurementDateTime = c("", "", ""), MeasurementFormation = c("",
"", ""), MeasurementSource = c("Richard W. Davis: Deriving geothermal parameters from bottom-hole temperatures in Wyoming\" AAPG bulletin, V. 96, No. 8 (August 2012), pp. 1579-1592",
"Richard W. Davis: Deriving geothermal parameters from bottom-hole temperatures in Wyoming\" AAPG bulletin, V. 96, No. 8 (August 2012), pp. 1579-1592",
"Richard W. Davis: Deriving geothermal parameters from bottom-hole temperatures in Wyoming\" AAPG bulletin, V. 96, No. 8 (August 2012), pp. 1579-1592"
), RelatedResource = c(NA, NA, NA), CasingLogger = c(NA, NA,
NA), CasingBottomDepthDriller = c(NA, NA, NA), CasingTopDepth = c(NA_real_,
NA_real_, NA_real_), CasingPipeDiameter = c(NA, NA, NA), CasingWeight = c(NA,
NA, NA), CasingWeightUnits = c(NA, NA, NA), CasingThickness = c(NA,
NA, NA), DrillingFluid = c("", "", ""), Salinity = c(NA_real_,
NA_real_, NA_real_), MudResistivity = c(NA_real_, NA_real_, NA_real_
), Density = c(NA_real_, NA_real_, NA_real_), FluidLevel = c(NA_real_,
NA_real_, NA_real_), pH = c(NA_real_, NA_real_, NA_real_), Viscosity = c(NA_real_,
NA_real_, NA_real_), FluidLoss = c(NA_real_, NA_real_, NA_real_
), MeasurementNotes = c(NA, NA, NA), InformationSource = c("Wyoming State Geological Survey",
"Wyoming State Geological Survey", "Wyoming State Geological Survey"
)), .Names = c("ObservationURI", "WellName", "APINo", "HeaderURI",
"OtherID", "OtherName", "BoreholeName", "Label", "Operator",
"LeaseName", "LeaseOwner", "LeaseNo", "SpudDate", "EndedDrillingDate",
"WellType", "Status", "CommodityOfInterest", "StatusDate", "Function",
"Production", "ProducingInterval", "ReleaseDate", "Field", "OtherLocationName",
"County", "State", "PLSS_Meridians", "TWP", "RGE", "Section_",
"SectionPart", "Parcel", "UTM_E", "UTM_N", "UTMDatumZone", "LatDegree",
"LongDegree", "SRS", "LocationUncertaintyStatement", "LocationUncertaintyCode",
"LocationUncertaintyRadius", "DrillerTotalDepth", "DepthReferencePoint",
"LengthUnits", "WellBoreShape", "TrueVerticalDepth", "ElevationKB",
"ElevationDF", "ElevationGL", "FormationTD", "BitDiameterCollar",
"BitDiameterTD", "DiameterUnits", "Notes", "MaximumRecordedTemperature",
"MeasuredTemperature", "CorrectedTemperature", "TemperatureUnits",
"TimeSinceCirculation", "CirculationDuration", "MeasurementProcedure",
"CorrectionType", "DepthOfMeasurement", "MeasurementDateTime",
"MeasurementFormation", "MeasurementSource", "RelatedResource",
"CasingLogger", "CasingBottomDepthDriller", "CasingTopDepth",
"CasingPipeDiameter", "CasingWeight", "CasingWeightUnits", "CasingThickness",
"DrillingFluid", "Salinity", "MudResistivity", "Density", "FluidLevel",
"pH", "Viscosity", "FluidLoss", "MeasurementNotes", "InformationSource"
), row.names = c(NA, 3L), class = "data.frame")`
File 2 example:
`structure(list(ObservationURI = c("http://resources.usgin.org/uri-gin/mags/bhtemp/UM:MA-Weston47-422036N0711640.1/",
"http://resources.usgin.org/uri-gin/mags/bhtemp/UM:MA-Dover20-421431N0711752.1/",
"http://resources.usgin.org/uri-gin/mags/bhtemp/UM:MA-Lincoln13-422440N0711815.1/"
), WellName = c("Weston47-USGS HDR19", "Dover20-USGS HDR19",
"Lincoln13-USGS HDR19"), APINo = c(NA, NA, NA), HeaderURI = c("http://resources.usgin.org/uri-gin/mags/well/Weston47-USGS_HDR19/",
"http://resources.usgin.org/uri-gin/mags/well/Dover20-USGS_HDR19/",
"http://resources.usgin.org/uri-gin/mags/well/Lincoln13-USGS_HDR19/"
), OtherID = c("", "", ""), OtherName = c("", "", ""), BoreholeName = c(NA,
NA, NA), Operator = c(NA, NA, NA), LeaseOwner = c(NA, NA, NA),
LeaseNo = c(NA, NA, NA), SpudDate = c(NA, NA, NA), EndedDrillingDate = c("",
"", ""), WellType = c("temporarily abandoned", "observation",
"observation"), Status = c("Idle", "Idle", "Idle"), CommodityOfInterest = c("Water",
"Water", "Water"), StatusDate = c("", "", ""), Function = c("production",
"monitoring", "monitoring"), Production = c(NA, NA, NA),
Field = c(NA, NA, NA), County = c("Middlesex", "Norfolk",
"Middlesex"), State = c("MA", "MA", "MA"), PLSS_Meridians = c(NA,
NA, NA), TWP = c(NA, NA, NA), RGE = c(NA, NA, NA), Section_ = c(NA,
NA, NA), SectionPart = c(NA, NA, NA), Parcel = c(NA, NA,
NA), UTM_E = c(NA, NA, NA), UTM_N = c(NA, NA, NA), LatDegree = c(42.3147771183,
42.2417748607, 42.4110851252), LongDegree = c(-71.3257301787,
-71.2975422044, -71.3034583949), SRS = c("EPSG:4326", "EPSG:4326",
"EPSG:4326"), LocationUncertaintyStatement = c("Field located on topographic map",
"Field located on topographic map", "Field located on topographic map"
), DrillerTotalDepth = c(29, 22, 20), LengthUnits = c("ft",
"ft", "ft"), WellBoreShape = c("Vertical", "Vertical", "Vertical"
), TrueVerticalDepth = c(NA, NA, NA), ElevationGL = c(140,
150, 180), BitDiameterTD = c(72, 48, 42), DiameterUnits = c("in",
"in", "in"), Notes = c("", "", ""), MeasuredTemperature = c(8,
9, 8.5), CorrectedTemperature = c(NA, NA, NA), TemperatureUnits = c("C",
"C", "C"), TimeSinceCirculation = c(NA, NA, NA), CirculationDuration = c(NA,
NA, NA), MeasurementProcedure = c("Samples collected from spigot or faucet nearest to well. Water run until temperature, pH or specific conductance stablized. Temperature measured with a mercury thermometer to nearest half degree in degrees F. Converted to degrees C for table.",
"Samples collected from spigot or faucet nearest to well. Water run until temperature, pH or specific conductance stablized. Temperature measured with a mercury thermometer to nearest half degree in degrees F. Converted to degrees C for table.",
"Samples collected from spigot or faucet nearest to well. Water run until temperature, pH or specific conductance stablized. Temperature measured with a mercury thermometer to nearest half degree in degrees F. Converted to degrees C for table."
), CorrectionType = c(NA, NA, NA), DepthOfMeasurement = c(NA,
NA, NA), MeasurementDateTime = c(NA, NA, NA), MeasurementFormation = c(NA,
NA, NA), MeasurementSource = c("Walker, Eugene H., William W. Caswell, and S. William Wandle, Jr. Hydrologic Data of the Charles River Basin",
"Walker, Eugene H., William W. Caswell, and S. William Wandle, Jr. Hydrologic Data of the Charles River Basin",
"Walker, Eugene H., William W. Caswell, and S. William Wandle, Jr. Hydrologic Data of the Charles River Basin"
), CasingLogger = c(" Massachusetts\". USGS Massachusetts Hydrologic-Data Report No. 19 (1977): 1-57. Print. ftp://eclogite.geo.umass.edu/pub/stategeologist/Products/Geothermal/BoreholeTemperatureData/DataReport19.pdf\"",
" Massachusetts\". USGS Massachusetts Hydrologic-Data Report No. 19 (1977): 1-57. Print. ftp://eclogite.geo.umass.edu/pub/stategeologist/Products/Geothermal/BoreholeTemperatureData/DataReport19.pdf\"",
" Massachusetts\". USGS Massachusetts Hydrologic-Data Report No. 19 (1977): 1-57. Print. ftp://eclogite.geo.umass.edu/pub/stategeologist/Products/Geothermal/BoreholeTemperatureData/DataReport19.pdf\""
), CasingDepthDriller = c("", "", ""), CasingPipeDiameter = c("",
"", ""), CasingWeight = c(NA, NA, NA), CasingWeightUnits = c(NA,
NA, NA), CasingThickness = c(NA, NA, NA), DrillingFluid = c(NA,
NA, NA), Salinity = c(NA, NA, NA), MudResisitivity = c(NA,
NA, NA), Density = c(NA, NA, NA), FluidLevel = c(NA, NA,
NA), pH = c(NA, NA, NA), Viscosity = c(NA, NA, NA), FluidLoss = c(NA,
NA, NA), Unnamed..66 = c(NA, NA, NA), BitDiameterCollar = c(72,
48, 42), Unnamed..68 = c(NA, NA, NA), InformationSource = c("Stephen Mabee, MA State Geologist, University of Massachusetts, 611 North Pleasant Street, Amherst MA 01003 413-545-2285",
"Stephen Mabee, MA State Geologist, University of Massachusetts, 611 North Pleasant Street, Amherst MA 01003 413-545-2285",
"Stephen Mabee, MA State Geologist, University of Massachusetts, 611 North Pleasant Street, Amherst MA 01003 413-545-2285"
)), .Names = c("ObservationURI", "WellName", "APINo", "HeaderURI",
"OtherID", "OtherName", "BoreholeName", "Operator", "LeaseOwner",
"LeaseNo", "SpudDate", "EndedDrillingDate", "WellType", "Status",
"CommodityOfInterest", "StatusDate", "Function", "Production",
"Field", "County", "State", "PLSS_Meridians", "TWP", "RGE", "Section_",
"SectionPart", "Parcel", "UTM_E", "UTM_N", "LatDegree", "LongDegree",
"SRS", "LocationUncertaintyStatement", "DrillerTotalDepth", "LengthUnits",
"WellBoreShape", "TrueVerticalDepth", "ElevationGL", "BitDiameterTD",
"DiameterUnits", "Notes", "MeasuredTemperature", "CorrectedTemperature",
"TemperatureUnits", "TimeSinceCirculation", "CirculationDuration",
"MeasurementProcedure", "CorrectionType", "DepthOfMeasurement",
"MeasurementDateTime", "MeasurementFormation", "MeasurementSource",
"CasingLogger", "CasingDepthDriller", "CasingPipeDiameter", "CasingWeight",
"CasingWeightUnits", "CasingThickness", "DrillingFluid", "Salinity",
"MudResisitivity", "Density", "FluidLevel", "pH", "Viscosity",
"FluidLoss", "Unnamed..66", "BitDiameterCollar", "Unnamed..68",
"InformationSource"), row.names = c(NA, 3L), class = "data.frame")`
We can read the datasets in a list with fread and use rbindlist from data.table with fill = TRUE and idcol argument to create a single data.table object. The fill = TRUE ensure that NA elements are created for those datasets that have lesser number of columns.
library(data.table)
#get the files from the working directory
files <- list.files(pattern = ".csv")
#read files in a loop with fread and then rbind the data.tables
rbindlist(lapply(files, fread), fill = TRUE, idcol = "grp")

Resources