Removing punctuation and all capitalization in newly generated columns (RStudio) - r

I am new to R, and while I do know some of the basics, I've been unable to figure out how to add new columns (preferably using the mutate() function) to a table which lack any punctuation or capitalization.
I exported around 20,000 observations from the citizen science network iNaturalist in an effort to determine which species are most commonly misidentified. To accomplish this, my goal is to have R compare the value for each observation in the species_guess column (which consists of variably punctuated and capitalized common and scientific names) to the corresponding name in either the taxon_species_name column (standardized, uniform scientific names) and the common_name column (which contains standardized, uniform common names). Every time the species_guess matches one of the latter two columns, I'd like to have either TRUE or FALSE printed in a new column: correct_identification.
I expect that accomplishing this would require the following:
the creation of three new columns which are the same as species_guess, taxon_species_name, and common_name but are all lowercase and have no punctuation.
the creation of a correct_identification column which reads TRUE or FALSE depending on whether the new species_guess matches taxon_species_name or common_name. I think I can do this step myself.
species_guess sample
Please don't hesitate to ask clarifying questions as needed. I am happy to provide more code samples. As requested, the output from the dput function (specifically using the code provided by #IRTFM) has been pasted at the bottom.
I found information on grep() and tolower(), but I really have no idea how to use them to create a new column. There's a lot on removing punctuation from a string, but I'm not sure how those methods would be applicable to an entire column in a dataset.
Thanks!
structure(list(id = c(99512L, 190432L, 207211L, 276566L, 298366L,
380464L), observed_on_string = c("Fri Jul 06 2012 14:35:33 GMT-0400 (EDT)",
"2009-09-19", "2012-06-13", "6/23/2010", "2013-06-13", "2013-08-27"
), observed_on = c("2012-07-06", "2009-09-19", "2012-06-13",
"2010-06-23", "2013-06-13", "2013-08-27"), time_observed_at = c("2012-07-06 18:35:33 UTC",
NA, NA, NA, NA, NA), time_zone = c("Eastern Time (US & Canada)",
"Eastern Time (US & Canada)", "Eastern Time (US & Canada)", "Eastern Time (US & Canada)",
"Eastern Time (US & Canada)", "Eastern Time (US & Canada)"),
user_id = c(2179L, 12610L, 13594L, 12035L, 12610L, 13406L
), user_login = c("charlie", "susanelliott", "bheitzman",
"sfaccio", "susanelliott", "hobiecat"), user_name = c("Charlie Hohn",
"Susan Elliott", "Bob Heitzman", "Steve Faccio", "Susan Elliott",
NA), created_at = c("2012-07-07 19:56:36 UTC", "2013-02-02 16:19:29 UTC",
"2013-03-01 02:00:25 UTC", "2013-05-23 19:32:44 UTC", "2013-06-13 18:57:38 UTC",
"2013-08-28 03:04:18 UTC"), updated_at = c("2019-01-08 21:22:48 UTC",
"2020-02-13 19:16:34 UTC", "2021-06-27 23:36:32 UTC", "2016-09-20 02:53:33 UTC",
"2017-09-26 01:21:35 UTC", "2020-02-12 01:23:48 UTC"), quality_grade = c("research",
"research", "research", "research", "research", "research"
), license = c("CC0", "CC-BY-NC", "CC-BY-NC", NA, "CC-BY-NC",
"CC-BY-NC"), url = c("http://www.inaturalist.org/observations/99512",
"http://www.inaturalist.org/observations/190432", "http://www.inaturalist.org/observations/207211",
"http://www.inaturalist.org/observations/276566", "http://www.inaturalist.org/observations/298366",
"http://www.inaturalist.org/observations/380464"), image_url = c("https://inaturalist-open-data.s3.amazonaws.com/photos/144232/medium.jpg",
"https://inaturalist-open-data.s3.amazonaws.com/photos/244969/medium.jpg",
"https://inaturalist-open-data.s3.amazonaws.com/photos/262914/medium.JPG",
"http://static.inaturalist.org/photos/342086/medium.JPG",
"https://inaturalist-open-data.s3.amazonaws.com/photos/369424/medium.jpg",
"https://inaturalist-open-data.s3.amazonaws.com/photos/475664/medium.jpg"
), sound_url = c(NA, NA, NA, NA, NA, NA), tag_list = c(NA,
"Spiranthes, ladies tresses, plant", "Spiranthes, lucida, orchid, Vermont",
NA, NA, NA), description = c(NA, NA, "S. lucida can be found in heavily scoured sections of the river banks, generally on the downstream side of boulders, where they are protected during floods. Very hardy, stout plants, with distinctive thick leaf whorls.\nFlower spikes are distinctive in mid-June, with 6-20 blossoms in a spiral.",
"Many blooming around pond edge.", NA, "Ladies' Tresses "
), num_identification_agreements = c(2L, 0L, 2L, 1L, 1L,
1L), num_identification_disagreements = c(0L, 0L, 0L, 0L,
0L, 0L), captive_cultivated = c("false", "false", "false",
"false", "false", "false"), oauth_application_id = c(NA_integer_,
NA_integer_, NA_integer_, NA_integer_, NA_integer_, NA_integer_
), place_guess = c("United States", "Vermont, US", "Vermont, US",
"Vermont, US", "Vermont, US", "Grand Isle, VT"), latitude = c(43.6243306384,
44.7147801982, 43.6528495032, 43.9558655593, 43.8546044617,
44.75182), longitude = c(-73.2028825367, -71.933891759, -72.2231645845,
-72.5525452841, -73.1619811058, -73.30593), positional_accuracy = c(5L,
NA, NA, NA, 166L, NA), private_place_guess = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), private_latitude = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_), private_longitude = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), public_positional_accuracy = c(27443L,
27285L, 27443L, 27396L, 27396L, NA), geoprivacy = c("obscured",
"obscured", "obscured", NA, NA, NA), taxon_geoprivacy = c("obscured",
NA, "obscured", "obscured", "obscured", NA), coordinates_obscured = c("true",
"true", "true", "true", "true", "false"), positioning_method = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), positioning_device = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), place_town_name = c(NA, NA, NA, NA, NA, "Grand Isle"),
place_county_name = c("Rutland", "Essex", "Windsor", "Orange",
"Addison", "Grand Isle"), place_state_name = c("Vermont",
"Vermont", "Vermont", "Vermont", "Vermont", "Vermont"), species_guess = c("Northern Slender Ladies'-tresses",
"Sphinx ladies’ tresses", "Spiranthes lucida", "Shining Ladies' Tresses",
"Shining Ladies' Tresses", "Sphinx ladies’ tresses"), scientific_name = c("Spiranthes lacera lacera",
"Spiranthes incurva", "Spiranthes lucida", "Spiranthes lucida",
"Spiranthes lucida", "Spiranthes incurva"), common_name = c("Northern Slender Ladies'-tresses",
"Sphinx ladies’ tresses", "Shining Ladies' Tresses", "Shining Ladies' Tresses",
"Shining Ladies' Tresses", "Sphinx ladies’ tresses"), iconic_taxon_name = c("Plantae",
"Plantae", "Plantae", "Plantae", "Plantae", "Plantae"), taxon_id = c(243059L,
773387L, 62254L, 62254L, 62254L, 773387L), taxon_subfamily_name = c("Orchidoideae",
"Orchidoideae", "Orchidoideae", "Orchidoideae", "Orchidoideae",
"Orchidoideae"), taxon_tribe_name = c("Cranichideae", "Cranichideae",
"Cranichideae", "Cranichideae", "Cranichideae", "Cranichideae"
), taxon_subtribe_name = c("Spiranthinae", "Spiranthinae",
"Spiranthinae", "Spiranthinae", "Spiranthinae", "Spiranthinae"
), taxon_genus_name = c("Spiranthes", "Spiranthes", "Spiranthes",
"Spiranthes", "Spiranthes", "Spiranthes"), taxon_species_name = c("Spiranthes lacera",
"Spiranthes incurva", "Spiranthes lucida", "Spiranthes lucida",
"Spiranthes lucida", "Spiranthes incurva"), taxon_hybrid_name = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), taxon_variety_name = c("Spiranthes lacera lacera",
NA, NA, NA, NA, NA)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
UPDATE: found a solution!
spiranthes<-spiranthes %>%
mutate(standardized_species_guess = gsub('[[:punct:] ]+',' ',tolower(species_guess)))
view(spiranthes)
Hopefully this helps anyone else who may be struggling with the same thing.

Related

R Matrix Row.names adding a number at the end of every repeated string

im uing the row names function to track the production capacity of power producing facilities based on the fuel they use. when i go to create a barplot of the data, instead of creating a nice bar plot of the 6 types of fuel im interested in, i instead get a plot that looks like this
bad bar plot
when i reviewed my matrix, i found that my data looks like this enter image description here
does anyone know how i can effectively group this dataset to fix my barplot?
code used
install.packages('ggplot2', 'tidyverse')
install.packages('tidyverse')
library('tidyverse')
Power_Facilities<- read.csv('powerplants (global) - global_power_plants.csv')
drop<-c("secondary.fuel", "other_fuel2", "other_fuel3", "geolocation_source")
PF<-Power_Facilities[,!(names(Power_Facilities) %in% drop)]
PF<-subset(PF,PF$capacity.in.MW>2000)
PF$generated <-(ifelse(is.na (PF$generation_gwh_2021), paste(PF$estimated_generation_gwh_2021), PF$generation_gwh_2021))
PF$generated <-as.numeric(PF$generated)
#PF<- PF [!((PF$generated == "NA") | PF$generated==""), ]
#PF<- PF [!((PF$generated >1)),]
#PF<- PF [!((PF$capacity.in.MW<20)), ]
head(sort(PF$capacity.in.MW, decreasing = TRUE))
tail(sort(PF$capacity.in.MW, decreasing = TRUE))
head(sort(PF$generated, decreasing = TRUE))
tail(sort(PF$generated, decreasing = TRUE))
pf2<-PF%>%group_by(primary_fuel)summarize
barplot((PF2$capacity.in.MW), names.arg =pf2$primary_fuel)
barplot(t(power_matrix), beside = T, las=2, legend.text =T, col = c("blue", "grey"), ylim=c(0, 1000000))
summary(power_matrix)
structure(list(country.code = c("AUS", "AUS", "AUS", "AZE", "BHR",
"BLR", "BEL", "BEL", "BRA", "BRA"), country_long = c("Australia",
"Australia", "Australia", "Azerbaijan", "Bahrain", "Belarus",
"Belgium", "Belgium", "Brazil", "Brazil"), name.of.powerplant = c("Bayswater",
"Liddell", "Loy Yang A", "Azerbaijan TPP", "Alba Power Station",
"Lukoml Thermal Power Plant Belarus", "DOEL 4", "TIHANGE 3",
"Belo Monte", "Ilha Solteira"), capacity.in.MW = c(2640, 2200,
2180, 2400, 2204, 2460, 2910, 2053.8, 3327.45544, 3444), latitude = c(-32.3953,
-32.3713, -38.2536, 40.78, 26.0945, 54.6803, 51.3254, 50.5342,
-3.1264, -20.3822), longitude = c(150.9491, 150.9776, 146.5746,
46.9901, 50.6008, 29.1341, 4.2597, 5.2751, -51.775, -51.3636),
primary_fuel = c("Coal", "Coal", "Coal", "Oil", "Gas", "Gas",
"Nuclear", "Nuclear", "Hydro", "Hydro"), start.date = c(NA,
NA, NA, NA, NA, NA, 1985, 1985, 2016, 1973), owner.of.plant = c("Macquarie Generation",
"Macquarie Generation", "GEAC Great Energy Alliance Corporation",
"AzerEnerji", "Aluminum Bahrain", "", "", "", "", ""), generation_gwh_2021 = c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_), estimated_generation_gwh_2021 = c(NA,
NA, NA, NA, NA, NA, NA, NA, 17396.84, 6318.07), generated = c(NA,
NA, NA, NA, NA, NA, NA, NA, 17396.84, 6318.07)), row.names = c(356L,
565L, 573L, 927L, 942L, 1017L, 1044L, 1083L, 1386L, 2164L), class = "data.frame")```
I'd pivot your data to long format and use ggplot2:
library(tidyr)
library(ggplot2)
PF2_long = PF2 %>%
pivot_longer(cols = c(generated, capacity.in.MW), names_to = "measure")
ggplot(PF2_long, aes(x = primary_fuel, y = value, fill = measure)) +
geom_col(position = "dodge") +
scale_fill_manual(values = c("blue", "grey60")) +
labs(
x = "Primary fuel",
y = "MW",
fill = ""
) +
theme_bw()

Using gsub for removing unwanted characters : facing issues

df$Claim_Value <- gsub("Rs.", "", df$`Total Amount Claimed`)
checked class(df$Total Amount Claimed): showing numeric
will gsub work for numeric column ?
Here df$'Total Amount Claimed' is a column which has amount with the text Rs.
For example : Rs.200000. Trying to remove Rs. from this column. so used gsub. Its working but showing amount in thousands and not in lakhs.
How to show amount in lakhs
structure(list(Approver = c("Amarjeet Singh", "Amit Barot", "Amit Barot",
"Amit Barot", "Amit Barot", "Amit Barot"), `Assigned To` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM"), `Resolution Date` = structure(c(1609341652,
1574165400, 1591818814, 1592327216, 1592397052, 1592496000), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `Allow Submit Till Date` = structure(c(NA,
1589414400, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`Amt App by CSS (Without Tax)` = c(NA, NA, NA, NA, NA, NA
), `ESN/Alternator No.` = c("AAG8045S126087", "84846607",
"22321621", "191014875", "25452001", "78939252"), `Auto Approved` = c("No",
"No", "No", "No", "No", "No"), BIS = c("N", "N", "N", "Y",
"Y", "N"), `Batch Amount` = c(NA, NA, NA, NA, NA, NA), `Batch Date` = c(NA,
NA, NA, NA, NA, NA), `Batch Number` = c(NA, NA, NA, NA, NA,
NA), `Category Of Service` = c("Maintenance or repair service",
"Maintenance or repair service", "Maintenance or repair service",
"Maintenance or repair service", "Maintenance or repair service",
"Maintenance or repair service"), `Claim Scope` = c("In Scope",
"In Scope", "In Scope", "In Scope", "In Scope", "In Scope"
), `Claim Type` = c("WARRANTY", "WARRANTY", "WARRANTY", "WARRANTY",
"WARRANTY", "WARRANTY"), `Customer Name` = c("SUDHIR POWER LIMITED",
"BHARGAV EARTH MOVERS", "WAGAD INFRA PROJECT PVT LTD", "SUDHIR POWER LIMITED",
"SUDHIR POWER LIMITED", "CORE MULTI SERVICE"), `Final Amount Approved...16` = c(NA,
NA, NA, NA, NA, NA), `Division Name` = c("Pal Engineers - Mohali",
"Sudhir (Ahmedabad) - Rajkot", "Sudhir Sales & Services Limited, Ahmedabad",
"Sudhir Sales & Services Limited, Ahmedabad", "Sudhir Sales & Services Limited, Ahmedabad",
"Sudhir Sales & Services Limited, Ahmedabad"), `Failure Type` = c("Warranty Failure",
"Warranty Failure", "Warranty Failure", "Warranty Failure",
"Warranty Failure", "Warranty Failure"), `GIEA Agreement Name` = c(NA,
NA, NA, NA, NA, NA), `Cummins Invoice Num` = c(NA, NA, NA,
NA, NA, NA), Agreement = c(NA, NA, NA, NA, NA, NA), `Problem Summary` = c("Electrical issue / PCC Controller issues / Starter / alternator issue / Battery",
"Engine not starting / Tripping / Not stopping", "Maintenance / General Check",
"Engine not starting / Tripping / Not stopping", "Engine not starting / Tripping / Not stopping",
"Leakages - Oil/ Fuel/ Coolant / Air"), `Resolution Summary` = c("After recharge battery tested and failed on load test replaced battery warranty",
"REPAIRED THE FUEL PUMP TAKEN TRAIL ALL PARAMETER LIMIT",
"Last service done by 23/12/2019 at 724 hours qt this time change air filter also.today service done at 974 hours.in between customer says top up oil 2.5 ltr then start the engine running ok now all parameters within limits.",
"attend site check & found starter loose connection then correct it & Suggested to customer requests load balances and require proper ventilation for dg set suction and discharge air .",
"ATTEND THE SITE OBSERVE ENGINE FOUND FAULT SHUTDOWN ERROR NEED TO VISIT OEM SIDE",
"ATTEND SITE CHECK & FOUND FUEL LEAKAGE FROM BLEIND PLUG THEN REMOVED IT & FITMENT GAIN & START ENGINE & FOUND ENGINE RUNNING WITHIN LIMIT.."
), `Claim Rejected` = c("Y", "Y", "Y", "Y", "Y", "Y"), `SR Number` = c("SR-PE-MO-2021-006884",
"SR-SU-RJ-1920-002793", "SR-SU-AH-2021-000683", "SR-SU-AH-2021-000857",
"SR-SU-AH-2021-000865", "SR-SU-AH-2021-000913"), `Service Type` = c(NA,
NA, NA, NA, NA, NA), `Sub Type` = c(NA, NA, NA, NA, NA, NA
), `Amount Claimed By Dealer` = c("Rs.5,721.00", "Rs.19,087.00",
"Rs.1,166.00", "Rs.836.00", "Rs.1,034.00", "Rs.2,057.00"),
`Processed By...29` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM"), `Claim #` = c("1-5W4ZZVR",
"1-5PWNEAT", "1-5QQ4Z4J", "1-5QWC86P", "1-5QXPYU1", "1-5QXU7VN"
), `Claim Category` = c("STANDARD", "STANDARD", "STANDARD",
"STANDARD", "STANDARD", "STANDARD"), `Claim Creation Date` = structure(c(1609844392,
1588360803, 1591890038, 1592481430, 1592577627, 1592582659
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), `Created By` = c("1-5LS00O1",
"1-2CD07UT", "1-2CD07UT", "1-2CD07UT", "1-2CD07UT", "1-2CD07UT"
), `Currency Code` = c("INR", "INR", "INR", "INR", "INR",
"INR"), Partner = c(NA, NA, NA, NA, NA, NA), `Final Amount Approved...36` = c("Rs.0.00",
"Rs.0.00", "Rs.0.00", "Rs.0.00", "Rs.0.00", "Rs.0.00"), `Fund Req Category` = c(NA,
NA, NA, NA, NA, NA), Comments = c(NA, NA, NA, NA, NA, NA),
`Claim Name` = c("CLM-PE-MO-2021-002442", "CLM-SU-RJ-2021-000055",
"CLM-SU-AH-2021-000527", "CLM-SU-AH-2021-000627", "CLM-SU-AH-2021-000641",
"CLM-SU-AH-2021-000643"), `Organization Name` = c("Pal Engineers, Jammu",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD",
"Sudhir Sales & Services Limited, Ahmedabad, AHMEDABAD"),
Period = c(NA, NA, NA, NA, NA, NA), `Pre-Approval #` = c(NA,
NA, NA, NA, NA, NA), `Processed By...43` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM",
"CAMC2#SUDHIRGROUP.COM", "CAMC2#SUDHIRGROUP.COM"), `Program Account Name` = c(NA,
NA, NA, NA, NA, NA), `Program Name` = c(NA, NA, NA, NA, NA,
NA), `Promotion Name` = c("BTRY_CHANDIGARH", "CIL_20000",
"CIC_5000", "Warranty_BIS_RECON", "Warranty_BIS_RECON", "CIC_5000"
), Description = c(NA, NA, NA, NA, NA, NA), Status = c("Pending",
"Pending", "Pending", "Pending", "Pending", "Pending"), `Final Approval Date` = c(NA,
NA, NA, NA, NA, NA), `Submitted By` = c("SOLUTIONS.MOHALI#PALENGINEERS.IN",
"WARRANTY.AHD#SUDHIRGROUP.COM", "WARRANTY.AHD#SUDHIRGROUP.COM",
"WARRANTY.AHD#SUDHIRGROUP.COM", "WARRANTY.AHD#SUDHIRGROUP.COM",
"WARRANTY.AHD#SUDHIRGROUP.COM"), `Total Amount Approved` = c(0,
0, 0, 0, 0, 0), `Total Amount Claimed` = c("Rs.5,721.00",
"Rs.19,087.00", "Rs.1,166.00", "Rs.836.00", "Rs.1,034.00",
"Rs.2,057.00"), `Total Participation Amount` = c(NA, NA,
NA, NA, NA, NA), Updated = structure(c(1610113437, 1589227258,
1591896091, 1592491326, 1592645576, 1592839702), class = c("POSIXct",
"POSIXt"), tzone = "UTC"), `Updated By` = c("1-4YO9LU", "1-2QTU4R",
"1-SDR5", "1-SDRU", "1-SDRU", "1-SDR5"), `Resolved By FSL` = c("Y",
"N", "Y", "Y", "N", "N"), `Parts Warranty Claim` = c(NA,
NA, NA, NA, NA, NA), `Inbox Last Updated` = structure(c(1610113437,
1589227258, 1591896091, 1592491326, 1592645576, 1592839702
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), `Claim Submitted Date` = structure(c(1609939043,
NA, NA, NA, NA, NA), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`Claim Rejection Reason` = c("Incorrect/Missing Commercial Bills",
"Incorrect/Missing Technical Documents", "HCS or KAM Approval Required",
"Incorrect/Missing Technical Documents", "Incorrect/Missing Technical Documents",
"Incorrect/Missing Technical Documents"), `Claim Categorization Reason` = c(NA,
NA, NA, NA, NA, NA), Aging = c(2.4278125, 244.16599537037,
213.276724537037, 206.387430555556, 204.60212962963, 202.355300925926
), AgeGroup = structure(c(2L, 6L, 6L, 6L, 6L, 6L), .Label = c("0-1 Days",
"2-4 Days", "5-7 Days", "8-15 Days", "16-30 Days", ">30 Days"
), class = "factor"), Zones = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
), Approver.y = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_), Claim_Value = c(NA,
NA, NA, 836L, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
The following should work fine.
as.numeric(gsub("Rs.", "", "Rs 2000"))
provided df$`Total Amount Claimed` column is character type and not a factor type.
For showing in lakhs and not in exponential format, use the option
options("scipen"=100, "digits"=4)
You can turn the values to numeric by using gsub in the folowing way :
df$`Total Amount Claimed`
#[1] "Rs.5,721.00" "Rs.19,087.00" "Rs.1,166.00" "Rs.836.00" "Rs.1,034.00" "Rs.2,057.00"
df$Claim_Value <- as.numeric(gsub('Rs\\.|,', '', df$`Total Amount Claimed`))
df$Claim_Value
#[1] 5721 19087 1166 836 1034 2057

How to create a for loop based on unique user IDs and specific event types

I have two data frames: users and events.
Both data frames contain a field that links events to users.
How can I create a for loop where every user's unique ID is matched against an event of a particular type and then stores the number of occurrences into a new column within users (users$conversation_started, users$conversation_missed, etc.)?
In short, it is a conditional for loop.
So far I have this but it is wrong:
for(i in users$id){
users$conversation_started <- nrow(event[event$type = "conversation-started"])
}
An example of how to do this would be ideal.
The idea is:
for(each user)
find the matching user ID in events
count the number of event types == "conversation-started"
assign count value to user$conversation_started
end for
Important note:
The type field can contain one of five values so I will need to be able to effectively filter on each type for each associate:
> events$type %>% table %>% as.matrix
[,1]
conversation-accepted 3120
conversation-already-accepted 19673
conversation-declined 27
conversation-missed 831
conversation-request 23427
Data frames (note that these are reduced versions as confidential information has been removed):
users <- structure(list(`_id` = c("JTuXhdI4Ai", "iGIeCEXyVE", "6XFtOJh0bD",
"mNN986oQv9", "9NI71KBMX9", "x1jH7t0Cmy"), language = c("en",
"en", "en", "en", "en", "en"), registering = c(TRUE, TRUE, FALSE,
FALSE, FALSE, NA), `_created_at` = structure(c(1485995043.131,
1488898839.838, 1480461193.146, 1481407887.979, 1489942757.189,
1491311381.916), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`_updated_at` = structure(c(1521039527.236, 1488898864.834,
1527618624.877, 1481407959.116, 1490043838.561, 1491320333.09
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), lastOnlineTimestamp = c(1521039526.90314,
NA, 1480461472, 1481407959, 1490043838, NA), isAgent = c(FALSE,
NA, FALSE, FALSE, FALSE, NA), lastAvailableTime = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"), tzone = ""), available = c(NA, NA, NA, NA, NA,
NA), busy = c(NA, NA, NA, NA, NA, NA), joinedTeam = structure(c(NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct",
"POSIXt"), tzone = ""), timezone = c(NA_character_, NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_
)), row.names = c("list.1", "list.2", "list.3", "list.4",
"list.5", "list.6"), class = "data.frame")
and
events <- structure(list(`_id` = c("JKY8ZwkM1S", "CG7Xj8dAsA", "pUkFFxoahy",
"yJVJ34rUCl", "XxXelkIFh7", "GCOsENVSz6"), expirationTime = structure(c(1527261147.873,
NA, 1527262121.332, NA, 1527263411.619, 1527263411.619), class = c("POSIXct",
"POSIXt"), tzone = ""), partId = c("d22bfddc-cd51-489f-aec8-5ab9225c0dd5",
"d22bfddc-cd51-489f-aec8-5ab9225c0dd5", "cf4356da-b63e-4e4d-8e7b-fb63035801d8",
"cf4356da-b63e-4e4d-8e7b-fb63035801d8", "a720185e-c300-47c0-b30d-64e1f272d482",
"a720185e-c300-47c0-b30d-64e1f272d482"), type = c("conversation-request",
"conversation-accepted", "conversation-request", "conversation-accepted",
"conversation-request", "conversation-request"), `_p_conversation` = c("Conversation$6nSaLeWqs7",
"Conversation$6nSaLeWqs7", "Conversation$6nSaLeWqs7", "Conversation$6nSaLeWqs7",
"Conversation$bDuAYSZgen", "Conversation$bDuAYSZgen"), `_p_merchant` = c("Merchant$0A2UYADe5x",
"Merchant$0A2UYADe5x", "Merchant$0A2UYADe5x", "Merchant$0A2UYADe5x",
"Merchant$0A2UYADe5x", "Merchant$0A2UYADe5x"), `_p_associate` = c("D9ihQOWrXC",
"D9ihQOWrXC", "D9ihQOWrXC", "D9ihQOWrXC", "D9ihQOWrXC", "D9ihQOWrXC"
), `_wperm` = list(list(), list(), list(), list(), list(), list()),
`_rperm` = list("*", "*", "*", "*", "*", "*"), `_created_at` = structure(c(1527264657.998,
1527264662.043, 1527265661.846, 1527265669.435, 1527266922.056,
1527266922.059), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
`_updated_at` = structure(c(1527264657.998, 1527264662.043,
1527265661.846, 1527265669.435, 1527266922.056, 1527266922.059
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), read = c(TRUE,
NA, TRUE, NA, NA, NA), data.customerName = c("Shopper 109339",
NA, "Shopper 109339", NA, "Shopper 109364", "Shopper 109364"
), data.departmentName = c("Personal advisors", NA, "Personal advisors",
NA, "Personal advisors", "Personal advisors"), data.recurring = c(FALSE,
NA, TRUE, NA, FALSE, FALSE), data.new = c(TRUE, NA, FALSE,
NA, TRUE, TRUE), data.missed = c(0L, NA, 0L, NA, 0L, 0L),
data.customerId = c("84uOFRLmLd", "84uOFRLmLd", "84uOFRLmLd",
"84uOFRLmLd", "5Dw4iax3Tj", "5Dw4iax3Tj"), data.claimingTime = c(NA,
4L, NA, 7L, NA, NA), data.lead = c(NA, NA, FALSE, NA, NA,
NA), data.maxMissed = c(NA, NA, NA, NA, NA, NA), data.associateName = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), data.maxDecline = c(NA, NA, NA, NA, NA, NA
), data.goUnavailable = c(NA, NA, NA, NA, NA, NA)), row.names = c("list.1",
"list.2", "list.3", "list.4", "list.5", "list.6"), class = "data.frame")
Update: 21st September 2018
This solution now results in an NA-only data frame being produced at the end of the function. When written to a .csv, this is what I get (naturally, Excel displays NA-values as blank values):
My data source has not changed, nor has my script.
What might be causing this?
My guess is that this is an unforeseen case where there may have been 0 hits for each step has occurred; as such, is there a way to add 0 to those cases where there weren't any hits, rather than NA/ blank values?
Is there a way to avoid this?
New solution based on the provided data.
Note: As your data had no overlap in _id, I changed the events$_id to be the same as in users.
Simplified example data:
users <- structure(list(`_id` = structure(c(4L, 3L, 1L, 5L, 2L, 6L),
.Label = c("6XFtOJh0bD", "9NI71KBMX9", "iGIeCEXyVE",
"JTuXhdI4Ai", "mNN986oQv9", "x1jH7t0Cmy"),
class = "factor")), .Names = "_id",
row.names = c(NA, -6L), class = "data.frame")
events <- structure(list(`_id` = c("JKY8ZwkM1S", "CG7Xj8dAsA", "pUkFFxoahy",
"yJVJ34rUCl", "XxXelkIFh7", "GCOsENVSz6"),
type = c("conversation-request", "conversation-accepted",
"conversation-request", "conversation-accepted",
"conversation-request", "conversation-request")),
.Names = c("_id", "type"), class = "data.frame",
row.names = c("list.1", "list.2", "list.3", "list.4", "list.5", "list.6"))
events$`_id` <- users$`_id`
> users
_id
1 JTuXhdI4Ai
2 iGIeCEXyVE
3 6XFtOJh0bD
4 mNN986oQv9
5 9NI71KBMX9
6 x1jH7t0Cmy
> events
_id type
list.1 JTuXhdI4Ai conversation-request
list.2 iGIeCEXyVE conversation-accepted
list.3 6XFtOJh0bD conversation-request
list.4 mNN986oQv9 conversation-accepted
list.5 9NI71KBMX9 conversation-request
list.6 x1jH7t0Cmy conversation-request
We can use the same approach I suggested before, just enhance it a bit.
First we loop over unique(events$type) to store a table() of every type of event per id in a list:
test <- lapply(unique(events$type), function(x) table(events$`_id`, events$type == x))
Then we store the specific type as the name of the respective table in the list:
names(test) <- unique(events$type)
Now we use a simple for-loop to match() the user$_id with the rownames of the table and store the information in a new variable with the name of the event type:
for(i in names(test)){
users[, i] <- test[[i]][, 2][match(users$`_id`, rownames(test[[i]]))]
}
Result:
> users
_id conversation-request conversation-accepted
1 JTuXhdI4Ai 1 0
2 iGIeCEXyVE 0 1
3 6XFtOJh0bD 1 0
4 mNN986oQv9 0 1
5 9NI71KBMX9 1 0
6 x1jH7t0Cmy 1 0
Hope this helps!

Using mutate and a lookup/calc funtion

I wrote a function where I pass a company name to lookup in a 2nd table a set of records, calculate a complicated result, and return the result.
I want to process all companies and add a value to each record with that result.
I am using the following code:
`aa <- mutate(companies,newcol=sum_rounds(companies$company_name))`
But I get the following warning:
Warning message:
In c("Bwom", "Symple", "TravelTriangle", "Ark Biosciences", "Artizan Biosciences", :
longer object length is not a multiple of shorter object length
(each of these is a company name)
The company dataframe gets a new column, but all values are "false" where actually there should be both true and false.
Any advice would be welcome to a newbie.
Function follows:
sum_rounds<-function(co_name) {
#get records from rounds for the company name passed to the function
#remove NAs from column roundtype too
outval<- rounds %>%
filter(company_name.x==co_name & !is.na(roundtype)) %>%
#sort by date round is announced
arrange(announced_on) %>%
select(roundtype) %>%
#create a string of all round types in order
apply(2,paste,collapse="")
#the values from mixed to "M", venture to "V" and pureangel to "A"
# now see if it is of the form aaaaa (and #) followed by m or v
# in grep: ^ is start of a line and + is for ar least one copy
# [mv] is either m or v
# nice summary is here: http://www.endmemo.com/program/R/gsub.php
#is angel2vc?
angel2vc<-grepl("^a+[mv]+",outval)
#return(list("roundcodes"=outval,"angel2vc"=angel2vc))
return(angel2vc)
}
DPUT from Companies table Follows:
structure(list(company_name = c("Bwom", "Symple", "TravelTriangle",
"Ark Biosciences", "Artizan Biosciences", "Audiense"), domain = c("b-wom.com",
"getsymple.com", "traveltriangle.com", "arkbiosciences.com",
NA, "audiense.com"), country_code = c("ESP", "USA", "USA", "CHN",
"USA", "GBR"), state_code = c(NA, "CA", "VA", NA, "NC", NA),
region = c("Barcelona", "SF Bay Area", "Washington, D.C.",
"Shanghai", "Raleigh", "London"), city = c("Barcelona", "San Francisco",
"Charlottesville", "Shanghai", "Durham", "London"), status = c("operating",
"operating", "operating", "operating", "operating", "operating"
), short_description = c("Bwom is a tool that offers a test and personalized exercises for women's intimate health.",
"Symple is the cloud platform for all your business payments. Pay, get paid, connect.",
"TravelTriangle enables travel enthusiasts to reserve a personalized holiday plan with a local travel agent.",
"Ark Biosciences is a biopharmaceutical company that is dedicated to the discovery and development",
"Artizan Biosciences", "SaaS developer delivering unique consumer insight and engagement capabilities to many of the world’s biggest brands and agencies."
), category_list = c("health care", "cloud computing|machine learning|mobile apps|mobile payments|retail technology",
"e-commerce|personalization|tourism|travel", "health care",
"biopharma", "analytics|apps|marketing|market research|social crm|social media|social media marketing"
), category_group_list = c("health care", "apps|commerce and shopping|data and analytics|financial services|hardware|internet services|mobile|payments|software",
"commerce and shopping|travel and tourism", "health care",
"biotechnology|health care|science and engineering", "apps|data and analytics|design|information technology|internet services|media and entertainment|sales and marketing|software"
), employee_count = c("1 to 10", "11 to 50", "101 to 250",
NA, "1 to 10", "51 to 100"), funding_rounds = c(2L, 1L, 4L,
2L, 2L, 5L), funding_total_usd = c(1075791, 120000, 19900000,
NA, 3e+06, 8013391), founded_on = structure(c(16555, 16770,
15156, 16071, NA, 14975), class = "Date"), first_funding_on = structure(c(16526,
17204, 15492, 16532, 17091, 15294), class = "Date"), last_funding_on = structure(c(17204,
17204, 17204, 17203, 17203, 17203), class = "Date"), closed_on = c(NA_character_,
NA_character_, NA_character_, NA_character_, NA_character_,
NA_character_), email = c("hello#b-wom.com", "info#getsymple.com",
"admin#traveltriangle.com", "info#arkbiosciences.com", NA,
"moreinfo#audiense.com"), phone = c(NA, NA, "'+91 98 99 120408",
"###############################################################################################################################################################################################################################################################",
NA, "###############################################################################################################################################################################################################################################################"
), cb_url = c("https://www.crunchbase.com/organization/bwom",
"https://www.crunchbase.com/organization/symple-2", "https://www.crunchbase.com/organization/traveltriangle-com",
"https://www.crunchbase.com/organization/ark-biosciences",
"https://www.crunchbase.com/organization/artizan-biosciences",
"https://www.crunchbase.com/organization/socialbro"), twitter_url = c("https://www.twitter.com/hellobwom",
NA, "https://www.twitter.com/traveltriangle", NA, NA, "https://www.twitter.com/socialbro"
), facebook_url = c("https://www.facebook.com/hellobwom/?fref=ts",
NA, "http://www.facebook.com/traveltriangle", NA, NA, "http://www.facebook.com/socialbro"
), uuid = c("e6096d58-3454-d982-0dbe-7de9b06cd493", "fd0ab78f-0dc4-1f18-21d1-7ce9ff7a173b",
"742043c1-c17a-4526-4ed0-e911e6e9555b", "8e27eb22-ce03-a2af-58ba-53f0f458f49c",
"ed07ac9e-1071-fca0-46d9-42035c2da505", "fed333e5-2754-7413-1e3d-5939d70541d2"
), isbio = c("other", "other", "other", "other", "bio", "other"
), co_type = c("m", "m", "m", "v", "v", "m")), .Names = c("company_name",
"domain", "country_code", "state_code", "region", "city", "status",
"short_description", "category_list", "category_group_list",
"employee_count", "funding_rounds", "funding_total_usd", "founded_on",
"first_funding_on", "last_funding_on", "closed_on", "email",
"phone", "cb_url", "twitter_url", "facebook_url", "uuid", "isbio",
"co_type"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
>

Extract specific columns from dataset, create column of NAs if it doesn't exist

Data frame df has 57 columns. I later read in other csv files, each of which may have the same 57, but more likely have more or fewer columns. I take the names of the original file as:
df = read.csv(...)
str = colnames(df)
I know I can take subsets of a data frame as:
file = read.csv(...)
file = file[, str]
If the columns of file have the same or greater number of columns than the original 57, this will work fine. The extra columns would simply be dropped. However, if the columns of file are fewer than the original 57, the following error arises:
Error in `[.data.frame`(file, , str) : undefined columns selected
Is there a way to take this same approach, but create columns of NA if the column does not exist in file?
EDIT: Including dput ouput for #akrun. I'm not familiar with dput so I hope this is what you were asking for:
File 1 example:
`structure(list(ObservationURI = c("http://resources.usgin.org/uri-gin/wygs/bhtemp/49-037-20341_182_12296/",
"http://resources.usgin.org/uri-gin/wygs/bhtemp/49-037-20341_215_14316/",
"http://resources.usgin.org/uri-gin/wygs/bhtemp/49-037-20341_236_16496/"
), WellName = c("1 BRADY UNIT ANADARKO E&P COMPANY LP", "1 BRADY UNIT ANADARKO E&P COMPANY LP",
"1 BRADY UNIT ANADARKO E&P COMPANY LP"), APINo = c("49-037-20341",
"49-037-20341", "49-037-20341"), HeaderURI = c("http://resources.usgin.org/uri-gin/wygs/well/3720341/",
"http://resources.usgin.org/uri-gin/wygs/well/3720341/", "http://resources.usgin.org/uri-gin/wygs/well/3720341/"
), OtherID = c(3720341, 3720341, 3720341), OtherName = c(NA,
NA, NA), BoreholeName = c(NA, NA, NA), Label = c("Temperature observation for well 3720341",
"Temperature observation for well 3720341", "Temperature observation for well 3720341"
), Operator = c("", "", ""), LeaseName = c("", "", ""), LeaseOwner = c("",
"", ""), LeaseNo = c("", "", ""), SpudDate = c("1900-01-01T00:00",
"1900-01-01T00:00", "1900-01-01T00:00"), EndedDrillingDate = c("",
"", ""), WellType = c("Oil", "Oil", "Oil"), Status = c("Producing Oil Well",
"Producing Oil Well", "Producing Oil Well"), CommodityOfInterest = c("",
"", ""), StatusDate = c("1973-05-03T00:00:00", "1973-05-03T00:00:00",
"1973-05-03T00:00:00"), Function = c(NA, NA, NA), Production = c(NA,
NA, NA), ProducingInterval = c(NA, NA, NA), ReleaseDate = c(NA,
NA, NA), Field = c("", "", ""), OtherLocationName = c("Great Divide Basin",
"Great Divide Basin", "Great Divide Basin"), County = c("Sweetwater",
"Sweetwater", "Sweetwater"), State = c("WY", "WY", "WY"), PLSS_Meridians = c(NA,
NA, NA), TWP = c("16N", "16N", "16N"), RGE = c("101W", "101W",
"101W"), Section_ = c(11, 11, 11), SectionPart = c("NENW", "NENW",
"NENW"), Parcel = c(NA, NA, NA), UTM_E = c(NA, NA, NA), UTM_N = c(NA,
NA, NA), UTMDatumZone = c(NA, NA, NA), LatDegree = c(41.38696,
41.38696, 41.38696), LongDegree = c(-108.75009, -108.75009, -108.75009
), SRS = c("EPSG:4326", "EPSG:4326", "EPSG:4326"), LocationUncertaintyStatement = c("nil:missing",
"nil:missing", "nil:missing"), LocationUncertaintyCode = c(NA,
NA, NA), LocationUncertaintyRadius = c(NA, NA, NA), DrillerTotalDepth = c(NA_real_,
NA_real_, NA_real_), DepthReferencePoint = c(NA, NA, NA), LengthUnits = c("ft",
"ft", "ft"), WellBoreShape = c(NA, NA, NA), TrueVerticalDepth = c(NA,
NA, NA), ElevationKB = c(7135, 7135, 7135), ElevationDF = c(7106,
7106, 7106), ElevationGL = c(0, 0, 0), FormationTD = c("", "",
""), BitDiameterCollar = c(NA, NA, NA), BitDiameterTD = c(NA_real_,
NA_real_, NA_real_), DiameterUnits = c("", "", ""), Notes = c("Depth of measurement assumed to be equal to driller total depth (CRC-AZGS, 2013).",
"Depth of measurement assumed to be equal to driller total depth (CRC-AZGS, 2013).",
"Depth of measurement assumed to be equal to driller total depth (CRC-AZGS, 2013)."
), MaximumRecordedTemperature = c(NA_real_, NA_real_, NA_real_
), MeasuredTemperature = c(182, 215, 236), CorrectedTemperature = c(NA_real_,
NA_real_, NA_real_), TemperatureUnits = c(FALSE, FALSE, FALSE
), TimeSinceCirculation = c(NA_real_, NA_real_, NA_real_), CirculationDuration = c(11,
12, 12), MeasurementProcedure = c("Well log", "Well log", "Well log"
), CorrectionType = c(NA, NA, NA), DepthOfMeasurement = c(-99999,
-99999, -99999), MeasurementDateTime = c("", "", ""), MeasurementFormation = c("",
"", ""), MeasurementSource = c("Richard W. Davis: Deriving geothermal parameters from bottom-hole temperatures in Wyoming\" AAPG bulletin, V. 96, No. 8 (August 2012), pp. 1579-1592",
"Richard W. Davis: Deriving geothermal parameters from bottom-hole temperatures in Wyoming\" AAPG bulletin, V. 96, No. 8 (August 2012), pp. 1579-1592",
"Richard W. Davis: Deriving geothermal parameters from bottom-hole temperatures in Wyoming\" AAPG bulletin, V. 96, No. 8 (August 2012), pp. 1579-1592"
), RelatedResource = c(NA, NA, NA), CasingLogger = c(NA, NA,
NA), CasingBottomDepthDriller = c(NA, NA, NA), CasingTopDepth = c(NA_real_,
NA_real_, NA_real_), CasingPipeDiameter = c(NA, NA, NA), CasingWeight = c(NA,
NA, NA), CasingWeightUnits = c(NA, NA, NA), CasingThickness = c(NA,
NA, NA), DrillingFluid = c("", "", ""), Salinity = c(NA_real_,
NA_real_, NA_real_), MudResistivity = c(NA_real_, NA_real_, NA_real_
), Density = c(NA_real_, NA_real_, NA_real_), FluidLevel = c(NA_real_,
NA_real_, NA_real_), pH = c(NA_real_, NA_real_, NA_real_), Viscosity = c(NA_real_,
NA_real_, NA_real_), FluidLoss = c(NA_real_, NA_real_, NA_real_
), MeasurementNotes = c(NA, NA, NA), InformationSource = c("Wyoming State Geological Survey",
"Wyoming State Geological Survey", "Wyoming State Geological Survey"
)), .Names = c("ObservationURI", "WellName", "APINo", "HeaderURI",
"OtherID", "OtherName", "BoreholeName", "Label", "Operator",
"LeaseName", "LeaseOwner", "LeaseNo", "SpudDate", "EndedDrillingDate",
"WellType", "Status", "CommodityOfInterest", "StatusDate", "Function",
"Production", "ProducingInterval", "ReleaseDate", "Field", "OtherLocationName",
"County", "State", "PLSS_Meridians", "TWP", "RGE", "Section_",
"SectionPart", "Parcel", "UTM_E", "UTM_N", "UTMDatumZone", "LatDegree",
"LongDegree", "SRS", "LocationUncertaintyStatement", "LocationUncertaintyCode",
"LocationUncertaintyRadius", "DrillerTotalDepth", "DepthReferencePoint",
"LengthUnits", "WellBoreShape", "TrueVerticalDepth", "ElevationKB",
"ElevationDF", "ElevationGL", "FormationTD", "BitDiameterCollar",
"BitDiameterTD", "DiameterUnits", "Notes", "MaximumRecordedTemperature",
"MeasuredTemperature", "CorrectedTemperature", "TemperatureUnits",
"TimeSinceCirculation", "CirculationDuration", "MeasurementProcedure",
"CorrectionType", "DepthOfMeasurement", "MeasurementDateTime",
"MeasurementFormation", "MeasurementSource", "RelatedResource",
"CasingLogger", "CasingBottomDepthDriller", "CasingTopDepth",
"CasingPipeDiameter", "CasingWeight", "CasingWeightUnits", "CasingThickness",
"DrillingFluid", "Salinity", "MudResistivity", "Density", "FluidLevel",
"pH", "Viscosity", "FluidLoss", "MeasurementNotes", "InformationSource"
), row.names = c(NA, 3L), class = "data.frame")`
File 2 example:
`structure(list(ObservationURI = c("http://resources.usgin.org/uri-gin/mags/bhtemp/UM:MA-Weston47-422036N0711640.1/",
"http://resources.usgin.org/uri-gin/mags/bhtemp/UM:MA-Dover20-421431N0711752.1/",
"http://resources.usgin.org/uri-gin/mags/bhtemp/UM:MA-Lincoln13-422440N0711815.1/"
), WellName = c("Weston47-USGS HDR19", "Dover20-USGS HDR19",
"Lincoln13-USGS HDR19"), APINo = c(NA, NA, NA), HeaderURI = c("http://resources.usgin.org/uri-gin/mags/well/Weston47-USGS_HDR19/",
"http://resources.usgin.org/uri-gin/mags/well/Dover20-USGS_HDR19/",
"http://resources.usgin.org/uri-gin/mags/well/Lincoln13-USGS_HDR19/"
), OtherID = c("", "", ""), OtherName = c("", "", ""), BoreholeName = c(NA,
NA, NA), Operator = c(NA, NA, NA), LeaseOwner = c(NA, NA, NA),
LeaseNo = c(NA, NA, NA), SpudDate = c(NA, NA, NA), EndedDrillingDate = c("",
"", ""), WellType = c("temporarily abandoned", "observation",
"observation"), Status = c("Idle", "Idle", "Idle"), CommodityOfInterest = c("Water",
"Water", "Water"), StatusDate = c("", "", ""), Function = c("production",
"monitoring", "monitoring"), Production = c(NA, NA, NA),
Field = c(NA, NA, NA), County = c("Middlesex", "Norfolk",
"Middlesex"), State = c("MA", "MA", "MA"), PLSS_Meridians = c(NA,
NA, NA), TWP = c(NA, NA, NA), RGE = c(NA, NA, NA), Section_ = c(NA,
NA, NA), SectionPart = c(NA, NA, NA), Parcel = c(NA, NA,
NA), UTM_E = c(NA, NA, NA), UTM_N = c(NA, NA, NA), LatDegree = c(42.3147771183,
42.2417748607, 42.4110851252), LongDegree = c(-71.3257301787,
-71.2975422044, -71.3034583949), SRS = c("EPSG:4326", "EPSG:4326",
"EPSG:4326"), LocationUncertaintyStatement = c("Field located on topographic map",
"Field located on topographic map", "Field located on topographic map"
), DrillerTotalDepth = c(29, 22, 20), LengthUnits = c("ft",
"ft", "ft"), WellBoreShape = c("Vertical", "Vertical", "Vertical"
), TrueVerticalDepth = c(NA, NA, NA), ElevationGL = c(140,
150, 180), BitDiameterTD = c(72, 48, 42), DiameterUnits = c("in",
"in", "in"), Notes = c("", "", ""), MeasuredTemperature = c(8,
9, 8.5), CorrectedTemperature = c(NA, NA, NA), TemperatureUnits = c("C",
"C", "C"), TimeSinceCirculation = c(NA, NA, NA), CirculationDuration = c(NA,
NA, NA), MeasurementProcedure = c("Samples collected from spigot or faucet nearest to well. Water run until temperature, pH or specific conductance stablized. Temperature measured with a mercury thermometer to nearest half degree in degrees F. Converted to degrees C for table.",
"Samples collected from spigot or faucet nearest to well. Water run until temperature, pH or specific conductance stablized. Temperature measured with a mercury thermometer to nearest half degree in degrees F. Converted to degrees C for table.",
"Samples collected from spigot or faucet nearest to well. Water run until temperature, pH or specific conductance stablized. Temperature measured with a mercury thermometer to nearest half degree in degrees F. Converted to degrees C for table."
), CorrectionType = c(NA, NA, NA), DepthOfMeasurement = c(NA,
NA, NA), MeasurementDateTime = c(NA, NA, NA), MeasurementFormation = c(NA,
NA, NA), MeasurementSource = c("Walker, Eugene H., William W. Caswell, and S. William Wandle, Jr. Hydrologic Data of the Charles River Basin",
"Walker, Eugene H., William W. Caswell, and S. William Wandle, Jr. Hydrologic Data of the Charles River Basin",
"Walker, Eugene H., William W. Caswell, and S. William Wandle, Jr. Hydrologic Data of the Charles River Basin"
), CasingLogger = c(" Massachusetts\". USGS Massachusetts Hydrologic-Data Report No. 19 (1977): 1-57. Print. ftp://eclogite.geo.umass.edu/pub/stategeologist/Products/Geothermal/BoreholeTemperatureData/DataReport19.pdf\"",
" Massachusetts\". USGS Massachusetts Hydrologic-Data Report No. 19 (1977): 1-57. Print. ftp://eclogite.geo.umass.edu/pub/stategeologist/Products/Geothermal/BoreholeTemperatureData/DataReport19.pdf\"",
" Massachusetts\". USGS Massachusetts Hydrologic-Data Report No. 19 (1977): 1-57. Print. ftp://eclogite.geo.umass.edu/pub/stategeologist/Products/Geothermal/BoreholeTemperatureData/DataReport19.pdf\""
), CasingDepthDriller = c("", "", ""), CasingPipeDiameter = c("",
"", ""), CasingWeight = c(NA, NA, NA), CasingWeightUnits = c(NA,
NA, NA), CasingThickness = c(NA, NA, NA), DrillingFluid = c(NA,
NA, NA), Salinity = c(NA, NA, NA), MudResisitivity = c(NA,
NA, NA), Density = c(NA, NA, NA), FluidLevel = c(NA, NA,
NA), pH = c(NA, NA, NA), Viscosity = c(NA, NA, NA), FluidLoss = c(NA,
NA, NA), Unnamed..66 = c(NA, NA, NA), BitDiameterCollar = c(72,
48, 42), Unnamed..68 = c(NA, NA, NA), InformationSource = c("Stephen Mabee, MA State Geologist, University of Massachusetts, 611 North Pleasant Street, Amherst MA 01003 413-545-2285",
"Stephen Mabee, MA State Geologist, University of Massachusetts, 611 North Pleasant Street, Amherst MA 01003 413-545-2285",
"Stephen Mabee, MA State Geologist, University of Massachusetts, 611 North Pleasant Street, Amherst MA 01003 413-545-2285"
)), .Names = c("ObservationURI", "WellName", "APINo", "HeaderURI",
"OtherID", "OtherName", "BoreholeName", "Operator", "LeaseOwner",
"LeaseNo", "SpudDate", "EndedDrillingDate", "WellType", "Status",
"CommodityOfInterest", "StatusDate", "Function", "Production",
"Field", "County", "State", "PLSS_Meridians", "TWP", "RGE", "Section_",
"SectionPart", "Parcel", "UTM_E", "UTM_N", "LatDegree", "LongDegree",
"SRS", "LocationUncertaintyStatement", "DrillerTotalDepth", "LengthUnits",
"WellBoreShape", "TrueVerticalDepth", "ElevationGL", "BitDiameterTD",
"DiameterUnits", "Notes", "MeasuredTemperature", "CorrectedTemperature",
"TemperatureUnits", "TimeSinceCirculation", "CirculationDuration",
"MeasurementProcedure", "CorrectionType", "DepthOfMeasurement",
"MeasurementDateTime", "MeasurementFormation", "MeasurementSource",
"CasingLogger", "CasingDepthDriller", "CasingPipeDiameter", "CasingWeight",
"CasingWeightUnits", "CasingThickness", "DrillingFluid", "Salinity",
"MudResisitivity", "Density", "FluidLevel", "pH", "Viscosity",
"FluidLoss", "Unnamed..66", "BitDiameterCollar", "Unnamed..68",
"InformationSource"), row.names = c(NA, 3L), class = "data.frame")`
We can read the datasets in a list with fread and use rbindlist from data.table with fill = TRUE and idcol argument to create a single data.table object. The fill = TRUE ensure that NA elements are created for those datasets that have lesser number of columns.
library(data.table)
#get the files from the working directory
files <- list.files(pattern = ".csv")
#read files in a loop with fread and then rbind the data.tables
rbindlist(lapply(files, fread), fill = TRUE, idcol = "grp")

Resources