R: changing all string nominal columns to integers - r

I have a dataset where I'm planning to use ubRacing of unbalanced package. But this ubRacing only accepts numeric columns. Is there anyway I can convert all the chr columns to numeric through R?
Thanks
'data.frame': 31000 obs. of 22 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : int 56 57 37 40 56 45 59 41 24 25 ...
$ job : chr "housemaid" "services" "services" "admin." ...
$ marital : chr "married" "married" "married" "married" ...
$ education : chr "basic.4y" "high.school" "high.school" "basic.6y" ...
$ default : chr "no" "unknown" "no" "no" ...
$ housing : chr "no" "no" "yes" "no" ...
$ loan : chr "no" "no" "no" "no" ...
$ contact : chr "telephone" "telephone" "telephone" "telephone" ...
$ month : chr "may" "may" "may" "may" ...
$ day_of_week : chr "mon" "mon" "mon" "mon" ...

It is not clear how to character columns should be converted to numeric. One possible option would be to convert the character class to factor and then coerce it to numeric. We loop through the columns of the dataset with lapply.
df1[] <- lapply(df1, function(x) if(is.character(x)) as.numeric(factor(x))
else (x))

Related

Reading a dropbox file as data frame

I try to read from a dropbox link a csv file as data frame using this option
df <- read.csv("https://www.dropbox.com/s/vta51y5wyzu86m1/FY_2008.csv?dl=0", stringsAsFactors = FALSE)
However I receive this error:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
duplicate 'row.names' are not allowed
Any help to figure out why this error exist?
Change the dl=0 to dl=1.
For an abbreviated demonstration, I'll limit to just the top 10 rows:
df <- read.csv("https://www.dropbox.com/s/vta51y5wyzu86m1/FY_2008.csv?dl=1", nrows=10)
str(df)
# 'data.frame': 10 obs. of 65 variables:
# $ contract_transaction_unique_key : chr "9700_9700_0000_0_W91QUZ07D0011_0" "9700_9700_0001_0_DAJA6196A0004_0" "6940_6940_0001_1_DTNH2208D00115_0" "9700_9700_0001_17_F0470001D0020_0" ...
# $ contract_award_unique_key : chr "CONT_AWD_0000_9700_W91QUZ07D0011_9700" "CONT_AWD_0001_9700_DAJA6196A0004_9700" "CONT_AWD_0001_6940_DTNH2208D00115_6940" "CONT_AWD_0001_9700_F0470001D0020_9700" ...
# $ award_id_piid : int 0 1 1 1 1 1 1 1 1 1
# $ modification_number : int 0 0 1 17 2 0 0 0 1 1
# $ transaction_number : int 0 0 0 0 0 0 0 0 0 0
# $ parent_award_agency_id : int 9700 9700 6940 9700 9700 9700 9700 9700 9700 9700
# $ parent_award_agency_name : chr "" "DEPT OF DEFENSE" "NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION" "" ...
# $ parent_award_id_piid : chr "W91QUZ07D0011" "DAJA6196A0004" "DTNH2208D00115" "F0470001D0020" ...
# $ parent_award_modification_number : chr "0" "0" "0" "P00013" ...
# $ federal_action_obligation : num 1082099 1104 0 -15741 -15927 ...
# $ total_dollars_obligated : num NA 1104 NA NA NA ...
# $ current_total_value_of_award : num NA 1104 NA NA NA ...
# $ potential_total_value_of_award : num NA 1104 NA NA NA ...
# $ disaster_emergency_fund_codes_for_overall_award : logi NA NA NA NA NA NA ...
# $ outlayed_amount_funded_by_COVID.19_supplementals_for_overall_aw: logi NA NA NA NA NA NA ...
# $ obligated_amount_funded_by_COVID.19_supplementals_for_overall_a: logi NA NA NA NA NA NA ...
# $ action_date : chr "2008-09-30" "2008-09-30" "2008-09-30" "2008-09-30" ...
# $ action_date_fiscal_year : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008
# $ period_of_performance_start_date : chr "2008-09-30 00:00:00" "2008-09-30 00:00:00" "2008-09-30 00:00:00" "2008-09-30 00:00:00" ...
# $ period_of_performance_current_end_date : chr "2009-09-29 00:00:00" "2008-09-30 00:00:00" "2009-12-18 00:00:00" "2003-11-30 00:00:00" ...
# $ period_of_performance_potential_end_date : chr "2009-09-29 00:00:00" "2008-09-30 00:00:00" "2009-12-18 00:00:00" "2003-11-30 00:00:00" ...
# $ awarding_agency_code : int 97 97 69 97 97 97 97 97 97 97
# $ awarding_agency_name : chr "DEPARTMENT OF DEFENSE (DOD)" "DEPARTMENT OF DEFENSE (DOD)" "DEPARTMENT OF TRANSPORTATION (DOT)" "DEPARTMENT OF DEFENSE (DOD)" ...
# $ awarding_sub_agency_code : int 2100 2100 6940 5700 5700 5700 5700 5700 5700 5700
# $ awarding_sub_agency_name : chr "DEPT OF THE ARMY" "DEPT OF THE ARMY" "NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION" "DEPT OF THE AIR FORCE" ...
# $ awarding_office_code : chr "W911W4" "W912PA" "00022" "FA9301" ...
# $ awarding_office_name : chr "W00Y CONTR OFC DODAAC" "ECC PARC EUROPE REGIONAL CONTRACTIN" "DEPT OF TRANS/NAT HIGHWAY TRAFFIC SAFETY ADM" "FA9301 AFTC PZIO" ...
# $ recipient_duns : int 614948396 123456787 49508120 848288408 92440044 52220485 144606436 132004701 122474104 57579807
# $ recipient_name : chr "WORLD WIDE TECHNOLOGY, INC." "MISCELLANEOUS FOREIGN AWARDEES" "WESTAT, INC." "ACCENT SERVICE COMPANY INC" ...
# $ recipient_doing_business_as_name : logi NA NA NA NA NA NA ...
# $ recipient_parent_duns : int 131784451 123456787 49508120 848288408 92440044 52220485 144606436 132004701 122474104 57579807
# $ recipient_parent_name : chr "WORLD WIDE TECHNOLOGY HOLDING CO. INC." "MISCELLANEOUS FOREIGN CONTRACTORS" "WESTAT INC." "ACCENT SERVICE COMPANY INC" ...
# $ recipient_country_code : chr "USA" "USA" "UNITED STATES" "UNITED STATES" ...
# $ recipient_country_name : chr "UNITED STATES OF AMERICA" "UNITED STATES" "" "" ...
# $ recipient_address_line_1 : chr "60 WELDON PKWY" "1800 F ST NW" "1650 RESEARCH BLVD RM RE164" "2001 LEMNOS DR" ...
# $ recipient_address_line_2 : logi NA NA NA NA NA NA ...
# $ recipient_city_name : chr "MARYLAND HEIGHTS" "WASHINGTON" "ROCKVILLE" "COSTA MESA" ...
# $ recipient_county_name : chr "ST. LOUIS" "DISTRICT OF COLUMBIA" "" "" ...
# $ recipient_state_code : chr "MO" "DC" "MD" "CA" ...
# $ recipient_state_name : chr "MISSOURI" "DISTRICT OF COLUMBIA" "" "" ...
# $ recipient_zip_4_code : int 63043 204050001 208503195 926263535 92408 329205818 769047833 223031802 782584092 782073102
# $ primary_place_of_performance_country_name : chr "UNITED STATES OF AMERICA" "GERMANY" "UNITED STATES" "UNITED STATES" ...
# $ primary_place_of_performance_city_name : chr "FORT BELVOIR" "" "ROCKVILLE" "EDWARDS" ...
# $ primary_place_of_performance_county_name : chr "FAIRFAX" "" "MONTGOMERY" "KERN" ...
# $ primary_place_of_performance_state_code : chr "VA" "" "MD" "CA" ...
# $ primary_place_of_performance_state_name : chr "VIRGINIA" "" "MARYLAND" "CALIFORNIA" ...
# $ award_or_idv_flag : chr "AWARD" "AWARD" "AWARD" "AWARD" ...
# $ award_type_code : chr "C" "C" "C" "C" ...
# $ award_type : chr "DO" "DELIVERY ORDER" "DO" "DO" ...
# $ type_of_contract_pricing_code : chr "J" "J" "3" "S" ...
# $ type_of_contract_pricing : chr "FIXED PRICE" "FIXED PRICE" "OTHER (NONE OF THE ABOVE)" "COST NO FEE" ...
# $ award_description : chr "PURCHASE OF ROUTERS, SERVERS, AND ANCILLARY EQUIPMENT. USED WORLD-WIDE IN SUPPORT OF MISSION." "LOCKSMITH SUPPLIES" "RFP FOR IDIQ CONTRACT - MULTIPLE AWARD" "BASIC CLEANING SERVICES" ...
# $ product_or_service_code : chr "7490" "4510" "R405" "S201" ...
# $ product_or_service_code_description : chr "MISCELLANEOUS OFFICE MACHINES" "PLUMBING FIXTURES AND ACCESSORIES" "OPERATIONS RESEARCH & QUANTITATIVE" "CUSTODIAL JANITORIAL SERVICES" ...
# $ naics_description : chr "WIRED TELECOMMUNICATIONS CARRIERS" "OTHER SUPPORT ACTIVITIES FOR ROAD TRANSPORTATION" "ENGINEERING SERVICES" "JANITORIAL SERVICES" ...
# $ domestic_or_foreign_entity : logi NA NA NA NA NA NA ...
# $ country_of_product_or_service_origin_code : chr "USA" "DEU" "NAN" "USA" ...
# $ extent_competed_code : chr "A" "A" "" "D" ...
# $ extent_competed : chr "FULL AND OPEN COMPETITION" "FULL AND OPEN COMPETITION" "" "FULL AND OPEN COMPETITION AFTER EXCLUSION OF SOURCES" ...
# $ parent_award_type_code : chr "" "B" "" "" ...
# $ parent_award_type : chr "" "IDC" "" "" ...
# $ cost_or_pricing_data_code : chr "N" "N" "" "N" ...
# $ cost_or_pricing_data : chr "NO" "NO" "" "NO" ...
# $ multi_year_contract_code : chr "N" "N" "N" "N" ...
# $ multi_year_contract : chr "NO" "NO" "NO" "NO" ...

Removing all "$" from an entire data frame

I have a df with several columns that have dollar values preceded by the "$" like so:
> str(data)
Classes ‘data.table’ and 'data.frame': 196879 obs. of 32 variables:
$ City : chr "" "" "" "" ...
$ Company_Goal : chr "" "" "" "" ...
$ Company_Name : chr "" "" "" "" ...
$ Event_Date : chr "5/14/2016" "9/26/2015" "9/12/2015" "6/3/2017" ...
$ Event_Year : chr "FY 2016" "FY 2016" "FY 2016" "FY 2017" ...
$ Fundraising_Goal : chr "$250" "$200" "$350" "$0" ...
$ Name : chr "Heart Walk 2015-2016 St. Louis MO" "Heart Walk 2015-2016 Canton, OH" "Heart Walk 2015-2016 Dallas, TX" "FDA HW 2016-2017 Albany, NY WO-65355" ...
$ Participant_Id : chr "2323216" "2273391" "2419569" "4088558" ...
$ State : chr "" "OH" "TX" "" ...
$ Street : chr "" "" "" "" ...
$ Team_Average : chr "$176" "$123" "$306" "$47" ...
$ Team_Captain : chr "No" "No" "Yes" "No" ...
$ Team_Count : chr "7" "6" "4" "46" ...
$ Team_Id : chr "152788" "127127" "45273" "179207" ...
$ Team_Member_Goal : chr "$0" "$0" "$0" "$0" ...
$ Team_Name : chr "Team Clayton" "Cardiac Crusaders" "BIS - Team Myers" "Independent Walkers" ...
$ Team_Total_Gifts : chr "$1,230 " "$738" "$1,225 " "$2,145 " ...
$ Zip : chr "" "" "" "" ...
$ Gifts_Count : chr "2" "1" "2" "1" ...
$ Registration_Gift: chr "No" "No" "No" "No" ...
$ Participant_Gifts: chr "$236" "$218" "$225" "$0" ...
$ Personal_Gift : chr "$0" "$0" "$0" "$250" ...
$ Total_Gifts : chr "$236" "$218" "$225" "$250" ...
$ MATCH_CODE : chr "UX000" "UX000" "UX000" "UX000" ...
$ TAP_LEVEL : chr "X" "X" "X" "X" ...
$ TAP_DESC : chr "" "" "" "" ...
$ TAP_LIFED : chr "" "" "" "" ...
$ MEDAGE_CY : chr "0" "0" "0" "0" ...
$ DIVINDX_CY : chr "0" "0" "0" "0" ...
$ MEDHINC_CY : chr "0" "0" "0" "0" ...
$ MEDDI_CY : chr "0" "0" "0" "0" ...
$ MEDNW_CY : chr "0" "0" "0" "0" ...
- attr(*, ".internal.selfref")=<externalptr>
I am trying to remove all of the "$". I have been unable to do so- I have tried the suggestions provided in this post as well as this one but in both situations- the data remains unchanged...
Help?
The dollar sign is a reserved character in regular expressions (see here for more info). The gsub() function assumes the pattern is a regex by default.
You have to escape the dollar sign using backslashes (\\$) to match a literal $.
#sample data
df = data.frame(Team_Average = c("$176", "$123", "$306"),
Name = c("Heart Walk 2015-2016 St. Louis MO",
"Heart Walk 2015-2016 Canton, OH",
"Heart Walk 2015-2016 Dallas, TX"),
stringsAsFactors = FALSE)
df[] = lapply(df, gsub, pattern="\\$", replacement="")
Alternatively you can use gsub's option of fixed=TRUE to match the pattern literally.
df[] = lapply(df, gsub, pattern="$", replcement="", fixed=TRUE)
The other answers work nicely on the example provided. However, if the data set contained any numeric columns, then running gsub() or stringr::str_replace_all() via lapply() would coerece numeric columns to character:
library(stringr)
library(dplyr)
d <- data_frame(
x = c("$200", "$191.40", "80.12"),
y = c("$test", "column", "$foo"),
z = 1:3
)
d[] <- lapply(d, gsub, pattern = "\\$", replacement = "")
# A tibble: 3 x 3
x y z
<chr> <chr> <chr>
1 200 test 1
2 191.40 column 2
3 80.12 foo 3
Note the class of z above.
Here is a tidyverse approach to removing $ from all character columns:
d %>%
mutate_if(
is.character,
funs(str_replace_all(., "\\$", ""))
)
# A tibble: 3 x 3
x y z
<chr> <chr> <int>
1 200 test 1
2 191.40 column 2
3 80.12 foo 3

Why does write.csv() convert numeric columns to strings?

My data frame contains five character columns and one numeric column. When I export the data frame all columns convert to strings including the numeric column. How do I avoid this?
The structure of my data frame:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3194 obs. of 6 variables:
$ State_FIPS_Code : chr "00" "01" "01" "01" ...
$ County_FIPS_Code : chr "000" "000" "001" "003" ...
$ Postal_Code : chr "US" "AL" "AL" "AL" ...
$ Name : chr "United States" "Alabama" "Autauga County" "Baldwin County" ...
$ Poverty_Percent_All_Ages: num 14.7 18.5 12.7 12.9 32 22.2 14.7 39.6 25.8 20 ...
$ geoid : chr "00000" "01000" "01001" "01003" ...
Export code:
write.csv(dfGeoid, file = "MyData.csv", row.names=T)

Make calculations into a variable within a list

I would like to make some specific calculation within a large dataset.
This is my MWE using an API call (takes 3-4 sec ONLY to Download)
devtools::install_github('mingjerli/IMFData')
library(IMFData)
fdi_asst <- c("BFDA_BP6_USD","BFDAD_BP6_USD","BFDAE_BP6_USD")
databaseID <- "BOP"
startdate <- "1980-01-01"
enddate <- "2016-12-31"
checkquery <- FALSE
FDI_ASSETS <- as.data.frame(CompactDataMethod(databaseID, list(CL_FREA = "Q", CL_AREA_BOP = "", CL_INDICATOR_BOP= fdi_asst), startdate, enddate, checkquery))
my dataframe 'FDI_ASSETS' looks like this (I provide a picture instead of head() for convenience)
the last column is a list and contains three more variables:
head(FDI_ASSETS$Obs)
[[1]]
#TIME_PERIOD #OBS_VALUE #OBS_STATUS
1 1980-Q1 30.0318922812441 <NA>
2 1980-Q2 23.8926174547104 <NA>
3 1980-Q3 26.599634375058 <NA>
4 1980-Q4 32.7522451203517 <NA>
5 1981-Q1 44.124979234001 <NA>
6 1981-Q2 35.9907120805994 <NA>
MY SCOPE
I want to do the following:
if/when the "#UNIT_MULT == 6" then divide the "#OBS_VALUE" in FDI_ASSETS$Obs by 1000
if/when the "#UNIT_MULT == 3" then divide the "#OBS_VALUE" in FDI_ASSETS$Obs by 1000000
UPDATE
Structure of FDI_ASSETS looks like this:
str(FDI_ASSETS)
'data.frame': 375 obs. of 6 variables:
$ #FREQ : chr "Q" "Q" "Q" "Q" ...
$ #REF_AREA : chr "FI" "MX" "MX" "TO" ...
$ #INDICATOR : chr "BFDAE_BP6_USD" "BFDAD_BP6_USD" "BFDAE_BP6_USD" "BFDAD_BP6_USD" ...
$ #UNIT_MULT : chr "6" "6" "6" "3" ...
$ #TIME_FORMAT: chr "P3M" "P3M" "P3M" "P3M" ...
$ Obs :List of 375
..$ :'data.frame': 147 obs. of 3 variables:
.. ..$ #TIME_PERIOD: chr "1980-Q1" "1980-Q2" "1980-Q3" "1980-Q4" ...
.. ..$ #OBS_VALUE : chr "30.0318922812441" "23.8926174547104" "26.599634375058" "32.7522451203517" ...
.. ..$ #OBS_STATUS : chr NA NA NA NA ...
..$ :'data.frame': 60 obs. of 2 variables:
.. ..$ #TIME_PERIOD: chr "2001-Q1" "2001-Q3" "2002-Q1" "2002-Q2" ...
.. ..$ #OBS_VALUE : chr "9.99999999748979E-05" "9.99999997475243E-05" "9.8999999998739E-05" "-9.90000000342661E-05" ...
..$ :'data.frame': 63 obs. of 2 variables:
.. ..$ #TIME_PERIOD: chr "2001-Q1" "2001-Q2" "2001-Q3" "2001-Q4" ...
.. ..$ #OBS_VALUE : chr "130.0149" "189.627" "3453.8319" "630.483" ...
..$ :'data.frame': 17 obs. of 2 variables:
I downloaded your data and it is quite complicated. I have removed my wrong answer so that you can get it answered by #akrun or someone similar :) I don't have the time to parse through it right now.
I found the following solution
list_assets<-list(FDI_ASSETS=FDI_ASSETS, Portfolio_ASSETS=Portfolio_ASSETS, other_invest_ASSETS=other_invest_ASSETS, fin_der_ASSETS=fin_der_ASSETS, Reserves=Reserves)
for (df in list_assets){
for( i in 1:length(df$"#UNIT_MULT")){
if (df$"#UNIT_MULT"[i]=="6"){
df$Obs[[i]]$"#OBS_VALUE" <- as.numeric(df$Obs[[i]]$"#OBS_VALUE")
df$Obs[[i]]$"#OBS_VALUE" <- df$Obs[[i]]$"#OBS_VALUE"/1000
} else if ((df$"#UNIT_MULT"[i]=="3")){
df$Obs[[i]]$"#OBS_VALUE" <- as.numeric(df$Obs[[i]]$"#OBS_VALUE")
df$Obs[[i]]$"#OBS_VALUE" <- df$Obs[[i]]$"#OBS_VALUE"/1000000
}
}
}
Please let me know how I can modify the code in order to make it more efficient and avoid these loops.

Transform to numeric a column with "NULL" values

I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.

Resources