Number of columns not matching - r

I'm trying to use the rbind function to create some data for a matching process, but I'm getting this error:
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
I've checked and adjusted the order so they definitely match, but still getting the error. Any idea why?
This is my code:
# Match on the data from the year before treatment, matching on counttotal and countbrown
matchData <-
rbind(treat_firms_1year_prior[, -c(
grep("year_int_tx", colnames(treat_firms_1year_prior)),
grep("matchingyear", colnames(treat_firms_1year_prior)),
grep("flag", colnames(treat_firms_1year_prior))
)],
control_firms_year_int_tx)
These are the column names:
> colnames(treat_firms_1year_prior)
[1] "investor" "dealyear" "totalUSD" "counttotal" "greenUSD" "countgreen"
[7] "brownUSD" "countbrown" "signatory" "treatment" "firsttreat" "matchingyear"
[13] "country" "region" "yearest" "strategy" "capsources" "historicfunds"
[19] "eligible_treat_firm" "year_int_tx" "flag"
> colnames(control_firms_year_int_tx)
[1] "investor" "dealyear" "totalUSD" "counttotal" "greenUSD" "countgreen"
[7] "brownUSD" "countbrown" "signatory" "treatment" "firsttreat" "matchingyear"
[13] "country" "region" "yearest" "strategy" "capsources" "historicfunds"
[19] "eligible_treat_firm" "year_int_tx" "flag"

rbind requires the 2 dataframes to have the same number of columns.
Yours do not have the same columns as treat_firms_1year_prior is dropping 3 columns ("year_int_tx", "matchingyear", flag") and control_firms_year_int_tx is not.
You'll need to also drop them in control_firms_year_int_tx, or keep them in treat_firms_1year_prior.
matchData <-
rbind(treat_firms_1year_prior[, -c(
grep("year_int_tx", colnames(treat_firms_1year_prior)),
grep("matchingyear", colnames(treat_firms_1year_prior)),
grep("flag", colnames(treat_firms_1year_prior))
)],
control_firms_year_int_tx[, -c(
grep("year_int_tx", colnames(treat_firms_1year_prior)),
grep("matchingyear", colnames(treat_firms_1year_prior)),
grep("flag", colnames(treat_firms_1year_prior))
)])
Or
excludeColumns <- c("year_int_tx", "matchingyear", "flag")
matchData <-
rbind(
treat_firms_1year_prior[ , !(names(treat_firms_1year_prior) %in% excludeColumns)],
control_firms_year_int_tx[ , !(names(control_firms_year_int_tx) %in% excludeColumns)]
)
Or
matchData <-
rbind(treat_firms_1year_prior, control_firms_year_int_tx)

Related

rbind error while performing a for-loop: duplicate 'row.names' are not allowed

The enclosed code is an attempt to extract data from an api, but when I try to paginate and bind the rows, the row index duplicates posing the below error:
**Error in `.rowNamesDF<-`(x, value = value) : duplicate 'row.names' are not allowed**
**In addition: Warning message: non-unique values when setting 'row.names':**
The code is:
df = tibble()
for (i in seq(from = 0, to = 620, by = 24)) {
linky = paste0("https://www.rightmove.co.uk/api/_search?locationIdentifier=REGION%5E94405&numberOfPropertiesPerPage=24&radius=0.0&sortType=2&index=",i,"&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft&currencyCode=GBP&isFetching=false")
pge <- jsonlite::fromJSON(linky)
props <- pge$properties
print(linky)
Sys.sleep(runif(1, 2.34, 6.19))
df = rbind(df, tibble(props))
print(paste("Page:", i))
}
HA_area_ <- df
As the error indicates due to different column names the dataframes can't be bound together. Below are the column names for first two dataframes.
[[1]]
[1] "id" "bedrooms" "bathrooms" "numberOfImages"
[5] "numberOfFloorplans" "numberOfVirtualTours" "summary" "displayAddress"
[9] "countryCode" "location" "propertyImages" "propertySubType"
[13] "listingUpdate" "premiumListing" "featuredProperty" "price"
[17] "customer" "distance" "transactionType" "productLabel"
[21] "commercial" "development" "residential" "students"
[25] "auction" "feesApply" "feesApplyText" "displaySize"
[29] "showOnMap" "propertyUrl" "contactUrl" "staticMapUrl"
[33] "channel" "firstVisibleDate" "keywords" "keywordMatchType"
[37] "saved" "hidden" "onlineViewingsAvailable" "lozengeModel"
[41] "hasBrandPlus" "propertyTypeFullDescription" "addedOrReduced" "formattedDistance"
[45] "heading" "enhancedListing" "displayStatus" "formattedBranchName"
[49] "isRecent"
[[2]]
[1] "id" "bedrooms" "bathrooms" "numberOfImages"
[5] "numberOfFloorplans" "numberOfVirtualTours" "summary" "displayAddress"
[9] "countryCode" "location" "propertyImages" "propertySubType"
[13] "listingUpdate" "premiumListing" "featuredProperty" "price"
[17] "customer" "distance" "transactionType" "productLabel"
[21] "commercial" "development" "residential" "students"
[25] "auction" "feesApply" "feesApplyText" "displaySize"
[29] "showOnMap" "propertyUrl" "contactUrl" "staticMapUrl"
[33] "channel" "firstVisibleDate" "keywords" "keywordMatchType"
[37] "saved" "hidden" "onlineViewingsAvailable" "lozengeModel"
[41] "hasBrandPlus" "displayStatus" "formattedBranchName" "addedOrReduced"
[45] "isRecent" "formattedDistance" "propertyTypeFullDescription" "enhancedListing"
[49] "heading"
You can see different names of column at certain positions.
Instead of rbind we can use lapply and store results in a list.
Wee shall create function f1 to get the dataframe required and then use possibly to skip any errors.
f1 = function(x){
linky = paste0("https://www.rightmove.co.uk/api/_search?locationIdentifier=REGION%5E94405&numberOfPropertiesPerPage=24&radius=0.0&sortType=2&index=",x,"&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft&currencyCode=GBP&isFetching=false")
pge <- jsonlite::fromJSON(linky)
props <- pge$properties
print(linky)
Sys.sleep(runif(1, 2.34, 6.19))
print(paste("Page:", x))
return(props)
}
x = seq(from = 0, to = 620, by = 24)
df = lapply(x, possibly(f1, NA))
library(data.table)
dt <- lapply(seq(from = 0, to = 620, by = 24), function(i) {
uri <- paste0("https://www.rightmove.co.uk/api/_search?locationIdentifier=REGION%5E94405&numberOfPropertiesPerPage=24&radius=0.0&sortType=2&index=", i,"&includeSSTC=false&viewType=LIST&channel=BUY&areaSizeUnit=sqft&currencyCode=GBP&isFetching=false")
as.data.table(jsonlite::fromJSON(uri)$properties)
})
dt <- rbindlist(dt, fill = T)
Strangely I changed from rbind() to bind_rows() and for some reason it worked. Although there was the added complication of unnesting some columns. It would not allow me to save the data as a CSV without unnesting the nested columns. Thank you for the answers too

How to create and save subset dataframes for sequence of year-month

I would like to filter from a dataframe observations for a given year-month and then save it as a separate dataframe and name it with the respective year-month.
I would be grateful if someone could suggest a more efficient code than the one below. Also, this code is not filtering correctely the observations.
data <- data.frame(year = c(rep(2012,12),rep(2013,12),rep(2014,12),rep(2015,12),rep(2016,12)),
month = rep(1:12,5),
info = seq(60)*100)
years <- 2012:2016
months <- 1:12
for(year in years){
for(month in months){
data_sel <- data %>%
filter(year==year & month==month)
if(month<10){
month_alt <- paste0("0",month) # months 1-9 should show up as 01-09
}
Newname <- paste0(year,month_alt,'_','data_sel')
assign(Newname, data_sel)
}
}
The output I am looking to get is below (separate objects containing data from a given year-month):
> ls()
[1] "201201_data_sel" "201202_data_sel" "201203_data_sel" "201204_data_sel"
[5] "201205_data_sel" "201206_data_sel" "201207_data_sel" "201208_data_sel"
[9] "201209_data_sel" "201301_data_sel" "201302_data_sel" "201303_data_sel"
[13] "201304_data_sel" "201305_data_sel" "201306_data_sel" "201307_data_sel"
[17] "201308_data_sel" "201309_data_sel" "201401_data_sel" "201402_data_sel"
[21] "201403_data_sel" "201404_data_sel" "201405_data_sel" "201406_data_sel"
[25] "201407_data_sel" "201408_data_sel" "201409_data_sel" "201501_data_sel"
[29] "201502_data_sel" "201503_data_sel" "201504_data_sel" "201505_data_sel"
[33] "201506_data_sel" "201507_data_sel" "201508_data_sel" "201509_data_sel"
[37] "201601_data_sel" "201602_data_sel" "201603_data_sel" "201604_data_sel"
[41] "201605_data_sel" "201606_data_sel" "201607_data_sel" "201608_data_sel"
[45] "201609_data_sel" "data" "data_sel" "month"
[49] "month_alt" "months" "Newname" "year"
[53] "years"
You could do:
library(dplyr)
g <- data %>%
mutate(month = sprintf("%02d", month)) %>%
group_by(year, month)
setNames(group_split(g), with(group_keys(g), paste0("data_sel_", year, month))) %>%
list2env(envir = .GlobalEnv)
Starting an object name with a digit is not allowed in R, so in paste0 "data_sel_" is first.
As mentioned in the comments it might be better to not pipe to list2env and store the output as a list with named elements.

Summarize columns where names have a specific pattern in data.table

I have a very large data.table, which I want to summarise columns by group, where the column names starts with a certain pattern.
The columns I am interested in always have the same format, namely: f<X>_<Y>, m<X>_<Y>, f<X>, m<X>.
This is the list of all possible column names:
ageColsPossible <- c("m0_9", "m10_19", "m20_29", "m30_39", "m40_49", "m50_59", "m60_69",
"f0_9", "f10_19", "f20_29", "f30_39", "f40_49", "f50_59", "f60_69")
if there is not enough data available, my data.table will only have some of these columns. I would like to get a vector with the column names that are available in the data:
> names(myData)
[1] "clientID" "policyID" "startYear" "product" "NOplans" "grp"
[7] "policyid" "personid" "age" "gender" "dependant" "location"
[13] "region" "exposure" "startMonth" "cover_effective_date" "endexposuredate" "fromdate"
[19] "enddate" "planHistSufficiency" "productRank" "claim10month" "claim11month" "claim12month"
[25] "claim9month" "NA20_29" "NA30_39" "NA40_49" "NA50_59" "f0_9"
[31] "f10_19" "f20_29" "f30_39" "f40_49" "f50_59" "f60_69"
[37] "m0_9" "m10_19" "m20_29" "m30_39" "m40_49" "m50_59"
[43] "m60_69" "u0_9" "u10_19" "u20_29" "u30_39" "u40_49"
[49] "u50_59" "u60_69" "uNA"
I know of regrex and was thinking something along the line: regex = "(m|f)(\\d+)_?(\\d+)?", but i have also seen patern() function somewhere. Unfortunately i can no longer find it.
any ideas?
something like this will most likely do the trick.. assuming you only need one summary-function? (median() in this example)...
DT[, lapply( .SD, median), by=.(group), .SDcols = patterns( "^[mf]\\d+" ) ]

Change order of multiple optional substrings

That's a bit like this question, but I have multiple substrings that may or may not occur.
The substrings code for two different dimensions, in my example "test" and "eye". They can occur in any imaginable order.
The variables can be coded in different ways - in my example, "method|test" would be two ways to code for "test", as well as "r|re|l|le" different ways to code for eyes.
I found a convoluted solution, which is using a chain of seven (!) gsub calls, and I wondered if there is a more concise way.
x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old",
"test_l_old", "re_test_new","new_le_method", "new_r_test")
x
#> [1] "id" "r_test" "l_method" "test_re"
#> [5] "method_le" "test_r_old" "test_l_old" "re_test_new"
#> [9] "new_le_method" "new_r_test"
Desired output
#> [1] "id" "r_test" "l_test" "r_test" "l_test"
#> [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
How I got there (convoluted)
## Unify codes for variables, I use the underscores to make it more unique for future regex
clean_test<- gsub("(?<![a-z])(test|method)(?![a-z])", "_test_", tolower(x), perl = TRUE)
clean_r <- gsub("(?<![a-z])(r|re)(?![a-z])", "_r_", tolower(clean_test), perl = TRUE)
clean_l <- gsub("(?<![a-z])(l|le)(?![a-z])", "_l_", tolower(clean_r), perl = TRUE)
## Now sort, one after the other
sort_eye <- gsub("(.*)(_r_|_l_)(.*)", "\\2\\1\\3", clean_l, perl = TRUE)
sort_test <- gsub("(_r_|_l_)(.*)(_test_)(.*)", "\\1\\3\\2\\4", sort_eye, perl = TRUE)
## Remove underscores
clean_underscore_mult <- gsub("_{2,}", "_", sort_test)
clean_underscore_ends <- gsub("^_|_$", "", clean_underscore_mult)
clean_underscore_ends
#> [1] "id" "r_test" "l_test" "r_test" "l_test"
#> [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
I'd be already very very grateful for a suggestion how to better proceed from ## Now sort, one after the other downwards...
How about tokenizing the string and using lookup tables instead? I'll use data.table to assist but the idea fits naturally with other data grammars as well
library(data.table)
# build into a table, keeping track of an ID
# to say which element it came from originally
l = strsplit(x, '_', fixed=TRUE)
DT = data.table(id = rep(seq_along(l), lengths(l)), token = unlist(l))
Now build a lookup table:
# defined using fread to make it easier to see
# token & match side-by-side; only define tokens
# that actually need to be changed here
lookups = fread('
token,match
le,l
re,r
method,test
')
Now combine:
# default value is the token itself
DT[ , match := token]
# replace anything matched
DT[lookups, match := i.match, on = 'token']
Next use factor ordering to get the tokens in the right order:
# the more general [where you don't have an exact list of all the possible
# tokens ready at hand] is a bit messier -- you might do something
# similar to setdiff(unique(match), lookups$match)
DT[ , match := factor(match, levels = c('id', 'r', 'l', 'test', 'old', 'new'))]
# sort to this new order
setorder(DT, id, match)
Finally combine again (an aggregation) to get the output:
DT[ , paste(match, collapse='_'), by = id]$V1
# [1] "id" "r_test" "l_test" "r_test" "l_test"
# [6] "r_test_old" "l_test_old" "r_test_new" "l_test_new" "r_test_new"
Here's a one-liner with nested sub that transforms x without any intermediary steps:
sub("^(\\w+)_(r|re|l|le)", "\\2_\\1",
sub("method", "test",
sub("(l|r)e", "\\1",
sub("(^new)_(\\w+_\\w+)$", "\\2_\\1", x))))
# [1] "id" "r_test" "l_test" "r_test" "l_test" "r_test_old"
# [7] "l_test_old" "r_test_new" "l_test_new" "r_test_new"
Data:
x <- c("id", "r_test", "l_method", "test_re", "method_le", "test_r_old",
"test_l_old", "re_test_new","new_le_method", "new_r_test")
Much inspired and building on user MichaelChirico's answer, this is a function using base R only, which (in theory) should work with any number of substrings to sort. The list defines the sort (by its elements), and it specifies all ways to code for the default tokens (the list names).
## I've added some more ways to code for right and left eyes, as well as different further strings that are not known.
x <- c("id", "r_random_test_old", "r_test", "r_test_else", "l_method", "test_re", "method_le", "test_od_old",
"test_os_old", "re_mth_new","new_le_method", "new_r_test_random")
x
#> [1] "id" "r_random_test_old" "r_test"
#> [4] "r_test_else" "l_method" "test_re"
#> [7] "method_le" "test_od_old" "test_os_old"
#> [10] "re_mth_new" "new_le_method" "new_r_test_random"
sort_substr(x, list(r = c("od","re"), l = c("os","le"), test = c("method", "mth"), time = c("old","new")))
#> [1] "id" "r_test_time_random" "r_test"
#> [4] "r_test_else" "l_test" "r_test"
#> [7] "l_test" "r_test_time" "l_test_time"
#> [10] "r_test_time" "l_test_time" "r_test_time_random"
sort_substr
sort_substr <- function(x, list_substr) {
lookups <- data.frame(match = rep(names(list_substr), lengths(list_substr)),
token = unlist(list_substr))
l <- strsplit(x, "_", fixed = TRUE)
DF <- data.frame(id = rep(seq_along(l), lengths(l)), token = unlist(l))
match_token <- lookups$match[match(DF$token, lookups$token)]
DF$match <- ifelse(is.na(match_token), DF$token, match_token)
rest_token <- base::setdiff(DF$match, names(list_substr))
DF$match <- factor(DF$match, levels = c(names(list_substr), rest_token))
DF <- DF[with(DF, order(id, match)), ]
out <- vapply(split(DF$match, DF$id),
paste, collapse = "_",
FUN.VALUE = character(1),
USE.NAMES = FALSE)
out
}

Error with R dplyr left_join

So I've been trying to use left_join to get the columns of a new dataset onto my main dataset (called employee)
I've double checked the vector names and the cleaning that I've don't and nothing seems to work. Here is my code. Would appreciate any help.
job_codes <- read_csv("Quest_UMMS_JobCodes.csv")
job_codes <- job_codes %>%
clean_names() %>%
select(job_code, pos_desc = pos_des_desc)
job_codes$is_nurse <- str_detect(tolower(job_codes$pos_desc), "nurse")
employee <- employee %>%
left_join(job_codes, by = "job_code")
The error I keep getting:Error in eval(substitute(expr), envir, enclos) :
'job_code' column not found in rhs, cannot join
here are the results of
names(job_code)
> names(job_codes)
[1] "job_code" "pos_desc" "is_nurse"
names(employee)
> names(employee)
[1] "REC_NUM" "ZIP" "STATE"
[4] "SEX" "EEO_CLASS" "BIRTH_YEAR"
[7] "EMP_STATUS" "PROCESS_LEVEL" "DEPARTMENT"
[10] "JOB_CODE" "UNION_CODE" "SUPERVISOR"
[13] "DATE_HIRED" "R_SHIFT" "SALARY_CLASS"
[16] "EXEMPT_EMP" "PAY_RATE" "ADJ_HIRE_DATE"
[19] "ANNIVERS_DATE" "TERM_DATE" "NBR_FTE"
[22] "PENSION_PLAN" "PAY_GRADE" "SCHEDULE"
[25] "OT_PLAN_CODE" "DECEASED" "POSITION"
[28] "WORK_SCHED" "SUPERVISOR_IND" "FTE_TOTAL"
[31] "PRO_RATE_TOTAL" "PRO_RATE_A_SAL" "NEW_HIRE_DATE"
[34] "COUNTY" "FST_DAY_WORKED" "date_hired"
[37] "date_hired_adj" "term_date" "employment_duration"
[40] "current" "age" "emp_duration_years"
[43] "DESCRIPTION.x" "PAY_STATUS.x" "DESCRIPTION.y"
[46] "PAY_STATUS.y"
Now, after the OP has added the column names of both tables in the Q, it is evident that the columns to join on are written in different ways (upper vs lower case).
If the column names are different, help("left_join") suggests:
To join by different variables on x and y use a named vector. For example, by = c("a" = "b") will match x.a to y.b.
So, in this case it should read
employee <- employee %>% left_join(job_codes, by = c("JOB_CODE" = "job_code"))

Resources