left_join says column is not present even though it is present - r

I would like to join two data frames with two different variables tp join. There is an error which says it cannotfind the variable in the second dataframe. But when I run the function colnames(), the column name shows up. Why is this the case?
df_new <- left_join(master_settlement_current_month, master_settlement, by = c("D.settlecounty", "NAMECOUNTY"))
Error: Join columns must be present in data.
x Problem with `NAMECOUNTY`.
Run `rlang::last_error()` to see where the error occurred.
colnames(master_settlement_current_month)[1:5]
[1] "month" "D.info_state" "D.info_county" "D.info_settlement" "D.settlecounty"
colnames(master_settlement)
[1] "NAME" "NAMEJOIN" "NAMECOUNTY" "COUNTYJOIN" "DATE" "DATA_SOURC" "IMG_VERIFD"
[8] "X" "Y" "kobo_label" "X.3" "X.2" "X.1" "INDEX"
[15] "P_CODE" "aok_sett_id" "name_county_low" "ALT_NAME1" "ALT_NAME2" "ALT_NAME3" "ALT_NAME4"
[22] "FUNC_CLASS" "CONF_SCORE" "SRC_VERIFD" "num_dup" "check_coord_v38"

I think your syntax in the by = statement may be a little off.
library(dplyr)
df_new <- left_join(master_settlement_current_month, master_settlement, by = c("D.settlecounty" = "NAMECOUNTY"))

Related

Summarize columns where names have a specific pattern in data.table

I have a very large data.table, which I want to summarise columns by group, where the column names starts with a certain pattern.
The columns I am interested in always have the same format, namely: f<X>_<Y>, m<X>_<Y>, f<X>, m<X>.
This is the list of all possible column names:
ageColsPossible <- c("m0_9", "m10_19", "m20_29", "m30_39", "m40_49", "m50_59", "m60_69",
"f0_9", "f10_19", "f20_29", "f30_39", "f40_49", "f50_59", "f60_69")
if there is not enough data available, my data.table will only have some of these columns. I would like to get a vector with the column names that are available in the data:
> names(myData)
[1] "clientID" "policyID" "startYear" "product" "NOplans" "grp"
[7] "policyid" "personid" "age" "gender" "dependant" "location"
[13] "region" "exposure" "startMonth" "cover_effective_date" "endexposuredate" "fromdate"
[19] "enddate" "planHistSufficiency" "productRank" "claim10month" "claim11month" "claim12month"
[25] "claim9month" "NA20_29" "NA30_39" "NA40_49" "NA50_59" "f0_9"
[31] "f10_19" "f20_29" "f30_39" "f40_49" "f50_59" "f60_69"
[37] "m0_9" "m10_19" "m20_29" "m30_39" "m40_49" "m50_59"
[43] "m60_69" "u0_9" "u10_19" "u20_29" "u30_39" "u40_49"
[49] "u50_59" "u60_69" "uNA"
I know of regrex and was thinking something along the line: regex = "(m|f)(\\d+)_?(\\d+)?", but i have also seen patern() function somewhere. Unfortunately i can no longer find it.
any ideas?
something like this will most likely do the trick.. assuming you only need one summary-function? (median() in this example)...
DT[, lapply( .SD, median), by=.(group), .SDcols = patterns( "^[mf]\\d+" ) ]

Why can not get a vector class

I have extracted this dataframe:
> df<-as.data.frame(model_rf$variable.importance)
> df
Importance
DayOfWeek 3.763932e+11
Customers 1.364059e+12
Open 6.345289e+11
Promo 2.617495e+11
StateHoliday 5.196666e+09
SchoolHoliday 6.522969e+09
DateYear 7.035399e+09
DateMonth 2.013482e+10
DateDay 3.763177e+10
DateWeek 3.283496e+10
StoreType 3.156843e+10
Assortment 2.025741e+10
CompetitionDistance 1.118476e+11
CompetitionOpenSinceMonth 4.633220e+10
CompetitionOpenSinceYear 4.554890e+10
Promo2 0.000000e+00
Promo2SinceWeek 5.066674e+10
Promo2SinceYear 4.096407e+10
CompetitionOpen 3.992745e+10
PromoOpen 2.831936e+10
IspromoinSales 2.844220e+09
then I want to extract values in other column:
> v<-as.vector(model_rf$variable.importance$Importance)
> v
[1] 3.763932e+11 1.364059e+12 6.345289e+11 2.617495e+11 5.196666e+09 6.522969e+09 7.035399e+09 2.013482e+10 3.763177e+10
[10] 3.283496e+10 3.156843e+10 2.025741e+10 1.118476e+11 4.633220e+10 4.554890e+10 0.000000e+00 5.066674e+10 4.096407e+10
[19] 3.992745e+10 2.831936e+10 2.844220e+09
And names of each row in other column
> w<-(as.vector((row.names(df))))
> w
[1] "DayOfWeek" "Customers" "Open" "Promo"
[5] "StateHoliday" "SchoolHoliday" "DateYear" "DateMonth"
[9] "DateDay" "DateWeek" "StoreType" "Assortment"
[13] "CompetitionDistance" "CompetitionOpenSinceMonth" "CompetitionOpenSinceYear" "Promo2"
[17] "Promo2SinceWeek" "Promo2SinceYear" "CompetitionOpen" "PromoOpen"
[21] "IspromoinSales"
Then I need to get a data frame created by the tow vector above:
DF<-as.data.frame(w,v)
DF<-as.data.frame(w,v) Warning message: In as.data.frame.vector(x, ..., nm = nm) : 'row.names' is not a character vector of length 21
-- omitting it. Will be an error!
In fact, it seems that the w vector doesn't be converted as vector class even I did as.vector. It still as a character class.
> class(w)
[1] "character"
How do you explain this please?
Try this code:
DF<-as.data.frame(cbind(w,v))
If you look at the documentation of as.data.frame you see that the function expects the second vector to be a character vector for row names.
In your case, you supplied first the row names and then the values, leading to the error above.
You can either use
as.data.frame(v,w)
or
data.frame(w,v)
to get your desired result.

Remove specific columns from data frame

I am trying to remove a group of columns from a data frame (followed this) but I get an error in return.
Specifically, size of the data frame (NNF.data) is 34233 rows with 147 columns:
[118] "NNF.2015.03.EUR" "NNF.2015.04.EUR" "NNF.2015.05.EUR"
[121] "NNF.2015.06.EUR" "NNF.2015.07.EUR" "NNF.2015.08.EUR"
[124] "NNF.2015.09.EUR" "NNF.2015.10.EUR" "NNF.2015.11.EUR"
[127] "NNF.2015.12.EUR" "NNF.2016.01.EUR" "NNF.2016.02.EUR"
[130] "NNF.2016.03.EUR" "NNF.2016.04.EUR" "NNF.2016.05.EUR"
[133] "NNF.2016.06.EUR" "NNF.2016.07.EUR" "NNF.2016.08.EUR"
[136] "YTD.NNF.Year2005.EUR" "YTD.NNF.Year2006.EUR" "YTD.NNF.Year2007.EUR"
[139] "YTD.NNF.Year2008.EUR" "YTD.NNF.Year2009.EUR" "YTD.NNF.Year2010.EUR"
[142] "YTD.NNF.Year2011.EUR" "YTD.NNF.Year2012.EUR" "YTD.NNF.Year2013.EUR"
[145] "YTD.NNF.Year2014.EUR" "YTD.NNF.Year2015.EUR" "YTD.NNF.Year2016.EUR"
What I want to do is to remove the columns from 136-147, or the ones that contain YTD in their name.
I tried to use
NNF.data[, grep("YTD", names(NNF.data)):= NULL]
but I get the error:
Error in `[.data.frame`(NNF.data, , `:=`(grep("YTD", names(NNF.data)), :
could not find function ":="
Similarly, I tried
NNF.data[, which(grepl("YTD", colnames(NNF.data))):=NULL]
but again, I get
Error in `[.data.frame`(NNF.data, , `:=`(which(grepl("YTD", colnames(NNF.data))), :
could not find function ":="
Any suggestions please?
I made sure that NNF.data is a data frame
> is.data.frame(NNF.data)
[1] TRUE
:= only works for data.table objects. If you are working with a data.frame you can try this:
df = data.frame(First = c(1,2,3), AVSecond = c(3,4,5), ThirdAV = c(6,7,8), Fourth = c(10,22,2))
df = df[-c(grep("AV", colnames(df)), 4)]
This will remove the columns with 'AV' in it and the Fourth column. Output:
First
1 1
2 2
3 3
df = data.frame(YTD.NNF.Year2009.EUR=c(1,2,3),NNF.2016.06.EUR=c(3,4,5),HJK=c(6,7,8))
nm = colnames(df)
numb = grepl("\\bYTD\\b", nm)
df = df[,-numb]

R - get values from multiple variables in the environment

I have some variables in my current R environment:
ls()
[1] "clt.list" "commands.list" "dirs.list" "eq" "hurs.list" "mlist" "prec.list" "temp.list" "vars"
[10] "vars.list" "wind.list"
where each one of the variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" is a (huge) list of strings.
For example:
clt.list[1:20]
[1] "clt_Amon_ACCESS1-0_historical_r1i1p1_185001-200512.nc" "clt_Amon_ACCESS1-3_historical_r1i1p1_185001-200512.nc"
[3] "clt_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc" "clt_Amon_bcc-csm1-1-m_historical_r1i1p1_185001-201212.nc"
[5] "clt_Amon_BNU-ESM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CanESM2_historical_r1i1p1_185001-200512.nc"
[7] "clt_Amon_CCSM4_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-BGC_historical_r1i1p1_185001-200512.nc"
[9] "clt_Amon_CESM1-CAM5_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-CAM5-1-FV2_historical_r1i1p1_185001-200512.nc"
[11] "clt_Amon_CESM1-FASTCHEM_historical_r1i1p1_185001-200512.nc" "clt_Amon_CESM1-WACCM_historical_r1i1p1_185001-200512.nc"
[13] "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-190412.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_190001-200512.nc"
[15] "clt_Amon_CMCC-CESM_historical_r1i1p1_190501-190912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_191001-191412.nc"
[17] "clt_Amon_CMCC-CESM_historical_r1i1p1_191501-191912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_192001-192412.nc"
[19] "clt_Amon_CMCC-CESM_historical_r1i1p1_192501-192912.nc" "clt_Amon_CMCC-CESM_historical_r1i1p1_193001-193412.nc"
What I need to do is extract the subset of the string that is between "Amon_" and "_historical".
I can do this for a single variable, as shown here:
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", clt.list[1:20])))
[1] "ACCESS1-0" "ACCESS1-3" "bcc-csm1-1" "bcc-csm1-1-m" "BNU-ESM" "CanESM2" "CCSM4"
[8] "CESM1-BGC" "CESM1-CAM5" "CESM1-CAM5-1-FV2" "CESM1-FASTCHEM" "CESM1-WACCM" "CMCC-CESM"
However, what I'd like to do is to run the command above for all the five variables at once. Instead of using just "ctl.list" as argument in the command above, I'd like to use all variables "clt.list", "hurs.list", "prec.list", "temp.list" and "wind.list" at once.
How can I do that?
Many thanks in advance!
You can put your operation into a function and then iterate over it:
get_my_substr <- function(vecname)
levels(as.factor(sub(".*?Amon_(.*?)_historical.*", "\\1", get(vecname))))
lapply(my_vecnames,get_my_substr)
lapply acts like a loop. You can create your list of vector names with
my_vecnames <- ls(pattern=".list$")
It is generally good practice to post a reproducible example in your question. Since none was provided here, I tested this approach with...
# example-maker
prestr <- "grr_Amon_"
posstr <- "_historical_zzz"
make_ex <- function()
replicate(
sample(10,1),
paste0(prestr,paste0(sample(LETTERS,sample(5,1)),collapse=""),posstr)
)
# make a couple examples
set.seed(1)
m01 <- make_ex()
m02 <- make_ex()
# test result
lapply(ls(pattern="^m[0-9][0-9]$"),get_my_substr)
One solution would be to create a vector containing the variable names that you want extract the data from, for example:
var.names <- c("clt.list", "commands.list", "dirs.list")
Then to access the value of each variable from the name:
for (var.name in var.names) {
var.value <- as.list(environment())[[var.name]]
# Do something with var.value
}

store summary output in a list of tables or matrix

How to read the following vector "c" of strings into a list of tables? Which way is the shortest read.table strsplit? e.g. I cant see how to read the table Edit:c[4:6] a[4:6] in one command.
require(car)
m<-matrix(rnorm(16),4,4,byrow=T)
a<-Anova(lm(m~1),type=3,idata=data.frame(treatment=factor(1:4)),idesign=~treatment)
c<-capture.output(summary(a,multivariate=F))
c
This returns lines 4:6
c[4:6]
Now if you wanted to parse this I would do it in two steps. First on the column values from rows 5:6 and then add back the names.
> vals <- read.table(text=c[5:6])
> txt <- " \t SS\t num Df\t Error SS\t den Df\t F\t Pr(>F)"
> names(vals) <- names(read.delim(text=txt))
> vals
X SS num.Df Error.SS den.Df F Pr..F.
1 (Intercept) 0.57613392 1 0.4219563 3 4.09616 0.13614
2 treatment 1.85936442 3 8.2899759 9 0.67287 0.58996
EDIT --
you could look at the source code of the summary function and calculate the quantities required by yourself
getAnywhere(summary.Anova.mlm)
The original idea seems not to work.
c2 <- summary(a)
# find out what 'properties' the summary object has
# turns out, it is just the Anova object
class(c2) <- "list"
names(c2)
This returns
[1] "SSP" "SSPE" "P" "df" "error.df"
[6] "terms" "repeated" "type" "test" "idata"
[11] "idesign" "icontrasts" "imatrix" "singular"
and we can get access them
c2$SSP
c2$SSPE
It seems not a good idea to use R internal c function as a variable name

Resources