Remove specific columns from data frame - r

I am trying to remove a group of columns from a data frame (followed this) but I get an error in return.
Specifically, size of the data frame (NNF.data) is 34233 rows with 147 columns:
[118] "NNF.2015.03.EUR" "NNF.2015.04.EUR" "NNF.2015.05.EUR"
[121] "NNF.2015.06.EUR" "NNF.2015.07.EUR" "NNF.2015.08.EUR"
[124] "NNF.2015.09.EUR" "NNF.2015.10.EUR" "NNF.2015.11.EUR"
[127] "NNF.2015.12.EUR" "NNF.2016.01.EUR" "NNF.2016.02.EUR"
[130] "NNF.2016.03.EUR" "NNF.2016.04.EUR" "NNF.2016.05.EUR"
[133] "NNF.2016.06.EUR" "NNF.2016.07.EUR" "NNF.2016.08.EUR"
[136] "YTD.NNF.Year2005.EUR" "YTD.NNF.Year2006.EUR" "YTD.NNF.Year2007.EUR"
[139] "YTD.NNF.Year2008.EUR" "YTD.NNF.Year2009.EUR" "YTD.NNF.Year2010.EUR"
[142] "YTD.NNF.Year2011.EUR" "YTD.NNF.Year2012.EUR" "YTD.NNF.Year2013.EUR"
[145] "YTD.NNF.Year2014.EUR" "YTD.NNF.Year2015.EUR" "YTD.NNF.Year2016.EUR"
What I want to do is to remove the columns from 136-147, or the ones that contain YTD in their name.
I tried to use
NNF.data[, grep("YTD", names(NNF.data)):= NULL]
but I get the error:
Error in `[.data.frame`(NNF.data, , `:=`(grep("YTD", names(NNF.data)), :
could not find function ":="
Similarly, I tried
NNF.data[, which(grepl("YTD", colnames(NNF.data))):=NULL]
but again, I get
Error in `[.data.frame`(NNF.data, , `:=`(which(grepl("YTD", colnames(NNF.data))), :
could not find function ":="
Any suggestions please?
I made sure that NNF.data is a data frame
> is.data.frame(NNF.data)
[1] TRUE

:= only works for data.table objects. If you are working with a data.frame you can try this:
df = data.frame(First = c(1,2,3), AVSecond = c(3,4,5), ThirdAV = c(6,7,8), Fourth = c(10,22,2))
df = df[-c(grep("AV", colnames(df)), 4)]
This will remove the columns with 'AV' in it and the Fourth column. Output:
First
1 1
2 2
3 3

df = data.frame(YTD.NNF.Year2009.EUR=c(1,2,3),NNF.2016.06.EUR=c(3,4,5),HJK=c(6,7,8))
nm = colnames(df)
numb = grepl("\\bYTD\\b", nm)
df = df[,-numb]

Related

R: How to get column names for columns that contain a certain word AND their associated index number?

I want to create a list of column names that contain the word "arrest" AND their associated index number. I do not want all the columns, so I DO NOT want to subset the arrest columns into a new data frame. I merely want to see the list of names and their index numbers so I can delete the ones I don't want from the original data frame.
I tried getting the column names and their associated index numbers by using the below codes, but they only gave one or the other.
This gives me their names only
colnames(x2009_2014)[grepl("arrest",colnames(x2009_2014))]
[1] "poss_cannabis_tot_arrests" "poss_drug_total_tot_arrests"
[3] "poss_heroin_coke_tot_arrests" "poss_other_drug_tot_arrests"
[5] "poss_synth_narc_tot_arrests" "sale_cannabis_tot_arrests"
[7] "sale_drug_total_tot_arrests" "sale_heroin_coke_tot_arrests"
[9] "sale_other_drug_tot_arrests" "sale_synth_narc_tot_arrests"
[11] "total_drug_tot_arrests"
This gives me their index numbers only
grep("county", colnames(x2009_2014))
[1] 93 168 243 318 393 468 543 618 693 768 843
But I want their name AND index number so that it looks something like this
[93] "poss_cannabis_tot_arrests"
[168] "poss_drug_total_tot_arrests"
[243] "poss_heroin_coke_tot_arrests"
[318] "poss_other_drug_tot_arrests"
[393] "poss_synth_narc_tot_arrests"
[468] "sale_cannabis_tot_arrests"
[543] "sale_drug_total_tot_arrests"
[618] "sale_heroin_coke_tot_arrests"
[693] "sale_other_drug_tot_arrests"
[768] "sale_synth_narc_tot_arrests"
[843] "total_drug_tot_arrests"
Lastly, using advice here, I used the below code, but it did not work.
K=sapply(x2009_2014,function(x)any(grepl("arrest",x)))
which(K)
named integer(0)
The person who provided the advice in the above link used
K=sapply(df,function(x)any(grepl("\\D+",x)))
names (df)[K]
Zo.A Zo.B
Which (k)
Zo.A Zo.B
2 4
I'd prefer the list I showed in the third block of code, but the code this person used provides a structure I can work with. It just did not work for me when I tried using it.
Hacky as a one-liner because I really dislike use <- inside a function call, but this should work:
setNames(
nm = matches <- grep("arrest", colnames(x2009_2014)),
colnames(x2009_2014)[matches]
)
Reproducible example:
setNames(nm = x <- grep("b|c", letters), letters[x])
# 2 3
# "b" "c"
Or write your own function that does it. Here I put it in a data frame, which seems nicer than a named vector:
grep_ind_value = function(pattern, x, ...) {
index = grep(x, pattern, ...)
value = x[index]
data.frame(index, value)
}

left_join says column is not present even though it is present

I would like to join two data frames with two different variables tp join. There is an error which says it cannotfind the variable in the second dataframe. But when I run the function colnames(), the column name shows up. Why is this the case?
df_new <- left_join(master_settlement_current_month, master_settlement, by = c("D.settlecounty", "NAMECOUNTY"))
Error: Join columns must be present in data.
x Problem with `NAMECOUNTY`.
Run `rlang::last_error()` to see where the error occurred.
colnames(master_settlement_current_month)[1:5]
[1] "month" "D.info_state" "D.info_county" "D.info_settlement" "D.settlecounty"
colnames(master_settlement)
[1] "NAME" "NAMEJOIN" "NAMECOUNTY" "COUNTYJOIN" "DATE" "DATA_SOURC" "IMG_VERIFD"
[8] "X" "Y" "kobo_label" "X.3" "X.2" "X.1" "INDEX"
[15] "P_CODE" "aok_sett_id" "name_county_low" "ALT_NAME1" "ALT_NAME2" "ALT_NAME3" "ALT_NAME4"
[22] "FUNC_CLASS" "CONF_SCORE" "SRC_VERIFD" "num_dup" "check_coord_v38"
I think your syntax in the by = statement may be a little off.
library(dplyr)
df_new <- left_join(master_settlement_current_month, master_settlement, by = c("D.settlecounty" = "NAMECOUNTY"))

Access R Dataframe Values Rather than Tibble

I'm an experienced Pandas user and am having trouble plugging values from my R frame into a function.
The following function works with hard coded values
>seq.Date(as.Date('2018-01-01'), as.Date('2018-01-31'), 'days')
[1] "2018-01-01" "2018-01-02" "2018-01-03" "2018-01-04" "2018-01-05" "2018-01-06" "2018-01-07"
[8] "2018-01-08" "2018-01-09" "2018-01-10" "2018-01-11" "2018-01-12" "2018-01-13" "2018-01-14"
[15] "2018-01-15" "2018-01-16" "2018-01-17" "2018-01-18" "2018-01-19" "2018-01-20" "2018-01-21"
[22] "2018-01-22" "2018-01-23" "2018-01-24" "2018-01-25" "2018-01-26" "2018-01-27" "2018-01-28"
[29] "2018-01-29" "2018-01-30" "2018-01-31"
Here is an extract from a dataframe I'm using
>df[1,1:2]
# A tibble: 1 x 2
start_time end_time
<date> <date>
1 2017-04-27 2017-05-11
When plugging these values into the 'seq.Date' function I get an error
> seq.Date(from=df[1,1], to=df[1,2], 'days')
Error in seq.Date(from = df[1, 1], to = df[1, 2], "days") :
'from' must be a "Date" object
I suspect this is because subsetting using df[x,y] returns a tibble rather than the specific value
data.class(df[1,1])
[1] "tbl_df"
What I'm hoping to derive is a sequence of dates. I need to be able to point this at various places around the dataframe.
Many thanks for any help!
Just use double brackets:
seq.Date(from=df[[1,1]], to=df[[1,2]], 'days')
The extraction functions of tibble may not return vectors but one column tibbles, use dplyr::pull to extract the column as vector, like in this answer: Extract a dplyr tbl column as a vector
Another option is to set the drop argument in the `[` function to TRUE.
If TRUE the result is coerced to the lowest possible dimension
seq.Date(from = df[1, 1, drop = TRUE], to = df[1, 2, drop = TRUE], 'days')
# [1] "2017-04-27" "2017-04-28" "2017-04-29" "2017-04-30" "2017-05-01" "2017-05-02" "2017-05-03" "2017-05-04" "2017-05-05" "2017-05-06"
#[11] "2017-05-07" "2017-05-08" "2017-05-09" "2017-05-10" "2017-05-11"
data
df <- tibble(start_time = as.Date('2017-04-27'),
end_time = as.Date('2017-05-11'))

cSplit_e not returning a binary data frame

I have a data frame with a Genre column that has rows like Action,Romance. I want to split those values and create a binary vector. If Action,Romance,Drama are all the possible genres, then the above mentioned row would be 1,1,0 in the output data frame.
I found this and this SO posts, and this CRAN doc covering cSplit_e, but when I use it I'm not getting a binary dataframe output, I'm getting the original data frame with a few values scrambled.
a = cSplit_e(df4, "Genre", sep = ",", mode = "binary", type = "character", drop=TRUE, fixed=TRUE,fill = 0)
Edit: The issue appears to be that it's adding the new columns to the old data frame, instead of creating a new frame. How can I get the Genres into their own frame?
> names(a)
[1] "Title" "Year" "Rated" "Released" "Runtime" "Genre" "Director" "Writer" "Actors"
[10] "Plot" "Language" "Country" "Awards" "Poster" "Metascore" "imdbRating" "imdbVotes" "imdbID"
[19] "Type" "tomatoMeter" "tomatoImage" "tomatoRating" "tomatoReviews" "tomatoFresh" "tomatoRotten" "tomatoConsensus" "tomatoUserMeter"
[28] "tomatoUserRating" "tomatoUserReviews" "tomatoURL" "DVD" "BoxOffice" "Production" "Website" "Response" "Budget"
[37] "Domestic_Gross" "Gross" "Date" "Genre_Action" "Genre_Adult" "Genre_Adventure" "Genre_Animation" "Genre_Biography" "Genre_Comedy"
[46] "Genre_Crime" "Genre_Documentary" "Genre_Drama" "Genre_Family" "Genre_Fantasy" "Genre_Film-Noir" "Genre_Game-Show" "Genre_History" "Genre_Horror"
[55] "Genre_Music" "Genre_Musical" "Genre_Mystery" "Genre_N/A" "Genre_News" "Genre_Reality-TV" "Genre_Romance" "Genre_Sci-Fi" "Genre_Short"
[64] "Genre_Sport" "Genre_Talk-Show" "Genre_Thriller" "Genre_War" "Genre_Western"
The drop argument only applies to the column being split, not all of the other columns in the data.frame. Thus, to subsequently extract just the split columns, use the original column name and extract just those columns.
Example:
> a <- cSplit_e(df4, "Genre", ",", mode = "binary", type = "character", fill = 0, drop = TRUE)
> a
id Genre_Action Genre_Drama Genre_Romance
1 1 1 0 1
2 2 1 1 1
> a[startsWith(names(a), "Genre")]
Genre_Action Genre_Drama Genre_Romance
1 1 0 1
2 1 1 1
Where:
df4 <- structure(list(Genre = c("Action,Romance", "Action,Romance,Drama"), id = 1:2),
.Names = c("Genre", "id"), row.names = 1:2, class = "data.frame")

How to get dataframe value?

I have a dataframe df. I took the column headers in another dataframe as I want to run a loop on it.
This is the output of the header dataframe
df.header
# [1] ISBAD MA_KH MA_CN Ngay.xep.hang So.ho.so SEQ Primary_Key Nganh Mo.hinh.xep.hang
# [10] Loai.hinh.DN NAM_TAI_CHINH Liq1 Liq4 Liq5 Per1 Per2 Per3 Per4
# [19] Per5 Lev1 Lev2 Lev3 Lev4 Lev5 Prof1 Prof2 Prof3
# [28] Prof4 Prof5 Gro14 Gro15 Gro16 Gro17 Gro18 Gro19 Gro20
# [37] Struc1 Cov1 Liq6 Prof6 Struc2 Lev6 Lev7 Lev8 Struc3
# [46] Struc4 Struc5 Prof8 Struc6 Liq7 Lev9 Cov.24 Cov2 Liq9
# [55] Cov4 Prof9 Struc7 Cov6 Prof10 Prof13 Prof16 Prof18 Prof19
# [64] Prof22 Per6 Per7 Per8 Prof23 Cov7 Prof24 Lev10 Struc8
# [73] Struc9 Lev11 Struc10 Liq10 cov3 Cov9 Cov10 Liq11 Cov11
# [82] Prof29 Prof30 Per9 Per10 Liq12 Cov12 Cov13 Liq13 Cov15
# [91] Per11 Per12 Per13 Per14 Cov16 Cov17 Cov18 Gro21 Gro22
#[100] Cov19 Cov20 Liq14 Liq15 Liq16 Liq17 Liq18 Struc11 Struc15
#[109] Liq19 Prof34 Prof35 Prof38 Prof40 Prof41 Prof42 Prof43 Struc12
#[118] Struc13 Struc14 Cov21 Cov22 Prof44 Gro23 Liq20 Cov23 Liq21
#[127] Liq22 Lev12 Prof31
Now when I put in the following code in the loop
liststring <- toString(df.header[2])
I got the output
liststring
# [1] "integer(0)"
Instead of MA_KH
I also tried toString(df.header[2],) and got the same result.
Not sure where I'm going wrong here
With command df.header <-head(df,0) you don't get a dataframe of colum header, but an empty copy of your original dataframe.
to get get just the names of a dataframe use: names(df).
Maybe you can post the purpose of this new dataframe. Iterating over variables of a dataframe can be done using lapply without creating a new dataframe.

Resources