Use grep to delete any string containing year less than 2014 - r

Edited to add more context and data 5/12/2017
Using R version 3 on Windows
I have a data frame data2:
'data.frame': 1504 obs. of 14 variables:
$ Member.Name : chr "A" "B" "C"...
$ MSTATUS : Factor w/ 14 levels "","ACTIVE","ACTIVE;CHANGEDROLES;NONQUALIF",..: 13 2 2 2 2 4 13 13 2 13 ...
$ MCAT : Factor w/ 9 levels "","EDNEWCLASS",..: 5 4 9 6 6 6 9 9 4 4 ...
$ SALUTATION : Factor w/ 822 levels "","Aaron","Abigail",..: 285 2 2 2 4 4 4 4 5 5 ...
$ MEM_SUBCATEGORY : Factor w/ 22 levels "","AGENCYCEO",..: 22 6 8 15 8 6 8 1 6 6 ...
$ MEM_SUBTYPE : Factor w/ 25 levels "","AGENCY","AGENCYCEO",..: 24 6 6 20 6 6 6 6 6 6 ...
$ COUNTRY : Factor w/ 33 levels "","AE","AT","AU",..: 33 33 33 33 7 33 33 33 33 33 ...
$ F500 : Factor w/ 243 levels "","#1406 on Forbes Global 2000 ($11B)",..: 1 1 96 1 242 1 147 1 1 76 ...
$ OPT_LINE : Factor w/ 1467 levels "","(Formerly) Condé Nast",..: 1 1170 609 1333 251 1427 444 258 814 1207 ...
$ FLAGS : chr "2014PAGEJAMPARTICIPANT, \nPHOTO" "" "PUFOUNDINGMEMBER" "2014FLESPEAKER" ...
$ FLAGS_DESCR : chr "2014 Page Jam Participant, \nPhoto on File" "" "Page Up Founding Member" "2014 Future Leaders Experience Speaker" ...
$ Enroll.Date : Date, format: "2012-12-04" "2010-08-24" "2013-09-20" "2013-05-06" ...
$ Expiration.Date : Date, format: "2014-12-31" "2017-12-31" "2017-12-31" "2017-12-31" ...
$ Sponsorship.Amount: num 0 0 0 0 0 0 0 0 0 0 ...
For the FLAGS variable, I'd like to remove all row elements that contain a year less than 2014.
head(data2$FLAGS, n=3)
[1] "2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL" ""
[3] "2012COI"
So that FLAGS will look like:
head(data2$FLAGS, n=3)
[1] "\n2016CHAIRCOUNCIL" ""
[3] ""
The rows with no values can either be blank or NA, BUT if a row does contain an event with a year >=2014 and an event with a year <2014 than just delete the event less than 2014 and keep the other events in the row.

This regex works for your example. The idea is to match the first 3 characters of year for those elements that fail and drop them.
FLAGS[-grep("20(0|1[0123])", FLAGS)]
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT" "\n2014PUSPONSOR, \nPHOTO"
or, using invert, you'd have
FLAGS[grep("20(0|1[0123])", FLAGS, invert=TRUE)]
Note that it won't catch pre-2000s and you should be cautious if there are other "numeric" values in the vector.
To return a vector of the same length, with NAs replacing the earlier years, you could use is.na<- and grepl like this
is.na(FLAGS) <- grepl("20(0|1[0123])", FLAGS)
original data
FLAGS<-c("2014PAGEJAMPARTICIPANT, \nPHOTO", "2001ANNUALCONFERENCECOMM",
"\n2011GOVERNANCE", "\n2014PAGEJAMPARTICIPANT", "2013NEWMEMBERNOMINATOR",
"\n2014PUSPONSOR, \nPHOTO")
given OP's second question. The following more or less works:
sapply(strsplit(FLAGS, ","),
function(x) paste(gsub("(\\n)?20(0|1[0123]).*?(, |$)", "", trimws(x)), collapse=" "))
[1] " 2016CHAIRCOUNCIL" "" ""
Note that a "\n" is missing at the beginning and there is an additional (set of) space(s) at the beginning of the first element. The "\n" is removed be trimws. This makes the string a bit easier to work with. The additional spaces can be removed by wrapping the above expression in trimws, for example, trimws(sapply(strsplit(...))).
additional data
FLAGS <- c("2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL", "", "2012COI")

Here is one solution using stringr package:
library(stringr)
FLAGS[sapply(str_extract_all(FLAGS, '[0-9]{4}'),
function(x) !any(as.integer(x) < 2014))]
This solution assumes you may have more than one year in each value. If that is not the case, you can do something more simple like:
FLAGS[as.integer(str_extract(FLAGS, '[0-9]{4}')) >= 2014]
Assuming FLAGS is as follows:
FLAGS
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "2001ANNUALCONFERENCECOMM"
[3] "\n2011GOVERNANCE" "\n2014PAGEJAMPARTICIPANT"
[5] "2013NEWMEMBERNOMINATOR" "\n2014PUSPONSOR, \nPHOTO"
You get result as:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT"
[3] "\n2014PUSPONSOR, \nPHOTO"
EDITING ANSWER BASED ON QUESTION EDIT ABOVE
You can keep only values with 2014 or above and fill with NAs otherwise as follows:
data2$FLAGS <- ifelse(as.integer(str_extract(data2$FLAGS, '\\d+')) >= 2014,
data2$FLAGS, NA)
Result is as follows:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" NA
[3] NA "\n2014PAGEJAMPARTICIPANT"
[5] NA "\n2014PUSPONSOR, \nPHOTO"

Related

unable to write to the csv file [duplicate]

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

How to remove $ from all values in a data frame column in a character vector?

I have a data frame in R that has information about NBA players, including salary information. All the data in the salary column have a "$" before the value and I want to convert the character data to numeric for the purpose of analysis. So I need to remove the "$" in this column. However, I am unable to subset or parse any of the values in this column. It seems that each value is a vector of 1. I've included below the structure of the data and what I have tried in my attempt at removing the "$".
> str(combined)
'data.frame': 588 obs. of 9 variables:
$ Player: chr "Aaron Brooks" "Aaron Gordon" "Aaron Gray" "Aaron Harrison" ...
$ Tm : Factor w/ 30 levels "ATL","BOS","BRK",..: 4 22 9 5 9 18 1 5 25 30 ...
$ Pos : Factor w/ 5 levels "C","PF","PG",..: 3 2 NA 5 NA 2 1 1 4 5 ...
$ Age : num 31 20 NA 21 NA 24 29 31 25 33 ...
$ G : num 69 78 NA 21 NA 52 82 47 82 13 ...
$ MP : num 1108 1863 NA 93 NA ...
$ PER : num 11.8 17 NA 4.3 NA 5.6 19.4 18.2 12.7 9.2 ...
$ WS : num 0.9 5.4 NA 0 NA -0.5 9.4 2.8 4 0.3 ...
$ Salary: chr "$2000000" "$4171680" "$452059" "$525093" ...
combined[, "Salary"] <- gsub("$", "", combined[, "Salary"])
The last line of code above is able to run successfully but it doesn't augment the "Salary" column.
I am able to successfully augment it by running the code listed below, but I need to find a way to automize the replacement process for the whole data set instead of doing it row by row.
combined[, "Salary"] <- gsub("$2000000", "2000000", combined[, "Salary"])
How can I subset the character vectors in this column to remove the "$"? Apologies for any formatting faux pas ahead of time, this is my first time asking a question. Cheers,
The $ is a metacharacter which means the end of the string. So, we need to either escape (\\$) or place it in square brackets ("[$]") or use fixed = TRUE in the sub. We don't need gsub as there seems to be only a single $ character in each string.
combined[, "Salary"] <- as.numeric(sub("$", "", combined[, "Salary"], fixed=TRUE))
Or as #gung mentioned in the comments, using substr would be faster
as.numeric(substr(d$Salary, 2, nchar(d$Salary)))

Raw data: reading attributes with varied number of spaces in R

I am trying to read in the Baylor dataset but I can't use read.csv since the spaces are not consistent.
I do have the column numbers so I was thinking read.fwf would help fix my issue but that means I have to review more than 100 attributes and check the line widths.
Is there an easier way to read the data?
baylor <- read.csv('C:/Users/Documents/baylor-religion-survey-data-2007.txt', header=F)
Column Numbers
Baylor Religion 2007 Survey Data
I haven't tested carefully, but I think this does it:
Define URLs:
lnum_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-column-numbers.txt"
survey_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-data-2007.txt"
Read file with column info:
nums <- read.table(url(lnum_url),as.is=TRUE,header=TRUE)
Extract starting column for each field:
startcol <- as.numeric( ## convert to numeric
sapply(
strsplit(nums[,3],"-"), ## split strings on dashes
"[",1)) ## select first element of each result
## sapply(z,"[",1) == sapply(z,function(x) x[1])
Field widths are differences (assume last field is length 1):
w <- c(diff(startcol),1)
Read fixed width:
r <- read.fwf(url(survey_url),widths=w)
Assign field names:
names(r) <- gsub(":","",nums$COL)
Some quick checks:
str(r[,1:8])
## 'data.frame': 1648 obs. of 8 variables:
## $ ID : num 1.1e+09 1.1e+09 1.1e+09 1.1e+09 1.1e+09 ...
## $ WEIGHT : num 0.822 0.312 1.604 1.184 1.35 ...
## $ REGION : int 3 3 4 3 2 2 2 4 2 2 ...
## $ RELIG1 : int 12 12 46 45 14 31 16 33 16 16 ...
## $ RELIG2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ DENOM : Factor w/ 301 levels " ",..: 231 231 1 1 1 1 83 113 1 23 ...
## $ RELGIOUS: int 3 4 1 3 3 4 4 4 3 4 ...
## $ ATTEND : int 5 8 0 8 3 0 8 7 1 8 ...
tail(sort(levels(r$DENOM)))
## [1] " RIVER OF LIFE EVANGELICAL FREE OF ELK RIVER"
## [2] " ELCA - EVANGELICAL LUTHERAN CHURCH OF AMERICA"
## [3] " WASHBURN CHRISTIAN CHURCH DISCIPLES OF CHRIST"
## [4] " THE CHURCH OF JESUS CHRIST OF LATTER DAY SAINTS"
## [5] " GENERAL ASSOCIATION OF REGULAR BAPTISTS CHURCHES"
## [6] "CONGREGATIONAL/METHODIST UNITED CHURCHES OF DURHAM,"
Some more processing (e.g. stripping white space in the denominations) might be in order, and I would certainly further check these results, but this should get you most of the way there.
For future reference it might be worth downloading the data from the original download site and checking cross-tabulations against the code book ...

Error when exporting dataframe to text file in R

I am trying to write a dataframe in R to a text file, however it is returning to following error:
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) :
missing value where TRUE/FALSE needed
I used the following command for the export:
write.table(df, file ='dfname.txt', sep='\t' )
I have no idea what the problem could stem from. As far as "missing data where TRUE/FALSE is needed", I have only one column which contains TRUE/FALSE values, and none of these values are missing.
Contents of the dataframe:
> str(df)
'data.frame': 776 obs. of 15 variables:
$ Age : Factor w/ 4 levels "","A","J","SA": 2 2 2 2 2 2 2 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 2 2 ...
$ Rep : Factor w/ 11 levels "L","NR","NRF",..: 1 1 4 4 2 2 2 2 2 2 ...
$ FA : num 61.5 62.5 60.5 61 59.5 59.5 59.1 59.2 59.8 59.9 ...
$ Mass : num 20 19 16.5 17.5 NA 14 NA 23 19 18.5 ...
$ Vir1 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir2 : num 999 999 999 999 999 999 999 999 999 999 ...
$ Vir3 : num 40 999 999 999 999 999 999 999 999 999 ...
$ Location : Factor w/ 4 levels "Loc1",..: 4 4 4 4 4 4 2 2 2 2 ...
$ Site : Factor w/ 6 levels "A","B","C",..: 5 5 5 5 5 5 3 3 3 3 ...
$ Date : Date, format: "2010-08-30" "2010-08-30" ...
$ Record : int 35 34 39 49 69 38 145 112 125 140 ...
$ SampleID : Factor w/ 776 levels "AT1-A-F1","AT1-A-F10",..: 525 524 527 528
529 526 111 78
88 110 ...
$ Vir1Inc : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
$ Month :'data.frame': 776 obs. of 2 variables:
..$ Dates: Date, format: "2010-08-30" "2010-08-30" ...
..$ Month: Factor w/ 19 levels "Apr-2011","Aug-2010",..: 2 2 2 2
2 2 18 18 18 18 ...
I hope I've given enough/the right information ...
Many thanks,
Heather
An example to reproduce the error. I create a nested data.frame:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
str(dd)
'data.frame': 15 obs. of 2 variables:
$ Age : int 1 2 3 4 5 6 7 8 9 10 ...
$ Month:'data.frame': 15 obs. of 2 variables:
..$ Dates: Date, format: "2003-02-02" "2003-02-03" "2003-02-04" ...
..$ Month: Factor w/ 12 levels "1","2","3","4",..: 1 1 2 2 3 3 4 4 5 5 ...
No I try to save it , I reproduce the error :
write.table(dd)
Error in if (inherits(X[[j]], "data.frame") && ncol(xj) > 1L)
X[[j]] <- as.matrix(X[[j]]) : missing value where TRUE/FALSE needed
Without inverstigating, one option to remove the nested data.frame:
write.table(data.frame(subset(dd,select=-c(Month)),unclass(dd$Month)))
The solution by agstudy provides a great quick fix, but there is a simple alternative/general solution for which you do not have to specify the element(s) in your data.frame that was(were) nested:
The following bit is just copied from agstudy's solution to obtain the nested data.frame dd:
Month=data.frame(Dates= as.Date("2003-02-01") + 1:15,
Month=gl(12,2,15))
dd <- data.frame(Age=1:15)
dd$Month <- Month
You can use akhilsbehl's LinearizeNestedList() function (which mrdwab made available here) to flatten (or linearize) the nested levels:
library(devtools)
source_gist(4205477) #loads the function
ddf <- LinearizeNestedList(dd, LinearizeDataFrames = TRUE)
# ddf is now a list with two elements (Age and Month)
ddf <- LinearizeNestedList(ddf, LinearizeDataFrames = TRUE)
# ddf is now a list with 3 elements (Age, `Month/Dates` and `Month/Month`)
ddf <- as.data.frame.list(ddf)
# transforms the flattened/linearized list into a data.frame
ddf is now a data.frame without nesting. However, it's column names still reflect the nested structure:
names(ddf)
[1] "Age" "Month.Dates" "Month.Month"
If you want to change this (in this case it seems redundant to have Month. written before Dates, for example) you can use gsub and some regular expression that I copied from Sacha Epskamp to remove all text in the column names before the ..
names(ddf) <- gsub(".*\\.","",names(ddf))
names(ddf)
[1] "Age" "Dates" "Month"
The only thing left now is exporting the data.frame as usual:
write.table(ddf, file="test.txt")
Alternatively, you could use the "flatten" function from the jsonlite package to flatten the dataframe before export. It achieves the same result of the other functions mentioned and is much easier to implement.
jsonlite::flatten
https://rdrr.io/cran/jsonlite/man/flatten.html

R How to update a column in data.frame using values from another data.frame

New to R.
I have a data.frame
'data.frame': 2070 obs. of 5 variables:
$ id : int 16625062 16711130 16625064 16668358 16625066 16711227 16711290 16668746 16711502 16625494 ...
$ subj : Factor w/ 3 levels "L","M","S": 1 1 1 1 1 1 1 1 1 1 ...
$ grade: int 4 6 4 5 4 6 6 5 6 4 ...
$ score: int 225 225 0 225 225 375 375 125 225 125 ...
$ level: logi NA NA NA NA NA NA ...
and a list of named numbers called lookup
Named num [1:12] 12 19 20 26 31 32 49 67 72 73 ...
- attr(*, "names")= chr [1:12] "0" "50" "100" "125" ...
I'd like to find a way to update the data frame "level" column by looking up values in the lookup list, matching the data frame "score" column with the name of the number in the lookup list. In other words, the score values in the data frame are used to lookup the number (that will go in the level column) in the lookup list.
So... if anyone understands what I mean... please help.
Thanks Robn
You should be able to do this with (assuming your data frame is called d):
d$level = as.numeric(lookup[as.character(d$score)])
For example:
lookup = list(1, 2, 3, 4)
names(lookup) = c("0", "50", "100", "150")
d = data.frame(score=c(50, 150, 0, 0), level=NA)
d$level = as.numeric(lookup[as.character(d$score)])
print(d)
# score level
# 1 50 2
# 2 150 4
# 3 0 1
# 4 0 1

Resources