Raw data: reading attributes with varied number of spaces in R - r

I am trying to read in the Baylor dataset but I can't use read.csv since the spaces are not consistent.
I do have the column numbers so I was thinking read.fwf would help fix my issue but that means I have to review more than 100 attributes and check the line widths.
Is there an easier way to read the data?
baylor <- read.csv('C:/Users/Documents/baylor-religion-survey-data-2007.txt', header=F)
Column Numbers
Baylor Religion 2007 Survey Data

I haven't tested carefully, but I think this does it:
Define URLs:
lnum_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-column-numbers.txt"
survey_url <- "http://facweb.cdm.depaul.edu/sjost/csc433/projects/baylor-religion-survey-data-2007.txt"
Read file with column info:
nums <- read.table(url(lnum_url),as.is=TRUE,header=TRUE)
Extract starting column for each field:
startcol <- as.numeric( ## convert to numeric
sapply(
strsplit(nums[,3],"-"), ## split strings on dashes
"[",1)) ## select first element of each result
## sapply(z,"[",1) == sapply(z,function(x) x[1])
Field widths are differences (assume last field is length 1):
w <- c(diff(startcol),1)
Read fixed width:
r <- read.fwf(url(survey_url),widths=w)
Assign field names:
names(r) <- gsub(":","",nums$COL)
Some quick checks:
str(r[,1:8])
## 'data.frame': 1648 obs. of 8 variables:
## $ ID : num 1.1e+09 1.1e+09 1.1e+09 1.1e+09 1.1e+09 ...
## $ WEIGHT : num 0.822 0.312 1.604 1.184 1.35 ...
## $ REGION : int 3 3 4 3 2 2 2 4 2 2 ...
## $ RELIG1 : int 12 12 46 45 14 31 16 33 16 16 ...
## $ RELIG2 : int NA NA NA NA NA NA NA NA NA NA ...
## $ DENOM : Factor w/ 301 levels " ",..: 231 231 1 1 1 1 83 113 1 23 ...
## $ RELGIOUS: int 3 4 1 3 3 4 4 4 3 4 ...
## $ ATTEND : int 5 8 0 8 3 0 8 7 1 8 ...
tail(sort(levels(r$DENOM)))
## [1] " RIVER OF LIFE EVANGELICAL FREE OF ELK RIVER"
## [2] " ELCA - EVANGELICAL LUTHERAN CHURCH OF AMERICA"
## [3] " WASHBURN CHRISTIAN CHURCH DISCIPLES OF CHRIST"
## [4] " THE CHURCH OF JESUS CHRIST OF LATTER DAY SAINTS"
## [5] " GENERAL ASSOCIATION OF REGULAR BAPTISTS CHURCHES"
## [6] "CONGREGATIONAL/METHODIST UNITED CHURCHES OF DURHAM,"
Some more processing (e.g. stripping white space in the denominations) might be in order, and I would certainly further check these results, but this should get you most of the way there.
For future reference it might be worth downloading the data from the original download site and checking cross-tabulations against the code book ...

Related

Use grep to delete any string containing year less than 2014

Edited to add more context and data 5/12/2017
Using R version 3 on Windows
I have a data frame data2:
'data.frame': 1504 obs. of 14 variables:
$ Member.Name : chr "A" "B" "C"...
$ MSTATUS : Factor w/ 14 levels "","ACTIVE","ACTIVE;CHANGEDROLES;NONQUALIF",..: 13 2 2 2 2 4 13 13 2 13 ...
$ MCAT : Factor w/ 9 levels "","EDNEWCLASS",..: 5 4 9 6 6 6 9 9 4 4 ...
$ SALUTATION : Factor w/ 822 levels "","Aaron","Abigail",..: 285 2 2 2 4 4 4 4 5 5 ...
$ MEM_SUBCATEGORY : Factor w/ 22 levels "","AGENCYCEO",..: 22 6 8 15 8 6 8 1 6 6 ...
$ MEM_SUBTYPE : Factor w/ 25 levels "","AGENCY","AGENCYCEO",..: 24 6 6 20 6 6 6 6 6 6 ...
$ COUNTRY : Factor w/ 33 levels "","AE","AT","AU",..: 33 33 33 33 7 33 33 33 33 33 ...
$ F500 : Factor w/ 243 levels "","#1406 on Forbes Global 2000 ($11B)",..: 1 1 96 1 242 1 147 1 1 76 ...
$ OPT_LINE : Factor w/ 1467 levels "","(Formerly) Condé Nast",..: 1 1170 609 1333 251 1427 444 258 814 1207 ...
$ FLAGS : chr "2014PAGEJAMPARTICIPANT, \nPHOTO" "" "PUFOUNDINGMEMBER" "2014FLESPEAKER" ...
$ FLAGS_DESCR : chr "2014 Page Jam Participant, \nPhoto on File" "" "Page Up Founding Member" "2014 Future Leaders Experience Speaker" ...
$ Enroll.Date : Date, format: "2012-12-04" "2010-08-24" "2013-09-20" "2013-05-06" ...
$ Expiration.Date : Date, format: "2014-12-31" "2017-12-31" "2017-12-31" "2017-12-31" ...
$ Sponsorship.Amount: num 0 0 0 0 0 0 0 0 0 0 ...
For the FLAGS variable, I'd like to remove all row elements that contain a year less than 2014.
head(data2$FLAGS, n=3)
[1] "2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL" ""
[3] "2012COI"
So that FLAGS will look like:
head(data2$FLAGS, n=3)
[1] "\n2016CHAIRCOUNCIL" ""
[3] ""
The rows with no values can either be blank or NA, BUT if a row does contain an event with a year >=2014 and an event with a year <2014 than just delete the event less than 2014 and keep the other events in the row.
This regex works for your example. The idea is to match the first 3 characters of year for those elements that fail and drop them.
FLAGS[-grep("20(0|1[0123])", FLAGS)]
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT" "\n2014PUSPONSOR, \nPHOTO"
or, using invert, you'd have
FLAGS[grep("20(0|1[0123])", FLAGS, invert=TRUE)]
Note that it won't catch pre-2000s and you should be cautious if there are other "numeric" values in the vector.
To return a vector of the same length, with NAs replacing the earlier years, you could use is.na<- and grepl like this
is.na(FLAGS) <- grepl("20(0|1[0123])", FLAGS)
original data
FLAGS<-c("2014PAGEJAMPARTICIPANT, \nPHOTO", "2001ANNUALCONFERENCECOMM",
"\n2011GOVERNANCE", "\n2014PAGEJAMPARTICIPANT", "2013NEWMEMBERNOMINATOR",
"\n2014PUSPONSOR, \nPHOTO")
given OP's second question. The following more or less works:
sapply(strsplit(FLAGS, ","),
function(x) paste(gsub("(\\n)?20(0|1[0123]).*?(, |$)", "", trimws(x)), collapse=" "))
[1] " 2016CHAIRCOUNCIL" "" ""
Note that a "\n" is missing at the beginning and there is an additional (set of) space(s) at the beginning of the first element. The "\n" is removed be trimws. This makes the string a bit easier to work with. The additional spaces can be removed by wrapping the above expression in trimws, for example, trimws(sapply(strsplit(...))).
additional data
FLAGS <- c("2011PRESIDENTS, \n2012CHAIRMANSCOUNCIL, \n2016CHAIRCOUNCIL", "", "2012COI")
Here is one solution using stringr package:
library(stringr)
FLAGS[sapply(str_extract_all(FLAGS, '[0-9]{4}'),
function(x) !any(as.integer(x) < 2014))]
This solution assumes you may have more than one year in each value. If that is not the case, you can do something more simple like:
FLAGS[as.integer(str_extract(FLAGS, '[0-9]{4}')) >= 2014]
Assuming FLAGS is as follows:
FLAGS
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "2001ANNUALCONFERENCECOMM"
[3] "\n2011GOVERNANCE" "\n2014PAGEJAMPARTICIPANT"
[5] "2013NEWMEMBERNOMINATOR" "\n2014PUSPONSOR, \nPHOTO"
You get result as:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" "\n2014PAGEJAMPARTICIPANT"
[3] "\n2014PUSPONSOR, \nPHOTO"
EDITING ANSWER BASED ON QUESTION EDIT ABOVE
You can keep only values with 2014 or above and fill with NAs otherwise as follows:
data2$FLAGS <- ifelse(as.integer(str_extract(data2$FLAGS, '\\d+')) >= 2014,
data2$FLAGS, NA)
Result is as follows:
[1] "2014PAGEJAMPARTICIPANT, \nPHOTO" NA
[3] NA "\n2014PAGEJAMPARTICIPANT"
[5] NA "\n2014PUSPONSOR, \nPHOTO"

How can I store a value in a name?

I use the neotoma package where I get data from a geographical site, which is marked by an ID. What I want to do is to "store" the number in a term, like Sitenum, so I can just need to write down the ID once and then use it.
What I did:
Site<-get_download(20131, verbose = TRUE)
taxa<-as.vector(Site$'20131'$taxon.list$taxon.name)
What I want to do:
Sitenum <-20131
Site<-get_download(Sitenum, verbose = TRUE) # this obv. works
taxa<-as.vector(Site$Sitenum$taxon.list$taxon.name) # this doesn't work
The structure of the dataset:
str(Site)
List of 1
$ 20131:List of 6
..$ taxon.list :'data.frame': 84 obs. of 6 variables:
.. ..$ taxon.name : Factor w/ 84 levels "Alnus","Amaranthaceae",..: 1 2 3 4 5 6 7 8 9 10 ...
I constructed an object that mimics yours as follows:
Site <- list("2043"=list(other=data.frame(that=1:10)))
Note that the structure is essentially identical.
str(Site)
List of 1
$ 2043:List of 1
..$ other:'data.frame': 10 obs. of 1 variable:
.. ..$ that: int [1:10] 1 2 3 4 5 6 7 8 9 10
Now, I save the value of the first term:
temp <- 2043
Then use the code in my comment to access the inner vector:
Site[[as.character(temp)]]$other$that
[1] 1 2 3 4 5 6 7 8 9 10
I could also use recursive referencing like this
Site[[c(temp,"other", "that")]]
[1] 1 2 3 4 5 6 7 8 9 10
because c will coerce temp to be a character vector in the presence of "other" and "that" character vectors.

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

How to remove $ from all values in a data frame column in a character vector?

I have a data frame in R that has information about NBA players, including salary information. All the data in the salary column have a "$" before the value and I want to convert the character data to numeric for the purpose of analysis. So I need to remove the "$" in this column. However, I am unable to subset or parse any of the values in this column. It seems that each value is a vector of 1. I've included below the structure of the data and what I have tried in my attempt at removing the "$".
> str(combined)
'data.frame': 588 obs. of 9 variables:
$ Player: chr "Aaron Brooks" "Aaron Gordon" "Aaron Gray" "Aaron Harrison" ...
$ Tm : Factor w/ 30 levels "ATL","BOS","BRK",..: 4 22 9 5 9 18 1 5 25 30 ...
$ Pos : Factor w/ 5 levels "C","PF","PG",..: 3 2 NA 5 NA 2 1 1 4 5 ...
$ Age : num 31 20 NA 21 NA 24 29 31 25 33 ...
$ G : num 69 78 NA 21 NA 52 82 47 82 13 ...
$ MP : num 1108 1863 NA 93 NA ...
$ PER : num 11.8 17 NA 4.3 NA 5.6 19.4 18.2 12.7 9.2 ...
$ WS : num 0.9 5.4 NA 0 NA -0.5 9.4 2.8 4 0.3 ...
$ Salary: chr "$2000000" "$4171680" "$452059" "$525093" ...
combined[, "Salary"] <- gsub("$", "", combined[, "Salary"])
The last line of code above is able to run successfully but it doesn't augment the "Salary" column.
I am able to successfully augment it by running the code listed below, but I need to find a way to automize the replacement process for the whole data set instead of doing it row by row.
combined[, "Salary"] <- gsub("$2000000", "2000000", combined[, "Salary"])
How can I subset the character vectors in this column to remove the "$"? Apologies for any formatting faux pas ahead of time, this is my first time asking a question. Cheers,
The $ is a metacharacter which means the end of the string. So, we need to either escape (\\$) or place it in square brackets ("[$]") or use fixed = TRUE in the sub. We don't need gsub as there seems to be only a single $ character in each string.
combined[, "Salary"] <- as.numeric(sub("$", "", combined[, "Salary"], fixed=TRUE))
Or as #gung mentioned in the comments, using substr would be faster
as.numeric(substr(d$Salary, 2, nchar(d$Salary)))

Wrong R data type or bad data?

I'm having trouble doing simple functions on a data frame and am unsure whether it's the data type of the column, or bad data in the data frame.
I exported a SQL query into a CSV file, then loaded it into a data frame, then attached it.
df <-read.csv("~/Desktop/orders.csv")
Attach(df)
When I am done, and run str(df), here is what I get:
$ AccountID: Factor w/ 18093 levels "(819947 row(s) affected)",..: 10 97 167 207 207 299 299 309 352 573 ...
$ OrderID : int 1874197767 1874197860 1874196789 1874206918 1874209100 1874207018 1874209111 1874233050 1874196791 1875081598 ...
$ OrderDate : Factor w/ 280 levels "","2010-09-24",..: 2 2 2 2 2 2 2 2 2 2 ...
$ NumofProducts : int 16 6 4 6 10 4 2 4 6 40 ...
$ OrderTotal : num 20.3 13.8 12.5 13.8 16.4 ...
$ SpecialOrder : int 1 1 1 1 1 1 1 1 1 1 ...
Trying to run the following functions, here is what I get:
> length(OrderID)
[1] 0
> min(OrderTotal)
[1] NA
> min(OrderTotal, na.rm=TRUE)
[1] 5.00
> mean(NumofProducts)
[1] NA
> mean(NumofProducts, na.rm=TRUE)
[1] 3.462902
I have two questions related to this data frame:
Do I have the right data types for the columns? Nums versus integers versus decimals.
Is there a way to review the data set to find the rows that are driving the need to use na.rm=TRUE to make the function work? I'd like to know how many there are, etc.
The difference between num and int is pretty irrelevant at this stage.
See help(is.na) for starters on NA handling. Do things like:
sum(is.na(foo))
to see how many foo's are NA values. Then things like:
df[is.na(df$foo),]
to see the rows of df where foo is NA.

Resources