Why does substr() return an empty string? - r

I don't know if I have a logical error, but my substr() function is keep returning an empty string? My initial thought was that I am cutting the string wrong, however even with a lower starting value, I am getting null string in my dataframe column.
I have looked at this PHP question to get some information, but didn't work: PHP: substr returns empty string
Reproducible example-
#Original Data set str() output
#'data.frame': 9245 obs. of 3 variables:
#$ Latitude : num 29.7 29.7 29.7 29.6 29.7 ...
#$ Longitude : num -82.3 -82.4 -82.3 -82.4 -82.3 ...
#$ Census Code: chr "120010011003032" "120010010004035" "120010002003009" "120010015213000" ...
#For example, even if I do this:
base::substr("120010011003032", 6, 1)
#Output : ""
#Desired output: 001100
I needed to cut census codes to generate tract information, and the tract information is usually first two being the state, next three the county, and the following six the tract.

You need base::substr("120010011003032", 1, 6) ! (EDIT: or 11, 6 per your comment)
The arguments are start, stop not length, start; see the doc or type ?base::substr. Tip: always triple-check the R doc first. And also try copy-pasting the working examples it gives you, then see if/how they differ from yours. Or just try various argument values.

base::substr("120010011003032", 6, 11)
"001100"

Related

How to balance a dataset using SCUT and the scutr-package in R

I was given a machine learning project in R by a colleague who can no longer work on it. I am currently trying to balance the used dataset with the SCUT function in the scutr package and I keep running into the following Problem:
The project I am working with contains the base dataset, formatted as a standard dataframe that contains different information on different YouTube channels (URL, name, description, etc.) and also a classification of 4 classes (hkgeschlecht). The classification is numerical, some of the other information as well, but the channel description for example is a text:
'data.frame': 199 obs. of 6 variables:
$ ctitle : chr "Gaming Kati" "EinfallsReich" "Frank / Generation - E" "Gladiator Glubschi" ...
$ cdescr : chr "Dieser Kanal ist einfach ein Kanal von einem Mädel, welches einfach im Animal Crossing hype ist <U+0001F61D><U+"| __truncated__ "Kurze und EinfallsReiche Fakten Videos mit folgenden Themen:\n\n "| __truncated__ "Ich bin Frank aus Hamburg...\n\n...glücklicher Ehemann und Vater von zwei fantastischen Jungs. \n\nZu meinen gr"| __truncated__ "Gladiator Glubschi\n- Ein Glubschi\n- Zwei krasse Kanäle\n- Drei Unterhaltung!\n\nUnd damit erstmal danke fürs "| __truncated__ ...
$ cthumbnailurl: chr "https://yt3.ggpht.com/a/AATXAJwsWCPoVZ6g-uk_9UbMU3NqOU-QuoQyunPoYg=s240-c-k-c0xffffffff-no-rj-mo" "https://yt3.ggpht.com/a/AATXAJxunaT5qD2CbS7AQodCYq-HDOVee87NYBnRnw=s240-c-k-c0xffffffff-no-rj-mo" "https://yt3.ggpht.com/a/AATXAJzaeY6aZJuWpCsa8ul1CXHmQ1bC6reTWk9mTw=s240-c-k-c0xffffffff-no-rj-mo" "https://yt3.ggpht.com/a/AATXAJx0pmglui0v3YZblGuT1yOdNTm33qVP7mLXxQ=s240-c-k-c0xffffffff-no-rj-mo" ...
$ cviews : int 1348087 2764 229744 15556 1884 1077314 158044 113570 25495 2364116 ...
$ csubscriber : int 13000 0 1140 320 0 7940 623 823 406 34700 ...
$ hkgeschlecht : num 2 99 1 1 1 2 1 1 1 2 ...
The project uses a Naive Bayes Classifier and thus the channel description (sdescr) in the dataframe is transformed into a document feature matrix dfm which then is split into a training dataset and test dataset. This all works out fine and the model gives me decent predictions.
However the main dataset is unbalanced as one class is much more dominant than the others. I now want to balance this dataset using the SCUT-method so that the prediction of the minority classes improves. I had planned on using the scutr package and the SCUT function in it since it is seems fairly straight forward.
Now my problem is, if I apply the function to main dataset like this:
ret <- SCUT(mldata, "hkgeschlecht", oversample = oversample_smote, undersample = undersample_hclust,)
I get this error:
Error in get.knnx(data, query, k, algorithm) : Data non-numeric
I assume that is due to the differently formated variables in the dataframe.
But if I try to apply it only to the training dataset like this:
ret <- SCUT(testdfm1, testdfm1#docvars$docvars, oversample = oversample_smote, undersample = undersample_hclust,)
I get this error:
Error in validate_dataset(data, cls_col) :
Column not found in data: 22211299112111312111122333311133211111111
Which I assume is due to the SCUT function needing a dataframe format and not a document feature matrix.
My question thus is: How I can apply the SCUT method in this case? Is there a way to make the function work with a document feature matrix, say to get it to recognize the column with the classification? Would that even make sense? Or do I have to go about it in a completely different way?

Converting a date to numeric in R

I have data where I have the dates in YYYY-MM-DD format in one column and another column is num.
packages:
library(forecast)
library(ggplot2)
library(readr)
Running str(my_data) produces the following:
spec_tbl_df [261 x 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:261], format: "2017-01-01" "2017-01-08" ...
$ popularity: num [1:261] 100 81 79 75 80 80 71 85 79 81 ...
- attr(*, "spec")=
.. cols(
.. date = col_date(format = ""),
.. popularity = col_double()
.. )
- attr(*, "problems")=<externalptr>
I would like to do some time series analysis on this. When running the first line of code for this decomp <- stl(log(my_data), s.window="periodic")
I keep running into the following error:
Error in Math.data.frame(my_data) :
non-numeric-alike variable(s) in data frame: date
Originally my date format was in MM/DD/YYYY format, so I feel like I'm... barely closer. I'm learning R again, but it's been a while since I took a formal course in it. I did a precursory search here, but could not find anything that I could identify as helpful (I'm just an amateur.)
You currently have a data.frame (or tibble variant thereof). That is not yet time aware. You can do things like
library(ggplot2)
ggplot(data=df) + aes(x=date, y=popularity) + geom_line()
to get a basic line plot properly index by date.
You will have to look more closely at package forecast and the examples of functions you want to use to predict or model. Packages like xts can help you, i.e.
library(xts)
x <- xts(df$popularity, order.by=df$date)
plot(x) # plot xts object
besides plotting you get time- and date aware lags and leads and subsetting. The rest depends more on what you want to do ... which you have not told us much about.
Lastly, if you wanted to convert your dates to numbers (since Jan 1, 1970) a quick as.numeric(df$date)) will; but using time-aware operations is often better (but has the learning curve you see now...)

Confusion pointing to lists for values in functions in R

I'm having a problem understanding how R functions interact with variable names. If you pass a variable name into a function, it seems to behave differently than if you pass the variable value to the function, which confuses me.
I have tried searching the forums, but would appreciate some clarification, as I think there is something fundamentally wrong with my understanding of R.
The following code produces the desired effect:
library(MASS)
hist(Boston$crim,xlab='Crime Rate',ylab='Frequency', main='Frequency plot of Crime Rate')
Expected Behaviour
The histogram titles and labels are all as defined in the function.
The problem arises, when I try to do it in a loop and do multiple plots, by calling the labels and plots using variables in lists. It seems calling the strings by pointing to a value in a list doesn't pass through to the histogram function.
sectors =c('crim','tax','ptratio')
xlabels =c('Crime Rate','Property Tax Rate', 'Pupil Teacher Ratio')
titles =c('Frequency plot of Crime Rate', 'Frequency plot of Tax Rate', 'Frequency Plot of Pupil:Teacher')
hist(Boston[sectors[1]],ylab='Frequency',xlab=as.character(xlabels[1]),main=as.character(titles[1]))
This produces the wrong image, where as you can see the titles and labels are wrong.
Not expected behaviour
I'm not observing any error messages, and I'm not entirely sure what to call this effect to google it correctly. I apologize if this has been answered before and would appreciate any and all help.
Thanks in advance
Notice the difference:
str(Boston[sectors[1]])
'data.frame': 506 obs. of 1 variable:
# $ crim: num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
str(Boston[, sectors[1]])
# num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
str(Boston[[sectors[1]]])
# num [1:506] 0.00632 0.02731 0.02729 0.03237 0.06905 ...
A data frame is a special kind of list so it can get confusing. The first example extracts the first element of the list (which here is also a column) and returns it as a list (a data frame). You should get an error message because the hist function does not take a list/data frame but a vector. The second uses the matrix/data frame way of extracting a column so you get a numeric vector. The third example treats Boston as a list and extracts just the first element and returns it without wrapping it in a list. The ?Extract manual page talks about this but it can take multiple readings to begin to figure it out. Also using str() can help you figure out what you are getting.
Also, your vectors are character so you do not need as.character():
i <- 3
hist(Boston[, sectors[i]], ylab='Frequency', xlab=xlabels[i], main=titles[i])

Read cell values without formatting into R with googlesheets

Would like to be able to read Google Sheets cell values into R with googlesheets package, but without any cell formatting applied (e.g. comma separators, percentage conversion, etc.).
Have tried gs_read() without specifying a range, which uses gs_read_csv(), which will "request the data from the Sheets API via the exportcsv link". Can't find a way to tell it to provide underlying cell value without formatting applied.
Similarly, tried gs_read() and specifying a range, which uses gs_read_cellfeed(). But can't find a way to indicate that I want un-formatted cell values.
Note: I'm not after the formulas in any cells, just the values without any formatting applied.
Example:
(looks like I'm not able to post image images)
Here's a screenshot of an example Google Sheet:
https://www.dropbox.com/s/qff05u8nn3do33n/Screenshot%202015-07-26%2008.42.58.png?dl=0
First and third columns are numeric with no formatting applied, 2nd column applies comma separators for thousands, 4th column applies percentage formatting.
Reading this sheet with the following code:
library(googlesheets)
gs <- gs_title("GoogleSheets Test")
ws <- gs_read(gs, ws = "Sheet1")
yields:
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 4 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
Would like to be able to read a worksheet that has formatting applied (ala columns 2 and 4), but read the unformatted values (ala columns 1 and 3).
At this point, I think your best bet is to fix the imported data like so:
> ws$Number_fixed <- type.convert(gsub(',', '', ws$Number_wFormat))
> ws$Percent_fixed <- type.convert(gsub('%', '', ws$Percent_wFormat)) / 100
> str(ws)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 6 variables:
$ Number : int 123456 123457 123458
$ Number_wFormat : chr "123,456" "123,457" "123,458"
$ Percent : num 0.123 0.234 0.346
$ Percent_wFormat: chr "12.34%" "23.45%" "34.56%"
$ Number_fixed : int 123456 123457 123458
$ Percent_fixed : num 0.123 0.234 0.346
I had some hope that post-processing with functions from readr would be a decent answer, but it looks like percentages and "currency" style numbers are open issues there too.
I have opened an issue to solve this better in googlesheets, one way or another.

data.table::fread's stringsAsFactors=TRUE argument doesn't convert character columns to factor type - what's the workaround?

I know this issue has been raised in several places and I have been trying to find out a possible good solution for hours but failed. That's why I'm asking this.
So, I have a huge data file (~5GB) and I used fread() to read this
library(data.table)
df<- fread('output.txt', sep = "|", stringsAsFactors = TRUE)
head(df, 5)
age income homeowner_status_desc marital_status_cd gender
1: $35,000 - $49,999
2: 35 - 44 $35,000 - $49,999 Rent Single F
3: $35,000 - $49,999
4:
5: $50,000 - $74,999
str(df)
Classes ‘data.table’ and 'data.frame': 999 obs. of 5 variables:
$ age : chr "" "35 - 44" "" "" ...
$ income : chr "$35,000 - $49,999" "$35,000 - $49,999" "$35,000 - $49,999" "" ...
$ homeowner_status_desc: chr "" "Rent" "" "" ...
$ marital_status_cd : chr "" "Single" "" "" ...
$ gender : chr "" "F" "" "" ...
- attr(*, ".internal.selfref")=<externalptr>
There are missing data(where it's blank). In the original data, there are lots of columns and thus I need to find a way to make columns Factor whenever columns include strings. Could anyone suggest what is the best practice to get this done? I was considering changing it to data frame and do this. But is it possible to do this while it's a data.table?
Just implemented stringsAsFactors argument for fread in v 1.9.6+
From NEWS:
Implemented stringsAsFactors argument for fread(). When TRUE, character columns are converted to factors. Default is FALSE. Thanks to Artem Klevtsov for filing #501, and to #hmi2015 for this SO post.
This is basically a comment, but it's long, so here goes.
You may want to use colClasses to specify which columns are factors.
If you've got a lot of columns, something I've done to simplify is use the following function I wrote:
abbr_to_colClass<-function(inits,counts){
x<-substring(inits,1:nchar(inits),1:nchar(inits))
types<-ifelse(x=="c","character",
ifelse(x=="f","factor",
ifelse(x=="i","integer",
"numeric")))
rep(types,substring(counts,1:nchar(counts),1:nchar(counts)))
}
Say you've got a .csv with columns of classes:
character 3
factor 2
integer 1
numeric 5
character 6
Then you could use my function to set
colClasses=abbr_to_colClass("cfinc","32156")
This will particularly save space if you have long strings of one type consecutively.
(I know it's not the most robust function, but it's served me quite well many times when there are a lot fields to be read in)
I made a little csv file and I can confirm the same behavior where stringsAsFactors=TRUE doesn't result in factor columns. Additionally specifying colClasses as factor doesn't seem to work either.
If you run this after fread it'll convert all your character columns to factors
for (j in which(sapply(df, class)=='character')) set(df, i=NULL, j=j, value=as.factor(df[[j]]))
Try the new readr package, it has been optimized to be 10x faster and not leak memory. Instead of stringsAsFactors, you can now specify col_types argument where you can specify a collector (a custom parser function). Look at the documentation, esp. col_factor/parse_factor.
require(readr)
read_csv(..., col_types=...)

Resources