Data Management in R

Data Management in R - r

So I have this code where I am trying to unite separate columns called grade prek-12 into one column called Grade. I have employed the tidyr package and used this line of code to perform said task:
unite(dta, "Grade",
c(Gradeprek,
dta$Gradek, dta$Grade1, dta$Grade2,
dta$Grade3, dta$Grade4, dta$Grade5,
dta$Grade6, dta$Grade7, dta$Grade8,
dta$Grade9, dta$Grade10, dta$Grade11,
dta$Grade12),
sep="")
However, I have been getting an error saying this:
error: All select() inputs must resolve to integer column positions.
The following do not: * c(Gradeprek, dta$Gradek, dta$Grade1, dta$Grade2, dta$Grade3, dta$Grade4, dta$Grade5, dta$Grade6, ...
Penny for your thoughts on how I can resolve the situation.

You are mixing and matching the two syntax options for unite and unite_ - you need to pick one and stick with it. In both cases, do not use data$column - they take a data argument so you don't need to re-specify which data frame your columns come from.
Option 1: NSE The default non-standard evaluation means bare column names - no quotes! And no c().
unite(dta, Grade, Gradeprek, Gradek, Grade1, Grade2, Grade3, ...,
Grade12, sep = "")
There are tricks you can do with this. For example, if all your Grade columns are in this order next to each other in your data frame, you could do
unite(dta, Grade, Gradeprek:Grade12, sep = "")
You could also use starts_with("Grade") to get all column that begin with that string. See ?unite and its link to ?select for more details.
Option 2: Standard Evaluation You can use unite_() for a standard-evaluating alternative which will expect column names in a character vector. This has the advantage in this case of letting you use paste() to build column names in the order you want:
unite_(dta, col = "Grade", c("Gradeprek", "Gradek", paste0("Grade", 1:12)), sep = "")

Related

Error: unexpected input in "reg1 <- lm(_"

I am pretty new to R, about 3 months in, and when I was trying to run a regression R shot me this error, Error: unexpected input in "reg1 <- lm(_" the variable I use has an underscore, and some other variables too, I didn't know if R support underscore in a regression or not as thats the first time I had a variable with an underscore in it's name. If it doesn't, how can I change to name?

As good practice, always begin variable/column names with letters (although not explicitly the rule, and you can technically start with the period symbol, this will save hassle). When dealing with data imported into R with predefined column names (or just when dealing with dataframes in general) you can rename columns in the dataframe df as follows
names(df)[names(df) == 'OldName'] <- 'NewName'

If you really need to, you can protect 'illegal' names with back-quotes (although I agree with other answers/comments that this is not good practice ...)
dd <- data.frame(`_y`=rnorm(10), x = 1:10, check.names=FALSE)
names(dd)
## [1] "_y" "x"
lm(`_y` ~ x, data = dd)

Legal column names in R and consequences of syntactically invalid column names

df <- read.csv(
text = '"2019-Jan","2019-Feb",
"3","1"',
check.names = FALSE
)
OK, so I use check.names = FALSE and now my column names are not syntactically valid. What are the practical consequences?
df
#> 2019-Jan 2019-Feb
#> 1 3 1 NA
And why is this NA appearing in my data frame? I didn't put that in my code. Or did I?
Here's the check.names man page for reference:
check.names logical. If TRUE then the names of the variables in the
data frame are checked to ensure that they are syntactically valid
variable names. If necessary they are adjusted (by make.names) so that
they are, and also to ensure that there are no duplicates.

The only consequence is that you need to escape or quote the names to work with them. You either string-quote and use standard evaluation with the [[ column subsetting operator:
df[['2019-Jan']]
… or you escape the identifier name with backticks (R confusingly also calls this quoting), and use the $ subsetting:
df$`2019-Jan`
Both work, and can be used freely (as long as they don’t lead to exceedingly unreadable code).
To make matters more confusing, R allows using '…' and "…" instead of `…` in certain contexts:
df$'2019-Jan'
Here, '2019-Jan' is not a character string as far as R is concerned! It’s an escaped identifier name.1
This last one is a really bad idea because it confuses names2 with character strings, which are fundamentally different. The R documentation advises against this. Personally I’d go further: writing 'foo' instead of `foo` to refer to a name should become a syntax error in future versions of R.
1 Kind of. The R parser treats it as a character string. In particular, both ' and " can be used, and are treated identically. But during the subsequent evaluation of the expression, it is treated as a name.
2 “Names”, or “symbols”, in R refer to identifiers in code that denote a variable or function parameter. As such, a name is either (a) a function name, (b) a non-function variable name, (c) a parameter name in a function declaration, or (d) an argument name in a function call.

The NA issue is unrelated to the names. read.csv is expecting an input with no comma after the last column. You have a comma after the last column, so read.csv reads the blank space after "2019-Feb", as the column name of the third column. There is no data for this column, so an NA value is assigned.
Remove the extra comma and it reads properly. Of course, it may be easier to just remove the last column after using read.csv.
df <- read.csv(
text = '"2019-Jan","2019-Feb"
"3","1"',
check.names = FALSE
)
df
# 2019-Jan 2019-Feb
# 1 3 1

Consider df$foo where foo is a column name. Syntactically invalid names will not work.
As for the NA it’s a consequence of there being three columns in your first line and only two in your second.

Read multiple integer columns as string, trying to gsub and convert back to integer

i have about 30 columns within a dataframe of over 100 columns. the file i am reading in stores its numbers as characters. In other words 1300 is 1,300 and R thinks it is a character.
I am trying to fix that issue by replacing the "," with nothing and turn the field into an integer. I do not want to use gsub on each column that has the issue. I would rather store the columns as a vector that have the issue and do one function or loop with all the columns.
I have tried using lapply, but am not sure what to put as the "x" variable.
Here is my function with the error below it
ItemStats_2014[intColList] <- lapply(ItemStats_2014[intColList],
as.integer(gsub(",", "", ItemStats_2014[intColList])) )
Error in [.data.table(ItemStats_2014, intColList) : When i is a
data.table (or character vector), the columns to join by must be
specified either using 'on=' argument (see ?data.table) or by keying x
(i.e. sorted, and, marked as sorted, see ?setkey). Keyed joins might
have further speed benefits on very large data due to x being sorted
in RAM.

The file I am reading in stores its numbers as characters [with commas as decimal separator]
Just directly read those columns in as decimal, not as string:
data.table::fread() understands decimal separators: dec=',' by default.
You might need to play with fread(..., colClasses=(...) ) argument a bit to specify the integer columns:
myColClasses <- rep('string',100) # for example...
myColClasses[intColList] <- 'integer'
# ...any other colClass fixup as needed...
ItemStats_2014 <- fread('your.csv', colClasses=myColClasses)
This approach is simpler and faster and uses much less memory than reading as string, then converting later.

Try using dplyr::mutate_at() to select multiple columns and apply a transformation to them.
ItemStats_2014 <- ItemStats_2014 %>%
mutate_at(intColList, funs(as.integer(gsub(',', '', .))))
mutate_at selects columns from a list or using a dplyr selector function (see ?select_helpers) then applies one or more functions to each column. The . in gsub refers to each selected column that mutate_at passes to it. You can think of it as the x in function(x) ....

Applying a function over a column in an r data frame where each item is a character string

I have a dataframe, wineSA, with two columns. One of these columns is populated with a character string, like so:
> summary(wineSA$description)
Length Class Mode
129971 character character
An example of a typical entry would be:
review <- "Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity."
I also have a function, that when applied to a string returns a sentiment score, like so:
> Getting_Sentimental(review)
[1] 0.4317412
I want to apply this function to every element in the wineSA$description column and add the sentiment score, as a separate column, to the data frame wineSA.
I have tried the following method, which uses apply(), but I get this message:
> wineSA$reviewSentiment <- apply(wineSA$description, FUN = Getting_Sentimental)
Error in apply(wineSA$description, FUN = Getting_Sentimental) :
dim(X) must have a positive length
I'm not sure apply() is appropriate here, but when I use either sapply() or lappy() it populates the new column with the same value for the sentiment.
Is there a special way of handling functions on string characters? Is there anything I'm missing?
Thanks

How to separate a text file into columns

This is what my text file looks like:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
All these values are continuous and each character indicates something. I am unable to figure out how to separate each value into columns and specify a heading for all these new columns which will be created.
I used this code, however it does not seem to work.
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth”))
Any help would be appreciated.

Assuming that you have a clear definition for every column, you can use regular expressions to solve this in no time.
From your column names and example data, I guess that the regular expression that matches each field is:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
We can put all this together in a regular expression that captures each of these fields
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
Note: In R, we need to scape "\" that's why we write \d instead of \d.
That said, here comes the code to solve the problem.
First, you need to read your strings
lines <- readLines("birthweighthw1.txt")
Now, we define our regular expression and use the function str_match from the package stringr to get your data into character matrix.
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
You can check that the first column in the matrix contains the text matched, and the following columns the data captured. So, we can get rid of the first column
captured <- captured[,-1]
and transform it into a data.frame with appropriate column names
result <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
Now, every column in result is of type character, you can transform each of them into other types. For example:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data Management in R - r

Related

Error: unexpected input in "reg1 <- lm(_"

Legal column names in R and consequences of syntactically invalid column names

Read multiple integer columns as string, trying to gsub and convert back to integer

Applying a function over a column in an r data frame where each item is a character string

How to separate a text file into columns

Categories

Resources