Assign names to list elements without titled quotes - r

I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?

Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!

Related

String Manipulation in R data frames

I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))

R: Loop should return numeric element from string

I have a question how to write a loop in r which goes checks if a certain expression occurs in a string . So I want to check if the the expression “i-sty” occurs in my variable for each i between 1:200 and, if this is true, it should give the corresponding i.
For example if we have “4-sty” the loop should give me 4 and if there is no “i-sty” in the variable it should give me . for the observation.
I used
for (i in 1:200){
datafram$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
But it did not work. I literally only receive points. Attached I show a picture of the string variable.
enter image description here
"i-sty" is just a string with the letter i in it. To you use a regex pattern with your variable i, you need to paste together a string, e.g., grepl(paste0(i, "-sty"), ...). I'd also recommend using NA rather than "." for the "else" result - that way the resulting height variable can be numeric.
for (i in 1:200){
dataframe$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
The above works syntactically, but not logically. You also have a problem that you are overwriting height each time through the loop - when i is 2, you erase the results from when i is 1, when i is 3, you erase the results from when i is 2... I think a better approach would be to extract the match, which is easy using stringr (but also possible in base). As a benefit, with the right pattern we can skip the loop entirely:
library(stringr)
dataframe$height = str_match(string = dataframe$Description, pattern = "[0-9]+-sty")[, 2]
# might want to wrap in `as.numeric`
You use both datafram and dataframe. I've assumed dataframe is correct.

Replacing Column Names without typing the actual column names

Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()
Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0

How to separate a text file into columns

This is what my text file looks like:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
All these values are continuous and each character indicates something. I am unable to figure out how to separate each value into columns and specify a heading for all these new columns which will be created.
I used this code, however it does not seem to work.
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth”))
Any help would be appreciated.
Assuming that you have a clear definition for every column, you can use regular expressions to solve this in no time.
From your column names and example data, I guess that the regular expression that matches each field is:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
We can put all this together in a regular expression that captures each of these fields
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
Note: In R, we need to scape "\" that's why we write \d instead of \d.
That said, here comes the code to solve the problem.
First, you need to read your strings
lines <- readLines("birthweighthw1.txt")
Now, we define our regular expression and use the function str_match from the package stringr to get your data into character matrix.
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
You can check that the first column in the matrix contains the text matched, and the following columns the data captured. So, we can get rid of the first column
captured <- captured[,-1]
and transform it into a data.frame with appropriate column names
result <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
Now, every column in result is of type character, you can transform each of them into other types. For example:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)

R encoding ASCII backtick

I have the following backtick on my list's names. Prior lists did not have this backtick.
$`1KG_1_14106394`
[1] "PRDM2"
$`1KG_20_16729654`
[1] "OTOR"
I found out that this is a 'ASCII grave accent' and read the R page on encoding types. However what to do about it ? I am not clear if this will effect some functions (such as matching on list names) or is it OK leave it as is ?
Encoding help page: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Encoding.html
Thanks!
My understanding (and I could be wrong) is that the backticks are just a means of escaping a list name which otherwise could not be used if left unescaped. One example of using backticks to refer to a list name is the case of a name containing spaces:
lst <- list(1, 2, 3)
names(lst) <- c("one", "after one", "two")
If you wanted to refer to the list element containing the number two, you could do this using:
lst[["after one"]]
But if you want to use the dollar sign notation you will need to use backticks:
lst$`after one`
Update:
I just poked around on SO and found this post which discusses a similar question as yours. Backticks in variable names are necessary whenever a variable name would be forbidden otherwise. Spaces is one example, but so is using a reserved keyword as a variable name.
if <- 3 # forbidden because if is a keyword
`if` <- 3 # allowed, because we use backticks
In your case:
Your list has an element whose name begins with a number. The rules for variable names in R is pretty lax, but they cannot begin with a number, hence:
1KG_1_14106394 <- 3 # fails, variable name starts with a number
KG_1_14106394 <- 3 # allowed, starts with a letter
`1KG_1_14106394` <- 3 # also allowed, since escaped in backticks

Resources