I have a 10 x ~15,000 data frame with salaries in column 9 and I'm trying to remove the $ from the start of each entry in that column.
This is the best version of what I have. I am new to R and far more familiar with other languages. Preferably if there is a way to run an operation on each element of a data frame (like cellfun in Matlab, or a list comprehension in python) that would make this far easier.
Based on my debugging attempts it seems like gsub just isn't doing anything, even outside a loop. Any suggestions from a more experienced user would be appreciated.
Thanks.
bbdat <- read.csv("C:/Users/musta/Downloads/BBs1.csv", header=TRUE, sep=",", dec=".", stringsAsFactors=FALSE)
i <- 0
for (val in bbdat[,9])
{
i = i+1
bbdat[i,9]<- gsub("$","",val)
}
The $ is a metacharacter and it implies the end of the string. If we want to evaluate it literally, either use the fixed = TRUE (by default it is FALSE) or keep it inside square bracket ("[$]") or escape (\\$). As gsub/sub are vectorized, looping is not required
bbdat[,9] <- gsub("$", "", bbdat[,9], fixed = TRUE)
If there is only a single instance of $ in each element, use sub (gsub - global substitution) instead ofgsub`
Related
I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))
I have a question how to write a loop in r which goes checks if a certain expression occurs in a string . So I want to check if the the expression “i-sty” occurs in my variable for each i between 1:200 and, if this is true, it should give the corresponding i.
For example if we have “4-sty” the loop should give me 4 and if there is no “i-sty” in the variable it should give me . for the observation.
I used
for (i in 1:200){
datafram$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
But it did not work. I literally only receive points. Attached I show a picture of the string variable.
enter image description here
"i-sty" is just a string with the letter i in it. To you use a regex pattern with your variable i, you need to paste together a string, e.g., grepl(paste0(i, "-sty"), ...). I'd also recommend using NA rather than "." for the "else" result - that way the resulting height variable can be numeric.
for (i in 1:200){
dataframe$height <- ifelse(grepl("i-sty", dataframe$Description), i, ".")
}
The above works syntactically, but not logically. You also have a problem that you are overwriting height each time through the loop - when i is 2, you erase the results from when i is 1, when i is 3, you erase the results from when i is 2... I think a better approach would be to extract the match, which is easy using stringr (but also possible in base). As a benefit, with the right pattern we can skip the loop entirely:
library(stringr)
dataframe$height = str_match(string = dataframe$Description, pattern = "[0-9]+-sty")[, 2]
# might want to wrap in `as.numeric`
You use both datafram and dataframe. I've assumed dataframe is correct.
Simple but frustrating problem here:
I've imported xls data into R, which unfortunately is the only current way to get the data - no csv option or direct DB query.
Anyways - I'm looking to do quite a bit of manipulation on this data set, however the variable names are extraordinarily messy: ie. col2 = "\r\n\r\n\r\n\r\r XXXXXX YYYYY ZZZZZZ" - you get my gist. Each column head has an equally messy name as this example and there are typically >15 columns per spreadsheet.
Ideally I'd like to program a name manipulation solution via R to avoid manually changing the names in xls prior to importing. But I can't seem to find the right solution, since every R function I try/check requires the column name be spelled out and set to a new variable. Spelling out the entire column name is tedious and impractical and plus the special characters seem to break R's functions anyways.
Does anyone know how to do a global replace all names or a global rename by column number rather than name?
I've tried
replace()
for loops
lapply()
Remove non-printing characters in the first gsub. Then trim whitespace off the ends using trimws and replace consecutive strings of the same character with just one of them in the second gsub. No packages are used.
# test input
d <- data.frame("\r\r\r\r\r\n\n\n\n\n\n XXXX YYYY ZZZZ" = 0, check.names = FALSE)
names(d) <- trimws(gsub("[^[:print:]]", "", names(d)))
names(d) <- gsub("(.)\\1+", "\\1", names(d))
d
## X Y Z
## 1 0
With R 3.6 or later you could consider replacing the first gsub line with this trimws line:
names(d) <- trimws(names(d), "both", "\\s")
If you want syntactic names add this after the above code:
names(d) <- make.names(names(d))
d
## X.Y.Z
## 1 0
I am interested to assign names to list elements. To do so I execute the following code:
file_names <- gsub("\\..*", "", doc_csv_names)
print(file_names)
"201409" "201412" "201504" "201507" "201510" "201511" "201604" "201707"
names(docs_data) <- file_names
In this case the name of the list element appears with ``.
docs_data$`201409`
However, in this case the name of the list element appears in the following way:
names(docs_data) <- paste("name", 1:8, sep = "")
docs_data$name1
How can I convert the gsub() result to receive the latter naming pattern without quotes?
gsub() and paste () seem to produce the same class () object. What is the difference?
Both gsub and paste return character objects. They are different because they are completely different functions, which you seem to know based on their usage (gsub replaces instances of your pattern with a desired output in a string of characters, while paste just... pastes).
As for why you get the quotations, that has nothing to do with gsub and everything to do with the fact that you are naming variables/columns with numbers. Indeed, try
names(docs_data) <- paste(1:8)
and you'll realize you have the same problem when invoking the naming pattern. It basically has to do with the fact that R doesn't want to be confused about whether a number is really a number or a variable because that would be chaos (how can 1 refer to a variable and also the number 1?), so what it does in such cases is change a number 1 into the character "1", which can be given names. For example, note that
> 1 <- 3
Error in 1 <- 3 : invalid (do_set) left-hand side to assignment
> "1" <- 3 #no problem!
So R is basically correcting that for you! This is not a problem when you name something using characters. Finally, an easy fix: just add a character in front of the numbers of your naming pattern, and you'll be able to invoke them without the quotations. For example:
file_names <- paste("file_",gsub("\\..*", "", doc_csv_names),sep="")
Should do the trick (or just change the "file_" into whatever you want as long as it's not empty, cause then you just have numbers left and the same problem)!
I already have tried to find a solutions on the internet for my problem, and I have the feeling I know all the small pieces but I am unable to put them together. I'm quite knew at programing so pleace be patient :D...
I have a (in reality much larger) text string which look like this:
string <- "Test test [438] test. Test 299, test [82]."
Now I want to replace the numbers in square brackets using a lookup table and get a new string back. There are other numbers in the text but I only want to change those in brackets and need to have them back in brackets.
lookup <- read.table(text = "
Number orderedNbr
1 270 1
2 299 2
3 82 3
4 314 4
5 438 5", header = TRUE)
I have made a pattern to find the square brackets using regular expressions
pattern <- "\\[(\\d+)\\]"
Now I looked all around and tried sub/gsub, lapply, merge, str_replace, but I find myself unable to make it work... I don't know how to tell R! to look what's inside the brackets, to look for that same argument in the lookup table and give out what's standing in the next column.
I hope you can help me, and that it's not a really stupid question. Thx
We can use a regex look around to match only numbers that are inside a square bracket
library(gsubfn)
gsubfn("(?<=\\[)(\\d+)(?=\\])", setNames(as.list(lookup$orderedNbr),
lookup$Number), string, perl = TRUE)
#[1] "Test test [5] test. Test [3]."
Or without regex lookaround by pasteing the square bracket on each column of 'lookup'
gsubfn("(\\[\\d+\\])", setNames(as.list(paste0("[", lookup$orderedNbr,
"]")), paste0("[", lookup$Number, "]")), string)
Read your table of keys and values (a 2 column table) into a data frame. If your source information be a flat text file, then you can easily use read.csv to obtain a data frame. In the example below, I hard code a data frame with just two entries. Then, I iterate over it and make replacements in the input string.
df <- data.frame(keys=c(438, 82), values=c(5, 3))
string <- "Test test [438] test. Test [82]."
for (i in 1:nrow(df)) {
string <- gsub(paste0("(?<=\\[)", df$keys[i], "(?=\\])"), df$values[i], string, perl=TRUE)
}
string
[1] "Test test 5 test. Test 3."
Demo
Note: As #Frank wisely pointed out, my solution would fail if your number markers (e.g. [438]) happen to have replacements which are numbers also appearing as other markers. That is, if replacing a key with a value results in yet another key, there could be problems. If this be a possibility, I would suggest using markers for which this cannot happen. For example, you could remove the brackets after each replacement.
You can use regmatches<- with a pattern containing lookahead/lookbehind:
patt = "(?<=\\[)\\d+(?=\\])"
m = gregexpr(patt, string, perl=TRUE)
v = as.integer(unlist(regmatches(string, m)))
`regmatches<-`(string, m, value = list(lookup$orderedNbr[match(v, lookup$Number)]))
# [1] "Test test [5] test. Test 299, test [3]."
Or to modify the string directly, change the last line to the more readable...
regmatches(string, m) <- list(lookup$orderedNbr[match(v, lookup$Number)])