R - identify which columns contain currency data $ - r

I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.
name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)
clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)
The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.

Using dplyr and stringr packages, you can use mutate_if to identify columns that have any string starting with a $ and then change the accordingly.
library(dplyr)
library(stringr)
emp_data %>%
mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
~as.numeric(str_replace_all(., '[$,]', '')))

Taking advantage of the powerful parsers the readr package offers out of the box:
my_parser <- function(col) {
# Try first with parse_number that handles currencies automatically quite well
res <- suppressWarnings(readr::parse_number(col))
if (is.null(attr(res, "problems", exact = TRUE))) {
res
} else {
# If parse_number fails, fall back on parse_guess
readr::parse_guess(col)
# Alternatively, we could simply return col without further parsing attempt
}
}
library(dplyr)
emp_data %>%
mutate(foo = "USD13.4",
bar = "£37") %>%
mutate_all(my_parser)
# name salary foo bar
# 1 john 23456.33 13.4 37
# 2 carl 45677.43 13.4 37
# 3 hank 76234.88 13.4 37

A base R option is to use startsWith to detect the dollar columns, and gsub to remove "$" and "," from the columns.
doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols],
function(x) as.numeric(gsub('\\$|,', '', x)))

Related

Create for each match an own csv

I've alreaedy looked for a similar question to mine but I couldn't found it.
Whenever you find one, let me know.
I have a df that looks like (in reality this df has three columns and more than 1000 rows):
Name, Value
Markus, 2
Markus, 4
Markus, 1
Caesar, 77
Caesar, 70
Brutus, 3
Nero, 4
Nero, 9
Nero, 10
Nero, 19
How can I create for each match (depending on Name) an own csv file?
I don't know how to approach this.
In this case the end result should be four csv files with the name form the Name column:
Markus.csv
Caesar.csv
Brutus.csv
Nero.csv
I'm thankful for any advice.
We can split your df by Name to create a list, iterate over it and write a csv for each group. File names are created using paste0
lst <- split(df, df$Name)
lapply(1:length(lst), function(i){
write.csv(lst[[i]], paste0(names(lst)[i], ".csv"), row.names = FALSE)
})
Similar idea to u/Jilber Urbina, you can use dplyr and tidyr to filter the dataset. The basic principle for both answers is to iterate the write.csv() function over the list/vector of names via lapply() function.
library(dplyr)
library(tidyr)
lapply(unique(df$Name), function(i) {
write.csv(
x = df %>% dplyr::filter(Name==i) %>% # filter dataset by Name
select(-Name), # if you want to remove the Name column from output
file = paste0(i,'.csv'),
row.names = FALSE
)
})

Renaming column but capturing number

I would like to rename columns that have the following pattern:
x1_test_thing
x2_test_thing
into:
test_thing_1
test_thing_2
Essentially moving the number to the end while removing the string (x) before it.
If a solution using dplyr and using rename_at() could be suggested that would be great.
If there is a better way to do it i'd definitely love to see it.
Thanks!
Using dplyr::rename_at function to rename the name of columns:
first parameter is your datafame.
second parameter is selecting the columns matching your requirements.
third parameter is choosing the function to processing the name of columns, and the parameters of function to processing strings put after comma.
For example, gsub is a function to processing strings. Originally, the usage of the function is gsub(x=c("x1_test_thing","x2_test_thing"),pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1"), but the correct usage is gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1" when you use this function at dplyr::rename_at.
pattern = "^.(.)_(test_thing)" means using the pair of parentheses to capture the second character, such as "1", and the characters after underline to the end of string, such as "test_thing" ,from the name of columns.
replacement = "\\2_\\1" means concatenating strings at the second pair of parentheses (test_thing) ,such as "test_thing", a underline"_" ,with strings at the first pair of parentheses (.), such as "1", to get desired output ,and finally replace the name of columns with the string processed.
library(dplyr)
# using test data for example
test <- data.frame(x1_test_thing=c(0),x2_test_thing=c(0))
rename_at(test, vars(contains("test_thing")),gsub,pattern = "^.(.)_(test_thing)",replacement = "\\2_\\1")
We can use readr::parse_number to extract the number from the string.
library(dplyr)
df <- data.frame(x1_test_thing= 1:5, x2_test_thing= 5:1)
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)))
# test_thing_1 test_thing_2
#1 1 5
#2 2 4
#3 3 3
#4 4 2
#5 5 1
To rename only those column that have 'test_thing' in them -
df %>%
rename_with(~paste0('test_thing_', readr::parse_number(.)),
contains('test_thing'))
In base R,
names(df) <- sub('x(\\d+)_.*', 'test_thing_\\1', names(df))
df

How do I make the name of a list the name of its column?

I have a set of 270 RNA-seq samples, and I have already subsetted out their expected counts using the following code:
for (i in 1:length(sample_ID_list)) {
assign(sample_ID_list[i], subset(get(sample_file_list[i]), select = expected_count))
}
Where sample_ID_list is a character list of each sample ID (e.g., 313312664) and sample_file_list is a character list of the file names for each sample already in my environment to be subsetted (e.g., s313312664).
Now, the head of one of those subsetted samples looks like this:
> head(`308087571`)
# A tibble: 6 x 1
expected_count
<dbl>
1 129
2 8
3 137
4 6230.
5 1165.
6 0
The problem is I want to paste all of these lists together to make a counts dataframe, but I will not be able to differentiate between columns without their sample ID as the column name instead of expected_count.
Does anyone know of a good way to go about this? Please let me know if you need any more details!
You can use:
dplyr::bind_rows(mget(sample_ID_list), .id = name)
If we want to name the list, loop over the list, extract the first element of 'expected_count' ('nm1') and use that to assign the names of the list
nm1 <- sapply(sample_file_list, function(x) x$expected_count[1])
names(sample_file_list) <- nm1
Or from sample_ID_list
do.call(rbind, Map(cbind, mget(sample_ID_list), name = sample_ID_list))
Update
Based on the comments, we can loop over the 'sample_file_list and 'sample_ID_list' with Map and rename the 'expected_count' column with the corresponding value from 'sample_ID_list'
sample_file_list2 <- Map(function(dat, nm) {
names(dat)[match('expected_count', names(dat))] <- nm
dat
}, sample_file_list, sample_ID_list)
Or if we need a package solution,
library(data.table)
rbindlist(mget(sample_ID_list), idcol = name)
Update:
Thank you all so much for your help. I had to update my for loop as follows:
for (i in 1:length(sample_ID_list)) {
assign(sample_ID_list[i], subset(get(sample_file_list[i]), select = expected_count))
data<- get(sample_ID_list[i])
colnames(data)<- sample_ID_list[i]
assign(sample_ID_list[i],data)
}
and was able to successfully reassign the names!

use dplyr to combine columns of data.frame when column names are not known

Given a tibble:
library(tibble)
myTibble <- tibble(a = letters[1:3], b = c(T, F, T), c = 1:3)
I can use transmute to paste the columns, separated by '.':
> library(dplyr)
> transmute(myTibble, concat = paste(a, b, c, sep = "."))
# A tibble: 3 x 1
concat
<chr>
1 a.TRUE.1
2 b.FALSE.2
3 c.TRUE.3
If I want to use the above transmute statement in a function that receives a tibble, I won't know the names of the tibble or the number of columns ahead of time. What dplyr syntax would allow me to paste all columns in a tibble separated by a '.'?
Please note, I can do this with something like:
> apply(myTibble, 1, paste, collapse = ".")
[1] "a.TRUE.1" "b.FALSE.2" "c.TRUE.3"
but I am trying to understand dplyr better. So, yes, this is a specific problem I am trying to solve, but I am also stumped as to why I can't solve it with dplyr, which means there is something key about dplyr column selection I don't yet understand, and I'd like to learn, so that is why I'm asking specifically about a dplyr solution.
With a little trial and error:
colNames_as_symbols <- syms(names(myTibble))
transmute(myTibble, concat = paste(!!!colNames_as_symbols, sep = '.'))
Here was the hint that put me on to the solution... From the documentation for !!!:
The big-bang operator !!! forces-splice a list of objects. The
elements of the list are spliced in place, meaning that they each
become one single argument.
vars <- syms(c("height", "mass"))
Force-splicing is equivalent to supplying the elements separately:
starwars %>% select(!!!vars)
starwars %>% select(height, mass)
In fact, the entire documentation entitled "Force parts of an expression" is fascinating reading. It can be accessed by issuing ?qq_show

Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)

Resources