Reading tables with strings containing the separator in R - r

I got a text file with data I want to read, but one of the columns is a messy "code" which contains the same character used as the separator. Take the following set as an example:
number:string
1:abc?][
2:def:{+
There will be a line with 3 columns and only 2 column names.
Is there any strategy to read this dataset?

Read the file a line at a time, split into two parts on the ":", bind into a data frame. The column names get lost but you can put them back on again easy enough. You need the stringr and readr packages:
> do.call(rbind.data.frame,stringr::str_split(readr::read_lines("seps.csv",skip=1),":",2))
c..1....2.. c..abc.......def.....
1 1 abc?][
2 2 def:{+
Here with stringr and readr attached for readability, with the names fixed:
> library(stringr)
> library(readr)
> d = do.call(rbind.data.frame,str_split(read_lines("seps.csv",skip=1),":",2))
> names(d) = str_split(read_lines("seps.csv",n_max=1),":",2)[[1]]
> d
number string
1 1 abc?][
2 2 def:{+

Good old regular expressions should help you with this
Read txt file
df <- read.table("pathToFile/fileName.txt", header = TRUE)
The data.frame will be one column, so we will need to split it based on some pattern
Create the columns
df$number <- sub("([0-9]+):.*", "\\1", df[, 1])
df$string <- sub("[0-9]+:(.*)", "\\1", df[, 1])
df <- df[, c("number", "string")]
View(df)

Related

Replacing commas to periods in all colums in R

I am trying to write a function to change the numbers in a column so that the ',' is a '.'
I want to then convert that column into a double.
For example, here is some data to copy/paste:
df <- data.frame(a=c("1,1", "2,2"))
First do some pattern matching and replacement with gsub, then coerce the column to class double.
df <- data.frame(a=c("1,1", "2,2"))
df$b <- as.double(gsub(",", ".", df$a))
df
a b
1 1,1 1.1
2 2,2 2.2
To make this a function:
my_function <- function(x) {
as.double(gsub(",", ".", x))
}
my_function(df$a)
[1] 1.1 2.2
Since your data are in csv format, just read in the file with the decimal set to a comma:
read.csv("my_file.csv", dec = ",")
No function needed. Checkout ?read.table for details.

How to transform long names into shorter (two-part) names

I have a character vector in which long names are used, which will consist of several words connected by delimiters in the form of a dot.
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
The length of the names is different. But only the first two words of the entire name are important.
My goal is to get names up to 7 symbols: 3 initial symbols from the first two words and a separator in the form of a "dot" between them.
Very close to my request are these examples, but I do not know how to apply these code variations to my case.
R How to remove characters from long column names in a data frame and
how to append names to " column names" of the output data frame in R?
What should I do to get exit names to look like this?
x <- c("Dus.fru",
"Bet.nan",
"Sal.gla",
"Sal.jen",
"Vac.min")
Any help would be appreciated.
You can do the following:
gsub("(\\w{1,3})[^\\.]*\\.(\\w{1,3}).*", "\\1.\\2", x)
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
First we match up to 3 characters (\\w{1,3}), then ignore anything which is not a dot [^\\.]*, match a dot \\. and then again up to 3 characters (\\w{1,3}). Finally anything, that comes after that .*. We then only use the things in the brackets and separate them with a dot \\1.\\2.
Split on dot, substring 3 characters, then paste back together:
sapply(strsplit(x, ".", fixed = TRUE), function(i){
paste(substr(i[ 1 ], 1, 3), substr(i[ 2], 1, 3), sep = ".")
})
# [1] "Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"
Here a less elegant solution than kath's, but a bit more easy to read, if you are not an expert in regex.
# Your data
x <- c("Duschekia.fruticosa..Rupr...Pouzar",
"Betula.nana.L.",
"Salix.glauca.L.",
"Salix.jenisseensis..F..Schmidt..Flod.",
"Vaccinium.minus..Lodd...Worosch")
# A function that takes three characters from first two words and merges them
cleaner_fun <- function(ugly_string) {
words <- strsplit(ugly_string, "\\.")[[1]]
short_words <- substr(words, 1, 3)
new_name <- paste(short_words[1:2], collapse = ".")
return(new_name)
}
# Testing function
sapply(x, cleaner_fun)
[1]"Dus.fru" "Bet.nan" "Sal.gla" "Sal.jen" "Vac.min"

Recalculate a column in data.frame

Coming from Python + Pandas, I tried to convert a column in an R data.frame.
In Python/Pandas I'd do it like so: df[['weight']] = df[['weight']] / 1000
In R I came up with this:
convertWeight <- function(w) {
return(w/1000)
}
df$weight <- lapply(df$weight, convertWeight)
I know of the library dplyr which has the function mutate. That would allow me to transform columns as well.
Is there a different way to mutate a column without using the dplyr library? Something that comes close to Pandas way of doing it?
EDIT
Inspecting the values in df$weight I see this:
> df[1,]
date weight
1 1.552954e+12 84500.01
> typeof(df[1,1])
[1] "list"
> df$weight[1]
[[1]]
[1] 84500.01
> typeof(df$weight[1])
[1] "list"
This is neither a number, nor a char. Why list?
Btw: I have the data from a json import like so:
library(rjson)
data <- fromJSON(file = "~/Desktop/weight.json")
# convert json data to data.frame
df <- as.data.frame(do.call("rbind", json_data))
# select only two columns
df <- df[c("date", "weight")]
# now I converted the `date` value from posix to date
# and the weight value from milli grams to kg
# ...
Obviously I have a lot to learn about R.
df$weight = as.numeric(df$weight)
df$weight = df$weight / 1000
# If you would like to eliminate the display of scientific notation:
options(scipen = 999)
# If having difficulty still because of the list, try
df = as.data.frame(df)

R - identify which columns contain currency data $

I have a very large dataset with some columns formatted as currency, some numeric, some character. When reading in the data all currency columns are identified as factor and I need to convert them to numeric. The dataset it too wide to manually identify the columns. I am trying to find a programmatic way to identify if a column contains currency data (ex. starts with '$') and then pass that list of columns to be cleaned.
name <- c('john','carl', 'hank')
salary <- c('$23,456.33','$45,677.43','$76,234.88')
emp_data <- data.frame(name,salary)
clean <- function(ttt){
as.numeric(gsub('[^a-zA-z0-9.]','', ttt))
}
sapply(emp_data, clean)
The issue in this example is that this sapply works on all columns resulting in the name column being replaced with NA. I need a way to programmatically identify just the columns that the clean function needs to be applied to.. in this example salary.
Using dplyr and stringr packages, you can use mutate_if to identify columns that have any string starting with a $ and then change the accordingly.
library(dplyr)
library(stringr)
emp_data %>%
mutate_if(~any(str_detect(., '^\\$'), na.rm = TRUE),
~as.numeric(str_replace_all(., '[$,]', '')))
Taking advantage of the powerful parsers the readr package offers out of the box:
my_parser <- function(col) {
# Try first with parse_number that handles currencies automatically quite well
res <- suppressWarnings(readr::parse_number(col))
if (is.null(attr(res, "problems", exact = TRUE))) {
res
} else {
# If parse_number fails, fall back on parse_guess
readr::parse_guess(col)
# Alternatively, we could simply return col without further parsing attempt
}
}
library(dplyr)
emp_data %>%
mutate(foo = "USD13.4",
bar = "£37") %>%
mutate_all(my_parser)
# name salary foo bar
# 1 john 23456.33 13.4 37
# 2 carl 45677.43 13.4 37
# 3 hank 76234.88 13.4 37
A base R option is to use startsWith to detect the dollar columns, and gsub to remove "$" and "," from the columns.
doll_cols <- sapply(emp_data, function(x) any(startsWith(as.character(x), '$')))
emp_data[doll_cols] <- lapply(emp_data[doll_cols],
function(x) as.numeric(gsub('\\$|,', '', x)))

Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)

Resources