I have a CSV datafile called test_20171122
Often, datasets that I work with were originally in Accounting or Currency format in Excel and later converted to a CSV file.
I am looking into the optimal way to clean data from an accounting format "$##,###" to a number "####" in R using gsub().
My trouble is in the iteration of gsub() across all columns of a dataset. My first instinct run gsub() on the whole dataframe (below) but it seems to alter the data in a counterproductive way.
gsub("\\$", "", test_20171122)
The following code is a for loop that seems to get the job done.
for (i in 1:length(test_20171122)){
clean1 <- gsub("\\$","",test_20171122[[1]])
clean2 <- gsub("\\,","",clean1)
test_20171122[,i] <- clean2
i = i + 1
}
I am trying to figure out the optimal way of cleaning a dataframe using gsub(). I feel like sapply() would work but it seems to break the structure of the dataframe when I run the following code:
test_20171122 <- sapply(test_20171122,function(x) gsub("\\$","",x))
test_20171122 <- sapply(test_20171122,function(x) gsub("\\,","",x))
You can use the following pattern in gsub: "[$,]"
Example:
df <- data.frame(
V1 = c("$1,234.56", " $ 23,456.70"),
V2 = c("$89,101,124", "15,234")
)
df
# V1 V2
# 1 $1,234.56 $89,101,124
# 2 $ 23,456.70 15,234
df[] <- lapply(df, function(x) as.numeric(gsub("[$,]", "", x)))
df
# V1 V2
# 1 1234.56 89101124
# 2 23456.70 15234
A solution using the purrr function map_df :
clean_df <- map_df(test_20171122, ~ gsub("[$,]", "", .x))
Related
Coming from Python + Pandas, I tried to convert a column in an R data.frame.
In Python/Pandas I'd do it like so: df[['weight']] = df[['weight']] / 1000
In R I came up with this:
convertWeight <- function(w) {
return(w/1000)
}
df$weight <- lapply(df$weight, convertWeight)
I know of the library dplyr which has the function mutate. That would allow me to transform columns as well.
Is there a different way to mutate a column without using the dplyr library? Something that comes close to Pandas way of doing it?
EDIT
Inspecting the values in df$weight I see this:
> df[1,]
date weight
1 1.552954e+12 84500.01
> typeof(df[1,1])
[1] "list"
> df$weight[1]
[[1]]
[1] 84500.01
> typeof(df$weight[1])
[1] "list"
This is neither a number, nor a char. Why list?
Btw: I have the data from a json import like so:
library(rjson)
data <- fromJSON(file = "~/Desktop/weight.json")
# convert json data to data.frame
df <- as.data.frame(do.call("rbind", json_data))
# select only two columns
df <- df[c("date", "weight")]
# now I converted the `date` value from posix to date
# and the weight value from milli grams to kg
# ...
Obviously I have a lot to learn about R.
df$weight = as.numeric(df$weight)
df$weight = df$weight / 1000
# If you would like to eliminate the display of scientific notation:
options(scipen = 999)
# If having difficulty still because of the list, try
df = as.data.frame(df)
I am a beginner in R and while trying to make some exercises I got stuck in one of them. My data.frame is as follow:
LanguageWorkedNow LanguageNextYear
Java; PHP Java; C++; SQL
C;C++;JavaScript; JavaScript; C; SQL
And I need to know the variables which are in LanguageNextYear and are not in LanguageWorkedNow, to set a list with the different ones.
Sorry if the question is duplicated, I'm quite new here and tried to find it, but with no success.
base R
Idea: mapply setdiff on strsplitted NextYear and WorkedNow, and then paste it using collapse=";":
df$New <- with(df, {
a <- mapply(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";"), SIMPLIFY = FALSE)
sapply(a, paste, collapse=";")
})
# SIMPLIFY = FALSE is needed in a general case, it doesn't
# affect the output in the example case
# Or if you use Map instead of mapply, that is the default, so
# it could also be...
df$New <- with(df,
sapply(Map(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";")),
paste, collapse=";"))
data
df <- read.table(text = "WorkedNow NextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=TRUE, stringsAsFactors=FALSE)
Here's a solution using purrr package:
df = read.table(text = "
LanguageWorkedNow LanguageNextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=T, stringsAsFactors=F)
library(purrr)
df$New = map2_chr(df$LanguageWorkedNow,
df$LanguageNextYear,
~{x1 = unlist(strsplit(.x, split=";"))
x2 = unlist(strsplit(.y, split=";"))
paste0(x2[!x2%in%x1], collapse = ";")})
df
# LanguageWorkedNow LanguageNextYear New
# 1 Java;PHP Java;C++;SQL C++;SQL
# 2 C;C++;JavaScript JavaScript;C;SQL SQL
For each row you get your columns and you create vectors of values (separated by ;). Then you check which values of NextYear vector don't exist in WorkedNow vector and you create a string based on / combining those values.
The map function family will help you apply your logic / function to each row. In our case we use map2_chr as we have two inputs (your two columns) and we excpet a string / character output.
One of the strings in my vector (df$location1) is the following:
Potomac, MD 20854\n(39.038266, -77.203413)
Rest of the data in the vector follow same pattern. I want to separate each component of the string into a separate data element and put it in new columns like: df$city, df$state, etc.
So far I have been able to isolate the lat. long. data into a separate column by doing the following:
df$lat.long <- gsub('.*\\\n\\\((.*)\\\)','\\\1',df$location1)
I was able to make it work by looking at other codes online but I don't fully understand it. I understand the regex pattern but don't understand the "\\1" part. Since I don't understand it in full I have been unable to use it to subset other parts of this same string.
What's the best way to subset data like this?
Is using regex a good way to do this? What other ways should I be looking into?
I have looked into splitting the string after a comma, subset using regex, using scan() function and to many other variations. Now I am all confused. Thx
We can also use the separate function from the tidyr package (part of the tidyverse package).
library(tidyverse)
# Create example data frame
dat <- data.frame(Data = "Potomac, MD 20854\n(39.038266, -77.203413)",
stringsAsFactors = FALSE)
dat
# Data
# 1 Potomac, MD 20854\n(39.038266, -77.203413)
# Separate the Data column
dat2 <- dat %>%
separate(Data, into = c("City", "State", "Zip", "Latitude", "Longitude"),
sep = ", |\\\n\\(|\\)|[[:space:]]")
dat2
# City State Zip Latitude Longitude
# 1 Potomac MD 20854 39.038266 -77.203413
You can try strsplit or data.table::tstrsplit(strsplit + transpose):
> x <- 'Potomac, MD 20854\n(39.038266, -77.203413)'
> data.table::tstrsplit(x, ', |\\n\\(|\\)')
[[1]]
[1] "Potomac"
[[2]]
[1] "MD 20854"
[[3]]
[1] "39.038266"
[[4]]
[1] "-77.203413"
More generally, you can do this:
library(data.table)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
The pattern ', |\\n\\(|\\)' tells tstrsplit to split by ", ", "\n(" or ")".
In case you want to sperate state and zip and cite names may contain spaces, You can try a two-step way:
# original split (keep city names with space intact)
df[c('city', 'state', 'lat', 'long')] <- tstrsplit(df$location1, ', |\\n\\(|\\)')
# split state and zip
df[c('state', 'zip')] <- tstrsplit(df$state, ' ')
Here is an option using base R
read.table(text= trimws(gsub(",+", " ", gsub("[, \n()]", ",", dat$Data))),
header = FALSE, col.names = c("City", "State", "Zip", "Latitude", "Longitude"),
stringsAsFactors = FALSE)
# City State Zip Latitude Longitude
#1 Potomac MD 20854 39.03827 -77.20341
So this process might be a little longer, but for me it makes things clear. As opposed to using breaks, below I identify values by using a specific regex for each value I want. I make a vector of regex to extract each value, a vector for the variable names, then use a loop to extract and create the dataframe from those vectors.
library(stringi)
library(dplyr)
library(purrr)
rgexVec <- c("[\\w\\s-]+(?=,)",
"[A-Z]{2}",
"\\d+(?=\\n)",
"[\\d-\\.]+(?=,)",
"[\\d-\\.]+(?=\\))")
varNames <- c("city",
"state",
"zip",
"lat",
"long")
map2_dfc(varNames, rgexVec, function(vn, rg) {
extractedVal <- stri_extract_first_regex(value, rg) %>% as.list()
names(extractedVal) <- vn
extractedVal %>% as_tibble()
})
\\1 is a back reference in regex. It is similar to a wildcard (*) that will grab all instances of your search term, not just the first one it finds.
I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)
I have a tab-delimited file that looks like this:
AG-AG AG-CA AT-AA AT-AC AT-AG ...
0.0142180094786 0.009478672985781 0.0142180094786 0.4218009478672 ...
When I read this into R using read.table i get:
nc.tab <- read.table("./percent_splice_pair.tab", sep="\t", header=TRUE)
AG.AG AG.CA AT.AA AT.AC AT.AG ...
1 0.01421801 0.009478673 0.01421801 0.4218009 0.03317536 ...
This feels somewhat awkward for me, because I am much more used to working with data if its like this:
splice.pair counts
AG.AG 0.01421801
AG.CA 0.009478673
AT.AA 0.01421801
AT.AG 0.03317536
... ...
so far, my attempts at trying to coerce the table into a data frame like this (using data.frame()) have caused very odd results. I can't work out how to get each row of the table I have as a simple list, which I can then use as columns for the data frame. colnames(nc.tab) works for the headers but things like nc.tab[1,] just give me the table + headers again. Am I missing something obvious?
--edit--
Whilst #Andrie's answer gave me the data.frame I needed, I had to do a bit of extra work to coerse the counts values into numeric values so that they would work correctly in ggplot:
nc.tab <- read.table("./percent_splice_pair.tab", header=FALSE, sep="\t")
nc.mat <- t(as.matrix(nc.tab))
sp <- as.character(nc.tab[,2])
c <- as.numeric(as.character(nc.tab[,2]))
nc.dat <- data.frame(Splice.Pair=sp, count=c)
Splice.Pair count
1 AG-AG 0.014218009
2 AG-CA 0.009478673
3 AT-AA 0.014218009
4 AT-AC 0.421800948
5 AT-AG 0.033175355
You need the following to read and reshape your data in the way you want:
use read.table with the parameter header=FALSE
then transpose the data with the function t()
rename the columns
Here is the code:
x <- read.table(..., header=FALSE)
df <- as.data.frame(t(x))
names(df) <- c("splice.pair", "counts")
df
splice.pair counts
V1 AG-AG 0.0142180094786
V2 AG-CA 0.009478672985781
V3 AT-AA 0.0142180094786
V4 AT-AC 0.4218009478672