Related
I have few columns which I need to convert to factors
for cols in ['col1','col2']:
df$cols<-as.factor(as.character(df$cols))
Error
for cols in ['col1','col2']:
Error: unexpected symbol in "for cols"
> df$cols<-as.factor(as.character(df$cols))
Error in `$<-.data.frame`(`*tmp*`, cols, value = integer(0)) :
replacement has 0 rows, data has 942
The syntax showed also use the python for loop and python list. Instead it would be a vector of strings in `R
for (col in c('col1','col2')) {
df[[col]] <- factor(df[[col]])
}
NOTE: here we use [[ instead of $ and the braces {}. The factor can be directly applied instead of as.character wrapping
Or with lapply where it can be done easily (without using any packages)
df[c('col1', 'col2')] <- lapply(df[c('col1', 'col2')], factor)
Or in dplyr, where it can be done more easily
library(dplyr)
df <- df %>%
mutate_at(vars(col1, col2), factor)
In complement to #akrun solution, with data.table, this can be done easily:
library(data.table)
setDT(df)
df[,c("col1","col2") := lapply(.SD, function(c) as.factor(as.character(c))), .SDcols = c("col1","col2")]
Note that df is updated by reference (:=) so no need for reassignment
I am trying to loop over columns in data.table package in R. I have been having trouble trying to get the for loop to accurately input the column when I subset the datatable.
The goal of my task is to get the number of rows for each subset of the datatable when the column condition of "==1" is met.
Here is my code:
data <- data.table(va=c(1,0,1), vb=c(1,0,0), vc=c(1,1,1))
names <- c("va", "vc")
for (col in names) {
print(nrow(data[col == 1,]))
print(col)
}
Here is the output I get
[1] 0
[1] "va"
[1] 0
[1] "vc"
Is there something I am missing or a better way of doing this?
You can use colSums, which is much simpler and faster than looping.
dt <- data.table(va=c(1,0,1), vb=c(1,0,0), vc=c(1,1,1))
col.names <- c("va", "vc")
dt[, colSums(.SD==1), .SDcols = col.names]
# va vc
# 2 3
Note: I changed your object names to dt and col.names because it is not good practice to use base functions as names.
If you really want to use a for loop (I don't recommend it, but for educational purposes...) you can fix it using get to use the values of the column rather than the column name itself
for (col in col.names) {
dt[get(col) == 1, print(.N)]
}
I am a beginner in R and while trying to make some exercises I got stuck in one of them. My data.frame is as follow:
LanguageWorkedNow LanguageNextYear
Java; PHP Java; C++; SQL
C;C++;JavaScript; JavaScript; C; SQL
And I need to know the variables which are in LanguageNextYear and are not in LanguageWorkedNow, to set a list with the different ones.
Sorry if the question is duplicated, I'm quite new here and tried to find it, but with no success.
base R
Idea: mapply setdiff on strsplitted NextYear and WorkedNow, and then paste it using collapse=";":
df$New <- with(df, {
a <- mapply(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";"), SIMPLIFY = FALSE)
sapply(a, paste, collapse=";")
})
# SIMPLIFY = FALSE is needed in a general case, it doesn't
# affect the output in the example case
# Or if you use Map instead of mapply, that is the default, so
# it could also be...
df$New <- with(df,
sapply(Map(setdiff, strsplit(NextYear, ";"), strsplit(WorkedNow, ";")),
paste, collapse=";"))
data
df <- read.table(text = "WorkedNow NextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=TRUE, stringsAsFactors=FALSE)
Here's a solution using purrr package:
df = read.table(text = "
LanguageWorkedNow LanguageNextYear
Java;PHP Java;C++;SQL
C;C++;JavaScript JavaScript;C;SQL
", header=T, stringsAsFactors=F)
library(purrr)
df$New = map2_chr(df$LanguageWorkedNow,
df$LanguageNextYear,
~{x1 = unlist(strsplit(.x, split=";"))
x2 = unlist(strsplit(.y, split=";"))
paste0(x2[!x2%in%x1], collapse = ";")})
df
# LanguageWorkedNow LanguageNextYear New
# 1 Java;PHP Java;C++;SQL C++;SQL
# 2 C;C++;JavaScript JavaScript;C;SQL SQL
For each row you get your columns and you create vectors of values (separated by ;). Then you check which values of NextYear vector don't exist in WorkedNow vector and you create a string based on / combining those values.
The map function family will help you apply your logic / function to each row. In our case we use map2_chr as we have two inputs (your two columns) and we excpet a string / character output.
I'm having issues removing just the right amount of information from the following data:
18,14,17,2,9,8
17,17,17,14
18,14,17,2,1,1,1,1,9,8,1,1,1
I'm applying !duplicate in order to remove the duplicates.
SplitFunction <- function(x) {
b <- unlist(strsplit(x, '[,]'))
c <- b[!duplicated(b)]
return(paste(c, collapse=","))
}
I'm having issues removing only consecutive duplicates. The result below is what I'm getting.
18,14,17,2,9,8
17,14
18,14,17,2,1,9,8
The data below is what I want to obtain.
18,14,17,2,9,8
17,14
18,14,17,2,1,9,8,1
Can you suggest a way to perform this? Ideally a vectorized approach...
Thanks,
Miguel
you can use rle function to sovle this question.
xx <- c("18,14,17,2,9,8","17,17,17,14","18,14,17,2,1,1,1,1,9,8,1,1,1")
zz <- strsplit(xx,",")
sapply(zz,function(x) rle(x)$value)
And you can refer to this link.
How to remove/collapse consecutive duplicate values in sequence in R?
We can use rle
sapply(strsplit(x, ','), function(x) paste(inverse.rle(within.list(rle(x),
lengths <- rep(1, length(lengths)))), collapse=","))
#[1] "18,14,17,2,9,8" "17,14" "18,14,17,2,1,9,8,1"
data
x <- c('18,14,17,2,9,8', '17,17,17,14', '18,14,17,2,1,1,1,1,9,8,1,1,1')
Great rle-answers. This is just to add an alternative without rle. This gives a list of numeric vectors but can of course easily expanded to return strings:
numbers <- c("18,14,17,2,9,8", "17,17,17,14", "14,17,18,2,9,8,1", "18,14,17,11,8,9,8,8,22,13,6", "14,17,2,9,8", "18,14,17,2,1,1,1,1,1,1,1,1,9,8,1,1,1,1")
result <- sapply(strsplit(numbers, ","), function(x) x[x!=c(x[-1],Inf)])
print(result)
I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)