I am trying to use gsub on every column of a dataframe to remove some characters, I have tried using apply to do this without success:
data<-apply(data,2, function(x) gsub("£","",data[x]))
returns error
Error in `[.data.frame`(data, x) : undefined columns selected
I know it works if I do
for(i in 1: length(data)){data[,i]<-gsub("£","",data[,i]) }
But why doesn't the apply call work?
Here's the next best reproducible example. Though there might be a better / faster (vectorized) way if I thought a little harder. But since you asked for apply:
# just turn it to characters in order to
# turn . to , ... was just the first dataset that came to
# but as character should not be necessary for your data
ds[] <- sapply(mtcars,function(x) gsub("\\.",",",as.character(x)))
Related
quick question.
I have a table (Data) with a series of columns having similar colnames
colnames(Data):
[Sum_Flag_x_30,Sum_Flag_y_60,...Sum_Flag_z_n].
I ideally want to write a simple code to get all of these into numeric format (as.numeric).
Tried with:
Data[,grep("Sum_Flag",colnames(Data),value=T)] <- as.numeric(Data[,grep("Sum_Flag",colnames(Data),value=T)]
but it's not working and I get the following error:
Error in [<-.data.table(*tmp*, , grep("Sum_Flag", colnames(Data),
: Supplied 25 items to be assigned to 55057 items of column
'Sum_Flag_x_30'. If you wish to 'recycle' the RHS please use rep() to
make this intent clear to readers of your code.
Any clue about this?
Thanks guys
Ciao
All you have to do is use apply. Apply as.numeric() to every column, by doing:
# margin = 2 means that you're applying a function column-wise.
new_data <- apply(Data, MARGIN = 2, as.numeric)
I'm working with some data that has hundreds of covariates, so I decided to write some functions to make pre-processing much faster and cleaner (like scaling certain numeric variables). An important part of all of these functions is type-checking the columns before I apply a particular function to them.
Here is my function for scaling continuous columns:
# rm (vector): names of columns not to be scaled
scale.continuous <- function(df, rm=NULL) {
cols <- setdiff(colnames(df), rm)
for(col in cols) {
if(is.numeric(df[,col])){
df[,col] <- as.numeric(scale(df[,col]))
}
}
df
}
This works perfectly fine if I load the data frame using read.csv(), but the data I have is huge so the speed boost of using read_csv() from readr/tidyverse is significant. Unfortunately, if I load my data using read_csv() all of my functions break.
I narrowed down the issue to the type-checking, specifically when type-checking a column I am accessing by a string of its column name. Here's some code to demonstrate what I mean:
# When using read.csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] TRUE
# When using read_csv()
> is.numeric(df$col)
[1] TRUE
> is.numeric(df[,"col"])
[1] FALSE
I realized the issue here was that indexing the dataframe with a string the way I do above returns a tibble instead of a regular list like other methods of indexing do. What I don't understand is why this behavior exists, why as.numeric() (or any type-check) does not work with a tibble and in general why there is this difference in the way the default and tidyverse dataframes are constructed. Also, it would be nice to know if there is a parameter I can change in read_csv() that will make the behavior of this type of indexing the same as with a default dataframe.
I should mention, I realize there are probably better ways of writing this code (for example, just using df$"col" to index fixes the issue), but I still don't understand what the root of the issue was with my first approach. I am now working with much larger data sets that require much more involved pre-processing than what I have been used to in the past so I want to have as complete an understanding of the data structures I am using as possible.
Tibbles have a slightly different default behaviour than regular data frames when using the [ extracting function which can be a bit of a gotcha. Specifically df[,"col"] on a tibble will return a one column tibble whereas on a regular data frame it will return a vector. So you need to use:
df[["col"]]
Or explicitly state that you want to coerce to the lowest dimension and do:
df[, "col", drop = TRUE]
From the documentation:
df[, j] returns a tibble; it does not automatically extract the column
inside. df[, j, drop = FALSE] is the default.
I am having an issue with subsetting my Spark DataFrame.
I have a DataFrame called nfe, which contains a column called ITEM_PRODUTO that is formatted as a string. I would like to subset this DataFrame based on whether the item column contains the word "AREIA". I can easily subset the data based on an exact phrase:
nfe.subset1 <- subset(nfe, nfe$ITEM_PRODUTO == "AREIA LAVADA FINA")
nfe.subset2 <- subset(nfe, nfe$ITEM_PRODUTO %in% "AREIA")
However, what I would like is a subset of all rows that contain the word "AREIA" in the ITEM_PRODUTO column. When I try to use grep, though, I receive an error message:
nfe.subset3 <- subset(nfe, grep("AREIA", nfe$ITEM_PRODUTO))
# Error in as.character.default(x) :
# no method for coercing this S4 class to a vector
I've tried multiple iterations of syntax, and tried grepl as well, but nothing seems to work. It's probably a syntax error, but could anyone help me out?
Thanks!
Standard R functions cannot be applied to SparkDataFrame. Use either like`:
where(nfe, like(nfe$ITEM_PRODUTO, "%AREIA%"))
or rlike:
where(nfe, rlike(nfe$ITEM_PRODUTO, ".*AREIA.*"))
Hear me out. Consider an arbitrary case where the new column's elements do not require any information from other columns (which I frustrates base $ and mutate assignment), and not every element in the new column is the same. Here is what I've tried:
df$rand<-rep(sample(1:100,1),nrow(df))
unique(df$rand)
[1] 58
and rest assured, nrow(df)>1. I think the correct solution might have to do with an apply function?
Your code repeats one single random number nrow(df) times. Try instead:
df$rand<-sample(1:100, nrow(df))
This samples without replacement from 1:100 nrow(df) times. Now this would give you an error if nrow(df)>100 because you would run out of numbers from 1:100 to sample. To make sure you don't get this error, you can instead sample with replacement:
df$rand<-sample(1:100, nrow(df), replace = TRUE)
If, however, you don't want any random numbers to repeat but would also like to prevent the error, you can do something like this:
df$rand<-sample(1:nrow(df), nrow(df))
if I understand this correctly ,I think this is pretty easily doable in dplyr or data.table .
for e.g dplyr soln on iris
iris%>%mutate(sample(n()))
I have converted a bunch of my columns from factor to numeric, but the code was very cumbersome. I had to individually convert each column, which ended up taking more time than it should. This is the code I used (only a short sample - I actually have many more columns):
city1$NY <-as.numeric(levels(city1$NY))[city1$NY]
city1$CHI<-as.numeric(levels(city1$CHI))[city1$CHI]
city1$LA <-as.numeric(levels(city1$LA))[city1$LA]
city1$ATL<-as.numeric(levels(city1$ATL))[city1$ATL]
city1$MIA<-as.numeric(levels(city1$MIA))[city1$MIA]
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
Where CityNames is just all of the columns for the data that I would like to convert.. But that doesn't work, as I get:
Error in as.numeric(levels(city1[, CityNames]))[city1[, CityNames]] :
invalid subscript type 'list'
Can anyone tell what I am doing wrong? Or is there just simply no easier way to do this task other than my long, annoying first method?
I was almost positive that instead of doing all of that, I could've just done:
city1[,CityNames]<-as.numeric(levels(city1[,CityNames]))[city1[,CityNames]]
So, a small change is needed:
city1[,CityNames] <- lapply(city1[,CityNames], function(x) as.numeric(levels(x))[x] )
The original approach didn't work because
levels are vector-specific, so it's not clear what myvec = levels(city1[,CityNames]) is.
myvec[ city1[,CityNames] ] throws an error because city1[,CityNames] is a data.frame and cannot be used to subset in this way.
This is typically what I do when I want to convert many columns in a data.frame to a different data type:
convNames <- c("NY", "CHI", "LA", "ATL", "MIA")
for(name in convNames) { city1[, name] <- as.numeric(as.character((city1[, name])) }
It's a nice two lines and you just have to add the names of whatever columns you want to coerce to the convNames vector to add a new column to the coercing loop below.
EDIT: Do to a factor issue, do the lapply method above.
I'm not sure if it is faster, but may be since the lookups may be what is slowing you down. Try city1 <- as.numeric(as.character(city1)). The as.character() converts to the level values and then the as.numeric() interprets those strings as their a numeric equivalent. It may be significantly faster since it does not have to do any lookups into the levels vector for each value.