Using gsub() in a data.table - r

I have a big data table (about 20,000 rows). One of its columns contains in integers from 1 to 6.
I also have a character vector of car models (6 models).
I'm trying to replace integers with corresponding car model.(just 2 in this example)
gsub("1",paste0(labels[1]),Models)
gsub("2",paste0(labels[2]),Models)
...
"Models" is the name of a column.
labels <- c("Altima","Maxima")
After fighting with it for 12+ hours gsub() isn't working(
sample data:
mydata<-data.table(replicate(1,sample(1:6,10000,rep=TRUE)))
labels<-c("altima","maxima","sentra","is","gs","ls")

I don't think you need gsub here. What you are describing is a factor variable.
If you data is
mydata <- data.table(replicate(1,sample(1:6,1000,rep=TRUE)))
models <- c("altima","maxima","sentra","is","gs","ls")
you could just do
mydata[[1]] <- factor(mydata[[1]], levels=seq_along(models), labels=models)
If you really wanted a character rather than a factor, then
mydata[[1]] <- models[ mydata[[1]] ]
would also do the trick. Both of these require the numbers are continuous and start at 1.

You could try using factor() in the following way - worked for me on your test data. Assuming that name of the first column in mydata is V1 (the default)
mydata$V1 <- factor(mydata$V1, labels=models)

Related

How to convert all factor variables into numeric variables (in multiple data frames at once)?

I have n data frames, each corresponding to data from a city.
There are 3 variables per data frame and currently they are all factor variables.
I want to transform all of them into numeric variables.
I have started by creating a vector with the names of all the data frames in order to use in a for loop.
cities <- as.vector(objects())
for ( i in cities){
i <- as.data.frame(lapply(i, function(x) as.numeric(levels(x))[x]))
}
Although the code runs and there I get no error code, I don't see any changes to my data frames as all three variables remain factor variables.
The strangest thing is that when doing them one by one (as below) it works:
df <- as.data.frame(lapply(df, function(x) as.numeric(levels(x))[x]))
What you're essentially trying to do is modify the type of the field if it is a factor (to a numeric type). One approach using purrr would be:
library(purrr)
map(cities, ~ modify_if(., is.factor, as.numeric))
Note that modify() in itself is like lapply() but it doesn't change the underlying data structure of the objects you are modifying (in this case, dataframes). modify_if() simply takes a predicate as an additional argument.
for anyone who's interested in my question, I worked out the answer:
for ( i in cities){
assign(i, as.data.frame(lapply(get(i), function(x) as.numeric(levels(x))[x])))
}

R - Removing rows in data frame by list of column values

I have two data frames, one containing the predictors and one containing the different categories I want to predict. Both of the data frames contain a column named geoid. Some of the rows of my predictors contains NA values, and I need to remove these.
After extracting the geoid value of the rows containing NA values, and removing them from the predictors data frame I need to remove the corresponding rows from the categories data frame as well.
It seems like a rather basic operation but the code won't work.
categories <- as.data.frame(read.csv("files/cat_df.csv"))
predictors <- as.data.frame(read.csv("files/radius_100.csv"))
NA_rows <- predictors[!complete.cases(predictors),]
geoids <- NA_rows['geoid']
clean_categories <- categories[!(categories$geoid %in% geoids),]
None of the rows in categories/clean_categories are removed.
A typical geoid value is US06140231. typeof(categories$geoid) returns integer.
I can't say this is it, but a very basic typo won't be doing what you want, try this correction
clean_categories <- categories[!(categories$geoid %in% geoids),]
Almost certainly this is what you meant to happen in that line. You want to negate the result of the %in% operator. You don't include a reproducible example so I can't say whether the whole thing will do as you want.

Find data by multiple variable names in R

I have a question regarding variables names in R.
In my dataset I have a list of 70 variable names as characters and I want to find the corresponding data (including the header) in the data.
For example I used the dataset iris. I don't want to select all variables by iris$Sepal.Length since I have 70 variables in the dataset that I use. In my code I can print the data but I am struggling with saving the data as a dataframe with the corresponding header names. Somebody any thoughts?
iris
head(iris)
colnames(iris)
b <- list("Sepal.Length","Petal.Length")
i=1
for (i in 1:length(b)){
#print(b[[i]])
print(iris[,c(b[[i]])])
c[,i]<-(iris[,c(b[[i]])])
}
It sounds like you're trying to get a subset of 70 columns from a data.frame or matrix. The 70 columns you have are stored in a list. R will let you get columns named by a character vector, but not by a list. So, you can just use unlist.
b <- list("Sepal.Length","Petal.Length")
newTable <- iris[,unlist(b)]
I find dplyr the best for this. If you turn iris into a tibble
iris <- as_tibble(iris)
You can then use the dplyr::select function either selecting by name (no quotes) or by position. You can even use the 1:5 notation selecting columns 1 to 5. A great place to start is: http://r4ds.had.co.nz
Are you looking for this ?
b <- c("Sepal.Length","Petal.Length")
New_iris=iris[,b]

Determining how to quickly classify columns in huge datasets as factors

No good example here since my datasets that I am working with are huge.
But if I have a 200,300something column dataset I want to have some sort of rule to quickly classify and convert some of these columns to factors. Is there some quick R code to do it?
Reason being I don't have time to go column by column to completely understand or interpret data, but if I see there are just unique 4 values out 5000 rows, I assume that this is categorical data.
Anyone have any quick code snippets or ways to go about doing this?
Assuming that df refers to your dataframe:
## Find all columns with less than 5 unique values
cols <- apply(df, 2, FUN = function(x) length(unique(x))) < 5
## Convert columns with less than 5 unique values to factor
df[cols] <- lapply(df[cols], factor)

Applying a function to a dataframe to trim empty columns within a list environment R

I am a naive user of R and am attempting to come to terms with the 'apply' series of functions which I now need to use due to the complexity of the data sets.
I have large, ragged, data frame that I wish to reshape before conducting a sequence of regression analyses. It is further complicated by having interlaced rows of descriptive data(characters).
My approach to date has been to use a factor to split the data frame into sets with equal row lengths (i.e. a list), then attempt to remove the trailing empty columns, make two new, matching lists, one of data and one of chars and then use reshape to produce a common column number, then recombine the sets in each list. e.g. a simplified example:
myDF <- as.data.frame(rbind(c("v1",as.character(1:10)),
c("v1",letters[1:10]),
c("v2",c(as.character(1:6),rep("",4))),
c("v2",c(letters[1:6], rep("",4)))))
myDF[,1] <- as.factor(myDF[,1])
myList <- split(myDF, myDF[,1])
myList[[1]]
I can remove the empty columns for an individual set and can split the data frame into two sets from the interlacing rows but have been stumped with the syntax in writing a function to apply the following function to the list - though 'lapply' with 'seq_along' should do it?
Thus for the individual set:
DF <- myList[[2]]
DF <- DF[,!sapply(DF, function(x) all(x==""))]
DF
(from an earlier answer to a similar, but simpler example on this site). I have a large data set and would like an elegant solution (I could use a loop but that would not use the capabilities of R effectively). Once I have done that I ought to be able to use the same rationale to reshape the frames and then recombine them.
regards
jac
Try
lapply(split(myDF, myDF$V1), function(x) x[!colSums(x=='')])

Resources