How to formulate for loop here - r

I have a csv file of information of car (price, model, color, and more)
I have uploaded this into R through read.csv
Some variables are text based categorical variables such as Model, color, and fuel type
I came up with a for loop to find out how to find these text based categorical variables
for(i in 1:dim(car)[2]){
if(is.character(car[,i])){
print(names(car)[i])
}
}
###car is name of file
Now I want to add to the loop how to find the index of the column. For example column of Model is 2 but how should I integrate it into this loop? Below is what I have so far but response is "Integer(0)".
for(i in 1:dim(car)[2]){
if(is.character(car[,i])){
print(which(i==colnames(car)))}
}

dim(car)[2] is the number of columns of car. (ncol() is a more common way to get this number for a data frame).
1:dim(car)[2] is therefore 1, 2, 3, ... up to the number of columns.
So for(i in ...) means i will be 1, then i will be 2, .... up to the number of columns.
When your if statement is TRUE, the current value of i is the column number. So you want print(i) inside your if() statement.
Your attempt, print(which(i==colnames(car))) failes because colnames(car) are the names of the columns, and i is the number of the column. Names and numbers are different.
A more R-like way to do this would be to use sapply instead of a loop. Something like this:
char_cols = sapply(cars, is.character)
char_cols # named vector saying if each column is character or not
char_cols[char_cols] # look only at the character columns

"which" function can still be used. From the response from Gregor Thomas there is a way to modify there is a way to modify for loop
for(i in 1:ncol(car)){
if(is.character(car[,i])){
print(names(car)[i])
print(which(names(car)[i]==colnames(car)))
}
}
enter image description here
we first print the actual names through print(names(car)[i])
then we simply ask R to print the names (that we receive above) that match with name in column of "car" dataset
check the link below for a picture. Once again thank you to Mr. Gregor Thomas

A slight variation of Gregor Thomas' smart recommendation is to use sapply with the typeof function to type every column and then the which function to get the character column numbers:
x <- sapply(cars, typeof)
y <- which(x == 'character')
Also note that you see which columns are character from a visual inspection of a dataframe's structure, str(car)

Related

improving specific code efficiency - *base R* alternative to for() loop solution

Looking for a vectorized base R solution for my own edification. I'm assigning a value to a column in a data frame based on a value in another column in the data frame.
My solution creates a named vector of possible codes, looks up the code in the original column, subsets the named list by the value found, and assigns the resulting name to the new column. I'm sure there's a way to do exactly this using the named vector I created that doesn't need a for loop; is it some version of apply?
dplyr is great and useful and I'm not looking for a solution that uses it.
# reference vector for assigning more readable text to this table
tempAssessmentCodes <- setNames(c(600,301,302,601,303,304,602,305,306,603,307,308,604,309,310,605,311,312,606,699),
c("base","3m","6m","6m","9m","12m","12m","15m","18m","18m","21m","24m","24m","27m","30m","30m",
"33m","36m","36m","disch"))
for(i in 1:nrow(rawDisp)){
rawDisp$assessText[i] <- names(tempAssessmentCodes)[tempAssessmentCodes==rawDisp$assessment[i]]
}
The standard way is to use match():
rawDisp$assessText <- names(tempAssessmentCodes)[match(rawDisp$assessment, tempAssessmentCodes)]
For each y element match(x, y) will find a corresponding element index in x. Then we use the names of y for replacing values with names.
Personally, I do it the opposite way - make tempAssesmentCodes have names that correspond to old codes, and values correspond to new codes:
codes <- setNames(names(tempAssessmentCodes), tempAssessmentCodes)
Then simply select elements from the new codes using the names (old codes):
rawDisp$assessText <- codes[as.character(rawDisp$assessment)]

Retaining a value in an R dataset if it's present in another dataset

I am currently working on a code which applies to various datasets from an experiment which looks at a wide range of variables which might not be present in every repetition. My first step is to create an empty dataset with all the possible variables, and then write a function which retains columns that are in the dataset being inputted and delete the rest. Here is an example of how I want to achieve this:-
x<-c("a","b","c","d","e","f","g")
y<-c("c","f","g")
Is there a way of removing elements of x that aren't present in y and/or retaining values of x that are present in y?
For your first question: "My first step is to create an empty dataset with all the possible variables", I would use factor on the concatenation of all the vectors, for example:
all_vect = c(x, y)
possible = levels(factor(all_vect))
Then, for the second part " write a function which retains columns that are in the dataset being inputted and delete the rest", I would write:
df[,names(df)%in%possible]
As akrun wrote, use intersect(x,y) or
> x[x %in% y]

How to code this if else clause in R?

I have a function that outputs a list containing strings. Now, I want to check if this list contain strings which are all 0's or if there is at least one string which doesn't contain all 0's (can be more).
I have a large dataset. I am going to execute my function on each of the rows of the dataset. Now,
Basically,
for each row of the dataset
mylst <- func(row[i])
if (mylst(contains strings containing all 0's)
process the next row of the dataset
else
execute some other code
Now, I can code the if-else clause but I am not able to code the part where I have to check the list for all 0's. How can I do this in R?
Thanks!
You can use this for loop:
for (i in seq(nrow(dat))) {
if( !any(grepl("^0+$", dat[i, ])) )
execute some other code
}
where dat is the name of your data frame.
Here, the regex "^0+$" matches a string that consists of 0s only.
I'd like to suggest solution that avoids use of explicit for-loop.
For a given data set df, one can find a logical vector that indicates the rows with all zeroes:
all.zeros <- apply(df,1,function(s) all(grepl('^0+$',s))) # grepl() was taken from the Sven's solution
With this logical vector, it is easy to subset df to remove all-zero rows:
df[!all.zeros,]
and use it for any subsequent transformations.
'Toy' dataset
df <- data.frame(V1=c('00','01','00'),V2=c('000','010','020'))
UPDATE
If you'd like to apply the function to each row first and then analyze the resulting strings, you should slightly modify the all.zeros expression:
all.zeros <- apply(df,1,function(s) all(grepl('^0+$',func(s))))

Editing columns of data frames in lists in R

I want to do some statistic analysis in R on a column with the same name but different length originating from several data frames. I created a list:
my.list <- list(df1, df2, df3, df4)
Now, as some elements of the column of interest (say: my.col) contain the word "FAILED" instead of numbers, I replace it by 'NA':
for (i in 1:length(my.list)){
for (j in 1:length(my.list[[i]]$my.col)){
if (my.list[[i]]$my.col[j] %in% c("FAILED"))
{my.list[[i]]$my.col[j] <- 'NA'};
}
}
I am pretty sure that this is not the best solution for the problem, but at least it works. Although I have to say that it causes warnings that in another column (not my.col) there are unvalid factor levels that have been replaced by 'NA'. No idea why it actually considers other columns than my.col. Suggestions for improvement are highly appreciated.
Now, the remaining numbers contain decimal comma instead of point. Although I tried to eliminate this problems while importing the .csv-file with "dec=","", this does not work out for columns that contain anything else than numbers (e.g. "FAILED"). So I have to replace the comma by the point, and this is what doesn't work for me. I tried:
for (i in 1:length(my.list)){
as.numeric(gsub(",", ".", my.list[[i]]$my.col))
}
This doesn't give any errors, but it also doesn't change anything, although if I type in e.g.
as.numeric(gsub(",", ".", my.list[[4]]$my.col))
it does what I want to do for the 4th element of the list. From my point of view, both should be the same. What's the problem with this?
Btw, I prefer not to delete the other columns from the data frames because I might need them in future for other analysis.
You can do this efficiently using the plyr package.
Note that in the example, I use the built in iris data.
Instead of replacing "FAILED" with NA, I replaced values of "versicolor".
Instead of replacing a coma with a period, I replaced an s with a w.
my.list <- list(iris, iris)
library(plyr)
my.list<-llply(.data=my.list,
function(x) { x$Species<-as.character(x$Species)
x$Species[x$Species=="versicolor"]<-"NA"
x$Species<-gsub(pattern="s",
replacement="w",
x=x$Species)
x$Species<-as.factor(x$Species)
return(x)
})
The as.character was added as an example of a way to circumvent problems with adding a level to a factor. The as.factor insure the column is return as a factor with the new levels.
This will also give you the flexibility to convert from list to data.frame. Simply replace llply with ldply.

Evaluating dataframe and storing the result

My dataframe(m*n) has few hundreds of columns, i need to compare each column with all other columns (contingency table) and perform chisq test and save the results for each column in different variable.
Its working for one column at a time like,
s <- function(x) {
a <- table(x,data[,1])
b <- chisq.test(a)
}
c1 <- apply(data,2,s)
The results are stored in c1 for column 1, but how will I loop this over all columns and save result for each column for further analysis?
If you're sure you want to do this (I wouldn't, thinking about the multitesting problem), work with lists :
Data <- data.frame(
x=sample(letters[1:3],20,TRUE),
y=sample(letters[1:3],20,TRUE),
z=sample(letters[1:3],20,TRUE)
)
# Make a nice list of indices
ids <- combn(names(Data),2,simplify=FALSE)
# use the appropriate apply
my.results <- lapply(ids,
function(z) chisq.test(table(Data[,z]))
)
# use some paste voodoo to give the results the names of the column indices
names(my.results) <- sapply(ids,paste,collapse="-")
# select all values for y :
my.results[grep("y",names(my.results))]
Not harder than that. As I show you in the last line, you can easily get all tests for a specific column, so there is no need to make a list for each column. That just takes longer and takes more space, but gives the same information. You can write a small convenience function to extract the data you need :
extract <- function(col,l){
l[grep(col,names(l))]
}
extract("^y$",my.results)
Which makes you can even loop over different column names of your dataframe and get a list of lists returned :
lapply(names(Data),extract,my.results)
I strongly suggest you get yourself acquainted with working with lists, they're one of the most powerful and clean ways of doing things in R.
PS : Be aware that you save the whole chisq.test object in your list. If you only need the value for Chi square or the p-value, select them first.
Fundamentally, you have a few problems here:
You're relying heavily on global arguments rather than local ones.
This makes the double usage of "data" confusing.
Similarly, you rely on a hard-coded value (column 1) instead of
passing it as an argument to the function.
You're not extracting the one value you need from the chisq.test().
This means your result gets returned as a list.
You didn't provide some example data. So here's some:
m <- 10
n <- 4
mytable <- matrix(runif(m*n),nrow=m,ncol=n)
Once you fix the above problems, simply run a loop over various columns (since you've now avoided hard-coding the column) and store the result.

Resources