How to use a %in% condition in the R which function? - r

I have a simple task, which I can do in loads of line of individual code, but I would like to simplify it as it will take a long time in the future.
my task is to transform 100's of columns of a data frame in to factors and re label accordingly.
with just a subset of my data, I tried to create a list of variables as the 12 variables have different prefixes at each wave (year of collection, the code I ended up using was:
ghq <-c("scghqa", "scghqb", "scghqc", "scghqd", "scghqe", "scghqf", "scghqg",
"scghqh", "scghqi", "scghqj", "scghqk", "scghql")
waves <- c("a", "b", "c", "d", "e")
ghqa <- paste0(waves[1], sep = "_", ghq[1:12])
ghqb <- paste0(waves[2], sep = "_", ghq[1:12])
ghqc <- paste0(waves[3], sep = "_", ghq[1:12])
ghqd <- paste0(waves[4], sep = "_", ghq[1:12])
ghqe <- paste0(waves[5], sep = "_", ghq[1:12])
ghqv <- c(ghqa, ghqb, ghqc, ghqd, ghqe)
I tried this in a for loop, but I could not get it to produce the output in a list or character vector (only a matrix seemed to work), see the code for that at the bottom of this question, if you are curious.
From here to be able to use apply, I need to know the positions of these columns in the dataframe
apply(data[c(indexes of cols), 2, lfactor(c(values in the factor), levels =c(levels they will correspond to), labels=c(text labels to be attached to each level))
NOTE: I put this here because perhaps I am going the wrong way about things by trying to use apply.
so to identify the columns I want drom the data i used
head(dat[colnames(dat) %in% ghqv]) # produced the data for the 60 rows I want
length(dat[colnames(dat) %in% ghqv]) # 60 (as expected)
so I tried:
which(dat[colnames(dat) %in% ghqv])
Error in which(dat[colnames(dat) %in% ghqv]) :
argument to 'which' is not logical
How can I transform this to a logical please? as any time I use == with %in% it does not seem to recognise it
To try to help simplify this, with the silly variable names, I created the same issue in the mt cars data set:
cars <- mtcars
vars <- c("mpg", "qsec")
head(cars[colnames(cars) %in% vars])
which(cars[colnames(cars) %in% vars])
Error in which(cars[colnames(cars) %in% vars]) :
argument to 'which' is not logical
Any assistance would be very welcomed, thank you
Just as an aside; the for loop i couldn't change to create a single vector which appended
vars <- data.frame(matrix(nrow = 12, ncol = 5)) # we will create a container
colnames(vars) <- c("wave1", "wave2", "wave3", "wave4", "wave5")
rownames(vars) <- c("ghq1", "ghq2", "ghq3", "ghq4", "ghq5",
"ghq6", "ghq7", "ghq8", "ghq9", "ghq10",
"ghq11", "ghq12")
for(i in 1:5){
a <- paste(waves[i], ghqv[1:12], sep = "_")
vars[,i] <- a
print(a) # we print it to see in console
}

You're passing an entire data frame to which()
which(cars[colnames(cars) %in% vars]) is running which on cars[colnames(cars) %in% vars], which is a substet of the cars data.frame (incidentally, cars[colnames(cars %in% vars] is identical to cars[vars]
If you just want the indeces of matching columns, run:
which(colnames(cars) %in% vars)
There's probably a better way to do what you want to do
I would run
require(dplyr)
mutate(cars, across(all_of(vars), factor)) %>%
rename_at(vars, some_function_that_renames_columns)

Related

Iteratively adding a row containing characters and numbers to a dataframe

I have a list containing named elements. I am iterating over the list names, performing the computation for each corresponding element, "encapsulating" the results and the name in a vector and finally adding the vector to a table. The row or vector after each iteration contains a mix of characters and numbers.
The first row is getting added but from the second row onwards there is a problem.
In this example, there is supposed to be one column (first) containing alphanumeric names. All rows after the first one contain NAs.
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s))
}
df <- as.data.frame(df)
I know there are possibly more efficient ways but for the moment this is more intuitive for me as it is assuring that each computation is associated with a particular name. There can be several columns and rows and the names are extremely helpful to join tables, query, compare etc. They make it easier to trace back results to a particular element in my original list.
Additionally, I would be glad to know other ways in which the element names are always retained while transforming.
Thankyou!
You have to set stringsAsFactors = FALSE in rbind. With stringsAsFactors = TRUE the first iteration in the loop converts the string variables into factors (with the factor levels being the values).
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame()
for(name in names(x))
{
tmp <- x[[name]]
m <- mean(tmp)
s <- sum(tmp)
df <- rbind(df, c(name,m,s), stringsAsFactors = FALSE)
}
An easier solution would be to utilize sapply().
x <- list(a_1=c(1,2,3), b_2=c(3,4,5), c_3=c(5,1,9))
df <- data.frame(name = names(x), m = sapply(x, mean), s = sapply(x, sum))

Passing variable names with condition inside FOR loop in R

I'm a newbie in R programming. I have a requirement in mind and trying to work it out with for loop. I have a data frame with 14 variables which has empty values for some rows and columns. My requirement is to list the number of empty values in each variable (column).
My code below to achieve it:
for (x in names(df)){
cat(paste("No of rows with empty value for", x, " variable:",
nrow(df[df$x == '', ])))
}
nrow(df[df$x=='',])
From the above nrow command, the x value is not getting substituted for df$x == ''.
Need some expert help to fix it.
Thanks in advance,
Regards,
Vin
You can use sapply though to make your code cleaner.
sapply(df, FUN=function(x) sum(x == ''))
I slightly altered your for loop, and added a line break in the end. It is easier if you sum over the booleans created than counting the rows.
##Create some fake data
df <- data.frame(
first_var = c(rep("",10),1:10),
second_var = c(rep("",9), 1:11),
third_var = c(rep("", 8), 1:12),
fourth_Var = c(rep("", 7), 1:13)
)
for(i in names(df)){
cat(paste0("No of rows with empty value for ",i, " variable:",sum(df[,i] == ""),"\n"))
}

How to read and use the dataframes with the different names in a loop?

I'm struggling with the following issue: I have many data frames with different names (For instance, Beverage, Construction, Electronic etc., dim. 540x1000). I need to clean each of them, calculate and save as zoo object and R data file. Cleaning is the same for all of them - deleting the empty columns and the columns with some specific names.
For example:
Beverages <- Beverages[,colSums(is.na(Beverages))<nrow(Beverages)] #removing empty columns
Beverages_OK <- Beverages %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
Beverages_OK[, 1] <- NULL #dropping the first column
Beverages_OK <- cbind(data[1], Beverages_OK) # adding a date column
Beverages_zoo <- read.zoo(Beverages_OK, header = FALSE, format = "%Y-%m-%d")
save (Beverages_OK, file = "StatisticsInRFormat/Beverages.RData")
I tied to use 'lapply' function like this:
list <- ls() # the list of all the dataframes
lapply(list, function(X) {
temp <- X
temp <- temp [,colSums(is.na(temp))< nrow(temp)] #removing empty columns
temp <- temp %>% select (-starts_with ("X.ERROR")) # dropping X.ERROR column
temp[, 1] <- NULL
temp <- cbind(data[1], temp)
X_zoo <- read.zoo(X, header = FALSE, format = "%Y-%m-%d") # I don't know how to have the zame name as X has.
save (X, file = "StatisticsInRFormat/X.RData")
})
but it doesn't work. Is any way to do such a job? Is any r-package that facilitates it?
Thanks a lot.
If you are sure the you have only the needed data frames in the environment this should get you started:
df1 <- mtcars
df2 <- mtcars
df3 <- mtcars
list <- ls()
lapply(list, function(x) {
tmp <- get(x)
})

dynamic subsetting in r

I have a data set that is something like the following, but with many more columns and rows:
a<-c("Fred","John","Mindy","Mike","Sally","Fred","Alex","Sam")
b<-c("M","M","F","M","F","M","M","F")
c<-c(40,35,25,50,25,40,35,40)
d<-c(9,7,8,10,10,9,5,8)
df<-data.frame(a,b,c,d)
colnames(df)<-c("Name", "Gender", "Age", "Score")
I need to create a function that will let me sum the scores for selected subsets of the data. However, the subsets selected may have different numbers of variables each time. One subset could be Name=="Fred" and another could be Gender == "M" & Age == 40. In my actual data set, there could be up to 20 columns used in a selected subset, so I need to make this as general as possible.
I tried using a sapply command that included eval(parse(text=...), but it takes a long time with only a sample of 20,000 or so records. I'm sure there's a much faster way, and I'd appreciate any help in finding it.
There are several ways to represent these two variables. One way is as two distinct objects, another is as two elements in a list.
However, using a named list might be the easiest:
# df is a function for the F distribution. Avoid using "df" as a variable name
DF <- df
example1 <- list(Name = c("Fred")) # c() not needed, used for emphasis
example2 <- list(Gender = c("M"), Age=c(40, 50))
## notice that the key portion is `DF[[nm]] %in% ll[[nm]]`
subByNmList <- function(ll, DF, colsToSum=c("Score")) {
ret <- vector("list", length(ll))
names(ret) <- names(ll)
for (nm in names(ll))
ret[[nm]] <- colSums(DF[DF[[nm]] %in% ll[[nm]] , colsToSum, drop=FALSE])
# optional
if (length(ret) == 1)
return(unlist(ret, use.names=FALSE))
return(ret)
}
subByNmList(example1, DF)
subByNmList(example2, DF)
lapply( subset( df, Gender == "M" & Age == 40, select=Score), sum)
#$Score
#[1] 18
I could have writtne just :
sum( subset( df, Gender == "M" & Age == 40, select=Score) )
But that would not generalize very well.

merge data frames based on non-identical values in R

I have two data frames. First one looks like
dat <- data.frame(matrix(nrow=2,ncol=3))
names(dat) <- c("Locus", "Pos", "NVAR")
dat[1,] <- c("ACTC1-001_1", "chr15:35087734..35087734", "1" )
dat[2,] <- c("ACTC1-001_2 ", "chr15:35086890..35086919", "2")
where chr15:35086890..35086919 indicates all the numbers within this range.
The second looks like:
dat2 <- data.frame(matrix(nrow=2,ncol=3))
names(dat2) <- c("VAR","REF.ALT"," FUNC")
dat2[1,] <- c("chr1:116242719", "T/A", "intergenic" )
dat2[2,] <- c("chr1:116242855", "A/G", "intergenic")
I want to merge these by the values in dat$Pos and dat2$VAR. If the single number in a cell in dat2$VAR is contained within the range of a cell in dat$Pos, I want to merge those rows. If this occurs more than once (dat2$VAR in more than one range in dat$Pos, I want it merged each time). What's the easiest way to do this?
Here is a solution, quite short but not particularly efficient so I would not recommend it for large data. However, you seemed to indicate your data was not that large so give it a try and let me know:
library(plyr)
exploded.dat <- adply(dat, 1, function(x){
parts <- strsplit(x$Pos, ":")[[1]]
chr <- parts[1]
range <- strsplit(parts[2], "..", fixed = TRUE)[[1]]
start <- range[1]
end <- range[2]
data.frame(VAR = paste(chr, seq(from = start, to = end), sep = ":"), x)
})
merge(dat2, exploded.dat, by = "VAR")
If it is too slow or uses too much memory for your needs, you'll have to implement something a bit more complex and this other question looks like a good starting point: Merge by Range in R - Applying Loops.
Please try this out and let us know how it works. Without a larger data set it is a bit hard to trouble shoot. If for whatever reason it does not work, please share a few more rows from your data tables (specifically ones that would match)
SPLICE THE DATA
range.strings <- do.call(rbind, strsplit(dat$Pos, ":"))[, 2]
range.strings <- do.call(rbind, strsplit(range.strings, "\\.\\."))
mins <- as.numeric(range.strings[,1])
maxs <- as.numeric(range.strings[,2])
d2.vars <- as.numeric(do.call(rbind, str_split(dat2$VAR, ":"))[,2])
names(d2.vars) <- seq(d2.vars)
FIND THE MATCHES
# row numebr is the row in dat
# col number is the row in dat2
matches <- sapply(d2.vars, function(v) mins < v & v <= maxs)
MERGE
# create a column in dat to merge-by
dat <- cbind(dat, VAR=NA)
# use the VAR in dat2 as the merge id
sapply(seq(ncol(matches)), function(i)
dat$VAR <- dat2[i, "VAR"] )
merge(dat, dat2)

Resources