Passing variable names with condition inside FOR loop in R

Passing variable names with condition inside FOR loop in R - r

I'm a newbie in R programming. I have a requirement in mind and trying to work it out with for loop. I have a data frame with 14 variables which has empty values for some rows and columns. My requirement is to list the number of empty values in each variable (column).
My code below to achieve it:
for (x in names(df)){
cat(paste("No of rows with empty value for", x, " variable:",
nrow(df[df$x == '', ])))
}
nrow(df[df$x=='',])
From the above nrow command, the x value is not getting substituted for df$x == ''.
Need some expert help to fix it.
Thanks in advance,
Regards,
Vin

You can use sapply though to make your code cleaner.
sapply(df, FUN=function(x) sum(x == ''))

I slightly altered your for loop, and added a line break in the end. It is easier if you sum over the booleans created than counting the rows.
##Create some fake data
df <- data.frame(
first_var = c(rep("",10),1:10),
second_var = c(rep("",9), 1:11),
third_var = c(rep("", 8), 1:12),
fourth_Var = c(rep("", 7), 1:13)
)
for(i in names(df)){
cat(paste0("No of rows with empty value for ",i, " variable:",sum(df[,i] == ""),"\n"))
}

Related

Attempting to loop through list of dataframes and perform operation on one column in each dataframe

I'm attempting to loop through a list of dataframes that I have and for the same column in each dataframe, sum up that column then divide it by the number of rows in that dataframe and print it out. Not add a row/column to a new dataframe, I just want it to print the result out for each one. I also want it to print out the number of rows in each dataframe separately.
I created this list of dataframes by using this for loop:
Coverages <- list('Cover 0', 'Cover 1', 'Cover 2', 'Cover 3')
DoublePostsLeftDFs <- c()
for (x in Coverages) {
assign(paste("DoublePostsLeft", str_replace_all(x, " ", ""), sep=""), DoublePostsLeft %>% filter(CoverageScheme == x))
name <- paste("DoublePostsLeft", str_replace_all(x, " ", ""), sep="")
DoublePostsLeftDFs <- append(DoublePostsLeftDFs, name)
This successfully creates all the dataframes I need, but I didn't know a better way to make a list of what they were all named which is where I suspect my problem is coming from. Here is what I've attempted to do so far:
for (x in DoublePostsLeftDFs) {
row_number <- nrow(x)
average <- sum(x$desired_column)/nrow(x)
print(row_number)
print(average)
}
When I use that I the error: Error: $ operator is invalid for atomic vectors
So then I tried this:
for (x in DoublePostsLeftDFs) {
new <- as.data.frame(x)
row_number <- nrow(new)
average <- sum(new$desired_column)/nrow(new)
print(row_number)
print(average)
}
And all it did was print out:
[1] 1
[1] 0
for each dataframe in the list. I suspect it has something to do with how I created the list of the dataframes? Any help would be appreciated.

I don't think there is a need to create list of dataframes here. Is this what you want?
library(dplyr)
result <- DoublePostsLeft %>%
group_by(CoverageScheme) %>%
summarise(nrow = n(),
average = mean(desired_column, na.rm = TRUE))
result

How to OR Loop in R

I have a data set with 100 values and want to pick only specific items from that data set. That's how I do it right now:
df.match <- subset(df.raw.csv, value == "UC9d" | value == "UCenoM“)
It's working but I want to solve it with a loop. I tried this but I only get one match. Although I know both values are in the data set.
for (ID in c("UC9d" , "UCenoM")){df.match <- subset(df.raw.csv, value == ID)}
Any suggestions?

My suggestion would be not to use loops in R:
library(dplyr)
mydata <- mutate(mydata, TOBEINCL = 0) #rename according to your data
Create a list of patterns for the match of mydata$ID (^ and $ are for exact matching):
toMatch <- c("^UC9d$", "^UCenoM$")
Use pattern matching from base R:
mydata$TOBEINCL[grep(paste(toMatch,collapse="|"), mydata$ID, ignore.case = FALSE, invert = TRUE)] <- 1
Select data:
mydataINCL <- mydata[(mydata$TOBEINCL==1) , ]
mydataINCL$ID <- factor(mydataINCL$ID) #sometimes R sticks with the old values

An option:
df.match <- subset(df.raw.csv, value %in% c("UcenoM", "Uc9d"))

How to use a %in% condition in the R which function?

I have a simple task, which I can do in loads of line of individual code, but I would like to simplify it as it will take a long time in the future.
my task is to transform 100's of columns of a data frame in to factors and re label accordingly.
with just a subset of my data, I tried to create a list of variables as the 12 variables have different prefixes at each wave (year of collection, the code I ended up using was:
ghq <-c("scghqa", "scghqb", "scghqc", "scghqd", "scghqe", "scghqf", "scghqg",
"scghqh", "scghqi", "scghqj", "scghqk", "scghql")
waves <- c("a", "b", "c", "d", "e")
ghqa <- paste0(waves[1], sep = "_", ghq[1:12])
ghqb <- paste0(waves[2], sep = "_", ghq[1:12])
ghqc <- paste0(waves[3], sep = "_", ghq[1:12])
ghqd <- paste0(waves[4], sep = "_", ghq[1:12])
ghqe <- paste0(waves[5], sep = "_", ghq[1:12])
ghqv <- c(ghqa, ghqb, ghqc, ghqd, ghqe)
I tried this in a for loop, but I could not get it to produce the output in a list or character vector (only a matrix seemed to work), see the code for that at the bottom of this question, if you are curious.
From here to be able to use apply, I need to know the positions of these columns in the dataframe
apply(data[c(indexes of cols), 2, lfactor(c(values in the factor), levels =c(levels they will correspond to), labels=c(text labels to be attached to each level))
NOTE: I put this here because perhaps I am going the wrong way about things by trying to use apply.
so to identify the columns I want drom the data i used
head(dat[colnames(dat) %in% ghqv]) # produced the data for the 60 rows I want
length(dat[colnames(dat) %in% ghqv]) # 60 (as expected)
so I tried:
which(dat[colnames(dat) %in% ghqv])
Error in which(dat[colnames(dat) %in% ghqv]) :
argument to 'which' is not logical
How can I transform this to a logical please? as any time I use == with %in% it does not seem to recognise it
To try to help simplify this, with the silly variable names, I created the same issue in the mt cars data set:
cars <- mtcars
vars <- c("mpg", "qsec")
head(cars[colnames(cars) %in% vars])
which(cars[colnames(cars) %in% vars])
Error in which(cars[colnames(cars) %in% vars]) :
argument to 'which' is not logical
Any assistance would be very welcomed, thank you
Just as an aside; the for loop i couldn't change to create a single vector which appended
vars <- data.frame(matrix(nrow = 12, ncol = 5)) # we will create a container
colnames(vars) <- c("wave1", "wave2", "wave3", "wave4", "wave5")
rownames(vars) <- c("ghq1", "ghq2", "ghq3", "ghq4", "ghq5",
"ghq6", "ghq7", "ghq8", "ghq9", "ghq10",
"ghq11", "ghq12")
for(i in 1:5){
a <- paste(waves[i], ghqv[1:12], sep = "_")
vars[,i] <- a
print(a) # we print it to see in console
}

You're passing an entire data frame to which()
which(cars[colnames(cars) %in% vars]) is running which on cars[colnames(cars) %in% vars], which is a substet of the cars data.frame (incidentally, cars[colnames(cars %in% vars] is identical to cars[vars]
If you just want the indeces of matching columns, run:
which(colnames(cars) %in% vars)
There's probably a better way to do what you want to do
I would run
require(dplyr)
mutate(cars, across(all_of(vars), factor)) %>%
rename_at(vars, some_function_that_renames_columns)

How to paste factor labels conditionally in R

Hope someone can help me on this one for which I have just found a lousy solution on my own: I would like to aggregate (or paste) the labels of four columns (A to D) into a fifth (dream) but conditionally, that is only if its numeric value is 2.
Here is my database df:
id= c(1:12)
A = c(2,NA,NA,2,NA,1,1,1,1,1,NA,2)
B = c(2,1,1,1,2,NA,1,1,1,1,2,1)
C = c(2,1,1,1,2,NA,1,1,1,1,NA,1)
D = c(2,1,1,1,1,1,2,1,1,NA,2,1)
df = data.frame(id,A,B,C,D) ; df
df$A=factor(df$A, labels=c("no", "i saw"))
df$B=factor(df$B, labels=c("no", "someone"))
df$C=factor(df$C, labels=c("no", "sitting"))
df$D=factor(df$D, labels=c("no", "on a cloud")) ; df
Here is below the solution i found, but not so satisfying...
df$dream = ifelse(as.numeric(df$A)!=2, NA, as.character(df$A)) ; df
df$dream = ifelse(as.numeric(df$B)!=2, df$dream, paste(df$dream, as.character(df$B))) ; df
df$dream = ifelse(as.numeric(df$C)!=2, df$dream, paste(df$dream, as.character(df$C))) ; df
df$dream = ifelse(as.numeric(df$D)!=2, df$dream, paste(df$dream, as.character(df$D))) ; df
I am sure there is a straightfoward way to do so, in addition my code doesn't even seem to work this way.. Could someone help me? Thanks

This solution will work but you have to declare this vector of values you want to paste from factors.
# init empty result vector
dream <- character(nrow(df))
# values from each column (A-D) you want to paste
values <- c("i saw","someone","sitting", "on a cloud")
# iterate over each row
for(i in seq_len(nrow(df))){
#paste values from each row
dream[i] <- paste(values[which(as.numeric(df[i,-1]) == 2)], collapse=" ")
}

I think it would be easier if you transform you data.frame in data.table.
For column B you can use
dt <- as.data.table(df)
dt[,dream:=ifelse(B==2,"someone",ifelse(B==1,"no",NA))]
And then replicate the same for the rest 3 columns. I hope this helps
UPDATE
Or maybe you could try this
dt$dream.A <- mapvalues(dt$A,c(1,2),c("no", "i saw"))

Drop columns per row based on a separate column value

Given a dummy data frame that looks like this:
Data1<-rnorm(20, mean=20)
Data2<-rnorm(20, mean=21)
Data3<-rnorm(20, mean=22)
Data4<-rnorm(20, mean=19)
Data5<-rnorm(20, mean=20)
Data6<-rnorm(20, mean=23)
Data7<-rnorm(20, mean=21)
Data8<-rnorm(20, mean=25)
Index<-rnorm(20,mean=5)
DF<-data.frame(Data1,Data2,Data3,Data4,Data5,Data6,Data7,Data8,Index)
What I'd like to do is remove (make NA) certain columns per row based on the Index column. I took the long way and did this to give you an idea of what I'm trying to do:
DF[DF$Index>5.0,8]<-NA
DF[DF$Index>=4.5 & DF$Index<=5.0,7:8]<-NA
DF[DF$Index>=4.0 & DF$Index<=4.5,6:8]<-NA
DF[DF$Index>=3.5 & DF$Index<=4.0,5:8]<-NA
DF[DF$Index>=3.0 & DF$Index<=3.5,4:8]<-NA
DF[DF$Index>=2.5 & DF$Index<=3.0,3:8]<-NA
DF[DF$Index>=2.0 & DF$Index<=2.5,2:8]<-NA
DF[DF$Index<=2.0,1:8]<-NA
This works fine as is, but is not very adaptable. If the number of columns change, or I need to tweak the conditional statements, it's a pain to rewrite the entire code (the actual data set is much larger).
What I would like to do is be able to define a few variables, and then run some sort of loop or apply to do exactly what the lines of code above do.
As an example, in order to replicate my long code, something along the lines of this kind of logic:
NumCol<-8
Max<-5
Min<-2.0
if index > Max, then drop NumCol
if index >= (Max-0.5) & <=Max, than drop NumCol:(NumCol -1)
repeat until reach Min
I don't know if that's the most logical line of reasoning in R, and I'm pretty bad with Looping and apply, so I'm open to any line of thought that can replicate the above long lines of code with the ability to adjust the above variables.

If you don't mind changing your data.frame to a matrix, here is a solution that uses indexing by a matrix. The building of the two-column matrix of indices to drop is a nice review of the apply family of functions:
Seq <- seq(Min, Max, by = 0.5)
col.idx <- lapply(findInterval(DF$Index, Seq) + 1, seq, to = NumCol)
row.idx <- mapply(rep, seq_along(col.idx), sapply(col.idx, length))
drop.idx <- as.matrix(data.frame(unlist(row.idx), unlist(col.idx)))
M <- as.matrix(DF)
M[drop.idx] <- NA

Here is a memory efficient (but I can't claim elegant) data.table solution
It uses the very useful function findInterval to change you less than / greater than loop
#
library(data.table)
DT <- data.table(DF)
# create an index column which 1:8 represent your greater than less than
DT[,IND := findInterval(Index, c(-Inf, seq(2,5,by =0.5 ), Inf))]
# the columns you want to change
changing <- names(DT)[1:8]
setkey(DT, IND)
# loop through the indexes and alter by reference
for(.ind in DT[,unique(IND)]){
# the columns you want to change
.which <- tail(changing, .ind)
# create a call to `:=`(a = as(NA, class(a), b= as(NA, class(b))
pairlist <- mapply(sprintf, .which, .which, MoreArgs = list(fmt = '%s = as(NA,class(%s))'))
char_exp <- sprintf('`:=`( %s )',paste(pairlist, collapse = ','))
.e <- parse(text = char_exp)
DT[J(.ind), eval(.e)]
}