Replace NA with 0 in R using a loop on a dataframe - r

I would like to run through specific columns in a dataframe and replace all NAs with 0s using a loop.
extract = read.csv("2013-09 Data extract.csv")
extract$Premium1[is.na(extract$Premium1)] <- 0
extract$Premium1
gives me the required result for Premium1 in dataframe extract, but I would like to loop through all 27 columns of premiums, so what I am trying is
extract = read.csv("2013-09 Data extract.csv")
for(i in 1:27) {
thispremium <- get(paste("extract$Premium", i, sep=""))
thispremium[is.na(thispremium)] <- 0
}
which gives
Error in get(paste("extract$Premium", i, sep = "")) :
object 'extract$Premium1' not found
Any idea on what is causing the error?

Do you need the loop because of other requirements? Because it works just fine without one:
extract[is.na(extract)] <- 0
If you want to do the replacement for some columns only, select those columns first, perform the replacement, and substitute the columns back into the original set:
first5 <- extract[, 1 : 5]
first5[is.na(first5)] <- 0
extract[, 1 : 5] <- first5
More generally loops can (and should) be almost avoided in R – especially when manipulating data frames). Often operations vectorise automatically (like above). When they don’t, functions of the apply family can be used.

How about
for (colname in names(extract))
extract[[colname]][is.na(extract[[colname]])] <- 0
(or even extract[is.na(extract)] <- 0)
Or, if you are not doing it to all the columns (I think I misread your question):
for(i in 1:27) {
colname <- paste0("Premium",i)
extract[[colname]][is.na(extract[[colname]])] <- 0
}
Alternatively, you don't really need to know the number of such columns:
premium <- grep("^Premium[0-9]*$",names(extract))
extract[premium][is.na(extract[premium])] <- 0

Related

How to run Chisq test for multiple rows FASTER in R?

I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)

Looping through rows in an R data frame?

I'm working with multiple big data frames in R and I'm trying to write functions that can modify each of them (given a set of common parameters). One function is giving me trouble (shown below).
RawData <- function(x)
{
for(i in 1:nrow(x))
{
if(grep(".DERIVED", x[i,]) >= 1)
{
x <- x[-i,]
}
}
for(i in 1:ncol(x))
{
if(is.numeric(x[,i]) != TRUE)
{
x <- x[,-i]
}
}
return(x)
}
The objective of this function is twofold: first, to remove any rows that contain a ".DERIVED" string in any one of their cells (using grep), and second, to remove any columns that are non-numeric (using is.numeric). I get an error on the following condition:
if(grep(".DERIVED", x[i,]) >= 1)
The error states the "argument is of zero length", which I believe is usually associated with NULL values in a vector. However, I've used is.null on the entire data frame that is giving me errors, and it confirmed that there are no null values in the DF. I'm sure I'm missing something relatively simple here. Any advice would be greatly appreciated.
If you can use non-base-R functions, this should address your issue. df is the data.frame in question here. It will also be faster than looping over rows (generally not advised if avoidable).
library(dplyr)
library(stringr)
df %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
You can make it a function just as you would anything else:
mattsFunction <- function(dat){
dat %>%
filter_all(!str_detect(., '\\.DERIVED')) %>%
select_if(is.numeric)
}
you should probably give it a better name though
The error is from the line
if(grep(".DERIVED", x[i,]) >= 1)
When grep doesn't find the term ".DERIVED", it returns something of zero length, your inequality doesn't return TRUE or FALSE, but rather returns logical(0). The error is telling you that the if statement cannot evaluate whether logical(0) >= 1
A simple example:
if(grep(".DERIVED", "1234.DERIVEDabcdefg") >= 1) {print("it works")} # Works nicely, since the inequality can be evaluated
if(grep(".DERIVED", "1234abcdefg") > 1) {print("no dice")}
You can replace that line with if(length(grep(".DERIVED", x[i,])) != 0)
There's something else you haven't noticed yet, which is that you're removing rows/columns in a loop. Say you remove the 5th column, the next loop iteration (when i = 6) will be handling what was the 7th row! (this will end in an error along the lines of Error in[.data.frame(x, , i) : undefined columns selected)
I prefer using dplyr, but if you need to use base R functions there are ways to to this without if statements.
Notice that you should consider using the regex version of "\\.DERIVED" and not ".DERIVED" which would mean "any character followed by DERIVED".
I don't have example data or output, so here's my best go...
# Made up data
test <- data.frame(a = c("data","data.DERIVED","data","data","data.DERIVED"),
b = (c(1,2,3,4,5)),
c = c("A","B","C","D","E"),
d = c(2,5,6,8,9),
stringsAsFactors = FALSE)
# Note: The following code assumes that the column class is numeric because the
# example code provided assumed that the column class was numeric. This will not
# detects if the column is full of a string of character values of only numbers.
# Using the base subset command
test2 <- subset(test,
subset = !grepl("\\.DERIVED",test$a),
select = sapply(test,is.numeric))
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8
# Trying to use []. Note: If only 1 column is numeric this will return a vector
# instead of a data.frame
test2 <- test[!grepl("\\.DERIVED",test$a),]
test2 <- test2[,sapply(test,is.numeric)]
# > test2
# b d
# 1 1 2
# 3 3 6
# 4 4 8

For loop for multiple indices

I know that in R for loops should be avoided and vectorized operations should be used instead.
I want to solve this with a for loop and then try to use the apply family, then also in Rcpp.
I load a dataset containing one column of passwords (alphanumeric).
Once loaded (a sample, for speed), I want to create new column with value (0,1) based on some conditions "contains_lower_chars", "contains_numbers" and so on.
Here what I tried to do, but it doesn't work - meaning each column I create has the same value.
library(tidyverse)
set.seed(123)
# load dataset from url, skip the first 16 rows
df <- read.csv('http://datashaping.com/passwords.txt', header = F, skip = 16) %>%
sample_frac(.001) %>%
rename(password = V1)
patterns = c("[a-z]","[A-Z]","[0-9]+")
df$has_lower <- 0
df$has_upper <- 0
df$has_numeric <- 0
for(i in 1:nrow(df)){
for(j in patterns){
n <- ifelse(grepl(j, df$password[i]),1,0)
}
df$has_lower[i] <- n
df$has_upper[i] <- n
df$has_numeric[i] <- n
}
Output I have in mind is:
password has_lower has_upper has_numeric
Bigmaccas 1 1 0
0127515559 0 0 1
dbqky73p 1 0 1
We can simplify things if we just name your pattern vector. For example
patterns = c(has_lower="[a-z]",
has_upper="[A-Z]",
has_numeric="[0-9]+")
for(pattern in names(patterns)) {
df[, pattern] = as.numeric(grepl(patterns[pattern], df$password))
}
Basically we just loop through each of the names, grab the regular expression corresponding to that name, then do the matching and adding the column.
A data frame is above all a list.
So, you can simply do:
df[c("has_lower", "has_upper", "has_numeric")] <-
lapply(patterns, function(pattern) grepl(pattern, df$password) + 0)
Use + 0L instead of + 0 is you want integers instead of doubles (I would recommend to do nothing and to keep logicals).
First you need to update has.lower has.upper and has.numeric within the j loop otherwise your n remains the same for this 3 cases. To do so you need to be able to loop over the names of the columns has.lower has.upper and has.numeric:
names <- c("has_lower","has_upper","has_numeric")
for(i in 1:nrow(df)){
for(j in 1:length(patterns)){
df[i,(names[j])] <- as.numeric(grepl(j, df$password[i]))
}
}
A quicker, nicer, more compact alternative using apply and the fact that grepl is already vectorized:
df[, c("has_lower","has_upper","has_numeric"):=lapply(patterns, function(x) grepl(x,df$password))]
Note (nothing to do with your question):
I advise you to use the fread function to read your dataset since it is quite large.
df = fread('http://datashaping.com/passwords.txt', header = F, skip = 16)%>%
sample_frac(.001) %>%
rename(password = V1)

How to Create Dynamic Variable Names in a For Loop in R

I have a dataset/table (called behavioural) with data from 24 participants - these are named in the format: 's1' to 's24'.
The first 6 rows in the table/dataset:
head(behavioural)[c(1,17)]
subj recognition_order
1 s1 2
2 s1 6
3 s1 7
4 s1 8
5 s1 9
6 s1 10
I want to create a subset for each participant and order each of these subsets by the variable recognition_order
I have created a loop to do this:
behavioural <- read.table("*my file path*\behavioral.txt", header = TRUE)
subj_counter <- 1
for(i in 1:24) {
subject <- paste("s", subj_counter, sep = "")
subset_name <- paste(subject, "_subset", sep="")
[subset_name] <- behavioural[which(behavioural$subj == subject), ]
[subset_name] <- subset_name[order(subset_name$recognition_order),]
subj_counter = subj_counter + 1
print(subset_name)
print(subj_counter)
}
And I'm pretty sure the logic is solid, except when I run the loop, it does not create 24 subsets. It just creates 1 - s24_subset.
What do I need to do to the bit before "<-" in these 2 lines of code?
[subset_name] <- behavioural[which(behavioural$subj == subject), ]
[subset_name] <- subset_name[order(subset_name$recognition_order),]
Because [subset_name] isn't working.
I want the [subset_name] to be dynamic - i.e. each time the loop runs, its value changes and it creates a new subset/variable each time.
I have seen things online about the assign() function but I'm not quite sure how to implement this into my loop?
Thank you in advance!
If you want to order the items inside the results of a split than just use lapply to pass the needed function calls to do the ordering on a single dataframe at a time (which are re-bundled together by lapply after the ordering:
my_split_list <- split(behavioural, behavioural$subj)
ord.list <- lapply( my_split_list, function(d){
d[ order(d[['recognition_order']]) , ] }
This is a common paradigm called "split-apply-combine": "The Split-Apply-Combine Strategy for Data Analysis" https://www.jstatsoft.org/article/view/v040i01/v40i01.pdf
for(i in 1:5) {
assign(paste0("test_",i),i)
}
test_list <- mget(ls(pattern = "test_"))`
I hope you get a good answer. There is a good way to create an assign function and to bind this pattern in mget. I've outlined a lot of questions for R and Python about dynamically generating variables. Attached to the following link.
- Creating Variables Dynamically (R, Python)
You can accomplish this with eval() and parse(), like so:
eval(parse(text = paste(subset_name, "<- subset_name[order(subset_name$recognition_order),]", sep = '')))

Recursive function in R to find unique rows of a list of data tables

I am working on a function that takes a list of data tables with the same column names as an input and returns a single data table that has the unique rows from each data frame combined using successive rbind as shown below.
The function would be applied on a "very" large data.table (10s of millions of rows) which is why I had to split it up into several smaller data tables and assign them into a list to use recursion. At each step depending upon the length of the list of data tables (odd or even), I find the unique of data.table at that list index and the data table at the list index x - 1 and then successively rbind the 2 and assign to list index x - 1, and more list index x.
I must be missing something obvious, because although I can produce the final unique-d data.table when I print it (eg., print (listelement[[1]]), when I return (listelement[[1]]) I get NULL. Would help if someone can spot what I am missing ... or suggest if there is perhaps any other more efficient way to perform this.
Also, instead of having to add each data.table to a list, can I add them as "references" in the list ? I believe doing something like list(datatable1, datatable2 ...) would actually copy them ?
## CODE
returnUnique2 <- function (alist) {
if (length(alist) == 1) {
z <- (alist[[1]])
print (class(z))
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
}
if (length(alist) %% 2 == 0) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
if (length(alist) %% 2 == 1 && length(alist) > 2) {
alist[[length(alist) - 1]] <- unique(rbind(unique(alist[[length(alist)]]), unique(alist[[length(alist) - 1]])))
alist[[length(alist)]] <- NULL
returnUnique2(alist)
}
}
## OUTPUT with print statement
t1 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t2 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
t3 <- data.table(col1=rep("a",10), col2=round(runif(10,1,10)))
tempList <- list(t1, t2, t3)
returnUnique2(tempList)
[1] "list"
[[1]]
col1 col2
1: a 3
2: a 2
3: a 5
4: a 9
5: a 10
6: a 7
7: a 1
8: a 8
9: a 4
10: a 6
Changing the following,
print (z) ### This is the issue, if I change to return (z), I get NULL (?)
to read
return(z)
returns NULL
Thanks in advance.
Please correct me if I misunderstand what you're doing, but it sounds like you have one big data.table and are trying to split it up to run some function on it and would then combine everything back and run a unique on that. The data.table way of doing that would be to use by, e.g.
fn = function(d) {
# do whatever to the subset and return the resulting data.table
# in this case, do nothing
d
}
N = 10 # number of pieces you like
dt[, fn(.SD), by = (seq_len(nrow(dt)) - 1) %/% (nrow(dt)/N)][, seq_len := NULL]
dt = dt[!duplicated(dt)]
Seems like this could be a good use case for a for loop. With many rows the overhead of using a for loop should be relatively small compared to the computation time. I would try combining my data.table's into a list (called ll in my example), then for each one remove duplicated rows, then rbind to the previous data.table with unique rows and then subset by unique rows again.
If you have many duplicated rows in each chunk then this might save some time, overall I'm not sure how effective it will be, but worth a shot?
# Create empty data.table for results (I have columns x and y in this case)
res <- data.table( x= numeric(0),y=numeric(0))
# loop over all data.tables in a list called 'll'
for( i in 1:length(ll) ){
# rbind the unique rows from the current list element to the results from all previous iterations
res <- rbind( res , ll[[i]][ ! duplicated(ll[[i]]) , ] )
# Keep only unique records at each iteration
res <- res[ ! duplicated(res) , ]
}
On another note, have you looked at the documentation for data.table? It explicitly states,
Because data.tables are usually sorted by key, tests for duplication
are especially quick.
So you might just be better off running on the entire data.table?
DT[ ! duplicated(DT) , ]
Add an id column to each data.table
t1$id=1
t2$id=2
t3$id=3
then combine them all at once and do a unique using by=.
If the data.tables are huge you could use setkey(...) to create an index on id before calling unique.
tall=rbind(t1,t2,t3)
tall[,unique(col1,col2),by=id]

Resources