I am trying to do a series of function but selecting between two variables. I need to first impute the missing values then normalize the variables. To impute I use the following code.
for(i in (train$B365A:train$BSA)){
data[i][is.na(data[i])] <- round(mean(data[i], na.rm = TRUE))
}
so for above, I am trying to impute the missing values, they have approximately 20 variables between them.
I have come up with this but it is not affecting the cells.
convert_num <- function(i) {
i <- as.numeric(i)
}
for (i in c(1:3)){
convert_num(i)
}
The data looks similar to the following
hope coal kite
3 4 5
2 1 5
right now its class but need to be numeric.It has over 20 variables and 18k row.
if I understand correctly the solution to your problem would be the following.
data <- data.frame(c1 = c(rbinom(10,5,0.5)),
c2 = c(rbinom(10,5,0.5)),
c3 = c(rbinom(10,5,0.5)))
data[2:4,1] <- rep(NA,3);data[c(6,8),2] <- rep(NA,2);data[10,3] <- NA
data
# imput data from c1:c3
for(i in 1:3){
data[i][is.na(data[i])] <- round(mean(data[,i], na.rm = T))
}
data
data[] <- lapply(data,as.numeric) # transform to numeric
sapply(data,class)
Related
I am trying to populate the output of a for loop into a data frame. The loop is repeating across the columns of a dataset called "data". The output is to be put into a new dataset called "data2". I specified an empty data frame with 4 columns (i.e. ncol=4). However, the output generates only the first two columns. I also get a warning message: "In matrix(value, n, p) : data length [2403] is not a sub-multiple or multiple of the number of columns [2]"
Why does the dataframe called "data2" have 2 columns, when I have specified 4 columns? This is my code:
a <- 0
b <- 0
GM <- 0
GSD <- 0
data2 <- data.frame(ncol=4, nrow=33)
for (i in 1:ncol(data))
{
if (i==34) {break}
a[i] <- colnames(data[i])
b <- data$cycle
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
If you look at the ?data.frame() help page, you'll see that it does not take arguments nrow and ncol--those are arguments for the matrix() function.
This is how you initialize data2, and you can see it starts with 2 columns, one column is named ncol, the second column is named nrow.
data2 <- data.frame(ncol=4, nrow=33)
data2
# ncol nrow
# 1 4 33
Instead you could try data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33)), though if you share a small sample of data and your expected result there may be more efficient ways than explicit loops to get this job done.
Generally, if you do loop, you want to do as much outside of the loop as possible. This is just guesswork without having sample data, these changes seem like a start at improving your code.
a <- colnames(data)
b <- data$cycle ## this never changes, no need to redefine every iteration
GM <- numeric(ncol(data)) ## better to initialize vectors to the correct length
GSD <- numeric(ncol(data))
data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33))
for (i in 1:ncol(data))
{
if (i==34) {break}
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
## it's weird to assign a row of data.frame at once...
## maybe you should keep it as a matrix?
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.
I have a data set containing 526 rows nd 560 columns. In this data set, I want to run pca analysis for each 16 columns, respectively, in the loop and save the PCA scores for each row. I tried the below code but it did not work. I would be happy to get your advice.
Thanks in advance for your help.
for(i in 1:ncol(df)) {
df[ , i:(i+15)] <- prcomp(df[, i:(i+15)], scale. = TRUE, center = T)
}
Here is a way with a lapply loop. Create a vector f of consecutive integers, each repeated 16 times. Then split the data.frame names by this vector and lapply function prcomp to each subset. Finally, extract the scores.
f <- c(1, rep(0, 15))
f <- rep(f, length(names(df1))/16)
f <- cumsum(f)
nms <- split(names(df1), f)
pca_list <- lapply(nms, function(x){
prcomp(df1[x], center = TRUE, scale. = TRUE)
})
scores_list <- lapply(pca_list, '[[', 'x')
Test data creation code
set.seed(2021)
df1 <- replicate(560, rnorm(526))
df1 <- as.data.frame(df1)
x1=c(55,60,75,80)
x2=c(30,20,15,23)
x3=c(4,3,2,6)
x=data.frame(x1,x2,x3)
From this function :
NAins=function(x,alpha=0.3){
x.n=NULL
for (i in 1:ncol(x)){
S= sort(x[,i], decreasing=TRUE)
N= S[ceiling(alpha*nrow(x))]
x.n= ifelse(x[,i]>N, NA, x[,i])
print(x.n) }
}
How to save the final result as adataframe look like the original dataset ?however I used data.frame(x.nmar) .
and How to get the result out of the loop ?.
Better to use lapply here to avoid side effect of the for-loop:
NAins <- function(x,alpha=0.3){
Nr <- nrow(x)
lapply(x,function(col){
S <- sort(col, decreasing=TRUE)
N <- S[ceiling(alpha*Nr)]
ifelse(col>N, NA, col)
})
Then you can coerce the result to a data.frame:
as.data.frame(NAins(dx))
Converting the comment to answer
If you want to achieve this the loop way, you will need to predefine a matrix or a data frame and then fill it up (In your case you can just use your original x data.frame because the function will not update the original data set in the global environment). After the loop ends, you will need to return it because all the variables you've created within the function will be removed. print isn't being saved anywhere neither. Also, running ceiling(alpha*nrow(x)) in a loop doesn't make sense as it always stays the same. Neither the ifelse is needed if you only have a single alternative each time. See below
NAins=function(x, alpha = 0.3){
N <- ceiling(alpha * nrow(x)) ## Run this only once (take out of the loop)
for(i in 1:ncol(x)){
S <- sort(x[, i], decreasing = TRUE)
x[x[, i] > S[N], i] <- NA # don't use `ifelse`, you only inserting one value
}
x # return the result after the loop ends
}
Test
NAins(x)
# x1 x2 x3
# 1 55 NA 4
# 2 60 20 3
# 3 75 15 2
# 4 NA 23 NA
users,
I have data.frames which are NULL in my results, but I don't want them to be NULL. I want them to be the same as the beginning (unchanged). I'm working on a list of files and the aim of my code is to fill all the NA with data from my other data.frames (according to the best correlation coefficient). Here's a small example:
Imagine these are my 3 input data frames (10 rows each):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,NA,NA,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
The aim here is for example, if there're NAs in ST1, ST1 must be filled with data from the best correlated file with ST1 (between ST2 and ST3 in this example)).
As ST3 has no data here, I cannot have any correlation coefficient. So NAs from ST3 cannot be filled, and ST3 cannot also be used to fill another file. So ST3 has no use if you want. Nevertheless I want to keep ST3 unchanged during all my code.
So the problem in my code comes from data.frames with no data and so with only NAs.
For the moment my code would give this for "refill" (end of my code) (filled NA in my data.frames):
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- NULL
But actually, I want for results in "refill" this:
ST1 <- data.frame(x1=c(1:10))
ST2 <- data.frame(x2=c(1:5,6,7,8:10))
ST3 <- data.frame(x3=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA))
So for data.frames with only NAs, I don't want them to be NULL in "refill", but I want them to be identical as in input. I need this to have the same dimensions of data.frames between inputs and outputs.
If they are as NULL (like it is for the moment but I don't understand why and I want to change this), there will be 0 rows in this data.frame instead of 10 rows like the other data.frames.
So I think there's something wrong in my code in function "process.all" or "na.fill" or maybe "lst".
Here's my code and it is a reproductible example for you to understand my error (you'll see in head(refill) ST2 is set as NULL).
Sorry if it is a bit long but my error depends on other functions previously used. Hope you've understand my problem and what I'm trying to do. Thanks for your help!
(For information, in function "process.all" and "na.fill": x is the data.frame I want to fill, and y is the file which will be used to fill x (so the best correlated file with x)).
Geoffrey
# my data for example
DF1 <- data.frame(x1=c(NA,NA,rnorm(3:20)),x2=c(31:50))
write.table(DF1,"ST001_2008.csv",sep=";")
DF2 <- data.frame(x1=c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,rnorm(1:10)),x2=c(1:20))
write.table(DF2,"ST002_2008.csv",sep=";")
DF3 <- data.frame(x1=rnorm(81:100),x2=NA)
write.table(DF3,"ST003_2008.csv",sep=";")
DF4 <- data.frame(x1=c(21:40),x2=rnorm(1:20))
write.table(DF4,"ST004_2008.csv",sep=";")
# Correlation table
corhiver2008capt1 <- read.table(text=" ST001 ST002 ST003 ST004
ST001 1.0000000 NA -0.4350665 0.3393549
ST002 NA NA NA NA
ST003 -0.4350665 NA 1.0000000 -0.4992513
ST004 0.3393549 NA -0.4992513 1.0000000",header=T)
lst <- lapply(list.files(pattern="\\_2008.csv$"), read.table,sep=";", header=TRUE, stringsAsFactors=FALSE)
Stations <-c("ST001","ST002","ST003","ST004")
names(lst) <- Stations
# searching the highest correlation for each data.Frame
get.max.cor <- function(station, mat){
mat[row(mat) == col(mat)] <- -Inf
m <- max(mat[station, ],na.rm=TRUE)
if (is.finite(m)) {return(which( mat[station, ] == m ))}
else {return(NA)}
}
# fill the data.frame with the data.frame which has the highest correlation coefficient
na.fill <- function(x, y){
if(all(!is.finite(y[1:10,1]))) return(y)
i <- is.na(x[1:10,1])
xx <- y[1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][i] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude),new)[i]
x
}
process.all <- function(df.list, mat){
f <- function(station)
na.fill(df.list[[ station ]], df.list[[ max.cor[station] ]])
g <- function(station){
x <- df.list[[station]]
if(any(!is.finite(x[1:10,1]))){
mat[row(mat) == col(mat)] <- -Inf
nas <- which(is.na(x[1:10,1]))
ord <- order(mat[station, ], decreasing = TRUE)[-c(1, ncol(mat))]
for(y in ord){
if(all(!is.na(df.list[[y]][1:10,1][nas]))){
xx <- df.list[[y]][1:10,1]
new <- data.frame(xx=xx)
x[1:10,1][nas] <- predict(lm(x[1:10,1]~xx, na.action=na.exclude), new)[nas]
break
}
}
}
x
}
n <- length(df.list)
nms <- names(df.list)
max.cor <- sapply(seq.int(n), get.max.cor, corhiver2008capt1)
df.list <- lapply(seq.int(n), f)
df.list <- lapply(seq.int(n), g)
names(df.list) <- nms
df.list
}
refill <- process.all(lst, corhiver2008capt1)
refill <- as.data.frame(refill) ########## HERE IS THE PROBLEM ######
refill
How about
if(sum(!is.na(ST3)) == 0) {
skip whatever you normally would do and go to the next vector
}
This assumes, of course, that you don't have any problems with, say, a vector of 1999 NAs and one numerical value.