I have a data frame which name is df, of 200+ variables with 300,000+ observations (200+ columns, 300000+ rows)
The end goal of my R code is to find the outlier of each column and replace them with a certain value, say, NA. If the value is already NA, skip and proceed to the next loop
for (j in 1:ncol(df)){
outnumtext <- paste0('out_value <- boxplot.stats(df$',colnames(df[j]),')$out')
eval(parse(text=outnumtext))
for (k in 1:nrow(df)){
replacetext <- paste0('
if ((df[',k,',',j,'] %in% out_value) & !(is.na(df[',k,',',j,']))) {
df[',k,',',j,'] <- NA
} else if (is.na(df[',k,',',j,'])) {
next
} else {
next
}')
eval(parse(text=replacetext))
}
}
I discovered that using the for loop in r and looping through each and every one of the rows in every column, considerably slows down the running. Are there any alternatives to this?
Thank you very much in advance!
Edit P/S: The real code is not just replacing outliers with NA, but has several ways of dealing based on several conditions (where if & if else conditions will be executed accordingly). However my goal is to get a possible alternative in reducing the running time, thus I tried simplifying my original code as much as possible to get to the main point
You don't want to use loops for this. You could try dplyr::mutate_all().
It will still be slow over 300K+ rows, but should be better than the loop.
library(dplyr)
df <- df %>%
mutate_all(funs(ifelse(. %in% boxplot.stats(.)$out, NA, .)))
Example:
exdata <- structure(list(x = c(200, 6, 8, 2, 7, 1, 4, 9, 3, 5, 1000),
y = c(300, 1, 18, 3, 2, 16, 14, 9, 11, 6, 100)),
row.names = c(NA, -11L),
class = "data.frame")
exdata
x y
1 200 300
2 6 1
3 8 18
4 2 3
5 7 2
6 1 16
7 4 14
8 9 9
9 3 11
10 5 6
11 1000 100
data1 %>%
mutate_all(funs(ifelse(. %in% boxplot.stats(.)$out, NA, .)))
x y
1 NA NA
2 6 1
3 8 18
4 2 3
5 7 2
6 1 16
7 4 14
8 9 9
9 3 11
10 5 6
11 NA NA
Related
I have a dataset with 54285 observations. What I need is to assign randomly 50% of the rows into another dataframe, 30% into another dataset, and the rest (20%) into another one. This should be done without duplicates.
This is an example:
data<-data.frame(numbers=c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
data
1
2
3
4
5
6
7
8
9
10
What I expect would be:
df1
5
3
8
1
7
df2
2
4
9
df3
6
10
Multiply the ratio by number of rows in the dataset and split the data to divide them in separate dataframes.
set.seed(123)
result <- split(data, sample(rep(1:3, nrow(data) * c(0.5, 0.3, 0.2))))
names(result) <- paste0('df', seq_along(result))
list2env(result, .GlobalEnv)
df1
# numbers
#1 1
#3 3
#7 7
#9 9
#10 10
df2
# numbers
#4 4
#5 5
#8 8
df3
# numbers
#2 2
#6 6
For large dataframes using sample with prob argument should work as well. However, note that this might not give you exact number of rows that you expect like the above rep answer.
result <- split(data, sample(1:3, nrow(data), replace = TRUE, prob = c(0.5, 0.3, 0.2)))
I have two different variables in my data. first one: vein size (there are NAs).
Second variable is: procedure site (c=(1,2,3,4))
I want to impute different value to vein size based on different procedure site. I tried if else, but it wasn't successful. e.g.: if procedure site is 1 or 2, impute 3; if procedure site is 3, impute 4; if procedure site is 4, impute 5.I am new to this field. Any help is much appreciated!
vein.size<-(3,3,3,NA,NA,NA)
procedure.site<-(1,2,2,3,4,4)
df<-cbind(vein.size,procedure.site)
My expected output is:
vein.size<-(3,3,3,4,5,5)
thank you
You can use chain of ifelse statement or try case_when from dplyr :
library(dplyr)
df <- df %>%
mutate(output = case_when(is.na(vein.size) & procedure.site %in% 1:2 ~ 3,
is.na(vein.size) & procedure.site == 3 ~ 4,
is.na(vein.size) & procedure.site == 4 ~ 5,
TRUE ~ vein.size))
# vein.size procedure.site output
#1 3 1 3
#2 3 2 3
#3 3 2 3
#4 NA 3 4
#5 NA 4 5
#6 NA 4 5
data
vein.size<-c(3,3,3,NA,NA,NA)
procedure.site<-c(1,2,2,3,4,4)
df<-data.frame(vein.size,procedure.site)
You could use a lookup table and then merge:
# your data
vein.size <- c(3,3,3,NA,NA,NA)
procedure.site <- c(1,2,2,3,4,4)
your_df <- data.frame(vein.size = vein.size,
procedure.site = procedure.site)
# the lookup table
lookup_df <- data.frame(
procedure.site = c(1, 2, 3, 4),
imputation = c(3, 3, 4, 5)
)
# result
merge(your_df, lookup_df, by='procedure.site')
Which gives:
procedure.site vein.size imputation
1 1 3 3
2 2 3 3
3 2 3 3
4 3 NA 4
5 4 NA 5
6 4 NA 5
I am trying to automate a process to complete missing values on a sequence of variables using an ifelse statement and mutate_all function. The problem involves a dataframe with many variable names, for example, ax1, bx1, ...zx1, ax2, bx2, ...zx2, ax3, bx3, ...zx3. The following data give a small scenario:
df<-data.frame(
"id" = c(1:5),
"ax1" = c(1, "NA", 8, "NA", 17),
"bx1" = c(2, 7, "NA", 11, 12),
"ax2" = c(2, 1, 8, 15, 17),
"bx2" = c(2, 6, 4, 13, 11))
The process is to replace the missing values on the variables with the ending "x1" with their corresponding values on the variables with the ending "x2". That is, if ax1 is missing it is replaced by ax2 and any missingness on bx1 is replaced by bx2 and so on. Since there are many variables than the scenario presented here, I am looking for a way to automate this process. I have tried the following codes
library(dplyr)
df <- df %>%
mutate_all(vars(ends_with("x1", "x2")), function(x,y)
ifelse(is.na(x), y, x)))
but it does not work. I greatly appreciate any help on this.
The expected output is
id ax1 bx1 ax2 bx2
1 1 2 2 2
2 1 7 1 6
3 8 4 8 4
4 15 11 15 13
5 17 12 17 11
In base R, we can replace NA value in x1 with corresponding NA values in x2 using Map.
x1_cols <- grep('x1$', names(df))
x2_cols <- grep('x2$', names(df))
df[x1_cols] <- Map(function(x, y) {x[is.na(x)] <- y[is.na(x)];x},
df[x1_cols], df[x2_cols])
df
# id ax1 bx1 ax2 bx2
#1 1 1 2 2 2
#2 2 1 7 1 6
#3 3 8 4 8 4
#4 4 15 11 15 13
#5 5 17 12 17 11
We can use the same logic and use purrr::map2
df[x1_cols] <- purrr::map2(df[x1_cols], df[x2_cols],
~{.x[is.na(.x)] <- .y[is.na(.x)];.x})
data
Modified data a bit making sure that NA are actual NAs and not string "NA" which were actually making columns as factors.
df<-data.frame(id=c(1:5),
ax1=c(1,NA,8,NA,17),
bx1=c(2,7,NA,11,12),
ax2=c(2,1,8,15,17),
bx2=c(2,6,4,13,11))
I need to calculate a lag or lead mean between two sequential values in a table and then output the means to a new column. I can write a for loop for this operation, but would prefer to avoid this so that the codes is more flexible. Is it possible to do this operation in dplyr and tidyr? Below is an example data set and the desired result. Thanks in advance.
DATA = data.frame(POO = c(2, 4, 6, 8, 10 , 20))
RESULTS = data.frame(POO = c(2, 4, 6, 8, 10 , 20), YEY = c(0,3,5,7,9,15))
Use filter:
DATA$YEY <- filter(DATA$POO, c(1, 1)/2, sides = 1)
# POO YEY
#1 2 NA
#2 4 3
#3 6 5
#4 8 7
#5 10 9
#6 20 15
You can then substitute NA with 0, but I don't understand the logic behind that.
Note that filter gets masked by package dplyr unfortunately. You might need to use stats::filter, if you have attached dplyr.
There's also a way in dplyr:
DATA %>%
mutate(YEY = (POO + lag(POO)) / 2)
This also has NA in the first row, which you could fix afterwards if you need to.
df1<-structure(list(POO = c(2, 4, 6, 8, 10, 20)), .Names = "POO", row.names = c(NA,
-6L), class = "data.frame")
library(dplyr)
libary(zoo) # for rollmean function
df1 %>% # df1 is your data frame
mutate(TEY=rollmean(POO,2,fill=0,align="right"))
POO TEY
1 2 0
2 4 3
3 6 5
4 8 7
5 10 9
6 20 15
I have 2 files of 3 columns and hundreds of rows. I want to compare and list the common elements of first two columns of the two files. Then the list which i will get after comparing i have to add the third column of second file to that list. Third column will contain the values which were in the second file corresponding to numbers of remaining two columns which i have got as common to both the files.
For example, consider two files of 6 rows and 3 columns
First file -
1 2 3
2 3 4
4 6 7
3 8 9
11 10 5
19 6 14
second file -
1 4 1
2 1 4
4 6 10
3 7 2
11 10 3
19 6 5
As i said i have to compare the first two columns and then add the third column of second file to that list. Therefore, output must be:
4 6 10
11 10 3
19 6 5
I have the following code, however its showing an error object not found also i am not able to add the third column. Please help :)
df2 = reading first file, df3 = reading second file. Code is in R language.
s1 = 1
for(i in 1:nrow(df2)){
for(j in 1:nrow(df3)){
if(df2[i,1] == df3[j,1]){
if(df2[i,2] == df3[j,2]){
common.rows1[s1,1] <- df2[i,1]
common.rows1[s1,2] <- df2[i,2]
s1 = s1 + 1
}
}
}
You can use the %in% operator twice to subset your second data.frame (I call it df2):
df2[df2$V1 %in% df1$V1 & df2$V2 %in% df1$V2,]
# V1 V2 V3
#3 4 6 10
#5 11 10 3
#6 19 6 5
V1 and V2 in my example are the column names of df1 and df2.
It seems that this is the perfect use-case for merge, e.g.
merge(d1[c('V1','V2')],d2)
results in:
V1 V2 V3
1 11 10 3
2 19 6 5
3 4 6 10
In which 'V1' and 'V2' are the column names of interest.
data.table proposal
library(data.table)
setDT(df1)
setDT(df2)
setkey(df1, V1, V2)
setkey(df2, V1, V2)
df2[df1[, -3, with = F], nomatch = 0]
## V1 V2 V3
## 1: 4 6 10
## 2: 11 10 3
## 3: 19 6 5
If your two tables are d1 and d2,
d1<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(2, 3, 6, 8, 10, 6),
V3 = c(3, 4, 7, 9, 5, 14)
)
d2<-data.frame(
V1 = c(1, 2, 4, 3, 11, 19),
V2 = c(4, 1, 6, 7, 10, 6),
V3 = c(1, 4, 10, 2, 3, 5)
)
then you can subset d2 (in order to keep the third column) with
d2[interaction(d2$V1, d2$V2) %in% interaction(d1$V1, d1$V2),]
The interaction() treats the first two columns as a combined key.