I have a dataframe with a combination of numeric, and factor variables.
I am trying to recursively replace all outliers (3 x SD) with NA however I'm having problems with the following error
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
The code i was using is
name = factor(c("A","B","NA","D","E","NA","G","H","H"))
height = c(120,NA,150,170,NA,146,132,210,NA)
age = c(10,20,0,30,40,50,60,NA,130)
mark = c(100,0.5,100,50,90,100,NA,50,210)
data = data.frame(name=name,mark=mark,age=age,height=height)
data
data[is.na(data)] <- 77777
data.scale <- scale(data)
data.scale[ abs(data.scale) > 3 ] <- NA
data <- data.scale
Any suggestions on how to get this working?
Here's one approach:
library(dplyr)
# take note of order for column names
data.names <- colnames(data)
# scale all numeric columns
data.numeric <- select_if(data, is.numeric) %>% # subset of numeric columns
mutate_all(scale) # perform scale separately for each column
data.numeric[data.numeric > 3] <- NA # set values larger than 3 to NA (none in this example)
# combine results with subset data frame of non-numeric columns
data <- data.frame(select_if(data, function(x) !is.numeric(x)),
data.numeric)
# restore columns to original order
data <- data[, data.names]
> data
name mark age height
1 A 0.20461856 -0.80009469 -1.0844636
2 B -1.43232992 -0.55391171 NA
3 NA 0.20461856 -1.04627767 -0.1459855
4 D -0.61796862 -0.30772873 0.4796666
5 E 0.04010112 -0.06154575 NA
6 NA 0.20461856 0.18463724 -0.2711159
7 G NA 0.43082022 -0.7090723
8 H -0.61796862 NA 1.7309707
9 H 2.01431035 2.15410109 NA
Note: the non-numeric (character / factor / etc) variables will be ordered before the numeric variables in this approach. Hence the last step restores the original order (if applicable).
Related
I have a folder with a couple hundred .csv files that I'd like to import and merge.
Each file contains two columns of data, but there are different numbers of rows, and the rows have different names. The columns don't have names (For this, let's say they're named x and y).
How can I merge these all together? I'd like to just stick the x columns together, side-by-side, rather than matching on any criteria so that the first row is matched across all data sets and empty rows are given NA.
I'd like column x to go away.
Although, the rows should stay in the order they were originally in from the csv.
Here's an example:
Data frame 112_c1.csv:
x y
1 -0.5604
3 -0.2301
4 1.5587
5 0.0705
6 0.1292
Dataframe 112_c2.csv:
x y
2 -0.83476
3 -0.82764
8 1.32225
9 0.36363
13 0.9373
42 -1.5567
50 -0.12237
51 -0.4837
Dataframe 113_c1.csv:
x y
5 1.5783
6 0.7736
9 0.28273
15 1.44565
23 0.999878
29 -0.223756
=
Desired result
112_c1.y 112_c2.y 113_c1.y
-0.5604 -0.83476 1.5783
-0.2301 -0.82764 0.7736
1.5587 1.32225 0.28273
0.0705 0.36363 1.44565
0.1292 0.9373 0.999878
NA -1.5567 -0.223756
NA -0.12237 -0.223756
NA -0.12237 NA
NA -0.4837 NA
I've tried a few things, and looked through many other threads. But code like the following simply produces NAs for any following columns:
df <- do.call(rbind.fill, lapply(list.files(pattern = "*.csv"), read.csv))
Plus, if I use rbind instead of rbind.fill I get the error that names do not match previous names and I'm unsure of how to circumvent this matching criteria.
Suggested solution using a function to calculate summary statistics right when loading data:
readCalc <- function(file_path) {
df <- read.csv(file_path)
return(data.frame(file=file_path,
column = names(df),
averages = apply(df, 2, mean),
N = apply(df, 2, length),
min = apply(df, 2, min),
stringsAsFactors = FALSE, row.names = NULL))
}
df <- do.call(rbind, lapply(list.files(pattern = "*.csv"), readCalc))
If we need the first or last value we could use dplyr::first, dplyr::last. We might even want to store the whole vector in a list somewhere, but if we only need the summary stats we might not even need it.
Here's a solution to read all your csv files from a folder called "data" and merge the y columns into a single dataframe. This assigns the file name as the column header.
library(tidyverse)
# store csv file paths
data_path <- "data" # path to the data
files <- dir(data_path, pattern = "*.csv") # get file names
files <- paste(data_path, '/', files, sep="")
# read csv files and combine into a single dataframe
compiled_data = tibble::tibble(File = files) %>% #create a tibble called compiled_data
tidyr::extract(File, "name", "(?<=data/)(.*)(?=[.]csv)", remove = FALSE) %>% #extract the file names
mutate(Data = lapply(File, readr::read_csv, col_names = F)) %>% #create a column called Data that stores the file names
tidyr::unnest(Data) %>% #unnest the Data column into multiple columns
select(-File) %>% #remove the File column
na.omit() %>% #remove the NA rows
spread(name, X2) %>% #reshape the dataframe from long to wide
select(-X1) %>% #remove the x column
mutate_all(funs(.[order(is.na(.))])) #reorganize dataframe to collapse the NA rows
Taken from here: cbind a dataframe with an empty dataframe - cbind.fill?
x <- c(1:6)
y <- c(1:3)
z <- c(1:10)
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
df <- as.data.frame(cbind.fill(x,y,z))
colnames(df) <- c("112_c1.y", "112_c2.y", "113_c1.y")
112_c1.y 112_c2.y 113_c1.y
1 1 1 1
2 2 2 2
3 3 3 3
4 4 NA 4
5 5 NA 5
6 6 NA 6
7 NA NA 7
8 NA NA 8
9 NA NA 9
10 NA NA 10
I have a data frame that has three variables with the valid values of 1,2,3,4,5,6,7 for each variable. If there isn't a numeric value assigned to the variable, it will show NA. The data frame a looks like below:
ak_eth co_eth pa_eth
1 NA 1 NA
2 NA NA 1
3 NA NA NA
4 2 NA NA
5 NA NA 4
6 NA NA NA
Each row could have NA across all three variables or have only one value in one of the three variables. I want to create a new variable called recode that takes values from the existing three variables. If all three existing variables are NA, the new value is NA; if one of the three existing variables has a value, then take that value for the new variable.
I've tried this, but it seems didn't work for me.
a$recode[is.na(a$ak_eth) & is.na(a$co_eth) & is.na(a$pa_eth)] <- "NA"
library(car)
a$recode <- recode(a$ak_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
a$recode <- recode(a$co_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
a$recode <- recode(a$pa_eth, "1=1;2=2;3=3;4=4;5=5;6=6;7=7")
Any suggestions will be appreciated. Thanks!
We can use pmax
a$Recode_Var <- do.call(pmax, c(a, na.rm = TRUE))
Or use pmin
a$Recode_Var <- do.call(pmin, c(a, na.rm = TRUE))
Or another option is rowSums
r1 <- rowSums(a, na.rm = TRUE)
a$Recode_Var <- replace(r1, r1==0, NA)
NOTE: According to the OP's post Each row could have NA across all three variables or have only one value in one of the three variables
how to calculate colMedian using colMedian function. I get an error: Argument 'x' must be a matrix or a vector.
col_medians <- round(colMedians(impute_marks[,-1], na.rm=TRUE),0)
k <- which(is.na(impute_marks), arr.ind=TRUE)
impute_marks[k] <- col_medians[k[,-1]]
I need to do the below operation for all the columns other than first column in the data frame.
Below code works fine. but I in for loop gives an error unknown courses when looped.
impute_marks$c1[is.na(impute_marks$c1)] <- round(mean(impute_marks$c1[!is.na(impute_marks$c1)]),0)
here, impute_marks is the name of the dataset and c1 is the column name.
using the above operation I am able to find the mean and replace all NA values in c1 (column). But I have 30+ columns. How can I write the above operation in a for loop to loop through each course and replace NA value with the mean?
my function for the operation:
impute_marks$F27SA[is.na(impute_marks$F27SA)] <- round(mean(impute_marks$F27SA[!is.na(impute_marks$F27SA)]),0)
imputing_using_mean <- function()
{
courses <- names(impute_marks)[2:26]
for(i in seq_along(courses))
{
impute_marks$courses[[i]][is.na(impute_marks$courses[[i]])] <- round(mean(impute_marks$courses[[i]][!is.na(impute_marks$courses[[i]])]),0)
}
}
imputing_using_mean()
Essentially the same as answer from #Aaron on
Replace NA values by row means . Tweaked to account for the first column.
marks <- read.table(text="
a 1 NA 3
b 1 2 3
c NA NA NA
")
col_means <- round(colMeans(marks[,-1], na.rm=TRUE), 0)
k <- which(is.na(marks), arr.ind=TRUE)
marks[k] <- col_means[k[,2]-1]
# V1 V2 V3 V4
#1 a 1 2 3
#2 b 1 2 3
#3 c 1 2 3
Below is a solution for calculating median for each column and replacing each NA values with the median calculated for each column. same goes for mean as well but the step to convert it to a matrix is not required.
# first convert it to matrix
matrix_marks <- as.matrix(impute_marks)
$calculate the median for each column
col_medians <- round(colMedians(matrix_marks[,-1], na.rm=TRUE),0)
#get the index for each NA values
k <- which(is.na(matrix_marks), arr.ind=TRUE)
finally replace those values with median value.
matrix_marks[k] <- col_medians[k[,-1]]
IN R
my data
a <- c('1','2','3','1','1')
b <- c('3','1','2','1','2')
j <- data.frame(a,b)
rowSums(j) #error
How can I calculate sum of the row?
In case you have real character vectors (not factors like in your example) you can use data.matrix in order to convert all the columns to numeric class
j <- data.frame(a, b, stringsAsFactors = FALSE)
rowSums(data.matrix(j))
## [1] 4 3 5 2 3
Otherwise, you will have to convert first to character and then to numeric in order to not lose information
rowSums(sapply(j, function(x) as.numeric(as.character(x))))
## [1] 4 3 5 2 3
How do I create a fixed size data frame of size [40 2], declare the first column with unique strings, and populate the other with specific values? Again, I want the first column to be the list of strings; I don't
want a row of headers.
(Someone please give me some pointers. I haven't program in R for a while and my R skills are terrible to
begin with.)
Two approaches:
# sequential strings
library(stringr)
df.1 <- data.frame(id=paste0("X",str_pad(1:40,2,"left","0")),value=NA)
head(df.1)
# id value
# 1 X01 NA
# 2 X02 NA
# 3 X03 NA
# 4 X04 NA
# 5 X05 NA
# 6 X06 NA
Second Approach:
# random strings
rstr <- function(n,k){
sapply(1:n,function(i){do.call(paste0,as.list(sample(letters,k,replace=T)))})
}
set.seed(1)
df.2 <- data.frame(id=rstr(40,5),value=NA)
head(df.2)
# id value
# 1 gjoxf NA
# 2 xyrqb NA
# 3 ferju NA
# 4 mszju NA
# 5 yfqdg NA
# 6 kajwi NA
The function rstr(n,k) produces a vector of length n with each element being a string of random characters of length k. rstr(...) does not guarantee that all strings are unique, but the probability of duplication is O(n/26^k).
Create the data.frame and define it's columns with the values
The reciclying rule, repeats the strings to match the 40 rows defined by the second column
df <- data.frame(x = c("unique_string 1", "unique_string 2"), y = rpois(40, 2))
# Change column names
names(df) <- c("string_col", "num_col")
I found this way of creating dataframes in R extremely productive and easy,
Create a raw array of values , then convert into matrix of required dimenions and finally name the columns and rows
dataframe.values = c(value1, value2,.......)
dataframe = matrix(dataframe.values,nrow=number of rows ,byrow = T)
colnames(dataframe) = c("column1","column2",........)
row.names(dataframe) = c("row1", "row2",............)
exampledf <- data.frame(columnofstrings=c("a string", "another", "yetanother"),
columnofvalues=c(2,3,5) )
gives
> exampledf
columnofstrings columnofvalues
1 a string 2
2 another 3
3 yetanother 5