I am trying to calculate the average of columns in another column but getting errror
converted all the nastring na.strings = c("N") to NA but after the class of columns is character.
after this i have NA in place of N in data frame but still the class of column is character
df <- data.frame("T_1_1"= c(68,24,"N",105,58,"N",135,126,24),
"T_1_2"=c(26,105,"N",73,39,97,46,108,"N"),
"T_1_3"=c(93,32,73,103,149,"N",147,113,139),
"S_2_1"=c(69,67,94,"N",77,136,137,92,73),
"S_2_2"=c(87,67,47,120,85,122,"N",96,79),
"S_2_3"= c(150,"N",132,121,29,78,109,40,"N"),
"TS1_av"=c(68.5,45.5,94,105,67.5,136,136,109,48.5),
"TS2_av"=c(56.5,86,47,96.5,62,109.5,46,102,79),
"TS3_av"=c(121.5,32,102.5,112,89,78,128,76.5,139)
)
df$TS1_av <- rowMeans(df[,c(as.numeric(as.character("T_1_1","S_2_1")))], na.rm=TRUE)
You can use :
#Change 'N' to NA
df[df == 'N'] <- NA
#Change the type of columns
df <- type.convert(df, as.is = TRUE)
#Take mean of selected columns and add a new column
df$TS1_av <- rowMeans(df[,c("T_1_1","S_2_1")], na.rm=TRUE)
df
You could use readr::parse_number to extract numbers and replace any string that can't be converted to numeric by NA.
the na argument allows to specify strings to be interpreted as NA (here 'N'). If you don't supply this argument, you get a warning for every string which couldn't be interpreted but it's also replaced by NA.
library(dplyr)
library(readr)
df <- df %>% mutate(across(where(is.character),readr::parse_number,na='N'))
df$TS1_av <- rowMeans(df[,c("T_1_1","S_2_1")], na.rm=TRUE)
df
2 Base R solutions:
# Columns to subset out: cols => character vector
cols <- c("T_1_1", "S_2_1")
# Option 1: calculate the mean row-wise: TS1_av => numeric vector
df$TS1_av <- apply(df[,cols], 1, function(x){
mean(suppressWarnings(as.numeric(x)), na.rm = TRUE)
}
)
# Columns to subset out: cols => character vector
cols <- c("T_1_1", "S_2_1")
# Option 2: Coerce to numeric and calculate the row mean:
# TS1_av => numeric vector
df$TS1_av <- rowMeans(
suppressWarnings(
vapply(df[,cols], as.numeric, numeric(nrow(df)))
),
na.rm = TRUE
)
Related
I am trying to get the column name and number of columns that have "NA" in them, I have tried using this code
names(df)[sapply(df, anyNA)]
but it only gives me the column names and no the numbers,
any idea how to get an output for both?
We may convert the logical vector to index with which and subset the index with names to get the column names
i1 <- sapply(df, anyNA)
which(i1)
names(df)[i1]
We may not need the names(df)[i1] as which gives a named vector of index though i.e
which(sapply(df, anyNA))
will be a single line code to give both column names and index
Or with dplyr
library(dplyr)
df %>%
summarise(across(where(anyNA), ~ match(cur_column(), names(df))))
Using which with colsums of NA.
which(colSums(is.na(iris_na)) > 0)
# Sepal.Length Petal.Length
# 1 3
Data:
iris_na <- iris
iris_na[c(1, 3)] <- lapply(iris_na[c(1, 3)], \(x) replace(x, sample(length(x), length(x)/10), NA_real_))
I would like to create a function to replace NA by the text "NR" in factor variables of a data frame.
I found the below code on the web, that works perfectly :
i <- sapply(data_5, is.factor) # Identify all factor variables in your data
data_5[i] <- lapply(data_5[i], as.character) # Convert factors to character variables
data_5[is.na(data_5)] <- 0 # Replace NA with 0
data_5[i] <- lapply(data_5[i], as.factor) # Convert character columns back to factors
But I would like to transform this code in a function called "remove_na_factor". I tried as below :
remove_na_factor <- function(x){
i <- sapply(x, is.factor) # Identify all factor variables in your data
x[i] <- lapply(x[i], as.character) # Convert factors to character variables
x[is.na(x)] <- "NR" # Replace NA with NR
x[i] <- lapply(x[i], as.factor) # Convert character columns back to factors
}
When when I run the function on a data frame with NA values, nothing happens ...
Thanks in advance for your help.
Just add return(x) at the end of your function:
remove_na_factor <- function(x){
#your function body
return(x)
}
You can also get the same result using a tidyverse approach
library(tidyverse)
x %>%
mutate_if(is.factor, as.character) %>% # Convert factors to character variables
mutate_if(is.character, replace_na, "NR") %>% # Replace NA with NR
mutate_if(is.character, as.factor) # Convert character columns back to factors
Suppose I have a data.frame like THIS (or see my code below). As you can see, after every some number of continuous rows, there is a row with all NAs.
I was wondering how I could split THIS data.frame based on every row of NA?
For example, in my code below, I want my original data.frame to be split into 3 smaller data.frames as there are 2 rows of NAs in the original data.frame.
Here is is what I tried with no success:
## The original data.frame:
DF <- read.csv("https://raw.githubusercontent.com/izeh/i/master/m.csv", header = T)
## the index number of rows with "NA"s; Here rows 7 and 14:
b <- as.numeric(rownames(DF[!complete.cases(DF), ]))
## split DF by rows that have "NA"s; that is rows 7 and 14:
split(DF, b)
If we also need the NA rows, create a group with cumsum on the 'study.name' column which is blank (or NA)
library(dplyr)
DF %>%
group_split(grp = cumsum(lag(study.name == "", default = FALSE)), keep = FALSE)
Or with base R
split(DF, cumsum(c(FALSE, head(DF$study.name == "", -1))))
Or with NA
i1 <- rowSums(is.na(DF))== ncol(DF)
split(DF, cumsum(c(FALSE, head(i1, -1))))
Or based on 'b'
DF1 <- DF[setdiff(seq_len(nrow(DF)), b), ]
split(DF1, as.character(DF1$study.name))
You can find occurrence of b in sequence of rows in DF and use cumsum to create groups.
split(DF, cumsum(seq_len(nrow(DF)) %in% b))
With the code below I read data from a website.
The problem is it reads the data as character not in numeric format especially some columns such as "Enlem(N) and Boylam(E).
How can I fix this?
library(rvest)
widths <- c(11,10,10,10,14,5,5,5,48,100)
dat <- "http://www.koeri.boun.edu.tr/scripts/lst5.asp" %>%
read_html %>%
html_nodes("pre") %>%
html_text %>%
textConnection %>%
read.fwf(widths = widths, stringsAsFactors = FALSE) %>%
setNames(nm = .[6,]) %>%
tail(-7) %>%
head(-2)
If you know what specific columns should be a number, you can convert those columns to be a number. If you do not know what columns should be a number, you can create a function to look at the data and if a large enough percentage of the cases in the column are a number change that column to be a number. I have used the function below for this purpose:
NumericColumns <- function(x, AllowedPercentNumeric =.95, PreserveDate=TRUE, PreserveColumns){
# find the counts of NA values in input data frame's columns
param_originalNA <- apply(x, 2, function(z){sum(is.na(z))})
# blindly coerce data.frame to numeric
df_JustNumbers <- suppressWarnings(as.data.frame(lapply(x, as.numeric)))
# Percent Non-NA values in each column
PercentNumeric <- (apply(df_JustNumbers, 2, function(x)sum(!is.na(x))))/(nrow(x)-param_originalNA)
rm(param_originalNA)
# identify columns which have a greater than or equal percentage of numeric as specified
param_numeric <- names(PercentNumeric)[PercentNumeric >= AllowedPercentNumeric]
# Remove columns from list to convert to numeric that are specified as to preserve
if (!missing(PreserveColumns)){param_numeric <- setdiff(param_numeric, PreserveColumns)}
# Identify columns that are dates initially
IsDateColumns <- lapply(x, function(y)(is(y, "Date")|is(y, "POSIXct")))
param_dates <- names(IsDateColumns)[IsDateColumns==TRUE]
# Remove dates from list if specified to preserve dates
if (PreserveDate){param_numeric <- setdiff(param_numeric, param_dates)}
# returns column position of numeric columns in target data frame
param_numeric <- match(param_numeric, colnames(df_JustNumbers))
# removes NA's from column list
param_numeric <- param_numeric[complete.cases(param_numeric)]
# coerces columns in param_numeric to numeric and inserts numeric columns into target data.frame
if(length(param_numeric)==1){
suppressWarnings(x[, param_numeric] <- as.numeric(x[, param_numeric]))
}
if(length(param_numeric)>1){
suppressWarnings(x[, param_numeric] <- apply(x[, param_numeric],2, function(x)as.numeric(x)))
}
return(x)
}
Once the function is created, you can use it on you data such as:
# Use function to convert to numeric
dat <- NumericColumns(dat)
I have a large data frame that I would like to cast into wide-form data using the dcast() function in the reshape2 package. However, the value column is a character column, but some of the values in it are numeric values in string format. I tried to create a custom aggregate function to deal with this, that will return the mean if there are numeric entries, but return the first entry if all entries are non-numeric. Although the function seems to work, it returns an error when used as fun.aggregate. Below is code with a smaller toy example to demonstrate. What I want is a 3x5 data frame with the first column the grouping variable, 3 columns of numeric values, and 1 column of character values.
mean_with_char <- function(x) {
xnum <- as.numeric(x)
if (any(!is.na(xnum))) mean(xnum, na.rm=TRUE) else x[1]
}
library(reshape2)
fakedata <- data.frame(grp1 = rep(letters[1:3],times=20), grp2 = rep(LETTERS[17:20],each=15), val=rnorm(60))
fakedata$val[46:60] <- rep(c('foo','bar','bla','bla','bla','bla'), length.out=15)
# This returns a 3x5 data frame with NA entries.
dcast(fakedata, grp1 ~ grp2, value.var='val', fun.aggregate=mean)
# This returns an error.
dcast(fakedata, grp1 ~ grp2, value.var='val', fun.aggregate=mean_with_char)
Error in vapply(indices, fun, .default) : values must be type
'character', but FUN(X[[1]]) result is type 'double'
Here is the workaround suggested by aosmith. The mean_with_char function returns only character output, and the numstring2num function converts numeric strings to numerics.
mean_with_char <- function(x) {
xnum <- as.numeric(x)
if (any(!is.na(xnum))) as.character(mean(xnum, na.rm=TRUE)) else x[1]
}
library(reshape2)
fakedata <- data.frame(grp1 = rep(letters[1:3],times=20), grp2 = rep(LETTERS[17:20],each=15), val=rnorm(60))
fakedata$val[46:60] <- rep(c('foo','bar','bla','bla','bla','bla'), length.out=15)
fakecast <- dcast(fakedata, grp1 ~ grp2, value.var='val', fun.aggregate=mean_with_char)
# Function to change columns in a df that only consist of numeric strings to numerics.
numstring2num <- function(x) {
xnum <- as.numeric(x)
if (!any(is.na(xnum)) & !is.factor(x)) xnum else x
}
fakecast[] <- lapply(fakecast[], numstring2num)