replace '?' in a dataset with 0 - r

I am working in R and have a dataset comprising of 700 rows and 10 columns, with some of the values having '?' as value. I want to replace the '?' values with 0.
I am not sure if the is.na() function would work here, as the values are not NA. If I convert my dataset into a matrix, and after searching for '?' , replace it with 0, would that help?
I tried this code:
datafile <- sapply(datafile, function(y){if (y=='?') 0 else y})
after this I saved the file as a text file, but the ? didn't go away.

You don't even need to convert to a matrix. As Ben Bolker said, your best option is to use na.strings when reading in the file.
If the data frame is not coming from a file, you can directly do:
df[df=="?"] <- 0
You have to remember though that anything containing character might be converted to a factor. If that's the case, you have to convert those factors to character. Ben gives you a brute force option, here's a more gentle approach:
# check which variables are factors
isfactor <- sapply(df, is.factor)
# convert them to character
# I use lapply bcs that returns a list, and I use the
# list-like selection of "elements" (variables) to replace
# the variables
df[isfactor] <- lapply(df[isfactor], as.character)
So if you put everything together, you get:
df <- data.frame(
a = c(1,5,3,'?',4),
b = c(3,'?','?',3,2)
)
isfactor <- sapply(df, is.factor)
df[isfactor] <- lapply(df[isfactor], as.character)
df[df=="?"] <- 0
df

It depends whether you have other NA values in your data set. If not, almost certainly the easiest way to do this is to use the na.strings= argument to read.(table|csv|cv2|delim), i.e. read your data with something like dd <- read.csv(...,na.strings=c("?","NA"). Then
dd[is.na(dd)] <- 0
If for some reason you don't have control of this part of the process (e.g. someone handed you a .rda file and you don't have the original CSV), then it's a bit more tedious -- you need
which.qmark <- which(x=="?")
x <- suppressWarnings(as.numeric(as.character(x)))
x[which.qmark] <- 0
(This version also works if you have both ? and other NA values in your data)

Related

R (Studio) Import xlsx/csv and set NA/missing on a per column basis

Probably a quick question that I just haven't found the right keywords to get an answer to.
I'm using R studio and importing a csv using readr (or an xlsx with readxl) of a large epidemiological data set (>40k rows, >200 variables) that was provided to me.
library (readr)
DF <- read_csv("com16_NA.csv", na = "999")
## OR ##
library(readxl)
DF <- read_excel("com16_NA.xlsx", na = "999")
I'm trying to set the missing values on import, however the creators have set missing placeholders as 99 for some variables, 999 for others (where 99 is a valid option such e.g. weight) and again 9999 for others (where 999 is possible).
Is there a way on import to set the missing values on a per column basis? Right now I can only see how to set a single value as missing for the entire data set (as per the code above).
Or is my best bet to convert all of the missing placeholders to NA in a spreadsheet before importing?
Thanks
I'd let the creators know it's bad practice to have missing value codes that apply to some columns but not others!
You can use the replace_with_na() function from the naniar package in this case:
library(readr)
library(naniar)
DF <- read_csv("com16_NA.csv") %>%
replace_with_na(replace = list(x = 99, y = 999))
where x is the column name with missing values set as 99 and y with 999,
Both read_csv and read_excel accept a character vector for the na argument, so you can enter:
DF <- read_csv("com16_NA.csv", na = c('', 'NA', '999'))
Or include any other value you want to be NA. The default na argument is both na = c('', 'NA') for read_csv and just '' for read_excel

Reading csv in r - numbers like "general"

First, I am new on R.
My csv has some numbers considered like "general" so I can't do the math with data. Is there any solution for this?
I have tried data >- as.numeric ( as.character(data)) but I failed.
data <- read.csv(file="TC.csv", header=TRUE, sep=",")
data[ data == "?" ] <- NA
for(i in 1:ncol(data)) {
data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}
I get this message:
In mean.default(results) : argument is not numeric or logical: returning NA
I think the problem is related to numbers like on yellow cell.
Sample input:
You shouldn't need to loop over the data set to remove rows. Also, I don't believe the highlighted rows are the root of the problem. To make it easier, I would convert the data to a data frame.
data <- as.data.frame(read.csv(file="TC.csv", header=TRUE, sep=","))
To remove the '?' character, you should be able to run the code below. I think it is easier to run the code below instead of converting it to NA and then dropping it.
data <- data[!grepl('?',data$Column),]
mean(TC$Column)
summary(TC)
In summary, you should convert it to a data frame, replace/drop the rows that have values that aren't numeric, and then perform your summary stats.
You are getting that error message because you are applying the mean function to a list, when it operates on numeric types.
In R, the usual way of dealing with multi-dimensional data is not to loop over it, but to use one of the various apply functions, which perform an operation on one dimension of your data. Here you are looking for the column mean, which you get by:
TC.csv
a_0,a_1,a_2,a_3,a_4
3030.93,1,1,1,1
3095.78,2,2,2,2
2932.61,3,3,?,3
3032.24,4,4,4,4
2946.25,5,5,5,5
3058.88,6,?,6,6
get_mean.R
data <- read.csv(file="TC.csv", header=TRUE, sep=",", na.strings="?")
# apply( data, dimension, function, function_args )
col_means <- apply( data, 2, mean, na.rm=1 )
Apply Functions Over Array Margins
Apply a Function over a List or Vector

How to write loop for creating multiple variables in R

I have a data set in R called data, and in this data set I have more than 600 variables. Among these variables I have 94 variables called data$sleep1,data$sleep2...data$sleep94, and another 94 variables called data$wakeup1,data$wakeup2...data$wakeup94.
I want to create new variables, data$total1-data$total94, each of which is the sum of sleep and wakeup for the same day.
For example, data$total64 <-data$sleep64 + data$wakeup64,data$total94<-data$sleep94+data$wakeup94.
Without a loop, I need to write this code 94 times. I hope someone could give me some tips on this. It doesn't have to be a loop, but an easier way to do this.
FYI, every variables are numeric and have about 30% missing values. The missing are random, it could be anywhere. missing value is a blank but not 0.
I recommend storing your data in long form. To do this, use melt. I'll use data.table.
Sample data:
library(data.table)
set.seed(102943)
x <- setnames(as.data.table(matrix(runif(1880), nrow = 10)),
paste0(c("sleep", "wakeup"), rep(1:94, 2)))[ , id := 1:.N]
Melt:
long_data <-
melt(x, id.vars = "id",
measure.vars = list(paste0("sleep", 1:94),
paste0("wakeup", 1:94)))
#rename the output to look more familiar
#**note: this syntax only works in the development version;
# to install, follow instructions
# here: https://github.com/jtilly/install_github
# to install from https://github.com/Rdatatable/data.table
# (or, read ?setnames and figure out how to make the old version work)
setnames(long_data, -1L, c("day", "sleep", "wakeup"))
I hope you'll find it's much easier to work with the data in this form.
For example, your problem is now simple:
long_data[ , total := sleep + wakeup]
We could do this without a loop. Assuming that the columns are arranged in the sequence mentioned, we subset the 'sleep' columns and 'wakeup' columns separately using grep and then add the datasets together.
sleepDat <- data[grep('sleep', names(data))]
wakeDat <- data[grep('wakeup', names(data))]
nm1 <- paste0('total', 1:94)
data[nm1] <- sleepDat+wakeDat
If there are missing values and they are NA, we can replace the NA values with 0 and then add it together as before.
data[nm1] <- replace(sleepDat, is.na(sleepDat), 0) +
replace(wakeDat, is.na(wakeDat), 0)
If the missing value is '', then the columns would be either factor or character class (not clear from the OP's post). In that case, we may need to convert the dataset to numeric class so that the '' will be automatically converted to NA
sleepDat[] <- lapply(sleepDat, function(x)
as.numeric(as.character(x)))
wakeDat[] <- lapply(wakeDat, function(x)
as.numeric(as.character(x)))
and then proceed as before.
NOTE: If the columns are character, just omit the as.character step and use only as.numeric.

Invalid factor level when setting rows of a data frame in R

This is the a little confusing to me here are the steps I am taking:
Create a dataframe:
df <- data.frame(one=numeric(5), two=character(5))
This makes a dataframe with the columns "one" and "two" with 5 empty rows. Then I can try to assign a value to a row.
df[1,] <- list(1, "test")
This results in an error. If I do it with pure numeric values there is no issue. Also if I use list(1, "") that works as well.
List can handle different vector types and so can a dataframe so I am probably making a syntax mistake, but I can't seem to figure it out.
Thanks for helping a new learner =)
To avoid converting to factor class, you can set the options to stringsAsFactors=FALSE or specify it within the data.frame
op <- options(stringsAsFactors=FALSE)
df[1,] <- list(1, "test")
options(op)
df
# one two
#1 1 test
#2 0
#3 0
#4 0
#5 0
As mentioned in the comment, you should prevent converting strings to factors when constructing your data frame.
Full code:
df <- data.frame(one=numeric(5), two=character(5), stringsAsFactors=FALSE)
df[1,] <- list(1, "test")
should work fine.

Replace a number in dataframe

I have a dataframe in which I occasionally have -1s. I want to replace them with NA. I tried the apply function, but it returns a matrix of characters to me, which is no good:
apply(d,c(1,2), function(x){
if (x == -1){
return (NA)
}else{
return (x)
}
})
I am wrestling with by but I cannot seem to handle it properly. I have got this so far:
s <-by(d,d[,'Q1_I1'], function(x){
for(i in x)
print(i)
})
which if I understood correctly by() serves into x my dataframe row by row. And I can iterate through the each element of the row by the for function. I just don't know how to replace the value.
The reason that apply does not work is that it converts a data frame to a matrix and if your data frame has any factors then this will be a character matrix.
You can use lapply instead which will process the data frame one column at a time. This code works:
mydf <- data.frame( x=c(1:10, -1), y=c(-1, 10:1), g=sample(letters,11) )
mydf
mydf[] <- lapply(mydf, function(x) { x[x==-1] <- NA; x})
mydf
As #rawr mentions in the comments it does work to do:
mydf[ mydf== -1 ] <- NA
but the documentation (?'[.data.frame') say that that is not recommended due to the conversions.
One big question is how the data frame is being created. If you are reading the data using read.table or related functions then you can just specify the na.strings argument and have the conversion done for you as the data is read in.
You can do this fast and transparently with the data.table library.
# take standard dataset and transform to data.table
mtcars = data.table(mtcars,keep.rownames = TRUE)
# select rows with 5 gear and set to NA
mtcars[gear==5,gear:= NA]
mtcars

Resources