I am working with the R programming language. I have a dataset with both character and numeric variables - I am trying to replace all NA's and empty values in this data with "0". For a continuous variable, the NA/empty value should be replaced with a "numeric 0". For factor variables, the NA/empty value should be replaced with a "factor 0".
In the past, I used to use a standard command for replacing all NA's with 0 (in the below code, "df" represents the data frame containing the data):
df[df == NA] <- 0
I tried the above code on my data, but I still noticed that within the factor variables, this code was not able to replace <NA> values with 0. <NA> 's are still present.
I tried several approaches:
1st Approach:
df[is.na(df)] <- 0
But this did not work:
Warning message:
In '[<-.factor'('*tmp*',thisvar, value = 0):
invalid factor level, NA generated
Second Approach: I tried for one of the factor variables
library(car)
df$some_factor_var <- recode(df$some_factor_var, "NA = 0")
But this replaced every value within "some_factor_var" as 0
Third Approach : I tried again for one of the factor variables
library(forcats)
fct_explicit_na(df$some_factor_var,0)
Error: Can't convert a double vector to a character vector
Can someone please show me how to fix this problem? Is there a way to replace ALL empty/missing/NA values for all variables at once?
Thanks
For factor variables you need to first include the new level (0) in the data if it is not already present.
See this example -
df <- data.frame(a = factor(c(1, NA, 2, 5)), b = 1:4,
c = c('a', 'b', 'c', NA), d = c(1, 2, NA, 1))
#Include 0 in the levels for "a" variable
levels(df$a) <- c(levels(df$a), 0)
#Replace NA to 0
df[is.na(df)] <- 0
df
# a b c d
#1 1 1 a 1
#2 0 2 b 2
#3 2 3 c 0
#4 5 4 0 1
str(df)
#'data.frame': 4 obs. of 4 variables:
# $ a: Factor w/ 4 levels "1","2","5","0": 1 4 2 3
# $ b: int 1 2 3 4
# $ c: chr "a" "b" "c" "0"
# $ d: num 1 2 0 1
With tidyverse, try:
library(tidyverse)
df <-
tibble(var_numeric = c(1,2,3,NA),
var_factor = as.factor(c(4,5,6,NA)))
df %>%
replace_na(list(var_numeric = 0)) %>%
mutate(var_factor = fct_explicit_na(var_factor, "0"))
# A tibble: 4 x 2
var_numeric var_factor
<dbl> <fct>
1 1 4
2 2 5
3 3 6
4 0 0
Related
This question already has answers here:
Change row order in a matrix/dataframe
(7 answers)
Closed 2 years ago.
I have a df in which around 50 variables have with character values ranging from 1,2,3,4
var
1
2
3
4
How can I "bulk" change the values reversing them such that I get:
var
4
3
2
1
So 4 becomes 1, 3 becomes 2, etc... Kind of like applying the formula (var = 5-value) for each variable but for character values.
This as mentioned for a long list of variables (~50).
You can try :
library(dplyr)
df %>% mutate(across(var1:var50, ~5 - as.numeric(.)))
OR in base R :
cols <- paste0('var', 1:50)
df[cols] <- lapply(df[cols], function(x) 5 - as.numeric(x))
If you're just subtracting the data.frame from a value, as you indicate in your example, you should be able to just do this:
df[] <- 5 - data.matrix(df)
Here's an example:
df <- data.frame(var1 = as.character(c(1, 2, 3, 4)),
var2 = as.character(c(10, 20, 30, 40)),
stringsAsFactors = FALSE)
df[] <- 5 - data.matrix(df)
str(df)
# 'data.frame': 4 obs. of 2 variables:
# $ var1: num 4 3 2 1
# $ var2: num -5 -15 -25 -35
If you're just reversing the row order, then something like this should work:
df[nrow(df):1, ]
# var1 var2
# 4 4 40
# 3 3 30
# 2 2 20
# 1 1 10
You can use tidyverse’s mutate_at() or mutate_all().
Am a basic R user.
I have 50 column pairs (example pair is: "pair_q1" and "pair_01_v_rde") per "id" in the same dataframe that I would like to collect data from and place it in a new corresponding variable e.g. "newvar_q1".
All the pair variable names have a pattern in their names that can be distilled to this ("pair_qX" and "pair_X_v_rde", where X = 1:50, and the final variables I would like to have are "newvar_qX", where X = 1:50)
Ideally only one member of the pair should contain data, but this is not the case.
Each of the variables can contain values from 1:5 or NA(missing).
Rules for collecting data from each pair based on "id" and what to place in their newly created corresponding variable are:
If one of the pairs has a value and the other is missing then place the value in their corresponding new variable. e.g. ("pair_q1" = 1 and "pair_01_v_rde" = NA then "newvar_q1" = 1)
If both pairs have the same value or both are missing then place that value/missing in their corresponding new variable e.g. ("pair_q50" = 1/NA and "pair_50_v_rde" = 1/NA then "newvar_q50" = 1/NA)
If both pairs have different values then ignore both values and assign their corresponding new variable 999 e.g. ("pair_q02" = 3 and "pair_02_v_rde" = 2 then "newvar_q02" = 999)
Can anyone show me how I can execute this in R please?
Thanks!
Nelly
# Create Toy dataset
id <- c(100, 101, 102)
pair_q1 <- c(1, NA, 1)
pair_01_v_rde <- c(NA, 2, 1)
pair_q2 <- c(1, 1, NA)
pair_02_v_rde <- c(2, NA, NA)
pair_q50 <- c(NA, 2, 4)
pair_50_v_rde <- c(4, 3, 1)
mydata <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde)
# The dataset
> mydata
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
1 100 1 NA 1 2 NA 4
2 101 NA 2 1 NA 2 3
3 102 1 1 NA NA 4 1
# Here I manually build what I would like to have in the dataset
newvar_q1 <- c(1, 2, 1)
newvar_q2 <- c(999, 1, NA)
newvar_q50 <- c(4, 999, 999)
mydata2 <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde, newvar_q1, newvar_q2, newvar_q50)
> mydata2
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde newvar_q1 newvar_q2 newvar_q50
1 100 1 NA 1 2 NA 4 1 999 4
2 101 NA 2 1 NA 2 3 2 1 999
3 102 1 1 NA NA 4 1 1 NA 999
A possible solution using the 'tidyverse' (use 'inner_join(mydata,.,by="id")' to get the new columns in the order you give in your question):
mydata %>%
select(id,matches("^pair_q")) %>% # keeps only left part of pairs
gather(k,v1,-id) %>% # transforms into tuples (id,variable name,variable value)
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df1 # converts variable name into variable number
mydata %>%
select(id,matches("^pair_\\d")) %>% # same on right part of pairs
gather(k,v2,-id) %>%
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df2
inner_join(df1,df2,by=c("id","n")) %>%
mutate(w=case_when(is.na(v1) ~ v2, # builds new variable value
is.na(v2) ~ v1, # from your rules
v1==v2 ~ v1,
TRUE ~999),
k=paste0("newvar_q",n)) %>% # builds new variable name from variable number
select(id,k,w) %>% # keeps only useful columns
spread(k,w) %>% # switches back from tuple view to wide view
inner_join(mydata,by="id") # and merges the new variables to the original data
# id newvar_q1 newvar_q2 newvar_q50 pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
#1 100 1 999 4 1 NA 1 #2 NA 4
#2 101 2 1 999 NA 2 1 NA 2 3
#3 102 1 NA 999 1 1 NA NA 4 1
After reading in data and cleaning it, I ended up with factor columns that have levels that should no longer be there.
For example, d below has one blank cell in excel. When it’s read in, the factor columns have a level "", which shouldn’t be part of the data.
d <- read.csv(header = TRUE, text='
x,y,value
a,one,1
,,5
b,two,4
c,three,10
')
d
#> x y value
#> 1 a one 1
#> 2 5
#> 3 b two 4
#> 4 c three 10
str(d)
#> 'data.frame': 4 obs. of 3 variables:
#> $ x : Factor w/ 4 levels "","a","b","c": 2 1 3 4
#> $ y : Factor w/ 4 levels "","one","three",..: 2 1 4 3
#> $ value: int 1 5 4 10
How do I remove this level, "" from the factors which are about 20 factors in the data frame, without deleting the entire row that has just one empty cell, cause this will reduce my sample size from 299000 to just 7 observation(which I have tried before).
One way would be to replace the '' with NA and use droplevels to remove the unused levels
d[1:2] <- lapply(d[1:2], function(x) droplevels(replace(x, x=="", NA)))
levels(d$x)
#[1] "a" "b" "c"
levels(d$y)
#[1] "one" "three" "two"
Another option while reading the dataset (as we assume the OP wanted factor columns would be
d <- read.csv("yourfile.csv", na.strings = "")
This should make sure that the '' will be read as NA.
Update
Suppose, there are numeric columns in between and we need to do the replace/droplevels only for the factor columns
d[] <- lapply(d, function(x) if(is.factor(x)) droplevels(replace(x, x== "", NA))
else x)
I have a seemingly simple task of adding a row to a data frame in R but I just can't do it!
I have a data frame with 50 rows and 100 columns. The data frame, which I would like to keep in the same format, has the first column as a factor, and all other columns as characters -- this is what lapply produced. I would simply like to add append a 51st row...but I incur warnings every time.
My added data is of the form Cat <- c("Cat", 1,NA,3,NA,5). (I have no clue where " or ' need to go - quite new to R!)
rbind shows "invalid factor levels" every time.
e.g.
df <- rbind(df,Cat)
I believe this is because of the factor/character divide.
The factor levels in your data.frame should also contain the values in your "Cat" object for the relevant factor column.
Here's a simple example:
df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
toAdd <- list("c", 3)
## Warnings...
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 <NA> 3
# Warning message:
# In `[<-.factor`(`*tmp*`, ri, value = "c") :
# invalid factor level, NA generated
## Possible fix
df$v1 <- factor(df$v1, unique(c(levels(df$v1), toAdd[[1]])))
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 c 3
Alternatively, consider rbindlist from "data.table", which should save you from having to convert the factor levels:
> library(data.table)
> df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
> rbindlist(list(df, toAdd))
v1 v2
1: a 1
2: b 2
3: c 3
> str(.Last.value)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ v1: Factor w/ 3 levels "a","b","c": 1 2 3
$ v2: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
This does not work
> dfi=data.frame(v1=c(1,1),v2=c(2,2))
> dfi
v1 v2
1 1 2
2 1 2
> df$df=dfi
Error in `$<-.data.frame`(`*tmp*`, "df", value = list(v1 = c(1, 1), v2 = c(2, :
replacement has 2 rows, data has 0
df$df=I(dfi) has the same error. Please help.
Thank you.
Moved this from comments for formatting reasons:
What exactly are you trying to achieve? If you want the contents of dfi passed to df you can use this code:
df <- data.frame(matrix(vector(), 0, 2, dimnames=list(c(), c("V1", "V2"))), stringsAsFactors=F)
df=dfi
As #joran says, it is unclear why you would ever want to do this. Nevertheless, it is possible.
One of the requirements of a data frame is that all the columns have the same number of rows. This is why you are getting the error. Something like this will work:
dfi <- data.frame(v1=c(1,1),v2=c(2,2)) # 2 rows
df <- data.frame(x=1:2) # also 2 rows
df$df <- dfi # works now
Printing would lead you to believe that df has three columns...
df
# x df.v1 df.v2
# 1 1 1 2
# 2 2 1 2
but it does not!
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ x : int 1 2
# $ df:'data.frame': 2 obs. of 2 variables:
# ..$ v1: num 1 1
# ..$ v2: num 2 2
Since df$df is a data frame
class(df$df)
# [1] "data.frame"
you can use the standard data frame accessors...
df$df$v1
# [1] 1 1
df$df[1,]
# v1 v2
# 1 1 2
Incidentally, RStudio has trouble displaying this type of data structure; view(df) gives an inaccurate display of the structure.
Finally, you are probably better off creating a list of data frames, rather than a data frame containing data frames:
df <- data.frame(grp=rep(LETTERS[1:3],each=5),x=rnorm(15),y=rpois(15,5))
df.lst <- split(df,df$grp) # creates a list of data frames
df.lst$A
# grp x y
# 1 A -1.3606420 10
# 2 A -0.4511408 5
# 3 A -1.1951950 4
# 4 A -0.8017765 5
# 5 A -0.2816298 9
df.lst$A$x
# [1] -1.3606420 -0.4511408 -1.1951950 -0.8017765 -0.2816298