How to change characters into NA? - r

I have a census dataset with some missing variables indicated with a ?,
When checking for incomplete cases in R it says there are none because R takes the ? as a valid character. Is there any way to change all the ? to NAs? I would like to run multiple imputation using the mice package to fill in the missing data after.

Data frames. You may need to fiddle with the quotation marks. I have not tested this.
df[df == "?"] <- NA

Creating data frame df
df <- data.frame(A=c("?",1,2),B=c(2,3,"?"))
df
# A B
# 1 ? 2
# 2 1 3
# 3 2 ?
I. Using replace() function
replace(df,df == "?",NA)
# A B
# 1 <NA> 2
# 2 1 3
# 3 2 <NA>
II. While importing a file with ?
data <- read.table("xyz.csv",sep=",",header=T,na.strings=c("?",NA))
data
# A B
# 1 1 NA
# 2 2 3
# 3 3 4
# 4 NA NA
# 5 NA NA
# 6 4 5

Related

Removing NA’s from a dataset in R

I want to remove all of the NA’s from the variables selected however when I used na.omited() for example:
na.omit(df$livharm)
it does not work and the NA’s are still there. I have also tried an alternative way for example:
married[is.na(livharm1)] <-NA
I have done this for each variable within the larger variable I am looking at using the code:
E.g.
df <- within(df, {
married <- as.numeric(livharm == 1)
“
“
“
married[is.na(livharm1)] <- NA
})
however I’m not sure what I actually have to do. Any help I would greatly appreciate!
Using complete.cases gives:
dat <- data.frame( a=c(1,2,3,4,5),b=c(1,NA,3,4,5) )
dat
a b
1 1 1
2 2 NA
3 3 3
4 4 4
5 5 5
complete.cases(dat)
[1] TRUE FALSE TRUE TRUE TRUE
# is.na equivalent has to be used on a vector for the same result:
!is.na(dat$b)
[1] TRUE FALSE TRUE TRUE TRUE
dat[complete.cases(dat),]
a b
1 1 1
3 3 3
4 4 4
5 5 5
Using na.omit is the same as complete.cases but instead of returning a boolean vector the object itself is returned.
na.omit(dat)
a b
1 1 1
3 3 3
4 4 4
5 5 5
This function returns a different result when applied only to a vector, which is probably not handled correctly by ggplot2. It can be "rescued" by putting it back in a data frame. base plot works as intended though.
na.omit(dat$b)
[1] 1 3 4 5
attr(,"na.action")
[1] 2
attr(,"class")
[1] "omit"
data.frame(b=na.omit(dat$b))
b
1 1
2 3
3 4
4 5
Plotting with ggplot2
ggplot(dat[complete.cases(dat),]) + geom_point( aes(a,b) )
# <plot>
# See warning when using original data set with NAs
ggplot(dat) + geom_point( aes(a,b) )
Warning message:
Removed 1 rows containing missing values (geom_point).
# <same plot as above>

r- dynamically detect excel column names format as date (without df slicing)

I am trying to detect column dates that come from an excel format:
library(openxlsx)
df <- read.xlsx('path/df.xlsx', sheet=1, detectDates = T)
Which reads the data as follows:
# a b c 44197 44228 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
I tried to specify a fix index slice and then transform these specific columns as follows:
names(df)[4:5] <- format(as.Date(as.numeric(names(df)[4:5]),
origin = "1899-12-30"), "%m/%d/%Y")
This works well when the df is sliced for those specific columns, unfortunately, there could be the possibility that these column index's change, say from names(df)[4:5] to names(df)[2:3] for example. Which will return coerced NA values instead of dates.
data:
Note: for this data the column name is read as X4488, while read.xlsx() read it as 4488
df <- data.frame(a=rep(1:5), b=rep(1:5), c=NA, "44197"=rep(1:5), '44228'=rep(1:5), d=rep(1:5))
Expected Output:
Note: this is the original excel format for these above columns:
# a b c 01/01/2021 01/02/2021 d
#1 1 1 NA 1 1 1
#2 2 2 NA 2 2 2
#3 3 3 NA 3 3 3
#4 4 4 NA 4 4 4
#5 5 5 NA 5 5 5
How could I detect directly these excel format and change it to date without having to slice the dataframe?
We may need to only get those column names that are numbers
i1 <- !is.na(as.integer(names(df)))
and then use
names(df)[i1] <- format(as.Date(as.numeric(names(df)[i1]),
origin = "1899-12-30"), "%m/%d/%Y")
Or with dplyr
library(dplyr)
df %>%
rename_with(~ format(as.Date(as.numeric(.),
origin = "1899-12-30"), "%m/%d/%Y"), matches('^\\d+$'))

Data.table: rbind a list of data tables with unequal columns [duplicate]

This question already has answers here:
rbindlist data.tables with different number of columns
(1 answer)
Rbind with new columns and data.table
(5 answers)
Closed 4 years ago.
I have a list of data tables that are of unequal lengths. Some of the data tables have 35 columns and others have 36.
I have this line of code, but it generates an error
> lst <- unlist(full_data.lst, recursive = FALSE)
> model_dat <- do.call("rbind", lst)
Error in rbindlist(l, use.names, fill, idcol) :
Item 1362 has 35 columns, inconsistent with item 1 which has 36 columns. If instead you need to fill missing columns, use set argument 'fill' to TRUE.
Any suggestions on how I can modify that so that it works properly.
Here's a minimal example of what you are trying to do.
No need to use any other package to do this. Just set fill=TRUE in rbindlist.
You can do this:
df1 <- data.table(m1 = c(1,2,3))
df2 <- data.table(m1 = c(1,2,3), m2=c(3,4,5))
df3 <- rbindlist(list(df1, df2), fill=T)
print(df3)
m1 m2
1: 1 NA
2: 2 NA
3: 3 NA
4: 1 3
5: 2 4
6: 3 5
If I understood your question correctly, I could possibly see only two options for having your data tables appended.
Option A: Drop the extra variable from one of the datasets
table$column_Name <- NULL
Option B) Create the variable with missing values in the incomplete dataset.
full_data.lst$column_Name <- NA
And then do rbind function.
Try to use rbind.fill from package plyr:
Input data, 3 dataframes with different number of columns
df1<-data.frame(a=c(1,2,3,4,5),b=c(1,2,3,4,5))
df2<-data.frame(a=c(1,2,3,4,5,6),b=c(1,2,3,4,5,6),c=c(1,2,3,4,5,6))
df3<-data.frame(a=c(1,2,3),d=c(1,2,3))
full_data.lst<-list(df1,df2,df3)
The solution
library("plyr")
rbind.fill(full_data.lst)
a b c d
1 1 1 NA NA
2 2 2 NA NA
3 3 3 NA NA
4 4 4 NA NA
5 5 5 NA NA
6 1 1 1 NA
7 2 2 2 NA
8 3 3 3 NA
9 4 4 4 NA
10 5 5 5 NA
11 6 6 6 NA
12 1 NA NA 1
13 2 NA NA 2
14 3 NA NA 3

creating adjacency network matrix (or list) from large csv dataset using igraph

i am trying to do network analysis in igraph but having some issues with transforming the dataset I have into an edge list (with weights), given the differing amount of columns.
The data set looks as follows (much larger of course): First is the main operator id (main operator can also be partner and vice versa, so the Ids are staying the same in the adjacency) The challenge is that the amount of partners varies (from 0 to 40).
IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4 .....
1 4 3 2 NA
2 3 1 NA NA
3 1 4 7 6
4 9 6 3 NA
.
.
my question is how to transform this into an edge list with weight which is undirected (just expressing interaction):
Id1 Id2 weight
1 2 2
1 3 2
1 4 1
2 3 1
3 4 2
. .
Does anyone have a tip what the best way to go is? Many thanks in advance!
This is a classic reshaping task. You can use the reshape2 package for this.
text <- "IdMain IdPartner1 IdPartner2 IdPartner3 IdPartner4
1 4 3 2 NA
2 3 NA NA NA
3 1 4 7 6
4 9 NA NA NA"
data <- read.delim(text = text, sep = "")
library(reshape2)
data_melt <- reshape2::melt(data, id.vars = "IdMain")
edgelist <- data_melt[!is.na(data_melt$value), c("IdMain", "value")]
head(edgelist, 4)
# IdMain value
# 1 1 4
# 2 2 3
# 3 3 1
# 4 4 9

Setting values to NA in a dataframe in R

Here is some reproducible code that shows the problem I am trying to solve in another dataset. Suppose I have a dataframe df with some NULL values in it. I would like to replace these with NAs, as I attempt to do below. But when I print this, it comes out as <NA>. See the second dataframe, which comes is the dataframe I would like to produce from df, in which the NA is a regular old NA without the carrots.
> df = data.frame(a=c(1,2,3,"NULL"),b=c(1,5,4,6))
> df[4,1] = NA
> print(df)
a b
1 1 1
2 2 5
3 3 4
4 <NA> 6
>
> d = data.frame(a=c(1,2,3,NA),b=c(1,5,4,6))
> print(d)
a b
1 1 1
2 2 5
3 3 4
4 NA 6

Resources