I am trying to combine two dataframes with different number of columns and column headers. However, after I combine them using rbind.fill(), the resulting file has filled the empty cells with NA.
This is very inconvenient since one of the columns has data that is also represented as "NA" (for North America), so when I import it into a csv, the spreadsheet can't tell them apart.
Is there a way for me to:
Use the rbind.fill function without having it populate the empty cells with NA
or
Change the column to replace the NA values*
*I've scoured the blogs, and have tried the two most popular solutions:
df$col[is.na(df$col)] <- 0, #it does not work
df$col = ifelse(is.na(df$col), "X", df$col), #it changes all the characters to numbers, and ruins the column
Let me know if you have any advice! I (unfortunately) cannot share the df, but will be willing to answer any questions!
NA is not the same as "NA" to R, but might be interpreted as such by your favourite spreadsheet program. NA is a special value in R just like NaN (not a number). If I understand correctly, one of your solutions is to replace the "NA" values in the column representing North America with something else, in which case you should just be able to do...
df$col[ df$col == "NA" ] <- "NorthAmerica"
This is assuming that your "NA" values are actually character strings. is.na() won't return any values if they are character strings which is why df$col[ is.na(df$col) ] <- 0 won't work.
An example of the difference between NA and "NA":
x <- c( 1, 2, 3 , "NA" , 4 , 5 , NA )
> x[ !is.na(x) ]
[1] "1" "2" "3" "NA" "4" "5"
> x[ x == "NA" & !is.na(x) ]
[1] "NA"
Method to resolve this
I think you want to leave "NA" and any NAs as they are in the first df, but make all NA in the second df formed from rbind.fill() change to something like "NotAvailable". You can accomplish this like so...
df1 <- data.frame( col = rep( "NA" , 6 ) , x = 1:6 , z = rep( 1 , 6 ) )
df2 <- data.frame( col = rep( "SA" , 2 ) , x = 1:2 , y = 5:6 )
df <- rbind.fill( df1 , df2 )
temp <- df [ (colnames(df) %in% colnames(df2)) ]
temp[ is.na( temp ) ] <- "NotAvailable"
res <- cbind( temp , df[ !( colnames(df) %in% colnames(df2) ) ] )
#df has real NA values in column z and column y. We just want to get rid of y's
df
# col x z y
# 1 NA 1 1 NA
# 2 NA 2 1 NA
# 3 NA 3 1 NA
# 4 NA 4 1 NA
# 5 NA 5 1 NA
# 6 NA 6 1 NA
# 7 SA 1 NA 5
# 8 SA 2 NA 6
#res has "NA" strings in col representing "North America" and NA values in z, whilst those in y have been removed
#More generally, any NA in df1 will be left 'as-is', whilst NA from df2 formed using rbind.fill will be converted to character string "NotAvilable"
res
# col x y z
# 1 NA 1 NotAvailable 1
# 2 NA 2 NotAvailable 1
# 3 NA 3 NotAvailable 1
# 4 NA 4 NotAvailable 1
# 5 NA 5 NotAvailable 1
# 6 NA 6 NotAvailable 1
# 7 SA 1 5 NA
# 8 SA 2 6 NA
If you have a dataframe that contains NA's and you want to replace them all you can do something like:
df[is.na(df)] <- -999
This will take care of all NA's in one shot
If you only want to act on a single column you can do something like
df$col[which(is.na(df$col))] <- -999
Related
I have some data that I am looking at in R. One particular column, titled "Height", contains a few rows of NA.
I am looking to subset my data-frame so that all Heights above a certain value are excluded from my analysis.
df2 <- subset ( df1 , Height < 40 )
However whenever I do this, R automatically removes all rows that contain NA values for Height. I do not want this. I have tried including arguments for na.rm
f1 <- function ( x , na.rm = FALSE ) {
df2 <- subset ( x , Height < 40 )
}
f1 ( df1 , na.rm = FALSE )
but this does not seem to do anything; the rows with NA still end up disappearing from my data-frame. Is there a way of subsetting my data as such, without losing the NA rows?
If we decide to use subset function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.
You could also do:
df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.
# Create Dataset
library(data.table)
df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
df
# V1 V2
# 1: Surface 1
# 2: Bottom 2
# 3: <NA> 3
# Keep all but 'Bottom'
df[!V1 %in% c('Bottom')]
# V1 V2
# 1: Surface 1
# 2: <NA> 3
This works because %in% never returns an NA (see ?match)
As you can probably see from the age of my account, i'm new here.
I'm running into problems with creating a function or loop to replace single values in a row, based on 2 or more conditions. Here is my sample dataset:
date timeslot volume lag1
1 2018-01-17 3 553 296
2 2018-01-17 4 NA 553
3 2018-01-18 1 NA NA
4 2018-01-18 2 NA NA
5 2018-01-18 3 NA NA
6 2018-01-18 4 NA NA
types are: Date, int, num, num
i want to create a function that replaces the NA from lag1 with the average of the last 5 simmulair timeslots. This value is calculated with:
w <- as.integer(mean(tail(data$volume[data$timeslot %in% c(1)],5), na.rm =TRUE ))
if i create a if or for loop, it returns "the condition has length > 1 and only the first element will be used"
So far i can only change all the lag1 values, or non.
The function should be something like this: if lag1 == NA & timeslot ==1 then change that row's value to w
What i have tried so far:
for(i in data$lag1){
if(data$timeslot== '1'){
data$lag1[is.na(data$lag1)]<-w
}else(data$lag1<-data$lag1)
}
and also:
data$lag1<- ifelse(data$timeslot== "1", is.na(data$lag1)<-w, data$lag1 )
This does work, but it changes all the values at once. It should only change the 1 value that is in the same row as the timeslot.
Most of the time it will return the error above. I suspect that it has something to do with the "timeslot" column.
i tried a few different things, but seeing that i like a clean R environment, most of them have been deleted
i can't seem to figure this one out. Hope you guys can point me in the right direction.
Overview
I created the ReplaceNALag1WithSimilarRecentTimeslots() function to replace NA df$lag1 values with the average of the last 5 df$lag1 values for each unique df$timeslot value.
The use of sapply() was helpful in using ReplaceNALag1WithSimilarRecentTimeslots() once, since it applies the logic to each element in X. In this case, X is a vector of unique df$timeslot values whose row also contains an NA df$lag1 value.
NaN are introduced due to the reproducible data containing no recent non NA df$lag1 values.
# create data frame
df <-
data.frame(
date = as.Date( x = c(
paste("2018"
, "01"
, rep( x = "17", times = 2 )
, sep = "-"
)
, paste( "2018"
, "01"
, rep( x = "18", times = 4 )
, sep = "-"
)
)
)
, timeslot = as.integer( c( 3, 4, 1, 2, 3, 4 ) )
, volume = c( 533, rep( x = NA, times = 5 ) )
, lag1 = c( 296, 553, rep( x = NA, times = 4 ) )
, stringsAsFactors = FALSE
)
# ensure that the data frame
# is ordered by date,
# so that rows with a date value closer to today
# appear at the end of the data frame
df <- df[ order( df$date ) , ]
# view results
df
# date timeslot volume lag1
# 1 2018-01-17 3 533 296
# 2 2018-01-17 4 NA 553
# 3 2018-01-18 1 NA NA
# 4 2018-01-18 2 NA NA
# 5 2018-01-18 3 NA NA
# 6 2018-01-18 4 NA NA
# create a function that
# replaces NA lag1 values
# with the average of the
# last 5 lag1 values for
# each unique timeslot value
ReplaceNALag1WithSimilarRecentTimeslots <- function( unique.timeslot.value ){
# create condition that
# that pulls out non NAs from lag1 for a particular timeslot
# but that only gives us the 5 most recent values
# assuming that elements that appear at the end of vector
# are more recent than elements that appear near the beginning of the vector
non.na.lag1.condition.by.timeslot <-
tail(
x = which( !is.na( df$lag1 ) & df$timeslot == unique.timeslot.value )
, n = 5
)
# calculate the average lag1 value
# for those similar non NA lag1 values
# for that particular timeslot
mean( df$lag1[ non.na.lag1.condition.by.timeslot ] )
} # end of ReplaceNALag1WithSimilarRecentTimeslots() function
# create the NA lag1 condition
na.lag1.condition <- which( is.na( df$lag1 ) )
# use ReplaceNALag1WithSimilarRecentTimeslots()
# on those NA lag1 values
df$lag1[ na.lag1.condition ] <-
sapply( X = unique( df$timeslot[ na.lag1.condition ] )
, FUN = function( i ) ReplaceNALag1WithSimilarRecentTimeslots( i )
, simplify = TRUE
, USE.NAMES = TRUE
)
# View the results
df
# date timeslot volume lag1
# 1 2018-01-17 3 533 296
# 2 2018-01-17 4 NA 553
# 3 2018-01-18 1 NA NaN
# 4 2018-01-18 2 NA NaN
# 5 2018-01-18 3 NA 296
# 6 2018-01-18 4 NA 553
# end of script #
I have some data that I am looking at in R. One particular column, titled "Height", contains a few rows of NA.
I am looking to subset my data-frame so that all Heights above a certain value are excluded from my analysis.
df2 <- subset ( df1 , Height < 40 )
However whenever I do this, R automatically removes all rows that contain NA values for Height. I do not want this. I have tried including arguments for na.rm
f1 <- function ( x , na.rm = FALSE ) {
df2 <- subset ( x , Height < 40 )
}
f1 ( df1 , na.rm = FALSE )
but this does not seem to do anything; the rows with NA still end up disappearing from my data-frame. Is there a way of subsetting my data as such, without losing the NA rows?
If we decide to use subset function, then we need to watch out:
For ordinary vectors, the result is simply ‘x[subset & !is.na(subset)]’.
So only non-NA values will be retained.
If you want to keep NA cases, use logical or condition to tell R not to drop NA cases:
subset(df1, Height < 40 | is.na(Height))
# or `df1[df1$Height < 40 | is.na(df1$Height), ]`
Don't use directly (to be explained soon):
df2 <- df1[df1$Height < 40, ]
Example
df1 <- data.frame(Height = c(NA, 2, 4, NA, 50, 60), y = 1:6)
subset(df1, Height < 40 | is.na(Height))
# Height y
#1 NA 1
#2 2 2
#3 4 3
#4 NA 4
df1[df1$Height < 40, ]
# Height y
#1 NA NA
#2 2 2
#3 4 3
#4 NA NA
The reason that the latter fails, is that indexing by NA gives NA. Consider this simple example with a vector:
x <- 1:4
ind <- c(NA, TRUE, NA, FALSE)
x[ind]
# [1] NA 2 NA
We need to somehow replace those NA with TRUE. The most straightforward way is to add another "or" condition is.na(ind):
x[ind | is.na(ind)]
# [1] 1 2 3
This is exactly what will happen in your situation. If your Height contains NA, then logical operation Height < 40 ends up a mix of TRUE / FALSE / NA, so we need replace NA by TRUE as above.
You could also do:
df2 <- df1[(df1$Height < 40 | is.na(df1$Height)),]
For subsetting by character/factor variables, you can use %in% to keep NAs. Specify the data you wish to exclude.
# Create Dataset
library(data.table)
df=data.table(V1=c('Surface','Bottom',NA),V2=1:3)
df
# V1 V2
# 1: Surface 1
# 2: Bottom 2
# 3: <NA> 3
# Keep all but 'Bottom'
df[!V1 %in% c('Bottom')]
# V1 V2
# 1: Surface 1
# 2: <NA> 3
This works because %in% never returns an NA (see ?match)
I'm trying to calculate the mode for numeric columns. The columns which are not numeric, should have a "NA" as a placeholder in the vector. I would also need percentages according to a target. Some example data:
c1= c("A", "B", "C", "C", "B", "C", "C")
c2= factor(c(1, 1, 2, 2,1,2,1), labels = c("Y","N"))
d= as.Date(c("2015-02-01", "2015-02-03","2015-02-01","2015-02-05", "2015-02-03","2015-02-01", "2015-02-03"), format="%Y-%m-%d")
x= c(1,1,2,3,1,2,4)
y= c(1,2,2,6,2,3,1)
t= c(1,0,1,1,0,0,1)
df=data.frame(c1, c2, d, x, y,t)
df
c1 c2 d x y t
1 A Y 2015-02-01 1 1 1
2 B Y 2015-02-03 1 2 0
3 C N 2015-02-01 2 2 1
4 C N 2015-02-05 3 6 1
5 B Y 2015-02-03 1 2 0
6 C N 2015-02-01 2 3 0
7 C Y 2015-02-03 4 1 1
I would need the mode for each numeric column:
mode=as.numeric(c("NA","NA", "NA", 1,2,1))
mode
[1] NA NA NA 1 2 1
and a vector of percentages of rows with t==1, when value in column == mode
[1] NA NA NA 0.33 0.33
and a vector of percentages of rows with t==1, when value in column != mode
[1] NA NA NA 0.75 0.75
How could I calculate such vectors?
The best I have found for mode is:
library(plyr)
mode_fun <- function(x) {
mode0 <- names(which.max(table(x)))
if(is.numeric(x)) return(as.numeric(mode0))
mode0
}
kdf_mode=apply(kdf,2, numcolwise(mode_fun))
But it gives an error if there are any non numeric columns.
We can use sapply to loop over the columns of 'df', apply the mode_fun to get the output vector ('v1'). We use an if/else condition to return NA for non-numeric columns.
v1 <- unname(sapply(df, function(x) if(!is.numeric(x)) NA else mode_fun(x)))
v1
#[1] NA NA NA 1 2 1
For the second case (I guess we don't need the 6th column i.e. 't'). We loop through the columns of 'df' with sapply, use the if/else condition. In the else condition, we compare whether the mode values is equal to the column values (mode_fun(x)==x)). We use the & to get the logical index of values that are equal to mode that corresponds to t==1. Get the sum and divide by the sum(v1).
unname(sapply(df[-6], function(x) if(!is.numeric(x)) {
NA
} else {
v1 <- mode_fun(x)==x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.3333333 0.3333333
For the third, we change the condition to get the logical index where the column is not equal to the mode. Do the same as in the previous case.
unname(sapply(df[-6], function(x) if(!is.numeric(x)){
NA
} else {
v1 <- mode_fun(x)!=x
sum(v1 & t==1)/sum(v1)
} ))
#[1] NA NA NA 0.75 0.75
After we calculate 'v1', this can be also done without looping with sapply. We create a logical index where the column class is 'numeric' and the column names is not 't' ('indx').
indx <- sapply(df, is.numeric) & names(df)!='t'
We subset the 'df' and 'v1' based on 'indx' (df[indx], v1[indx]), make the lengths by replicating the vector using col. The col gives the numeric index of the columns in df[indx]. Then we check whether the subset dataset is equal to the vector to give a logical matrix.
indx1 <- df[indx]==v1[indx][col(df[indx])]
As in the previous code, we use & to check whether the TRUE values in 'indx1' also corresponds to 't==1. DocolSums, divide by thecolSumsof 'indx1', and concatenate (c) with theNA` elements of 'v1'
unname(c(v1[is.na(v1)], colSums(indx1& t==1)/colSums(indx1)))
#[1] NA NA NA 0.3333333 0.3333333
Similarly, we can create 'indx2' by changing the condition and then do colSums as before
indx2 <- df[indx]!=v1[indx][col(df[indx])]
unname(c(v1[is.na(v1)], colSums(indx2& t==1)/colSums(indx2)))
#[1] NA NA NA 0.75 0.75
I am using R for a project and I have a data frame in in the following format:
A B C
1 1 0 0
2 0 1 1
I want to return a data frame that gives the Column Name when the value is 1.
i.e.
Impair1 Impair2
1 A NA
2 B C
Is there a way to do this for thousands of records? The max impairment number is 4.
Note: There are more than 3 columns. Only 3 were listed to make it easier.
You could loop through the rows of your data, returning the column names where the data is set with an appropriate number of NA values padded at the end:
`colnames<-`(t(apply(dat == 1, 1, function(x) c(colnames(dat)[x], rep(NA, 4-sum(x))))),
paste("Impair", 1:4))
# Impair1 Impair2 Impair3 Impair4
# 1 "A" NA NA NA
# 2 "B" "C" NA NA
Using the apply family of functions, here is a general solution that should work for your larger dataset:
res <- apply(df, 1, function(x) {
out <- character(4) # create a 4-length vector of NAs
tmp <- colnames(df)[which(x==1)] # store the column names in a tmp field
out[1:length(tmp)] <- tmp # overwrite the relevant positions
out
})
# transpose and turn it into a data.frame
> data.frame(t(res))
X1 X2 X3 X4
1 A
2 B C