I have something like this:
dates <- seq(from = as.Date("2010-01-01"), as.Date("2017-12-01"), "1 day")
values = cumsum(rnorm(length(dates)))
df <- cbind(dates, values)
Which looks like:
dates values
1 14610 -0.3750827
2 14611 0.2068051
3 14612 0.1986609
4 14613 0.1793758
5 14614 1.1068358
6 14615 0.9621490
I would like to add randomly to the data NA values such that:
dates values
1 14610 -0.3750827
2 NA NA
3 14612 0.1986609
4 14613 0.1793758
5 NA NA
6 14615 0.9621490
Where some rows have NA values in. I have found code to randomly add NA values but only to one column.
ind <- sample(df, 100)
df[ind] <- NA
Does not work for me.
To do it your way, you'd need an array of same dimensions as df with random TRUEs and FALSEs so that you can replace the TRUEs with NA. Here's a way -
ind <- matrix(sample(c(TRUE,FALSE), prod(dim(df)), replace = T),
nrow = nrow(df), ncol = ncol(df))
df[ind] <- NA
Related
I have this code written so far:
getSymbols(Symbols="SPY", from="2012-01-01", to= "2013-12-31")
SPY=data.frame(SPY)
Data=matrix(SPY$SPY.Adjusted, ncol=10, byrow=T)
df=data.frame(Data)
the matrix is filled completly with values but I want the last 9 values to be NA. How can I stop the matrix from restarting the vector? So i should end with a 51x10 matrix with 10 values in the first 50 rows and the last row should be 1 value and 9 NAs.
An option is to use length<- to append the NA at the end
library(quantmod)
Data <- matrix(`length<-`(SPY$SPY.Adjusted, 51 * 10), ncol = 10, byrow = TRUE)
-testing
dim(Data)
#[1] 51 10
Data[51,]
#[1] 161.0724 NA NA NA NA NA NA NA NA NA
I am looking for a more efficient way (in terms of length of code) of converting a data.frame from:
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 1 2 3 NA NA NA NA NA NA
# 2 NA NA NA 3 2 1 NA NA NA
# 3 NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA 1 2 3
to
# [,1] [,2] [,3]
#[1,] 1 2 3
#[2,] 3 2 1
#[3,] NA NA NA
#[4,] NA NA NA
#[5,] 1 2 3
That is, I want to remove excess NAs but correctly represent rows with only NAs.
I wrote the following function which does the job, but I am sure there is a less lengthy way of achieving the same.
#Dummy data.frame
data <- matrix(c(1:3, rep(NA, 6),
rep(NA, 3), 3:1, rep(NA, 3),
rep(NA, 9),
rep(NA, 9),
rep(NA, 6), 1:3),
byrow=TRUE, ncol=9)
data <- as.data.frame(data)
sieve <- function(data) {
#get a list of all entries that are not NA
cond <- apply(data, 1, function(x) x[!is.na(x)])
#set integer(0) equal to NA
cond[sapply(cond, function(x) length(x)==0)] <- NA
#check how many items there are in non-empty rows
#(rows are either empty or contain the same number of items)
n <- max(sapply(cond, length))
#replace single NA with n NAs, where n = number of items
#first get an index of entries with single NAs
index <- (1:length(cond)) [sapply(cond, function(x) length(x)==1)]
#then replace each entry with n NAs
for (i in index) cond[[i]] <- rep(NA, n)
#turn list into a data.frame
cond <- matrix(unlist(cond), nrow=length(cond), byrow=TRUE)
cond
}
sieve(data)
My question resembles this question about extracting conditions to which participants are assigned (for which I received great answers). I tried expanding these answers to the current dummy data, but without success so far. Hence my rather lengthy custom function.
Edit: Additional info for why I am asking this question: The first data frame represents the raw output from an experiment in which I assigned participants to one of three conditions (using 3 here for simplicity). In each condition, participants read a different scenario, but then answered the same set of questions about the scenario they had read. Qualtrics recorded answers from participants in the first condition in the columns V1through V3, answers from participants in the second condition in the columns V4through V6 and answers from participants in the third condition in columns V7through V9. (If this block of questions would have contained 4 questions it would have been columns V1 through V4 for answers from participants in the first condition, V2 through V8 for answers from participants in the second condition ...).
You can try this if the length of non-NAs is always the same in rows that aren't entirely filled with NA:
First, create a data frame with the appropriate (transposed) dimensions, and fill it with NAs.
d2 <- data.frame(
matrix(nrow = max(apply(d, 1, function(ii) sum(!is.na(ii)))),
ncol=nrow(d)))
Then, using apply fill that data frame, then transpose it to get your desired outcome:
d2[] <- apply(d, 1, function(ii) ii[!is.na(ii)])
t(d2)
# [,1] [,2] [,3]
#X1 1 2 3
#X2 3 2 1
#X3 NA NA NA
#X4 NA NA NA
#X5 1 2 3
Given a data frame like this:
A <- c(1,2,3,4,NA,6,7,8,9,10,11,12,13,14,15)
B <- c(NA,NA,NA,20,NA,NA,NA,15,NA,NA,NA,NA,11,NA,9)
DF <- data.frame(A, B)
I would like to calculate the mean for a range of values in column A, based on the value in column B. Specifically, every time there is a non-NA value in column B, I would like to calculate the mean of the range of rows 2 above and 2 below in column A.
For example, the first non-NA value in column B is 20. So I would like to calculate the mean of the two rows above (2, 3), two rows below (NA, 6), and the row adjacent (4). So:
mean(2,3,4,NA,6)
Similarly, the next non-NA value in row B is 15. Which would be
mean(6,7,8,9,10)
So, the end result for the entire data frame would be a new column C
DF$C <- c(NA,NA,NA,3.75,NA,NA,NA,8,NA,NA,NA,NA,13,NA,14)
You could try the following.
nona <- !is.na(DF$B)
DF$C <- replace(
DF$B,
nona,
vapply(which(nona), function(i) {
ii <- (i-2):(i+2)
mean(DF$A[ii[ii > 0]], na.rm = TRUE)
}, 1)
)
Here we are finding the non-NA values in column B and then using that vector to set up the indices for the values we want to find the mean for in column A, being careful to remove any negative subscripts that might occur should the first one or two values of column B not be NA. The above code gives the following result for DF.
A B C
1 1 NA NA
2 2 NA NA
3 3 NA NA
4 4 20 3.75
5 NA NA NA
6 6 NA NA
7 7 NA NA
8 8 15 8.00
9 9 NA NA
10 10 NA NA
11 11 NA NA
12 12 NA NA
13 13 11 13.00
14 14 NA NA
15 15 9 14.00
Here is an approach with the zoo package:
library(zoo)
width <- 5 # the observation ± 2
DF$C <- rollapply(DF$A, width, mean, na.rm = TRUE, partial = TRUE)
# when DF$B is NA, assign NA to corresponding DF$C
DF$C[is.na(DF$B)] <- NA
partial = TRUE allows calculating the mean with a partial window at the leading and trailing parts of the DF$A vector where the whole window can't be accommodated (i.e. the first 2 and last 2 values of DF$A where a window of size 5 is not possible).
I am trying to merge 6+ datasets into one by ID. Right now, the duplication of IDs makes merge treat each as a new observation.
Example code:
combined <-Reduce(function(x,y) merge(x,y, all=TRUE), list(NRa,NRb,NRc,NRd,NRe,NRf,NRg,NRh))
Which gives me this:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA NA 1 NA
3 62734839 NA NA NA 1 NA NA
4 62737229 NA 1 NA NA NA NA
5 62737229 NA NA NA 1 1 NA
I would like each ID to have a single row:
ID Segment.h Segment.g Segment.f Segment.e Segment.d Segment.c
1 62729107 NA NA NA NA NA 1
2 62734839 NA 1 NA 1 1 NA
3 62737229 NA 1 NA 1 1 NA
Any help is appreciated. Thank you.
Using R's sqldf package will work leaving you with one id per row.
Data1 <- data.frame(
X = sample(1:10),
Housing = sample(c("yes", "no"), 10, replace = TRUE)
)
Data2 <- data.frame(
X = sample(1:10),
Credit = sample(c("yes", "no"), 10, replace = TRUE)
)
Data3 <- data.frame(
X = sample(1:10),
OwnsCar = sample(c("yes", "no"), 10, replace = TRUE)
)
Data4 <- data.frame(
X = sample(1:10),
CollegeGrad = sample(c("yes", "no"), 10, replace = TRUE)
)
library(sqldf)
sqldf("Select Data1.X,Data1.Housing,Data2.Credit,Data3.OwnsCar,Data4.CollegeGrad from Data1
inner join Data2 on Data1.X = Data2.X
inner join Data3 on Data1.X = Data3.X
inner join Data4 on Data1.X = Data4.X
")
Why don't you try by='ID' in your merge() function. If that's not enough, try aggregate().
Your description of the problem is not entirely clear, and you don't provide data.
Assuming that all of your dataframes have the same dimensions, column names, column orders, ID entries, that the ID row orders match, that ID is the first column, that all other entries are either NA or 1 and that any cell in one dataframe featuring a 1 has NA values in that cell for all other data frames or that sums of numeric values are acceptable, and that you want the result as a data frame ...
An Old-School solution using the abind package:
consolidate <- function(lst) {
stopifnot(require(abind))
## form 3D array, replace NA
x <- abind(lst, along=3)
x[is.na(x)] <- 0
z <- x[,,1] ## data store
## sum array along 3rd dimension
for (j in seq(2,ncol(x)))
for (i in seq(nrow(x)))
z[i,j] <- sum(x[i,j,])
z[z==0] <- NA ## restore NA
as.data.frame(z)
}
For dataframes (with the above caveats) a,b,c:
consolidate(list(a,b,c))
I am trying to combine two dataframes with different number of columns and column headers. However, after I combine them using rbind.fill(), the resulting file has filled the empty cells with NA.
This is very inconvenient since one of the columns has data that is also represented as "NA" (for North America), so when I import it into a csv, the spreadsheet can't tell them apart.
Is there a way for me to:
Use the rbind.fill function without having it populate the empty cells with NA
or
Change the column to replace the NA values*
*I've scoured the blogs, and have tried the two most popular solutions:
df$col[is.na(df$col)] <- 0, #it does not work
df$col = ifelse(is.na(df$col), "X", df$col), #it changes all the characters to numbers, and ruins the column
Let me know if you have any advice! I (unfortunately) cannot share the df, but will be willing to answer any questions!
NA is not the same as "NA" to R, but might be interpreted as such by your favourite spreadsheet program. NA is a special value in R just like NaN (not a number). If I understand correctly, one of your solutions is to replace the "NA" values in the column representing North America with something else, in which case you should just be able to do...
df$col[ df$col == "NA" ] <- "NorthAmerica"
This is assuming that your "NA" values are actually character strings. is.na() won't return any values if they are character strings which is why df$col[ is.na(df$col) ] <- 0 won't work.
An example of the difference between NA and "NA":
x <- c( 1, 2, 3 , "NA" , 4 , 5 , NA )
> x[ !is.na(x) ]
[1] "1" "2" "3" "NA" "4" "5"
> x[ x == "NA" & !is.na(x) ]
[1] "NA"
Method to resolve this
I think you want to leave "NA" and any NAs as they are in the first df, but make all NA in the second df formed from rbind.fill() change to something like "NotAvailable". You can accomplish this like so...
df1 <- data.frame( col = rep( "NA" , 6 ) , x = 1:6 , z = rep( 1 , 6 ) )
df2 <- data.frame( col = rep( "SA" , 2 ) , x = 1:2 , y = 5:6 )
df <- rbind.fill( df1 , df2 )
temp <- df [ (colnames(df) %in% colnames(df2)) ]
temp[ is.na( temp ) ] <- "NotAvailable"
res <- cbind( temp , df[ !( colnames(df) %in% colnames(df2) ) ] )
#df has real NA values in column z and column y. We just want to get rid of y's
df
# col x z y
# 1 NA 1 1 NA
# 2 NA 2 1 NA
# 3 NA 3 1 NA
# 4 NA 4 1 NA
# 5 NA 5 1 NA
# 6 NA 6 1 NA
# 7 SA 1 NA 5
# 8 SA 2 NA 6
#res has "NA" strings in col representing "North America" and NA values in z, whilst those in y have been removed
#More generally, any NA in df1 will be left 'as-is', whilst NA from df2 formed using rbind.fill will be converted to character string "NotAvilable"
res
# col x y z
# 1 NA 1 NotAvailable 1
# 2 NA 2 NotAvailable 1
# 3 NA 3 NotAvailable 1
# 4 NA 4 NotAvailable 1
# 5 NA 5 NotAvailable 1
# 6 NA 6 NotAvailable 1
# 7 SA 1 5 NA
# 8 SA 2 6 NA
If you have a dataframe that contains NA's and you want to replace them all you can do something like:
df[is.na(df)] <- -999
This will take care of all NA's in one shot
If you only want to act on a single column you can do something like
df$col[which(is.na(df$col))] <- -999