I have a dataframe with NA values peppered in that I want to interpolate.
Here is the repeatable example:
A <- as.data.frame(c(1:6))
A$b <- NA
A$c <- 2:7
library(zoo)
na.approx(A)
#expectation
A$b <- seq(1.5, 6.5, 1)
Obviously na.approx() isn't doing it for me, is there a function that will interpolate by row?
na.approx and also work column wise on a matrix
t(na.approx(t(A)))
how about?
t(apply(A,1,na.approx));
Here a solution that enables you to keep the original data type:
library(imputeTS)
as.data.frame(t(na.interpolation(t(A))))
Will do the same calculation wise as the mentioned solutions with na.approx .
(but this way you'll still have a data.frame and retain your column names)
Related
I have a data set with Air Quality Data. The Data Frame is a matrix of 153 rows and 5 columns.
I want to find the mean of the first column in this Data Frame.
There are missing values in the column, so I want to exclude those while finding the mean.
And finally I want to do that using Control Structures (for loops and if-else loops)
I have tried writing code as seen below. I have created 'y' instead of the actual Air Quality data set to have a reproducible example.
y <- c(1,2,3,NA,5,6,NA,NA,9,10,11,NA,13,NA,15)
x <- matrix(y,nrow=15)
for(i in 1:15){
if(is.na(data.frame[i,1]) == FALSE){
New.Vec <- c(x[i,1])
}
}
print(mean(New.Vec))
I expected the output to be the mean. Though the error I received is this:
Error: object 'New.Vec' not found
One line of code, no need for for loop.
mean(data.frame$name_of_the_first_column, na.rm = TRUE)
Setting na.rm = TRUE makes the mean function ignore NAs.
Here, we can make use of na.aggregate from zoo
library(zoo)
df1[] <- na.aggregate(df1)
Assuming that 'df1' is a data.frame with all numeric columns and wanted to fill the NA elements with the corresponding mean of that column. na.aggregate, by default have the fun.aggregate as mean
can't see your data, but probably like this? the vector needed to be initialized. better to avoid loops in R when you can...
myDataFrame <- read.csv("hw1_data.csv")
New.Vec <- c()
for(i in 1:153){
if(!is.na(myDataFrame[i,1])){
New.Vec <- c(New.Vec, myDataFrame[i,1])
}
}
print(mean(New.Vec))
I want to use adist to calculate edit distance between the values of two columns in each row.
I am using it in more-or-less this way:
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df$dist <- adist(my_df$A, my_df$B, ignore.case = TRUE)
my_df <- my_df[order(dist),]
The last two rows are the same as in my case, but the actual data frame looks a bit different - columns of my original data frame are character type, not factor. Also, the dist column seems to be returned as 2-column matrix, I have no idea why it happens.
Update:
I have read a bit and found that I need to apply it over the rows, so my new code is following:
apply(my_df, 1, function(d) adist(d[1], d[2]))
It works fine, but for my original dataset calling it by column numbers is inpractical, how can I refer to column names in this function?
Using tidyverse approach, you may use the following code:
library(tidyverse)
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df %>%
rowwise() %>%
mutate(Lev_dist=adist(x=A,y=B,ignore.case=TRUE))
You can overcome that problem by using mapply, i.e.
mapply(adist, df$A, df$B)
#[1] 2 1
As per adist function definition the x and y arguments should be character vectors. In your example the function is returning a 2x2 matrix because it is comparing also the cross words "mad" with "cat" and "car" with "mug".
Just look at the matrix master diagonal.
I'm sure there is a very easy way to accomplish this task, but I cannot seem to figure it out. I have two data frames which have the exact same data but from two separate locations.
df1 <- data.frame(a=c(1,2,3,NA),b=c(1,5,4,6))
df2 <- data.frame(a=c(3,4,5,6),b=c(7,8,9,NA))
My desired output is two have a new version of df1 and df2 which are the exact same but the bottom row only contains NA values. I.e. if there is an NA value in one data frame, I need that replicated on the corresponding cell in the other data frame...
df1[4,2] <- NA
df2[4,1] <- NA
I have seen very similar questions addressing the problem from the opposite perspective (e.g. Filling missing values in a data.frame from another data.frame) but I can't figure how to apply this to my own data. Thank you in advance.
We can create an index based on the occurrence of NA in either of the two datasets and multiply
i1 <- NA^(is.na(df1)| is.na(df2))
df1 <- df1 * i1
df2 <- df2 * i1
Here are some possibilities. (1) seems the cleanest and clearest in intent. (3) works but seems unnecessarily complex in terms of sorting out side effects.
1) replace Try replace.
df1new <- replace(df1, is.na(df2), NA)
df2new <- replace(df2, is.na(df1), NA)
This would continue to work if df1new and df2new were replaced with df1 and df2 although it adds complexity. In that case it might be better to assign df1 and df2 (i.e. df1 <- df1new; df2 <-df2new) afterwards to avoid complexity.
2) indexing It could alternately be written like this:
df1new <- df1
df1new[is.na(df2)] <- NA
df2new <- df2
df2new[is.na(df1)] <- NA
3) destructive indexing Not sure that this one is a good idea but it works here:
df1[is.na(df2)] <- df2[is.na(df1)] <- NA
How can I know how many values are NA in a dataset? OR if there are any NAs and NaNs in dataset?
This may also work fine
sum(is.na(df)) # For entire dataset
for a particular column in a dataset
sum(is.na(df$col1))
Or to check for all the columns as mentioned by #nicola
colSums(is.na(df))
As #Roland noticed there are multiple functions for finding and dealing with missing values in R (see help("NA") and here).
Example:
Create a fake dataset with some NA's:
data <- matrix(1:300,,3)
data[sample(300, 40)] <- NA
Check if there are any missing values:
anyNA(data)
Columnwise check if there are any missing values:
apply(data, 2, anyNA)
Check percentages and counts of missing values in columns:
colMeans(is.na(data))*100
colSums(is.na(data))
For a dataframe it is:
sum(is.na(df)
here df is the dataframe
where as for a particular column in the dataframe you can use:
sum(is.na(df$col)
or
cnt=0
for(i in df$col){
if(is.na(i)){
cnt=cnt+1
}
}
cnt
here cnt gives the no. of NA in the column
You can simply get the number of "NA" included in the each column of dataset by using R.
For a vector x
summary(x)
For a data frame df
summary(df)
I want to apply some operations to the values in a number of columns, and then sum the results of each row across columns. I can do this using:
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$a2 <- x$a^2
x$b2 <- x$b^2
x$result <- x$a2 + x$b2
but this will become arduous with many columns, and I'm wondering if anyone can suggest a simpler way. Note that the dataframe contains other columns that I do not want to include in the calculation (in this example, column sample is not to be included).
Many thanks!
I would simply subset the columns of interest and apply everything directly on the matrix using the rowSums function.
x <- data.frame(sample=1:3, a=4:6, b=7:9)
# put column indices and apply your function
x$result <- rowSums(x[,c(2,3)]^2)
This of course assumes your function is vectorized. If not, you would need to use some apply variation (which you are seeing many of). That said, you can still use rowSums if you find it useful like so. Note, I use sapply which also returns a matrix.
# random custom function
myfun <- function(x){
return(x^2 + 3)
}
rowSums(sapply(x[,c(2,3)], myfun))
I would suggest to convert the data set into the 'long' format, group it by sample, and then calculate the result. Here is the solution using data.table:
library(data.table)
melt(setDT(x),id.vars = 'sample')[,sum(value^2),by=sample]
# sample V1
#1: 1 65
#2: 2 89
#3: 3 117
You can easily replace value^2 by any function you want.
You can use apply function. And get those columns that you need with c(i1,i2,..,etc).
apply(( x[ , c(2, 3) ])^2, 1 ,sum )
If you want to apply a function named somefunction to some of the columns, whose indices or colnames are in the vector col_indices, and then sum the results, you can do :
# if somefunction can be vectorized :
x$results<-apply(x[,col_indices],1,function(x) sum(somefunction(x)))
# if not :
x$results<-apply(x[,col_indices],1,function(x) sum(sapply(x,somefunction)))
I want to come at this one from a "no extensions" R POV.
It's important to remember what kind of data structure you are working with. Data frames are actually lists of vectors--each column is itself a vector. So you can you the handy-dandy lapply function to apply a function to the desired column in the list/data frame.
I'm going to define a function as the square as you have above, but of course this can be any function of any complexity (so long as it takes a vector as an input and returns a vector of the same length. If it doesn't, it won't fit into the original data.frame!
The steps below are extra pedantic to show each little bit, but obviously it can be compressed into one or two steps. Note that I only retain the sum of the squares of each column, given that you might want to save space in memory if you are working with lots and lots of data.
create data; define the function
grab the columns you want as a separate (temporary) data.frame
apply the function to the data.frame/list you just created.
lapply returns a list, so if you intend to retain it seperately make it a temporary data.frame. This is not necessary.
calculate the sums of the rows of the temporary data.frame and append it as a new column in x.
remove the temp data.table.
Code:
x <- data.frame(sample=1:3, a=4:6, b=7:9); square <- function(x) x^2 #step 1
x[2:3] #Step 2
temp <- data.frame(lapply(x[2:3], square)) #step 3 and step 4
x$squareRowSums <- rowSums(temp) #step 5
rm(temp) #step 6
Here is an other apply solution
cols <- c("a", "b")
x <- data.frame(sample=1:3, a=4:6, b=7:9)
x$result <- apply(x[, cols], 1, function(x) sum(x^2))