R - Removing a specific row - r

I have a dataframe called and df and I want to remove a row for a specific row which contains NA.

As commented before, you should provide a reproducible R example. If I understand correctly you can easily use subset function.
# Generating some fake data:
set.seed(101)
df <- data.frame("StudyID" = paste("Study", seq(1:100), sep = "_"),
"Column" = sample(c(1:30, NA),100, replace = TRUE))
Use subset with !is.na() if your NA is a Not Available value
newdf <- subset(df, !is.na(Column))
If your NA is a character:
# Numeric to character conversion
df$Column<- as.character(df$Column)
# Replace missing values with "NA"
df$Column[is.na(df$Column)] <- "NA"
Thus, just subsetting:
newdf <- subset(reviews, Column != "NA")

Here is a solution using grepl from base R, considering NA as a character.
pattern<-"NA"
df <-df[!grepl(pattern, df$Column),]
If possible share sample data for better clarity on the data

Related

How to replace several variables with several variables from another dataframe in R using a loop?

I would like to replace multiple variables with variables from a second dataframe in R.
df1$var1 <- df2$var1
df1$var2 <- df2$var2
# and so on ...
As you can see the variable names are the same in both dataframes, however, numeric values are slightly different whereas the correct version is in df2 but needs to be in df1. I need to do this for many, many variables in a complex data set and wonder whether someone could help with a more efficient way to code this (possibly without using column references).
Here some example data:
# dataframe 1
var1 <- c(1:10)
var2 <- c(1:10)
df1 <- data.frame(var1,var2)
# dataframe 2
var1 <- c(11:20)
var2 <- c(11:20)
df2 <- data.frame(var1,var2)
# assigning correct values
df1$var1 <- df2$var1
df1$var2 <- df2$var2
As Parfait has said, the current post seems a bit too simplified to give any immediate help but I will try and summarize what you may need for something like this to work.
If the assumption is that df1 and df2 have the same number of rows AND that their orders are already matching, then you can achieve this really easily by the following subset notation:
df1[,c({column names df1}), drop = FALSE] <- df2[, c({column names df2}), drop = FALSE]
Lets say that df1 has columns a, b, and c and you want to replace b and c with two columns of df1 whose columns are x, y, z.
df1[,c("b","c"), drop = FALSE] <- df2[, c("y", "z"), drop = FALSE]
Here we are replacing b with y and c with z. The drop argument is just for added protection against subsetting a data.frame to ensure you don't get a vector.
If you do NOT know the order is correct or one data frame may have a differing size than the other BUT there is a unique identifier between the two data.frames - then I would personally use a function that is designed for merging two data frames. Depending on your preference you can use merge from base or use *_join functions from the dplyr package (my preference).
library(dplyr)
#assuming a and x are unique identifiers that can be matched.
new_df <- left_join(df1, df2, by = c("a"="x"))

using adist on two columns of data frame

I want to use adist to calculate edit distance between the values of two columns in each row.
I am using it in more-or-less this way:
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df$dist <- adist(my_df$A, my_df$B, ignore.case = TRUE)
my_df <- my_df[order(dist),]
The last two rows are the same as in my case, but the actual data frame looks a bit different - columns of my original data frame are character type, not factor. Also, the dist column seems to be returned as 2-column matrix, I have no idea why it happens.
Update:
I have read a bit and found that I need to apply it over the rows, so my new code is following:
apply(my_df, 1, function(d) adist(d[1], d[2]))
It works fine, but for my original dataset calling it by column numbers is inpractical, how can I refer to column names in this function?
Using tidyverse approach, you may use the following code:
library(tidyverse)
A <- c("mad","car")
B <- c("mug","cat")
my_df <- data.frame(A,B)
my_df %>%
rowwise() %>%
mutate(Lev_dist=adist(x=A,y=B,ignore.case=TRUE))
You can overcome that problem by using mapply, i.e.
mapply(adist, df$A, df$B)
#[1] 2 1
As per adist function definition the x and y arguments should be character vectors. In your example the function is returning a 2x2 matrix because it is comparing also the cross words "mad" with "cat" and "car" with "mug".
Just look at the matrix master diagonal.

Generate column names dynamically for a dataframe in R

So, I am coverting a json into dataframe using and I'm successful in doing that. Below is my code:
df <- data.frame(t(sapply(json, c)))
colnames(df) <- gsub("X", "y",colnames(df))
So, it gives me column names like y1,y2,y3 etc. Is it possible if I could have these column names generated from 0 instead. So, the column names should be like y0,y1,y2 etc.
From the comments:
df <- data.frame(t(sapply(json,c))
colnames(df) <- paste0("y", 0:(ncol(df)-1))
Or if you want padded zeros
a <- seq(0,ncol(df)-1,1)
colnames(df) <- sprintf("y%02d",a)

Condensing code: Check that multiple columns follow a boolean in data frame in R

I have a data frame (df) that has some NA values. I wanted to extract the rows where there are NA values across multiple columns (in the example below, I am doing so for columns 12-20):
NArows = which(is.na(df[,20])&is.na(df[,19]&is.na(df[,18])&is.na(df[,17])&is.na(df[,16])&is.na(df[,15])&is.na(df[,14])&is.na(df[,13])&is.na(df[,12])))
Is there a more readable (and condensed) way to accomplish this, without putting each column condition surrounded by & sign?
Thank you for any help...
Try this:
cols <- 12:20
NArows <- which(apply(df[cols],1,function(y)sum(!is.na(y))==0))
It slices your df to just the 'cols' you care about,then applies to each row the test is.na() and if it finds all values in those cols are NA it adds that row number to NArows.
Or based on david arenburg's answer, this flags any rows with 2 or more NAs:
NArows <- which(rowSums(is.na(df[12:20])) > 1L)
Adapting it to more closely match your requirements, this flags only rows where all are NAs:
cols <- 12:20
NArows <- which(rowSums(is.na(df[cols])) == ncol(df[cols]))

Replace a number in dataframe

I have a dataframe in which I occasionally have -1s. I want to replace them with NA. I tried the apply function, but it returns a matrix of characters to me, which is no good:
apply(d,c(1,2), function(x){
if (x == -1){
return (NA)
}else{
return (x)
}
})
I am wrestling with by but I cannot seem to handle it properly. I have got this so far:
s <-by(d,d[,'Q1_I1'], function(x){
for(i in x)
print(i)
})
which if I understood correctly by() serves into x my dataframe row by row. And I can iterate through the each element of the row by the for function. I just don't know how to replace the value.
The reason that apply does not work is that it converts a data frame to a matrix and if your data frame has any factors then this will be a character matrix.
You can use lapply instead which will process the data frame one column at a time. This code works:
mydf <- data.frame( x=c(1:10, -1), y=c(-1, 10:1), g=sample(letters,11) )
mydf
mydf[] <- lapply(mydf, function(x) { x[x==-1] <- NA; x})
mydf
As #rawr mentions in the comments it does work to do:
mydf[ mydf== -1 ] <- NA
but the documentation (?'[.data.frame') say that that is not recommended due to the conversions.
One big question is how the data frame is being created. If you are reading the data using read.table or related functions then you can just specify the na.strings argument and have the conversion done for you as the data is read in.
You can do this fast and transparently with the data.table library.
# take standard dataset and transform to data.table
mtcars = data.table(mtcars,keep.rownames = TRUE)
# select rows with 5 gear and set to NA
mtcars[gear==5,gear:= NA]
mtcars

Resources