I want to extract specific elements, specifically ID, from rows that have NAs. Here is my df:
df
ID x
1-12 1
1-13 NA
1-14 3
2-12 20
3-11 NA
I want a dataframe that has the IDs of observations that are NA, like so:
df
ID x
1-13 NA
3-11 NA
I tried this, but it's giving me a dataframe with the row #s that have NAs (e.g., row 2, row 5), not the IDs.
df1 <- data.frame(which(is.na(df$x)))
Can someone please help?
This is a very basic subsetting question:
df[is.na(df$x),]
Good basic and free guides can be found on w3schools: https://www.w3schools.com/r/
Cheers
Hannes
Simply run the following line:
df[is.na(x),]
Another option is complete.cases
subset(df, !complete.cases(x))
Here is another base R option using na.omit
> df[!1:nrow(df) %in% row.names(na.omit(df)), ]
ID x
2 1-13 NA
5 3-11 NA
Related
I am trying to replace NA values in multiple columns from dataframe x1 by the average of the values from dataframes x2 and x3, based on common and distinct atrribute 'ID'.
All the dataframes(each dataframe is for a particular year) have the same column structure:
ID A B C .....
01 2 5 7 .....
02 NA NA NA .....
03 5 4 8 .....
I have found an answer to do it for 1 column at a time, thanks to this post.
x1$A[is.na(x1$A)] <- (x2$A[match(x1$ID[is.na(x1$A)],x2$ID)] + x3$A[match(x1$ID[is.na(x1$A)],x3$ID)])/2
But since I have about a 100 coulmns to apply this for I would really like to have a smarter way to do it.
I tried the suggestions from this post and also from here.
I came up with this code, but couldn't make it work.
x1[6:105] = as.data.frame(lapply(x1[6:105], function(x) ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)]+x3$x[match(x1$ID, x3$ID)])/2, x1$x)))
Got the following error:
Error in ifelse(is.na(x), (x2$x[match(x1$ID, x2$ID)] + x3$x[match(x1$ID, : replacement has length zero
I initially thought function(x) worked on the entire column and x represented the column name, but i think it represents each individual cell value and that is why it wont work.
I am a novice in R, I would surely appreciate some guidance to let me know where I am going wrong, in applying the logic to multiple columns.
for (i in 1:ncol(x1)) {
nas <- is.na(x1[,i]) # where are NAs
if (sum(nas)==0) next
ids <- x1$ID[nas] # ids of NAs
nam <- colnames(x1)[i] # colname of the column
x1[nas, i] <- (x2[match(ids, x2$zip), nam] + x3[match(ids, x3$zip), nam]) / 2
}
I have the following dataframe: (this is just a small sample)
VALUE COUNT AREA n_dd-2000 n_dd-2001 n_dd-2002 n_dd-2003 n_dd-2004 n_dd-2005 n_dd-2006 n_dd-2007 n_dd-2008 n_dd-2009 n_dd-2010
2 16 2431 243100 NA NA NA NA NA NA 3.402293 3.606941 4.000461 3.666381 3.499614
3 16 2610 261000 3.805082 4.013435 3.98 3.490139 3.433857 3.27813 NA NA NA NA NA
4 16 35419 3541900 NA NA NA NA NA NA NA NA NA NA NA
and I would like to combine all three rows into one row replacing NA with the number that appears in each column (there's only one number per column). Just ignore the first three columns. I used this code:
bdep[4,4:9] <- bdep[3,4:9]
to replace NA's with numbers from another row, but can't figure out how to repeat it for all the columns. The columns 4 and beyond have a sequence in each row of six numbers followed by 20 NA's, so I've tried going down the road of using lapply() and seq() or for loops, but my efforts are failing.
I did a simple solution by replacing the NA:s with zeroes and adding all rows per column. Did this work?
#data
bdep <- rbind(c(rep(NA,6),3.402293,3.606941,4.000461,3.666381,3.499614),
c(3.805082,4.013435,3.98,3.490139,3.433857,3.27813, rep(NA,5)),
c(rep(NA,11)))
#solution
bdep2 <- ifelse(is.na(bdep), 0, bdep)
bdep3 <- apply(bdep2, 2, sum)
bdep3 #the row you want?
I finally came to a solution by patching together some code I found in other posts (esp. sequencing and for loops). I think this would be considered messy coding, so I'd welcome other solutions. This should better describe what I was trying to do in the OP, where I was trying to generalize too much. Specifically, I have 17 variables, measured over 14 years (that's 238 columns), and something happened while generating these data where the first 6 years of a variable are in one row and the following 8 years are in the other row, so instead of re-run the model, I just wanted to combine the two rows into one.
Below are some sample data, simplified from my real scenario.
Create the data frame:
df <- data.frame(
VALUE = c(16, 16, 16),
COUNT = c(2431, 2610, 35419),
AREA = c(243100, 261000, 3541900),
n_dd_2000 = c(NA, 3.805, NA),
n_dd_2001 = c(3.402, NA, NA)
)
The next two lines establish a sequence starting a pattern at column 4, repeating every 1 column, repeated 2 times out in the first line, 1 time out in the second line, and how many times to repeat the sequence:
info <- data.frame(start=seq(4, by=1, length.out=2), len=rep(1,2))
info2 <- data.frame(start=seq(5, by=1, length.out=1), len=rep(1,2))
This is the code from my real dataset, where I started at column 4, repeated the pattern every 14 columns, out 17 times, and looked at the first 6, then 8 columns: info <- data.frame(start=seq(4, by=14, length.out=17), len=rep(c(6,8),17))
The two for loops below write the specified values in the sequence from row 2 and row 1 to row 3, respectively:
foo = sequence(info$len) + rep(info$start-1, info$len)
foo2 = sequence(info2$len) + rep(info2$start-1, info2$len)
for(n in 1:length(foo)){
df[3,foo[n]] <- df[2,foo[n]]
}
for(n in 1:length(foo2)){
df[3,foo2[n]] <- df[1,foo2[n]]
}
Then I removed the first two rows I got those values from and I'm left with one complete row, no NA's:
df <- df[-(1:2),]
I am trying to be lazier than ever with R and I was wondering to know if there is a chance to drop columns from a data.frame by using a condition.
For instance, let's say my data.frame has 50 columns.
I want to drop all the columns that share each other
mean(mydata$coli)... = mean(mydata$coln) = 0
How would you write this code in order to drop them all at once? Because I use to drop columns with
mydata2 <- subset(mydata, select = c(vari, ..., varn))
Obviously not interesting because of the need of manual data checking.
Thank you all!
Something similar as #akrun using lapply
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)
mydata[lapply(mydata, mean)!=0]
# col2
#1 1
#2 2
#3 3
#4 4
#5 5
#6 6
#7 7
We can use colMeans to get the mean of all the columns as a vector, convert that to a logical index (!=0) and subset the dataset.
mydata[colMeans(mydata)!=0]
Or use Filter with f as mean. If the mean of a column is 0, it will be coerced to FALSE and all others as TRUE to filter out the columns.
Filter(mean, mydata)
data
mydata <- data.frame(col1=0, col2=1:7, col3=0, col4=-3:3)
How to select columns that don't contain any NA values in R? As long as a column contains at least one NA, I want to exclude it. What's the best way to do it? I am trying to use sum(is.na(x)) to achieve this, but haven't been successful.
Also, another R question. Is it possible to use commands to exclude columns that contain all same values? For example,
column1 column2
row1 a b
row2 a c
row3 a c
My purpose is to exclude column1 from my matrix so the final result is:
column2
row1 b
row2 c
row3 c
Remove columns from dataframe where ALL values are NA deals with the case where ALL values are NA
For a matrix, you can use colSums(is.na(x) to find out which columns contain NA values
given a matrix x
x[, !colSums(is.na(x)), drop = FALSE]
will subset appropriately.
For a data.frame, it will be more efficient to use lapply or sapply and the function anyNA
xdf[, sapply(xdf, Negate(anyNA)), drop = FALSE]
Also, could do
new.df <- df[, colSums(is.na(df)) == 0 ]
this way lets you subset based on the number of NA values in the columns.
Also if 'mat1' is the matrix:
indx <- unique(which(is.na(mat1), arr.ind=TRUE)[,2])
subset(mat1, select=-indx)
I have two dataframes. I need to check each element of a column in one data against each element of the second dataframe and when there is a match copy something from a different column in the second dataframe back to another column in the first dataframe.
Here is some fake data to play with:
df1 <-data.frame(c("267119002","257051033",NA,"267098003","267099020","267047006"))
names(df1)[1]<-"ID"
df2 <-data.frame(c("257051033","267098003","267119002","267047006","267099020"))
names(df2)[1]<-"ID"
df2$vals <-c(11,22,33,44,55)
Basically what I want to do is for each ID in df1, check for the corresponding matching row in df2, and copy the value of df2$vals back to df1. Merge is not really an option cause in the real data I need to repeat this for many columns and multiple merges would result in df1 getting stupidly big. I need to keep it lean! And df1 may contain NA's in which case I want to place NA in the new column instead of a value.
You can use match:
df2[match(df1$ID,df2$ID),]
ID vals
3 267119002 33
1 257051033 11
NA <NA> NA
2 267098003 22
5 267099020 55
4 267047006 44
ANd if you want to remove NA:
df2[na.omit(match(df1$ID,df2$ID)),]
ID vals
3 267119002 33
1 257051033 11
2 267098003 22
5 267099020 55
4 267047006 44
Ok so thanks to agstudy's answer I was able to figure this out myself.This does exactly what I want!
fetcher <-function(x){
y <- df2$vals[which(match(df2$ID,x)==TRUE)]
return(y)
}
sapply(df1$ID,function(x) fetcher(x))
Thanks for the inspiration agstudy