Related
I have something like this in my dataset and I only want to delete a row if it only has NA's, not if it has at least one value.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] NA NA NA
[5,] 4 8 NA
In this example they were able to delete what i want, but when i try to do in the exact same way, it doesn't work.
I've already tried their example:
data[rowSums(is.na(data)) != ncol(data),]
But my row's number don't change like this one.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] 4 8 NA
My NA's are not characters.if i ask for their class:
class(NA)
[1] "logical"
Do you know another way to ask for these, please?
______UPDATE_____
Maybe I said it wrong.
My problem, and it's why there code is not working
mymat[rowSums(is.na(mymat)) != ncol(mymat), ]
Because i have 3 columns with information but after that, is everything NA, like this:
Date Product Code protein fat
2016-01-01 aaa 0001 NA NA
2016-01-01 bbb 0003 NA NA
2016-02-01 ccc 0032 NA NA
So the row is not entirly NA's, only after the 3rd column... But i want to remove the entire row.. (1:5)
Thank you!
First, I would coerce the matrix to a data frame, because this is the typical ("tidy") format to store variables and observations. Then you could use the remove_empty_rows() function from the sjmisc-package:
library(sjmisc)
df <- data.frame(
a = c(1, 1, 4, NA, 4),
b = c(2, NA, 6, NA, 8),
c = c(3, 4, 7, NA, NA)
)
# get row numbers of empty rows
empty_rows(df)
## [1] 4
# remove empty rows
remove_empty_rows(df)
## A tibble: 4 × 3
## a b c
## * <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 1 NA 4
## 3 4 6 7
## 4 4 8 NA
There are also functions for columns: empty_cols() and remove_empty_cols().
If you just want to keep complete cases (rows), use complete.cases():
df[complete.cases(df), ]
## a b c
## 1 1 2 3
## 3 4 6 7
Check if this will work with the updated explanation. It will subset the data.frame to ignore the information columns when checking for NA. I added some additional rows that contain a mix of numbers and NA
df1 <- data.frame(Date=c("2016-01-01", "2016-01-01", "2016-02-01", "2016-03-01", "2016-03-01"),
Product=c("aaa", "bbb", "ccc", "ddd", "eee"),
Code=c("0001", "0003", "0032", "0005", "0007"),
protein=c(NA, NA, NA, 5, NA),
fat=c(NA, NA, NA, NA, 4))
# place any columns you do not want to check for NA in names.info
names.info <- c("Date", "Product", "Code")
names.check <- setdiff(names(df1), names.info)
df1[rowSums(is.na(df1[, names.check])) != length(names.check), ]
Date Product Code protein fat
4 2016-03-01 ddd 0005 5 NA
5 2016-03-01 eee 0007 NA 4
You need to delete the as.integer
mymat <- matrix(c(1:3, NA, 4:6, NA, rep(NA, 4)), ncol = 3)
Which translates to
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 2 5 NA
[3,] 3 6 NA
[4,] NA NA NA
mymat[as.integer(rowSums(is.na(mymat)) != ncol(mymat)), ]
Gives you
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 1 4 NA
[3,] 1 4 NA
But you want
mymat[rowSums(is.na(mymat)) != ncol(mymat), ]
To get
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 2 5 NA
[3,] 3 6 NA
Cheers,
Marc
I need to load social network data where each user has an unknown and potentially large number of friends, stored as a text file of the following format:
UserId: FriendId1, FriendId2, ...
1: 12, 33
2:
3: 4, 6, 10, 15, 16
into a two-column data.frame:
UserId FriendId
1 1 12
2 1 33
3 3 4
4 3 6
5 3 10
6 3 15
7 3 16
How would you do that in R?
Reading, filling and then reshaping is inefficient as it requires to keep in memory many columns full of NA.
Related questions here, and here.
If you really have a colon as a delimiter, then just use read.table with header = FALSE to get your data into R, then consider using cSplit from my "splitstackshape" package.
mydf <- read.table("test.txt", sep = ":", header = FALSE)
mydf
## V1 V2
## 1 1 12, 33
## 2 2
## 3 3 4, 6, 10, 15, 16
library(splitstackshape)
cSplit(mydf, "V2", ",", "long")
## V1 V2
## 1: 1 12
## 2: 1 33
## 3: 3 4
## 4: 3 6
## 5: 3 10
## 6: 3 15
## 7: 3 16
This reads the lines, then one-by-one parses them into two column matrices. This does produce character values (since lines of text are just characters) but it's trivial to coerce to numeric:
do.call(rbind, sapply(rLines, function(L) { n <- sub( ":.+", "", L);
items <- scan(text=sub(".+:","",L), sep=",");
matrix( c( rep(n, length(items)), items), ncol=2)}
)
)
#---------
[,1] [,2]
[1,] "1" "12"
[2,] "1" "33"
[3,] "3" "4"
[4,] "3" "6"
[5,] "3" "10"
[6,] "3" "15"
[7,] "3" "16"
If the path forward isn't trivial to you then educate yourself at ?as.numeric and ?as.data.frame.
How I can rewrite this function to vectorized variant. As I know, using loops are not good practice in R:
# replaces rows that contains all NAs with non-NA values from previous row and K-th column
na.replace <- function(x, k) {
for (i in 2:nrow(x)) {
if (!all(is.na(x[i - 1, ])) && all(is.na(x[i, ]))) {
x[i, ] <- x[i - 1, k]
}
}
x
}
This is input data and returned data for function:
m <- cbind(c(NA,NA,1,2,NA,NA,NA,6,7,8), c(NA,NA,2,3,NA,NA,NA,7,8,9))
m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] NA NA
[6,] NA NA
[7,] NA NA
[8,] 6 7
[9,] 7 8
[10,] 8 9
na.replace(m, 2)
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
Here is a solution using na.locf in the zoo package. row.na is a vector with one component per row of m such that a component is TRUE if the corresponding row of m is all NA and FALSE otherwise. We then set all elements of such rows to the result of applying na.locf to column 2.
At the expense of a bit of speed the lines ending with ## could be replaced with row.na <- apply(is.na(m), 1, all) which is a bit more readable.
If we knew that if any row has an NA in column 2 then all columns of that row are NA, as in the question, then the lines ending in ## could be reduced to just row.na <- is.na(m[, 2])
library(zoo)
nr <- nrow(m) ##
nc <- ncol(m) ##
row.na <- .rowSums(is.na(m), nr, nc) == nc ##
m[row.na, ] <- na.locf(m[, 2], na.rm = FALSE)[row.na]
The result is:
> m
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 3 3
[6,] 3 3
[7,] 3 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
REVISED Some revisions to improve speed as in comments below. Also added alternatives in discussion.
Notice that, unless you have a pathological condition where the first row is all NANA (in which case you're screwed anyway), you don't need to check whether all(is.na(x[i−1,]))all(is.na(x[i - 1, ])) is T or F because in the previous time thru the loop you "fixed" row i−1i-1 .
Further, all you care about is that the designated k-th value is not NA. The rest of the row doesn't matter.
BUT: The k-th value always "falls through" from the top, so perhaps you should:
1) treat the k-th column as a vector, e.g. c(NA,1,NA,NA,3,NA,4,NA,NA) and "fill-down" all numeric values. That's been done many times on SO questions.
2) Every row which is entirely NA except for column k gets filled with that same value.
I think that's still best done using either a loop or apply
You probably need to clarify whether some rows have both numeric and NA values, which your example fails to include. If that's the case, then things get trickier.
The most important part in this answer is getting the grouping you want, which is:
groups = cumsum(rowSums(is.na(m)) != ncol(m))
groups
#[1] 0 0 1 2 2 2 2 3 4 5
Once you have that the rest is just doing your desired operation by group, e.g.:
library(data.table)
dt = as.data.table(m)
k = 2
cond = rowSums(is.na(m)) != ncol(m)
dt[, (k) := .SD[[k]][1], by = cumsum(cond)]
dt[!cond, names(dt) := .SD[[k]]]
dt
# V1 V2
# 1: NA NA
# 2: NA NA
# 3: 1 2
# 4: 2 3
# 5: 3 3
# 6: 3 3
# 7: 3 3
# 8: 6 7
# 9: 7 8
#10: 8 9
Here is another base only vectorized approach:
na.replace <- function(x, k) {
is.all.na <- rowSums(is.na(x)) == ncol(x)
ref.idx <- cummax((!is.all.na) * seq_len(nrow(x)))
ref.idx[ref.idx == 0] <- NA
x[is.all.na, ] <- x[ref.idx[is.all.na], k]
x
}
And for fair comparison with #Eldar's solution, replace is.all.na with is.all.na <- is.na(x[, k]).
Finally I realized my version of vectorized solution and it works as expected. Any comments and suggestions are welcome :)
# Last Observation Move Forward
# works as na.locf but much faster and accepts only 1D structures
na.lomf <- function(object, na.rm = F) {
idx <- which(!is.na(object))
if (!na.rm && is.na(object[1])) idx <- c(1, idx)
rep.int(object[idx], diff(c(idx, length(object) + 1)))
}
na.replace <- function(x, k) {
v <- x[, k]
i <- which(is.na(v))
r <- na.lomf(v)
x[i, ] <- r[i]
x
}
Here's a workaround with the na.locf function from zoo:
m[na.locf(ifelse(apply(m, 1, function(x) all(is.na(x))), NA, 1:nrow(m)), na.rm=F),]
[,1] [,2]
[1,] NA NA
[2,] NA NA
[3,] 1 2
[4,] 2 3
[5,] 2 3
[6,] 2 3
[7,] 2 3
[8,] 6 7
[9,] 7 8
[10,] 8 9
I am looking for a more versatile way to get from a data.frame to a multidimensional array.
I would like to be able to create as many dimensions as needed from as many variables in the data frame as desired.
Currently, the method has to be tailored to each data.frame, requires subletting to form a vector.
I would love something along the melt/cast methods in plyr.
data<-data.frame(coord.name=rep(1:10, 2),
x=rnorm(20),
y=rnorm(20),
ID=rep(c("A","B"), each=10))
data.array<-array(dim=c(10, 2, length(unique(data$ID))))
for(i in 1:length(unique(data$ID))){
data.array[,1,i]<-data[data$ID==unique(data$ID)[i],"x"]
data.array[,2,i]<-data[data$ID==unique(data$ID)[i],"y"]
}
data.array
, , 1
[,1] [,2]
[1,] 1 1
[2,] 3 3
[3,] 5 5
[4,] 7 7
[5,] 9 9
[6,] 1 1
[7,] 3 3
[8,] 5 5
[9,] 7 7
[10,] 9 9
, , 2
[,1] [,2]
[1,] 2 2
[2,] 4 4
[3,] 6 6
[4,] 8 8
[5,] 10 10
[6,] 2 2
[7,] 4 4
[8,] 6 6
[9,] 8 8
[10,] 10 10
You may have had trouble applying the reshape2 functions for a somewhat subtle reason. The difficulty was that your data.frame has no column that can be used to direct how you want to arrange the elements along the first dimension of an output array.
Below, I explicitly add such a column, calling it "row". With it in place, you can use the expressive acast() or dcast() functions to reshape the data in any way you choose.
library(reshape2)
# Use this or some other method to add a column of row indices.
data$row <- with(data, ave(ID==ID, ID, FUN = cumsum))
m <- melt(data, id.vars = c("row", "ID"))
a <- acast(m, row ~ variable ~ ID)
a[1:3, , ]
# , , A
#
# x y
# 1 1 1
# 2 3 3
# 3 5 5
#
# , , B
#
# x y
# 1 2 2
# 2 4 4
# 3 6 6
I think this is right:
array(unlist(lapply(split(data, data$ID), function(x) as.matrix(x[ , c("x", "y")]))), c(10, 2, 2))
Say we have the following data frame:
> df
A B C
1 1 2 3
2 4 5 6
3 7 8 9
We can select column 'B' from its index:
> df[,2]
[1] 2 5 8
Is there a way to get the index (2) from the column label ('B')?
you can get the index via grep and colnames:
grep("B", colnames(df))
[1] 2
or use
grep("^B$", colnames(df))
[1] 2
to only get the columns called "B" without those who contain a B e.g. "ABC".
The following will do it:
which(colnames(df)=="B")
I wanted to see all the indices for the colnames because I needed to do a complicated column rearrangement, so I printed the colnames as a dataframe. The rownames are the indices.
as.data.frame(colnames(df))
1 A
2 B
3 C
Following on from chimeric's answer above:
To get ALL the column indices in the df, so i used:
which(!names(df)%in%c())
or store in a list:
indexLst<-which(!names(df)%in%c())
This seems to be an efficient way to list vars with column number:
cbind(names(df))
Output:
[,1]
[1,] "A"
[2,] "B"
[3,] "C"
Sometimes I like to copy variables with position into my code so I use this function:
varnums<- function(x) {w=as.data.frame(c(1:length(colnames(x))),
paste0('# ',colnames(x)))
names(w)= c("# Var/Pos")
w}
varnums(df)
Output:
# Var/Pos
# A 1
# B 2
# C 3
match("B", names(df))
Can work also if you have a vector of names.
To generalize #NPE's answer slightly:
which(colnames(dat) %in% var)
where var is of the form
c("colname1","colname2",...,"colnamen")
returns the indices of whichever column names one needs.
Use t function:
t(colnames(df))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "var1" "var2" "var3" "var4" "var5" "var6"
Here is an answer that will generalize Henrik's answer.
df=data.frame(A=rnorm(100), B=rnorm(100), C=rnorm(100))
numeric_columns<-c('A', 'B', 'C')
numeric_index<-sapply(1:length(numeric_columns), function(i)
grep(numeric_columns[i], colnames(df)))
#I wanted the column index instead of the column name. This line of code worked for me:
which (data.frame (colnames (datE)) == colnames (datE[c(1:15)]), arr.ind = T)[,1]
#with datE being a regular dataframe with 15 columns (variables)
data.frame(colnames(datE))
#> colnames.datE.
#> 1 Ce
#> 2 Eu
#> 3 La
#> 4 Pr
#> 5 Nd
#> 6 Sm
#> 7 Gd
#> 8 Tb
#> 9 Dy
#> 10 Ho
#> 11 Er
#> 12 Y
#> 13 Tm
#> 14 Yb
#> 15 Lu
which(data.frame(colnames(datE))==colnames(datE[c(1:15)]),arr.ind=T)[,1]
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15