I have the following data ( with variable number of columns)
> df1<-data.frame(F1=c(1,5,"NA",9),F2=c(2,5,"a","NA"),F3=c(1,"NA","o","NA"))
> df1
F1 F2 F3
1 1 2 1
2 5 5 NA
3 NA a o
4 9 NA NA
and I want to remove the NA cells from the rows and shrink the columns only to the cells with information in it.
> df2
F1 F2 F3
1 1 2 1
2 5 5
3 a o
4 9
Thanks!
Firstly, you can use this function to move all non-NA cells to the left:
df1 <- data.frame(F1=c(1,5,NA,9),F2=c(2,5,"a",NA),F3=c(1,NA,"o",NA))
df1 <- as.data.frame(t(apply(df1,1, function(x) { return(c(x[!is.na(x)],x[is.na(x)]) )} )))
colnames(df1) <- c("F1", "F2", "F3")
Output:
> print(df1)
F1 F2 F3
1 1 2 1
2 5 5 <NA>
3 a o <NA>
4 9 <NA> <NA>
Secondly, in order to apply blank cells instead of NA-observations, you could try:
df1 <- sapply(df1, as.character)
df1[is.na(df1)] <- " "
df1 <- as.data.frame(df1)
Output:
> print(df1)
F1 F2 F3
1 1 2 1
2 5 5
3 a o
4 9
Note: I changed your string "NA" to simply NA in order to detect the observations better. I'm not sure if you actually want the NA values to be observed as strings.
We can try the code below
df1[] <- t(apply(
df1,
1,
function(v) {
v[order(v == "NA")]
}
))
which gives
> df1
F1 F2 F3
1 1 2 1
2 5 5 NA
3 a o NA
4 9 NA NA
Based on your preference I chose to complete each row after the omission of "NA" values, with "" as a sort of blank values. But you could choose to fill them with real NA values:
library(dplyr)
library(purrr)
df1 %>%
pmap_dfr(~ {x <- c(...)[c(...) != "NA"]
setNames(c(x, rep("", ncol(df1) - length(x))),
names(df1))})
# A tibble: 4 x 3
F1 F2 F3
<chr> <chr> <chr>
1 1 "2" "1"
2 5 "5" ""
3 a "o" ""
4 9 "" ""
First convert you data frame to character (else this will be an issue if you have both numeric and characters), then row apply, shift the values, pad with NAs
df2=sapply(df1,as.character)
t(
sapply(1:nrow(df1),function(i){
tmp=df1[i,df1[i,]!="NA"]
if (length(tmp)<ncol(df1)) {
tmp=c(tmp,rep("NA",ncol(df1)-length(tmp)))
}
tmp
})
)
F1 F2 F3
[1,] "1" "2" "1"
[2,] "5" "5" "NA"
[3,] "a" "o" "NA"
[4,] "9" "NA" "NA"
Related
This question already has answers here:
Remove NULL elements from list of lists
(7 answers)
Closed 3 years ago.
I have a nested list of data frames. In those data frames I have NA variables (vectors now?). I want to remove those elements.
EDIT: actually I have NULL instead of NA.
df.ls <- list(list(id = NULL, x = 3, works = NULL),
list(id = 2, x = 4, works = NULL),
NULL)
I tried this code, but don't know how to tell which level should it use.
df.ls[sapply(df.ls, is.null)] <- NULL
For NULL values we can do
l1 <- lapply(df.ls, function(x) x[lengths(x) > 0])
For NAs we can do
l1 <- lapply(df.ls, function(x) x[!is.na(x)])
l1
#[[1]]
#[[1]]$x
#[1] 3
#[[2]]
#[[2]]$id
#[1] 2
#[[2]]$x
#[1] 4
#[[3]]
#list()
If you want to remove the empty list, you can do
l1[lengths(l1) > 0]
I am not sure what you are trying to do, since you say you have a list of data.frames but the example you provide is only a list of lists with elements of length one.
Lets assume you have a list of data.frames, which in turn contain vectors of length > 1, and you want to drop all columns that "only" contain NAs.
df.ls <- list(data.frame(id = c(NA,NA,NA),
x = c(NA,3,5),
works = c(4,5,NA)),
data.frame(id = c("a","b","c"),
x = c(NA,3,5),
works = c(NA,NA,NA)),
data.frame(id = c("e","d",NA),
x = c(NA,3,5),
works = c(4,5,NA)))
> [[1]]
id x works
1 NA NA 4
2 NA 3 5
3 NA 5 NA
[[2]]
id x works
1 a NA NA
2 b 3 NA
3 c 5 NA
[[3]]
id x works
1 e NA 4
2 d 3 5
3 <NA> 5 NA
Then this approach will work:
library(dplyr)
library(purrr)
non_empty_col <- function(x) {
sum(is.na(x)) != length(x)
}
map(df.ls, ~ .x %>% select_if(non_empty_col))
Which returns your list of data.frames without columns that contain only NA.
[[1]]
x works
1 NA 4
2 3 5
3 5 NA
[[2]]
id x
1 a NA
2 b 3
3 c 5
[[3]]
id x works
1 e NA 4
2 d 3 5
3 <NA> 5 NA
If you, however, prefer your list to have only complete cases in each data.frame (rows with no NAs), then the following code will work.
library(dplyr)
map(df.ls, ~ .x[complete.cases(.x), ])
Leaving you, in case of my example data, only with row 2 of data.frame 3.
To remove the NULL
discard(map(df.ls, ~ discard(.x, is.null)), is.null)
#[[1]]
#[[1]]$x
#[1] 3
#[[2]]
#[[2]]$id
#[1] 2
#[[2]]$x
#[1] 4
Or in base R with Filter and is.null
Filter(Negate(is.null), lapply(df.ls, function(x) Filter(Negate(is.null), x)))
Earlier version before the OP's update
library(purrr)
map(df.ls, ~ .x[!is.na(.x)])
#[[1]]
#[[1]]$x
#[1] 3
#[[2]]
#[[2]]$id
#[1] 2
#[[2]]$x
#[1] 4
#[[3]]
#list()
my data is like this:
> head(df)
ETDPAT04 ETDPAT06 ETDPAT08 ETDPAT12
1: 2 . 3 3
2: 12 12 . 14
3: 6 5 6 7
4: 1 1 1 1
5: 1 3 3 2
6: 3 3 2 4
...
how to return all rows where value is any of those columns is more than 61?
I tried to do this:
a=df[apply(df, 1, function(row) {any(row > 61)}),]
what I got does not satisfy my above mentioned condition. I got this:
> head(a)
ETDPAT04 ETDPAT06 ETDPAT08 ETDPAT12
1: 6 5 6 7
2: 6 6 7 8
3: 8 3 6 4
...
there is no data in my dataframe in those columns which is more than 61, so I should get zero results.
colMax <- function(df) sapply(df, max, na.rm = TRUE)
colMax(df)
ETDPAT04 ETDPAT06 ETDPAT08 ETDPAT12
"9" "9" "9" "9"
Also:
> sapply(df, class)
ETDPAT04 ETDPAT06 ETDPAT08 ETDPAT12
"character" "character" "character" "character"
I got df from:
t=data.table::fread("phs000086.v3.pht000279.v1.DS-T1D-IRB.txt", header=TRUE,na.strings = ".")
colnames(t) <- as.character(t[1,])
t <- t[2:nrow(t),]
df=select(t, ETDPAT04, ETDPAT06,ETDPAT08,ETDPAT12)
df <- sapply( df, as.numeric )
a=df[apply(df, 1, function(row) {any(row > 61)}),]
dim(a)
44 4
head(a)
ETDPAT04 ETDPAT06 ETDPAT08 ETDPAT12
[1,] NA NA NA NA
[2,] NA NA NA NA
My original .txt data looks like this:
phv00033517.v1.p1.c1 phv00033518.v1.p1.c1 phv00033519.v1.p1.c1
1: PHASE AGE ADULT
2: 2 17 0
3: 2 29 1
4: 2 35 1
5: 2 14 0
I wanted to remove the first row, and to make the 2nd row into header, so my column names become: PHASE, AGE ...
I also tried to do this in more basic way but still no solution:
library(dplyr)
d<- read.table("phs000086.v3.pht000279.v1.p1.c1.DCCT_ms2exprt.DS-T1D- IRB.txt", header = FALSE)
write.table(d,"phen2", quote=F,sep = " ",row.names = F,col.names=F)
d1=read.table("phen2", header=TRUE)
d2=select(d1,AGE, FEMALE,HBAEL,ETDPAT00, ETDPAT02, ETDPAT04, ETDPAT06, ETDPAT08, ETDPAT10, ETDPAT12)
d2[d2=="."]<-NA
asNumeric <- function(x) as.numeric(as.character(x))
factorsNumeric <- function(d) modifyList(d, lapply(d[, sapply(d, is.factor)],
asNumeric))
f <- factorsNumeric(d2)
f[4:9] <- lapply(f[4:9], as.integer)
a=f[apply(t(f[,4:10]>61),1, any), ]
I 'm getting dataframe a with 800 or something rows all filled with NA. While I am trying to find any column where any values are >61.
The same if I look for any row where a value is > 61, getting 77 rows of all NAs
a=f[apply(t(f[,4:10]>61),2, any), ]
sapply(f, class)
AGE FEMALE HBAEL ETDPAT00 ETDPAT02 ETDPAT04 ETDPAT06 ETDPAT08
"integer" "integer" "numeric" "integer" "integer" "integer" "integer" "integer"
ETDPAT10 ETDPAT12
"integer" "integer"
I am completely stuck here. Anyone can provide any help? Do I need to give more info about my data?
Assuming DT shown reproducibly in the Note at the end make the first row the header and convert the columns to numeric. Then select the rows as indicated:
DT <- fread(paste(paste(do.call("paste", DT), collapse = "\n")), na.strings = ".")
DT[apply(DT > 61, 1, any), ]
## Empty data.table (0 rows) of 4 cols: ETDPAT04,ETDPAT06,ETDPAT08,ETDPAT12
Note
Lines <- "
ETDPAT04 ETDPAT06 ETDPAT08 ETDPAT12
2 . 3 3
12 12 . 14
6 5 6 7
1 1 1 1
1 3 3 2
3 3 2 4"
library(data.table)
DT <- fread(Lines, colClasses = "character", header = FALSE)
I have a list of dataframes.
I would like to check every column name of the dataframes. If the column name is missing, I want to create this column to the dataframe, and complete with NA values.
Dummy data:
d1 <- data.frame(a=1:2, b=2:3, c=4:5)
d2 <- data.frame(a=1:2, b=2:3)
l<-list(d1, d2)
# Check the columns names of the dataframes
# If column is missing, add new column, add NA as values
lapply(l, function(x) if(!("c" %in% colnames(x)))
{
c<-rep(NA, nrow(x))
cbind(x, c) # does not work!
})
What I get:
[[1]]
NULL
[[2]]
a b c
1 1 2 NA
2 2 3 NA
What I want instead:
[[1]]
a b c
1 1 2 4
2 2 3 5
[[2]]
a b c
1 1 2 NA
2 2 3 NA
Thanks for your help!
You could use dplyr::mutate with an ifelse:
library(dplyr)
lapply(l, function(x) mutate(x, c = ifelse("c" %in% names(x), c, NA)))
[[1]]
a b c
1 1 2 4
2 2 3 4
[[2]]
a b c
1 1 2 NA
2 2 3 NA
You have some good answers, but if you want to stick to base R:
lapply(l, function(x)
if(!("c" %in% colnames(x))) {
c<-rep(NA, nrow(x))
return(cbind(x, c))
}
else(return(x))
)
Your code was returning NULL for the first df because you had no else statement to handle the case of c existing (i.e FALSE in the if statement).
One way is to use dplyr::bind_rows to bind data.frames in the list and fill entries from missing columns with NA, and then split the resulting data.frame again to produce a list of data.frames:
df <- dplyr::bind_rows(l, .id = "id");
lapply(split(df, df$id), function(x) x[, -1])
#$`1`
# a b c
#1 1 2 4
#2 2 3 5
#
#$`2`
# a b c
#3 1 2 NA
#4 2 3 NA
Or the same as a tidyverse/magrittr chain
bind_rows(l, .id = "id") %>% split(., .$id) %>% lapply(function(x) x[, -1])
library(purrr)
map(l, ~{if(!length(.x$c)) .x$c <- NA; .x})
I have list of lists similar to this sample:
z <- list(list(num1=list((list(tab1=list(list(a=1, b=2, c=5), list(a=3, b=4), list(d=4,e=7)))))),list(num2=list((list(tab2=list(list(a=1, b=2), list(a=3, b=4)))))))
I would like to extract the figures out of the last list of lists names:
Desired output list (since 1 list entries are shorter) or as dataframe with columns corresponding to main list:
[1] a b c a b d e
[2] a b a b
dataframe:
column1 column2
a a
b b
c a
a b
b ""
d ""
e ""
I have tried various combinations of sapply(z, "[[", c("a","b"...) but failed, since the sublist names varies.
EDIT: Sorry, I needed the actual values not the last node (letters)! Additionally, each numeric value has column name, not set in the example above; it is like this:
[[1]]$num1[[1]]$tab1[[1]]$a
Name
1
So the desired solution are values:
[1]
1 2 5 3 4 4 7
[2]
1 2 3 4
I would actually need the numeric values instead of the letters. If you could adjust your solution to this I would be grateful. Thanks.
Try
lapply(z, function(x) as.numeric(unlist(x)))
## [[1]]
## [1] 1 2 5 3 4 4 7
##
## [[2]]
## [1] 1 2 3 4
z1 <- lapply(z, function(x) names(unlist(x)))
z1 <- lapply(z1, function(x) gsub(".*\\.", "", x))
n <- max(sapply(z1, length))
z1 <- lapply(z1, `length<-`, value = n)
setNames(as.data.frame(z1), paste0("Column", seq_along(z1)))
# Column1 Column2
#1 a a
#2 b b
#3 c a
#4 a b
#5 b <NA>
#6 d <NA>
#7 e <NA>
A bit far-fetched and everything but elegant, here is a way to get what you want :
lista<-unlist(lapply(strsplit(names(unlist(z)),"\\."),function(vec) vec[3]))
names(lista)<-unlist(lapply(strsplit(names(unlist(z)),"\\."),function(vec) vec[1]))
uninames<-unique(names(lista))
res<-sapply(uninames,function(x,vec){vec[names(vec)==x]},lista)
> res
$num1
num1 num1 num1 num1 num1 num1 num1
"a" "b" "c" "a" "b" "d" "e"
$num2
num2 num2 num2 num2
"a" "b" "a" "b"
UPDATE
To get the numbers :
a<-unlist(z)
b<-names(unique(z))
res<-sapply(unique(b),function(name,vec,l_name){vec[l_name==name]},a,b)
>res
$num1
num1.tab1.a num1.tab1.b num1.tab1.c num1.tab1.a num1.tab1.b num1.tab1.d num1.tab1.e
1 2 5 3 4 4 7
$num2
num2.tab2.a num2.tab2.b num2.tab2.a num2.tab2.b
1 2 3 4
I am trying to combine two dataframes with different number of columns and column headers. However, after I combine them using rbind.fill(), the resulting file has filled the empty cells with NA.
This is very inconvenient since one of the columns has data that is also represented as "NA" (for North America), so when I import it into a csv, the spreadsheet can't tell them apart.
Is there a way for me to:
Use the rbind.fill function without having it populate the empty cells with NA
or
Change the column to replace the NA values*
*I've scoured the blogs, and have tried the two most popular solutions:
df$col[is.na(df$col)] <- 0, #it does not work
df$col = ifelse(is.na(df$col), "X", df$col), #it changes all the characters to numbers, and ruins the column
Let me know if you have any advice! I (unfortunately) cannot share the df, but will be willing to answer any questions!
NA is not the same as "NA" to R, but might be interpreted as such by your favourite spreadsheet program. NA is a special value in R just like NaN (not a number). If I understand correctly, one of your solutions is to replace the "NA" values in the column representing North America with something else, in which case you should just be able to do...
df$col[ df$col == "NA" ] <- "NorthAmerica"
This is assuming that your "NA" values are actually character strings. is.na() won't return any values if they are character strings which is why df$col[ is.na(df$col) ] <- 0 won't work.
An example of the difference between NA and "NA":
x <- c( 1, 2, 3 , "NA" , 4 , 5 , NA )
> x[ !is.na(x) ]
[1] "1" "2" "3" "NA" "4" "5"
> x[ x == "NA" & !is.na(x) ]
[1] "NA"
Method to resolve this
I think you want to leave "NA" and any NAs as they are in the first df, but make all NA in the second df formed from rbind.fill() change to something like "NotAvailable". You can accomplish this like so...
df1 <- data.frame( col = rep( "NA" , 6 ) , x = 1:6 , z = rep( 1 , 6 ) )
df2 <- data.frame( col = rep( "SA" , 2 ) , x = 1:2 , y = 5:6 )
df <- rbind.fill( df1 , df2 )
temp <- df [ (colnames(df) %in% colnames(df2)) ]
temp[ is.na( temp ) ] <- "NotAvailable"
res <- cbind( temp , df[ !( colnames(df) %in% colnames(df2) ) ] )
#df has real NA values in column z and column y. We just want to get rid of y's
df
# col x z y
# 1 NA 1 1 NA
# 2 NA 2 1 NA
# 3 NA 3 1 NA
# 4 NA 4 1 NA
# 5 NA 5 1 NA
# 6 NA 6 1 NA
# 7 SA 1 NA 5
# 8 SA 2 NA 6
#res has "NA" strings in col representing "North America" and NA values in z, whilst those in y have been removed
#More generally, any NA in df1 will be left 'as-is', whilst NA from df2 formed using rbind.fill will be converted to character string "NotAvilable"
res
# col x y z
# 1 NA 1 NotAvailable 1
# 2 NA 2 NotAvailable 1
# 3 NA 3 NotAvailable 1
# 4 NA 4 NotAvailable 1
# 5 NA 5 NotAvailable 1
# 6 NA 6 NotAvailable 1
# 7 SA 1 5 NA
# 8 SA 2 6 NA
If you have a dataframe that contains NA's and you want to replace them all you can do something like:
df[is.na(df)] <- -999
This will take care of all NA's in one shot
If you only want to act on a single column you can do something like
df$col[which(is.na(df$col))] <- -999