Drop Multiple Columns in R - r

I have a data of 80k rows and 874 columns. Some of these columns are empty. I use sum(is.na) in a for loop to determine the index of empty columns. Since the first column is not empty, if sum(is.na) is equal to the number of rows of the first column, it means that column is empty.
for (i in 1:ncol(loans)){
if (sum(is.na(loans[i])) == nrow(loans[1])){
print(i)
}
}
Now that I know the indices of empty columns, I want to drop them from the data. I thought about storing those indices in an array and dropping them in a loop but I don't think it will work since columns with data will replace the empty columns. How can I drop them?

You should try to provide a toy dataset for your question.
loans <- data.frame(
a = c(NA, NA, NA),
b = c(1,2,3),
c = c(1,2,3),
d = c(1,2,3),
e = c(NA, NA, NA)
)
loans[!sapply(loans, function(col) all(is.na(col)))]
sapply loops over columns of loans and applies the anonymous function checking if all elements are NA. It then coerces the output to a vector, in this case logical.
The tidyverse option:
loans[!purrr::map_lgl(loans, ~all(is.na(.x)))]

Does this work:
df <- data.frame(col1 = rep(NA, 5),
col2 = 1:5,
col3 = rep(NA,5),
col4 = 6:10)
df
col1 col2 col3 col4
1 NA 1 NA 6
2 NA 2 NA 7
3 NA 3 NA 8
4 NA 4 NA 9
5 NA 5 NA 10
df[,which(colSums(df, na.rm = TRUE) == 0)] <- NULL
df
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
Another approach:
df[!apply(df, 2, function(x) all(is.na(x)))]
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10

A dplyr solution:
df %>%
select_if(!colSums(., na.rm = TRUE) == 0)

You can try to use fundamental skills like if else and for loops for almost all problems, although a drawback is that it will be slower.
# evaluate each column, if a column meets your condition, remove it, then next
for (i in 1:length(loans)){
if (sum(is.na(loans[,i])) == nrow(loans)){
loans[,i] <- NULL
}
}

Related

conditional count throughout each row using r

I tried every solution but my problem is still there. I have a big df (20rows*400cols) - for each row I want to count how many columns have a value of more than 16.
The first col is factor and the rest of the columns are integers.
my df:
col1 col2 col3 col4
abc 2 16 17
def 4 2 4
geh 50 60 73
desired output should be:
col1 col2 col3 col4 count
abc 2 16 17 1
def 4 2 4 0
geh 50 60 73 3
I tried df$morethan16 <- rowSums(df[,-1] > 16) but then I get NA in the count column.
We may need na.rm to take care of NA elements as >/</== returns NA wherever there are NA elements
df$morethan16 <- rowSums(df[,-1] > 16, na.rm = TRUE)
If we still get NA, check the class of the columns. The above code works only if the columns are numeric. Convert to numeric class automatically with type.convert (based on the values of the column)
df <- type.convert(df, as.is = TRUE)
check the structure
str(df)
If it is still not numeric, some values in the column may be character elements that prevents it from conversion to numeric. Force the columns to numeric with as.numeric. If those are factor columns, do as.character first
df[-1] <- lapply(df[-1], function(x) as.numeric(as.character(x)))
Here is another option using crossprod
df$count <- c(crossprod(rep(1, ncol(df[-1])), t(df[-1] > 16)))
which gives
col1 col2 col3 col4 count
1 abc 2 16 17 1
2 def 4 2 4 0
3 geh 50 60 73 3

Perform calculations on row depending on individual cells [duplicate]

This question already has answers here:
Sum rows in data.frame or matrix
(7 answers)
Closed 2 years ago.
I have a data frame in R that looks like
1 3 NULL,
2 NULL 5,
NULL NULL 9
I want to iterate through each row and perform and add the two numbers that are present. If there aren't two numbers present I want to throw an error. How do I refer to specific rows and cells in R? To iterate through the rows I have a for loop. Sorry not sure how to format a matrix above.
for(i in 1:nrow(df))
Data:
df <- data.frame(
v1 = c(1, 2, NA),
v2 = c(3, NA, NA),
v3 = c(NA, 5, 9)
)
Use rowSums:
df$sum <- rowSums(df, na.rm = T)
Result:
df
v1 v2 v3 sum
1 1 3 NA 4
2 2 NA 5 7
3 NA NA 9 9
If you do need a for loop:
for(i in 1:nrow(df)){
df$sum[i] <- rowSums(df[i,], na.rm = T)
}
If you have something with NULL you can make it a data.frame, but that will make the columns with NULL a character vector. You have to convert those to numeric, which will then introduce NA for NULL.
rowSums will then create the sum you want.
df <- read.table(text=
"
a b c
1 3 NULL
2 NULL 5
NULL NULL 9
", header =T)
# make columns numeric, this will change the NULL to NA
df <- data.frame(lapply(df, as.numeric))
cbind(df, sum=rowSums(df, na.rm = T))
# a b c sum
# 1 1 3 NA 4
# 2 2 NA 5 7
# 3 NA NA 9 9

"summarize" multiple incomplete columns to 1 summary column [duplicate]

I have some columns in R and for each row there will only ever be a value in one of them, the rest will be NA's. I want to combine these into one column with the non-NA value. Does anyone know of an easy way of doing this. For example I could have as follows:
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
So I would have
'a' 'x' 'y' 'z'
A 1 NA NA
B 2 NA NA
C NA 3 NA
D NA NA 4
E NA NA 5
And I would to get
'a' 'mycol'
A 1
B 2
C 3
D 4
E 5
The names of the columns containing NA changes depending on code earlier in the query so I won't be able to call the column names explicitly, but I have the column names of the columns which contains NA's stored as a vector e.g. in this example cols <- c('x','y','z'), so could call the columns using data[, cols].
Any help would be appreciated.
Thanks
A dplyr::coalesce based solution could be as:
data %>% mutate(mycol = coalesce(x,y,z)) %>%
select(a, mycol)
# a mycol
# 1 A 1
# 2 B 2
# 3 C 3
# 4 D 4
# 5 E 5
Data
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,NA),
'y' = c(NA,NA,3,NA,NA),
'z' = c(NA,NA,NA,4,5))
You can use unlist to turn the columns into one vector. Afterwards, na.omit can be used to remove the NAs.
cbind(data[1], mycol = na.omit(unlist(data[-1])))
a mycol
x1 A 1
x2 B 2
y3 C 3
z4 D 4
z5 E 5
Here's a more general (but even simpler) solution which extends to all column types (factors, characters etc.) with non-ordered NA's. The strategy is simply to merge the non-NA values of other columns into your merged column using is.na for indexing:
data$mycol = data$x # your new merged column. Start with x
data$mycol[!is.na(data$y)] = data$y[!is.na(data$y)] # merge with y
data$mycol[!is.na(data$z)] = data$z[!is.na(data$z)] # merge with z
> data
a x y z mycol
1 A 1 NA NA 1
2 B 2 NA NA 2
3 C NA 3 NA 3
4 D NA NA 4 4
5 E NA NA 5 5
Note that this will overwrite existing values in mycol if there are several non-NA values in the same row. If you have a lot of columns you could automate this by looping over colnames(data).
I would use rowSums() with the na.rm = TRUE argument:
cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
which gives:
> cbind.data.frame(a=data$a, mycol = rowSums(data[, -1], na.rm = TRUE))
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
You have to call the method directly (cbind.data.frame) as the first argument above is not a data frame.
Something like this ?
data.frame(a=data$a, mycol=apply(data[,-1],1,sum,na.rm=TRUE))
gives :
a mycol
1 A 1
2 B 2
3 C 3
4 D 4
5 E 5
max works too. Also works on strings vectors.
cbind(data[1], mycol=apply(data[-1], 1, max, na.rm=T))
One possibility using dplyr and tidyr could be:
data %>%
gather(variables, mycol, -1, na.rm = TRUE) %>%
select(-variables)
a mycol
1 A 1
2 B 2
8 C 3
14 D 4
15 E 5
Here it transforms the data from wide to long format, excluding the first column from this operation and removing the NAs.
In a related link (suppress NAs in paste()) I present a version of paste with a na.rm option (with the unfortunate name of paste5).
With this the code becomes
cols <- c("x", "y", "z")
cbind.data.frame(a = data$a, mycol = paste2(data[, cols], na.rm = TRUE))
The output of paste5 is a character, which works if you have character data otherwise you'll need to coerce to the type you want.
Though this is not the OP case, it seems some people like the approach based on sums, how about thinking in mean and mode, to make the answer more universal. This answer matches the title, which is what many people will find.
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c(1,2,NA,NA,9),
'y' = c(NA,6,3,NA,5),
'z' = c(NA,NA,NA,4,5))
splitdf<-split(data[,c(2:4)], seq(nrow(data[,c(2:4)])))
data$mean<-unlist(lapply(splitdf, function(x) mean(unlist(x), na.rm=T) ) )
data$mode<-unlist(lapply(splitdf, function(x) {
tab <- tabulate(match(x, na.omit(unique(unlist(x) ))));
paste(na.omit(unique(unlist(x) ))[tab == max(tab) ], collapse = ", " )}) )
data
a x y z mean mode
1 A 1 NA NA 1.000000 1
2 B 2 6 NA 4.000000 2, 6
3 C NA 3 NA 3.000000 3
4 D NA NA 4 4.000000 4
5 E 9 5 5 6.333333 5
If you want to stick with base,
data <- data.frame('a' = c('A','B','C','D','E'),'x' = c(1,2,NA,NA,NA),'y' = c(NA,NA,3,NA,NA),'z' = c(NA,NA,NA,4,5))
data[is.na(data)]<-","
data$mycol<-paste0(data$x,data$y,data$z)
data$mycol <- gsub(',','',data$mycol)

List index by number and some element NULL,how to convert to data frame?

In R program, the list length is unknow.It is generated from for loop.
for example:
ls <- list()
ls[[1]] <- 5
ls[[3]] <- a
ls[[6]] <- 8
....
Some index(ordinal number) is undefined.
I want to convert to data frame, such as follows:
1 5
2 NA
3 a
4 NA
5 NA
6 8
...
Additional question: how to get the ordinal number range of this list?
A base R approach could be, assuming here you have "ls" is already there in the environment :
Explanation:
We first iterate through all all the elements using lapply, In the anonymous function part, we try to find the null values, where ever there is null value found , we replace with NA. Once the list NULL values are replaced with NA, we bind them row wise using 'rbind' from do.call. To get the last part as sequence, we can use either seq function or colon operator to create a sequence.
dfs <- data.frame(col1 = do.call('rbind', lapply(ls,
function(x)ifelse(is.null(x), NA, x))),
col2 = seq(1,length(ls)), stringsAsFactors = F)
Alternate Using unlist(instead of do.call and rbind) :
dfs <- data.frame(col1 = unlist(lapply(ls,
function(x)ifelse(is.null(x), NA, x))), col2 =
seq(1,length(ls)), stringsAsFactors = F)
Output:
> dfs
# col1 col2
# 1 3 1
# 2 NA 2
# 3 6 3
# 4 NA 4
# 5 NA 5
# 6 8 6

R - find columns where all values are either NA or single value (0 variance)

I'm dealing with a data frame containing several columns that are a single value or NA's. I know how to find columns that are one or the other:
df1 <- data.frame(col1 = 1:10, col2 = 0, col3 = seq(1,20,2))
df1[c(1,4,7),'col2'] <- NA
names(df1)[sapply(df1, function(x) sum(is.na(x)) == length(x))]
names(df1)[sapply(df1, function(x) length(unique(x)) == length(x))]
However I can't think of a way to catch all NA's or a single value. In the above case col2 should be caught.
Any suggestions?
First you could check for the existence of NA within a column with:
any(is.na(df1$col2))
Then if you want to know if a column has all values set to zero without taking into consideration the NA values, use simply:
all(df1$col2 == 0, na.rm = TRUE)
Using rowSums as alex2006 suggests might lead to the inconvenience that you have an arrange of numbers whose sum is 0 and it would also flag that column.
If you're looking for columns were the variance is 0 you could try
colvar0<-apply(df1,2,function(x) var(x,na.rm=T)==0)
colvar0
col1 col2 col3
FALSE TRUE FALSE
to get the column names
names(df1)[colvar0]
edit: suppose you have some columns with only NA then colvar0 equals NA, you can retrieve all the column names with
names(df1)[colvar0|is.na(colvar0)]
Maybe the following will do it.
sapply(df1, function(x){
na <- is.na(x)
any(na) && length(unique(x[!na])) == 1
})
# col1 col2 col3
#FALSE TRUE FALSE
inx <- sapply(df1, function(x){
na <- is.na(x)
any(na) && length(unique(x[!na])) == 1
})
df1[which(inx)]
# col2
#1 NA
#2 0
#3 0
#4 NA
#5 0
#6 0
#7 NA
#8 0
#9 0
#10 0
df1[which(!inx)]
# col1 col3
#1 1 1
#2 2 3
#3 3 5
#4 4 7
#5 5 9
#6 6 11
#7 7 13
#8 8 15
#9 9 17
#10 10 19
Note: If you just want the column names, names[inx] gets the ones with variance zero.
sapply(df1, function(x) length(unique(sort(x))) %in% 0:1) #sort removes NA
# col1 col2 col3
#FALSE TRUE FALSE
OR
sapply(df1, function(x) length(unique(x[!is.na(x)])) %in% 0:1)
# col1 col2 col3
#FALSE TRUE FALSE
If you want to retrieve the actual row where this is happening I suggest the following:
which(is.na(rowSums(df1)) | rowSums(df1)==0)

Resources