Is there a way to remove rows permanently in R data frames? - r

I'm searching for a function that I can remove rows permanently from an R data frame.
For example:
> df<-data.frame(AA=LETTERS[1:5],
NN=c(NA, 12, 21, NA, 11))
> df
# AA NN
#1 A NA
#2 B 12
#3 C 21
#4 D NA
#5 E 11
When I use complete.cases, R just drops the rows from the data frame (df) not creating a new df with new levels and row names, as observed below:
> df<-df[complete.cases(df),]
> df
# AA NN
#2 B 12
#3 C 21
#5 E 11
> levels(df$AA)
#[1] "A" "B" "C" "D" "E"
What I want is the following df:
>df
# AA NN
#1 B 12
#2 C 21
#3 E 11
> levels(df$AA)
#[1] "B" "C" "E"
Is there a way to do this in R?
Thanks!

We can wrap with droplevels to remove the unused levels after subsetting
df <- droplevels(df[complete.cases(df),])
levels(df$AA)
#[1] "B" "C" "E"
row.names(df) <- NULL

Related

Make column levels into new columns R

I have a data table[1000+, 4].
One of the columns has 5 levels and I want to make these 5 levels into 5 new columns.
Any suggestions?
We extract the levels from the column of interest ('ColN'), convert to list and assign that to five 'new' columns.
df1[paste0("new", 1:5)] <- as.list(levels(df1$ColN))
head(df1,2)
# ColN Col2 Col3 new1 new2 new3 new4 new5
#1 C 7 a A B C D E
#2 A 4 b A B C D E
levels(df1$ColN)
#[1] "A" "B" "C" "D" "E"
data
set.seed(48)
df1 <- data.frame(ColN= sample(LETTERS[1:5], 10,
replace=TRUE), Col2= sample(10), Col3= letters[1:10])

Change values in multiple columns of a dataframe using a lookup table

I am trying to change the value of a number of columns at once using a lookup table. They all use the same lookup table. I know how to do this for just one column -- I'd just use a merge, but am having trouble with multiple columns.
Below is an example dataframe and an example lookup table. My actual data is much larger (~10K columns with 8 rows).
example <- data.frame(a = seq(1,5), b = seq(5,1), c=c(1,4,3,2,5))
lookup <- data.frame(number = seq(1,5), letter = LETTERS[seq(1,5)])
Ideally, I would end up with a dataframe which looks like this:
example_of_ideal_output <- data.frame(a = LETTERS[seq(1,5)], b = LETTERS[seq(5,1)], c=LETTERS[c(1,4,3,2,5)])
Of course, in my actual data the dataframe is numbers, but the lookup table is a lot more complicated, so I can't just use a function like LETTERS to solve things.
Thank you in advance!
Here's a solution that works on each column successively using lapply():
as.data.frame(lapply(example,function(col) lookup$letter[match(col,lookup$number)]));
## a b c
## 1 A E A
## 2 B D D
## 3 C C C
## 4 D B B
## 5 E A E
Alternatively, if you don't mind switching over to a matrix, you can achieve a "more vectorized" solution, as a matrix will allow you to call match() and index lookup$letter just once for the entire input:
matrix(lookup$letter[match(as.matrix(example),lookup$number)],nrow(example));
## [,1] [,2] [,3]
## [1,] "A" "E" "A"
## [2,] "B" "D" "D"
## [3,] "C" "C" "C"
## [4,] "D" "B" "B"
## [5,] "E" "A" "E"
(And of course you can coerce back to data.frame via as.data.frame() afterward, although you'll have to restore the column names as well if you want them, which can be done with setNames(...,names(example)). But if you really want to stick with a data.frame, my first solution is probably preferable.)
Using dplyr
f <- function(x)setNames(lookup$letter, lookup$number)[x]
library(dplyr)
example %>%
mutate_each(funs(f))
# a b c
#1 A E A
#2 B D D
#3 C C C
#4 D B B
#5 E A E
Or with data.table
library(data.table)
setDT(example)[, lapply(.SD, f), ]
# a b c
#1: A E A
#2: B D D
#3: C C C
#4: D B B
#5: E A E

Combine a list of data frames into one preserving row names

I do know about the basics of combining a list of data frames into one as has been answered before. However, I am interested in smart ways to maintain row names. Suppose I have a list of data frames that are fairly equal and I keep them in a named list.
library(plyr)
library(dplyr)
library(data.table)
a = data.frame(x=1:3, row.names = letters[1:3])
b = data.frame(x=4:6, row.names = letters[4:6])
c = data.frame(x=7:9, row.names = letters[7:9])
l = list(A=a, B=b, C=c)
When I use do.call, the list names are combined with the row names:
> rownames(do.call("rbind", l))
[1] "A.a" "A.b" "A.c" "B.d" "B.e" "B.f" "C.g" "C.h" "C.i"
When I use any of rbind.fill, bind_rows or rbindlist the row names are replaced by a numeric range:
> rownames(rbind.fill(l))
> rownames(bind_rows(l))
> rownames(rbindlist(l))
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
When I remove the names from the list, do.call produces the desired output:
> names(l) = NULL
> rownames(do.call("rbind", l))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
So is there a function that I'm missing that provides some finer control over the row names? I do need the names for a different context so removing them is sub-optimal.
To preserve rownames, you can simply do:
do.call(rbind, unname(l))
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#f 6
#g 7
#h 8
#i 9
Or as you underlined by setting the rownames of l to NULL , this can be also done by:
do.call(rbind, setNames(l, NULL))
We can use add_rownames from dplyr package before binding:
rbind_all(lapply(l, add_rownames))
# Source: local data frame [9 x 2]
#
# rowname x
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
# 5 e 5
# 6 f 6
# 7 g 7
# 8 h 8
# 9 i 9
Why not only using rbind:
rbind(l$A, l$B, l$C)
Here it is another solution that I have just found and it works well (and efficiently) when you have large list and therefore, big dataframes.
df <- data.table::rbindlist(l)
# add a column with the rownames
df[,Col := unlist(lapply(l, rownames))]
df <- df %>% dplyr::select(Col, everything())
> df
Col x
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
More info about rbindlist here.

Sorting elements of a tie

Given this table
df <- data.frame(col1 = c(letters[3:5], "b","a"),
col2 = c(2:3, 1,1,1))
How can I tell R to return "a".
That means, from the three characters with value of 1 (a tie for the lowest value), I want to select only the first in alphabetical order
I think you want order
with(df, col1[order(col2, col1)][1])
# [1] a
# Levels: a b c d e
or
as.character(with(df, col1[order(col2, col1)][1]))
# [1] "a"
You can order column 1 by the ordered values in column 2 with
df[with(df, order(col2, col1)),]
# col1 col2
# 5 a 1
# 4 b 1
# 3 e 1
# 1 c 2
# 2 d 3
Try:
> min(as.character(df[df$col2==min(df$col2),1]))
[1] "a"
For explanation:
# first find col1 list in rows with minimum of df$col2
> xx = df[df$col2==min(df$col2),1]
> xx
[1] e b a
Levels: a b c d e
# Now find the minimum amongst these after converting factor to character:
> min(as.character(xx))
[1] "a"
>

Grouping/recoding factors in the same data.frame

Let's say I have a data frame like this:
df <- data.frame(a=letters[1:26],1:26)
And I would like to "re" factor a, b, and c as "a".
How do I do that?
One option is the recode() function in package car:
require(car)
df <- data.frame(a=letters[1:26],1:26)
df2 <- within(df, a <- recode(a, 'c("a","b","c")="a"'))
> head(df2)
a X1.26
1 a 1
2 a 2
3 a 3
4 d 4
5 e 5
6 f 6
Example where a is not so simple and we recode several levels into one.
set.seed(123)
df3 <- data.frame(a = sample(letters[1:5], 100, replace = TRUE),
b = 1:100)
with(df3, head(a))
with(df3, table(a))
the last lines giving:
> with(df3, head(a))
[1] b d c e e a
Levels: a b c d e
> with(df3, table(a))
a
a b c d e
19 20 21 22 18
Now lets combine levels a and e into level Z using recode()
df4 <- within(df3, a <- recode(a, 'c("a","e")="Z"'))
with(df4, head(a))
with(df4, table(a))
which gives:
> with(df4, head(a))
[1] b d c Z Z Z
Levels: b c d Z
> with(df4, table(a))
a
b c d Z
20 21 22 37
Doing this without spelling out the levels to merge:
## Select the levels you want (here 'a' and 'e')
lev.want <- with(df3, levels(a)[c(1,5)])
## now paste together
lev.want <- paste(lev.want, collapse = "','")
## then bolt on the extra bit
codes <- paste("c('", lev.want, "')='Z'", sep = "")
## then use within recode()
df5 <- within(df3, a <- recode(a, codes))
with(df5, table(a))
Which gives us the same as df4 above:
> with(df5, table(a))
a
b c d Z
20 21 22 37
Has anyone tried using this simple method? It requires no special packages, just an understanding of how R treats factors.
Say you want to rename the levels in a factor, get their indices
data <- data.frame(a=letters[1:26],1:26)
lalpha <- levels(data$a)
In this example we imagine we want to know the index for the level 'e' and 'w'
lalpha <- levels(data$a)
ind <- c(which(lalpha == 'e'), which(lalpha == 'w'))
Now we can use this index to replace the levels of the factor 'a'
levels(data$a)[ind] <- 'X'
If you now look at the dataframe factor a there will be an X where there was an e and w
I leave it to you to try the result.
You could do something like:
df$a[df$a %in% c("a","b","c")] <- "a"
UPDATE: More complicated factors.
Data <- data.frame(a=sample(c("Less than $50,000","$50,000-$99,999",
"$100,000-$249,999", "$250,000-$500,000"),20,TRUE),n=1:20)
rows <- Data$a %in% c("$50,000-$99,999", "$100,000-$249,999")
Data$a[rows] <- "$250,000-$500,000"
there are two ways.
if you don't want to drop the unused levels, ie "b" and "c", Joshua's solution is probably best.
if you want to drop the unused levels, then
df$a<-factor(ifelse(df$a%in%c("a","b","c"),"a",as.character(df$a)))
or
levels(df$a)<-ifelse(levels(df$a)%in%c("a","b","c"),"a",levels(df$a))
This is a simplified version of the chosen answer:
I've found that the easiest way to deal with this is to simply overwrite the factor levels by looking at them and then writing the numbers down to be overwritten.
df <- data.frame(a=letters[1:26],1:26)
levels(df)
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
"p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
levels(df$a)[c(1,2)] <- "c"
summary(df$a)
> c d e f g h i j k l m n o p q r s t u v w x y z
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Resources