I am trying to change the value of a number of columns at once using a lookup table. They all use the same lookup table. I know how to do this for just one column -- I'd just use a merge, but am having trouble with multiple columns.
Below is an example dataframe and an example lookup table. My actual data is much larger (~10K columns with 8 rows).
example <- data.frame(a = seq(1,5), b = seq(5,1), c=c(1,4,3,2,5))
lookup <- data.frame(number = seq(1,5), letter = LETTERS[seq(1,5)])
Ideally, I would end up with a dataframe which looks like this:
example_of_ideal_output <- data.frame(a = LETTERS[seq(1,5)], b = LETTERS[seq(5,1)], c=LETTERS[c(1,4,3,2,5)])
Of course, in my actual data the dataframe is numbers, but the lookup table is a lot more complicated, so I can't just use a function like LETTERS to solve things.
Thank you in advance!
Here's a solution that works on each column successively using lapply():
as.data.frame(lapply(example,function(col) lookup$letter[match(col,lookup$number)]));
## a b c
## 1 A E A
## 2 B D D
## 3 C C C
## 4 D B B
## 5 E A E
Alternatively, if you don't mind switching over to a matrix, you can achieve a "more vectorized" solution, as a matrix will allow you to call match() and index lookup$letter just once for the entire input:
matrix(lookup$letter[match(as.matrix(example),lookup$number)],nrow(example));
## [,1] [,2] [,3]
## [1,] "A" "E" "A"
## [2,] "B" "D" "D"
## [3,] "C" "C" "C"
## [4,] "D" "B" "B"
## [5,] "E" "A" "E"
(And of course you can coerce back to data.frame via as.data.frame() afterward, although you'll have to restore the column names as well if you want them, which can be done with setNames(...,names(example)). But if you really want to stick with a data.frame, my first solution is probably preferable.)
Using dplyr
f <- function(x)setNames(lookup$letter, lookup$number)[x]
library(dplyr)
example %>%
mutate_each(funs(f))
# a b c
#1 A E A
#2 B D D
#3 C C C
#4 D B B
#5 E A E
Or with data.table
library(data.table)
setDT(example)[, lapply(.SD, f), ]
# a b c
#1: A E A
#2: B D D
#3: C C C
#4: D B B
#5: E A E
Related
I'm searching for a function that I can remove rows permanently from an R data frame.
For example:
> df<-data.frame(AA=LETTERS[1:5],
NN=c(NA, 12, 21, NA, 11))
> df
# AA NN
#1 A NA
#2 B 12
#3 C 21
#4 D NA
#5 E 11
When I use complete.cases, R just drops the rows from the data frame (df) not creating a new df with new levels and row names, as observed below:
> df<-df[complete.cases(df),]
> df
# AA NN
#2 B 12
#3 C 21
#5 E 11
> levels(df$AA)
#[1] "A" "B" "C" "D" "E"
What I want is the following df:
>df
# AA NN
#1 B 12
#2 C 21
#3 E 11
> levels(df$AA)
#[1] "B" "C" "E"
Is there a way to do this in R?
Thanks!
We can wrap with droplevels to remove the unused levels after subsetting
df <- droplevels(df[complete.cases(df),])
levels(df$AA)
#[1] "B" "C" "E"
row.names(df) <- NULL
I do know about the basics of combining a list of data frames into one as has been answered before. However, I am interested in smart ways to maintain row names. Suppose I have a list of data frames that are fairly equal and I keep them in a named list.
library(plyr)
library(dplyr)
library(data.table)
a = data.frame(x=1:3, row.names = letters[1:3])
b = data.frame(x=4:6, row.names = letters[4:6])
c = data.frame(x=7:9, row.names = letters[7:9])
l = list(A=a, B=b, C=c)
When I use do.call, the list names are combined with the row names:
> rownames(do.call("rbind", l))
[1] "A.a" "A.b" "A.c" "B.d" "B.e" "B.f" "C.g" "C.h" "C.i"
When I use any of rbind.fill, bind_rows or rbindlist the row names are replaced by a numeric range:
> rownames(rbind.fill(l))
> rownames(bind_rows(l))
> rownames(rbindlist(l))
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9"
When I remove the names from the list, do.call produces the desired output:
> names(l) = NULL
> rownames(do.call("rbind", l))
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i"
So is there a function that I'm missing that provides some finer control over the row names? I do need the names for a different context so removing them is sub-optimal.
To preserve rownames, you can simply do:
do.call(rbind, unname(l))
# x
#a 1
#b 2
#c 3
#d 4
#e 5
#f 6
#g 7
#h 8
#i 9
Or as you underlined by setting the rownames of l to NULL , this can be also done by:
do.call(rbind, setNames(l, NULL))
We can use add_rownames from dplyr package before binding:
rbind_all(lapply(l, add_rownames))
# Source: local data frame [9 x 2]
#
# rowname x
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
# 5 e 5
# 6 f 6
# 7 g 7
# 8 h 8
# 9 i 9
Why not only using rbind:
rbind(l$A, l$B, l$C)
Here it is another solution that I have just found and it works well (and efficiently) when you have large list and therefore, big dataframes.
df <- data.table::rbindlist(l)
# add a column with the rownames
df[,Col := unlist(lapply(l, rownames))]
df <- df %>% dplyr::select(Col, everything())
> df
Col x
1: a 1
2: b 2
3: c 3
4: d 4
5: e 5
6: f 6
7: g 7
8: h 8
9: i 9
More info about rbindlist here.
I have a data.frame df as mentioned below
V1 V2
4 b c
14 g h
10 d g
6 b f
2 a e
5 b e
12 e f
1 a b
3 a f
9 c h
11 d h
7 c d
8 c g
13 f g
The first column is the row.names column so just ignore that. Now see the second columns V1 and V2. I want to find unique elements present in the columns V1 and V2. So if you see V1 the unique elements are b,g,d,a,e,c,f and in V2 the unique elements are c,h,g,f,e,b,d. Now if you look in these unique elements listed above. Even come elements are common in V1 and V2 i.e b,g,d,a,e,c and f.
So i need to make a new data.frame which has one column which lists all unique elements considering both V1 and V2. By unique elements i mean elements which are either present in V1 or in V2 or in both but they should not be listed repeatedly in this new data.frame so the data.frame i want is listed below. It would be better if the list is sorted alphabetically (if values are alphabets like a,b,c,d... or in ascending order if elements are 1,2,3,..
UniqueValues
a
b
c
d
e
f
g
h
Suppose this new data.frame is called UV1 and i have a similar data.frame with one column with the number of rows being either same or greater or lesser and is called UV2(and is a result of a similar operation between other columns of a different data.frame like above), so can i compare these 2 data.frames i.e compare UV1 and UV2 and find values which are same in both these data.frames and values which are not same in both these data.frames and save them in 2 different data.frames like in a (similarValuesdf) and in a (differentValuesdf) data.frames?
I am a beginner so i would prefer easier code rather than 5-10 operations being performed in a single statements like i have seen in other replies. I understand that is for time saving and those pros would have gone through a lot to figure out that one two lines of code to perform the whole thing but I am just trying to learn so would really appreciate easier code.
Thanks in advance.
I think the neatest way to do this would be to join the columns into a single vector and then use the unique() function:
unique(c(DF$V1,DF$V2))
[1] "b" "g" "d" "a" "e" "c" "f" "h"
To arrange in alphabetical order sort() would be a good choice.
x ->unique(c(DF$V1,DF$V2))
sort(x)
[1] "a" "b" "c" "d" "e" "f" "g" "h"
Let's recreate your data:
DF <- read.table(text = " V1 V2
4 b c
14 g h
10 d g
6 b f
2 a e
5 b e
12 e f
1 a b
3 a f
9 c h
11 d h
7 c d
8 c g
13 f g", header = TRUE, stringsAsFactors = FALSE)
Unlist the two columns into one vector and find unique values in that vector:
u1 <- unique(unlist(DF[, c("V1", "V2")]))
sort(u1)
#[1] "a" "b" "c" "d" "e" "f" "g" "h"
A second vector:
u2 <- c("d", "e", "f")
Find the intersection:
intersect(u1, u2)
#[1] "d" "e" "f"
Find the set difference:
setdiff(u1, u2)
#[1] "b" "g" "a" "c" "h"
To get your unique values:
UniqueValues = sort(union(unique(df$V1), unique(df$V2)))
To get the intersection of two data.frame you can try:
df1 = data.frame(col1=c(1,4,6,8))
df2 = data.frame(col1=c(6,4,8,9))
similarValuesdf = merge(df1, df2)
# col1
#1 4
#2 6
#3 8
df_new = unique(append(df$V, df$V2, after = length(df$V1)))
I'm trying to implement a data.table for my relatively large datasets and I can't figure out how to operate a function over multiple columns in the same row. Specifically, I want to create a new column that contains a specifically-formatted tally of the values (i.e., a histogram) in a subset of columns. It is kind of like table() but that also includes 0 entries and is sorted--so, if you know of a better/faster method I'd appreciate that too!
Simplified test case:
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c"))
DT<-as.data.table(DF)
> DT
A B C D E
1: a b c a a
2: d a a b a
3: a a a c c
my klunky histogram function:
histo<-function(vec){
foo<-c("a"=0,"b"=0,"c"=0,"d"=0)
for(i in vec){foo[i]=foo[i]+1}
return(foo)}
>histo(unname(unlist(DF[1,])))
a b c d
3 1 1 0
>histo(unname(unlist(DF[2,])))
a b c d
3 1 0 1
>histo(unname(unlist(DF[3,])))
a b c d
3 0 2 0
pseduocode of desired function and output
>DT[,his:=some_func_with_histo(A:E)]
>DT
A B C D E his
1: a b c a a (3,1,1,0)
2: d a a b a (3,1,0,1)
3: a a a c c (3,0,2,0)
df <- data.table(DF)
df$hist <- unlist(apply(df, 1, function(x) {
list(
sapply(letters[1:4], function(d) {
b <- sum(!is.na(grep(d,x)))
assign(d, b)
}))
}), recursive=FALSE)
Your df$hist column is a list, with each value named:
> df
A B C D E hist
1: a b c a a 3,1,2,0
2: d a a b a 3,1,1,1
3: a a a c c 3,0,3,0
> df$hist
[[1]]
a b c d
3 1 2 0
[[2]]
a b c d
3 1 1 1
[[3]]
a b c d
3 0 3 0
NOTE: Answer has been updated to to OP's request and mnel's comment
OK, how do you like that solution:
library(data.table)
DT <- data.table(A=c("a","d","a"),
B=c("b","a","a"),
C=c("c","a","a"),
D=c("a","b","c"),
E=c("a","a","c"))
fun <- function(vec, char) {
sum(vec==char)
}
DT[, Vec_Nr:= paste(Vectorize(fun, 'char')(.SD, letters[1:4]), collapse=","),
by=1:nrow(DT),
.SDcols=LETTERS[1:5]]
A B C D E Vec_Nr
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
I basically split up your problem into several steps:
First, I define a function fun that gives me the number of occurrences for one character. To see how
that function works, just call
fun(c("a", "a", "b"), "b")
[1] 1
Next, I vectorize this function because you don't want to know that for only one character "b", but for many. To pass a vector of arguments to a function,
use Vectorize. To see how that works, just type
Vectorize(fun, "char")(c("a", "a", "b"), c("a", "b"))
a b
2 1
Next, I collapse the results into one string and save that as a new column. Note that I deliberatly used the letters and LETTERS here to show you how make this more dynamic.
EDIT (also see below): Provided you first convert column classes to character, e.g., with DT <- DT[,lapply(.SD,as.character)]...
By using factor, you can convert vec and pass the values (a,b,c,d) in one step:
histo2 <- function(x) table(factor(x,levels=letters[1:4]))
Then you can iterate over rows by passing by=1:nrow(DT).
DT[,as.list(histo2(.SD)),by=1:nrow(DT)]
This gives...
nrow a b c d
1: 1 3 1 1 0
2: 2 3 1 0 1
3: 3 3 0 2 0
Also, this iterates over columns. This works because .SD is a special variable holding the subset of data associated with the call to by. In this case, that subset is the data.table consisting of one of the rows. histo2(DT[1]) works the same way.
EDIT (responding to OP's comment): Oh, sorry, I instinctively replaced your first line with
DF<-data.frame("A"=c("a","d","a"),"B"=c("b","a","a"),"C"=c("c","a","a"),"D"=c("a","b","c"),"E"=c("a","a","c")
,stringsAsFactors=FALSE)
since I dislike using factors except when making tables. If you do not want to convert your factor columns to character columns in this way, this will work:
histo3 <- function(x) table(factor(sapply(x,as.character),levels=letters[1:4]))
To put the output into a single column, you use := as you suggested...
DT[,hist:=list(list(histo3(.SD))),by=1:nrow(DT)]
The list(list()) part is key; I always figure this out by trial-and-error. Now DT looks like this:
A B C D E hist
1: a b c a a 3,1,1,0
2: d a a b a 3,1,0,1
3: a a a c c 3,0,2,0
You might find that it's a pain to access the information directly from your new column. For example, to access the "a" column of the "histogram", I think the fastest route is...
DT[,hist[[1]][["a"]],by=1:nrow(DT)]
My initial suggestion created an auxiliary data.table with just the counts. I think it's cleaner to do whatever you want to do with the counts in that data.table and then cbind it back. If you choose to store it in a column, you can always create the auxiliary data.table later with
DT[,as.list(hist[[1]]),by=1:nrow(DT)]
You are correct about using .SDcols. For your example, ...
cols = c("A","C")
histname = paste(c("hist",cols),collapse="")
DT[,(histname):=list(list(histo3(.SD))),by=1:nrow(DT),.SDcols=cols]
This gives
A B C D E hist histAC
1: a b c a a 3,1,1,0 1,0,1,0
2: d a a b a 3,1,0,1 1,0,0,1
3: a a a c c 3,0,2,0 2,0,0,0
Let's say I have a data frame like this:
df <- data.frame(a=letters[1:26],1:26)
And I would like to "re" factor a, b, and c as "a".
How do I do that?
One option is the recode() function in package car:
require(car)
df <- data.frame(a=letters[1:26],1:26)
df2 <- within(df, a <- recode(a, 'c("a","b","c")="a"'))
> head(df2)
a X1.26
1 a 1
2 a 2
3 a 3
4 d 4
5 e 5
6 f 6
Example where a is not so simple and we recode several levels into one.
set.seed(123)
df3 <- data.frame(a = sample(letters[1:5], 100, replace = TRUE),
b = 1:100)
with(df3, head(a))
with(df3, table(a))
the last lines giving:
> with(df3, head(a))
[1] b d c e e a
Levels: a b c d e
> with(df3, table(a))
a
a b c d e
19 20 21 22 18
Now lets combine levels a and e into level Z using recode()
df4 <- within(df3, a <- recode(a, 'c("a","e")="Z"'))
with(df4, head(a))
with(df4, table(a))
which gives:
> with(df4, head(a))
[1] b d c Z Z Z
Levels: b c d Z
> with(df4, table(a))
a
b c d Z
20 21 22 37
Doing this without spelling out the levels to merge:
## Select the levels you want (here 'a' and 'e')
lev.want <- with(df3, levels(a)[c(1,5)])
## now paste together
lev.want <- paste(lev.want, collapse = "','")
## then bolt on the extra bit
codes <- paste("c('", lev.want, "')='Z'", sep = "")
## then use within recode()
df5 <- within(df3, a <- recode(a, codes))
with(df5, table(a))
Which gives us the same as df4 above:
> with(df5, table(a))
a
b c d Z
20 21 22 37
Has anyone tried using this simple method? It requires no special packages, just an understanding of how R treats factors.
Say you want to rename the levels in a factor, get their indices
data <- data.frame(a=letters[1:26],1:26)
lalpha <- levels(data$a)
In this example we imagine we want to know the index for the level 'e' and 'w'
lalpha <- levels(data$a)
ind <- c(which(lalpha == 'e'), which(lalpha == 'w'))
Now we can use this index to replace the levels of the factor 'a'
levels(data$a)[ind] <- 'X'
If you now look at the dataframe factor a there will be an X where there was an e and w
I leave it to you to try the result.
You could do something like:
df$a[df$a %in% c("a","b","c")] <- "a"
UPDATE: More complicated factors.
Data <- data.frame(a=sample(c("Less than $50,000","$50,000-$99,999",
"$100,000-$249,999", "$250,000-$500,000"),20,TRUE),n=1:20)
rows <- Data$a %in% c("$50,000-$99,999", "$100,000-$249,999")
Data$a[rows] <- "$250,000-$500,000"
there are two ways.
if you don't want to drop the unused levels, ie "b" and "c", Joshua's solution is probably best.
if you want to drop the unused levels, then
df$a<-factor(ifelse(df$a%in%c("a","b","c"),"a",as.character(df$a)))
or
levels(df$a)<-ifelse(levels(df$a)%in%c("a","b","c"),"a",levels(df$a))
This is a simplified version of the chosen answer:
I've found that the easiest way to deal with this is to simply overwrite the factor levels by looking at them and then writing the numbers down to be overwritten.
df <- data.frame(a=letters[1:26],1:26)
levels(df)
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
"p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
levels(df$a)[c(1,2)] <- "c"
summary(df$a)
> c d e f g h i j k l m n o p q r s t u v w x y z
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1