Make column levels into new columns R - r

I have a data table[1000+, 4].
One of the columns has 5 levels and I want to make these 5 levels into 5 new columns.
Any suggestions?

We extract the levels from the column of interest ('ColN'), convert to list and assign that to five 'new' columns.
df1[paste0("new", 1:5)] <- as.list(levels(df1$ColN))
head(df1,2)
# ColN Col2 Col3 new1 new2 new3 new4 new5
#1 C 7 a A B C D E
#2 A 4 b A B C D E
levels(df1$ColN)
#[1] "A" "B" "C" "D" "E"
data
set.seed(48)
df1 <- data.frame(ColN= sample(LETTERS[1:5], 10,
replace=TRUE), Col2= sample(10), Col3= letters[1:10])

Related

Can %in% be used in base R to match value pairs?

I'm familiar with %in% generally, and I'm looking for a base R solution, if one exists.
Suppose I want to know whether a particular combination of values from multiple fields in a data frame exists in another data frame. As a work-around, sometimes I concatenate all these values into a single field and match on the custom concatenation, but I'm wondering if there's a way to pass the value combinations to %in% directly.
I'm imagining syntax similar to deduplicating on unique combinations of values across multiple columns, whose syntax works like this, by way of a generic example:
df[!duplicated(df[,c("col1","col2","col3")]),]
I was sort of expecting something like this to work, but I see why it doesn't:
df1[df1[,c("col1","col2")] %in% df2[,c("col1","col2")],]
... above, I'm attempting to ask which value pairs in df1 also exist as value pairs in df2.
You can use mapply to create a logical matrix of matches and then use it to subset df1.
Test data.
set.seed(2022)
df1 <- data.frame(col1 = letters[1:10], col2 = 1:10, col3 = 11:20)
df2 <- data.frame(col1 = sample(letters[1:10], 4),
col2 = sample(1:10, 4), col3 = 11:14)
Here I start by putting the columns in a vector, it simplifies the code.
cols <- c("col1", "col2")
(i <- mapply(\(x, y) x %in% y, df1[cols], df2[cols]))
# col1 col2
# [1,] FALSE FALSE
# [2,] FALSE FALSE
# [3,] TRUE FALSE
# [4,] TRUE TRUE
# [5,] FALSE FALSE
# [6,] TRUE TRUE
# [7,] TRUE TRUE
# [8,] FALSE FALSE
# [9,] FALSE TRUE
#[10,] FALSE FALSE
Now subset. The question is not very clear on which of the following is asked for.
# at least one column match
j <- rowSums(i) > 0L
df1[j, ]
# col1 col2 col3
#3 c 3 13
#4 d 4 14
#6 f 6 16
#7 g 7 17
#9 i 9 19
# all columns match
k <- rowSums(i) == length(cols)
df1[k, ]
# col1 col2 col3
#4 d 4 14
#6 f 6 16
#7 g 7 17
I think just doing a merge() by the two columns of interest get you what you need. You can then subset the merged output to just columns from the original data.frame. This would return only rows of your query data.frame where col1 and col2 match their cognate values in the reference data.frame. Please clarify if that's NOT your goal.
# simulate two DFs with some common values in col1 and col2
x <- data.frame(col1 = LETTERS[1:5],
col2 = 1:5,
col3 = runif(5))
y <- data.frame(col1 = LETTERS[4:8],
col2 = 4:8,
col3 = runif(5))
x
#> col1 col2 col3
#> 1 A 1 0.4306611
#> 2 B 2 0.7149893
#> 3 C 3 0.2808990
#> 4 D 4 0.4383580
#> 5 E 5 0.1372991
y
#> col1 col2 col3
#> 1 D 4 0.40191250
#> 2 E 5 0.94833538
#> 3 F 6 0.85608320
#> 4 G 7 0.05758958
#> 5 H 8 0.29011770
# merge without adding .x suffix to col3 from x
# then subset to only keep columns from x
merge(x, y,
by = c("col1", "col2"),
suffixes = c("", ".drop"))[,1:ncol(x)]
#> col1 col2 col3
#> 1 D 4 0.4383580
#> 2 E 5 0.1372991
Created on 2022-01-08 by the reprex package (v2.0.1)

Is there a way to remove rows permanently in R data frames?

I'm searching for a function that I can remove rows permanently from an R data frame.
For example:
> df<-data.frame(AA=LETTERS[1:5],
NN=c(NA, 12, 21, NA, 11))
> df
# AA NN
#1 A NA
#2 B 12
#3 C 21
#4 D NA
#5 E 11
When I use complete.cases, R just drops the rows from the data frame (df) not creating a new df with new levels and row names, as observed below:
> df<-df[complete.cases(df),]
> df
# AA NN
#2 B 12
#3 C 21
#5 E 11
> levels(df$AA)
#[1] "A" "B" "C" "D" "E"
What I want is the following df:
>df
# AA NN
#1 B 12
#2 C 21
#3 E 11
> levels(df$AA)
#[1] "B" "C" "E"
Is there a way to do this in R?
Thanks!
We can wrap with droplevels to remove the unused levels after subsetting
df <- droplevels(df[complete.cases(df),])
levels(df$AA)
#[1] "B" "C" "E"
row.names(df) <- NULL

Sorting elements of a tie

Given this table
df <- data.frame(col1 = c(letters[3:5], "b","a"),
col2 = c(2:3, 1,1,1))
How can I tell R to return "a".
That means, from the three characters with value of 1 (a tie for the lowest value), I want to select only the first in alphabetical order
I think you want order
with(df, col1[order(col2, col1)][1])
# [1] a
# Levels: a b c d e
or
as.character(with(df, col1[order(col2, col1)][1]))
# [1] "a"
You can order column 1 by the ordered values in column 2 with
df[with(df, order(col2, col1)),]
# col1 col2
# 5 a 1
# 4 b 1
# 3 e 1
# 1 c 2
# 2 d 3
Try:
> min(as.character(df[df$col2==min(df$col2),1]))
[1] "a"
For explanation:
# first find col1 list in rows with minimum of df$col2
> xx = df[df$col2==min(df$col2),1]
> xx
[1] e b a
Levels: a b c d e
# Now find the minimum amongst these after converting factor to character:
> min(as.character(xx))
[1] "a"
>

Match one column of a data.frame with all the columns in another data.frame

I have two data.frames:
DF1
Col1 Col2 ...... ...... Col2000
A H
c d
d e
n b
e A
b n
H c
DF2
A
b
c
d
e
n
H
I need simply to match the only one column in DF2 with each column in DF1. I need to match them because I need to know exactly the ranking of the match. Anyway I tried to write a function but since I'm not an R expert something goes wrong in my code:
lapply(DF1, function(x) match(DF1[,i], DF2[,1]))
To get a correct result, you need a correct command :
lapply(DF1, function(x) match(x, DF2[,1]))
is doing what you're trying to do. Take :
DF1 <- data.frame(
Col1 = c('A','c','d','n','e','b','H'),
Col2 = c('H','d','e','b','A','n','c')
)
DF2 <- data.frame(c('A','b','c','d','e','n','H'))
Then:
> lapply(DF1, function(x) match(x, DF2[,1]))
$Col1
[1] 1 3 4 6 5 2 7
$Col2
[1] 7 4 5 2 1 6 3

Grouping/recoding factors in the same data.frame

Let's say I have a data frame like this:
df <- data.frame(a=letters[1:26],1:26)
And I would like to "re" factor a, b, and c as "a".
How do I do that?
One option is the recode() function in package car:
require(car)
df <- data.frame(a=letters[1:26],1:26)
df2 <- within(df, a <- recode(a, 'c("a","b","c")="a"'))
> head(df2)
a X1.26
1 a 1
2 a 2
3 a 3
4 d 4
5 e 5
6 f 6
Example where a is not so simple and we recode several levels into one.
set.seed(123)
df3 <- data.frame(a = sample(letters[1:5], 100, replace = TRUE),
b = 1:100)
with(df3, head(a))
with(df3, table(a))
the last lines giving:
> with(df3, head(a))
[1] b d c e e a
Levels: a b c d e
> with(df3, table(a))
a
a b c d e
19 20 21 22 18
Now lets combine levels a and e into level Z using recode()
df4 <- within(df3, a <- recode(a, 'c("a","e")="Z"'))
with(df4, head(a))
with(df4, table(a))
which gives:
> with(df4, head(a))
[1] b d c Z Z Z
Levels: b c d Z
> with(df4, table(a))
a
b c d Z
20 21 22 37
Doing this without spelling out the levels to merge:
## Select the levels you want (here 'a' and 'e')
lev.want <- with(df3, levels(a)[c(1,5)])
## now paste together
lev.want <- paste(lev.want, collapse = "','")
## then bolt on the extra bit
codes <- paste("c('", lev.want, "')='Z'", sep = "")
## then use within recode()
df5 <- within(df3, a <- recode(a, codes))
with(df5, table(a))
Which gives us the same as df4 above:
> with(df5, table(a))
a
b c d Z
20 21 22 37
Has anyone tried using this simple method? It requires no special packages, just an understanding of how R treats factors.
Say you want to rename the levels in a factor, get their indices
data <- data.frame(a=letters[1:26],1:26)
lalpha <- levels(data$a)
In this example we imagine we want to know the index for the level 'e' and 'w'
lalpha <- levels(data$a)
ind <- c(which(lalpha == 'e'), which(lalpha == 'w'))
Now we can use this index to replace the levels of the factor 'a'
levels(data$a)[ind] <- 'X'
If you now look at the dataframe factor a there will be an X where there was an e and w
I leave it to you to try the result.
You could do something like:
df$a[df$a %in% c("a","b","c")] <- "a"
UPDATE: More complicated factors.
Data <- data.frame(a=sample(c("Less than $50,000","$50,000-$99,999",
"$100,000-$249,999", "$250,000-$500,000"),20,TRUE),n=1:20)
rows <- Data$a %in% c("$50,000-$99,999", "$100,000-$249,999")
Data$a[rows] <- "$250,000-$500,000"
there are two ways.
if you don't want to drop the unused levels, ie "b" and "c", Joshua's solution is probably best.
if you want to drop the unused levels, then
df$a<-factor(ifelse(df$a%in%c("a","b","c"),"a",as.character(df$a)))
or
levels(df$a)<-ifelse(levels(df$a)%in%c("a","b","c"),"a",levels(df$a))
This is a simplified version of the chosen answer:
I've found that the easiest way to deal with this is to simply overwrite the factor levels by looking at them and then writing the numbers down to be overwritten.
df <- data.frame(a=letters[1:26],1:26)
levels(df)
> [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o"
"p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
levels(df$a)[c(1,2)] <- "c"
summary(df$a)
> c d e f g h i j k l m n o p q r s t u v w x y z
3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Resources