Generate dummy variable with multiple levels in R - r

My question involves how to generate a dummy-variable from a character variable with multiple repeated characters in R. The number of times
that a certain character is repeated varies.
There are several questions about this topic, but none of them seem to address my specific problem.
Below is a minimal example of the data:
df <- data.frame(ID=c("C/004","C/004","C/005","C/005","C/005","C/007",
"C/007", "C/007"))
The result I expect is as follows:
> df
ID newID
1 C/004 1
2 C/004 1
3 C/005 2
4 C/005 2
5 C/005 2
6 C/007 3
7 C/007 3
8 C/007 3
I would like to have the resulting variable newID as of numeric class and not a factor and so I would not go for the function factor(.., levels=...)
since it results into a factor variable and besides I would be required to supply factor levels which are too many.
Any assistance would be greatly appreciated.

You can do this in a couple of ways
match(df$ID, unique(df$ID))
#[1] 1 1 2 2 2 3 3 3
Or
as.numeric(factor(df$ID))
#[1] 1 1 2 2 2 3 3 3
Or
cumsum(!duplicated(df$ID))
#[1] 1 1 2 2 2 3 3 3

All factors are numerics underneath. Therefore, if you want a numeric, simply convert
df$newID <- as.numeric(factor(df$ID))

Related

How to merge two data sets based on the result of matching two columns in R [duplicate]

This came up just in an answer to another question here. When you rbind two data frames, it matches columns by name rather than index, which can lead to unexpected behavior:
> df<-data.frame(x=1:2,y=3:4)
> df
x y
1 1 3
2 2 4
> rbind(df,df[,2:1])
x y
1 1 3
2 2 4
3 1 3
4 2 4
Of course, there are workarounds. For example:
rbind(df,rename(df[,2:1],names(df)))
data.frame(rbind(as.matrix(df),as.matrix(df[,2:1])))
On edit: rename from the plyr package doesn't actually work this way (although I thought I had it working when I originally wrote this...). The way to do this by renaming is to use SimonO101's solution:
rbind(df,setNames(df[,2:1],names(df)))
Also, maybe surprisingly,
data.frame(rbindlist(list(df,df[,2:1])))
works by index (and if we don't mind a data table, then it's pretty concise), so this is a difference between do.call(rbind).
The question is, what is the most concise way to rbind two data frames where the names don't match? I know this seems trivial, but this kind of thing can end up cluttering code. And I don't want to have to write a new function called rbindByIndex. Ideally it would be something like rbind(df,df[,2:1],byIndex=T).
You might find setNames handy here...
rbind(df, setNames(rev(df), names(df)))
# x y
#1 1 3
#2 2 4
#3 3 1
#4 4 2
I suspect your real use-case is somewhat more complex. You can of course reorder columns in the first argument of setNames as you wish, just use names(df) in the second argument, so that the names of the reordered columns match the original.
This seems pretty easy:
mapply(c,df,df[,2:1])
x y
[1,] 1 3
[2,] 2 4
[3,] 3 1
[4,] 4 2
For this simple case, though, you have to turn it back into a dataframe (because mapply simplifies it to a matrix):
as.data.frame(mapply(c,df,df[,2:1]))
x y
1 1 3
2 2 4
3 3 1
4 4 2
Important note 1: There appears to be a downside of type coercion when your dataframe contains vectors of different types:
df<-data.frame(x=1:2,y=3:4,z=c('a','b'))
mapply(c,df,df[,c(2:1,3)])
x y z
[1,] 1 3 2
[2,] 2 4 1
[3,] 3 1 2
[4,] 4 2 1
Important note 2: It also is terrible if you have factors.
df<-data.frame(x=factor(1:2),y=factor(3:4))
mapply(c,df[,1:2],df[,2:1])
x y
[1,] 1 1
[2,] 2 2
[3,] 1 1
[4,] 2 2
So, as long as you have all numeric data, it's okay.

How to tidy up a character column?

What I have:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),label=c(1,1,1,1,2,2,2,2,2),alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"))
> test_df
isolate label alignment
1 1 1 --at
2 2 1 at--
3 3 1 --at
4 4 1 --at
5 1 2 a--
6 2 2 acg
7 3 2 a--
8 4 2 a--
9 5 2 agg
What I want:
I'd like to explode the alignment field into two columns, position and character:
> test_df
isolate label aln_pos aln_char
1 1 1 1 -
2 1 1 2 -
3 1 1 3 a
4 1 1 4 t
...
Not all alignments are the same length, but all alignments with the same label have the same length.
What I've tried:
I was thinking I could use separate to first make each position have its own column, then use gather turn those columns into key value pairs. However, I haven't been able to get the separate part right.
Since you mentioned tidyr::gather, you could try this:
test_df <- data.frame(isolate=c(1,2,3,4,1,2,3,4,5),
label=c(1,1,1,1,2,2,2,2,2),
alignment=c("--at","at--","--at","--at","a--","acg","a--","a--", "agg"),
stringsAsFactors = FALSE)
library(tidyverse)
test_df %>%
mutate(alignment = strsplit(alignment,"")) %>%
unnest(alignment)
In base R, you can use indexing along with creation of a list with strsplit like this.
# make variable a character vector
test_df$alignment <- as.character(test_df$alignment)
# get list of individual characters
myList <- strsplit(test_df$alignment, split="")
then build the data.frame
# construct data.frame
final_df <- cbind(test_df[rep(seq_len(nrow(test_df)), lengths(myList)),
c("isolate", "label")],
aln_pos=sequence(lengths(myList)),
aln_char=unlist(myList))
Here, we take the first two columns of the original data.frame and repeat the rows using rep with a vector input in its second argument telling it how many times to repeat the corresponding value in its first argument. The number of times is calculated with lengths. The second argument of cbind is a call to sequence taking the same lengths output. this produces counts from 1 to the corresponding length. The third argument is the unlisted character values.
this returns
head(final_df, 10)
isolate label aln_pos aln_char
1 1 1 1 -
1.1 1 1 2 -
1.2 1 1 3 a
1.3 1 1 4 t
2 2 1 1 a
2.1 2 1 2 t
2.2 2 1 3 -
2.3 2 1 4 -
3 3 1 1 -
3.1 3 1 2 -

How to remove columns of data from a data frame using a vector with a regular expression

I am trying to remove columns from a dataframe using a vector of numbers, with those numbers being just a part of the whole column header. What I'm looking to use is something like the wildcard "*" in unix, so that I can say that I want to remove columns with labels xxxx, xxkx, etc... To illustrate what I mean, if I have the following data:
data_test_read <- read.table("batch_1_8c9.structure-edit.tsv",sep="\t", header=TRUE)
data_test_read[1:5,1:5]
samp pop X12706_10 X14223_16 X14481_7
1 BayOfIslands_s088.fq 1 4 1 3
2 BayOfIslands_s088.fq 1 4 1 3
3 BayOfIslands_s089.fq 1 4 1 3
4 BayOfIslands_s089.fq 1 4 3 3
5 BayOfIslands_s090.fq 1 4 1 3
And I want to take out, for example, columns with headers (X12706_10, X14481_7), the following works
data_subs1=subset(data_test_read, select = -c(X12706_10, X14481_7))
data_subs1[1:4,1:4]
samp pop X14223_16 X15213_19
1 BayOfIslands_s088.fq 1 1 3
2 BayOfIslands_s088.fq 1 1 3
3 BayOfIslands_s089.fq 1 1 3
4 BayOfIslands_s089.fq 1 3 3
However, what I need is to be able to identify these columns by only the numbers, so, using (12706,14481). But, if I try this, I get the following
data_subs2=subset(data_test_read, select = -c(12706,14481))
data_subs2[1:4,1:4]
samp pop X12706_10 X14223_16
1 BayOfIslands_s088.fq 1 4 1
2 BayOfIslands_s088.fq 1 4 1
3 BayOfIslands_s089.fq 1 4 1
4 BayOfIslands_s089.fq 1 4 3
This is clearly because I haven't specified anything to do with the "x", or the "_" or what is after the underscore. I've read so many answers on using regular expressions, and I just can't seem to sort it out. Any thoughts, or pointers to what I might turn to would be appreciated.
First you can just extract the numbers from the headers
# for testing
col_names <- c("X12706_10","X14223_16","X14481_7")
# in practice, use
# col_names <- names(data_test_read)
samples <- gsub("X(\\d+)_.*","\\1",col_names)
The find the indexes of the samples you want to drop.
samples_to_drop <- c(12706, 14481)
cols_to_drop <- match(samples_to_drop, samples)
Then you can use
data_subs2 <- subset(data_test_read, select = -cols_to_drop)
to actually get rid of those columns.
Perhaps put this all in a function to make it easier to use
sample_subset <- function(x, drop) {
samples <- gsub("X(\\d+)_.*","\\1", names(x))
subset(x, select = -match(drop, samples))
}
sample_subset(data_test_read, c(12706, 14481))

Determining congruence between rows in R, based on key variable

I have a few large data sets with many variables. There is a "key" variable that is the ID for the research participant. In these data sets, there are some IDs that are duplicated. I have written code to extract all data for duplicated IDs, but I would like a way to check if the remainder of the variables for those IDs are equal or not. Below is a simplistic example:
ID X Y Z
1 2 3 4
1 2 3 5
2 5 5 4
2 5 5 4
3 1 2 3
3 2 2 3
3 1 2 3
In this example, I would like to be able to identify that the rows for ID 1 and ID 3 are NOT all equal. Is there any way to do this in R?
You can use duplicated for this:
d <- read.table(text='ID X Y Z
1 2 3 4
1 2 3 5
2 5 5 4
2 5 5 4
3 1 2 3
3 2 2 3
3 1 2 3
4 1 1 1', header=TRUE)
tapply(duplicated(d), d[, 1], function(x) all(x[-1]))
## 1 2 3 4
## FALSE TRUE FALSE TRUE
Duplicated returns a vector indicating, for each row of a dataframe, whether it has been encountered earlier in the dataframe. We use tapply over this logical vector, splitting it in to groups based on ID and applying a function to each of these groups. The function we apply is all(x[-1]), i.e. we ask whether all rows for the group, other than the initial row, are duplicated?
Note that I added a group with a single record to ensure that the solution works in these cases as well.
Alternatively, you can reduce the dataframe to unique records with unique, and then split by ID and check whether each split has only a single row:
sapply(split(unique(d), unique(d)[, 1]), nrow) == 1
## 1 2 3 4
## FALSE TRUE FALSE TRUE
(If it's a big dataframe it's worth calculating unique(d) in advance rather than calling it twice.)

R table function

If I have a vector numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4), and I use 'table(numbers)', I get
names 1 2 4 5
counts 2 5 4 1
What if I want it to include 3 also or generally, all numbers from 1:max(numbers) even if they are not represented in numbers. Thus, how would I generate an output as such:
names 1 2 3 4 5
counts 2 5 0 4 1
If you want R to add up numbers that aren't there, you should create a factor and explicitly set the levels. table will return a count for each level.
table(factor(numbers, levels=1:max(numbers)))
# 1 2 3 4 5
# 2 5 0 4 1
For this particular example (positive integers), tabulate would also work:
numbers <- c(1,1,2,4,2,2,2,2,5,4,4,4)
tabulate(numbers)
# [1] 2 5 0 4 1

Resources