I have a matrix with last column contains characters:
A
B
B
A
...
I would like to replace A with 1 and B with 2 in R. The expected result should be:
1
2
2
1
...
If you are 100% confident only "A" and "B" appear
sample_data = c("A", "B", "B", "A")
sample_data
# [1] "A" "B" "B" "A"
as.numeric(gsub("A", 1, gsub("B", 2, sample_data)))
# [1] 1 2 2 1
Using factor or a simple lookup table would be much more flexible:
sample_data = c("A", "B", "B", "A")
Recommended:
as.numeric(factor(sample_data))
# [1] 1 2 2 1
Possible alternative:
as.numeric(c("A" = "1", "B" = "2")[sample_data])
# [1] 1 2 2 1
Related
I'm trying to learn how to work with the sjlabelled package in R, and labelled data more generally. I'm trying
Here's the output for what I tried for a very simple example:
> library(dplyr)
> library(sjlabelled)
>
> df <- data.frame(col1 = c(1:3, 1:3),
+ col2 = c(seq(11, 33, 11), 11, 12, 13))
>
> df <- df %>%
+ set_labels(col1, labels = c("a" = 1, "b" = 2, "c" = 3)) %>%
+ set_labels(col2, labels = c("A" = 11, "B" = 12, "C" = 13, "D" = 22, "E" = 33))
>
> df
col1 col2
1 1 11
2 2 22
3 3 33
4 1 11
5 2 12
6 3 13
>
> get_labels(df)
$col1
[1] "a" "b" "c"
$col2
[1] "A" "B" "C" "D" "E"
>
> df <- df %>%
+ mutate(col1 = ifelse(col1 > 1 & col2 < 33, 2, as_labelled(col1)))
>
> df
col1 col2
1 1 11
2 2 22
3 3 33
4 1 11
5 2 12
6 2 13
>
> get_labels(df)
$col1
NULL
$col2
[1] "A" "B" "C" "D" "E"
I have used as_labelled to preserve labels in other situations, such as when using rbind for data frames with labelled data, but it isn't working here.
Are either of the following possible using sjlabelled or a similar approach:
a) overwriting values with other values which had already been assigned a label so that labels are preserved and the overwritten value has its label overwritten with the corresponding label (e.g. any values that are overwritten as '2' in the example will now be labelled as "b")?
b) overwriting values with values which weren't in the column (e.g. NA) without losing the labels from the values which were already labelled (e.g. with the example, overwriting '1' and '2', with NA but keeping '3' labelled as "c")?
Thank you.
We could assign as
df$col1[with(df, col1 > 1 & col2 < 33)] <- 2
-checking
> get_labels(df)
$col1
[1] "a" "b" "c"
$col2
[1] "A" "B" "C" "D" "E"
> df
col1 col2
1 1 11
2 2 22
3 3 33
4 1 11
5 2 12
6 2 13
How do I pull out unique values from each column in a data frame (both numeric and strings) and make into one column?
a = c("a", "b", "c", "d", "a")
b = c(1, 2, 3, 4, 3)
df <- cbind(a, b)
The preferred output would be:
variable Level
a a
a b
a c
a d
b 1
b 2
b 3
b 4
The sample data above is simple but the intent is to be able to use the answer for multiple data frame with different column names and data in them. Thank you.
Quick + scalable
Tidyr's gather and dplyr's distinct gives you a quick way to get that structure. (I left the package calls in the functions so you can remember which one is from which package, which I always forget.)
library(tidyverse)
a = c("a", "b", "c", "d", "a")
b = c(1, 2, 3, 4, 3)
data.frame(a,b) %>% tidyr::gather() %>% dplyr::distinct()
key value
1 a a
2 a b
3 a c
4 a d
5 b 1
6 b 2
7 b 3
8 b 4
We place it in a list, get the unique elements, set the names with letters and then stack to data.frame
d1 <- stack(setNames(lapply(list(a, b), unique), letters[1:2]))[2:1]
colnames(d1) <- c('variable', 'Level')
df data.frame creation:
a = c("a", "b", "c", "d", "a")
b = c(1, 2, 3, 4, 3)
df <- cbind(a, b)
Columns name extraction
names<-colnames(df)
Data Extration
variable<-NULL
Level<-NULL
for(i in 1:length(names))
{
variable<-c(variable,rep(names[i],length(unique(df[,i]))))
Level<-c(Level,unique(df[,i]))
}
Your generic output
db<-cbind(variable,Level)
db
variable Level
[1,] "a" "a"
[2,] "a" "b"
[3,] "a" "c"
[4,] "a" "d"
[5,] "b" "1"
[6,] "b" "2"
[7,] "b" "3"
[8,] "b" "4"
I have a data frame which consists of two column: a character vector col1 and a list column, col2.
myVector <- c("A","B","C","D")
myList <- list()
myList[[1]] <- c(1, 4, 6, 7)
myList[[2]] <- c(2, 7, 3)
myList[[3]] <- c(5, 5, 3, 9, 6)
myList[[4]] <- c(7, 9)
myDataFrame <- data.frame(row = c(1,2,3,4))
myDataFrame$col1 <- myVector
myDataFrame$col2 <- myList
myDataFrame
# row col1 col2
# 1 1 A 1, 4, 6, 7
# 2 2 B 2, 7, 3
# 3 3 C 5, 5, 3, 9, 6
# 4 4 D 7, 9
I want to unlist my col2 still keeping for each element of the vectors in the list the information stored in col1. To phrase it differently, in commonly used data frame reshape terminology: the "wide" list column should be converted to a "long" format.
Then at the end of the day I want two vectors of length equal to length(unlist(myDataFrame$col2)). In code:
# unlist myList
unlist.col2 <- unlist(myDataFrame$col2)
unlist.col2
# [1] 1 4 6 7 2 7 3 5 5 3 9 6 7 9
# unlist myVector to obtain
# unlist.col1 <- ???
# unlist.col1
# [1] A A A A B B B C C C C C D D
I can't think of any straightforward way to get it.
You may also use unnest from package tidyr:
library(tidyr)
unnest(myDataFrame, col2)
# row col1 col2
# (dbl) (chr) (dbl)
# 1 1 A 1
# 2 1 A 4
# 3 1 A 6
# 4 1 A 7
# 5 2 B 2
# 6 2 B 7
# 7 2 B 3
# 8 3 C 5
# 9 3 C 5
# 10 3 C 3
# 11 3 C 9
# 12 3 C 6
# 13 4 D 7
# 14 4 D 9
You can use the "data.table" to expand the whole data.frame, and extract the column of interest.
library(data.table)
## expand the entire data.frame (uncomment to see)
# as.data.table(myDataFrame)[, unlist(col2), by = list(row, col1)]
## expand and select the column of interest:
as.data.table(myDataFrame)[, unlist(col2), by = list(row, col1)]$col1
# [1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"
In newer versions of R, you can now use the lengths function instead of the sapply(list, length) approach. The lengths function is considerably faster.
with(myDataFrame, rep(col1, lengths(col2)))
# [1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"
Here, the idea is to first get the length of each list element using sapply and then use rep to replicate the col1 with that length
l1 <- sapply(myDataFrame$col2, length)
unlist.col1 <- rep(myDataFrame$col1, l1)
unlist.col1
#[1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"
Or as suggested by #Ananda Mahto, the above could be also done with vapply
with(myDataFrame, rep(col1, vapply(col2, length, 1L)))
#[1] "A" "A" "A" "A" "B" "B" "B" "C" "C" "C" "C" "C" "D" "D"
I have the data.frame
df<-data.frame("Site.1" = c("A", "B", "C"),
"Site.2" = c("D", "B", "B"),
"Tsim" = c(2, 4, 7),
"Jaccard" = c(5, 7, 1))
# Site.1 Site.2 Tsim Jaccard
# 1 A D 2 5
# 2 B B 4 7
# 3 C B 7 1
I can get the unique levels for each column using
top.x<-unique(df[1:2,c("Site.1")])
top.x
# [1] A B
# Levels: A B C
top.y<-unique(df[1:2,c("Site.2")])
top.y
# [1] D B
# Levels: B D
How do I get the unique levels for both columns and turn them into a vector i.e:
v <- c("A", "B", "D")
v
# [1] "A" "B" "D"
top.xy <- unique(unlist(df[1:2,]))
top.xy
[1] A B D
Levels: A B C D
Try union:
union(top.x, top.y)
# [1] "A" "B" "D"
union(unique(df[1:2, c("Site.1")]),
unique(df[1:2, c("Site.2")]))
# [1] "A" "B" "D"
You can get the unique levels for the firs two collumns:
de<- apply(df[,1:2],2,unique)
de
# $Site.1
# [1] "A" "B" "C"
# $Site.2
# [1] "D" "B"
Then you can take the symmetric difference of the two sets:
union(setdiff(de$Site.1,de$Site.2), setdiff(de$Site.2,de$Site.1))
# [1] "A" "C" "D"
If you're intrested in just two first two rows (as in your example):
de<- apply(df[1:2,1:2],2,unique)
de
# Site.1 Site.2
# [1,] "A" "D"
# [2,] "B" "B"
union(de[,1],de[,2])
# [1] "A" "B" "D"
Suppose I have a list with observations:
foo <- list(c("C", "E", "A", "F"), c("B", "D", "B", "A", "C"), c("B",
"C", "C", "F", "A", "F"), c("D", "A", "A", "D", "D", "F", "B"
))
> foo
[[1]]
[1] "C" "E" "A" "F"
[[2]]
[1] "B" "D" "B" "A" "C"
[[3]]
[1] "B" "C" "C" "F" "A" "F"
[[4]]
[1] "D" "A" "A" "D" "D" "F" "B"
And a vector with each unique element:
vec <- LETTERS[1:6]
> vec
[1] "A" "B" "C" "D" "E" "F"
I want to obtain a data frame with the counts of each element of vec in each element of foo. I can do this with plyr in a very ugly unvectorized way:
> ldply(foo,function(x)sapply(vec,function(y)sum(y==x)))
A B C D E F
1 1 0 1 0 1 1
2 1 2 1 1 0 0
3 1 1 2 0 0 2
4 2 1 0 3 0 1
But that's obviously slow. How can this be done faster? I know of table() but haven't really figured out how to use it due to 0-counts in some of the elements of foo.
One solution (off the top of my head):
# convert foo to a list of factors
lfoo <- lapply(foo, factor, levels=LETTERS[1:6])
# apply table() to each list element
t(sapply(lfoo, table))
A B C D E F
[1,] 1 0 1 0 1 1
[2,] 1 2 1 1 0 0
[3,] 1 1 2 0 0 2
[4,] 2 1 0 3 0 1
or with reshape:
cast(melt(foo), L1 ~ value, length)[-1]