Mapping column values - r

I would like to transform the values of a given column using some mapping function. Example:
df <- data.frame(A = 1:5, B = sample(1:20, 10))
df
A B
1 1 17
2 2 5
3 3 3
4 4 11
5 5 19
6 1 16
7 2 4
8 3 7
9 4 6
10 5 9
My goal is to map all elements of column A as following:
1 -> "tt"
2 -> "ff"
3 -> "ss"
4 -> "fs"
5 -> "sf"
I have written the following:
mappingList <- c("tt", "ff", "ss", "fs", "sf")
df$A <- unlist(lapply(df$A, function(x){replace(x, x>0, mappingList[x])}))
df
A B
1 tt 17
2 ff 5
3 ss 3
4 fs 11
5 sf 19
6 tt 16
7 ff 4
8 ss 7
9 fs 6
10 sf 9
The code as above worked fine.
Now let's assume another dataframe where column A is not made of integers 1,2,3,4,5 but rather any other 'generic' items, say:
df <- data.frame(A = paste("str",1:5,sep=""), B = sample(1:20, 10))
or
df <- data.frame(A = seq(5, 25, by=5), B = sample(1:20, 10))
Question: How would you write the mapping ?

Did you look at factor?
df$A_2 <- factor(df$A, levels = 1:5, labels = c("tt", "ff", "ss", "fs", "sf"))
df
# A B A_2
# 1 1 17 tt
# 2 2 5 ff
# 3 3 3 ss
# 4 4 11 fs
# 5 5 19 sf
# 6 1 16 tt
# 7 2 4 ff
# 8 3 7 ss
# 9 4 6 fs
# 10 5 9 sf
Basically, your levels argument should have the original values to match, and your labels argument should have the replacement values.
You could also create a look-up table with a named vector.
Example:
df <- data.frame(A = paste("str",1:5,sep=""), B = sample(1:20, 10))
NamedVec <- setNames(paste("str",1:5,sep=""), c("tt", "ff", "ss", "fs", "sf"))
NamedVec
# tt ff ss fs sf
# "str1" "str2" "str3" "str4" "str5"
NamedVec[df$A]
# tt ff ss fs sf tt ff ss fs sf
# "str1" "str2" "str3" "str4" "str5" "str1" "str2" "str3" "str4" "str5"
names(NamedVec[df$A])
# [1] "tt" "ff" "ss" "fs" "sf" "tt" "ff" "ss" "fs" "sf"

Try:
mappingList[df$A]
#[1] "tt" "ff" "ss" "fs" "sf" "tt" "ff" "ss" "fs" "sf"
For the two other datasets:
df1 <- data.frame(A = paste("str",1:5,sep=""), B = sample(1:20, 10))
df2 <- data.frame(A = seq(5, 25, by=5), B = sample(1:20, 10))
mappingList[as.numeric(df1$A)]
#[1] "tt" "ff" "ss" "fs" "sf" "tt" "ff" "ss" "fs" "sf"
mappingList[as.numeric(factor(df2$A))]
#[1] "tt" "ff" "ss" "fs" "sf" "tt" "ff" "ss" "fs" "sf"

Related

How to read data

I am quite a newbie to the r language i wanted to read the following input but Ihave no idea how to proceed:
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
I wanted to read the following input into 6 categories
m
n
c(1,5,9,13)
c(2,6,10,14)
c(3,7,11,15)
c(4,8,12,16)
I tried the following code but it doesn't seem to work
f <- file("stdin")
r <- file("stdin")
data1 = scan(file = r,skip = 1)
data1 <- split(data1, " ")
data2 = scan(file = f ,nlines =1)
data2 <- split(data2, " ")
o1 = data2[1]
o2 = data2[2]
It always seems to give
"Read 0 items"
for data2.
Use read.table twice where Lines is given in the Note at the end.
mn <- read.table(text = Lines, nrows = 1, as.is = TRUE)
DF <- read.table(text = Lines, skip = 1)
giving:
mn
## V1 V2
## 1 m n
mn[[1]]
## [1] "m"
mn$V1 # same
## [1] "m"
DF
## V1 V2 V3 V4
## 1 1 2 3 4
## 2 5 6 7 8
## 3 9 10 11 12
## 4 13 14 15 16
DF[[1]]
## [1] 1 5 9 13
DF$V1 # same
## [1] 1 5 9 13
A list made up of the 6 components is:
unname( c(mn, DF) )
## [[1]]
## [1] "m"
##
## [[2]]
## [1] "n"
##
## [[3]]
## [1] 1 5 9 13
##
## [[4]]
## [1] 2 6 10 14
##
## [[5]]
## [1] 3 7 11 15
##
## [[6]]
## [1] 4 8 12 16
scan
If you prefer to use scan, as in the question, then assuming that the lines all have the same number of fields except for the first line, get the field counts, one per line, into counts and then use scan using those numbers:
counts <- count.fields(textConnection(Lines))
c( scan(text = Lines, what = "", nmax = counts[1], quiet = TRUE),
scan(text = Lines, what = as.list(numeric(counts[2])), skip = 1, quiet = TRUE) )
## [[1]]
## [1] "m"
##
## [[2]]
## [1] "n"
##
## [[3]]
## [1] 1 5 9 13
##
## [[4]]
## [1] 2 6 10 14
##
## [[5]]
## [1] 3 7 11 15
##
## [[6]]
## [1] 4 8 12 16
Note
Assume the input is:
Lines <- "m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16"

Match vectors in sequence

I have 2 vectors.
x=c("a", "b", "c", "d", "a", "b", "c")
y=structure(c(1, 2, 3, 4, 5, 6, 7, 8), .Names = c("a", "e", "b",
"c", "d", "a", "b", "c"))
I would like to match a to a, b to b in sequence accordingly, so that x[2] matches y[3] rather than y[7]; and x[5] matches y[6] rather than y[1], so on and so forth.
lapply(x, function(z) grep(z, names(y), fixed=T))
gives:
[[1]]
[1] 1 6
[[2]]
[1] 3 7
[[3]]
[1] 4 8
[[4]]
[1] 5
[[5]]
[1] 1 6
[[6]]
[1] 3 7
[[7]]
[1] 4 8
which matches all instances. How do I get this sequence:
1 3 4 5 6 7 8
So that elements in x can be mapped to the corresponding values in y accordingly?
You are actually looking for pmatch
pmatch(x,names(y))
[1] 1 3 4 5 6 7 8
You can change the names attributes according to the number of times each element appeared and then subset y:
x2 <- paste0(x, ave(x, x, FUN=seq_along))
#[1] "a1" "b1" "c1" "d1" "a2" "b2" "c2"
names(y) <- paste0(names(y), ave(names(y), names(y), FUN=seq_along))
y[x2]
#a1 b1 c1 d1 a2 b2 c2
# 1 3 4 5 6 7 8
Another option using Reduce
Reduce(function(v, k) y[-seq_len(v)][k],
x=x[-1L],
init=y[x[1L]],
accumulate=TRUE)
Well, I did it with a for-loop
#Initialise the vector with length same as x.
answer <- numeric(length(x))
for (i in seq_along(x)) {
#match the ith element of x with that of names in y.
answer[i] <- match(x[i], names(y))
#Replace the name of the matched element to empty string so next time you
#encounter it you get the next index.
names(y)[i] <- ""
}
answer
#[1] 1 3 4 5 6 7 8
Another possibility:
l <- lapply(x, grep, x = names(y), fixed = TRUE)
i <- as.integer(ave(x, x, FUN = seq_along))
mapply(`[`, l, i)
which gives:
[1] 1 3 4 5 6 7 8
Similar solution to Ronak, but it does not persist changes to y
yFoo<-names(y)
sapply(x,function(u){res<-match(u,yFoo);yFoo[res]<<-"foo";return(res)})
Result
#a b c d a b c
#1 3 4 5 6 7 8

summarise information reported by a named vector

I want to extrapolate the info reported by a character see below:
X<- c("BB", "BB", "CC", "CC", "CC", "EE", "EE")
names(X) <- c(1, 2, 2, 2, 3, 3, 4)
Character is below:
X
1 2 2 2 3 3 4
"BB" "BB" "CC" "CC" "CC" "EE" "EE"
"CC" in position 2 occurs twice, this info should be reported by the line Time Expected output:
1 2 2 3 3 4 # Position
1 1 2 1 1 1 # Times
"BB" "BB" "CC" "CC" "EE" "EE" # Character
Tried:
table (names(X))
data.frame(X)
We can use group by operation on the names of 'X' with the values of 'X' and get the frequency
library(data.table)
data.table(X, nm = names(X))[, .N, .(X, nm)]
# X nm N
#1: BB 1 1
#2: BB 2 1
#3: CC 2 2
#4: CC 3 1
#5: EE 3 1
#6: EE 4 1
Or similar option with tidyverse
library(dplyr)
data_frame(X, nm = names(X)) %>%
count(X, nm)
Or with aggregate from base R
aggregate(cbind(n = rep(1, length(X))) ~ X + names(X), FUN = sum)

Recursively set dimnames on a list of matrices

On a list of matrices, I'd like to set only the colnames and leave the rownames as NULL. The matrices are all different dimension. Unlike this example, the names are specific to each matrix.
provideDimnames gets me in the ballpark, but I'm having trouble telling it to ignore the NULL row names, and only set the column names. Here are my attempts.
> L <- list(matrix(1:6, 2), matrix(1:20, 5))
> dimnm <- list(list(NULL, letters[1:3]), list(NULL, letters[1:4]))
> lapply(L, provideDimnames, base = dimnm)
# Error in make.unique(base[[ii]][1L + (ss%%M[ii])], sep = sep) :
# 'names' must be a character vector
> lapply(L, provideDimnames, base = list(dimnm))
# Error in make.unique(base[[ii]][1L + (ss%%M[ii])], sep = sep) :
# 'names' must be a character vector
> lapply(L, provideDimnames, base = list(letters))
# [[1]]
# a b c
# a 1 3 5
# b 2 4 6
#
# [[2]]
# a b c d
# a 1 6 11 16
# b 2 7 12 17
# c 3 8 13 18
# d 4 9 14 19
# e 5 10 15 20
Almost, but I want [n,] for the row names. The desired result is:
> dimnames(L[[1]]) <- list(NULL, letters[1:3])
> dimnames(L[[2]]) <- list(NULL, letters[1:4])
> L
# [[1]]
# a b c
# [1,] 1 3 5
# [2,] 2 4 6
#
# [[2]]
# a b c d
# [1,] 1 6 11 16
# [2,] 2 7 12 17
# [3,] 3 8 13 18
# [4,] 4 9 14 19
# [5,] 5 10 15 20
> lapply(L, provideDimnames, base = list(NULL, letters))
# Error in make.unique(base[[ii]][1L + (ss%%M[ii])], sep = sep) :
# 'names' must be a character vector
> lapply(L, `colnames<-`, , letters)
# Error in FUN(X[[1L]], ...) :
# unused argument (c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k",
# "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"))
Is there a way to do this with provideDimnames()? setNames() wouldn't accept a list for the dim-names either.
How about something like this?
L <- list(matrix(1:6, 2), matrix(1:20, 5))
nms <- list(letters[1:3], letters[23:26])
mapply(function(X,Y) {colnames(X) <-Y; X}, L, nms)
[[1]]
a b c
[1,] 1 3 5
[2,] 2 4 6
[[2]]
w x y z
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20
You can do this relatively easily but you are complicating it by trying to do both dimnames where really you just want to fiddle with the column names. I would go about it this way:
## different dimnames; list of only the colnames
dimnm <- list(letters[1:3], letters[1:4])
## function to lapply which does the change
cnames <- function(i, lmat, names) {
colnames(lmat[[i]]) <- names[[i]]
lmat[[i]]
}
## do the change
L2 <- lapply(seq_along(L), cnames, lmat = L, names = dimnm)
L2
Gives us:
> L2
[[1]]
a b c
[1,] 1 3 5
[2,] 2 4 6
[[2]]
a b c d
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

Find names of columns which contain missing values

I want to find all the names of columns with NA or missing data and store these column names in a vector.
# create matrix
a <- c(1,2,3,4,5,NA,7,8,9,10,NA,12,13,14,NA,16,17,18,19,20)
cnames <- c("aa", "bb", "cc", "dd", "ee")
mymatrix <- matrix(a, nrow = 4, ncol = 5, byrow = TRUE)
colnames(mymatrix) <- cnames
mymatrix
# aa bb cc dd ee
# [1,] 1 2 3 4 5
# [2,] NA 7 8 9 10
# [3,] NA 12 13 14 NA
# [4,] 16 17 18 19 20
The desired result: columns "aa" and "ee".
My attempt:
bad <- character()
for (j in 1:4){
tmp <- which(colnames(mymatrix[j, ]) %in% c("", "NA"))
bad <- tmp
}
However, I keep getting integer(0) as my output. Any help is appreciated.
Like this?
colnames(mymatrix)[colSums(is.na(mymatrix)) > 0]
# [1] "aa" "ee"
Or as suggested by #thelatemail:
names(which(colSums(is.na(mymatrix)) > 0))
# [1] "aa" "ee"
R 3.1 introduced an anyNA function, which is more convenient and faster:
colnames(mymatrix)[ apply(mymatrix, 2, anyNA) ]
Old answer:
If it's a very long matrix, apply + any can short circuit and run a bit faster.
apply(is.na(mymatrix), 2, any)
# aa bb cc dd ee
# TRUE FALSE FALSE FALSE TRUE
colnames(mymatrix)[apply(is.na(mymatrix), 2, any)]
# [1] "aa" "ee"
If you have a data frame with non-numeric columns, this solution is more general (building on previous answers):
R 3.1 +
names(which(sapply(mymatrix, anyNA)))
or
names(which(sapply(mymatrix, function(x) any(is.na(x)))))

Resources