Change character column into numeric - r

I have a data frame df with a column X where we have three different variables a, b and c as characters. For example
df$X <- data.frame(X = c(a,a,a,b,b,c,c,c,c), Y = ....)
I want to transform it into a = 1, b = 2 and c = 3 as numerics.
I first tried
df$X = as.factor(df$X)
transform(df, X = as.numeric(X))
where now I have a factor with three levels and a=1, b=2 and c=3. However the problem is that I need the column X as numeric. If I try
transform(df, X = as.numeric(as.character(X)))
or
transform(df, X = as.numeric(levels(X))[X])
I get NA for all the inputs (a, b, c).
How can I get the column X with numeric 1, 2, 3?

The solution of #jay.sf with encoding the characters first as a factor is quite elegant, because it generalizes to aribitrary strings and not just single characters.
If the codes are single characters, there is another possible solution, which uses the builtin constant letters and returns the position therein:
sapply(df$X, function(x) {which(x == letters)})

Related

R regexp for odd sorting of a char vector

I have several hundred files that need their columns sorted in a convoluted way. Imagine a character vector x which is the result of names(foo) where foo is a data.frame:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
I'd like to have it ordered according to the following rule: First, alphabetical for anything starting with "Component". Second, alphabetical for anything remaining starting with "C" and a number. Third anything remaining in alphabetical order.
For x that would be:
x[c(3,4,6,5,9,8,2,7,1)]
Is this a regexp kind of task? And does one use match? Each file will have a different number of columns (so x will be of varying lengths). Any tips appreciated.
You can achieve that with the function order from base-r:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
order(
!startsWith(x, "Component"), # 0 - starts with component, 1 - o.w.
!grepl("^C\\d", x), # 0 - starts with C<NUMBER>, 1 - o.w.
x # alphabetical
)
# output: 3 4 6 5 9 8 2 7 1
A brute-force solution using only base R:
first = sort(x[grepl('^Component', x)])
second = sort(x[grepl('^C\\d', x)])
third = sort(setdiff(x, c(first, second)))
c(first, second, third)
We can split int to different elements and then use mixedsort from gtools
v1 <- c(gtools::mixedsort(grep("Component", x, value = TRUE)),
gtools::mixedsort(grep("^C\\d+", x, value = TRUE)))
c(v1, gtools::mixedsort(x[!x %in% v1]))
#[1] "Component.1" "Component.10" "Component.13" "Component.143" "C16:1n-7" "A" "i2" "r4"
#[9] "x1"
Or another option in select assuming that these are the columns of the data.frame
library(dplyr)
df1 %>%
select(mixedsort(starts_with('Component')),
mixedsort(names(.)[matches("^C\\d+")]),
gtools::mixedsort(names(.)[everything()]))
If it is just the order of occurrence
df1 %>%
select(starts_with('Component'), matches('^C\\d+'), sort(names(.)[everything()]))
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(5 * 9), ncol = 9,
dimnames = list(NULL, x)))

convert/combine 2 column dataframe to 1 column dataframe- R

The script below shows X, Y data that is stored in a two columns data.frame
a1 <- as.character(c(3456,2569))
a2 <- as.character(c(956,569))
a3 <- as.character(c(156,269))
mydf <- rbind(a1, a2, a3)
How can I stored it in a data.frame with one column in the format “X,Y” and add zero to each X and Y (characters).
so the output will be
"3456.000, 2569.000"
"956.000, 569.000"
"156.000, 269.000"
Something like this could work:
data.frame(col1 = apply(mydf, 1, function(x) paste(paste0(x, '.000'), collapse = ', ')))
# col1
#a1 3456.000, 2569.000
#a2 956.000, 569.000
#a3 156.000, 269.000
apply iterates per row of your matrix and firstly creates the number with the zeroes (that's paste0) and then merges everything in one comma separated string (that's paste).
Are all the numbers integers, or do some of them already have a decimal point? If it's the latter, you might want to do something like
sprintf("%.3f, %.3f", as.numeric(mydf[,1]), as.numeric(mydf[,2]))

r: subsetting with square brackets not working

I made data frame called x:
a b
1 2
3 NA
3 32
21 7
12 8
When I run
y <- x["a">2,]
The object y returned is identical to x. If I run
y <- x["a" == 1,]
y is an empty frame.
I made sure that the names of the x data frame have no white spaces (I named them myself with names() ) and also that a and are numeric.
PS: If I try
y <- x["a">2]
y is also returned as identical to x.
You're making an error in referencing the column of your data.frame x.
"a">2 means character a bigger than two, not variable a of data.frame x. You need to add either x$a or x["a"] to reference your data.frame column.
try
y <- x[x$a >2 ,]
or
y <- x[x["a"] >2 ,]
or even more clear
ix <- x["a"] > 2
y <- x[ix,]
A simple alternative would be using data.table
library(data.table)
setDT(x)
y <- x[ a > 2, ]
y <- x[ a == 1, ]

R subset vector when treated as strings

I have a large data frame where I've forced my vectors into a string (using lapply and toString) so they fit into a dataframe and now I can't check if one column is a subset of the other. Is there a simple way to do this.
X <- data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
X
y z
1 ABC ABC
2 A A,B,C
all(X$y %in% X$z)
[1] FALSE
(X$y[1] %in% X$z[1])
[1] TRUE
(X$y[2] %in% X$z[2])
[1] FALSE
I need to treat each y and z string value as a vector (comma separated) again and then check if y is a subset of z.
In the above case, A is a subset of A,B,C. However because I've treated both as strings, it doesnt work.
In the above y is just one value and z is 1 and 3. The data frames sample I'll be testing is 10,000 rows and the y will have 1-5 values per row and z 1-100 per row. It looks like the 1-5 are always a subset of z, but I'd like to check.
df = data.frame(y=c("ABC","A"), z=c("ABC","A,B,C"))
apply(df, 1, function(x) { # perform rowise ops.
y = unlist(strsplit(x[1], ",")) # splitting X$y if incase it had ","
z = y %in% unlist(strsplit(x[2], ",")) # check how many of 'X$y' present in 'X$z'
if (sum(z) == length(y)) # if all present then return TRUE
return(TRUE)
else
return(FALSE)
})
# 1] TRUE TRUE
# Case 2: changed the data. You will have to define if you want perfect subset or not. Accordingly we can update the code
df = data.frame(y=c("ABC","A,B,D"), z=c("ABC","A,B,C"))
#[1] TRUE FALSE
I think it might work better for you not to use your lapply and toString combination, but store the lists in your data frame. For this purpose, I find the tbl_df (as found in the tibble package) more friendly, although I believe data.table objects can do this as well (someone correct me if I'm wrong)
library(tibble)
y_char <- list("ABC", "A")
z_char <- list("ABC", c("A", "B", "C"))
X <- data_frame(y = y_char,
z = z_char)
Notice that when you print X now, your entries in each row of the tibble are entries from the list. Now we can use mapply to do pairwise comparison.
# All y in z
mapply(function(x, y) all(x %in% y),
X$y,
X$z)
# All z in y
mapply(function(x, y) all(y %in% x),
X$y,
X$z)

apply treats numbers as characters

I couldn't find a solution for this problem online, as simple as it seems.
Here's it is:
#Construct test dataframe
tf <- data.frame(1:3,4:6,c("A","A","A"))
#Try the apply function I'm trying to use
test <- apply(tf,2,function(x) if(is.numeric(x)) mean(x) else unique(x)[1])
#Look at the output--all columns treated as character columns...
test
#Look at the format of the original data--the first two columns are integers.
str(tf)
In general terms, I want to differentiate what function I apply over a row/column based on what type of data that row/column contains.
Here, I want a simple mean if the column is numeric and the first unique value if the column is a character column. As you can see, apply treats all columns as characters the way I've written this function.
Just write a specialised function and put it within sapply... don't use apply(dtf, 2, fun). Besides, your character ain't so characterish as you may think - run getOption("stringsAsFactors") and see for yourself.
sapply(tf, class)
X1.3 X4.6 c..A....A....A..
"integer" "integer" "factor"
sapply(tf, storage.mode)
X1.3 X4.6 c..A....A....A..
"integer" "integer" "integer"
EDIT
Or even better - use lapply:
fn <- function(x) {
if(is.numeric(x) & !is.factor(x)) {
mean(x)
} else if (is.character(x)) {
unique(x)[1]
} else if (is.factor(x)) {
as.character(x)[1]
}
}
dtf <- data.frame(a = 1:3, b = 4:6, c = rep("A", 3), stringsAsFactors = FALSE)
dtf2 <- data.frame(a = 1:3, b = 4:6, c = rep("A", 3), stringsAsFactors = TRUE)
as.data.frame(lapply(dtf, fn))
a b c
1 2 5 A
as.data.frame(lapply(dtf2, fn))
a b c
1 2 5 A
I find the numcolwise and catcolwise functions from the plyr package useful here, for a syntactically simple solution:
First let's name the columns, to avoid ugly column names when doing the aggregation:
tf <- data.frame(a = 1:3,b=4:6, d = c("A","A","A"))
Then you get your desired result with this one-liner:
> cbind(numcolwise(mean)(tf), catcolwise( function(z) unique(z)[1] )(tf))
a b d
1 2 5 A
Explanation: numcolwise(f) converts its argument ( in this case f is the mean function ) into a function that takes a data-frame and applies f only to the numeric columns of the data-frame. Similarly the catcolwise converts its function argument to a function that operates only on the categorical columns.
You want to use lapply() or sapply(), not apply(). A data.frame is a list under the hood, which apply will try to convert to a matrix before doing anything. Since at least one column in your data frame is character, every other column also gets coerced to character in forming that matrix.

Resources