How to split character value properly - r

I have a data frame which consists of some composite information. I would like to split the vector a into the vectors "a" and "d", where "a" corresponds only to the numeric ID 898, 3467 ,234 ,222 and vector "d" contains the corresponding character values.
Data:
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df<-data.frame(a,b,c)
What I tried so far:
a<-str(df$a)
a<-strsplit(df$a, split)
But that just doesn't work out with my regular expression skills.
The required output table might have the form:
a d b c
898 Me 1 2
3467 You or 8 3
234 Hi-hi 3 6
222 what 8 2

library(tidyr)
a<-c("898_Me","3467_You or ", "234_Hi-hi", "222_what")
b<-c(1,8,3,8)
c<-c(2,4,6,2)
df <-data.frame(a,b,c)
final_df <- separate(df , a , c("a" , "d") , sep = "_")
# a d b c
#1 898 Me 1 2
#2 3467 You or 8 4
#3 234 Hi-hi 3 6
#4 222 what 8 2
final_df$d
# [1] "Me" "You or " "Hi-hi" "what"

strsplit is right, but you need to pass the character to split with:
do.call(rbind, strsplit(as.character(df$a), "_"))
# [,1] [,2]
# [1,] "898" "Me"
# [2,] "3467" "You or "
# [3,] "234" "Hi-hi"
# [4,] "222" "what"
Or
library(stringi)
stri_split_fixed(df$a, "_", simplify = TRUE)

With your example, Here is my solution in base R:
df$a2 <- gsub("[^0-9]", "", a)
df$d <- gsub("[0-9]", "", a)
That gives:
> df
a b c a2 d
1 898_Me 1 2 898 _Me
2 3467_You or 8 4 3467 _You or
3 234_Hi-hi 3 6 234 _Hi-hi
4 222_what 8 2 222 _what
Not elegant but it preserves original data and easy to apply.

Related

Extract values from list of lists with R

I have list of lists similar to this sample:
z <- list(list(num1=list((list(tab1=list(list(a=1, b=2, c=5), list(a=3, b=4), list(d=4,e=7)))))),list(num2=list((list(tab2=list(list(a=1, b=2), list(a=3, b=4)))))))
I would like to extract the figures out of the last list of lists names:
Desired output list (since 1 list entries are shorter) or as dataframe with columns corresponding to main list:
[1] a b c a b d e
[2] a b a b
dataframe:
column1 column2
a a
b b
c a
a b
b ""
d ""
e ""
I have tried various combinations of sapply(z, "[[", c("a","b"...) but failed, since the sublist names varies.
EDIT: Sorry, I needed the actual values not the last node (letters)! Additionally, each numeric value has column name, not set in the example above; it is like this:
[[1]]$num1[[1]]$tab1[[1]]$a
Name
1
So the desired solution are values:
[1]
1 2 5 3 4 4 7
[2]
1 2 3 4
I would actually need the numeric values instead of the letters. If you could adjust your solution to this I would be grateful. Thanks.
Try
lapply(z, function(x) as.numeric(unlist(x)))
## [[1]]
## [1] 1 2 5 3 4 4 7
##
## [[2]]
## [1] 1 2 3 4
z1 <- lapply(z, function(x) names(unlist(x)))
z1 <- lapply(z1, function(x) gsub(".*\\.", "", x))
n <- max(sapply(z1, length))
z1 <- lapply(z1, `length<-`, value = n)
setNames(as.data.frame(z1), paste0("Column", seq_along(z1)))
# Column1 Column2
#1 a a
#2 b b
#3 c a
#4 a b
#5 b <NA>
#6 d <NA>
#7 e <NA>
A bit far-fetched and everything but elegant, here is a way to get what you want :
lista<-unlist(lapply(strsplit(names(unlist(z)),"\\."),function(vec) vec[3]))
names(lista)<-unlist(lapply(strsplit(names(unlist(z)),"\\."),function(vec) vec[1]))
uninames<-unique(names(lista))
res<-sapply(uninames,function(x,vec){vec[names(vec)==x]},lista)
> res
$num1
num1 num1 num1 num1 num1 num1 num1
"a" "b" "c" "a" "b" "d" "e"
$num2
num2 num2 num2 num2
"a" "b" "a" "b"
UPDATE
To get the numbers :
a<-unlist(z)
b<-names(unique(z))
res<-sapply(unique(b),function(name,vec,l_name){vec[l_name==name]},a,b)
>res
$num1
num1.tab1.a num1.tab1.b num1.tab1.c num1.tab1.a num1.tab1.b num1.tab1.d num1.tab1.e
1 2 5 3 4 4 7
$num2
num2.tab2.a num2.tab2.b num2.tab2.a num2.tab2.b
1 2 3 4

Swap (selected/subset) data frame columns in R

What is the simplest way that one can swap the order of a selected subset of columns in a data frame in R. The answers I have seen (Is it possible to swap columns around in a data frame using R?) use all indices / column names for this. If one has, say, 100 columns and need either: 1) to swap column 99 with column 1, or 2) move column 99 before column 1 (but keeping column 1 now as column 2) the suggested approaches appear cumbersome. Funny there is no small package around for this (Wickham's "reshape" ?) - or can one suggest a simple code ?
If you really want a shortcut for this, you could write a couple of simple functions, such as the following.
To swap the position of two columns:
swapcols <- function(x, col1, col2) {
if(is.character(col1)) col1 <- match(col1, colnames(x))
if(is.character(col2)) col2 <- match(col2, colnames(x))
if(any(is.na(c(col1, col2)))) stop("One or both columns don't exist.")
i <- seq_len(ncol(x))
i[col1] <- col2
i[col2] <- col1
x[, i]
}
To move a column from one position to another:
movecol <- function(x, col, to.pos) {
if(is.character(col)) col <- match(col, colnames(x))
if(is.na(col)) stop("Column doesn't exist.")
if(to.pos > ncol(x) | to.pos < 1) stop("Invalid position.")
x[, append(seq_len(ncol(x))[-col], col, to.pos - 1)]
}
And here are examples of each:
(m <- matrix(1:12, ncol=4, dimnames=list(NULL, letters[1:4])))
# a b c d
# [1,] 1 4 7 10
# [2,] 2 5 8 11
# [3,] 3 6 9 12
swapcols(m, col1=1, col2=3) # using column indices
# c b a d
# [1,] 7 4 1 10
# [2,] 8 5 2 11
# [3,] 9 6 3 12
swapcols(m, 'd', 'a') # or using column names
# d b c a
# [1,] 10 4 7 1
# [2,] 11 5 8 2
# [3,] 12 6 9 3
movecol(m, col='a', to.pos=2)
# b a c d
# [1,] 4 1 7 10
# [2,] 5 2 8 11
# [3,] 6 3 9 12

how to avoid change string to number automaticlly in r

I was trying to save some string into a matrix, but it automatically changed to numbers (levels). How can i avoid it??
Here is the table:
trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b
I wanna to save to a matrix like:
J0001 a ab ab b
But, what i get is:
J0001 1 2 2 3
How can i avoid this?
Your M column is defined as a factor. You can save it as-is by wrapping it with as.character
> dat <- read.table(header = TRUE, text = "trt means M
1 0 12.16673 a
2 111 11.86369 ab
3 125 11.74433 ab
4 14 11.54073 b")
> as.numeric(dat$M)
# [1] 1 2 2 3
> as.character(dat$M)
# [1] "a" "ab" "ab" "b"
You can avoid this in the first place by using stringsAsFactors = FALSE when you read the data into R, or take advantage of the colClasses argument in some of the read-in functions.

Difference between as.data.frame(x) and data.frame(x)

What is the difference between as.data.frame(x) and data.frame(x)
In this following example, the result is the same at the exception of the columns names.
x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> x
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 1 1 1
[3,] 1 1 1
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
> as.data.frame(x)
V1 V2 V3
1 1 1 1
2 1 1 1
3 1 1 1
As mentioned by Jaap, data.frame() calls as.data.frame() but there's a reason for it:
as.data.frame() is a method to coerce other objects to class data.frame. If you're writing your own package, you would store your method to convert an object of your_class under as.data.frame.your_class(). Here are just a few examples.
methods(as.data.frame)
[1] as.data.frame.AsIs as.data.frame.Date
[3] as.data.frame.POSIXct as.data.frame.POSIXlt
[5] as.data.frame.aovproj* as.data.frame.array
[7] as.data.frame.character as.data.frame.complex
[9] as.data.frame.data.frame as.data.frame.default
[11] as.data.frame.difftime as.data.frame.factor
[13] as.data.frame.ftable* as.data.frame.integer
[15] as.data.frame.list as.data.frame.logLik*
[17] as.data.frame.logical as.data.frame.matrix
[19] as.data.frame.model.matrix as.data.frame.numeric
[21] as.data.frame.numeric_version as.data.frame.ordered
[23] as.data.frame.raw as.data.frame.table
[25] as.data.frame.ts as.data.frame.vector
Non-visible functions are asterisked
data.frame() can be used to build a data frame while as.data.frame() can only be used to coerce other object to a data frame.
for example:
# data.frame()
df1 <- data.frame(matrix(1:12,3,4),1:3)
# as.data.frame()
df2 <- as.data.frame(matrix(1:12,3,4),1:3)
df1
# X1 X2 X3 X4 X1.3
# 1 1 4 7 10 1
# 2 2 5 8 11 2
# 3 3 6 9 12 3
df2
# V1 V2 V3 V4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12
As you noted, the result does differ slightly, and this means that they are not exactly equal:
identical(data.frame(x),as.data.frame(x))
[1] FALSE
So you might need to take care to be consistent in which one you use.
But it is also worth noting that as.data.frame is faster:
library(microbenchmark)
microbenchmark(data.frame(x),as.data.frame(x))
Unit: microseconds
expr min lq median uq max neval
data.frame(x) 71.446 73.616 74.80 78.9445 146.442 100
as.data.frame(x) 25.657 27.631 28.42 29.2100 93.155 100
y <- matrix(1:1e6,1000,1000)
microbenchmark(data.frame(y),as.data.frame(y))
Unit: milliseconds
expr min lq median uq max neval
data.frame(y) 17.23943 19.63163 23.60193 41.07898 130.66005 100
as.data.frame(y) 10.83469 12.56357 14.04929 34.68608 38.37435 100
The difference becomes clearer when you look at their main arguments:
as.data.frame(x, ...): check if object is a data frame, or coerce if possible. Here, "x" can be any R object.
data.frame(...): build a data frame. Here, "..." allows specifying all the components (i.e. the variables of the data frame).
So, the results by Ophelia are similar since both functions received a single matrix as argument: however, when these functions receive 2 (or more) vectors, the distinction becomes clearer:
> # Set seed for reproducibility
> set.seed(3)
> # Create one int vector
> IDs <- seq(1:10)
> IDs
[1] 1 2 3 4 5 6 7 8 9 10
> # Create one char vector
> types <- sample(c("A", "B"), 10, replace = TRUE)
> types
[1] "A" "B" "A" "A" "B" "B" "A" "A" "B" "B"
> # Try to use "as.data.frame" to coerce components into a dataframe
> dataframe_1 <- as.data.frame(IDs, types)
> # Look at the result
> dataframe_1
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
duplicate row.names: A, B
> # Inspect result with head
> head(dataframe_1, n = 10)
IDs
A 1
B 2
A.1 3
A.2 4
B.1 5
B.2 6
A.3 7
A.4 8
B.3 9
B.4 10
> # Check the structure
> str(dataframe_1)
'data.frame': 10 obs. of 1 variable:
$ IDs: int 1 2 3 4 5 6 7 8 9 10
> # Use instead "data.frame" to build a data frame starting from two components
> dataframe_2 <- data.frame(IDs, types)
> # Look at the result
> dataframe_2
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Inspect result with head
> head(dataframe_2, n = 10)
IDs types
1 1 A
2 2 B
3 3 A
4 4 A
5 5 B
6 6 B
7 7 A
8 8 A
9 9 B
10 10 B
> # Check the structure
> str(dataframe_2)
'data.frame': 10 obs. of 2 variables:
$ IDs : int 1 2 3 4 5 6 7 8 9 10
$ types: Factor w/ 2 levels "A","B": 1 2 1 1 2 2 1 1 2 2
As you see "data.frame()" works fine, while "as.data.frame()" produces an error as it recognises the first argument as the object to be checked and coerced.
To sum up, "as.data.frame()" should be used to convert/coerce one single R object into a data frame (as you correctly did using a matrix), while "data.frame()" to build a data frame from scratch.
Try
colnames(x) <- c("C1","C2","C3")
and then both will give the same result
identical(data.frame(x), as.data.frame(x))
What is more startling are things like the following:
list(x)
Provides a one-elemnt list, the elemnt being the matrix x; whereas
as.list(x)
gives a list with 9 elements, one for each matrix entry
MM
Looking at the code, as.data.frame fails faster. data.frame will issue warnings, and do things like remove rownames if there are duplicates:
> x <- matrix(data=rep(1,9),nrow=3,ncol=3)
> rownames(x) <- c("a", "b", "b")
> data.frame(x)
X1 X2 X3
1 1 1 1
2 1 1 1
3 1 1 1
Warning message:
In data.row.names(row.names, rowsi, i) :
some row.names duplicated: 3 --> row.names NOT used
> as.data.frame(x)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names =
TRUE, :
duplicate row.names: b

R data frame format group

I have data frame in this format-
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
How can I get this to-
ABC 2 4 6
DEF 10 20
I tried the aggregate function, but it needs functions like mean/sum as params. How can I just display the values directly in the row.
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20")
unstack(df, form=V2~V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
unstack produces a list in this case as the columns don't have the same length. In case of the same length:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
t(unstack(df, form=V2~V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20
Well, what are the observations? Are they suppose to measure the same thing for each category?
You can't actually get a data frame exactly as you have posted, because the number of observations for each category is different. But you could do that if you add an "NA" to the "DEF".
Like this:
ABC 2 4 6
DEF 10 20 NA
If that is what you want, you could just use reshape2's dcast.
But you have to name the observations:
library(reshape2)
df <- data.frame(obs =c(1:3, 1:2),
categories = c(rep("ABC", 3), rep("DEF",2)),
values=c(2,4,6,10,20), stringsAsFactors=FALSE)
df2 <- dcast(df, categories~obs)
df2
# categories 1 2 3
# 1 ABC 2 4 6
# 2 DEF 10 20 NA
To add to your alternatives:
This seems to be a basic "long to wide" reshape problem, but it is missing a "time" variable. It's easy to recreate one by using ave:
ave(as.character(df$V1), df$V1, FUN = seq_along)
# [1] "1" "2" "3" "1" "2"
df$time <- ave(as.character(df$V1), df$V1, FUN = seq_along)
Once you have a "time" variable, using reshape is pretty straightforward:
reshape(df, idvar="V1", timevar="time", direction = "wide")
# V1 V2.1 V2.2 V2.3
# 1 ABC 2 4 6
# 4 DEF 10 20 NA
If, instead, you wanted a list, there is no need for the time variable. Just use split:
split(df$V2, df$V1)
# $ABC
# [1] 2 4 6
#
# $DEF
# [1] 10 20
#
Similarly, if your data were balanced, split plus rbind could get you what you need. Using the sample data from #lukeA:
df <- read.table(sep=" ", header=F, text="
ABC 2
ABC 4
ABC 6
DEF 10
DEF 20
DEF 20")
do.call(rbind, split(df$V2, df$V1))
# [,1] [,2] [,3]
# ABC 2 4 6
# DEF 10 20 20
You want to obtain a sparse matrix? The two rows in your example have different lengths. Try a function producing a list:
mat<-cbind(
c("ABC","ABC","ABC","DEF","DEF"),
c(2,4,6,10,20)
)
count<-function(mat){
values<-unique(mat[,1])
outlist<-list()
for(v in values){
outlist[[v]]<-mat[mat[,1]==v,2]
}
return(outlist)
}
count(mat)
Which will give you this result:
$ABC
[1] "2" "4" "6"
$DEF
[1] "10" "20"

Resources