I need to manipulate the following data frame (data) so that the PATCH_CODE column is split into 2 resulting columns where the 1st column contains the letter of the string and the 2nd column contains the number as in the 2nd example dataframe below.
EDIT PATCH_CODE is not always 2 letters, occasional cases have a single letter in which case I need to force a 1 into the resulting code column.
initial data frame: head(data,4)
PATCH_CODE TERR PC1
A1 MENS_10 0.8629186
A3 MENS_10 -0.2703238
B1 MENS_10 0.9516067
B2 MENS_10 -0.1722446
resulting data frame:
PATCH CODE TERR PC1
A 1 MENS_10 0.8629186
A 3 MENS_10 -0.2703238
B 1 MENS_10 0.9516067
B 2 MENS_10 -0.1722446
I have seen examples of how to accomplish this when the column to be split has an identifiable text delimiter such as a comma by using colsplit in reshape but I have failed to find a solution for a structure like mine. Is this possible?
output of str(data)
'data.frame': 240 obs. of 3 variables:
$ PATCH_CODE: Factor w/ 42 levels "A","A1","A2",..: 2 3 4 7 8 12 13 16 17 18 ...
$ TERR : Factor w/ 19 levels "MENS_10","MENS_14",..: 1 1 1 1 1 1 1 1 1 1 ...
$ PC1 : num 0.548 1.228 0.273 5.548 3.853 ...
You can use strsplit. Passing an empty string as a delimiter results in a split at each letter.
a <- c("A1", "B1", "C2", "D5", "R3")
strsplit(a, "")
[[1]]
[1] "A" "1"
[[2]]
[1] "B" "1"
[[3]]
[1] "C" "2"
[[4]]
[1] "D" "5"
[[5]]
[1] "R" "3"
If you want to put that in a matrix
> do.call(rbind, strsplit(a, ""))
[,1] [,2]
[1,] "A" "1"
[2,] "B" "1"
[3,] "C" "2"
[4,] "D" "5"
[5,] "R" "3"
By the sounds of your description, strsplit should work fine. If your data are a little more complicated, you can also look at a possible regex-based solution.
For this particular example, try:
do.call(rbind, strsplit(mydf$PATCH_CODE,
split = "(?<=[a-zA-Z])(?=[0-9])",
perl = TRUE))
# [,1] [,2]
# [1,] "A" "1"
# [2,] "A" "3"
# [3,] "B" "1"
# [4,] "B" "2"
Related
This question already has answers here:
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 1 year ago.
I have a very big dataframe. let df below represent it:
df <-as.data.frame(rbind(c("a",1,1,1),c("a",1,1,1),c("a",1,1,1),c("b",2,2,2),c("b",2,2,2),c("b",2,2,2)))
[,1] [,2] [,3] [,4]
[1,] "a" "1" "1" "1"
[2,] "a" "1" "1" "1"
[3,] "a" "1" "1" "1"
[4,] "b" "2" "2" "2"
[5,] "b" "2" "2" "2"
[6,] "b" "2" "2" "2"
I want to create a dataframe like the one below out of it:
[,1] [,2] [,3] [,4]
[1,] "a" "3" "3" "3"
[2,] "b" "6" "6" "6"
I see several similar posts here, but the answers although very useful need a vector pf all possible values in the first column and so on. my problem is my dataset has about 3000 rows.
How can I get the result in r?
We could use group_byand summariseafter using type.convert(as.is=TRUE):
library(dplyr)
df %>%
type.convert(as.is=TRUE) %>%
group_by(V1) %>%
summarise(across(V2:V4, sum))
V1 V2 V3 V4
<chr> <int> <int> <int>
1 a 3 3 3
2 b 6 6 6
We can use aggregate
df <- type.convert(df, as.is = TRUE)
aggregate(.~ V1, df, FUN = sum)
-ouptut
V1 V2 V3 V4
1 a 3 3 3
2 b 6 6 6
NOTE: The OP created the data.frame from a matrix and matrix can hold only a single class. Thus, do the type conversion first
Another aggregate option
> aggregate(. ~ V1, df, function(x) sum(as.numeric(x)))
V1 V2 V3 V4
1 a 3 3 3
2 b 6 6 6
Struggling with string handling in R...
I've got a column of strings in an R data frame. Each one contains the "=" character once and only once. I'd like to know the position of the "=" character in each element of the column, as a step to splitting the column into two separate columns (one for the bit before the "=" and one for the bit after the "="). Can anyone help please? I'm sure it's simple but I'm struggling to find the answer.
For example, if I have:
x <- data.frame(string = c("aa=1", "aa=2", "aa=3", "b=1", "b=2", "abc=5"))
I'd like a bit of code to return
(3, 3, 3, 2, 2, 4)
Thank you.
To get the position of "=" you can use the regexp function:
regexpr("=", x$string)
#[1] 3 3 3 2 2 4
#attr(,"match.length")
#[1] 1 1 1 1 1 1
#attr(,"useBytes")
#[1] TRUE
However, as #Michael stated if your goal is to split the string you can use strsplit:
strsplit(x$string, "=")
#[[1]]
#[1] "aa" "1"
#
#[[2]]
#[1] "aa" "2"
#
#[[3]]
#[1] "aa" "3"
#
#[[4]]
#[1] "b" "1"
#
#[[5]]
#[1] "b" "2"
#
#[[6]]
#[1] "abc" "5"
Or to combine with do.call and `rbind to create a new dataframe:
do.call(rbind, strsplit(x$string, "="))
# [,1] [,2]
#[1,] "aa" "1"
#[2,] "aa" "2"
#[3,] "aa" "3"
#[4,] "b" "1"
#[5,] "b" "2"
#[6,] "abc" "5"
Here's a way to do:
library(stringr)
str_locate(x$string, "=")[,1]
You can use gregexpr:
unlist(lapply(gregexpr(pattern = '=', x$string), min))
[1] 3 3 3 2 2 4
In Base R you can do:
as.numeric(lapply(strsplit(as.character(x$string), ""), function(x) which(x == "=")))
[1] 3 3 3 2 2 4
Here is another solution to obtain a two column dataframe, the first containing the characters before = and the second one containing the characters after =. You can do that without obtaining the positions of the = character.
library(stringr)
t(as.data.frame(strsplit(x$string, "=")))
# [,1] [,2]
#c..aa....1.. "aa" "1"
#c..aa....2.. "aa" "2"
#c..aa....3.. "aa" "3"
#c..b....1.. "b" "1"
#c..b....2.. "b" "2"
#c..abc....5.. "abc" "5"
Some may find this more readable
library(tidyverse)
x %>%
mutate(
number = string %>% str_extract('[:digit:]+'),
text = string %>% str_extract('[:alpha:]+')
) %>%
as_tibble()
# A tibble: 6 x 3
string number text
<fct> <chr> <chr>
1 aa=1 1 aa
2 aa=2 2 aa
3 aa=3 3 aa
4 b=1 1 b
5 b=2 2 b
6 abc=5 5 abc
I have a long list, whose elements are lists of length one containing a character vector. These vectors can have different lengths.
The element of the vectors are 'characters' but I would like to convert them in numeric, as they actually represent numbers.
I would like to create a matrix, or a data frame, whose rows are the vectors above, converted into numeric. Since they have different lengths, the "right ends" of each row could be filled with NA.
I am trying to use the function rbind.fill.matrix from the library {plyr}, but the only thing I could get is a long numeric 1-d array with all the numbers inside, instead of a matrix.
This is the best I could do to get a list of numeric (dat here is my original list):
dat<-sapply(sapply(dat,unlist),as.numeric)
How can I create the matrix now?
Thank you!
I would do something like:
library(stringi)
temp <- stri_list2matrix(dat, byrow = TRUE)
final <- `dim<-`(as.numeric(temp), dim(temp))
The basic idea is that stri_list2matrix will convert the list to a matrix, but it would still be a character matrix. as.numeric would remove the dimensional attributes of the matrix, so we add those back in with:
`dim<-` ## Yes, the backticks are required -- or at least quotes
POC:
dat <- list(1:2, 1:3, 1:2, 1:5, 1:6)
dat <- lapply(dat, as.character)
dat
# [[1]]
# [1] "1" "2"
#
# [[2]]
# [1] "1" "2" "3"
#
# [[3]]
# [1] "1" "2"
#
# [[4]]
# [1] "1" "2" "3" "4" "5"
#
# [[5]]
# [1] "1" "2" "3" "4" "5" "6"
library(stringi)
temp <- stri_list2matrix(dat, byrow = TRUE)
final <- `dim<-`(as.numeric(temp), dim(temp))
final
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 2 NA NA NA NA
# [2,] 1 2 3 NA NA NA
# [3,] 1 2 NA NA NA NA
# [4,] 1 2 3 4 5 NA
# [5,] 1 2 3 4 5 6
I have a table
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3))
1 4 a,b,c
2 5 d,e
3 6 f
I would like to convert to
1 2 3
4 5 6
a d f
b e
c
so far I can transpose and split the third column, however, I'm lost as to how to reconstruct a new table with the format outline above?
new = t(rawData)
for (e in 1:ncol(new)){
s<-strsplit(new[3:3,e], split=",")
print(s)
}
I tried creating new vectors for each iteration but I'm not sure how to efficiently put each one back into a dataframe. Would be grateful for any help. thanks!
You can use stri_list2matrix from the stringi package:
library(stringi)
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3),stringsAsFactors = F)
d1 <- t(rawData[,1:2])
rownames(d1) <- NULL
d2 <- stri_list2matrix(strsplit(rawData$V3,split=','))
rbind(d1,d2)
# [,1] [,2] [,3]
# [1,] "1" "2" "3"
# [2,] "4" "5" "6"
# [3,] "a" "d" "f"
# [4,] "b" "e" NA
# [5,] "c" NA NA
You can also use cSplit from my "splitstackshape" package.
By default, it just creates additional columns after splitting the input:
library(splitstackshape)
cSplit(rawData, "V3")
# V1 V2 V3_1 V3_2 V3_3
# 1: 1 4 a b c
# 2: 2 5 d e NA
# 3: 3 6 f NA NA
You can just transpose that to get your desired output.
t(cSplit(rawData, "V3"))
# [,1] [,2] [,3]
# V1 "1" "2" "3"
# V2 "4" "5" "6"
# V3_1 "a" "d" "f"
# V3_2 "b" "e" NA
# V3_3 "c" NA NA
I am trying to populate a field in a table (or create a separate vector altogether, whichever is easier) with consecutive numbers from 1 to n, where n is the total number of records that share the same factor level, and then back to 1 for the next level, etc. That is, for a table like this
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
the result should be a new column (e.g. "sample") as follows:
sample<-c(1,2,3,4,1,2,3,1,2,3,4,1,2)
You can get it as follows, using ave:
data <- data.frame(data)
new <- ave(rep(1,nrow(data)),data$data,FUN=cumsum)
all.equal(new,sample) # check if it's right.
You can use rle function together with lapply :
sample <- unlist(lapply(rle(data[,1])$lengths,FUN=function(x){1:x}))
data <- cbind(data,sample)
Or even better, you can combine rle and sequence in the following one-liner (thanks to #Arun suggestion)
data <- cbind(data,sequence(rle(data[,1])$lengths))
> data
[,1] [,2]
[1,] "A" "1"
[2,] "A" "2"
[3,] "A" "3"
[4,] "A" "4"
[5,] "B" "1"
[6,] "B" "2"
[7,] "B" "3"
[8,] "C" "1"
[9,] "C" "2"
[10,] "C" "3"
[11,] "C" "4"
[12,] "D" "1"
[13,] "D" "2"
There are lots of different ways of achieving this, but I prefer to use ddply() from plyr because the logic seems very consistent to me. I think it makes more sense to be working with a data.frame (your title talks about levels of a factor):
dat <- data.frame(ID = c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)))
library(plyr)
ddply(dat, .(ID), summarise, sample = 1:length(ID))
# ID sample
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 B 1
# 6 B 2
# 7 B 3
# 8 C 1
# 9 C 2
# 10 C 3
# 11 C 4
# 12 D 1
# 13 D 2
My answer:
sample <- unlist(lapply(levels(factor(data)), function(x)seq_len(sum(factor(data)==x))))
factors <- unique(data)
f1 <- length(which(data == factors[1]))
...
fn <- length(which(data == factors[length(factors)]))
You can use a for loop or 'apply' family to speed that part up.
Then,
sample <- c(1:f1, 1:f2, ..., 1:fn)
Once again you can use a for loop for that part. Here is the full script you can use:
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
factors <- unique(data)
f <- c()
for(i in 1:length(factors)) {
f[i] <- length(which(data == factors[i]))
}
sample <- c()
for(i in 1:length(f)) {
sample <- c(sample, 1:f[i])
}
> sample
[1] 1 2 3 4 1 2 3 1 2 3 4 1 2