R: Pulling data from one column to create new columns

R: Pulling data from one column to create new columns - r

I have data with sample names that need to be unpacked and created into new columns.
sample
P10.1
P11.2
S1.1
S3.3
Using the sample ID data, I need to make three new columns: tissue, plant, stage.
sample tissue plant stage
P10.1 P 10 1
P11.2 P 11 2
S1.1 S 1 1
S3.3 S 3 3
Is there a way to pull the data from the sample column to populate the three new columns?

using dplyr and tidyr.
First we insert a "." in the sample code, next we separate sample into 3 columns.
library(dplyr)
library(tidyr)
df %>%
mutate(sample = paste0(substring(df$sample, 1, 1), ".", substring(df$sample, 2))) %>%
separate(sample, into = c("tissue", "plant", "stage"), remove = FALSE)
sample tissue plant stage
1 P.10.1 P 10 1
2 P.11.2 P 11 2
3 S.1.1 S 1 1
4 S.3.3 S 3 3
data:
df <- structure(list(sample = c("P10.1", "P11.2", "S1.1", "S3.3")),
.Names = "sample",
class = "data.frame",
row.names = c(NA, -4L))

Similar to #phiver, but uses regular expressions.
Within pattern:
The first parentheses captures any single uppercase letter (for tissue)
The second parentheses captures any one or two digit number (for plant)
The third parentheses captures any one or two digit number (for stage)
The sub() function pulls out those capturing groups, and places then in new variables.
library(magrittr)
pattern <- "^([A-Z])(\\d{1,2})\\.(\\d{1,2})$"
df %>%
dplyr::mutate(
tissue = sub(pattern, "\\1", sample),
plant = as.integer(sub(pattern, "\\2", sample)),
stage = as.integer(sub(pattern, "\\3", sample))
)
Result (displayed with str()):
'data.frame': 4 obs. of 4 variables:
$ sample: chr "P10.1" "P11.2" "S1.1" "S3.3"
$ tissue: chr "P" "P" "S" "S"
$ plant : int 10 11 1 3
$ stage : int 1 2 1 3

This is similar to phiver's answer, but use separate twice. Notice that we can specify the position index in the sep argument.
library(tidyr)
dat2 <- dat %>%
separate(sample, into = c("tissue", "number"), sep = 1, remove = FALSE) %>%
separate(number, into = c("plant", "stage"), sep = "\\.", remove = TRUE, convert = TRUE)
dat2
# sample tissue plant stage
# 1 P10.1 P 10 1
# 2 P11.2 P 11 2
# 3 S1.1 S 1 1
# 4 S3.3 S 3 3
DATA
dat <- read.table(text = "sample
P10.1
P11.2
S1.1
S3.3",
header = TRUE, stringsAsFactors = FALSE)

Related

How to Remove characters that doesn't match the string pattern from a column of a data frame

I have a column in my data frame as shown below.
I want to keep the data in the pattern "\\d+Zimmer" and remove all the digits from the column such as "9586" and "927" in the picture.
I tried following gsub function.
gsub("[^\\d+Zimmer]", "", flat_cl_one$rooms)
But it removes all the digits, as below.
What Regex can I use to get the correct result? Thank You in Advance

We can coerce any rows that have alphanumeric characters to NA and then replace the rows that don't have NA to blanks.
library(dplyr)
flat_cl_one %>%
mutate(rooms = ifelse(!is.na(as.numeric(rooms)), "", rooms))
Or we can use str_detect:
flat_cl_one %>%
mutate(rooms = ifelse(str_detect(rooms, "Zimmer", negate = TRUE), "", rooms))
Output
rooms
1 647Zimmer
2 394Zimmer
3
4
5 38210Zimmer
We could do the same thing with filter if you wanted to actually remove those rows.
flat_cl_one %>%
filter(is.na(as.numeric(rooms)))
# rooms
#1 647Zimmer
#2 394Zimmer
#3 38210Zimmer
Data
flat_cl_one <- structure(list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389",
"38210Zimmer")), class = "data.frame", row.names = c(NA, -5L))

Just replace strings that don't contain the word "Zimmer"
flat_cl_one$room[!grepl("Zimmer", flat_cl_one$room)] <- ""
flat_cl_one
#> room
#> 1 3Zimmer
#> 2 2Zimmer
#> 3 2Zimmer
#> 4 3Zimmer
#> 5
#> 6
#> 7 3Zimmer
#> 8 6Zimmer
#> 9 2Zimmer
#> 10 4Zimmer
Data
flat_cl_one <- data.frame(room = c("3Zimmer", "2Zimmer", "2Zimmer", "3Zimmer",
"9586", "927", "3Zimmer", "6Zimmer",
"2Zimmer", "4Zimmer"))

Another possible solution, using stringr::str_extract (I am using #AndrewGillreath-Brown's data, to whom I thank):
library(tidyverse)
df <- structure(
list(rooms = c("647Zimmer", "394Zimmer", "8796", "9389", "38210Zimmer")),
class = "data.frame",
row.names = c(NA, -5L))
df %>%
mutate(rooms = str_extract(rooms, "\\d+Zimmer"))
#> rooms
#> 1 647Zimmer
#> 2 394Zimmer
#> 3 <NA>
#> 4 <NA>
#> 5 38210Zimmer

This pattern [^\\d+Zimmer] matches any character except a digit or the following characters + Z i m etc...
Using gsub, you can check if the string does not start with the pattern ^\\d+Zimmer using a negative lookahead (?! setting perl = TRUE and then match 1 or more digits if the assertion it true.
gsub("^(?!^\\d+Zimmer\\b)\\d+\\b", "", flat_cl_one$rooms, perl = TRUE)
See an R demo.

sequential column names add 0

I have a dataset with column names
Col_a_b1 Col_a_b2 Col_a_b3 Col_a_b4 Col_a_b5 Col_a_b6 Col_a_b7 Col_a_b8 Col_a_b9 Col_a_b10 Col_a_b11 ... Col_a_b94
How do I add 0s to column names 1 to 10 , expected column names
Col_a_b01 Col_a_b02 Col_a_b03 Col_a_b04 Col_a_b05 Col_a_b06 Col_a_b07 Col_a_b08 Col_a_b09 Col_a_b10 Col_a_b11 ... Col_a_b94
Any suggestions much appreciated. Thanks.

With a tidyverse approach:
library(tidyverse)
names <- c("Col_a_b1", "Col_a_b2", "Col_a_b3", "Col_a_b4", "Col_a_b5", "Col_a_b6", "Col_a_b7", "Col_a_b8", "Col_a_b9", "Col_a_b10", "Col_a_b11")
names %>%
str_split("(?<=Col_a_b)(?=\\d+)") %>%
map_chr(~ str_c(.x[1], str_pad(.x[2], width = 2, pad = "0")))
#> [1] "Col_a_b01" "Col_a_b02" "Col_a_b03" "Col_a_b04" "Col_a_b05" "Col_a_b06"
#> [7] "Col_a_b07" "Col_a_b08" "Col_a_b09" "Col_a_b10" "Col_a_b11"

For a given dataframe called data:
colnames(data) <- sprintf("Col_a_b%02d", parse_number(colnames(data)))
%02d means a decimal integer, left padded, with zeros up to 2 digits.
Example
# Sample data
data = structure(list(Col_a_b1 = c("Name1", "Name2"), Col_a_b94 = c(1,
2)), class = "data.frame", row.names = c(NA, -2L))
> data
Col_a_b1 Col_a_b94
1 Name1 1
2 Name2 2
colnames(data) <- sprintf("Col_a_b%02d", parse_number(colnames(data)))
> data
Col_a_b01 Col_a_b94
1 Name1 1
2 Name2 2

One way to do is
#column names
nam = c( 'Col_a_b1', 'Col_a_b2' , 'Col_a_b3')
#extract the number
num = parse_number(nam)
#convert to two digits.
num = sub('(^[0-9]$)','0\\1', num)
#remove the numbers
nam = gsub('[0-9]+', '', nam)
#add 0
mod_nam = paste0(nam, num)
[1] "Col_a_b01" "Col_a_b02" "Col_a_b03"

Cleaning Data: Multiple Misspelled Strings in R

I have over 100 strings that I want to change, for ex:
Scheduled Caste, Schdeduled Caste, Schedulded Caste need to be changed to SC.
I have been doing it like this: Haryana3$Category[Haryana3$Category%in% "Scheduled Caste"] <- "SC"
Is there anything I can do that's more efficient?

Use gsub
Haryana3$Category <- gsub("Scheduled Caste", "SC", Haryana3$Category)
You can use data.table and try the following:
library(data.table)
setDT(Haryana3)
Haryana3[, Catergory:= gsub("Scheduled Caste", "SC", Category)]

I guess the rule is combing all the first letter from each word. If that is true, here is one idea.
library(tidyverse)
Haryana3 <- Haryana3 %>%
mutate(Category = strsplit(Category, split = " ")) %>%
mutate(Category = map_chr(Category, ~paste0(str_sub(.x, start = 1L, end = 1L), collapse = "")))
Haryana3
# ID Category
# 1 1 SC
# 2 2 SC
# 3 3 ST
# 4 4 ST
# 5 5 FC
DATA
Haryana3 <- read.table(text = "ID Category
1 'Scheduled Caste'
2 'Scheduled Caste'
3 'Scheduled Tribes'
4 'Scheduled Tribes'
5 'Forward Caste'", header = TRUE)

Identify elements from df1 in df2, then add column in df2 in those rows that were coincident using R

I have a dataframe with two columns (genome) and a dataframe with one column (list_SSNP).
What I am trying to do is to add a third and fourth columns in my Genome dataframe and add the value "1" for those positions in Genome that appear in list_SSNP and, separately, in list_SCPG.
I am trying to get an output dataframe that looks like this:
Gene_Symbol CHR SNP
A1BG 19q13.43
PDE1C 12p13.31 1
This is part of the content of Genome and I have included a reproducible example:
Genome <- c()
Genome$Gene_Symbol <- c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C")
Genome$CHR <- c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31")
Gene_Symbol CHR
1 A1BG 19q13.43
2 A1BG-AS1 19q13.43
3 A1CF 10q11.23
4 A2M 12p13.31
5 PDE1C 12p13.31
And this is part of the content of list_SSNP:
list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
Gene_Symbol
1 PDE1C
2 IMMP2L
3 ZCCHC14
4 NOS1AP
5 HARBI1
Using only 1 of the dataframes (list_SSNP), which is what I am attempting to do first, what I have tried to do is a loop through the genome dataframe and for element i (row) in my Genome if the element i of my list_SSNP dataframe is like element i in my Genome dataframe, then add the number 1 to a third column, but when I execute this code, nothing happens.
Full_genome <- read.table("FULL_GENOME.txt", header=TRUE, sep = "\t", dec = ',', na.strings=c("","NA"), fill=TRUE)
Genome <- Full_genome[,c(2,3)]
names(Genome) <- c("Gene_Symbol", "CHR")
list_SSNP <- as.data.frame(Gene_SSNP$Gene_Symbol)
for (i in 1: dim (Genome) [1]) {
if(list_SSNP[i] %in% Genome[i,1]){
Genome[i,3] <- 1
}
}
Just to further clarify, I have checked that all the elements from list_SSNP appear in Genome, so it is absolutely certain that it is not a matter of not finding any coincidences.
EDIT:
I have come to realize that my example does not specify that the entries in list_SSNP and Genome are unique and have no duplicates and that Genome has about 30k lines of entries, while list_SSNP has 49. I just want to add a column in Genome and a number 1 in those rows where the entry exists in both Genome and list_SSNP.

I believe this could help. You can try this code:
#Data
Genome <- data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
stringsAsFactors = F)
list_SSNP <- c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1")
#Collapse
vecc <- paste0(list_SSNP,collapse = '|')
#Contrast
Genome$SNP <- as.numeric(grepl(pattern = vecc,x = Genome$Gene_Symbol))
Output:
Gene_Symbol CHR SNP
1 A1BG 19q13.43 0
2 A1BG-AS1 19q13.43 0
3 A1CF 10q11.23 0
4 A2M 12p13.31 0
5 PDE1C 12p13.31 1

I may miss something important here, but the problem is formulated quite specifically to its domain. So, when I abtsracted it, I may have overseen an issue with my proposed solultion.
However, I understand that list_SSNP can have a SNP entry multiple times. So first of all, you could create a list of unique SNPs with the count of its occurences
library(dplyr)
list_SSNP = data.frame(SNP = c("PDE1C", "IMMP2L", "ZCCHC14", "NOS1AP", "HARBI1"))
unique_SSNP = list_SSNP %>%
group_by(SNP) %>%
# the summarize() could be replaced by count I guess, but I usually use this for more control
summarize(count = n())
And now you use a left_join
Genome = data.frame(Gene_Symbol = c("A1BG", "A1BG-AS1", "A1CF", "A2M", "PDE1C"),
CHR = c("19q13.43", "19q13.43", "10q11.23", "12p13.31", "12p13.31"),
stringsAsFactors = F)
Genome_extended = Genome %>%
left_join(unique_SSNP, by = c("Gene_Symbol" = "SNP"))
The count column in the extended dataframe would be NAs for non-existing SNPs and you could fill the NAs with a variety of commands from dplyr, tidyr or even base R.

How to add value base on specific character ,also fix with certain digits in R

There is the basic width : xxxx.xxxxxx (4digits before "." 6 digits after".")
Have to add "0" when each side before and after "." is not enough digits.
Use regexr find "[.]" location with combination of str_pad can
fix the first 4 digits but
don't know how to add value after the specific character with fixed digits.
(cannot find a library can count the location from somewhere specified)
Data like this
> df
Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402
Desired data
> df
Category
1 0300.030340
2 3400.040290
3 0700.070110
4 1700.090100
5 0700.070114
6 0700.079100
7 3600.050590
8 4400.040200
I am beginner of coding that sometime can't understand some regex like "["
e.t.c .With some explain of them would be super helpful.
Also i have a combination like this :
df$Category<-ifelse(regexpr("[.]",df$Category)==4,
paste("0",df1$Category,sep = ""),df$Category)
df$Category<-str_pad(df$Category,11,side = c("right"),pad="0")
Desire to know are there is any better way do this , especially count and
return the location from the END until specific character appear.

Using formatC:
df$Category <- formatC(as.numeric(df$Category), format = 'f', width = 11, flag = '0', digits = 6)
# > df
# Category
# 1 0300.030340
# 2 3400.040290
# 3 0700.070110
# 4 1700.090100
# 5 0700.070114
# 6 0700.079100
# 7 3600.050590
# 8 4400.040200
format = 'f': formating doubles;
width = 11: 4 digits before . + 1 . + 6 digits after .;
flag = '0': pads leading zeros;
digits = 6: the desired number of digits after the decimal point (format = "f");
Input df seems to be character data.frame:
structure(list(Category = c("300.030340", "3400.040290", "700.07011",
"1700.0901", "700.070114", "700.0791", "3600.05059", "4400.0402"
)), .Names = "Category", row.names = c(NA, -8L), class = "data.frame")

We can use sprintf
df$Category <- sprintf("%011.6f", df$Category)
df
# Category
#1 0300.030340
#2 3400.040290
#3 0700.070110
#4 1700.090100
#5 0700.070114
#6 0700.079100
#7 3600.050590
#8 4400.040200
data
df <- structure(list(Category = c(300.03034, 3400.04029, 700.07011,
1700.0901, 700.070114, 700.0791, 3600.05059, 4400.0402)),
.Names = "Category", class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8"))

There are plenty of great tricks, functions, and shortcuts to be learned, and I would encourage you to explore them all! For example, if you're trying to win code golf, you will want to use #akrun's sprintf() approach. Since you stated you're a beginner, it might be more helpful to breakdown the problem into its component parts. One transparent and easy-to-follow, in my opinion, approach would be to utilize the stringr package:
library(stringr)
location_of_dot <- str_locate(df$Category, "\\.")[, 1]
substring_left_of_dot <- str_sub(df$Category, end = location_of_dot - 1)
substring_right_of_dot <- str_sub(df$Category, start = location_of_dot + 1)
pad_left <- str_pad(substring_left_of_dot, 4, side = "left", pad = "0")
pad_right <- str_pad(substring_right_of_dot, 6, side = "right", pad = "0")
result <- paste0(pad_left, ".", pad_right)
result

Use separate in tidyr to separate Category on decimal. Use str_pad from stringr to add zeros in the front or back and paste them together.
library(tidyr) # to separate columns on decimal
library(dplyr) # to mutate and pipes
library(stringr) # to strpad
input_data <- read.table(text =" Category
1 300.030340
2 3400.040290
3 700.07011
4 1700.0901
5 700.070114
6 700.0791
7 3600.05059
8 4400.0402", header = TRUE, stringsAsFactors = FALSE) %>%
separate(Category, into = c("col1", "col2")) %>%
mutate(col1 = str_pad(col1, width = 4, side= "left", pad ="0"),
col2 = str_pad(col2, width = 6, side= "right", pad ="0"),
Category = paste(col1, col2, sep = ".")) %>%
select(-col1, -col2)

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: Pulling data from one column to create new columns - r

Related

How to Remove characters that doesn't match the string pattern from a column of a data frame

sequential column names add 0

Cleaning Data: Multiple Misspelled Strings in R

Identify elements from df1 in df2, then add column in df2 in those rows that were coincident using R

How to add value base on specific character ,also fix with certain digits in R

Categories

Resources