Spliting a row into columns using a delimiter in R - r

My data loks like this:
ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,
I would like to first separate this into different columns then apply a function on each row to calculate the difference of comma separated values such as (237-204).
Without the use of external library packages.

Try this except if the data is in a file replace the readLines line with something like this: L <- readLines("myfile.csv") . After that replace the colons with commas using gsub and then read the resulting text and transform it:
# test data
Lines <- "ID:10:237,204,
ID:11:257,239,
ID:12:309,291,
ID:13:310,272,
ID:14:3202,3184,
ID:15:404,388,"
L <- readLines(textConnection(Lines))
DF <- read.table(text = gsub(":", ",", L), sep = ",")
transform(DF, diff = V3 - V4)
giving:
V1 V2 V3 V4 V5 diff
1 ID 10 237 204 NA 33
2 ID 11 257 239 NA 18
3 ID 12 309 291 NA 18
4 ID 13 310 272 NA 38
5 ID 14 3202 3184 NA 18
6 ID 15 404 388 NA 16

Related

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

How to split data frame with multiple delimiter using str_split_fixed?

How can i split a column separated by multiple delimiter into separate columns in data frame
read.table(text = " Chr Nm1 Nm2 Nm3
chr10_100064111-100064134+Nfif 20 20 20
chr10_100064115-100064138-Kitl 30 19 40
chr10_100076865-100076888+Tert 60 440 18
chr10_100079974-100079997-Itg 50 11 23
chr10_100466221-100466244+Tmtc3 55 24 53", header = TRUE)
Chr gene Nm1 Nm2 Nm3
chr10_100064111-100064134 Nfif 20 20 20
chr10_100064115-100064138 Kitl 30 19 40
chr10_100076865-100076888 Tert 60 440 18
chr10_100079974-100079997 Itg 50 11 23 12
chr10_100466221-100466244 Tmtc3 55 24 53 12
i used
library(stringr)
df2 <- str_split_fixed(df1$name, "\\+", 2)
I would like to know how can we include both + and - delimiter
If you're trying to split one column into multiple, tidyr::separate is handy:
library(tidyr)
dat %>% separate(Chr, into = paste0('Chr', 1:3), sep = '[+-]')
# Chr1 Chr2 Chr3 Nm1 Nm2 Nm3
# 1 chr10_100064111 100064134 Nfif 20 20 20
# 2 chr10_100064115 100064138 Kitl 30 19 40
# 3 chr10_100076865 100076888 Tert 60 440 18
# 4 chr10_100079974 100079997 Itg 50 11 23
# 5 chr10_100466221 100466244 Tmtc3 55 24 53
This should work:
str_split_fixed(a, "[-+]", 2)
Here is a way to do this in base R with strsplit:
# split Chr into a list
tempList <- strsplit(as.character(df$Chr), split="[+-]")
# replace Chr with desired values
df$Chr <- sapply(tempList, function(i) paste(i[[1]], i[[2]], sep="-"))
# get Gene variable
df$gene <- sapply(tempList, "[[", 3)

Creating empty R dataframe and adding data row-by-row

I'm new to R and this hurdle may be a case of me crossing my R and Python wires - I apologise if that's the case.
I have some data that is supplied as individual rows. I'd like to create an empty dataframe and add each row of data one at a time. I read several posts that recommend not doing this if possible but, in this case, I think it should be easier. I've read several posts giving solutions to the same problem and I think I've followed them. The code I have so far is:
# Create empty dataframe with 1 column for string and several integer columns:
df = data.frame(name=character(), int_a=integer(), int_b=integer(), int_c=integer(), int_d=integer(), int_e=integer(), stringsAsFactors=FALSE)
# Create a series of lists containing the data
r1 = list(name="Row1", int_a=13234, int_b=567, int_c=566, int_d=53, int_e=11)
r2 = list(name="Row2", int_a=34454, int_b=34, int_c=643, int_d=33, int_e=56)
r3 = list(name="Row3", int_a=73857, int_b=3, int_c=226, int_d=4, int_e=55)
r4 = list(name="Row4", int_a=86754, int_b=346, int_c=384, int_d=35, int_e=59)
r5 = list(name="Row5", int_a=33748, int_b=456, int_c=461, int_d=6, int_e=85)
r6 = list(name="Row6", int_a=97865, int_b=34654, int_c=65, int_d=35, int_e=148)
r7 = list(name="Row7", int_a=36475, int_b=3444, int_c=365, int_d=55, int_e=34)
r8 = list(name="Row8", int_a=84748, int_b=454, int_c=345, int_d=148, int_e=884)
r9 = list(name="Row9", int_a=94848, int_b=23454, int_c=6548, int_d=7, int_e=566)
# Add row by row:
df = rbind(df, r1)
df = rbind(df, r2)
df = rbind(df, r3)
df = rbind(df, r4)
df = rbind(df, r5)
df = rbind(df, r6)
df = rbind(df, r7)
df = rbind(df, r8)
df = rbind(df, r9)
The end result is almost right but there are some errors – it looks like this:
name int_a int_b int_c int_d int_e
2 Row1 13234 567 566 53 11
21 <NA> 34454 34 643 33 56
3 <NA> 73857 3 226 4 55
4 <NA> 86754 346 384 35 59
5 <NA> 33748 456 461 6 85
6 <NA> 97865 34654 65 35 148
7 <NA> 36475 3444 365 55 34
8 <NA> 84748 454 345 148 884
9 <NA> 94848 23454 6548 7 566
And there a series of warnings is generated of the format:
1: In `[<-.factor`(`*tmp*`, ri, value = "Row2") :
invalid factor level, NA generated
Can anyone explain why the strings are not being entered into the dataframe and why the row names are a bit odd?
Thanks in advance.
options(stringsAsFactors = F)
your code ....
options(stringsAsFactors = T)
This will work. Not sure why you can't just specify it in the data frame as the OP did. Would appreciate clarification on this as well

printing a list based on range met

I would like to generate an string output into a list if some values are met. I have a table that looks like this:
grp V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 1 go.1 142 144 132 134 0 31 11 F D T hy al qe 34 6 3
2: 2 go.1 313 315 303 305 0 31 11 q z t hr ye er 29 20 41
3: 3 go.1 316 318 306 308 0 31 11 f w y hu er es 64 43 19
4: 4 go.1 319 321 309 311 0 31 11 r a y ie uu qr 26 22 20
5: 5 go.1 322 324 312 314 0 31 11 g w y hp yu re 44 7 0
I'm using this function to generate a desired output:
library(IRanges); library(data.table)
rangeFinder = function(x){
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out = x[1,1],
start = start(x.ir)[max.idx],
end = end(x.ir)[max.idx])
return(ans)}
rangeFinder(x.out)
out start end
1: 1 313 324
I would also like to generate a list with the letters (from column V9-V11) in the between the start and end output from rangeFinder.
For example, the output should look like this.
out
[[go.1]]
[1] "qztfwyraygwy"
rangeFinder is looking at values in column V2 and V3 and printing the longest match of numbers. Notice how "FDT" is not included in the list output even though rangeFinder produced an output from 313-324 (and not from 142-324). How can I get the desired output?
reduce has an argument with.revmap to add a "metadata" column (accessible with mcols()) to the object. This associates with each reduced range the indexes of the original range that map to the reduced range, as an IntegerList class, basically a list where all elements are guaranteed to be integer vectors. So these are the rows you're interested in
ir <- with(x, IRanges(V2, V3))
r <- reduce(ir, with.revmap=TRUE)
i <- unlist(mcols(r)[which.max(width(r)), "revmap"])
and the data character string can be munged with something like
j <- paste0("V", 9:11)
paste0(as.matrix(x[i, j, drop=FALSE]), collapse="")
It's better to ask your questions about IRanges on the Bioconductor mailing list; no subscription required.
with.revmap is a convenience argument added relatively recently; I think
h = findOverlaps(ir, r)
i = queryHits(h)[subjectHits(h) == which.max(width(r))]
is a replacement.

Resources