Splitting a character column into multiple columns [duplicate] - r

This question already has answers here:
Split data frame string column into multiple columns
(16 answers)
Closed 4 years ago.
y <- data.frame(x = c("63,98,131","75,109,145","66,104,139"))
I want to make three columns A,B,C from x by splitting from comma
A B C
63 98 131
75 109 145
66 104 139
I tried to use str_split
str_split(y$x, " , ")
[[1]]
[1] "63,98,131"
[[2]]
[1] "75,109,145"
[[3]]
[1] "66,104,139"
But this does not do the job. How can I fix it?

> dt=as.data.frame(matrix(unlist(strsplit(y$x,",")),ncol=dim(y)[1],byrow = T))
> dt
V1 V2 V3
1 63 98 131
2 75 109 145
3 66 104 139

Related

Replicate more than one column [duplicate]

This question already has answers here:
Sample random rows in dataframe
(13 answers)
Closed 3 years ago.
I am trying to sample 2 columns of a dataframe but the sample function is allowing me only one column to sample not both columns(Campaignid,CampaignName) at once.
Is there a way to sample like I wanted!
camp.d <- data.frame(Campaignid=c(121,132,133,143,153),
CampaignName=c('a','b','c','d','e'))
#allows only one column
a <- sample(camp.d$Campaignid, 100, replace = TRUE)
Expected:
Campaignid CampaignName
121 a
121 a
133 c
132 b
132 b
...
I think you need this -
sampled_data <- camp.d[sample(nrow(camp.d), 100, replace = T), ]
head(sampled_data)
Campaignid CampaignName
2 132 b
5 153 e
3 133 c
3.1 133 c
2.1 132 b
4 143 d
You could use the sample call the slice the full dataframe
camp.d[sample(camp.d$Campaignid, 100), ]
You can try:
as.data.frame(lapply(camp.d, sample, size = 100, replace = TRUE))
Campaignid CampaignName
1 132 a
2 133 c
3 143 a
4 132 e
5 133 c
6 143 a
7 132 c
8 153 a
9 121 c
10 132 b

R - Data Frame is a list of columns?

Question
Is a data frame in R is a list (list is, in my understanding, a sequence of objects) of columns?
What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
Any reference to related design document or article of data structure design would be appreciated.
I am just used to row-as-a-unit/record and would like to know why it is column oriented. Or if I misunderstood something, kindly suggest.
Background
I had thought a dataframe was a sequence of row, such as (Ozone, Solar.R, Wind, Temp, Month, Day).
> c ## data frame created from read.csv()
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
> typeof(c)
[1] "list"
However when lapply() is applied against c to show each list element, it was a column.
> lapply(c, function(arg){ return(arg) })
$Ozone
[1] 41 36 12 18 23 19
$Solar.R
[1] 190 118 149 313 299 99
$Wind
[1] 7.4 8.0 12.6 11.5 8.6 13.8
$Temp
[1] 67 72 74 62 65 59
$Month
[1] 5 5 5 5 5 5
$Day
[1] 1 2 3 4 7 8
Whereas I had expected was
[1] 41 190 7.4 67 5 1
[1] 36 118 8.0 72 5 2
…
1) Is a data frame in R a list of columns?
Yes.
df <- data.frame(a=c("the", "quick"), b=c("brown", "fox"), c=1:2)
is.list(df) # -> TRUE
attr(df, "name") # -> [1] "a" "b" "c"
df[[1]][2] # -> "quick"
2) What is the design decision in R to have made a data frame a column-oriented (not row-oriented) structure?
A data.frame is a list of column vectors.
is.atomic(df[[1]]) # -> TRUE
mode(df[[1]]) # -> [1] "character"
mode(df[[3]]) # -> [1] "numeric"
Vectors can only store one kind of object. A "row-oriented" data.frame would demand data frames be composed of lists instead. Now imagine what the performance of an operation like
df[[1]][20000]
would be in a list-based data frame keeping in mind that random access is O(1) for vectors and O(n) for lists.
3) Any reference to related design document or article of data structure design would be appreciated.
http://adv-r.had.co.nz/Data-structures.html#data-frames

Find multiple consecutive empty lines

I'm trying to chop up a text file into the articles it contains. Usually this is done by identifying a pattern each article begins with. Unfortunately the database I downloaded the articles from doesn't have that. The only pattern I can find is that after each article there are 3 empty lines.
How could I identify three consecutive empty line?
I know that I can find empty lines with:
Beginnings <- grep('^$', Lines.i)
Beginnings then looks like
> Beginnings[1:50]
[1] 1 2 3 6 8 10 12 13 40 41 42 43 45 49 50 51 53 54 62 63 64 65 67
[24] 69 70 110 111 112 113 115 117 121 122 123 125 131 132 133 135 137 138 150 151 152 153 155
[47] 157 158 169 170
You can see that the first article starts after 1 2 3 and the next one after 41 42 43.
So my idea was to just add the newline expression to the pattern
Beginnings <- grep('^$\n^$\n^$\n', Lines.i)
But this does not work. I would be grateful for any suggestions!
You may try rle
which(inverse.rle(within.list(rle(!nzchar(v1)),
values[lengths<3 & values] <- FALSE)))
#[1] 3 4 5 9 10 11 12
data
v1 <- c('ard', 'b', '', '', '', 'rr', '', 'fr', '', '', '', '', 'gh', 'd')
Here's a solution for extracting the article lines only. Turned out much more complex and cryptic than I'd been hoping, but I'm pretty sure it works. Also, thanks to akrun for the test data.
v1 <- c('ard','b','','','','rr','','fr','','','','','gh','d');
ind <- with(rle(c(rep(F,3),nzchar(v1),rep(F,3))),data.frame(start=cumsum(lengths[-length(lengths)])[values[-1]&!values[-length(values)]&lengths[-length(values)]>=3]-2,end=cumsum(lengths[-length(lengths)])[values[-length(lengths)]&!values[-1]&lengths[-1]>=3]-3));
articles <- lapply(1:nrow(ind),function(r) v1[ind[r,'start']:ind[r,'end']]);
v1;
## [1] "ard" "b" "" "" "" "rr" "" "fr" "" "" "" "" "gh" "d"
ind;
## start end
## 1 1 2
## 2 6 8
## 3 13 14
articles;
## [[1]]
## [1] "ard" "b"
##
## [[2]]
## [1] "rr" "" "fr"
##
## [[3]]
## [1] "gh" "d"

Converting probe ids to entrez ids from a list of lists

The conversion of probe ids to entrez ids is quite straight forward
i1<-c("246653_at", "246897_at", "251347_at", "252988_at", "255528_at", "256535_at", "257203_at", "257582_at", "258807_at", "261509_at", "265050_at", "265672_at")
select(ath1121501.db, i1, "ENTREZID", "PROBEID")
PROBEID ENTREZID
1 246653_at 833474
2 246897_at 832631
3 251347_at 825272
4 252988_at 829998
5 255528_at 827380
6 256535_at 840223
7 257203_at 821955
8 257582_at 841494
9 258807_at 819558
10 261509_at 843504
11 265050_at 841636
12 265672_at 817757
But Iam unsure how to do it for a long list of lists resulting from a clustering and store it as a list of ENTREZ ids instead of probe ids again:
For instance:
[[1]]
247964_at 248684_at 249126_at 249214_at 250223_at 253620_at 254907_at 259897_at 261256_at 267126_s_at
28 40 44 45 54 95 108 152 171 229
[[2]]
248230_at 250869_at 259765_at 265948_at 266221_at
33 64 151 216 221
[[3]]
245385_at 247282_at 248967_at 250180_at 250881_at 251073_at 53874_at 256093_at 257054_at 260007_at
5 22 42 52 65 67 101 117 125 155
261868_s_at 263136_at 267497_at
181 195 232
It should be something like
[[1]]
"835761","834904","834356","834281","831256","829175","826721","843479","837084","816891","816892"
and similarly for other list of lists.

Avoid quotation marks in column and row names when using write.table [duplicate]

This question already has an answer here:
Delete "" from csv values and change column names when writing to a CSV
(1 answer)
Closed 5 years ago.
I have the following data in a file called "data.txt":
pid 1 2 4 15 18 20
1_at 100 200 89 189 299 788
2_at 8 78 33 89 90 99
3_xt 300 45 53 234 89 34
4_dx 49 34 88 8 9 15
The data is separated by tabs.
Now I wanted to extract some columns on that table, based on the information of csv file called "vector.csv", this vector got the following data:
18,1,4,20
So I wanted to end with a modified file "datamod.txt" separated with tabs that would be:
pid 18 1 4 20
1_at 299 100 89 788
2_at 90 8 33 99
3_xt 89 300 53 34
4_dx 9 49 88 15
I have made, with some help, the following code:
fileName="vector.csv"
con=file(fileName,open="r")
controlfile<-readLines(con)
controls<-controlfile[1]
controlins<-controlfile[2]
test<-paste("pid",controlins,sep=",")
test2<-c(strsplit(test,","))
test3<-c(do.call("rbind",test2))
df<-read.table("data.txt",header=T,check.names=F)
CC <- sapply(df, class)
CC[!names(CC) %in% test3] <- "NULL"
df <- read.table("data.txt", header=T, colClasses=CC,check.names=F)
df<-df[,test3]
write.table(df,"datamod.txt",row.names=FALSE,sep="\t")
The problem that I got is that my resulting file has the following format:
"pid" "18" "1" "4" "20"
"1_at" 299 100 89 788
"2_at" 90 8 33 99
"3_xt" 89 300 53 34
"4_dx" 9 49 88 15
The question I have is how to avoid those quotation "" marks that appear in my saved file, so that the data appears like I would like to.
Any help?
Thanks
To quote from the help file for write.table
quote
a logical value (TRUE or FALSE) or a numeric vector. If TRUE,
any character or factor columns will be surrounded by double quotes.
If a numeric vector, its elements are taken as the indices of columns
to quote. In both cases, row and column names are quoted if they are
written. If FALSE, nothing is quoted.
Therefore
write.table(df,"datamod.txt",row.names=FALSE,sep="\t", quote = FALSE)
should work nicely.

Resources