Importing one long line of data with spaces into R

Importing one long line of data with spaces into R - r

This question is a followup to my previous question, Importing one long line of data into R.
I have a large data file consisting of a single line of text. The format resembles
Cat 14 15 Horse 16
I'd eventually like to get it into a data.frame. In the above example I would end up with two variables, two variables, Animal and Number. The number of characters in each "line" is fixed, so in the example above each line contains 11 characters, animals being the first 7 and numbers being the next four.
So what I'd like is a data frame that looks like:
Animal Number
Cat 14
NA 15
Horse 16

You can read the file with read.fwf, specifying the column widths and the number of columns:
inp.fwf <- read.fwf("tmp.txt", widths = rep(c(7, 4), times = 3), as.is = TRUE)
Here the argument times = 3 works for your sample data; for your real file, you'll have to indicate how many pairs there are and change times accordingly. If you don't know how many entries you have, this might work:
inp.rl <- readLines("tmp.txt")
nchar(inp.rl)/11
This will give you a data.frame with one row and many columns. You need to break that into many rows and two columns:
inp.mat <- matrix(inp.fwf, byrow = TRUE, ncol = 2)
This will get you the correct shape for your data. The animal names are stored as character vectors, which you'll probably want to change into factors, but at this point all the data is in R, so you can easily tweak it.

Solution with vectorized substring function.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
idx <- seq.int(1,nchar(x),by=11)
vsubstr <- Vectorize(substr,vectorize.args=c("start","stop"))
dat <- data.frame(Animal= vsubstr(x,idx,idx+6),
Number= as.numeric(vsubstr(x,idx+7,idx+10)))

Not sure what the 15 is all about from the way you described data it should be animal-space-count-space-animal...
Anyway if the 15 should not be there here is one approach.
list1<-"Cat 14 Horse 16"
x <- unlist(strsplit(list1, " "))
x <- as.data.frame(matrix(x, length(x)/2, 2, byrow = TRUE))
x[, 2] <- as.numeric(as.character(x[, 2]))
x[, 1] <- as.character(x[, 1])
names(x) <-c('animal', 'count')
x

Assume you have a text file, test.dat, with repeated Animal Number pairs.
x <- scan("test.dat", what=list("", 0))
my.df <- data.frame(Animal = x[[1]], Number = x[[2]])

Tyler's use of read.fwf is perhaps cleaner, but here's another possible method.
x <- readLines(textConnection("Cat 14 15 Horse 16 "))
x <- matrix(strsplit(x, "")[[1]], nrow=11)
d <- data.frame(Animal = apply(x[1:7,], 2, paste, collapse=""),
Number = as.numeric(apply(x[8:11,], 2, paste, collapse="")))

Related

Separating a column by the first 3 characters

I have a set of data below and I would like to separate the first three characters from the bm_id column into a separate column with the rest of the characters in another column.
bm_id
1
popCL20TE
2
agrST20
3
agrST20-09SE
I have tried using solutions to a similar question asked on stack, however I end up making extra empty columns with my data remaining together.
bm_id[c('species', 'id')] <- tstrsplit(bm_id$bm_id, '(?<=.{3})', perl = TRUE)
same happens with this code
bm_id2 <- tidyr::separate(bm_id, bm_id, into = c("species", "id"), sep = 3)

How about substr
df <- data.frame(vec= c("popCL20TE", "agrST20"))
df$first3 <- substr(df$vec, 1, 3)
df$last <- substr(df$vec, 4, nchar(df$vec))
df
vec first3 last
1 popCL20TE pop CL20TE
2 agrST20 agr ST20

Why won't R recognize data frame column names within lists?

HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.

You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6

If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10

Need to find most common combination of letters

Let's say for simplicity that i have 10 rows of 5 characters where each character can be A-Z.
E.g//
KJGXI
GDGQT
JZKDC
YOTQD
SSDIQ
PLUWC
TORHC
PFJSQ
IIZMO
BRPOJ
WLMDX
AZDIJ
ARNUA
JEXGA
VFPIP
GXOXM
VIZEM
TFVQJ
OFNOG
QFNJR
ZGUBZ
CCTMB
HZPGV
ORQTJ
I want to know which 3 letter combination is most common. However, the combination does not need to be in order, nor next to each other. E.g
ABCXY
CQDBA
=ABC
I could probably brute-force it with endless loops but I was wondering if there was a better way of doing it!

Here is a solution:
x <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC", "PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ",
"ARNUA", "JEXGA", "VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ", "CCTMB", "HZPGV", "ORQTJ")
temp <- do.call(cbind, lapply(strsplit(x, ""), combn, m = 3))
temp <- apply(temp, 2, sort)
temp <- apply(temp, 2, paste0, collapse = "")
sort(table(temp), decreasing = TRUE)
which will return the number of times each combination appear. You can then use names(which.max(sort(table(temp), decreasing = TRUE))) to have the combination (in this case, "FJQ")
In this case, two combinations appear 3 times, you can do
result <- sort(table(temp), decreasing = TRUE)
names(which(result == max(result)))
# [1] "FJQ" "IMZ"
to have the two combinations which appear the most time.
The code works as follow:
split each element of x in five letters, then generate each possible combination of 3 elements from the 5 letters
sort each of those combination alphabetically
paste the 3 letters together
generate the count for each of those combinations, and sort the result

I would split each string into letters, sort them, then use combn to get all combinations. Use paste0 to collapse these back into strings and count.
txt <- c("KJGXI", "GDGQT", "JZKDC", "YOTQD", "SSDIQ", "PLUWC", "TORHC",
"PFJSQ", "IIZMO", "BRPOJ", "WLMDX", "AZDIJ", "ARNUA", "JEXGA",
"VFPIP", "GXOXM", "VIZEM", "TFVQJ", "OFNOG", "QFNJR", "ZGUBZ",
"CCTMB", "HZPGV", "ORQTJ")
txt2 <- strsplit(txt, split = "")
txt2 <- lapply(txt2, sort)
txt3 <- lapply(txt2, combn, m = 3)
txt4 <- lapply(txt3, function(x){apply(x, 2, paste0, collapse = "")})
table(unlist(txt4))
Several steps here could be combined.

how can i read a csv file containing some additional text data

I need to read a csv file in R. But the file contains some text information in some rows instead of comma values. So i cannot read that file using read.csv(fileName) method.
The content of the file is as follows:
name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz
I need to store only values of each name,date pair as data frame. To do that how can i read that file?
Actually my required output is
>dataFrame1
abc,2,saa
anan,3,ds
ama,ds,az
>dataFrame2
snans,32,asa
asa,2,saz

You can read the data with scan and use grep and sub functions to extract the important values.
The text:
text <- "name:russel date:21-2-1991
abc,2,saa
anan,3,ds
ama,ds,az
,,
name:rus date:23-3-1998
snans,32,asa
asa,2,saz"
These commands generate a data frame with name and date values.
# read the text
lines <- scan(text = text, what = character())
# find strings staring with 'name' or 'date'
nameDate <- grep("^name|^date", lines, value = TRUE)
# extract the values
values <- sub("^name:|^date:", "", nameDate)
# create a data frame
dat <- as.data.frame(matrix(values, ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("name", "date"))))
The result:
> dat
name date
1 russel 21-2-1991
2 rus 23-3-1998
Update
To extract the values from the strings, which do not contain name and date information, the following commands can be used:
# read data
lines <- readLines(textConnection(text))
# split lines
splitted <- strsplit(lines, ",")
# find positions of 'name' lines
idx <- grep("^name", lines)[-1]
# create grouping variable
grp <- cut(seq_along(lines), c(0, idx, length(lines)))
# extract values
values <- tapply(splitted, grp, FUN = function(x)
lapply(x, function(y)
if (length(y) == 3) y))
create a list of data frames
dat <- lapply(values, function(x) as.data.frame(matrix(unlist(x),
ncol = 3, byrow = TRUE)))
The result:
> dat
$`(0,7]`
V1 V2 V3
1 abc 2 saa
2 anan 3 ds
3 ama ds az
$`(7,9]`
V1 V2 V3
1 snans 32 asa
2 asa 2 saz

I would read the entire file first as a list of characters, i.e. a string for each line in the file, this can be done using readLines. Next you have to find the places where the data for a new date starts, i.e. look for ,,, see grep for that. Then take the first entry of each data block, e.g. using str_extract from the stringr package. Finally, you need split all the remaing data strings, see strsplit for that.

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.

This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))

A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)

mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick

The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))

Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!

There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Importing one long line of data with spaces into R - r

Solution with vectorized substring function. x <- readLines(textConnection("Cat 14 15 Horse 16 ")) idx <- seq.int(1,nchar(x),by=11) vsubstr <- Vectorize(substr,vectorize.args=c("start","stop")) dat <- data.frame(Animal= vsubstr(x,idx,idx+6), Number= as.numeric(vsubstr(x,idx+7,idx+10)))

Assume you have a text file, test.dat, with repeated Animal Number pairs. x <- scan("test.dat", what=list("", 0)) my.df <- data.frame(Animal = x[[1]], Number = x[[2]])

Related

Separating a column by the first 3 characters

Why won't R recognize data frame column names within lists?

Need to find most common combination of letters

how can i read a csv file containing some additional text data

apply a function over groups of columns

Categories

Resources