Stacking columns with similar names in R - r

I have a CSV file whose awful format I cannot change (simplified here):
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"
My desired output is a new CSV containing:
inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"
Basically:
lowercase the headers
strip off header prefixes and preserve them by adding them to a new column
remove header repetitions in later rows
stack each column that shares the latter part of their names (e.g. a_One and b_One values should be merged into the same column).
During this process, preserve the Inc value from the original row (there may be more than one row like this in various places).
With caveats:
I don't know the column names ahead of time (many files, many different columns). These need to be parsed if they are to be used as logic for stripping the repetitious header rows.
There may or may not be more than one column with properties like Inc that need to be preserved when everything gets stacked. Generally, Inc represents any column that does not have a prefix like a_ or b_. I have a regex to strip out these prefixes already.
So far, I've accomplished this:
> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
4 Inc a_One a_Two a_Three b_One b_Two b_Three
5 3 9 9.5 15 Things 10 10.5 30 Things
> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4
> filwip <- rawwip[-skips,]
> filwip
V1 V2 V3 V4 V5 V6 V7
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
5 3 9 9.5 15 Things 10 10.5 30 Things
> rawwip[1,]
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
But then when I try to apply a tolower() to these strings, I get:
> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"
And this is quite unexpected.
So my questions are:
1) How can I gain access to the header strings in rawwip[1,] so that I can reformat them with tolower() and other string-manipulating functions?
2) Once I've done that, what's the most effective way to stack the columns with shared names while preserving the inc value for each row?
Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. I will not know the position of each stackable column ahead of time. This needs to be determined within the script.

You can use the base reshape() function. For example with the input
dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')
you can do
dx <- reshape(subset(dd, Inc!="inc"),
varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
v.names=c("One","Two","Three"),
idvar="Inc",
timevar="label",
times = c("a","b"),
direction="long")
dx
to get
Inc label One Two Three
1.a 1 a 1 1.5 5 Things
2.a 2 a 5 5.5 10 Things
3.a 3 a 9 9.5 15 Things
1.b 1 b 2 2.5 10 Things
2.b 2 b 6 6.5 20 Things
3.b 3 b 10 10.5 30 Things
Because your input data is messy (embedded headers), this creates everything as factors. You could try to convert to proper data types with
dx[]<-lapply(lapply(dx, as.character), type.convert)

I would suggest a combination of read.mtable from my GitHub-only "SOfun" package and merged.stack from my "splitstackshape" package.
Here's the approach. I'm assuming your data is stored in a file called "somedata.txt" in your working directory.
The packages we need:
library(splitstackshape) # for merged.stack
library(SOfun) # for read.mtable
First, grab a vector of the names. While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack and reshape.
theNames <- gsub("(.*)_(.*)", "\\2_\\1",
tolower(scan(what = "", sep = ",",
text = readLines("somefile.txt", n = 1))))
Second, use read.mtable to read the data in. We create the data chunks by identifying all the lines that start with letters. You can use a more specific regular expression if that doesn't match your actual data.
This will create a list of data.frames, so we use do.call(rbind, ...) to put it together in a single data.frame:
theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")
theData <- setNames(do.call(rbind, theData), theNames)
This is what the data now look like:
theData
# inc one_a two_a three_a one_b two_b three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1 1 1 1.5 5 Things 2 2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2 2 5 5.5 10 Things 6 6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three 3 9 9.5 15 Things 10 10.5 30 Things
From here, you can use merged.stack from "splitstackshape"....
merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
# inc .time_1 one two three
# 1: 1 a 1 1.5 5 Things
# 2: 1 b 2 2.5 10 Things
# 3: 2 a 5 5.5 10 Things
# 4: 2 b 6 6.5 20 Things
# 5: 3 a 9 9.5 15 Things
# 6: 3 b 10 10.5 30 Things
... or reshape from base R:
reshape(theData, direction = "long", idvar = "inc",
varying = 2:ncol(theData), sep = "_")
# inc time one two three
# 1.a 1 a 1 1.5 5 Things
# 2.a 2 a 5 5.5 10 Things
# 3.a 3 a 9 9.5 15 Things
# 1.b 1 b 2 2.5 10 Things
# 2.b 2 b 6 6.5 20 Things
# 3.b 3 b 10 10.5 30 Things

Related

Sorting dataframe in R in reverse order - Column name as a variable [duplicate]

I've looked and looked and the answer either does not work for me, or it's far too complex and unnecessary.
I have data, it can be any data, here is an example
chickens <- read.table(textConnection("
feathers beaks
2 3
6 4
1 5
2 4
4 5
10 11
9 8
12 11
7 9
1 4
5 9
"), header = TRUE)
I need to, very simply, sort the data for the 1st column in descending order. It's pretty straightforward, but I have found two things below that both do not work and give me an error which says:
"Error in order(var) : Object 'var' not found.
They are:
chickens <- chickens[order(-feathers),]
and
chickens <- chickens[sort(-feathers),]
I'm not sure what I'm not doing, I can get it to work if I put the df name in front of the varname, but that won't work if I put an minus sign in front of the varname to imply descending sort.
I'd like to do this as simply as possible, i.e. no boolean logic variables, nothing like that. Something akin to SPSS's
SORT BY varname (D)
The answer is probably right in front of me, I apologize for the basic question.
Thank you!
You need to use dataframe name as prefix
chickens[order(chickens$feathers),]
To change the order, the function has decreasing argument
chickens[order(chickens$feathers, decreasing = TRUE),]
The syntax in base R, needs to use dataframe name as a prefix as #dmi3kno has shown. Or you can also use with to avoid using dataframe name and $ all the time as mentioned by #joran.
However, you can also do this with data.table :
library(data.table)
setDT(chickens)[order(-feathers)]
#Also
#setDT(chickens)[order(feathers, decreasing = TRUE)]
# feathers beaks
# 1: 12 11
# 2: 10 11
# 3: 9 8
# 4: 7 9
# 5: 6 4
# 6: 5 9
# 7: 4 5
# 8: 2 3
# 9: 2 4
#10: 1 5
#11: 1 4
and dplyr :
library(dplyr)
chickens %>% arrange(desc(feathers))

Adding a header to columns based on the values of rows

I have the following different dataframes:
df1:
Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12
df2:
Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3
df3:
Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5
I would like to add a header for each of the dataframes where the column names are:
Class Technique Algorithm 1 2 3 ....
My issue is not with the first three columns but with the rest of the columns (integer values). As you see in the example, the number of columns for these integer values differs which makes it difficult to me how to name these columns (i.e., starting form 1 until the last value, for example, 4 in df1).
Can someone help me please in solving this issue?
Here is a function for you. The first argument, dat, is your data frame. The second argument, chr, is the vector names for your first few columns.
header_fun <- function(dat, chr = c("Class", "Technique", "Algorithm")){
dat2 <- setNames(dat, c(chr, 1:(ncol(dat) - length(chr))))
return(dat2)
}
The function will return a new data frame with the updated header.
header_fun(df1)
# Class Technique Algorithm C1 C2 C3 C4
# 1 Scribe Reduced A 5.0 2.5 3 10
# 2 Reader Reduced A 9.2 4.0 12 10
# 3 Optimise Reduced A 5.0 5.8 3 12
header_fun(df2)
# Class Technique Algorithm 1 2
# 1 Convert Reduced A 14.0 25.0
# 2 Configure Reduced A 14.7 6.8
# 3 Race Reduced A 2.0 6.3
header_fun(df3)
# Class Technique Algorithm 1 2 3 4 5 6
# 1 Abstract Reduced A 8.0 7.5 9 8 4.5 11.0
# 2 Follower Reduced A 5.5 6.0 14 19 6.0 13.5
DATA
df1 <- read.table(text = "Scribe Reduced A 5 2.5 3 10
Reader Reduced A 9.2 4 12 10
Optimise Reduced A 5 5.8 3 12",
header = FALSE, stringsAsFactors = FALSE)
df2 <- read.table(text = "Convert Reduced A 14 25
Configure Reduced A 14.7 6.8
Race Reduced A 2 6.3",
header = FALSE, stringsAsFactors = FALSE)
df3 <- read.table(text = "Abstract Reduced A 8 7.5 9 8 4.5 11
Follower Reduced A 5.5 6 14 19 6 13.5",
header = FALSE, stringsAsFactors = FALSE)

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

How to create a dataframe from a site which appears to store each row as a list?

Thanks in advance for the help. Essentially, I was testing obtaining data off websites, when I ran across this one: http://lib.stat.cmu.edu/datasets/sleep. I proceeded in the following fashion:
(A) Get a sense of the data (in R): I essentially typed the following
readLines("http://lib.stat.cmu.edu/datasets/sleep", n=100)
(B) I notice that the data I would want really starts on the 51st line, so I write this code:
sleep_table <- read.table("http://lib.stat.cmu.edu/datasets/sleep", header=FALSE, skip=50)
(C) I get the following error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 14 elements
Where I got the above approach was from another question on stack overflow (import dat file into R). However, this question deals with a .dat file and my question is with data at a particular URL. What I'd like to know is how do I get the data from line 51 down (if you used readLines) into a dataframe with no headers (I'll add those in later with a colnames(sleep_table) <- c("etc.", "etc2", "etc3"...).
Use the fact that the good lines end in a one digit field and that every field except the first is numeric:
URL <- "http://lib.stat.cmu.edu/datasets/sleep"
L <- readLines(URL)
# lines ending in a one digit field
good.lines <- grep(" \\d$", L, value = TRUE)
# insert commas before numeric fields
lines.csv <- gsub("( [-0-9.])", ",\\1", good.lines)
# re-read
DF <- read.table(text = lines.csv, sep = ",", as.is = TRUE, strip.white = TRUE,
na.strings = "-999.0")
If you are interested in the headings too here is some code for that. Omit the rest if you are not interested in headings.
# get headings - of the lines starting at left edge these are the ncol(DF) lines
# starting with the one containing "species"
headings0 <- grep("^[^ ]", L, value = TRUE)
i <- grep("species", headings0)
headings <- headings0[seq(i, length = ncol(DF))]
# The headings are a bit long so we shorten them to the first word
names(DF) <- sub(" .*$", "", headings)
This gives:
> head(DF)
species body brain slow paradoxical total maximum
1 African elephant 6654.000 5712.0 NA NA 3.3 38.6
2 African giant pouched rat 1.000 6.6 6.3 2.0 8.3 4.5
3 Arctic Fox 3.385 44.5 NA NA 12.5 14.0
4 Arctic ground squirrel 0.920 5.7 NA NA 16.5 NA
5 Asian elephant 2547.000 4603.0 2.1 1.8 3.9 69.0
6 Baboon 10.550 179.5 9.1 0.7 9.8 27.0
gestation predation sleep overall
1 645 3 5 3
2 42 3 1 3
3 60 1 1 1
4 25 5 2 3
5 624 3 5 4
6 180 4 4 4
UPDATE: minor simplification in white space trimming
UPDATE 2: shorten headings
UPDATE 3: added na.strings = "-999.0"
Since "Lesser short-tailed shrew" and "Pig" have unequal number of separator spaces, and the other fields are not tab-separated, read.table will not help. But luckily, this seems to be fixed space.
Note that the solution is not complete, because there are a few nasty lines at the end of the record, and you probably have to convert the characters to number, but that's left as an easy exercise.
# 123456789012345689012345678901234568901234567890123456890123456789012345689012345678901234568901234567890123456890
# African elephant 6654.000 5712.000 -999.0 -999.0 3.3 38.6 645.0 3 5 3
# African giant pouched rat 1.000 6.600 6.3 2.0 8.3 4.5 42.0 3 1 3
sleep_table <- read.fwf("http://lib.stat.cmu.edu/datasets/sleep", widths = c(25,rep(8,10)),
header=FALSE, skip=51)

Resources