Example data:
>data.frame("A" = c(20,40,53), "B" = c(40,11,60))
What's the easiest way in R to get from this
A B
1 20 40
2 40 11
3 53 60
to this?
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0
I couldn't find a way to make rank() or frank() work on multiple rows/columns and googling things like "r rank dataframe" "r rank multiple rows" yielded only questions on how to rank multiple rows/columns individually, which is weird, as I suspect the question must have been answered before.
Try rank like below
df[] <- rank(df)
or
df <- list2DF(relist(rank(df),skeleton = unclass(df)))
and you will get
> df
A B
1 2.0 3.5
2 3.5 1.0
3 5.0 6.0
I have a CSV file whose awful format I cannot change (simplified here):
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"
My desired output is a new CSV containing:
inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"
Basically:
lowercase the headers
strip off header prefixes and preserve them by adding them to a new column
remove header repetitions in later rows
stack each column that shares the latter part of their names (e.g. a_One and b_One values should be merged into the same column).
During this process, preserve the Inc value from the original row (there may be more than one row like this in various places).
With caveats:
I don't know the column names ahead of time (many files, many different columns). These need to be parsed if they are to be used as logic for stripping the repetitious header rows.
There may or may not be more than one column with properties like Inc that need to be preserved when everything gets stacked. Generally, Inc represents any column that does not have a prefix like a_ or b_. I have a regex to strip out these prefixes already.
So far, I've accomplished this:
> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
4 Inc a_One a_Two a_Three b_One b_Two b_Three
5 3 9 9.5 15 Things 10 10.5 30 Things
> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4
> filwip <- rawwip[-skips,]
> filwip
V1 V2 V3 V4 V5 V6 V7
2 1 1 1.5 5 Things 2 2.5 10 Things
3 2 5 5.5 10 Things 6 6.5 20 Things
5 3 9 9.5 15 Things 10 10.5 30 Things
> rawwip[1,]
V1 V2 V3 V4 V5 V6 V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three
But then when I try to apply a tolower() to these strings, I get:
> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"
And this is quite unexpected.
So my questions are:
1) How can I gain access to the header strings in rawwip[1,] so that I can reformat them with tolower() and other string-manipulating functions?
2) Once I've done that, what's the most effective way to stack the columns with shared names while preserving the inc value for each row?
Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. I will not know the position of each stackable column ahead of time. This needs to be determined within the script.
You can use the base reshape() function. For example with the input
dd<-read.csv(text='Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
inc,a_one,a_two,a_three,b_one,b_two,b_three
3,9,9.5,"15 Things",10,10.5,"30 Things"')
you can do
dx <- reshape(subset(dd, Inc!="inc"),
varying=Map(function(x) paste(c("a","b"), x, sep="_"), c("One","Two","Three")),
v.names=c("One","Two","Three"),
idvar="Inc",
timevar="label",
times = c("a","b"),
direction="long")
dx
to get
Inc label One Two Three
1.a 1 a 1 1.5 5 Things
2.a 2 a 5 5.5 10 Things
3.a 3 a 9 9.5 15 Things
1.b 1 b 2 2.5 10 Things
2.b 2 b 6 6.5 20 Things
3.b 3 b 10 10.5 30 Things
Because your input data is messy (embedded headers), this creates everything as factors. You could try to convert to proper data types with
dx[]<-lapply(lapply(dx, as.character), type.convert)
I would suggest a combination of read.mtable from my GitHub-only "SOfun" package and merged.stack from my "splitstackshape" package.
Here's the approach. I'm assuming your data is stored in a file called "somedata.txt" in your working directory.
The packages we need:
library(splitstackshape) # for merged.stack
library(SOfun) # for read.mtable
First, grab a vector of the names. While we are at it, change the name structure from "a_one" to "one_a" -- it's a much more convenient format for both merged.stack and reshape.
theNames <- gsub("(.*)_(.*)", "\\2_\\1",
tolower(scan(what = "", sep = ",",
text = readLines("somefile.txt", n = 1))))
Second, use read.mtable to read the data in. We create the data chunks by identifying all the lines that start with letters. You can use a more specific regular expression if that doesn't match your actual data.
This will create a list of data.frames, so we use do.call(rbind, ...) to put it together in a single data.frame:
theData <- read.mtable("somefile.txt", "^[A-Za-z]", header = FALSE, sep = ",")
theData <- setNames(do.call(rbind, theData), theNames)
This is what the data now look like:
theData
# inc one_a two_a three_a one_b two_b three_b
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.1 1 1 1.5 5 Things 2 2.5 10 Things
# Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three.2 2 5 5.5 10 Things 6 6.5 20 Things
# inc,a_one,a_two,a_three,b_one,b_two,b_three 3 9 9.5 15 Things 10 10.5 30 Things
From here, you can use merged.stack from "splitstackshape"....
merged.stack(theData, var.stubs = c("one", "two", "three"), sep = "_")
# inc .time_1 one two three
# 1: 1 a 1 1.5 5 Things
# 2: 1 b 2 2.5 10 Things
# 3: 2 a 5 5.5 10 Things
# 4: 2 b 6 6.5 20 Things
# 5: 3 a 9 9.5 15 Things
# 6: 3 b 10 10.5 30 Things
... or reshape from base R:
reshape(theData, direction = "long", idvar = "inc",
varying = 2:ncol(theData), sep = "_")
# inc time one two three
# 1.a 1 a 1 1.5 5 Things
# 2.a 2 a 5 5.5 10 Things
# 3.a 3 a 9 9.5 15 Things
# 1.b 1 b 2 2.5 10 Things
# 2.b 2 b 6 6.5 20 Things
# 3.b 3 b 10 10.5 30 Things
Say I have a data frame with the contents:
Trial Person Time
1 John 1.2
2 John 1.3
3 John 1.1
1 Bill 2.3
2 Bill 2.5
3 Bill 2.7
and another data frame with the contents:
Person Offset
John 0.5
Bill 1.0
and I want to modify the original frame based on the appropriate value from the second. I could do this easily in any other language or in SQL, and I'm sure I could manage using for loops and what, but with everything else I see in R, I'm guessing it has special syntax to do this as a one-liner. So, if so, how? And if not, could you show how it could be done using loops. I haven't actually got around to learning looping in R yet since it has amazing things to simply extract and manipulate whatever values.
For reference, the output would:
Trial Person Time
1 John 0.7
2 John 0.8
3 John 0.6
1 Bill 1.3
2 Bill 1.5
3 Bill 1.7
There are many possibilities. Here is a simple one using merge() and a simple column-wise subtraction in the enlarged data.frame:
R> DF1 <- data.frame(trial=rep(1:3,2), \
Person=rep(c("John","Bill"), each=3), \
Time=c(1.2,1.3,1.1,2.3,2.5,2.7))
R> DF2 <- data.frame(Person=c("John","Bill"), Offset=c(0.5,1.0))
R> DF <- merge(DF1, DF2)
R> DF
Person trial Time Offset
1 Bill 1 2.3 1.0
2 Bill 2 2.5 1.0
3 Bill 3 2.7 1.0
4 John 1 1.2 0.5
5 John 2 1.3 0.5
6 John 3 1.1 0.5
R> DF$NewTime <- DF$Time - DF$Offset
R> DF
Person trial Time Offset NewTime
1 Bill 1 2.3 1.0 1.3
2 Bill 2 2.5 1.0 1.5
3 Bill 3 2.7 1.0 1.7
4 John 1 1.2 0.5 0.7
5 John 2 1.3 0.5 0.8
6 John 3 1.1 0.5 0.6
R>
One liner:
transform(merge(d1,d2), Time=Time - Offset, Offset=NULL)