order() using data frame column name containing spaces produces unexpected results - r

I was trying to order a data frame in R according to a column named 'credit card usage'. The name of the data frame is mydata. The following command without a comma gives an error
newdata = mydata[order('credit card usage')]
But the following command with a comma works absolutely fine
newdata = mydata[order('credit card usage'),]
I need to understand why do we need the comma. Please can someone explain in a simple language whats going on behind the scenes?
Also the following command
mydata[order('credit card usage'),]
gives only the first row and not the whole dataframe. Why?

Why mydata[order('credit card usage'),] returns only first row is a tricky one.
The name of the column when used after , in [ refers to value of column otherwise it is just a string.
order('credit card usage') call considers just a string is passed to it and it sorts it out and pass the index(which is 1). Hence:
mydata[order('credit card usage'),] reduces to
mydata[1,]
=> which is 1st row of the mydata.

MKR's answer explains why the OP obtained the results described in the post. Here, we'll explain how to deal with the column named credit card usage to correctly sort the entire data frame.
Generally speaking, it's not advisable to use column names containing spaces in R because it often leads to unexpected results as experienced by the OP.
To use a column in a data frame whose name contains spaces, one must use the [[ form of the extract operator. We'll illustrate with some sample data...
set.seed(95014123)
mydata <- data.frame(matrix(round(runif(100)*1000,0),nrow=50))
names(mydata) <- c("credit card usage","value")
head(mydata)
head(mydata[order(mydata[["credit card usage"]]),])
...and the output:
> set.seed(95014123)
> mydata <- data.frame(matrix(round(runif(100)*1000,0),nrow=50))
> names(mydata) <- c("credit card usage","value")
> head(mydata)
credit card usage value
1 795 217
2 816 613
3 342 323
4 126 751
5 618 780
6 625 529
> head(mydata[order(mydata[["credit card usage"]]),])
credit card usage value
47 25 109
44 81 534
18 91 985
31 99 931
19 109 190
4 126 751
>
One can replace the spaces with underscores via the gsub() function, which will enable one to use the $ form of the extract operator in subsequent functions.
# replace spaces with underscores
names(mydata) <- gsub(" ","_",names(mydata))
head(mydata[order(mydata$credit_card_usage),])
...and the output:
> # replace spaces with underscores
> names(mydata) <- gsub(" ","_",names(mydata))
> head(mydata[order(mydata$credit_card_usage),])
credit_card_usage value
47 25 109
44 81 534
18 91 985
31 99 931
19 109 190
4 126 751
>

Related

Replace id numbers in rows based on the match between two columns

I am dealing with a data on club membership where each row represents a club's membership in one of the 10 student clubs, and the length of non-empty column represents the membership "size" of that club. Each non-empty cell of the data frame is filled with a "random number" denoting a student's membership in a club (random numbers were used to suppress their identities).
By default, each club has at least one member but not all students are registered as club members (some have no involvement in any clubs). The data looks like this (the data displayed at below contains only part of the data):
club_id mem1 mem2 mem3 mem4 mem5 mem6 mem7
1 339 520 58
2 700
3 80 434
4 516 811 471
5 20
6 211 80 439 516 305
I want to replace those random numbers with student ids (without revealing their real names) based on the match between the random numbers assigned to them and their student ids; however, only some of the students ids are matched to the random numbers assigned to those students.
I compiled them into a dataframe of 2 columns, which is available here and looks like
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
head(match)
id rn
1 1 700
2 2 339
3 3 540
4 4 58
5 5 160
6 6 371
where column rm means random number.
So the tasks I am having trouble with are to
(1) match and replace the random numbers on the dataframe with their corresponding student ids
(2) set those unmatched random number as NA
It will be really appreciated if someone could enlighten me on this.
Not sure if I got the logic right. I replicated only a short version of your initial table and replaced the first number with 1000 (because that is a number that has no matching id).
club2 <- data.frame(club_id = 1:6, mem2 = c(1000, 700, 80, 516, 20, 211))
match <- read.csv("https://www.dropbox.com/s/nc98i784r91ugin/match.csv?dl=1")
Then, for the column mem2, I check if it exists in match$rn. If that is not the case, an NA is inserted. If that is the case, however, it inserts match$id - the one at the position where match$rn is equal to the number in mem2.
club2$mem2 <- ifelse(club2$mem2 %in% match$rn == TRUE, match$id[match(club2$mem2, match$rn)], NA)

How to Interpret "Levels" in Random Forest using R/Rattle

I am brand new at using R/Rattle and am having difficulty understanding how to interpret the last line of this code output. Here is the function call along with it's output:
> head(weatherRF$model$predicted, 10)
336 342 94 304 227 173 265 44 230 245
No No No No No No No No No No
Levels: No Yes
This code is implementing a weather data set in which we are trying to get predictions for "RainTomorrow". I understand that this function calls for the predictions for the first 10 observations of the data set. What I do NOT understand is what the last line ("Levels: No Yes") means in the output.
It's called a factor variable.
That is the list of permitted values of the factor, here the values No and Yes are permitted.

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

Read text file with many 2D datasets in it using R

I have a data file which I'd like to read into R which is something like the following:
STARTOFDATA 2011-06-23 35
143 6456 23 646 123.53A 864.95 23B
343 634 24 545 65.3 235.2 94C
...
524 542 45 245.4 24 245A 45B
STARTOFDATA 2011-06-24 84
245 6532 24.4 624.2 542 23B 35A
241 4532 13.5 235.12 534.23 54 32B
etc...
As you can see, it's basically a 2D dataset (each of the columns between the header lines is a different variable) which is stored for a number of dates, specified by the STARTOFDATA lines, which split up the different days. The number at the end of the header line is the number of lines of data before the next header line. The A's, B's and C's etc are quality control information which can basically just be discarded - probably just as a gsub on the text I get from the file.
My question is: how should I go about reading this into R? Ideally I'd like to be able to read either the whole file, or a specified date (or date range). I should probably point out that the file is over 200,000 lines long!
I've done some thinking and researching about this, but can't seem to work out a sensible way to do it.
As far as I can see it, there are two questions:
How to read the file: Is there a way to move a pointer around within a file in R? Some other languages I've worked with have had that ability, in which case I could read the first line, read the date, see if I want that date or not, then if not skip the number of lines listed at the end of the header (preferably without reading them!) and read the next header line. I can't see anything in the documentation about a function that would let me do that without actually reading in the lines. It seems that if I create a connection object manually then that will keep track of where I am in the file, and I can use repeated calls to readLines (in a loop) to read in chunks of the file, discarding them once read if they're not needed.
How to store the data: Ideally I want to store the 2D dataset for each date in a dataframe, then I can continue to do any analysis on them fairly easily. However, how should I store loads of these 2D datasets? I'm thinking of a list of data-frames, but is that the best way to do it (in terms of being able to index the list sensibly)?
Any ideas or comments would be much appreciated.
Use readLines to read your data as a character vector and then manipulate this vector. Here is some code that splits your sample data into a list of blocks:
Use readLines to read the data:
x <- readLines(textConnection(
"STARTOFDATA 2011-06-23 35
143 6456 23 646 123.53A 864.95 23B
343 634 24 545 42 65.3 235.2 94C
...
524 542 45 245.4 24 542.54 245A 45B
STARTOFDATA 2011-06-24 84
245 6532 24.4 624.2 542 23B 35A
241 4532 13.5 235.12 534.23 54
etc..."))
Determine the positions of STARTOFDATA, then split into a list of blocks:
positions <- c(grep("STARTOFDATA", x), length(x)+1)
lapply(head(seq_along(positions), -1),
function(i)x[positions[i]:(positions[i+1]-1)])
[[1]]
[1] "STARTOFDATA 2011-06-23 35"
[2] "143 6456 23 646 123.53A 864.95 23B"
[3] "343 634 24 545 42 65.3 235.2 94C"
[4] "..."
[5] "524 542 45 245.4 24 542.54 245A 45B"
[[2]]
[1] "STARTOFDATA 2011-06-24 84"
[2] "245 6532 24.4 624.2 542 23B 35A"
[3] "241 4532 13.5 235.12 534.23 54"
[4] "etc..."
Now each block of data is an element in a list and you can process that as required, using a second lapply()

read.csv appends/modifies column headings with date values

I'm trying to read a csv file into R that has date values in some of the colum headings.
As an example, the data file looks something like this:
ID Type 1/1/2001 2/1/2001 3/1/2001 4/1/2011
A Supply 25 35 45 55
B Demand 26 35 41 22
C Supply 25 35 44 85
D Supply 24 39 45 75
D Demand 26 35 41 22
...and my read.csv logic looks like this
dat10 <- read.csv("c:\data.csv",header=TRUE, sep=",",as.is=TRUE)
The read.csv works fine except it modifies the name of the colums with dates as follows:
x1.1.2001 x2.1.2001 x3.1.2001 x4.1.2001
Is there a way to prevent this, or a easy way to correct afterwards?
Set check.names=FALSE. But be aware that 1/1/2001 et al are syntactically invalid names, therefore they may cause you some headaches.
You can always change the column names using the colnames function. For example,
colnames(dat10) = gsub("\\.", "/", colnames(dat10))
However, having slashes in your column names isn't a particularly good idea. You can always change them just before you print out the table or when you create a graph.

Resources