How to reference columns of a data.frame within a data.frame? - r

I have a data.frame called series_to_plot.df which I created by combining a number of other data.frames together (shown below). I now want to pull out just the .mm column from each of these, so I can plot them. So I want to pull out the 3rd column of each data.frame (e.g. p3c3.mm, p3c4.mm etc...), but I can't see how to do this for all data.frames in the object without looping through the name. Is this possible?
I can pull out just one set: e.g. series_to_plot.df[[3]] and another by
series_to_plot.df[[10]] (so it is just a list of vectors..) and I can reference directly with series_to_plot.df$p3c3.mm, but is there a command to get a vector containing all mm's from each data.frame? I was expecting an index something like this to work: series_to_plot.df[,3[3]] but it returns Error in [.data.frame(series_to_plot.df, , 3[3]) : undefined columns selected
series_to_plot.df
p3c3.rd p3c3.day p3c3.mm p3c3.sd p3c3.n p3c3.noo p3c3.no_NAs
1 2010-01-04 0 0.1702531 0.04003364 7 1 0
2 2010-01-06 2 0.1790594 0.04696674 7 1 0
3 2010-01-09 5 0.1720404 0.03801756 8 0 0
p3c4.rd p3c4.day p3c4.mm p3c4.sd p3c4.n p3c4.noo p3c4.no_NAs
1 2010-01-04 0 0.1076581 0.006542157 6 2 0
2 2010-01-06 2 0.1393447 0.066758781 7 1 0
3 2010-01-09 5 0.2056846 0.047722862 7 1 0
p3c5.rd p3c5.day p3c5.mm p3c5.sd p3c5.n p3c5.noo p3c5.no_NAs
1 2010-01-04 0 0.07987147 0.006508766 7 1 0
2 2010-01-06 2 0.11496167 0.046478767 8 0 0
3 2010-01-09 5 0.40326471 0.210217097 7 1 0

To get all columns with specified name you could do:
names_with_mm <- grep("mm$", names(series_to_plot.df), value=TRUE)
series_to_plot.df[, names_with_mm]
But if your base data.frame's all have the same structure then you can rbind them, something like:
series_to_plot.df <- rbind(
cbind(name="p3c3", p3c3),
cbind(name="p3c4", p3c4),
cbind(name="p3c5", p3c5)
)
Then mm values are in one column and its easier to plot.

To add to the other answers, I don't think it is a good idea to have useful information encoded in variable names. Much better to rearrange your data so that all useful information is in the value of some variable. I don't know enough about your data set to suggest the right format, but it might be something like
p c rd day date mm sd ...
3 3 2010-10-04 ...
Once you have done this the answer to your question becomes the simple df$mm.
If you are getting the data in a less useful form from an external source, you can rearrange it in a more useful form like the above within R using the reshape function or functions from the reshape package.

The R Language Definition has some good info on indexing (sec 3.4.1), which is pretty helpful.
You can then pull the names matching a sequence with the grep() command. Then string it all together like this:
dataWithMM <- series_to_plot.df[,grep("[P]", names(series_to_plot.df))]
to deconstruct it a little, this gets the number of the columns that match the "mm" pattern:
namesThatMatch <- grep("[mm]", names(series_to_plot.df)
Then we use that list to call the columns we want:
dataWithMM <- series_to_plot.df[, namesThatMatch ]

Related

Renaming dataframe without writing it to the global environment

I have written a loop that stores data frames in a list and would like to use strings stored in a vector as their names. This way, I could refer to the dataframes stored in the list by their names without having to use indexes. I have searched the internet extensively to this issue but so far have not found any solution.
So far, I have used a workaround: I loop over a list of data frame names using read.csv(). In each iteration, I write the imported data frame to the global environment using assign() which allows me to me set a variable name. Using get() and a pattern matching approach, I then fetch data frames from the global environment and store them in a list.
This approach is quite cumbersome and only works when data frame names follow a shared pattern.
Preferably, I would like to rename data frames without having to use assign():
Name of imported data frame 1 <- First element of vector containing the data frame names
How could I achieve this?
I highly appreciate every help!
My approach to this sort of problem is to use lapply to create the loop and then supply names for the elements of the resulting list. This gives a simple, two line solution once the "create a data frame" function has been written.
For example, generating a random data.frame rather than reading a csv file for easy reproduction:
createDataFrame <- function(x) {
data.frame(X=x, Y=rnorm(5))
}
beatles <- lapply(1:4, createDataFrame)
names(beatles) <- c("John", "Paul", "George", "Ringo")
beatles
$John
X Y
1 1 -1.1590175
2 1 0.6872888
3 1 -0.8868616
4 1 -0.3458603
5 1 1.1136297
$Paul
X Y
1 2 -0.3761409
2 2 -0.9059801
3 2 -0.7039736
4 2 -0.4490143
5 2 1.1337149
$George
X Y
1 3 -0.4804286
2 3 1.0573272
3 3 -1.9000426
4 3 0.8887967
5 3 0.6550380
$Ringo
X Y
1 4 -0.7539840
2 4 -0.3743590
3 4 -0.9748449
4 4 -1.1448570
5 4 -1.3277712
beatles$George
X Y
1 3 -0.4804286
2 3 1.0573272
3 3 -1.9000426
4 3 0.8887967
5 3 0.6550380
Make the obvious changes to createDataFrame for your actual use case.

R subset dataframe column with variable when column name is escaped

I am trying to select a column from a dataframe using a variable as a column name, with the problem that the column name is escaped. I have a couple of workarounds for doing it, which involve changing my code a bit too much, and anyway I've been looking around and I am curious if anybody knew the solution for this kind of weird case.
My dataset is actually a list of time series (which I construct after some operations), this would be a toy example.
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
> df
$`01/19/17`
[1] 1 2 3 4 5 6 7 8 9 10
$`01/20/17`
[1] 2 3 4 5 6 7 8 9 10 11
I don't put the escapes ` in the column names because I want to, but because they come as dates from the process I follow to construct the dataset.
If I know the column name I can access like this,
df$`01/19/17`
If I want to use a variable, looking around e.g. here I see I could rewrite it to something like this,
`$`(df, `01/19/17`)
But I cannot assign a variable like this,
> name1 <- `01/19/17`
Error: object '01/19/17' not found
and if assign it this other way I get a NULL,
> name1 <- "01/19/17"
> `$`(df, name1)
NULL
As I say there are workarounds like e.g. changing all the column names in the list of series, but I just would like to know. Thank you so much.
You can access with brackets rather than with $, even when the key is a string:
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
name1 <- "01/19/17"
df[[name1]]
# [1] 1 2 3 4 5 6 7 8 9 10

How to change the names of confidence levels per variable in linear regression

I got the confidence levels per variable in linear regression.I wanted to use the results for sorting variables so I kept the result set as a data frame. However when I tried to do an str() function on one of the variables I got an error (written below).How can I store the result data set so I'll be able to work on it?
df <- read.table(text = "target birds wolfs
1 9 7
1 8 4
0 2 8
1 2 3 3
0 1 2
1 7 1
0 1 5
1 9 7
1 8 7
0 2 7
0 2 3
1 6 3
0 1 1
0 3 9
0 1 1 ",header = TRUE)
model<-lm(target~birds+wolfs,data=df)
confint(model)
2.5 % 97.5 %
(Intercept) -0.23133823 0.36256052
birds 0.10102771 0.18768505
wolfs -0.09698902 0.00812353
s<-as.data.frame(confint(model))
str(s$2.5%)
Error: unexpected numeric constant in "str(s$2.5"
The expression behind the $ operator must be a valid R identifier. 2.5% isn’t a valid R identifier, but there’s a simple way of making it one: put it into backticks: `2.5%`1. In addition, you need to pay attention that the column name matches exactly (or at least its prefix does). In other words, you need to add a space before the %:
str(s$`2.5 %`)
In general, a$b is the same as a[['b']] (with some subtleties; refer to the documentation). So you can also write:
str(s[['2.5 %']])
Alternatively, you could provide different column names for the data.frame that are valid identifiers, by just assigning different column names. Beware of make.names though: it makes your strings into valid R names, but at the cost of mangling them in ways that are not always obvious. Relying on it risks confusing readers of the code, because previously undeclared identifiers suddenly appear in the code. In the same vein, you should always specify check.names = FALSE with data.frame, otherwise R once again mangles your column names.
1 In fact, R also accepts single quotes here (s$'2.5 %'). However, I suggest you forget this immediately; it’s a historical accident of the R language, and treating identifiers and strings the same (especially since it’s done inconsistently) does more harm than good.

Stepwise fill dataframe

I'm using a for-loop to perform operations on specific subsets of my data. At the end of each iteration of the for loop, I have all the values that I need to fill a row of my dataframe.
So far I tried
df=NULL
for(...){
//stuff to calculate
newline=c(allthethingscalculated)
df=rbind(df,newline)
}
this results in the contents of the dataframe not being accessable using '$' , because the rows are then atomic vectors.
I also tried to append the values I get at the end of each iteration to an already existing vector and when the for loop ends create a dataframe from these vectors using but appending the values to the respective vector didn't work, the values weren't added.
x<-data.frame(a,b,c,d,...)
Any ideas on this?
Since my for loop iterates over IDs in my data, I realized I could do something like this:
uids=unique(data$id)
filler=c(1:length(uids))
df=data.frame(uids,filler,filler,filler,filler,filler,filler,filler,filler,filler)
for(i in uids){
...
df[i,]<-newline
}
I used filler to create a dataframe with the correct number of columns and rows so I don't get an error like 'replacement has length of 9, replacement has length of 1'
Is there a better way to do this? Using this approach I still have the values of filler in the respective row that I'd need to remove?
This should work, can your show us you data ?
R) x=data.frame(a=rep(1,3),b=rep(2,3),c=rep(3,3))
R) d=c(4,4,4)
R) rbind(x,d)
a b c
1 1 2 3
2 1 2 3
3 1 2 3
4 4 4 4
R) cbind(x,d)
a b c d
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4

Applying a function on each row of a data frame in R

I would like to apply some function on each row of a dataframe in R.
The function can return a single-row dataframe or nothing (I guess 'return ()' return nothing?).
I would like to apply this function on each of the rows of a given dataframe, and get the resulting dataframe (which is possibly shorter, i.e. has less rows, than the original one).
For example, if the original dataframe is something like:
id size name
1 100 dave
2 200 sarah
3 50 ben
And the function I'm using gets a row n the dataframe (i.e. a single-row dataframe), returns it as-is if the name rhymes with "brave", otherwise returns null, then the result should be:
id size name
1 100 dave
This example actually refers to filtering a dataframe, and I would love to get both an answer specific to this kind of task but also to a more general case when even the result of the helper function (the one that operates on a single row) may be an arbitrary data frame with a single row. Please note than even in the case of filtering, I would like to use some sophisticated logic (not something simple like $size>100, but a more complex condition that is checked by a function, let's say boo(single_row_df).
P.s.
What I have done so far in these cases is to use apply(df, MARGIN=1) then do.call(rbind ...) but I think it give me some trouble when my dataframe only has a single row (I get Error in do.call(rbind, filterd) : second argument must be a list)
UPDATE
Following Stephen reply I did the following:
ranges.filter <- function(ranges,boo) {
subset(x=ranges,subset=!any(boo[start:end]))
}
I then call ranges.filter with some ranges dataframe that looks like this:
start end
100 200
250 400
698 1520
1988 2147
...
and some boolean vector
(TRUE,FALSE,TRUE,TRUE,TRUE,...)
I want to filter out any ranges that contain a TRUE value from the boolean vector. For example, the first range 100 .. 200 will be left in the data frame iff the boolean vector is FALSE in positions 100 .. 200.
This seems to do the work, but I get a warning saying numerical expression has 53 elements: only the first used.
For the more general case of processing a dataframe, get the plyr package from CRAN and look at the ddply function, for example.
install.packages(plyr)
library(plyr)
help(ddply)
Does what you want without masses of fiddling.
For example...
> d
x y z xx
1 1 0.68434946 0.643786918 8
2 2 0.64429292 0.231382912 5
3 3 0.15106083 0.307459540 3
4 4 0.65725669 0.553340712 5
5 5 0.02981373 0.736611949 4
6 6 0.83895251 0.845043443 4
7 7 0.22788855 0.606439470 4
8 8 0.88663285 0.048965094 9
9 9 0.44768780 0.009275935 9
10 10 0.23954606 0.356021488 4
We want to compute the mean and sd of x within groups defined by "xx":
> ddply(d,"xx",function(r){data.frame(mean=mean(r$x),sd=sd(r$x))})
xx mean sd
1 3 3.0 NA
2 4 7.0 2.1602469
3 5 3.0 1.4142136
4 8 1.0 NA
5 9 8.5 0.7071068
And it gracefully handles all the nasty edge cases that sometimes catch you out.
You may have to use lapply instead of apply to force the result to be a list.
> rhymesWithBrave <- function(x) substring(x,nchar(x)-2) =="ave"
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) dfr[i,] else NULL,
+ dfr))
id size name
1 1 100 dave
But in this case, subset would be more appropriate:
> subset(dfr,rhymesWithBrave(name))
id size name
1 1 100 dave
If you want to perform additional transformations before returning the result, you can go back to the lapply approach above:
> add100tosize <- function(x) within(x,size <- size+100)
> do.call(rbind,lapply(1:nrow(dfr),function(i,dfr)
+ if(rhymesWithBrave(dfr[i,"name"])) add100tosize(dfr[i,])
+ else NULL,dfr))
id size name
1 1 200 dave
Or, in this simple case, apply the function to the output of subset.
> add100tosize(subset(dfr,rhymesWithBrave(name)))
id size name
1 1 200 dave
UPDATE:
To select rows that do not fall between start and end, you might construct a different function (note: when summing result of boolean/logical vectors, TRUE values are converted to 1s and FALSE values are converted to 0s)
test <- function(x)
rowSums(mapply(function(start,end,x) x >= start & x <= end,
start=c(100,250,698,1988),
end=c(200,400,1520,2147))) == 0
subset(dfr,test(size))
It sounds like you want to use subset:
subset(orig.df,grepl("ave",name))
The second argument evaluates to a logical expression that determines which rows are kept. You can make this expression use values from as many columns as you want, eg grepl("ave",name) & size>50

Resources