Getting column names from df and querying df - r

When I do:
var = names(df)[2]
df$var
I get NULL. I think that var is a string inside quotes and that is why this is happening. How could get the columns in a dataframe and dynamically query them?
It has been suggested that I use df[var], but what if my dataframe has another dataframe within it? df[var][x] or df[var]$x won't work.

Get a column of a data frame or item in a list by value of a variable by doing:
df[[var]]

It's hard to know what error-inducing situation has been constructed without dput-output on the offending dataframe. It's modestly difficult to get a column name as described (with actual quotes in the column name, but its possible. First we can try and fail to get such a beast:
df2 <- data.frame("\"col1\""=1:10)
df2[["\"col1\""]]
#NULL
df2
# the data.frame function coerced it to a valid column name with no quotes
X.col1.
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
So we can bypass the validity checks. Now we need escapes preceding the quotes:
df2 <- data.frame("\"col1\""=1:10, check.names=FALSE)
> df2[["\"col1\""]]
[1] 1 2 3 4 5 6 7 8 9 10
If the df[[var]]$x approach worked for you, then the answer is more likely that df is not a dataframe but rather is an ordinary R named list and that it is x that is a dataframe. You should check this by doing:
str(df)
You could make such a structure very simply with:
> df3 <- list( item=data.frame(x=1:10, check.names=FALSE))
> var1 = "item"
> df3[[var1]]$x
[1] 1 2 3 4 5 6 7 8 9 10
> str(df3)
List of 1
$ item:'data.frame': 10 obs. of 1 variable:
..$ x: int [1:10] 1 2 3 4 5 6 7 8 9 10

Related

Using Strings to Identify Sequence of Column Names in R

I am currently try to use pre-defined strings in order to identify multiple column names in R.
To be more explicit, I am using the ave function to create identification variables for subgroups of a dataframe. The twist is that I want the identification variables to be flexible, in such a manner that I would just pass it as a generic string.
A sample code would be:
ids = with(df,ave(rep(1,nrow(df)),subcolumn1,subcolumn2,subcolumn3,FUN=seq_along))
I would like to run this code in the following fashion (code below does not work as expected):
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),subColumnsString ,FUN=seq_along))
I tried something with eval, but still did not work:
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),eval(parse(text=subColumnsString)),FUN=seq_along))
Any ideas?
Thanks.
EDIT: Working code example of what I want:
df = mtcars
id_names = c("vs","am")
idDF_correct = transform(df,idItem = as.numeric(interaction(vs,am)))
idDF_wrong = cbind(df,ave(rep(1,nrow(df)),df[id_names],FUN=seq_along))
Note how in idDF_correct, the unique combinations are correctly mapped into unique values of idItem. In idDF_wrong this is not the case.
I think this achieves what you requested. Here I use the mtcars dataset that ships with R:
subColumnsString <- c("cyl","gear")
ids = with(mtcars, ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along))
Just index your data.frame using the sub columns which returns a list that naturally works with ave
EDIT
ids = ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along)
You can omit the with and just call plain 'ol ave, as G. Grothendieck, stated and you should also use their answer as it is much more general.
This defines a function whose arguments are:
data, the input data frame
by, a character vector of column names in data
fun, a function to use in ave
Code--
Ave <- function(data, by, fun = seq_along) {
do.call(function(...) ave(rep(1, nrow(data)), ..., FUN = fun), data[by])
}
# test
Ave(CO2, c("Plant", "Treatment"), seq_along)
giving:
[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3
[39] 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6
[77] 7 1 2 3 4 5 6 7

Aggregate command in R to combine rows based on unique ID - output data structure?

I'm sure there's a super-easy answer to this. I am trying to combine ratings on subjects based on their unique ID. Here is a test dataset (called Aggregate_Test)I created, where the ID is unique to the subject, and the StaticScore was done by different raters:
ID StaticScore
1 6
2 7
1 5
2 6
3 7
4 8
3 4
4 5
After reading other posts carefully, I used aggregate to create the following dataset with new columns:
StaticAggregate<-aggregate(StaticScore ~ ID, Aggregate_Test, c)
> StaticAggregate
ID StaticScore.1 StaticScore.2
1 1 6 5
2 2 7 6
3 3 7 4
4 4 8 5
This data frame has the following str:
> str(StaticAggregate)
'data.frame': 4 obs. of 2 variables:
$ ID : num 1 2 3 4
$ StaticScore: num [1:4, 1:2] 6 7 7 8 5 6 4 5
If I try to create a new variable by subtracting StaticScore.1 from StaticScore.2, I get the following error:
Staticdiff<-StaticScore.1-StaticScore.2
Error: object 'StaticScore.1' not found
So, please help me - what is this data structure created by aggregate? A matrix? How could I convert StaticScore.1 and StaticScore.2 to separate variables, or barring that, what is the notation to subtract one from the other to create a new variable?
We can do a dcast to create a wide format from long and subtract those columns to create the 'StaticDiff'
library(data.table)
dcast(setDT(Aggregate_Test), ID~paste0("StaticScore", rowid(ID)), value.var="StaticScore"
)[, StaticDiff := StaticScore1 - StaticScore2]
Regarding the specific question about the aggregate behavior, we are just concatenating (c) the 'StaticScore' by 'ID'. The default behavior is to create a matrix column in aggregate
StaticAggregate<-aggregate(StaticScore ~ ID, Aggregate_Test, c)
This can be checked by looking at the str(StaticAggregate)
str(StaticAggregate)
#'data.frame': 4 obs. of 2 variables:
#$ ID : int 1 2 3 4
#$ StaticScore: int [1:4, 1:2] 6 7 7 8 5 6 4 5
How do we change it to normal columns?
It can be done with do.call(data.frame
StaticAggregate <- do.call(data.frame, StaticAggregate)
Check the str again
str(StaticAggregate)
#'data.frame': 4 obs. of 3 variables:
# $ ID : int 1 2 3 4
# $ StaticScore.1: int 6 7 7 8
# $ StaticScore.2: int 5 6 4 5
Now, we can do the calcuation as showed in the OP's post
StaticAggregate$Staticdiff <- with(StaticAggregate, StaticScore.1-StaticScore.2)
StaticAggregate
# ID StaticScore.1 StaticScore.2 Staticdiff
#1 1 6 5 1
#2 2 7 6 1
#3 3 7 4 3
#4 4 8 5 3
As the str output shown in the question indicates, StaticAggregate is a two column data.frame whose second column is a two column matrix, StaticScore. We can display the matrix like this:
StaticAggregate$StaticScore
## [,1] [,2]
## [1,] 6 5
## [2,] 7 6
## [3,] 7 4
## [4,] 8 5
To create a new column with the difference:
transform(StaticAggregate, diff = StaticScore[, 1] - StaticScore[, 2])
## ID StaticScore.1 StaticScore.2 diff
## 1 1 6 5 1
## 2 2 7 6 1
## 3 3 7 4 3
## 4 4 8 5 3
Note that there are no columns in StaticAggregate or in StaticAggregate$StaticScore named StaticScore.1 and StaticScore.2. StaticScore.1 in the heading of the data.frame print output just denotes the first column of the StaticScore matrix.
The reason that the matrix has no column names is that the aggregate function c does not produce them. If we change the original aggregate to this then they would have names:
StaticAggregate2 <- aggregate(StaticScore ~ ID, Aggregate_Test, setNames, c("A", "B"))
StaticAggregate2
## ID StaticScore.A StaticScore.B
## 1 1 6 5
## 2 2 7 6
## 3 3 7 4
## 4 4 8 5
Now we can write this using the column names of the matrix:
StaticAggregate2$StaticScore[, "A"]
## [1] 6 7 7 8
StaticAggregate2$StaticScore[, "B"]
## [1] 5 6 4 5
Note that there is a significant advantage of the way R's aggregate works as it allows simpler access to the results -- the kth column of the matrix is the kth result of the aggregate function. This is in contrast to having the k+1st column of the data.frame representing the kth result of the aggregate function. This may not seem like much of a simplification here but for more complex problems it can be a significant simplification if you need to access the statistics matrix. Of course, you can always flatten it to 3 columns if you want
do.call(data.frame, StaticAggregate)
but once you think about it for a while you may find that the structure it provides is actually more convenient.

freq() renames columns during printing

I want to get a one-way frequency table for each column in my dataframe (a count of each unique value in each column). I am following this tutorial, which suggests using the count() function from the plyr package.
for (col in mtcars[c("gear","carb")]){
freq <- count(col)
write.table(freq, file='filename.txt')
}
I would expect the output to look like this:
gear freq
1 3 15
2 4 12
3 5 5
Instead the column name is replaced with 'x':
x freq
1 3 15
2 4 12
3 5 5
Why is this happening, and how can I modify my for loop so that it prints the column name instead of 'x'?
(There is probably a better, vectorized way to do this other than using a for loop, but I'm new to R and can't quite figure out the syntax.)
In a for loop:
for (col in c("gear","carb")){
print(plyr::count(mtcars, col))
}
Using lapply():
lapply(c("gear","carb"), function(col) plyr::count(mtcars, col))
To be clear, count is not renaming anything. In your loop it receives col which is a vector. A vector does not have column names, and so count does not know what name it should use. It uses x as a place holder.
This will also work (with the names of the columns of the dataset mtcar as input, with result as a list of dataframes):
lapply(c("gear","carb"), function(x){df <- as.data.frame(table(mtcars[x])); names(df) <- c(x, 'freq'); df})
[[1]]
gear freq
1 3 15
2 4 12
3 5 5
[[2]]
carb freq
1 1 7
2 2 10
3 3 3
4 4 10
5 6 1
6 8 1

Extracting a specific type columns and specific named columns from a data frame-R

Let I have a data frame where some colums rae factor type and there is column named "index" which is not a column. I want to extract columns
which are factor tyepe and
the "index" column.
For example let
df<-data.frame(a=runif(10),b=as.factor(sample(10)),index=as.numeri(1:10))
So df is:
a b index
0.16187501 5 1
0.75214741 8 2
0.08741729 3 3
0.58871514 2 4
0.18464752 9 5
0.98392420 1 6
0.73771960 10 7
0.97141474 6 8
0.15768011 7 9
0.10171931 4 10
Desired output is(let it be a data frame called df1)
df1:
b index
5 1
8 2
3 3
2 4
9 5
1 6
10 7
6 8
7 9
4 10
which consist the factor column and the column named "index".
I use such a code
vars<-apply(df,2,function(x) {(is.factor(x)) || (names(x)=="index")})
df1<-df[,vars]
However, this code does not work. How can I return df1 using apply types function in R? I will be very glad for any help. Thanks a lot.
You could do:
df[ , sapply(df, is.factor) | grepl("index", names(df))]
I think two things went wrong with your method: First, apply converts the data frame to a matrix, which doesn't store values as factors (see here for more on this). Also, in a matrix, every value has to be of the same mode (character, numeric, etc.). In this case, everything gets coerced to character, so there's no factor to find.
Second, the column name isn't accessible within apply (AFAIK), so names(x) returns NULL and names(x)=="index" returns logical(0).

Using loop variables

I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7

Resources