I want to add an extra row above what is row 1 of the following dataframe (i.e. above the labels a, b and Percent):
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
These dataframes represent questions in an interview analysis I am doing, and I want to include the question descriptor above the row headers so I can easily identify which dataframe belongs to which question (i.e. "Age"). I have been using rbind to add rows, but is it possible to use this command above the row headers?
Thanks.
If it is just meta-data, you can add it as an attribute to the data.frame.
> attr(df1, "Question") <- "Age"
> attributes(df1)
$names
[1] "a" "b" "Percent"
$row.names
[1] 1 2 3 4 5
$class
[1] "data.frame"
$Question
[1] "Age"
If you want the question to be printed above the data.frame,
you can define a Question class, that extends data.frame,
and override the print method.
class(df1) <- c( "Question", class(df1) )
print.Question <- function( x, ... ) {
if( ! is.null( attr(x, "Question") ) ) {
cat("Question:", attr(x, "Question"), "\n")
}
print.data.frame(x)
}
df1
But that looks overkill: it may be simpler to just add a column.
> df1$Question <- "Age"
> df1
a b Percent Question
1 1 4 40 Age
2 2 3 30 Age
3 3 2 20 Age
4 4 1 10 Age
5 5 1 10 Age
I wish this was a part of core R, but I hacked up a solution with Jason Bryer's Likert package using attributes to store column names, and having the likert function read these attributes and use them when plotting. It only works with that function though - there a function HMisc called label, but again none of the functions care about this (including the functions that show dataframes etc).
Here's a writeup of my hack http://reganmian.net/blog/2013/10/02/likert-graphs-in-r-embedding-metadata-for-easier-plotting/, with a link to the code.
rbind is really the only way to go but everything would switch to atomic data. For example:
cols <- c("Age", "Age", "Age")
df1 <- rbind(cols,df1)
str(df1)
Definitely agree with Vincent on this one, I do this quite frequently with survey data, if it's all in one data.frame I generally set a comment attribute on each element of the data.frame(), it's also useful when you perform multiple operations and you want to maintain reasonable colnames(df1). It's not good practice, but if this is for presentation you can always set check.names=F when you create your data.frame()
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
df1<-data.frame(a,b,Percent)
comment(df1$a) <- "Q1a. This is a likert scale"
comment(df1$b) <- "Q1b. This is another likert scale"
comment(df1$Percent) <- "QPercent. This is some other question"
Then, if I "forget" what's in the columns, I can take a quick peak:
sapply(df1, comment)
Related
So I have a dataframe column with userID where there are duplicates. I was asked to find the userID that appear least frequent. What are the possible methods to achieve this. Only using Base R or Dplyr packages.
Something like this
userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3)
Expected Output would be 3 in this case.
If this is based on the lengths of same adjacent values
with(rle(userID), values[which.min(lengths)])
#[1] 3
Or if it is based on the full data values
names(which.min(table(userID)))
#[1] "3"
Another possibility is to get the min of mode:
# example dataframe
df <- data.frame(userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3))
# define Mode function
Mode <- function(x){
a = table(x) # x is a column
return(a[which.min(a)])
}
Mode(df$userID)
# Output:
3 #value
1 #count
Gives the value 3 and the count 1
I have a large Dataset (dataframe) where I want to find the number and the names of my cartegories in a column.
For example my df was like that:
A B
1 car
2 car
3 bus
4 car
5 plane
6 plane
7 plane
8 plane
9 plane
10 train
I would want to find :
car
bus
plane
train
4
How would I do that?
categories <- unique(yourDataFrame$yourColumn)
numberOfCategories <- length(categories)
Pretty painless.
This gives unique, length of unique, and frequency:
table(df$B)
bus car plane train
1 3 5 1
length(table(x$B))
[1] 4
You can simply use unique:
x <- unique(df$B)
And it will extract the unique values in the column. You can use it with apply to get them from each column too!
I would recommend you use factors here, if you are not already. It's straightforward and simple.
levels() gives the unique categories and nlevels() gives the number of them. If we run droplevels() on the data first, we take care of any levels that may no longer be in the data.
with(droplevels(df), list(levels = levels(B), nlevels = nlevels(B)))
# $levels
# [1] "bus" "car" "plane" "train"
#
# $nlevels
# [1] 4
Additionally, to see sorted values you can use the following:
sort(table(df$B), decreasing = TRUE)
And you will see the values in the decreasing order.
Firstly you must ensure that your column is in the correct data type. Most probably R had read it in as a 'chr' which you can check with 'str(df)'.
For the data you have provided as an example, you will want to change this to a 'factor'. df$column <- as.factor(df$column)
Once the data is in the correct format, you can then use 'levels(df$column)' to get a summary of levels you have in the dataset
I've written an apply where I want to 'loop' over a subset of the columns in a dataframe and print some output. For the sake of an example I'm just transforming based on dividing one column by another (I know there are other ways to do this) so we have:
apply(df[c("a","b","c")],2,function(x){
z <- a/df[c("divisor")]
}
)
I'd like to print the column name currently being operated on, but colnames(x) (for example) doesn't work.
Then I want to save a new column, based on each colname (a.exp,b.exp or whatever) into the same df.
For example, take
df <- data.frame(a = 1:3, b = 11:13, c = 21:23)
I'd like to print the column name currently being operated on, but
colnames(x) (for example) doesn't work.
Use sapply with column indices:
sapply(seq_len(ncol(df)), function(x) names(df)[x])
# [1] "a" "b" "c"
I want to save a new column, based on each colname (a.exp,b.exp or
whatever) into the same df.
Here is one way to do it:
(df <- cbind(df, setNames(as.data.frame(apply(df, 2, "^", 2)), paste(names(df), "sqr", sep = "."))))
# a b c a.sqr b.sqr c.sqr
# 1 1 11 21 1 121 441
# 2 2 12 22 4 144 484
# 3 3 13 23 9 169 529
I think a lot of people will look for this same issue, so I'm answering my own question (having eventually found the answers). As below, there are other answers to both parts (thanks!) but non-combining these issues (and some of the examples are more complex).
First, it seems the "colnames" element really isn't something you can get around (seems weird to me!), so you 'loop' over the column names, and within the function call the actual vectors by name [c(x)].
Then the key thing is that to assign, so create your new columns, within an apply, you use '<<'
apply(colnames(df[c("a","b","c")]),function(x) {
z <- (ChISEQCIS[c(paste0(x))]/ChISEQCIS[c("V1")])
ChISEQCIS[c(paste0(x,"ind"))] <<- z
}
)
The << is discussed e.g. https://stackoverflow.com/questions/2628621/how-do-you-use-scoping-assignment-in-r
I got confused because I only vaguely thought about wanting to save the outputs initially and I figured I needed both the column (I incorrectly assumed apply worked like a loop so I could use a counter as an index or something) and that there should be same way to get the name separately (e.g. colname(x)).
There are a couple of related stack questions:
https://stackoverflow.com/questions/9624866/access-to-column-name-of-dataframe-with-apply-function
https://stackoverflow.com/questions/21512041/printing-a-column-name-inside-lapply-function
https://stackoverflow.com/questions/10956873/how-to-print-the-name-of-current-row-when-using-apply-in-r
https://stackoverflow.com/questions/7681013/apply-over-matrix-by-column-any-way-to-get-column-name (easiest to understand)
So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.
The dataset named data has both categorical and continuous variables. I would like to the delete categorical variables.
I tried:
data.1 <- data[,colnames(data)[[3L]]!=0]
No error is printed, but categorical variables stay in data.1. Where are problems ?
The summary of "head(data)" is
id 1,2,3,4,...
age 45,32,54,23,...
status 0,1,0,0,...
...
(more variables like as I wrote above)
All variables are defined as "Factor".
What are you trying to do with that code? First of all, colnames(data) is not a list so using [[]] doesn't make sense. Second, The only thing you test is whether the third column name is not equal to zero. As a column name can never start with a number, that's pretty much always true. So your code translates to :
data1 <- data[,TRUE]
Not what you intend to do.
I suppose you know the meaning of binomial. One way of doing that is defining your own function is.binomial() like this :
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass'){
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
in case you want to take care of NA's. This you can then apply to your dataframe :
data.1 <- data[!sapply(data,is.binomial)]
This way you drop all binomial columns, i.e. columns with only two distinct values.
#Shimpei Morimoto,
I think you need a different approach.
Are the categorical variables defines in the dataframe as factors?
If so you can use:
data.1 <- data[,!apply(data,2,is.factor)]
The test you perform now is if the colname number 3L is not 0.
I think this is not the case.
Another approach is
data.1 <- data[,-3L]
works only if 3L is a number and the only column with categorical variables
I think you're getting there, with your last comment to #Mischa Vreeburg. It might make sense (as you suggest) to reformat your original data file, but you should also be able to solve the problem within R. I can't quite replicate the undefined columns error you got.
Construct some data that look as much like your data as possible:
X <- read.csv(textConnection(
"id,age,pre.treat,status
1,'27', 0,0
2,'35', 1,0
3,'22', 0,1
4,'24', 1,2
5,'55', 1,3
, ,yes(vs)no,"),
quote="\"'")
Take a look:
str(X)
'data.frame': 6 obs. of 4 variables:
$ id : int 1 2 3 4 5 NA
$ age : int 27 35 22 24 55 NA
$ pre.treat: Factor w/ 3 levels " 0"," 1","yes(vs)no": 1 2 1 2 2 3
$ status : int 0 0 1 2 3 NA
Define #Joris Mey's function:
is.binomial <- function(x,na.action=c('na.omit','na.fail','na.pass')) {
FUN <- match.fun(match.arg(na.action))
length(unique(FUN(x)))==2
}
Try it out: you'll see that it does not detect pre.treat as binomial, and keeps all the variables.
sapply(X,is.binomial)
X1 <- X[!sapply(X,is.binomial)]
names(X1)
## keeps everything
We can drop the last row and try again:
X2 <- X[-nrow(X),]
sapply(X2,is.binomial)
It is true in general that R does not expect "extraneous" information such as level IDs to be in the same column as the data themselves. On the one hand, you can do even better in the R world by simply leaving the data as their original, meaningful values ("no", "yes", or "healthy", "sick" rather than 0, 1); on the other hand the data take up slightly more space if stored as a text file, and, more important, it becomes harder to incorporate other meta-data such as units in the file along with the data ...