Suppose I have the following dataframe :
df <- data.frame(A=c(1,2,3),B=c("a","b","c"),C=c(2,1,3),D=c(1,2,3),E=c("a","b","c"),F=c(1,2,3))
> df
A B C D E F
1 1 a 2 1 a 1
2 2 b 1 2 b 2
3 3 c 3 3 c 3
I want to filter out the columns that are identical. I know that I can do it with
DuplCols <- df[duplicated(as.list(df))]
UniqueCols <- df[ ! duplicated(as.list(df))]
In the real world my dataframe has more than 500 columns and I do not know how many identical columns of the same kind I have and I do not know the names of the columns. However, each columnname is unique (as in df). My desired result is (optimally) a dataframe where in each row the column names of the identical columns of one kind are stored. The number of columns in the DesiredResult dataframe is the maximal number of identical columns of one kind in the original dataframe and if there are less identical columns of another kind, NA should be stored:
> DesiredResult
X1 X2 X3
1 A D F
2 B E NA
3 C NA NA
(With "identical column of the same kind" I mean the following: in df the columns A, D, F are identical columns of the same kind and B, E are identical columns of the same kind.)
You can use unique and then test with %in% where it matches to extract the colname.
tt_lapply(unique(as.list(df)), function(x) {colnames(df)[as.list(df) %in% list(x)]})
tt
#[[1]]
#[1] "A" "D" "F"
#
#[[2]]
#[1] "B" "E"
#
#[[3]]
#[1] "C"
t(sapply(tt, "length<-", max(lengths(tt)))) #As data.frame
# [,1] [,2] [,3]
#[1,] "A" "D" "F"
#[2,] "B" "E" NA
#[3,] "C" NA NA
Related
I have created a matrix out of two vectors
x<-c(1,118,3,220)
y<-c("A","B","C","D")
z<-c(x,y)
m<-matrix(z,ncol=2)
Now I want order the second row, but it doesn't work properly.
I tried:
m[order(m[,2]),]
The order should be 1,3,118,220, but it shows 1,118,220,3
The matrix can only hold one class which in this case would be character since you have "A","B","C","D".
So if still want to order the rows in matrix you need to subset the first column convert it into numeric, use order and then use them to reorder rows.
m[order(as.numeric(m[, 1])), ]
# [,1] [,2]
#[1,] "1" "A"
#[2,] "3" "C"
#[3,] "118" "B"
#[4,] "220" "D"
Since you have data with mixed data types why not store them in dataframe instead ?
x<-c(1,118,3,220)
y<-c("A","B","C","D")
df <- data.frame(x,y)
df[order(df[,1]),]
# x y
#1 1 A
#3 3 C
#2 118 B
#4 220 D
I have a data.frame with multiple columns and first column being Year. I want to sort my data frame in descending order for each year. I have fifteen years of data and then over 3000 columns.
I illustrate as follows:
Year A B C D
2000 2 3 4 NA
2001 3 4 NA 1
Desired output, my data frame has NAs as well but I can not remove those.
Year C B A
2000 4 3 2
Year B A D
2001 4 3 1
And this verion as well
Year
2000 C B A
2001 B A D
I have scripted this code
Asc <-order(df[-1], decreasing=True)
But I'm unable to obtain my desired output. I have referred in R sort row data in ascending order but still its different for what I'm looking for.
Would appreciate your help in this regard.
We can use apply with MARGIN=1. We loop through the rows of the dataset (excluding the first column) with apply, get the index of non-NA elements ('i1'), order the non-NA values descendingly ('i2'), and use that to rearrange the column names of the dataset.
m1 <- t(apply(df1[-1], 1, function(x) {
i1 <- !is.na(x)
i2 <- order(-x[i1])
names(df1)[-1][i1][i2]}))
m1
# [,1] [,2] [,3]
#[1,] "C" "B" "A"
#[2,] "B" "A" "D"
If we need the values and also the names, a list approach would be more suitable as it won't create any problems in the class
lst <- apply(df1[-1], 1, function(x){
i1 <- !is.na(x)
list(sort(x[i1],decreasing=TRUE))})
lst
#[[1]]
#[[1]][[1]]
#C B A
#4 3 2
#[[2]]
#[[2]][[1]]
#B A D
#4 3 1
We can extract the names or the elements from the 'lst'
do.call(rbind, do.call(`c`,rapply(lst, names,
how='list')))
# [,1] [,2] [,3]
#[1,] "C" "B" "A"
#[2,] "B" "A" "D"
Or
t(sapply(do.call(c, lst), names))
and the values as
t(simplify2array(do.call(c, lst)))
I have a dataframe with one column that I would like to split into several columns, but the number of splits is dynamic throughout the rows.
Var1
====
A/B
A/B/C
C/B
A/C/D/E
I have tried using colsplit(df$Var1,split="/",names=c("Var1","Var2","Var3","Var4")), but rows with less than 4 variables will repeat.
From Hansi, the desired output would be:
Var1 Var2 Var3 Var4
[1,] "A" "B" NA NA
[2,] "A" "B" "C" NA
[3,] "C" "B" NA NA
[4,] "A" "C" "D" "E"
> read.table(text=as.character(df$Var1), sep="/", fill=TRUE)
V1 V2 V3 V4
1 A B
2 A B C
3 C B
4 A C D E
Leading zeros in digit only fields can be preserved with colClasses="character"
a <- data.frame(Var1=c("01/B","04/B/C","0098/B","8708/C/D/E"))
read.table(text=as.character(a$Var1), sep="/", fill=TRUE, colClasses="character")
V1 V2 V3 V4
1 01 B
2 04 B C
3 0098 B
4 8708 C D E
If I understood your objective correctly here is one possible solution, I'm sure there is a better way of doing it but this was the first that came to mind:
a <- data.frame(Var1=c("A/B","A/B/C","C/B","A/C/D/E"))
splitNames <- c("Var1","Var2","Var3","Var4")
# R> a
# Var1
# 1 A/B
# 2 A/B/C
# 3 C/B
# 4 A/C/D/E
b <- t(apply(a,1,function(x){
temp <- unlist(strsplit(x,"/"));
return(c(temp,rep(NA,max(0,length(splitNames)-length(temp)))))
}))
colnames(b) <- splitNames
# R> b
# Var1 Var2 Var3 Var4
# [1,] "A" "B" NA NA
# [2,] "A" "B" "C" NA
# [3,] "C" "B" NA NA
# [4,] "A" "C" "D" "E"
i do not know a function to solve your problem, but you can achieve it easily with standard R commands :
# Here are your data
df <- data.frame(Var1=c("A/B", "A/B/C", "C/B", "A/C/D/E"), stringsAsFactors=FALSE)
# Split
rows <- strsplit(df$Var1, split="/")
# Maximum amount of columns
columnCount <- max(sapply(rows, length))
# Fill with NA
rows <- lapply(rows, `length<-`, columnCount)
# Coerce to data.frame
out <- as.data.frame(rows)
# Transpose
out <- t(out)
As it relies on strsplit, you may need to make some type conversion. See type.con
I know that to get a row from a data frame in R, we can do this:
data[row,]
where row is an integer. But that spits out an ugly looking data structure where every column is labeled with the names of the column names. How can I just get it a row as a list of value?
Data.frames created by importing data from a external source will have their data transformed to factors by default. If you do not want this set stringsAsFactors=FALSE
In this case to extract a row or a column as a vector you need to do something like this:
as.numeric(as.vector(DF[1,]))
or like this
as.character(as.vector(DF[1,]))
You can't necessarily get it as a vector because each column might have a different mode. You might have numerics in one column and characters in the next.
If you know the mode of the whole row, or can convert to the same type, you can use the mode's conversion function (for example, as.numeric()) to convert to a vector. For example:
> state.x77[1,]
Population Income Illiteracy Life Exp Murder HS Grad Frost
3615.00 3624.00 2.10 69.05 15.10 41.30 20.00
Area
50708.00
> as.numeric(state.x77[1,])
[1] 3615.00 3624.00 2.10 69.05 15.10 41.30 20.00 50708.00
This would work even if some of the columns were integers, although they would be converted to numeric floating-point numbers.
There is a problem with what you propose; namely that the components of data frames (what you call columns) can be of different data types. If you want a single row as a vector, that must contain only a single data type - they are atomic vectors!
Here is an example:
> set.seed(2)
> dat <- data.frame(A = 1:10, B = sample(LETTERS[1:4], 10, replace = TRUE))
> dat
A B
1 1 A
2 2 C
3 3 C
4 4 A
5 5 D
6 6 D
7 7 A
8 8 D
9 9 B
10 10 C
> dat[1, ]
A B
1 1 A
If we force it to drop the empty (column), the only recourse for R is to convert the row to a list to maintain the disparate data types.
> dat[1, , drop = TRUE]
$A
[1] 1
$B
[1] A
Levels: A B C D
The only logical solution to this it to get the data frame into a common type by coercing it to a matrix. This is done via data.matrix() for example:
> mat <- data.matrix(dat)
> mat[1,]
A B
1 1
data.matrix() converts factors to their internal numeric codes. The above allows the first row to be extracted as a vector.
However, if you have character data in the data frame, the only recourse will be to create a character matrix, which may or may not be useful, and data.matrix() now can't be used, we need as.matrix() instead:
> dat$String <- LETTERS[1:10]
> str(dat)
'data.frame': 10 obs. of 3 variables:
$ A : int 1 2 3 4 5 6 7 8 9 10
$ B : Factor w/ 4 levels "A","B","C","D": 1 3 3 1 4 4 1 4 2 3
$ String: chr "A" "B" "C" "D" ...
> mat <- data.matrix(dat)
Warning message:
NAs introduced by coercion
> mat
A B String
[1,] 1 1 NA
[2,] 2 3 NA
[3,] 3 3 NA
[4,] 4 1 NA
[5,] 5 4 NA
[6,] 6 4 NA
[7,] 7 1 NA
[8,] 8 4 NA
[9,] 9 2 NA
[10,] 10 3 NA
> mat <- as.matrix(dat)
> mat
A B String
[1,] " 1" "A" "A"
[2,] " 2" "C" "B"
[3,] " 3" "C" "C"
[4,] " 4" "A" "D"
[5,] " 5" "D" "E"
[6,] " 6" "D" "F"
[7,] " 7" "A" "G"
[8,] " 8" "D" "H"
[9,] " 9" "B" "I"
[10,] "10" "C" "J"
> mat[1, ]
A B String
" 1" "A" "A"
> class(mat[1, ])
[1] "character"
How about this?
library(tidyverse)
dat <- as_tibble(iris)
pulled_row <- dat %>% slice(3) %>% flatten_chr()
If you know all the values are same type, then use flatten_xxx.
Otherwise, I think flatten_chr() is safer.
As user "Reinstate Monica" notes, this problem has two parts:
A data frame will often have different data types in each column that need to be coerced to character strings.
Even after coercing the columns to character format, the data.frame "shell" needs to stripped-off to create a vector via a command like unlist.
With a combination of dplyr and base R this can be done in two lines. First, mutate_all converts all columns to character format. Second, the unlist commands extracts the vector out of the data.frame structure.
My particular issue was that the second line of a csv included the actual column names. So, I wanted to extract the second row to a vector and use that to assign column names. The following worked to extract the row as a character vector:
library(dplyr)
data_col_names <- data[2, ] %>%
mutate_all(as.character) %>%
unlist(., use.names=FALSE)
# example of using extracted row to rename cols
names(data) <- data_col_names
# only for this example, you'd want to remove row 2
# data <- data[-2, ]
(Note: Using as.character() in place of unlist will work too but it's less intuitive to apply as.character twice.)
I see that the most short variant is
c(t(data[row,]))
However if at least one column in data is a column of strings, so it will return string vector.
Do the following function pairs generate exactly the same results?
Pair 1) names() & colnames()
Pair 2) rownames() & row.names()
As Oscar Wilde said
Consistency is the last refuge of the
unimaginative.
R is more of an evolved rather than designed language, so these things happen. names() and colnames() work on a data.frame but names() does not work on a matrix:
R> DF <- data.frame(foo=1:3, bar=LETTERS[1:3])
R> names(DF)
[1] "foo" "bar"
R> colnames(DF)
[1] "foo" "bar"
R> M <- matrix(1:9, ncol=3, dimnames=list(1:3, c("alpha","beta","gamma")))
R> names(M)
NULL
R> colnames(M)
[1] "alpha" "beta" "gamma"
R>
Just to expand a little on Dirk's example:
It helps to think of a data frame as a list with equal length vectors. That's probably why names works with a data frame but not a matrix.
The other useful function is dimnames which returns the names for every dimension. You will notice that the rownames function actually just returns the first element from dimnames.
Regarding rownames and row.names: I can't tell the difference, although rownames uses dimnames while row.names was written outside of R. They both also seem to work with higher dimensional arrays:
>a <- array(1:5, 1:4)
> a[1,,,]
> rownames(a) <- "a"
> row.names(a)
[1] "a"
> a
, , 1, 1
[,1] [,2]
a 1 2
> dimnames(a)
[[1]]
[1] "a"
[[2]]
NULL
[[3]]
NULL
[[4]]
NULL
I think that using colnames and rownames makes the most sense; here's why.
Using names has several disadvantages. You have to remember that it means "column names", and it only works with data frame, so you'll need to call colnames whenever you use matrices. By calling colnames, you only have to remember one function. Finally, if you look at the code for colnames, you will see that it calls names in the case of a data frame anyway, so the output is identical.
rownames and row.names return the same values for data frame and matrices; the only difference that I have spotted is that where there aren't any names, rownames will print "NULL" (as does colnames), but row.names returns it invisibly. Since there isn't much to choose between the two functions, rownames wins on the grounds of aesthetics, since it pairs more prettily withcolnames. (Also, for the lazy programmer, you save a character of typing.)
And another expansion:
# create dummy matrix
set.seed(10)
m <- matrix(round(runif(25, 1, 5)), 5)
d <- as.data.frame(m)
If you want to assign new column names you can do following on data.frame:
# an identical effect can be achieved with colnames()
names(d) <- LETTERS[1:5]
> d
A B C D E
1 3 2 4 3 4
2 2 2 3 1 3
3 3 2 1 2 4
4 4 3 3 3 2
5 1 3 2 4 3
If you, however run previous command on matrix, you'll mess things up:
names(m) <- LETTERS[1:5]
> m
[,1] [,2] [,3] [,4] [,5]
[1,] 3 2 4 3 4
[2,] 2 2 3 1 3
[3,] 3 2 1 2 4
[4,] 4 3 3 3 2
[5,] 1 3 2 4 3
attr(,"names")
[1] "A" "B" "C" "D" "E" NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[20] NA NA NA NA NA NA
Since matrix can be regarded as two-dimensional vector, you'll assign names only to first five values (you don't want to do that, do you?). In this case, you should stick with colnames().
So there...