I've got a data.frame which contains a character variable and multiple numeric variables, something like this:
sampleDF <- data.frame(a = c(1,2,3,"String"), b = c(1,2,3,4), c= c(5,6,7,8), stringsAsFactors = FALSE)
Which looks like this:
a b c
1 1 1 5
2 2 2 6
3 3 3 7
4 String 4 8
I'd like to transpose this data.frame and get it to look like this:
V1 V2 V3 V4
1 1 2 3 String
2 1 2 3 4
3 5 6 7 8
I tried
c<-t(sampleDF)
as well as
d<-transpose(sampleDF)
but both these methods result in V1, V2 and V3 now being of characer type despite only having numeric values.
I know that this has already been asked multiple times. However, I haven't found a suitable answer for why in this case V1, V2 and V3 are also being converted to character.
Is there any way how ensure that these column stay numeric?
Thanks a lot any apologies already for the duplicate nature of this question.
EDIT:
as.data.frame(t(sampleDF)
Does not solve the problem:
'data.frame': 3 obs. of 4 variables:
$ V1: Factor w/ 2 levels "1","5": 1 1 2
..- attr(*, "names")= chr "a" "b" "c"
$ V2: Factor w/ 2 levels "2","6": 1 1 2
..- attr(*, "names")= chr "a" "b" "c"
$ V3: Factor w/ 2 levels "3","7": 1 1 2
..- attr(*, "names")= chr "a" "b" "c"
$ V4: Factor w/ 3 levels "4","8","String": 3 1 2
..- attr(*, "names")= chr "a" "b" "c"
After transposing it, convert the columns to numeric with type.convert
out <- as.data.frame(t(sampleDF), stringsAsFactors = FALSE)
out[] <- lapply(out, type.convert, as.is = TRUE)
row.names(out) <- NULL
out
# V1 V2 V3 V4
#1 1 2 3 String
#2 1 2 3 4
#3 5 6 7 8
str(out)
#'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 1 5
# $ V2: int 2 2 6
# $ V3: int 3 3 7
# $ V4: chr "String" "4" "8"
Or rbind the first column converted to respective 'types' with the transposed other columns
rbind(lapply(sampleDF[,1], type.convert, as.is = TRUE),
as.data.frame(t(sampleDF[2:3])))
NOTE: The first method would be more efficient
Or another approach would be to paste the values together in each column and then read it again
read.table(text=paste(sapply(sampleDF, paste, collapse=" "),
collapse="\n"), header = FALSE, stringsAsFactors = FALSE)
# V1 V2 V3 V4
#1 1 2 3 String
#2 1 2 3 4
#3 5 6 7 8
Or we can convert the 'data.frame' to 'data.matrix' which changes the character elements to NA, use the is.na to find the index of elements that are NA for replacing with the original string values
m1 <- data.matrix(sampleDF)
out <- as.data.frame(t(m1))
out[is.na(out)] <- sampleDF[is.na(m1)]
Or another option is type_convert from readr
library(dplyr)
library(readr)
sampleDF %>%
t %>%
as_data_frame %>%
type_convert
# A tibble: 3 x 4
# V1 V2 V3 V4
# <int> <int> <int> <chr>
#1 1 2 3 String
#2 1 2 3 4
#3 5 6 7 8
Related
Say we have a matrix M.
# n1 n2 n3 n4
# m1 1 4 7 10
# m2 2 5 8 11
# m3 3 6 9 12
In order to list the columns as "data.frames" we can do:
apply(M, 2, data.frame)
However, the listed data frames have weird and identical column names, e.g.:
# $n1
# newX...i.
# m1 1
# m2 2
# m3 3
Same thing with lapply:
lapply(data.frame(M), data.frame)
# $n1
# X..i..
# 1 1
# 2 2
# 3 3
The only way I have found to get my expected output so far is doing:
lapply(1:ncol(M), function(x) setNames(data.frame(M[,x]), colnames(M)[x]))
# [[1]]
# n1 ## <-- expected col names!
# m1 1
# m2 2
# m3 3
This turns out to be unexpectedly cumbersome. Have I maybe missed a simpler base function?
Data
M <- structure(1:12, .Dim = 3:4, .Dimnames = list(c("m1", "m2", "m3"
), c("n1", "n2", "n3", "n4")))
One option could be (a synthesis of my original post and the suggestion from #H 1):
split.default(data.frame(M), colnames(M))
It has the structure:
List of 4
$ n1:'data.frame': 3 obs. of 1 variable:
..$ n1: int [1:3] 1 2 3
$ n2:'data.frame': 3 obs. of 1 variable:
..$ n2: int [1:3] 4 5 6
$ n3:'data.frame': 3 obs. of 1 variable:
..$ n3: int [1:3] 7 8 9
$ n4:'data.frame': 3 obs. of 1 variable:
..$ n4: int [1:3] 10 11 12
I'm trying to rename the columns of a matrix that has no names in dplyr :
set.seed(1234)
v1 <- table(round(runif(50,0,10)))
v2 <- table(round(runif(50,0,10)))
library(dplyr)
bind_rows(v1,v2) %>%
t
[,1] [,2]
0 3 4
1 1 9
2 8 6
3 11 7
5 7 8
6 7 1
7 3 4
8 6 3
9 3 6
10 1 NA
4 NA 2
I usually use rename for that with the form rename(new_name=old_name) however because there is no old_name it doesn't work. I've tried:
rename("v1","v2")
rename(c("v1","v2")
rename(v1=1, v2=2)
rename(v1=[,1],v2=[,v2])
rename(v1="[,1]",v2="[,v2]")
rename_(.dots = c("v1","v2"))
setNames(c("v1","v2"))
none of these works.
I know the base R way to do it (colnames(obj) <- c("v1","v2")) but I'm specifically looking for a dplyrway to do it.
This one with magrittr:
library(dplyr)
bind_rows(v1,v2) %>%
t %>%
magrittr::set_colnames(c("new1", "new2"))
In order to use rename you need to have some sort of a list (like a data frame or a tibble). So you can do two things. You either convert to tibble and use rename or use colnames and leave the structure as is, i.e.
new_d <- bind_rows(v1,v2) %>%
t() %>%
as.tibble() %>%
rename('A' = 'V1', 'B' = 'V2')
#where
str(new_d)
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 11 obs. of 2 variables:
# $ A: int 3 1 8 11 7 7 3 6 3 1 ...
# $ B: int 4 9 6 7 8 1 4 3 6 NA ...
Or
new_d1 <- bind_rows(v1,v2) %>%
t() %>%
`colnames<-`(c('A', 'B'))
#where
str(new_d1)
# int [1:11, 1:2] 3 1 8 11 7 7 3 6 3 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:11] "0" "1" "2" "3" ...
# ..$ : chr [1:2] "A" "B"
When I create a column of counts using dplyr, it appears to be filled correctly, until I try to use the counts column on its own.
Example:
I create this dataframe:
V1 <- c("TEST", "test", "tEsT", "tesT", "TesTing", "testing","ME-TESTED", "re tested", "RE testing")
V2 <- c("othertest", "anothertest", "testing", "123", "random stuff", "irrelevant", "tested", "re-test", "tests")
V3 <- c("type1", "type2", "type1", "type2", "type3", "type2", "type2", "type2", "type1")
df <- data.frame(V1, V2, V3)
Then, I use dplyr to create a column of counts:
df$counts <- df %>% group_by(V3) %>% mutate(count = n())
This gives the expected result:
> df
V1 V2 V3 counts.V1 counts.V2 counts.V3 counts.count
1 TEST othertest type1 TEST othertest type1 3
2 test anothertest type2 test anothertest type2 5
3 tEsT testing type1 tEsT testing type1 3
4 tesT 123 type2 tesT 123 type2 5
5 TesTing random stuff type3 TesTing random stuff type3 1
6 testing irrelevant type2 testing irrelevant type2 5
7 ME-TESTED tested type2 ME-TESTED tested type2 5
8 re tested re-test type2 re tested re-test type2 5
9 RE testing tests type1 RE testing tests type1 3
But, when I try to use the counts.count column in any way, the result is null:
> df$counts.count
NULL
Same result for the other columns created by dplyr.
But the rest of the data frame seems normal:
> df$V1
[1] TEST test tEsT tesT TesTing testing ME-TESTED re tested RE testing
Levels: ME-TESTED re tested RE testing test tesT tEsT TEST testing TesTing
I am totally confused about why printing the whole df gives me a different output than printing just the column of interest. What am I missing here?
If you rewind and recreate the dataframe and then don't do an assignment but just print the result to the screen you see this:
df %>% group_by(V3) %>% mutate(count = n())
Source: local data frame [9 x 4]
Groups: V3 [3]
V1 V2 V3 count
<fctr> <fctr> <fctr> <int>
1 TEST othertest type1 3
2 test anothertest type2 5
3 tEsT testing type1 3
4 tesT 123 type2 5
5 TesTing random stuff type3 1
6 testing irrelevant type2 5
7 ME-TESTED tested type2 5
8 re tested re-test type2 5
9 RE testing tests type1 3
If you now do the assgnment the structure is rather confused and I think you might have gotten a more informative error if there had been fewer unique values of V1 or V2:
df$counts <- df %>% group_by(V3) %>% mutate(count = n())
# snipped what you already showed
str(df)
#-----
'data.frame': 9 obs. of 4 variables:
$ V1 : Factor w/ 9 levels "ME-TESTED","re tested",..: 7 4 6 5 9 8 1 2 3
$ V2 : Factor w/ 9 levels "123","anothertest",..: 4 2 8 1 5 3 7 6 9
$ V3 : Factor w/ 3 levels "type1","type2",..: 1 2 1 2 3 2 2 2 1
$ counts:Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 9 obs. of 4 variables:
..$ V1 : Factor w/ 9 levels "ME-TESTED","re tested",..: 7 4 6 5 9 8 1 2 3
..$ V2 : Factor w/ 9 levels "123","anothertest",..: 4 2 8 1 5 3 7 6 9
..$ V3 : Factor w/ 3 levels "type1","type2",..: 1 2 1 2 3 2 2 2 1
..$ count: int 3 5 3 5 1 5 5 5 3
..- attr(*, "vars")=List of 1
.. ..$ : symbol V3
..- attr(*, "labels")='data.frame': 3 obs. of 1 variable:
.. ..$ V3: Factor w/ 3 levels "type1","type2",..: 1 2 3
.. ..- attr(*, "vars")=List of 1
.. .. ..$ : symbol V3
.. ..- attr(*, "drop")= logi TRUE
..- attr(*, "indices")=List of 3
.. ..$ : int 0 2 8
.. ..$ : int 1 3 5 6 7
.. ..$ : int 4
..- attr(*, "drop")= logi TRUE
..- attr(*, "group_sizes")= int 3 5 1
..- attr(*, "biggest_group_size")= int 5
The format you are seeing is how R displays a matrix that is embedded in a dataframe. Objects of class table (and perhaps tbl?) inherit from the matrix-class.
I want to convert variables into factors using apply():
a <- data.frame(x1 = rnorm(100),
x2 = sample(c("a","b"), 100, replace = T),
x3 = factor(c(rep("a",50) , rep("b",50))))
a2 <- apply(a, 2,as.factor)
apply(a2, 2,class)
results in:
x1 x2 x3
"character" "character" "character"
I don't understand why this results in character vectors instead of factor vectors.
apply converts your data.frame to a character matrix. Use lapply:
lapply(a, class)
# $x1
# [1] "numeric"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
In second command apply converts result to character matrix, using lapply:
a2 <- lapply(a, as.factor)
lapply(a2, class)
# $x1
# [1] "factor"
# $x2
# [1] "factor"
# $x3
# [1] "factor"
But for simple lookout you could use str:
str(a)
# 'data.frame': 100 obs. of 3 variables:
# $ x1: num -1.79 -1.091 1.307 1.142 -0.972 ...
# $ x2: Factor w/ 2 levels "a","b": 2 1 1 1 2 1 1 1 1 2 ...
# $ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
Additional explanation according to comments:
Why does the lapply work while apply doesn't?
The first thing that apply does is to convert an argument to a matrix. So apply(a) is equivalent to apply(as.matrix(a)). As you can see str(as.matrix(a)) gives you:
chr [1:100, 1:3] " 0.075124364" "-1.608618269" "-1.487629526" ...
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:3] "x1" "x2" "x3"
There are no more factors, so class return "character" for all columns.
lapply works on columns so gives you what you want (it does something like class(a$column_name) for each column).
You can see in help to apply why apply and as.factor doesn't work :
In all cases the result is coerced by
as.vector to one of the basic vector
types before the dimensions are set,
so that (for example) factor results
will be coerced to a character array.
Why sapply and as.factor doesn't work you can see in help to sapply:
Value (...) An atomic vector or matrix
or list of the same length as X (...)
If simplification occurs, the output
type is determined from the highest
type of the return values in the
hierarchy NULL < raw < logical <
integer < real < complex < character <
list < expression, after coercion of
pairlists to lists.
You never get matrix of factors or data.frame.
How to convert output to data.frame?
Simple, use as.data.frame as you wrote in comment:
a2 <- as.data.frame(lapply(a, as.factor))
str(a2)
'data.frame': 100 obs. of 3 variables:
$ x1: Factor w/ 100 levels "-2.49629293159922",..: 60 6 7 63 45 93 56 98 40 61 ...
$ x2: Factor w/ 2 levels "a","b": 1 1 2 2 2 2 2 1 2 2 ...
$ x3: Factor w/ 2 levels "a","b": 1 1 1 1 1 1 1 1 1 1 ...
But if you want to replace selected character columns with factor there is a trick:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: chr "a" "b" "c" "d" ...
$ x2: chr "A" "B" "C" "D" ...
$ x3: chr "A" "B" "C" "D" ...
columns_to_change <- c("x1","x2")
a3[, columns_to_change] <- lapply(a3[, columns_to_change], as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: chr "A" "B" "C" "D" ...
You could use it to replace all columns using:
a3 <- data.frame(x1=letters, x2=LETTERS, x3=LETTERS, stringsAsFactors=FALSE)
a3[, ] <- lapply(a3, as.factor)
str(a3)
'data.frame': 26 obs. of 3 variables:
$ x1: Factor w/ 26 levels "a","b","c","d",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x2: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
$ x3: Factor w/ 26 levels "A","B","C","D",..: 1 2 3 4 5 6 7 8 9 10 ...
Here's my dataframe df
I'm trying:
df=data.frame(rbind(c(1,"*","*"),c("*",3,"*"))
df2=as.data.frame(sapply(df,sub,pattern="*",replacement="NA"))
It doesn't work because of the asterisk but I'm getting mad trying to replace it.
If you just have * in (meaning its not like ab*de) your data.frame, then, you can do ths without regex:
df[df == "*"] <- NA
Both solutions here address an object already in your workplace. If possible (or at least in the future) you can make use of the na.strings argument in read.table. Notice that it is plural "strings", so you should be able to specify more than one character to treat as NA values.
Here's an example: This just writes a file named "readmein.txt" to your current working directory and verifies that it is there.
cat("V1 V2 V3 V4 V5 V6 V7\n
2 * * * * * 2\n
1 2 * * * * 1\n", file = "readmein.txt")
list.files(pattern = "readme")
# [1] "readmein.txt"
Here's read.table with the na.strings argument in action.
read.table("readmein.txt", na.strings="*", header = TRUE)
# V1 V2 V3 V4 V5 V6 V7
# 1 2 NA NA NA NA NA 2
# 2 1 2 NA NA NA NA 1
Update: Objects already in your workplace
I see another problem with the other two answers: They both result in character (or rather factor) variables, even when the column should have possibly been numeric.
Here's an example. First, we create an example dataset. For fun, I've added another character to be treated as NA: ".".
temp <- data.frame(
V1 = c(1:3),
V2 = c(1, "*", 3),
V3 = c("a", "*", "c"),
V4 = c(".", "*", "3"))
temp
# V1 V2 V3 V4
# 1 1 1 a .
# 2 2 * * *
# 3 3 3 c 3
str(temp)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 2 3
# $ V2: Factor w/ 3 levels "*","1","3": 2 1 3
# $ V3: Factor w/ 3 levels "*","a","c": 2 1 3
# $ V4: Factor w/ 3 levels ".","*","3": 1 2 3
Let's make a copy, and then solve this in what I would consider the most obvious "R" way:
temp1 <- temp
temp1[temp1 == "*"|temp1 == "."] <- NA
Looks OK...
temp1
# V1 V2 V3 V4
# 1 1 1 a <NA>
# 2 2 <NA> <NA> <NA>
# 3 3 3 c 3
... but I presume that V2 and V4 should have been numeric....
str(temp1)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 2 3
# $ V2: Factor w/ 3 levels "*","1","3": 2 NA 3
# $ V3: Factor w/ 3 levels "*","a","c": 2 NA 3
# $ V4: Factor w/ 3 levels ".","*","3": 1 NA 3
Here's a workaround:
temp2 <- read.table(text = capture.output(temp), na.strings = c("*", "."))
temp2
# V1 V2 V3 V4
# 1 1 1 a NA
# 2 2 NA <NA> NA
# 3 3 3 c 3
str(temp2)
# 'data.frame': 3 obs. of 4 variables:
# $ V1: int 1 2 3
# $ V2: int 1 NA 3
# $ V3: Factor w/ 2 levels "a","c": 1 NA 2
# $ V4: int NA NA 3
Update 2: (Yet another) alternative
It might be more appropriate to make use of type.convert which is described as a "helper function for read.table" on its help page. I haven't timed it, but my guess is that it would be faster than the workaround I mentioned above, with all the benefits.
data.frame(
lapply(temp, function(x) type.convert(
as.character(x), na.strings = c("*", "."))))
You should put up a full reproducible example, people will be more inclined to help when you make it easy for em. Anywho...
dat <- data.frame(a=c(1,2,'*',3,4), b=c('*',2,3,4,'*'))
> dat
a b
1 1 *
2 2 2
3 * 3
4 3 4
5 4 *
> as.data.frame(sapply(dat,sub,pattern='\\*',replacement=NA))
a b
1 1 <NA>
2 2 2
3 <NA> 3
4 3 4
5 4 <NA>
This could work (It's a pretty flexible) but there's other great solutions already. Arun's solution is my typical approach but created replacer for new R (little experience with the command line) users. I wouldn't recommend replacer for anyone with even a bit of experience.
library(qdap)
replacer(dat, "*", NA)