I used RNCEP backstage to get a reanalyses data for temp. My data looks something like this:
(DD1 <- array(1:12, dim = c(2, 3, 2),
dimnames = list(c("A", "B"),
c("a", "b", "c"),
c("First", "Second"))))
# , , First
#
# a b c
# A 1 3 5
# B 2 4 6
#
# , , Second
#
# a b c
# A 7 9 11
# B 8 10 12
str(DD1)
# int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "dimnames")=List of 3
# ..$ : chr [1:2] "A" "B"
# ..$ : chr [1:3] "a" "b" "c"
# ..$ : chr [1:2] "First" "Second"
I think this is a tabular data?
I need to write the data as csv file where I have something like this:
y a a b b c c
x A B A B A B
1 2 3 4 5 6
7 8 9 10 11 12
But when I used write.csv I got this:
write.csv(DD1)
# "","a.First","b.First","c.First","a.Second","b.Second","c.Second"
# "A",1,3,5,7,9,11
# "B",2,4,6,8,10,12
I thought I had to transpose the data first. So I used this:
DD2 <- as.data.frame.table(DD1)
I also used t() but that also did not work.
Transpose function in R is t(), so hopefully this will work on the dataframe you are trying to transpose.
DD3= t(DD2)
You were on the right track with as.data.frame.table(DD1). That would give you a "long" dataset, that can then be converted to a "wide" form that you can use write.csv on.
Note, however, that R only allows one row of headers, so you will have to combine what you show as "x" and "y" into a single header row.
Here's the approach I would suggest:
library(data.table)
(DD2 <- dcast(data.table(as.data.frame.table(DD1)),
Var3 ~ Var1 + Var2, value.var = "Freq"))
# Var3 A_a A_b A_c B_a B_b B_c
# 1: First 1 3 5 2 4 6
# 2: Second 7 9 11 8 10 12
You can then easily use write.csv on the "DD2" object.
Related
Suppose I have the following dataframe with data (v) and a lookup dataframe (l):
v <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-05'), as.Date('2019-01-30'), as.Date('2019-02-02')), kind=c('a', 'b', 'c', 'a'), v1=c(1,2,3,4))
v
d kind v1
1 2019-01-01 a 1
2 2019-01-05 b 2
3 2019-01-30 c 3
4 2019-02-02 a 4
l <- data.frame(d = c(as.Date('2019-01-01'), as.Date('2019-01-04'), as.Date('2019-02-01')), kind=c('a','b','a'), l1=c(10,20,30))
l
d kind l1
1 2019-01-01 a 10
2 2019-01-04 b 20
3 2019-02-01 a 30
I would like to find the closest row in the l dataframe corresponding to each row in v using the columns: c("d", "kind"). Column kind needs to match exactly and maybe use findInterval(...) on d?
I would like my result to be:
d kind v1 l1
1 2019-01-01 a 1 10
2 2019-01-05 b 2 20
3 2019-01-30 c 3 NA
4 2019-02-02 a 4 30
NOTE: I would prefer a base-R implementation but it would be
interesting to see others
I tried findInterval(...) but I don't know how get it to work with multiple columns
Here's a shot in base-R only. (I do believe that data.table will do this much more elegantly, but I appreciate your aversion to bring in other packages.)
Split each frame into a list of frames, by kind:
v_spl <- split(v, v$kind)
l_spl <- split(l, l$kind)
str(v_spl)
# List of 3
# $ a:'data.frame': 2 obs. of 3 variables:
# ..$ d : Date[1:2], format: "2019-01-01" "2019-02-02"
# ..$ kind: Factor w/ 3 levels "a","b","c": 1 1
# ..$ v1 : num [1:2] 1 4
# $ b:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-05"
# ..$ kind: Factor w/ 3 levels "a","b","c": 2
# ..$ v1 : num 2
# $ c:'data.frame': 1 obs. of 3 variables:
# ..$ d : Date[1:1], format: "2019-01-30"
# ..$ kind: Factor w/ 3 levels "a","b","c": 3
# ..$ v1 : num 3
Now we determine the unique kind we have in common between the two, no need to try to join everything:
### this has the 'kind' in common
(nms <- intersect(names(v_spl), names(l_spl)))
# [1] "a" "b"
### this has the 'kind' we have to bring back in later
(miss_nms <- setdiff(names(v_spl), nms))
# [1] "c"
For the in-common kind, do an interval join:
joined <- Map(
v_spl[nms], l_spl[nms],
f = function(v0, l0) {
ind <- findInterval(v0$d, l0$d)
ind[ ind < 1 ] <- NA
v0$l1 <- l0$l1[ind]
v0
})
Ultimately we will rbind things back together, but those in miss_nms will not have the new column(s). This is a generic way to capture exactly one row of the new columns with an appropriate NA value:
emptycols <- joined[[1]][, setdiff(colnames(joined[[1]]), colnames(v)),drop=FALSE][1,,drop=FALSE][NA,,drop=FALSE]
emptycols
# l1
# NA NA
And add that column(s) to the not-yet-found frames:
unjoined <- lapply(v_spl[miss_nms], cbind, emptycols)
unjoined
# $c
# d kind v1 l1
# 3 2019-01-30 c 3 NA
And finally bring everything back into a single frame:
do.call(rbind, c(joined, unjoined))
# d kind v1 l1
# a.1 2019-01-01 a 1 10
# a.4 2019-02-02 a 4 30
# b 2019-01-05 b 2 20
# c 2019-01-30 c 3 NA
If you want an exact match you would go:
vl <- merge(v, l, by = c("d","kind"))
For your purposes, you can transform d into additional variables for year, month or day and use the merge
I have a sample data frame like below:
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
I want to know how can I select multiple columns and convert them together to factors. I usually do it in the way like data$A = as.factor(data$A). But when the data frame is very large and contains lots of columns, this way will be very time consuming. Does anyone know of a better way to do it?
Choose some columns to coerce to factors:
cols <- c("A", "C", "D", "H")
Use lapply() to coerce and replace the chosen columns:
data[cols] <- lapply(data[cols], factor) ## as.factor() could also be used
Check the result:
sapply(data, class)
# A B C D E F G
# "factor" "integer" "factor" "factor" "integer" "integer" "integer"
# H I J
# "factor" "integer" "integer"
Here is an option using dplyr. The %<>% operator from magrittr update the lhs object with the resulting value.
library(magrittr)
library(dplyr)
cols <- c("A", "C", "D", "H")
data %<>%
mutate_each_(funs(factor(.)),cols)
str(data)
#'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "23","24","26",..: 1 2 3 4
# $ B: int 15 13 39 16
# $ C: Factor w/ 4 levels "3","5","18","37": 2 1 3 4
# $ D: Factor w/ 4 levels "2","6","28","38": 3 1 4 2
# $ E: int 14 4 22 20
# $ F: int 7 19 36 27
# $ G: int 35 40 21 10
# $ H: Factor w/ 4 levels "11","29","32",..: 1 4 3 2
# $ I: int 17 1 9 25
# $ J: int 12 30 8 33
Or if we are using data.table, either use a for loop with set
setDT(data)
for(j in cols){
set(data, i=NULL, j=j, value=factor(data[[j]]))
}
Or we can specify the 'cols' in .SDcols and assign (:=) the rhs to 'cols'
setDT(data)[, (cols):= lapply(.SD, factor), .SDcols=cols]
The more recent tidyverse way is to use the mutate_at function:
library(tidyverse)
library(magrittr)
set.seed(88)
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
cols <- c("A", "C", "D", "H")
data %<>% mutate_at(cols, factor)
str(data)
$ A: Factor w/ 4 levels "5","17","18",..: 2 1 4 3
$ B: int 36 35 2 26
$ C: Factor w/ 4 levels "22","31","32",..: 1 2 4 3
$ D: Factor w/ 4 levels "1","9","16","39": 3 4 1 2
$ E: int 3 14 30 38
$ F: int 27 15 28 37
$ G: int 19 11 6 21
$ H: Factor w/ 4 levels "7","12","20",..: 1 3 4 2
$ I: int 23 24 13 8
$ J: int 10 25 4 33
As of 2021 (still current in early 2023), the current tidyverse/dplyr approach would be to use across, and a <tidy-select> statement.
library(dplyr)
data %>% mutate(across(*<tidy-select>*, *function*))
across(<tidy-select>) allows very consistent and easy selection of columns to transform.
Some examples:
data %>% mutate(across(c(A, B, C, E), as.factor)) # select columns A to C, and E (by name)
data %>% mutate(across(where(is.character), as.factor)) # select character columns
data %>% mutate(across(1:5, as.factor)) # select first 5 columns (by index)
You can use mutate_if (dplyr):
For example, coerce integer in factor:
mydata=structure(list(a = 1:10, b = 1:10, c = c("a", "a", "b", "b",
"c", "c", "c", "c", "c", "c")), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
# A tibble: 10 x 3
a b c
<int> <int> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
Use the function:
library(dplyr)
mydata%>%
mutate_if(is.integer,as.factor)
# A tibble: 10 x 3
a b c
<fct> <fct> <chr>
1 1 1 a
2 2 2 a
3 3 3 b
4 4 4 b
5 5 5 c
6 6 6 c
7 7 7 c
8 8 8 c
9 9 9 c
10 10 10 c
and, for completeness and with regards to this question asking about changing string columns only, there's mutate_if:
data <- cbind(stringVar = sample(c("foo","bar"),10,replace=TRUE),
data.frame(matrix(sample(1:40), 10, 10, dimnames = list(1:10, LETTERS[1:10]))),stringsAsFactors=FALSE)
factoredData = data %>% mutate_if(is.character,funs(factor(.)))
Here is a data.table example. I used grep in this example because that's how I often select many columns by using partial matches to their names.
library(data.table)
data <- data.table(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
factorCols <- grep(pattern = "A|C|D|H", x = names(data), value = TRUE)
data[, (factorCols) := lapply(.SD, as.factor), .SDcols = factorCols]
A simple and updated solution
data <- data %>%
mutate_at(cols, list(~factor(.)))
If you have another objective of getting in values from the table then using them to be converted, you can try the following way
### pre processing
ind <- bigm.train[,lapply(.SD,is.character)]
ind <- names(ind[,.SD[T]])
### Convert multiple columns to factor
bigm.train[,(ind):=lapply(.SD,factor),.SDcols=ind]
This selects columns which are specifically character based and then converts them to factor.
Here is another tidyverse approach using the modify_at() function from the purrr package.
library(purrr)
# Data frame with only integer columns
data <- data.frame(matrix(sample(1:40), 4, 10, dimnames = list(1:4, LETTERS[1:10])))
# Modify specified columns to a factor class
data_with_factors <- data %>%
purrr::modify_at(c("A", "C", "E"), factor)
# Check the results:
str(data_with_factors)
# 'data.frame': 4 obs. of 10 variables:
# $ A: Factor w/ 4 levels "8","12","33",..: 1 3 4 2
# $ B: int 25 32 2 19
# $ C: Factor w/ 4 levels "5","15","35",..: 1 3 4 2
# $ D: int 11 7 27 6
# $ E: Factor w/ 4 levels "1","4","16","20": 2 3 1 4
# $ F: int 21 23 39 18
# $ G: int 31 14 38 26
# $ H: int 17 24 34 10
# $ I: int 13 28 30 29
# $ J: int 3 22 37 9
It appears that the use of SAPPLY on a data.frame to convert variables to factors at once does not work as it produces a matrix/ array. My approach is to use LAPPLY instead, as follows.
## let us create a data.frame here
class <- c("7", "6", "5", "3")
cash <- c(100, 200, 300, 150)
height <- c(170, 180, 150, 165)
people <- data.frame(class, cash, height)
class(people) ## This is a dataframe
## We now apply lapply to the data.frame as follows.
bb <- lapply(people, as.factor) %>% data.frame()
## The lapply part returns a list which we coerce back to a data.frame
class(bb) ## A data.frame
##Now let us check the classes of the variables
class(bb$class)
class(bb$height)
class(bb$cash) ## as expected, are all factors.
I am trying to transition some plyr code to dplyr, and getting stuck with the new functionality of rename() in dplyr. I'd like to be able to reuse a single rename() expression for a set of datasets with overlapping but not identical original names. For example,
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
And then,
rename(sample1, var1 = A, var2 = B, var3 = C)
I would like the result to be that variable A is renamed var1, and B is renamed var2, not adding a var3 in this case. Instead, I get
Error: Unknown variables: C.
In contrast, the plyr syntax would let me use
rename(sample1, c("A" = "var1", "B" = "var2", "C" = "var3"))
rename(sample2, c("A" = "var1", "B" = "var2", "C" = "var3"))
and not throw an error. Is there a way to get the same result in dplyr without getting the Unknown variables error?
Completely ignoring your actual request on how to do this with dplyr, I would like suggest a different approach using a lookup table:
sample1 <- data.frame(A=1:10, B=letters[1:10])
sample2 <- data.frame(B=11:20, C=letters[11:20])
rename_map <- c("A"="var1",
"B"="var2",
"C"="var3")
names(sample1) <- rename_map[names(sample1)]
str(sample1)
names(sample2) <- rename_map[names(sample2)]
str(sample2)
Fundamentally the algorithm is simple:
Build a lookup table of current variable names to desired names
Using the names() function, do a lookup into the map with the mapping indexes and assign those mapped variables to the appropriate columns.
EDIT: As per Hadley's suggestion, I used a named vector instead of a list, makes life much easier. I always forget about named vectors :(
#no need to use rename
oldnames<-unique(c(names(sample1),names(sample2)))
newnames<-c("var1","var2","var3")
name_df<-data.frame(oldnames,newnames)
mydata<-list(sample1,sample2) # combined two datasets as a list
#one liner
finaldata <- lapply(mydata, function(i) {colnames(i)<-name_df[name_df[,1] %in% colnames(i),2]
return(i)})
> finaldata
[[1]]
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
[[2]]
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t
With dplyr, we can use a named vector with old names as values and new names as names, then unquote only the values in name_vec that matches names in your dataset. rename supports unquoting characters, so there is no need to convert them to sym beforehand:
library(dplyr)
name_vec <- c(var1 = "A", var2 = "B", var3 = "C")
sample1 %>%
rename(!!name_vec[name_vec %in% names(.)])
sample2 %>%
rename(!!name_vec[name_vec %in% names(.)])
Also, with setNames:
name_vec <- c(A = "var1", B = "var2", C = "var3")
sample1 %>%
setNames(name_vec[names(.)])
sample2 %>%
setNames(name_vec[names(.)])
Output:
var1 var2
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 7 g
8 8 h
9 9 i
10 10 j
var2 var3
1 11 k
2 12 l
3 13 m
4 14 n
5 15 o
6 16 p
7 17 q
8 18 r
9 19 s
10 20 t
I’ve used #earino’s answer before
myself, but discovered that it can be unsafe. If column names of the data
frame are missing in the (names of the) named vector, those column names are silently replaced with NA and that is certainly not what you want.
d1 <- data.frame(A = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
rename_vec <- c("B" = "var2", "C" = "var3")
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
The same can happen, if you run names(d1) <- rename_vec[names(d1)]
twice by accident, because when you run it the second time, none of the
colnames(d1) are in names(rename_vec).
names(d1) <- rename_vec[names(d1)]
str(d1)
#> 'data.frame': 10 obs. of 2 variables:
#> $ NA: int 1 2 3 4 5 6 7 8 9 10
#> $ NA: chr "a" "b" "c" "d" ...
We just need to select those columns that are in the data frame and in the rename vector.
d2 <- data.frame(B1 = 1:10, B = letters[1:10], stringsAsFactors = FALSE)
sel <- is.element(colnames(d2), names(rename_vec))
names(d2)[sel] <- rename_vec[names(d2)][sel]
str(d2)
#> 'data.frame': 10 obs. of 2 variables:
#> $ B1 : int 1 2 3 4 5 6 7 8 9 10
#> $ var2: chr "a" "b" "c" "d" ...
UPDATE: I initially had a solution here that involved string replacement, which turned out to be unsafe as well, because it allowed for partial matching. This one is better, I think.
I have the following data frame:
Group.1 V2
1 27562761 GO:0003676
2 27562765 c("GO:0004345", "GO:0050661", "GO:0006006", "GO:0055114")
3 27562775 GO:0016020
4 27562776 c("GO:0005525", "GO:0007264", "GO:0005622")
where the second column is a list. I tried to write the data frame into a text file using write.table, but it did not work. My desired output is the following one (file.txt):
27562761 GO:0003676
27562765 GO:0004345, GO:0050661, GO:0006006, GO:0055114
27562775 GO:0016020
27562776 GO:0005525, GO:0007264, GO:0005622
How could I obtain that?
You could look into sink, or you could use write.csv after flattening "V2" to a character string.
Try the following examples:
## recreating some data that is similar to your example
df <- data.frame(a = c(1, 1, 2, 2, 3), b = letters[1:5])
x <- aggregate(list(V2 = df$b), list(df$a), c)
x
# Group.1 V2
# 1 1 1, 2
# 2 2 3, 4
# 3 3 5
## V2 is a list, as you describe in your question
str(x)
# 'data.frame': 3 obs. of 2 variables:
# $ Group.1: num 1 2 3
# $ V2 :List of 3
# ..$ 1: int 1 2
# ..$ 3: int 3 4
# ..$ 5: int 5
sink(file = "somefile.txt")
x
sink()
## now open up "somefile.txt" from your working directory
x$V2 <- sapply(x$V2, paste, collapse = ", ")
write.csv(x, file = "somefile.csv")
## now open up "somefile.csv" from your working directory
I´m quite new to R, and this might there be an easy answer to, but still:
I have a data frame on the form
df <- data.frame(c("a", "b", "c", "d", "e"), 1:5, 7:11, stringsAsFactors=FALSE)
names(df) <- c("en", "to", "tre")
My dataset is way bigger than this, lots more rows and columns. But the basic idea is the same: I want to sort the n highest numeric values, independent of which column they appear in, and return a list with the values in decreasing order and their corresponding string in column "en".
Like this:
e 11
d 10
c 9
b 8
a 7
e 5
and so on.
How could I go about accomplishing this?
You can use the package reshape2 to melt your data and sort the value column, like this :
require(reshape2)
df <- data.frame(c("a", "b", "c", "d", "e"), 1:5, 7:11, stringsAsFactors=FALSE)
names(df) <- c("en", "to", "tre")
df2 <- melt(df, id = "en")
## 'data.frame': 10 obs. of 3 variables:
## $ en : chr "a" "b" "c" "d" ...
## $ variable: Factor w/ 2 levels "to","tre": 1 1 1 1 1 2 2 2 2 2
## $ value : int 1 2 3 4 5 7 8 9 10 11
df2[order(df2$value, decreasing = TRUE), c("en", "value")]
## en value
## 10 e 11
## 9 d 10
## 8 c 9
## 7 b 8
## 6 a 7
## 5 e 5
## 4 d 4
## 3 c 3
## 2 b 2
## 1 a 1
But I'm sure there are other ways to do that !!
Less elegant but without extra packages (it will work with any number of columns):
col1<-rep(df[,1],ncol(df)-1)
col2<-c()
for(i in 2:ncol(df)) {
col2<-c(col2,df[,i])
}
newdf<-data.frame(en=col1,value=col2)
newdf<-newdf[order(as.numeric(newdf[,2]),decreasing=TRUE),]