I have the following data frame:
Group.1 V2
1 27562761 GO:0003676
2 27562765 c("GO:0004345", "GO:0050661", "GO:0006006", "GO:0055114")
3 27562775 GO:0016020
4 27562776 c("GO:0005525", "GO:0007264", "GO:0005622")
where the second column is a list. I tried to write the data frame into a text file using write.table, but it did not work. My desired output is the following one (file.txt):
27562761 GO:0003676
27562765 GO:0004345, GO:0050661, GO:0006006, GO:0055114
27562775 GO:0016020
27562776 GO:0005525, GO:0007264, GO:0005622
How could I obtain that?
You could look into sink, or you could use write.csv after flattening "V2" to a character string.
Try the following examples:
## recreating some data that is similar to your example
df <- data.frame(a = c(1, 1, 2, 2, 3), b = letters[1:5])
x <- aggregate(list(V2 = df$b), list(df$a), c)
x
# Group.1 V2
# 1 1 1, 2
# 2 2 3, 4
# 3 3 5
## V2 is a list, as you describe in your question
str(x)
# 'data.frame': 3 obs. of 2 variables:
# $ Group.1: num 1 2 3
# $ V2 :List of 3
# ..$ 1: int 1 2
# ..$ 3: int 3 4
# ..$ 5: int 5
sink(file = "somefile.txt")
x
sink()
## now open up "somefile.txt" from your working directory
x$V2 <- sapply(x$V2, paste, collapse = ", ")
write.csv(x, file = "somefile.csv")
## now open up "somefile.csv" from your working directory
Related
I'm new to R so would be grateful for your help to understand what is going on!
I have a dataframe that is very big, the structure looks like this:
Var1 Var2
(chr) (int)
A , 1
A , 2
A , 3
B , 4
B , 5
C , 6
C , 7
C , 8
C , 9
...
I want to create a new dataframe that groups the Var 1 categorical values together, and all the corresponding integer Var2 values into numerical vectors
I am hoping it looks like:
Var1 Var2_Combined
(chr) (int)
A , vector[1, 2, 3]
B , vector[4, 5]
C , vector[6, 7, 8, 9]
etc.
Because the dataset is large, i don't want to assign each vector manually and i want to do it through a function. I've tried the following, but it hasn't worked.
1. Convert to string
write.csv(aggregate(df$Var2 ~ df$Var1, FUN = toString), file = "Test_file")
but i couldn't convert the string back into useable numerics using as.numeric() or as.integer() or any of these types of commands.
2. Concatenate
I tried to do it with the c()
write.csv(aggregate(df$Var2 ~ df$Var1, FUN = c), file = "Test_file")
While it match up all the Var2 values to unique values in Var1, it created a bunch of new columns rather than a column combining those values into vectors:
Var1 Var2 Var3 Var4 Var5 etc
(chr) (int) (int) (int) etc
A , 1 , 2 , 3 etc
B , 1 , 2 , 3 etc
3. a for loop
I tried to use the unique() filter and a 'for' loop, but it just returned unusable numbers
Var1_Unique <- unique(df$Var1)
Var2_Combined <- numeric(length = length(Var1_Unique))
for (i in seq(1, length(Var1_Unique))) {
Var2_Combined[i] <- df %>% filter(Var2 == Var1_Unique[i])
}
I only have dplyr attached at the moment.
Thank you
There are 2 options :
1. Store the data in a list.
A. Using base R :
df1 <- aggregate(Var2~Var1, df, list)
df1
# Var1 Var2
#1 A 1, 2, 3
#2 B 4, 5
#3 C 6, 7, 8, 9
str(df1)
#'data.frame': 3 obs. of 2 variables:
# $ Var1: chr "A" "B" "C"
# $ Var2:List of 3
# ..$ : int 1 2 3
# ..$ : int 4 5
# ..$ : int 6 7 8 9
Now get the data back as original.
df2 <- transform(df1[rep(1:nrow(df1), lengths(df1$Var2)), ],
Var2 = unlist(df1$Var2))
str(df2)
#'data.frame': 9 obs. of 2 variables:
# $ Var1: chr "A" "A" "A" "B" ...
# $ Var2: int 1 2 3 4 5 6 7 8 9
B. Using tidyverse.
library(dplyr)
library(tidyr)
df1 <- df %>% group_by(Var1) %>% summarise(Var2 = list(Var2))
df2 <- df1 %>% unnest(Var2)
2. Store the data as a string.
A. Using base R
df1 <- aggregate(Var2~Var1, df, toString)
str(df1)
#'data.frame': 3 obs. of 2 variables:
# $ Var1: chr "A" "B" "C"
# $ Var2: chr "1, 2, 3" "4, 5" "6, 7, 8, 9"
Get it back to original format.
tmp <- strsplit(df1$Var2, ', ')
df2 <- transform(df1[rep(1:nrow(df1), lengths(tmp)),],
Var2 = as.numeric(unlist(tmp)))
str(df2)
#'data.frame': 9 obs. of 2 variables:
# $ Var1: chr "A" "A" "A" "B" ...
# $ Var2: num 1 2 3 4 5 6 7 8 9
B. Using tidyverse :
df1 <- df %>% group_by(Var1) %>% summarise(Var2 = toString(Var2))
df2 <- df1 %>% separate_rows(Var2, sep = ', ', convert = TRUE)
You can use both the options if you want to keep the data in R only. If you want to write intermediate results to csv for df1 you cannot use option 1 because write.csv would not be able to write list columns to csv in which case you need to use option 2.
Create a data.frame where a column is a list
^ this got me most of the way there, but there are significant hurdles I can't seem to clear.
Pretend frame is a data.frame of 3 columns with the middle column intended to be a list. This works:
frame[1,]$list_column <- list(1:4)
None of this works:
frame[1,] <- c(1, list(1:4), 3)
frame[1,] <- c(1, I(list(1:4)), 3)
frame[1,]$list_column <- list(1,3,5)
frame[1,]$list_column <- I(list(1,3,5))
In all cases R thinks I'm trying to add multiple things to a bucket that holds 1 thing and I don't know how to tell it otherwise. (And, btw, that last one is the thing I'd really like to do.)
The key is in creating your list correctly:
> list(1:4)
[[1]]
[1] 1 2 3 4
# Produces a list that contains a single vector
> list(1:4, 7:9)
[[1]]
[1] 1 2 3 4
[[2]]
[1] 7 8 9
# Produces a list that contains two separate vectors
> list(c(1:4, 7:9))
[[1]]
[1] 1 2 3 4 7 8 9
# Produces a list that contains a single vector
So you could do something like this:
frame <- data.frame(a=1:3)
frame$list_column <- NA
frame[1,]$list_column <- list(c(1, 3, 5))
frame[2,]$list_column <- list(1:5)
frame[3,]$list_column <- list(c(1:3, 5:9))
print(frame)
a list_column
1 1 1, 3, 5
2 2 1, 2, 3, 4, 5
3 3 1, 2, 3, 5, 6, 7, 8, 9
str(frame)
'data.frame': 3 obs. of 2 variables:
$ a : int 1 2 3
$ list_column:List of 3
..$ : num 1 3 5
..$ : int 1 2 3 4 5
..$ : int 1 2 3 5 6 7 8 9
Is that what you're after?
Update to address your other query:
frame <- data.frame(a=rep(NA, 3), b=NA, c=NA)
frame[1,] <- list(list(1), list(c(2,5,7)), list(3))
When you're getting unexpected results, have a look at the structure of the object you're dealing with:
> str(c(1, list(c(2,5,7)), 3))
List of 3
$ : num 1
$ : num [1:3] 2 5 7
$ : num 3
This shows that the second element in the list is a vector with 3 items. If you try to put that into a data frame cell, you'll get an error:
> frame <- data.frame(a=rep(NA, 3), b=NA, c=NA)
> frame[1,] <- c(1, list(c(2,5,7)), 3)
Warning message:
In `[<-.data.frame`(`*tmp*`, 1, , value = list(1, c(2, 5, 7), 3)) :
replacement element 2 has 3 rows to replace 1 rows
This is telling you the number of elements don't match the number of slots in your data frame.
I used RNCEP backstage to get a reanalyses data for temp. My data looks something like this:
(DD1 <- array(1:12, dim = c(2, 3, 2),
dimnames = list(c("A", "B"),
c("a", "b", "c"),
c("First", "Second"))))
# , , First
#
# a b c
# A 1 3 5
# B 2 4 6
#
# , , Second
#
# a b c
# A 7 9 11
# B 8 10 12
str(DD1)
# int [1:2, 1:3, 1:2] 1 2 3 4 5 6 7 8 9 10 ...
# - attr(*, "dimnames")=List of 3
# ..$ : chr [1:2] "A" "B"
# ..$ : chr [1:3] "a" "b" "c"
# ..$ : chr [1:2] "First" "Second"
I think this is a tabular data?
I need to write the data as csv file where I have something like this:
y a a b b c c
x A B A B A B
1 2 3 4 5 6
7 8 9 10 11 12
But when I used write.csv I got this:
write.csv(DD1)
# "","a.First","b.First","c.First","a.Second","b.Second","c.Second"
# "A",1,3,5,7,9,11
# "B",2,4,6,8,10,12
I thought I had to transpose the data first. So I used this:
DD2 <- as.data.frame.table(DD1)
I also used t() but that also did not work.
Transpose function in R is t(), so hopefully this will work on the dataframe you are trying to transpose.
DD3= t(DD2)
You were on the right track with as.data.frame.table(DD1). That would give you a "long" dataset, that can then be converted to a "wide" form that you can use write.csv on.
Note, however, that R only allows one row of headers, so you will have to combine what you show as "x" and "y" into a single header row.
Here's the approach I would suggest:
library(data.table)
(DD2 <- dcast(data.table(as.data.frame.table(DD1)),
Var3 ~ Var1 + Var2, value.var = "Freq"))
# Var3 A_a A_b A_c B_a B_b B_c
# 1: First 1 3 5 2 4 6
# 2: Second 7 9 11 8 10 12
You can then easily use write.csv on the "DD2" object.
This does not work
> dfi=data.frame(v1=c(1,1),v2=c(2,2))
> dfi
v1 v2
1 1 2
2 1 2
> df$df=dfi
Error in `$<-.data.frame`(`*tmp*`, "df", value = list(v1 = c(1, 1), v2 = c(2, :
replacement has 2 rows, data has 0
df$df=I(dfi) has the same error. Please help.
Thank you.
Moved this from comments for formatting reasons:
What exactly are you trying to achieve? If you want the contents of dfi passed to df you can use this code:
df <- data.frame(matrix(vector(), 0, 2, dimnames=list(c(), c("V1", "V2"))), stringsAsFactors=F)
df=dfi
As #joran says, it is unclear why you would ever want to do this. Nevertheless, it is possible.
One of the requirements of a data frame is that all the columns have the same number of rows. This is why you are getting the error. Something like this will work:
dfi <- data.frame(v1=c(1,1),v2=c(2,2)) # 2 rows
df <- data.frame(x=1:2) # also 2 rows
df$df <- dfi # works now
Printing would lead you to believe that df has three columns...
df
# x df.v1 df.v2
# 1 1 1 2
# 2 2 1 2
but it does not!
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ x : int 1 2
# $ df:'data.frame': 2 obs. of 2 variables:
# ..$ v1: num 1 1
# ..$ v2: num 2 2
Since df$df is a data frame
class(df$df)
# [1] "data.frame"
you can use the standard data frame accessors...
df$df$v1
# [1] 1 1
df$df[1,]
# v1 v2
# 1 1 2
Incidentally, RStudio has trouble displaying this type of data structure; view(df) gives an inaccurate display of the structure.
Finally, you are probably better off creating a list of data frames, rather than a data frame containing data frames:
df <- data.frame(grp=rep(LETTERS[1:3],each=5),x=rnorm(15),y=rpois(15,5))
df.lst <- split(df,df$grp) # creates a list of data frames
df.lst$A
# grp x y
# 1 A -1.3606420 10
# 2 A -0.4511408 5
# 3 A -1.1951950 4
# 4 A -0.8017765 5
# 5 A -0.2816298 9
df.lst$A$x
# [1] -1.3606420 -0.4511408 -1.1951950 -0.8017765 -0.2816298
I have a list of data.frames (d) that looks like this:
$ 1 :'data.frame': 1 obs. of 2 variables:
..$ index: int 2
..$ V1 : Factor w/ 125 levels "cgtsloqasmlkjybjlo,..:"
$ 2 :'data.frame': 1 obs. of 2 variables:
..$ index: int 2
..$ V1 : Factor w/ 125 levels "ponlohlofdctlo,..:"
and so on for 1000 data.frames. I have to count the number of unique letters occurring in "cgtsloqasmlkjybjlo,..:" as well as in "ponlohlofdctlo,..:" and in the other 1000 data.frames.
I tried a stupid function, but I'm not an expert so it is wrong also because it does not work:
Anyway I tried to split (but it does not work..):
chars = sapply(d, function(x) strsplit(as.character(d),""))
In addiction, I have to count the number of occurrences of "lo" in "cgtsloqasmlkjybjlo,..:" as well as in "ponlohlofdctlo,..:" and in the other 1000.
Edit: the desired output will be a data.frame:
Seq length(unique_letters) lo_occurrences
cgtsloqasmlkjybjlo 13 2
ponlohlofdctlo 9 3
.............. ............ ............
dput output:
dput(d[1:3])
structure(list(1 = structure(1000L, .Label = c("jhgfilsouilohgucaksfiaaknajdauloadbayrzjdhad", "fjkhqurtglowqgbdahhmolovdethabvfdalo", "....", "V1"), class = "factor")), .Names = c("1", "2", "3"))
A way is this:
#simulating your list; I got an error trying to use your dput
d <- list(data.frame(index = 2, V1 = "cgtsloqasmlkjybjlo"),
data.frame(index = 2, V1 = "ponlohlofdctlo"))
d
#[[1]]
# index V1
#1 2 cgtsloqasmlkjybjlo
#[[2]]
# index V1
#1 2 ponlohlofdctlo
res <- do.call(rbind, lapply(d, function(x) data.frame(seq = as.character(x$V1),
length_uniques = length(unique(unlist(strsplit(as.character(x$V1), "")))),
lo_counts = length(unlist(gregexpr("lo", as.character(x$V1)))))))
res
# seq length_uniques lo_counts
#1 cgtsloqasmlkjybjlo 13 2
#2 ponlohlofdctlo 9 3