I have a huge data frame (call it huge) I would like to split up in two by row number. Though, I notice that the way I'd do it makes the resulting subsets large factors instead of data frames.
list1 <- huge[c(1:8175),]
list2 <- huge[c(8176:nrow(huge),]
class(list1)
[1] "factor"
Can someone explain to me why it is like that, and how do I prevent that?
It is likely that you subset a one-column data frame. Considering the following example.
# Create an example data frame
dt <- data.frame(a = 1:5, b = letters[1:5])
dt
# a b
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
# 5 5 e
str(dt)
# 'data.frame': 5 obs. of 2 variables:
# $ a: int 1 2 3 4 5
# $ b: Factor w/ 5 levels "a","b","c","d",..: 1 2 3 4 5
# Subset the data frame
list1 <- dt[1:2, ]
list2 <- dt[3:nrow(dt), ]
class(list1)
# [1] "data.frame"
The code to subset dt works well. However, when I created a one-column data frame from dt and subset it, you can see that the output automatically becomes a vector.
# Create a one-column data frame
dt2 <- dt[, 2, drop = FALSE]
# Subset the data frame
list3 <- dt2[1:2, ]
list4 <- dt2[3:nrow(dt2), ]
class(list3)
# [1] "factor"
list3
# [1] a b
# Levels: a b c d e
The solution would be add drop = FALSE when subsetting the data frame to keep the output as a data frame.
# Subset the data frame
list5 <- dt2[1:2, , drop = FALSE]
class(list5)
# [1] "data.frame"
Related
Relatively new to R, but I have an issue combining columns that have the same name. I have a very large dataframe (~70 cols and 30k rows). Some of the columns have the same name. I wish to merge these columns and remove the NA's.
An example of what I would like is below (although on a much larger scale).
df <- data.frame(x = c(2,1,3,5,NA,12,"blah"),
x = c(NA,NA,NA,NA,9,NA,NA),
y = c(NA,5,12,"hop",NA,2,NA),
y = c(2,NA,NA,NA,8,NA,4),
z = c(9,5,NA,3,2,6,NA))
desired.result <- data.frame(x = c(2,1,3,5,9,12,"blah"),
y = c(2,5,12,"hop",8,2,4),
z = c(9,5,NA,3,2,6,NA))
I have tried a number of things including suggestions such as:
R: merging columns and the values if they have the same column name
Combine column to remove NA's
However, these solutions either require a numeric dataset (I need to keep the character information) or they require you to manually input the columns that are the same (which is too time consuming for the size of my dataset).
I have managed to solve the issue manually by creating new columns that are combinations:
df$x <- apply(df[,1:2], 1, function(x) x[!is.na(x)][1])
However I don't know how to get R to auto-identify where the columns have the same names and then apply something like the above such that I don't need to specify the index each time.
Thanks
here is a base R approach
#split into a named list, nased on colnames befote the .-character
L <- split.default(df, f = gsub("(.*)\\..*", "\\1", names(df)))
#get the first non-na value for each row in each chunk
L2 <- lapply(L, function(x) apply(x, 1, function(y) na.omit(y)[1]))
# result in a data.frame
as.data.frame(L2)
# x y z
# 1 2 2 9
# 2 1 5 5
# 3 3 12 NA
# 4 5 hop 3
# 5 9 8 2
# 6 12 2 6
# 7 blah 4 NA
# since you are using mixed formats, the columsn are not of the same class!!
str(as.data.frame(L2))
# 'data.frame': 7 obs. of 3 variables:
# $ x: chr "2" "1" "3" "5" ...
# $ y: chr " 2" "5" "12" "hop" ...
# $ z: num 9 5 NA 3 2 6 NA
I have scraped a large table from a web page using the rvest package, but it is reading it as a single vector:
foo<-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
that I need to deal with as a dataframe that looks like this:
bar<-as.data.frame(cbind(Animal=c("Dog","Cat","Goat"),A=c(1,4,7),B=c(2,5,8),C=c(3,6,9)))
This might be a simple dilemma but I'd appreciate the help.
you can create a matrix from your vector and turn it into a data frame:
foo<-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
foo <- c("Animal" , foo)
m <- matrix(foo , ncol = 4 , byrow = TRUE)
df <- as.data.frame(m[-1,] , stringsAsFactors = FALSE)
colnames(df) <- m[1,]
# I assume you want numerics for your A,B,C columns:
df[,2:4]<-apply(df[,2:4],2,as.numeric)
lapply(df,class)
$Animal
[1] "character"
$A
[1] "numeric"
$B
[1] "numeric"
$C
[1] "numeric"
Just split it into required number of rows and rbind it. I added "Animal" at the start of foo to make the elements equal in each row when splitting
foo = c("Animal", foo)
df = data.frame(do.call(rbind, split(foo, ceiling(seq_along(foo)/4))),
stringsAsFactors = FALSE)
colnames(df) = df[1,]
df = df[-1,]
df
# Animal A B C
#2 Dog 1 2 3
#3 Cat 4 5 6
#4 Goat 7 8 9
If you want the proper column types, you can try this. Split into a list, name the list, then convert the column types before coercing to data frame.
l <- setNames(split(tail(foo, -3), rep(1:4, 3)), c("Animal", foo[1:3]))
as.data.frame(lapply(l, type.convert)) ## stringsAsFactors=FALSE if desired
# Animal A B C
# 1 Dog 1 2 3
# 2 Cat 4 5 6
# 3 Goat 7 8 9
Here is a convenient tool to work with list,
seqList <-
function(character,by= 1,res=list()){
### sequence characters by
if (length(character)==0){
res
} else{
seqList(character[-c(1:by)],by=by,res=c(res,list(character[1:by])))
}
}
Once you convert your characters into lists it's easier to manipulate them for instance you can do.
options(stringsAsFactors=FALSE)
foo <-c("A","B","C","Dog","1","2","3","Cat","4","5","6","Goat","7","8","9")
foo <- c("Animal",foo)
df <- data.frame(t(do.call("rbind",
lapply(1:4,function(x) do.call("cbind",lapply(seqList(foo,4),"[[",x))))))
colnames(df) <- df[1,]
df <- df[-1,]
## > df
## Animal A B C
## 2 Dog 1 2 3
## 3 Cat 4 5 6
## 4 Goat 7 8 9
Note:
I haven't tested the efficiency of the function. It might not be very efficient for large amount of characters.
The use of matrices might a better tool for this job.
I have a seemingly simple task of adding a row to a data frame in R but I just can't do it!
I have a data frame with 50 rows and 100 columns. The data frame, which I would like to keep in the same format, has the first column as a factor, and all other columns as characters -- this is what lapply produced. I would simply like to add append a 51st row...but I incur warnings every time.
My added data is of the form Cat <- c("Cat", 1,NA,3,NA,5). (I have no clue where " or ' need to go - quite new to R!)
rbind shows "invalid factor levels" every time.
e.g.
df <- rbind(df,Cat)
I believe this is because of the factor/character divide.
The factor levels in your data.frame should also contain the values in your "Cat" object for the relevant factor column.
Here's a simple example:
df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
toAdd <- list("c", 3)
## Warnings...
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 <NA> 3
# Warning message:
# In `[<-.factor`(`*tmp*`, ri, value = "c") :
# invalid factor level, NA generated
## Possible fix
df$v1 <- factor(df$v1, unique(c(levels(df$v1), toAdd[[1]])))
rbind(df, toAdd)
# v1 v2
# 1 a 1
# 2 b 2
# 3 c 3
Alternatively, consider rbindlist from "data.table", which should save you from having to convert the factor levels:
> library(data.table)
> df <- data.frame(v1 = c("a", "b"), v2 = 1:2)
> rbindlist(list(df, toAdd))
v1 v2
1: a 1
2: b 2
3: c 3
> str(.Last.value)
Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
$ v1: Factor w/ 3 levels "a","b","c": 1 2 3
$ v2: num 1 2 3
- attr(*, ".internal.selfref")=<externalptr>
I have two data.frames:
word1=c("a","a","a","a","b","b","b")
word2=c("a","a","a","a","c","c","c")
values1 = c(1,2,3,4,5,6,7)
values2 = c(3,3,0,1,2,3,4)
df1 = data.frame(word1,values1)
df2 = data.frame(word2,values2)
df1:
word1 values1
1 a 1
2 a 2
3 a 3
4 a 4
5 b 5
6 b 6
7 b 7
df2:
word2 values2
1 a 3
2 a 3
3 a 0
4 a 1
5 c 2
6 c 3
7 c 4
I would like to split these dataframes by word*, and perform two sample t.tests in R.
For example, the word "a" is in both data.frames. What's the t.test between the data.frames for the word "a"? And do this for all the words that are in both data.frames.
The result is a data.frame(result):
word tvalues
1 a 0.4778035
Thanks
Find the words common to both dataframes, then loop over these words, subsetting both dataframes and performing the t.test on the subsets.
E.g.:
df1 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
df2 <- data.frame(word=sample(letters[1:5], 30, replace=TRUE),
x=rnorm(30))
common_words <- sort(intersect(df1$word, df2$word))
setNames(lapply(common_words, function(w) {
t.test(subset(df1, word==w, x), subset(df2, word==w, x))
}), common_words)
This returns a list, where each element is the output of the t.test for one of the common words. setNames just names the list elements so you can see which words they correspond to.
Note I've created new example data here since your example data only have one word in common (a) and so don't really resemble your true problem.
If you just want a matrix of statistics, you can do something like:
t(sapply(common_words, function(w) {
test <- t.test(subset(df1, word==w, x), subset(df2, word==w, x))
c(test$statistic, test$parameter, p=test$p.value,
`2.5%`=test$conf.int[1], `97.5%`=test$conf.int[2])
}))
## t df p 2.5% 97.5%
## a 0.9141839 8.912307 0.38468553 -0.4808054 1.1313220
## b -0.2182582 7.589109 0.83298193 -1.1536056 0.9558315
## c -0.2927253 8.947689 0.77640684 -1.5340097 1.1827691
## d -2.7244728 12.389709 0.01800568 -2.5016301 -0.2826952
## e -0.3683153 7.872407 0.72234501 -1.9404345 1.4072499
This does not work
> dfi=data.frame(v1=c(1,1),v2=c(2,2))
> dfi
v1 v2
1 1 2
2 1 2
> df$df=dfi
Error in `$<-.data.frame`(`*tmp*`, "df", value = list(v1 = c(1, 1), v2 = c(2, :
replacement has 2 rows, data has 0
df$df=I(dfi) has the same error. Please help.
Thank you.
Moved this from comments for formatting reasons:
What exactly are you trying to achieve? If you want the contents of dfi passed to df you can use this code:
df <- data.frame(matrix(vector(), 0, 2, dimnames=list(c(), c("V1", "V2"))), stringsAsFactors=F)
df=dfi
As #joran says, it is unclear why you would ever want to do this. Nevertheless, it is possible.
One of the requirements of a data frame is that all the columns have the same number of rows. This is why you are getting the error. Something like this will work:
dfi <- data.frame(v1=c(1,1),v2=c(2,2)) # 2 rows
df <- data.frame(x=1:2) # also 2 rows
df$df <- dfi # works now
Printing would lead you to believe that df has three columns...
df
# x df.v1 df.v2
# 1 1 1 2
# 2 2 1 2
but it does not!
str(df)
# 'data.frame': 2 obs. of 2 variables:
# $ x : int 1 2
# $ df:'data.frame': 2 obs. of 2 variables:
# ..$ v1: num 1 1
# ..$ v2: num 2 2
Since df$df is a data frame
class(df$df)
# [1] "data.frame"
you can use the standard data frame accessors...
df$df$v1
# [1] 1 1
df$df[1,]
# v1 v2
# 1 1 2
Incidentally, RStudio has trouble displaying this type of data structure; view(df) gives an inaccurate display of the structure.
Finally, you are probably better off creating a list of data frames, rather than a data frame containing data frames:
df <- data.frame(grp=rep(LETTERS[1:3],each=5),x=rnorm(15),y=rpois(15,5))
df.lst <- split(df,df$grp) # creates a list of data frames
df.lst$A
# grp x y
# 1 A -1.3606420 10
# 2 A -0.4511408 5
# 3 A -1.1951950 4
# 4 A -0.8017765 5
# 5 A -0.2816298 9
df.lst$A$x
# [1] -1.3606420 -0.4511408 -1.1951950 -0.8017765 -0.2816298