Assign (or copy) column classes from a dataframe to another - r

I produced a large data frame (1700+obs,159 variables) with a function that collects info from a website. Usually, the function finds numeric values for some columns, and thus they're numeric. Sometimes, however, it finds some text, and converts the whole column to text.
I have one df whose column classes are correct, and I would like to "paste" those classes to a new, incorrect df.
Say, for example:
dfCorrect<-data.frame(x=c(1,2,3,4),y=as.factor(c("a","b","c","d")),z=c("bar","foo","dat","dot"),stringsAsFactors = F)
str(dfCorrect)
'data.frame': 4 obs. of 3 variables:
$ x: num 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: chr "bar" "foo" "dat" "dot"
## now I have my "wrong" data frame:
dfWrong<-as.data.frame(sapply(dfCorrect,paste,sep=""))
str(dfWrong)
'data.frame': 4 obs. of 3 variables:
$ x: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: Factor w/ 4 levels "bar","dat","dot",..: 1 4 2 3
I wanted to copy the classes of each column of dfCorrect into dfWrong, but haven't found how to do it properly.
I've tested:
dfWrong1<-dfWrong
dfWrong1[0,]<-dfCorrect[0,]
str(dfWrong1) ## bad result
'data.frame': 4 obs. of 3 variables:
$ x: Factor w/ 4 levels "1","2","3","4": 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: Factor w/ 4 levels "bar","dat","dot",..: 1 4 2 3
dfWrong1<-dfWrong
str(dfWrong1)<-str(dfCorrect)
'data.frame': 4 obs. of 3 variables:
$ x: num 1 2 3 4
$ y: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
$ z: chr "bar" "foo" "dat" "dot"
Error in str(dfWrong1) <- str(dfCorrect) :
could not find function "str<-"
With this small matrix I could go by hand, but what about larger ones? Is there a way to "copy" the classes from one df to another without having to know the individual classes (and indexes) of each column?
Expected final result (after properly "pasting" classes):
all.equal(sapply(dfCorrect,class),sapply(dfWrong,class))
[1] TRUE

You could try this:
dfWrong[] <- mapply(FUN = as,dfWrong,sapply(dfCorrect,class),SIMPLIFY = FALSE)
...although my first instinct is to agree with Oliver that if it were me I'd try to ensure the correct class at the point you're reading the data.

Related

what is the difference between df[1] and df[,1] (in dataframes)

i've noticed they give the same result except that for df[1] it gives the column in the shape of a dataframe while df[,1] returns a vector.Also, i've noticed they give exactly the same result in tibbles. is that all it is to it ?
The "[" function has (at least) two different forms. When used on a dataframe which is a special form of a list with two arguments it returns the contents of the rows and columns specified columns. It does have an optional argument "drop" whose default is TRUE. If it is set to FALSE, then you get the subset as a dataframeWhen used with one argument, it returns the columns itself without loss of the "data.frame" class attribute. The columns are actually lists in their own right.
The other extraction function, "[[" also returns the contents only.
dat <- data.frame(A=1:10,B=letters[1:10])
> str(dat[1:5,])
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 3 4 5
$ B: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5
> str(dat[1:5,1])
int [1:5] 1 2 3 4 5
> str(dat[1])
'data.frame': 10 obs. of 1 variable:
$ A: int 1 2 3 4 5 6 7 8 9 10
> str(dat[[1]])
int [1:10] 1 2 3 4 5 6 7 8 9 10
> str(dat[,1,drop=FALSE])
'data.frame': 10 obs. of 1 variable:
$ A: int 1 2 3 4 5 6 7 8 9 10

How to convert list given in a data frame to factor/numbers in R Data frame?

mydf is for reproducible purpose . I have mydf data frame , and I want to convert list as factors in mydf , but it throws an error
mydf<-data.frame(col1=c("a","b"),col2=c("f","j"))
mydf$col1<-as.list(mydf$col1)
mydf$col2<-as.list(mydf$col2)
str(mydf)
This is the error I get when I try to change lists to factors/numeric type
mydf$col1<-as.factor(mydf$col1)
Error in order(y) : unimplemented type 'list' in 'orderVector1'
I want my data frame (mydf) to be expected_df (no lists data frame)
expected_df<-data.frame(col1=c("a","b"),col2=c("f","j"))
str(expected_df)
If you compared str(mydf) and str(expected_df) , there is a difference as I am unable to change lists to factors in mydf data frame. Is there any workaround to solve my issue ?
str(mydf)
'data.frame': 2 obs. of 2 variables:
$ col1:List of 2
..$ : Factor w/ 2 levels "a","b": 1
..$ : Factor w/ 2 levels "a","b": 2
$ col2:List of 2
..$ : Factor w/ 2 levels "f","j": 1
..$ : Factor w/ 2 levels "f","j": 2
str(expected_df)
'data.frame': 2 obs. of 2 variables:
$ col1: Factor w/ 2 levels "a","b": 1 2
$ col2: Factor w/ 2 levels "f","j": 1 2
You can use stringsAsFactors = TRUE
> mydf <- data.frame(col1 = c("a", "b"), col2 = c("f", "j"), stringsAsFactors = TRUE)
> mydf
col1 col2
1 a f
2 b j
> mydf$col1
[1] a b
Levels: a b
> str(mydf)
'data.frame': 2 obs. of 2 variables:
$ col1: Factor w/ 2 levels "a","b": 1 2
$ col2: Factor w/ 2 levels "f","j": 1 2
Late to the party here, but I thought I would share my experience for future searches. I was also having the 'Error in order(y)' error when trying to convert a column to factors. The way I got round it was to explicitly label the factors. In your example it would be like so:
# instead of this:
# mydf$col1 <- as.factor(mydf$col1)
# using this:
mydf$col1 <- factor(mydf$col1, levels=c("a","b"))

cannot replace factor variable within a list r

I have the following list of dataframes structure:
str(mylist)
List of 2
$ L1 :'data.frame': 12471 obs. of 3 variables:
...$ colA : Date[1:12471], format: "2006-10-10" "2010-06-21" ...
...$ colB : int [1:12471], 62 42 55 12 78 ...
...$ colC : Factor w/ 3 levels "type1","type2","type3",..: 1 2 3 2 2 ...
I would like to replace type1 or type2 with a new factor type4.
I have tried:
mylist <- lapply(mylist, transform, colC =
replace(colC, colC == 'type1','type4'))
Warning message:
1: In `[<-.factor`(`*tmp*`, list, value = "type4") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, list, value = "type4") :
invalid factor level, NA generated
I do not want to read in my initial data with stringAsFactor=F but i have tried adding type4 as a level in my initial dataset (before splitting into a list of dataframes) using:
levels(mydf$colC) <- c(levels(mydf$colC), "type4")
but I still get the same error when trying to replace.
how do I tell replace that type4 is to be treated as a factor?
You can try to use levels options to renew your factor.
Such as,
status <- factor(status, order=TRUE, levels=c("1", "3", "2",...))
c("1", "3", "2",...) is your type4 in here.
As you state, the crucial thing is to add the new factor level.
## Test data:
mydf <- data.frame(colC = factor(c("type1", "type2", "type3", "type2", "type2")))
mylist <- list(mydf, mydf)
Your data has three factor levels:
> str(mylist)
List of 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 3 levels "type1","type2",..: 1 2 3 2 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 3 levels "type1","type2",..: 1 2 3 2 2
Now add the fourth factor level, then your replace command should work:
## Change levels:
for (ii in seq(along = mylist)) levels(mylist[[ii]]$colC) <-
c(levels(mylist[[ii]]$colC), "type4")
## Replace level:
mylist <- lapply(mylist, transform, colC = replace(colC,
colC == 'type1','type4'))
The new data has four factor levels:
> str(mylist)
List of 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 4 levels "type1","type2",..: 4 2 3 2 2
$ :'data.frame': 5 obs. of 1 variable:
..$ colC: Factor w/ 4 levels "type1","type2",..: 4 2 3 2 2

How to group by for one column within a list of dataframes

I have a list of dataframes.
Each dataframe is named by person and each dataframe contains events (the row). The columns for each event are called 'Indication for event' and 'Number of biopsies' . I would like to create a summary dataframe (or matrix?) that tells me how many biopsies are taken for each Indication by each person.
List of 3
$ :'data.frame': 3 obs. of 2 variables:
..$ Indication: Factor w/ 2 levels "AbdoPain","Vomiting": 1 2 1
..$ NumOfBx : num [1:3] 2 3 1
$ :'data.frame': 4 obs. of 2 variables:
..$ Indication: Factor w/ 3 levels "AbdoPain","Anaemia",..: 2 2 1 3
..$ NumOfBx : num [1:4] 12 23 1 5
$ :'data.frame': 4 obs. of 2 variables:
..$ Indication: Factor w/ 3 levels "AbdoPain","Anaemia",..: 2 1 3 3
..$ NumOfBx : num [1:4] 1 2 3 7
The results:
dfMrBen dfJohn dfStuart
Abdo pain
Vomiting
Anaemia
I thought this was likely to be a split-apply-combine problem but I don't know how to combine to get the summary as above. At the moment I have:
ReportOp<-function(x){
#To extract the dataframe name
theName<-x
#To extract the dataframe data
x<-data.frame(Dxlst[[x]])
grp<-x%>% group_by(Indication %>% summarise(mean=mean(NumberOfBx)
}
lapply(names(Dxlst,ReportOp)
but this just gives me the summary for each dataframe. How do I combine basically add the dataframes together to get the intended result?
first combine the data in one big dataframe or do this after summary with
do.call(rbind, Dxlst)
or first add id's to each list and then rbind them together like so:
Dxlst <- lapply(1:length(Dxlst),
function(x) cbind(Dxlst[[x]],
id = rep(x,nrow(Dxlst[[x]]))))
do.call(rbind, Dxlst)
Not exactly what you are looking for. But it is close. Also you should combine the data frame then so a summary which would be simpler.
Create the data:-
df1=data.frame(Indication=as.factor(sample(c(0,1), 10, replace = T)), Bx=sample(1:10, 10, replace = T))
df2=data.frame(Indication=as.factor(sample(c(0,1,2), 10, replace = T)), Bx=sample(1:10, 10, replace = T))
l=list(df1,df2)
then
l=lapply(l, function(x) aggregate( Bx ~ Indication, x, sum))
m=max(sapply(l, nrow))
n=lapply(l, function(x){ x <- x[seq_len(m),]; row.names(x) <- NULL; x})
do.call('cbind',n)
I get output like:
Indication Bx Indication Bx
1 0 18 0 9
2 1 28 1 35
3 <NA> NA 2 18

How to put different size vectors in data.table column

I have implemented a simple group-by-operation with the ?stats::aggregate function. It collects elements per group in a vector. I would like to make it faster using the data.table package. However I'm not able to reproduce the wanted behaviour with data.table.
Sample dataset:
df <- data.frame(group = c("a","a","a","b","b","b","b","c","c"), val = c("A","B","C","A","B","C","D","A","B"))
Output to reproduce with data.table:
by_group_aggregate <- aggregate(x = df$val, by = list(df$group), FUN = c)
What I've tried:
data_t <- data.table(df)
# working, but not what I want
by_group_datatable <- data_t[,j = paste(val,collapse=","), by = group]
# no grouping done when using c or as.vector
by_group_datatable <- data_t[,j = c(val), by = group]
by_group_datatable <- data_t[,j = as.vector(val), by = group]
# grouping leads to error when using as.list
by_group_datatable <- data_t[,j = as.list(val), by = group]
Is it possible to have vectors of different size in a data.table column? If yes, how do I achieve it?
Here's one way:
data_t[, list(list(val)), by = group]
# group V1
#1: a A,B,C
#2: b A,B,C,D
#3: c A,B
The first list() is used because you want to aggregate the result. The second list is used because you want to aggregate the val column into separate lists per group.
To check the structure:
str(data_t[, list(list(val)), by = group])
#Classes ‘data.table’ and 'data.frame': 3 obs. of 2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ V1 :List of 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2
# - attr(*, ".internal.selfref")=<externalptr>
Using dplyr, you could do the following:
library(dplyr)
df %>% group_by(group) %>% summarise(val = list(val))
#Source: local data frame [3 x 2]
#
# group val
# (fctr) (chr)
#1 a <S3:factor>
#2 b <S3:factor>
#3 c <S3:factor>
Check the structure:
df %>% group_by(group) %>% summarise(val = list(val)) %>% str
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 3 obs. of 2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ val :List of 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# ..$ : Factor w/ 4 levels "A","B","C","D": 1 2
Here is another option with dplyr/tidyr
library(dplyr)
library(tidyr)
res <- df %>%
nest(-group)
str(res)
#'data.frame': 3 obs. of 2 variables:
# $ group: Factor w/ 3 levels "a","b","c": 1 2 3
# $ data :List of 3
# ..$ :'data.frame': 3 obs. of 1 variable:
# .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2 3
# ..$ :'data.frame': 4 obs. of 1 variable:
# .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2 3 4
# ..$ :'data.frame': 2 obs. of 1 variable:
# .. ..$ val: Factor w/ 4 levels "A","B","C","D": 1 2

Resources