Melting a cast data frame gives incorrect output - r

I've encountered a strange behaviour in cast/melt from the reshape package. If I cast a data.frame, and then try to melt it, the melt comes out wrong. Manually unsetting the "df.melt" class from the cast data.frame lets it be melted properly.
Does anyone know if this is intended behaviour, and if so, what is the use case when you'd want it?
A small code example which shows the behaviour:
> df <- data.frame(type=c(1, 1, 2, 2, 3, 3), variable="n", value=c(71, 72, 68, 80, 21, 20))
> df
type variable value
1 1 n 71
2 1 n 72
3 2 n 68
4 2 n 80
5 3 n 21
6 3 n 20
> df.cast <- cast(df, type~., sum)
> names(df.cast)[2] <- "n"
> df.cast
type n
1 1 143
2 2 148
3 3 41
> class(df.cast)
[1] "cast_df" "data.frame"
> melt(df.cast, id="type", measure="n")
type value value
X.all. 1 143 (all)
X.all..1 2 148 (all)
X.all..2 3 41 (all)
> class(df.cast) <- "data.frame"
> class(df.cast)
[1] "data.frame"
> melt(df.cast, id="type", measure="n")
type variable value
1 1 n 143
2 2 n 148
3 3 n 41

I know this is an OLD question, and not likely to generate a lot of interest. I also can't quite figure out why you're doing what you demonstrate in your example. Nevertheless, to summarize the answer, either:
Wrap your df.cast in as.data.frame before "melting" again.
Ditch "reshape" and update to "reshape2". That wasn't applicable when you posted this question, since your question predates version 1 of "reshape2" by about half a year.
Here's a lengthier walktrhough:
First, we'll load "reshape" and "reshape2", perform your "casting", and rename your "n" variable. Obviously, objects appended with "R2" are those from "reshape2", and "R1", from "reshape".
library(reshape)
library(reshape2)
df.cast.R2 <- dcast(df, type~., sum)
df.cast.R1 <- cast(df, type~., sum)
names(df.cast.R1)[2] <- "n"
names(df.cast.R2)[2] <- "n"
Second, let's just have a quick look at what we've got now:
class(df.cast.R1)
# [1] "cast_df" "data.frame"
class(df.cast.R2)
[1] "data.frame"
str(df.cast.R1)
# List of 2
# $ type: num [1:3] 1 2 3
# $ n : num [1:3] 143 148 41
# - attr(*, "row.names")= int [1:3] 1 2 3
# - attr(*, "idvars")= chr "type"
# - attr(*, "rdimnames")=List of 2
# ..$ :'data.frame': 3 obs. of 1 variable:
# .. ..$ type: num [1:3] 1 2 3
# ..$ :'data.frame': 1 obs. of 1 variable:
# .. ..$ value: Factor w/ 1 level "(all)": 1
str(df.cast.R2)
# 'data.frame': 3 obs. of 2 variables:
# $ type: num 1 2 3
# $ n : num 143 148 41
A few observations are obvious:
By looking at the output of class, you can guess that you won't have any problems doing what you're trying to do if you're using "reshape2"
Whoa. That output of str(df.cast.R1) is the strangest looking data.frame I've ever seen! It actually looks like there are two single variable data.frames in there.
With this new knowledge, and with the prerequisite that we do not want to change the class of your casted data.frame, let's proceed:
# You don't want this
melt(df.cast.R1, id="type", measure="n")
# type value value
# X.all. 1 143 (all)
# X.all..1 2 148 (all)
# X.all..2 3 41 (all)
# You *do* want this
melt(as.data.frame(df.cast.R1), id="type", measure="n")
# type variable value
# 1 1 n 143
# 2 2 n 148
# 3 3 n 41
# And the class has not bee altered
class(df.cast.R1)
# [1] "cast_df" "data.frame"
# As predicted, this works too.
melt(df.cast.R2, id="type", measure="n")
# type variable value
# 1 1 n 143
# 2 2 n 148
# 3 3 n 41
If you're still working with cast from "reshape", consider upgrading to "reshape2", or write a convenience wrapper function around melt... perhaps melt2?
melt2 <- function(data, ...) {
ifelse(isTRUE("cast_df" %in% class(data)),
data <- as.data.frame(data),
data <- data)
melt(data, ...)
}
Try it out on df.cast.R1:
melt2(df.cast.R, id="type", measure="n")
# ype variable value
# 1 1 n 143
# 2 2 n 148
# 3 3 n 41

You need to melt the data frame before you cast it. Casting without melting first will yield all kinds of unexpected behavior, because reshape has to guess the structure of your data.

Related

Adding two columns in a tibble and saving the sum to third column is making the third column a dataframe

I am working on generating a report, upon trying to write the tibble using xlsx package's write.xlsx, it gave an error (even after me specifying as.data.frame(tibble) in write.xlsx).
Upon checking the tibble, I realized that when I added multiple columns and stored the result in another column in the tibble, the total column has become a dataframe.
Example:
> marks <- tibble(math = c(90,90,85,90),
+ physics = c(90,85,95,80),
+ Total = c(rep(NA,4)))
> marks
# A tibble: 4 x 3
math physics Total
<dbl> <dbl> <lgl>
1 90 90 NA
2 90 85 NA
3 85 95 NA
4 90 80 NA
> class(marks)
[1] "tbl_df" "tbl" "data.frame"
> str(marks)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ math : num 90 90 85 90
$ physics: num 90 85 95 80
$ Total : logi NA NA NA NA
> marks$Total <- marks[,1] + marks[,2]
> str(marks)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 4 obs. of 3 variables:
$ math : num 90 90 85 90
$ physics: num 90 85 95 80
$ Total :'data.frame': 4 obs. of 1 variable:
..$ math: num 180 175 180 170
>
As we can see above, I thought I can use vectorized operations of R but the "Total" column has changed to dataframe after summing up two columns and storing the result in Total column.
Could someone let me know why this is happening, also, how to perform the above operation.
Edited: OK seems like because tibble doesn't drop dimension, it was not like adding two vectors.
I think this is a result of the fact that by defaul tibbles don't drop the 2nd dimension when you access part of them with [], whereas dataframes do. Compare:
> marks[, 1]
# A tibble: 4 x 1
math
<dbl>
1 90
2 90
3 85
4 90
> marks_df = as.data.frame(marks)
> marks_df[ , 1]
[1] 90 90 85 90
So marks[, 1] + marks[, 2] is adding a tibble to a tibble and the result is a tibble.
To avoid this, you can either drop the 2nd dimension explicitly, or just use the column names:
marks$Total <- marks[,1, drop = TRUE] + marks[, 2, drop = TRUE]
marks$Total <- marks$math + marks$physics

Choosing multiple columns and changing their classes using a lookup table in R?

Is it possible to use a lookup table to assign/change the classes of variables in a data frame in R? I have thousands of columns with messed up classes in one data frame (my_df), and list of what they should be in another data frame (my_lt). PSEUDO CODE I was thinking something like use my_lt$variable_name and grep() on colnames(my_df) and pass the output through as.numeric if lt$variable_class == "numeric", with some form of if..else. Any help would be much appreciated!
input - my data frame (my_df)
my_df = data.frame(q1_hight_1=c(12,31,22,12),q1_hight_2=c(24,54,23,32),q1_hight_3=c(34,23,65,34),q2_shoe_size_1=c(2,2,3,4),q2_shoe_size_2=c(4,3,3,4))
input - my lookup table (my_lt)
my_lt = data.frame(variable_name=c("hight","shoe_size"),variable_class=c("numeric","integer"))
desired output (when checking classes)
$q1_hight_1 [1] "numeric" $q1_hight_2 [1] "numeric" $q1_hight_3 [1] "numeric" $q2_shoe_size_1 [1] "integer" $q2_shoe_size_2 [1] "integer"
This does the trick, given that there's no trap in the names you give to your variables (I use a very naïve grep).
library(dplyr)
library(purr)
map2(as.character(my_lt$variable_name),
as.character(my_lt$variable_class),
function(nam,cl){ map(grep(nam,names(my_df)),function(i){class(my_df[[i]]) <<- cl})})
str(my_df)
# 'data.frame': 4 obs. of 5 variables:
# $ q1_hight_1 : num 12 31 22 12
# $ q1_hight_2 : num 24 54 23 32
# $ q1_hight_3 : num 34 23 65 34
# $ q2_shoe_size_1: int 2 2 3 4
# $ q2_shoe_size_2: int 4 3 3 4

Remove duplicates in R without converting to numeric

I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.

Dplyr - Error: column '' has unsupported type

I have a odd issue when using dplyr on a data.frame to compute the number of missing observations for each group of a character variable. This creates the error "Error: column "" has unsupported type.
To replicate it I have created a subset. The subset rdata file is available here:
rdata file including dftest data.frame
First. Using the subset I have provided, the code:
dftest %>%
group_by(file) %>%
summarise(missings=sum(is.na(v131)))
Will create the error:
Error: column 'file' has unsupported type
The str(dftest) returns:
'data.frame': 756345 obs. of 2 variables:
$ file: atomic bjir31fl.dta bjir31fl.dta bjir31fl.dta bjir31fl.dta ...
..- attr(*, "levels")= chr
$ v131: Factor w/ 330 levels "not of benin",..: 6 6 6 6 1 1 1 9 9 9 ...
However, taking a subset of the subset, and running the dplyr command again, will create the expected output.
dftest <- dftest[1:756345,]
dftest %>%
group_by(file) %>%
summarise(missings=sum(is.na(v131)))
The str(dftest) now returns:
'data.frame': 756345 obs. of 2 variables:
$ file: chr "bjir31fl.dta" "bjir31fl.dta" "bjir31fl.dta" "bjir31fl.dta" ...
$ v131: Factor w/ 330 levels "not of benin",..: 6 6 6 6 1 1 1 9 9 9 ...
Anyone have any suggestions about what might cause this error, and what to do about it. In my original file I have 300 variables, and dplyr states that most of these are of unsupported type.
Thanks.
This seems to be an issue with using filter when a column of the data frame has an attribute. For example,
> df = data.frame(x=1:10, y=1:10)
> filter(df, x==3) # Works
x y
1 3 3
Add an attribute to the x column. Notice that str(df) shows x as atomic now, and filter doesn't work:
> attr(df$x, 'width')='broad'
> str(df)
'data.frame': 10 obs. of 2 variables:
$ x: atomic 1 2 3 4 5 6 7 8 9 10
..- attr(*, "width")= chr "broad"
$ y: int 1 2 3 4 5 6 7 8 9 10
> filter(df, x==3)
Error: column 'x' has unsupported type
To make it work, remove the attribute:
> attr(df$x, 'width') = NULL
> filter(df, x==3)
x y
1 3 3

tapply function complains that args are unequal length yet they appear to match

Here is the failing call, error messages and some displays to show the lengths in question:
it <- tapply(molten, c(molten$Activity, molten$Subject, molten$variable), mean)
# Error in tapply(molten, c(molten$Activity, molten$Subject, molten$variable), :
# arguments must have same length
length(molten$Activity)
# [1] 679734
length(molten$Subject)
# [1] 679734
length(molten$variable)
# [1] 679734
dim(molten)
# [1] 679734 4
str(molten)
# 'data.frame': 679734 obs. of 4 variables:
# $ Activity: Factor w/ 6 levels "WALKING","WALKING_UPSTAIRS",..: 5 5 5 5 5 5 5 5 5 5 ...
# $ Subject : Factor w/ 30 levels "1","2","3","4",..: 2 2 2 2 2 2 2 2 2 2 ...
# $ variable: Factor w/ 66 levels "tBodyAcc-mean()-X",..: 1 1 1 1 1 1 1 1 1 1 ...
# $ value : num 0.257 0.286 0.275 0.27 0.275 ...
If you have a look at ?tapply you will see that X should be "an atomic object, typically a vector". You feed tapply with a data frame ("molten"), which is not an atomic object. See is.atomic, and try is.atomic(molten). Furthermore, your grouping variables should be provided as a list (see INDEX argument).
Something like this works:
tapply(X = warpbreaks$breaks, INDEX = list(warpbreaks$wool, warpbreaks$tension), mean)
# L M H
# A 44.55556 24.00000 24.55556
# B 28.22222 28.77778 18.77778
You need to have a single object for INDEX, butc( )will string them all together which is the source of the eror, so use a list:
it <- tapply(molten$value, list(Act=molten$Activity, sub=molten$Subject, var=molten$variable), mean)
Better would be:
it <- with(molten , tapply(value, list(Act=Activity, Sub=Subject, var=variable), mean) )
Ever got this solved? Because I had the same issue reading in a CSV file and could fix the issue by saving the original CSV file as CSV(delimiter seperated) instead of CSV(delimiter seperated-UTF-8). My dataset had German Umlauts in it though so that might play a role aswell.

Resources