I have a Data Frame c1 as below:
str(c1)
#'data.frame': 2312 obs. of 6 variables:
# $ dt : Date, format: "2014-04-01" "2014-04-01" "2014-04-01" ...
# $ base : Factor w/ 2 levels "AA","AB": 1 1 1 2 2 2 2 1 1 1 ...
# $ curr : Factor w/ 5 levels "BA","BB","BC",..: 2 3 5 1 2 3 4 2 3 5 ...
# $ trans: int 72 176 4365 234 144 352 16762 61 160 4276 ...
# $ amt : num 2.18e+09 5.55e+09 9.99e+09 3.75e+08 4.37e+09 ...
# $ rate : num 1.11e-04 1.22e-02 1.26 3.94 5.65e+03 ...
d = "c1"
d
# [1] "c1"
Now then I use d instead of the actual data frame name it does not work correctly -
i <- sapply( c1, is.factor)
i
# dt base curr trans amt rate
#FALSE TRUE TRUE FALSE FALSE FALSE
Correct!
i <- sapply( paste(d), is.factor)
i
# c1
#FALSE
Incorrect
i <- sapply( noquote(d), is.factor)
i
# c1
#FALSE
Incorrect
Is there a way to fix this?
Edit -
c1[i] <- lapply(c1[i], as.character)
Works
get(d)[i] <- lapply( get(d)[i], as.character)
Fails
for (j in 1:length(i)) { ifelse(is.factor(get(d)[j]),get(d)[i] <- as.character(get(d)[i])) }
Fails
Can get be used in every place or are there 3/4 ways to use get()
Thanks Again
If I understand correctly, you're looking for
xy <- data.frame(a = runif(3), b = letters[1:3])
sapply(get("xy"), is.factor)
mind you this is bad practice. If you're making up variable names on-the-fly, you should consider using other objects, like a list, to store a data.frame(s).
This works for now. Although its exceptionally bad to make sense of.
.eval <- function(evaltext,envir=sys.frame()) {
## evaluate a string as R code
eval(parse(text=evaltext), envir=envir)
}
.eval(paste( "i = sapply(",noquote(d),",is.factor)",sep=""))
.eval(paste( noquote(d),"[i] <- lapply(",noquote(d),"[i], as.character)",sep=""))
I am still looking for better alternatives. This is so bad that I cannot accept this as answer :-(
Thanks, Manish
Related
Mutate in place is working fine as I set multiple dataframe columns blank if another dataframe column is blank. However, the mutated columns' types are changed. How to do this without changing column types?
Starting with data1:
I get data2:
Any ideas how to do this without changing any column types? Perhaps save all column types before the mutate and then set them back after the mutate?
Here's my code to create data1 and mutate to data2:
options(stringsasfactors = FALSE)
col_1_ferment <- c(452,768,856,192,905,752) #numeric type
col_1_crutch <- c('15','34','56','49','28','37') #character type
col_1_grease <- c(TRUE,TRUE,FALSE,FALSE,TRUE,FALSE) #boolean type
col_1_pump <- as.factor(c("3","6","3","2","1","2")) #factor type
indicator_col <- c(2,NA,2,1,1,2) #numeric type
data1 <- data.frame(col_1_ferment, col_1_crutch, col_1_grease, col_1_pump, indicator_col, check.rows = TRUE)
data2 <- data1 %>% mutate(dplyr::across(starts_with("col_1_"), ~ ifelse(is.na(indicator_col), "", .x)))
You can use NA instead of ""
data2 <- data1 %>% mutate(dplyr::across(starts_with("col_1_"), ~ ifelse(is.na(indicator_col), NA, .x)))
str(data2)
'data.frame': 6 obs. of 5 variables:
$ col_1_ferment: num 452 NA 856 192 905 752
$ col_1_crutch : chr "15" NA "56" "49" ...
$ col_1_grease : logi TRUE NA FALSE FALSE TRUE FALSE
$ col_1_pump : int 3 NA 3 2 1 2
$ indicator_col: num 2 NA 2 1 1 2
I have a dataset calles marathon and I have tried to use lubridate and churn to convert the characters of marathon$Official.Time into time value in order to work on them. I would like the times to be shown in minutes (meaning that 2 hours are shown as 120 minutes).
data.frame': 5616 obs. of 11 variables:
$ Overall.Position : int 1 2 3 4 5 6 7 8 9 10 ...
$ Gender.Position : int 1 2 3 4 5 6 7 8 9 10 ...
$ Category.Position: int 1 1 2 2 3 4 3 4 5 5 ...
$ Category : chr "MMS" "MMI" "MMI" "MMS" ...
$ Race.No : int 21080 14 2 21077 18 21 21078 21090 21084 12 ...
$ Country : chr "Kenya" "Kenya" "Ethiopia" "Kenya" ...
$ Official.Time : chr "2:12:12" "2:12:14" "2:12:20" "2:12:29" ...
I tried with:
library(lubridate)
times(marathon$Official.Time)
Or
library(chron)
chron(times=marathon$Official.Time)
as.difftime(marathon$Official.Time, units = "mins")
But I only get NA
You were almost there with difftime (which requires two times and gives you the difference). Instead, use as.difftime (which requires one "difference" - ie marathon time) and specify the format as hours:minutes:seconds.
> as.difftime("2:12:12", format="%H:%M:%S", units="mins")
Time difference of 132.2 mins
> as.numeric(as.difftime("2:12:12", format="%H:%M:%S", units="mins"))
[1] 132.2
No extra packages needed.
NOTE: #mathemetical.coffee's solution is ++gd better than these.
Pretty straightforward to kick it out manually:
library(stringi)
library(purrr)
df <- data.frame(Official.Time=c("2:12:12","2:12:14","2:12:20","2:12:29"),
stringsAsFactors=FALSE)
map(df$Official.Time, function(x) {
stri_split_fixed(x, ":")[[1]] %>%
as.numeric() %>%
`*`(c(60, 1, 1/60)) %>%
sum()
}) -> df$minutes
df
## Official.Time minutes
## 1 2:12:12 132.2
## 2 2:12:14 132.2333
## 3 2:12:20 132.3333
## 4 2:12:29 132.4833
You can also do it with just base R operations and w/o "piping":
df$minutes <- sapply(df$Official.Time, function(x) {
x <- strsplit(x, ":", TRUE)[[1]]
x <- as.numeric(x)
x <- x * (c(60, 1, 1/60))
sum(x)
}, USE.NAMES=FALSE)
If "stuck" with base R then I'd prbly actually do:
vapply(df$Official.Time, function(x) {
x <- strsplit(x, ":", TRUE)[[1]]
x <- as.numeric(x)
x <- x * (c(60, 1, 1/60))
sum(x)
}, double(1), USE.NAMES=FALSE)
to ensure type safety.
But, chron can also be used:
library(chron)
60 * 24 * as.numeric(times(df$Official.Time))
NOTE that lubridate has no times() function.
I have 2 variables in a data frame with 300 observations.
$ imagelike: int 3 27 4 5370 ...
$ user: Factor w/ 24915 levels "\"0.1gr\"","\"008bla\"", ..
I then tried to remove the duplicates, such as "- " appears 2 times:
testclean <- data1[!duplicated(data1), ]
This gives me the warning message:
In Ops.factor(left): "-"not meaningful for factors
I have then converted it to a maxtrix:
data2 <- data.matrix(data1)
testclean2 <- data2[!duplicated(data2), ]
This does the trick - however - it converts the userNames to a numeric.
=========================================================================
I am new but I have tried looking at previous posts on this topic (including the one below) but it did not work out:
Convert data.frame columns from factors to characters
Some sample data, from your image (please don't post images of data!):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : Factor w/ 5 levels "\"parmezan_pizza\"",..: 2 5 3 3 4 1
To fix the problem with factors as well as the embedded quotes:
data1$userName <- gsub('"', '', as.character(data1$userName))
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "testblabla" "test_00" "frenchfries" "frenchfries" ...
Like #DanielWinkler suggested, if you can change how the data is read-in or defined, you might choose to include stringsAsFactors = FALSE (this argument is accepted in many functions, including read.csv, read.table, and most data.frame functions including as.data.frame and rbind):
data1 <- data.frame(imageLikeCount = c(3,27,4,4,16,103),
userName = c("\"testblabla\"", "test_00", "frenchfries", "frenchfries", "test.inc", "\"parmezan_pizza\""),
stringsAsFactors = FALSE)
str(data1)
# 'data.frame': 6 obs. of 2 variables:
# $ imageLikeCount: num 3 27 4 4 16 103
# $ userName : chr "\"testblabla\"" "test_00" "frenchfries" "frenchfries" ...
(Note that this still has embedded quotes, so you'll still need something like data1$userName <- gsub('"', '', data1$userName).)
Now, we have data that looks like this:
data1
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 4 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
and your need to remove duplicates works:
data1[! duplicated(data1), ]
# imageLikeCount userName
# 1 3 testblabla
# 2 27 test_00
# 3 4 frenchfries
# 5 16 test.inc
# 6 103 parmezan_pizza
Try
data$userName <- as.character(data$userName)
And then
data<-unique(data)
You could also pass the argument stringAsFactor = FALSE when reading the data. This is usually a good idea.
This seems like something that should be easy but I can't figure it out.
>d=data.table(x=1:5,y=11:15,z=letters[1:5])
>d
x y z
1: 1 11 a
2: 2 12 b
3: 3 13 c
4: 4 14 d
5: 5 15 e
Now, I have decided that row 3 is bad data. I want all of those set to NA.
d[3,]<-NA
Warning message:
In [<-.data.table(*tmp*, 3, , value = NA) :
Coerced 'logical' RHS to 'character' to match the column's type. Either change the target column to 'logical' first (by creating a new
'logical' vector length 5 (nrows of entire table) and assign that;
i.e. 'replace' column), or coerce RHS to 'character' (e.g. 1L,
NA_[real|integer]_, as.*, etc) to make your intent clear and for
speed. Or, set the column type correctly up front when you create the
table and stick to it, please.
Yet, it seems to work.
> d
x y z
1: 1 11 a
2: 2 12 b
3: NA NA NA
4: 4 14 d
5: 5 15 e
If I convert to data.frame, it also works but without the warning. But then I need to convert back which seems awkward. Is there a better way?
To set by reference.
DT[rownum, (names(DT)) := lapply(.SD, function(x) { .x <- x[1]; is.na(.x) <- 1L; .x})]
Or perhaps
DT[rownum, (names(DT)) := lapply(.SD[1,], function(x) { is.na(x) <- 1L; x})]
This will ensure that the correct NA type is created (factor and dates as well)
The second case only indexes once, this may be slightly faster if there are lots of columns in DT or rownum creates a large subgroup of rows.
You could also do (a variant on Roland's solution, but with no copying.
DT[rownum, (names(DT)) := .SD[NA]]
Use the explicit NA types:
d[3,] <- list(NA_integer_, NA_integer_, NA_character_)
Another possibility:
d[3,] <- d[3,lapply(.SD,function(x) x[NA])]
What about using ?set?
> d=data.table(x=1:5,y=11:15,z=letters[1:5])
> set(d, 3L, 1:3, NA_character_)
> d
x y z
1: 1 11 a
2: 2 12 b
3: NA NA NA
4: 4 14 d
5: 5 15 e
> str(d)
Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
$ x: int 1 2 NA 4 5
$ y: int 11 12 NA 14 15
$ z: chr "a" "b" NA "d" ...
- attr(*, ".internal.selfref")=<externalptr>
Or, simply:
> d=data.table(x=1:5,y=11:15,z=letters[1:5])
> d[3] <- NA_character_
> str(d)
Classes ‘data.table’ and 'data.frame': 5 obs. of 3 variables:
$ x: int 1 2 NA 4 5
$ y: int 11 12 NA 14 15
$ z: chr "a" "b" NA "d" ...
- attr(*, ".internal.selfref")=<externalptr>
[ From Matthew ] Yes either set() is the way to go, or #mnel's answer is very neat :
DT[rownum, names(DT) := .SD[NA]]
On the presence or not of the coerce warning in the set approach, here's the internal code (modified here to convey the salient points). I seem to have had loss of precision (from double to integer) in mind when writing that, as well as inefficiency of coercing the RHS.
if( (isReal(RHS) && (TYPEOF(targetcol)==INTSXP || isLogical(targetcol))) ||
(TYPEOF(RHS)==INTSXP && isLogical(targetcol)) ||
(isString(targetcol))) {
if (isReal(RHS)) s3="; may have truncated precision"; else s3="";
warning("Coerced '%s' RHS to '%s' to match the column's type%s. ... <s3> ...
}
The full source of assign.c can be inpected here : https://r-forge.r-project.org/scm/viewvc.php/pkg/src/assign.c?view=markup&root=datatable
There is a very similar feature request to improve this :
FR#2551 Singleton := RHS no coerce warning if no precision lost
Have added a link there back to this question.
In general where data.table is over cautious in warning you about potential problems or inefficiencies, in a case like this where you want to set a set of column of different types, wrapping with suppressWarnings() is another way.
Here is what I am doing now. Ok, I guess but still a little awkward.
na_datatable_row<-function(dtrow){
#take a row of data.table and return a row of the same table but
#with all values set tp NA
#DT[rownum,]<-NA works but throws an annoying warning
#so instead, do DT[rownum,]<-na_datatable_row(DT[anyrow,])
#this preserves the right types
row=data.frame(dtrow)
row[1,]<-NA
return(data.table(row))
}
I've encountered a strange behaviour in cast/melt from the reshape package. If I cast a data.frame, and then try to melt it, the melt comes out wrong. Manually unsetting the "df.melt" class from the cast data.frame lets it be melted properly.
Does anyone know if this is intended behaviour, and if so, what is the use case when you'd want it?
A small code example which shows the behaviour:
> df <- data.frame(type=c(1, 1, 2, 2, 3, 3), variable="n", value=c(71, 72, 68, 80, 21, 20))
> df
type variable value
1 1 n 71
2 1 n 72
3 2 n 68
4 2 n 80
5 3 n 21
6 3 n 20
> df.cast <- cast(df, type~., sum)
> names(df.cast)[2] <- "n"
> df.cast
type n
1 1 143
2 2 148
3 3 41
> class(df.cast)
[1] "cast_df" "data.frame"
> melt(df.cast, id="type", measure="n")
type value value
X.all. 1 143 (all)
X.all..1 2 148 (all)
X.all..2 3 41 (all)
> class(df.cast) <- "data.frame"
> class(df.cast)
[1] "data.frame"
> melt(df.cast, id="type", measure="n")
type variable value
1 1 n 143
2 2 n 148
3 3 n 41
I know this is an OLD question, and not likely to generate a lot of interest. I also can't quite figure out why you're doing what you demonstrate in your example. Nevertheless, to summarize the answer, either:
Wrap your df.cast in as.data.frame before "melting" again.
Ditch "reshape" and update to "reshape2". That wasn't applicable when you posted this question, since your question predates version 1 of "reshape2" by about half a year.
Here's a lengthier walktrhough:
First, we'll load "reshape" and "reshape2", perform your "casting", and rename your "n" variable. Obviously, objects appended with "R2" are those from "reshape2", and "R1", from "reshape".
library(reshape)
library(reshape2)
df.cast.R2 <- dcast(df, type~., sum)
df.cast.R1 <- cast(df, type~., sum)
names(df.cast.R1)[2] <- "n"
names(df.cast.R2)[2] <- "n"
Second, let's just have a quick look at what we've got now:
class(df.cast.R1)
# [1] "cast_df" "data.frame"
class(df.cast.R2)
[1] "data.frame"
str(df.cast.R1)
# List of 2
# $ type: num [1:3] 1 2 3
# $ n : num [1:3] 143 148 41
# - attr(*, "row.names")= int [1:3] 1 2 3
# - attr(*, "idvars")= chr "type"
# - attr(*, "rdimnames")=List of 2
# ..$ :'data.frame': 3 obs. of 1 variable:
# .. ..$ type: num [1:3] 1 2 3
# ..$ :'data.frame': 1 obs. of 1 variable:
# .. ..$ value: Factor w/ 1 level "(all)": 1
str(df.cast.R2)
# 'data.frame': 3 obs. of 2 variables:
# $ type: num 1 2 3
# $ n : num 143 148 41
A few observations are obvious:
By looking at the output of class, you can guess that you won't have any problems doing what you're trying to do if you're using "reshape2"
Whoa. That output of str(df.cast.R1) is the strangest looking data.frame I've ever seen! It actually looks like there are two single variable data.frames in there.
With this new knowledge, and with the prerequisite that we do not want to change the class of your casted data.frame, let's proceed:
# You don't want this
melt(df.cast.R1, id="type", measure="n")
# type value value
# X.all. 1 143 (all)
# X.all..1 2 148 (all)
# X.all..2 3 41 (all)
# You *do* want this
melt(as.data.frame(df.cast.R1), id="type", measure="n")
# type variable value
# 1 1 n 143
# 2 2 n 148
# 3 3 n 41
# And the class has not bee altered
class(df.cast.R1)
# [1] "cast_df" "data.frame"
# As predicted, this works too.
melt(df.cast.R2, id="type", measure="n")
# type variable value
# 1 1 n 143
# 2 2 n 148
# 3 3 n 41
If you're still working with cast from "reshape", consider upgrading to "reshape2", or write a convenience wrapper function around melt... perhaps melt2?
melt2 <- function(data, ...) {
ifelse(isTRUE("cast_df" %in% class(data)),
data <- as.data.frame(data),
data <- data)
melt(data, ...)
}
Try it out on df.cast.R1:
melt2(df.cast.R, id="type", measure="n")
# ype variable value
# 1 1 n 143
# 2 2 n 148
# 3 3 n 41
You need to melt the data frame before you cast it. Casting without melting first will yield all kinds of unexpected behavior, because reshape has to guess the structure of your data.