Group and summarise character in R? - r

I this in R
> con_promedio_por_curso_transpuesta
A B C D
Description "Abc" "Bcd" "Cde" "Def"
mean(X7) "5.000000" "4.105263" "4.733333" "4.680000"
mean(X8) "5.000000" "3.736842" "4.400000" "4.760000"
mean(X9) "5.000000" "3.950000" "4.600000" "4.840000"
mean(X10) "5.000000" "4.210526" "4.333333" "4.560000"
I want to delete the first row
"Abc" "Bcd" "Cde" "Def"
I run the next instruction:
without_first_row <- con_promedio_por_curso_transpuesta[-c(1),])
Until now, everything is OK, but. If I want to group and summarise the without_first_row I get an error.
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "c('matrix', 'character')"
I run the type of data,
> typeof(con_promedio_por_curso_transpuesta)
[1] "character"
How I cast "character" to any type of data for grouping ?
Thanks.

You have an error in your syntax near where you create without_first_row, follow along below:
con_promedio_por_curso_transpuesta <-
data.frame(
row.names = c('Description','mean(X7)','mean(X8)','mean(X9)','mean(X10)'),
'A' = c("Abc","5.000000","5.000000","5.000000","5.000000"),
'B' = c("Bcd","4.105263","3.736842","3.950000","4.210526"),
'C' = c("Cde","4.733333","4.400000","4.600000","4.333333"),
'D' = c("Def","4.680000","4.760000","4.840000","4.560000"),
'ID' = c(NA, 1, 1, 2, 2) # added for this example
)
## without_first_row <- con_promedio_por_curso_transpuesta[-c(1), ]) <- this is your error, you added a ')' unnecessarily
without_first_row <- con_promedio_por_curso_transpuesta[-c(1), ] # with fixed syntax
> class(without_first_row)
[1] "data.frame"
# just to show you can group_by and summarise with data
without_first_row %>%
mutate_at(.vars = vars(c(A,B,C,D)), funs(as.numeric)) %>%
group_by(ID) %>%
summarise_all(mean)

Related

Create binary yes/no animal variable based on match with any term in a dictionary, "animal" in R

Continuing off this question: R: Create category column reflecting match between a dictionary and column in df
I have a big dataset, "df", of 30,000 rows, and two big dictionary dataframes: (1) animal, 600k rows; (2)nature, 300k rows.
I am simply trying to figure out how to create two simple binary variables, "df$content_animal" and "df$content_nature" based on whether each row in df$content had any matches with "animal" or "nature" dictionaries. (1=match, 0=no match).
Below are the data samples, it's impossible for me to include the entire datasets here:
df <- tibble(content= c("hello turkey feet blah blah blah", "i love rabbits haha", "wow this sunlight is amazing", "omg did u see the rainbow?!", "turtles like swimming in the water", "i love running across grassy lawns with my dog"))
animal=c("turkey", "rabbit", "turtle", "dog", "cat", "bear")
nature=c("sunlight", "water", "rainbow", "grass", "lawn", "mountain", "ice")
I have tried the following codes based on multiple-pattern matches, to no success - I suspect it is bc of the largeness of both my dataset and dictionary/pattern:
df$content_animal <- grepl(paste(animal,collapse="|"),df$content,ignore.case=TRUE)
df$content_nature <- grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
which returns the error:
Error in grepl(paste(animal,collapse="|"), df$content, :
invalid regular expression, reason 'Out of memory' Error in grepl(paste(nature,collapse="|"), df$content, :
invalid regular expression, reason 'Out of memory'
I also tried:
df<-df %>%
mutate(
content_animal = case_when(grepl(animal, content) ~ "1")
)
df<-df %>%
mutate(
content_nature = case_when(grepl(nature, content) ~ "1")
)
which returns the error:
Problem with `mutate()` input `content_animal`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_animal` is `case_when(grepl(animal, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
Problem with `mutate()` input `content_nature`.
ℹ argument 'pattern' has length > 1 and only the first element will be used
ℹ Input `content_nature` is `case_when(grepl(nature, content) ~ "1")`.argument 'pattern' has length > 1 and only the first element will be used
I ALSO tried
bench::mark(basic = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, animal))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, nature)))),
fixed = mutate(df, content_animal = 1L*map_lgl(content, ~any(str_detect(.x, fixed(animal)))),
content_nature = 1L*map_lgl(content, ~any(str_detect(.x, fixed(nature))))))
which ran for over two hours, without giving me any output.
I'm really at a loss here as to what I should do. Does anyone have any ideas? It there a better package or code to use for my big data purposes???
It may be better to loop with lapply and Reduce
Reduce(`|`, lapply(nature, function(x) grepl(x, df$content, ignore.case = TRUE)))
#[1] FALSE FALSE TRUE TRUE TRUE TRUE
which is the same as
grepl(paste(nature,collapse="|"),df$content,ignore.case=TRUE)
#[1] FALSE FALSE TRUE TRUE TRUE TRUE
Here's an approach with the quanteda package, which has built-in functions for doing exactly what you want. (I tried this only on the sample dataset; I'd be interested to hear what its performance is on the whole thing.)
library(quanteda)
c = corpus(df$content)
d = dictionary(list(animal = animal, nature = nature))
df = cbind(df, convert(dfm(c, dictionary = d), to = "data.frame")[,-1])

Error message using mstate::msprep ??bug?

I have had a problem with an error abend using mstate::msprep to prepare my data for a pretty classical 3 state problem. I can run the code from the mstate package vignette with no difficulty. My problem is entirely parallel to the vignette example. Subjects receive an islet transplant, then may achieve insulin independence. Whether they do or do not, they may have islet graft failure (or loss of insulin independence if it was achieved.) The vignette example works with included covariates (retained by the keep = parameter). My version works fine if I don't include the keep parameter but fails consistently if I use the keep parameter. Since my example works perfectly well without the keep variable, I very much doubt that there is a problem with my main data. It must be some problem with the “keep” data. See below for the session output.
Neither data set has any missing data. I tried the vignette data limiting it to three covariates -- one categorical, one continuous, and the third with one of the event-time variables, exactly parallel to my three covariates. The vignette still works perfectly, but mine doesn’t. Both covariate "keep" lists are character vectors. In sum, I can't imagine a more parallel "real" question to the vignette example.
I have tracked the problem to a subroutine of msprep "msprepEngine" at line 85 at the second time through the processing loop, but I haven't been able to figure out what the problem is. I suspect that it is a bug, but since I can't identify it, I can't be sure.
I would be very grateful for anyone that can help me with this issue. The vignette code is available with the package. Unfortunately I am not free to share my problem's data, but as I said above, the program works perfectly without the keep parameter. There must be something about my "keep" covariates that is giving the system indigestion.
Thanks in advance for any suggestions.
Larry Hunsicker
> library(magrittr)
> library(survival)
> library(mstate)
>
> #Three state tmat:
> data(ebmt3)
> names(msbmt)
[1] "id" "from" "to" "trans" "Tstart" "Tstop" "time" "status" "dissub" "age"
[11] "prtime"
> dim(msbmt)
[1] 5577 11
> tmat <- trans.illdeath(names = c("Tx", "PR", "RelDeath"))
> covs <- c('dissub', 'age', 'drmatch', 'tcd', 'prtime')
> class(covs)
[1] "character"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"),
+ data = ebmt3, trans = tmat, id = 'id', keep = covs)
>
> names(insfree3)
[1] "PatientID" "YrFree" "Free" "YrLossFail" "LossFail" "StudyID" "IEQ_kg"
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID')
>
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> covs <- c('StudyID', 'IEQ_kg', 'YrFree')
> class(covs)
[1] "character"
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID', keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I found the problem, and it is a bug. I just don't know whose bug it is. msprep() works when data is a data.frame, but not when it is a tibble. My repro example:
> library(survival)
> library(mstate)
> library(dplyr)
> data(ebmt3)
> class(ebmt3)
[1] "data.frame"
> tmat <- transMat(x = list(c(2, 3), c(3), c()), names = c("Tx",
+ "PR", "RelDeath"))
> ebmt3$prtime <- ebmt3$prtime/365.25
> ebmt3$rfstime <- ebmt3$rfstime/365.25
> covs <- c("dissub", "age", "drmatch", "tcd", "prtime")
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
> ebmt3 <- as_tibble(ebmt3)
> class(ebmt3)
[1] "tbl_df" "tbl" "data.frame"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I tracked the error down to line 157 in msprep()
ddcovs <- lapply(1:nkeep, function(i) rep(keep[, i], tbl))
When data is a data.frame, this line works. When it is a tibble, it abends with the above error message.
It was my impression that things that work with a data.frame should also work with a tibble, since a tibble is a data.frame. So I'm not sure whether this is a bug in msprep() or in the code for a tibble. But the way to avoid the error is to be sure that the data parameter in the call to msprep() is a data.frame, but not a tibble.
Larry Hunsicker

R: object with negative row.name value

I think I have the same issue with this: What's the difference between row.names() and attributes$row.names?
When I use dput now I get something like this:
-0.0120067403271522, -0.00712477902137182, -0.0105058179972997,
-0.0115956365572667, -0.00507521571067687, -0.013870827853567,
-0.0160501419238977, -0.00225243465241482, -0.0145865320678265,
-0.00118232647592066, -0.0190385732141539, 0.0108223868283294,
-0.0159300331503545, 0.0319315053338279, 0, 0.00315703437341087,
0.0368045045454188, -0.0276264287281491, -0.0101235678857984,
0.00486601316019395)), class = "data.frame", row.names = c(NA,
-11834L))
I discovered this while I was trying to force define rownames(var) <- c(list_of_row_names).
I get the error:
Error in .rowNamesDF<-(x, value = value) : invalid 'row.names'
length`
The thing is this object has values inside it. Anyone can tell me how I can rewind/fix this?
From my understanding, this happened bc R didnt know row names when this object was created?
The length of that variable list_of_row_names does not match with the nrow() of the data frame
See an example given below:
df <- data.frame(1:5)
list_of_row_names <- letters[1:4]
rownames(df) <- list_of_row_names
Error in row.names<-.data.frame(*tmp*, value = value) :
invalid 'row.names' length
nrow(df)
#[1] 5
length(list_of_row_names)
# [1] 4

Error: target of assignment expands to non-language object

I am trying to dynamically clean up some column names for a large number of tables and I get the above error.
I have a gut feeling that I should be using quo but I have no idea on how to do that.
Any ideas?
The apply_alias applies a set of business rules to clean the names.
apply_alias <- function(l){
which(l=="Geography")
l[which(l=="Geography")] <- "GEO"
toupper(l)
}
The cleanup_column_names_tbl applies the alias_function to a list of table
cleanup_column_names_tbl <- function(PID){
for(p in PID){
names(get(paste0("tbl_",p))) <- apply_alias(names(get(paste0("tbl_",p))))
}
}
cleanup_column_names_tbl("14100287")
When I try to run it i get the following error message:
> cleanup_column_names_tbl("14100287")
Error in names(get(paste0("tbl_", p))) <- apply_alias(names(get(paste0("tbl_", :
target of assignment expands to non-language object
Sample data:
> dput(tbl_14100287[1,])
structure(list(V1 = 0L, REF_DATE = "1976-01", GEO = "Canada",
DGUID = "2016A000011124", `Labour force characteristics` = "Population",
Sex = "Both sexes", `Age group` = "15 years and over", Statistics = "Estimate",
`Data type` = "Seasonally adjusted", UOM = "Persons", UOM_ID = 249L,
SCALAR_FACTOR = "thousands", SCALAR_ID = 3L, VECTOR = "v2062809",
COORDINATE = "1.1.1.1.1.1", VALUE = 16852.4, STATUS = "",
SYMBOL = NA, TERMINATED = NA, DECIMALS = 1L), class = c("data.table",
"data.frame"), row.names = c(NA, -1L), .internal.selfref = <pointer: 0x000002123cf21ef0>)
You cannot assign a value to get since there is no function get<-. The right way of doing it would be something like the following.
apply_alias <- function(l){
l[which(l == "Geography")] <- "GEO"
toupper(l)
}
cleanup_column_names_tbl <- function(PID, envir = .GlobalEnv){
pid_full <- paste0("tbl_", PID)
res <- lapply(pid_full, function(p){
nms <- apply_alias(names(get(p)))
DF <- get(p)
names(DF) <- nms
DF
})
names(res) <- pid_full
list2env(res, envir = envir)
invisible(NULL)
}
cleanup_column_names_tbl("14100287")
names(tbl_14100287)
# [1] "V1" "REF_DATE"
# [3] "GEO" "DGUID"
# [5] "LABOUR FORCE CHARACTERISTICS" "SEX"
# [7] "AGE GROUP" "STATISTICS"
# [9] "DATA TYPE" "UOM"
#[11] "UOM_ID" "SCALAR_FACTOR"
#[13] "SCALAR_ID" "VECTOR"
#[15] "COORDINATE" "VALUE"
#[17] "STATUS" "SYMBOL"
#[19] "TERMINATED" "DECIMALS"
My solution:
Creating and expression and evaluating it. It is short but somehow does not feel like it is the proper way of doing things since I am breaking away from the functional paradigm of R.
cleanup_column_names_tbl <- function(PID){
for(p in PID){
expr1 <- paste0("names(", paste0("tbl_",p), ") <- apply_alias(names(", paste0("tbl_",p),"))")
eval(rlang::parse_expr(expr1))
}
}
Edit:
A slightly different way that:
avoids using strings
a little more flexible since it allows the use of strings and symbol
library(rlang)
test_df <- data.frame(a=1:10,b=1:10)
test_df2 <- data.frame(a=1:10,b=1:10)
fix_names <- function(df){
x <- ensym(df)
expr1 <- expr(names(!!x) <- toupper(names(!!x)))
eval(expr1, envir = parent.env(environment()))
# expr1
}
fix_names(test_df)
fix_names("test_df2")
names(test_df)
#> [1] "A" "B"
names(test_df2)
#> [1] "A" "B"

R and Data Selection

I have a data table dt, as given below:
structure(list(IM = c(0.830088495575221, 0.681436210847976, 0.498810939357907,
0.47265400115141, 0.527908540685945, 0.580763582966226, 0.408069043807859,
0.467368671545006, 0.44662887412295, 0.0331974034502217, 0.0368210899219588,
0.0333698233772947, 0.0294312465832275, 0.578743426515361, 0.566950053134963,
0.808756701221038, 0.585507838980771, 0.61507839619537, 0.586388329979879,
0.794196637085474), CM = c(0.876991150442478, 0.996180290297937,
0.651605231866825, 0.824409902130109, 0.94418291862811, 0.961820851688693,
0.943861532396347, 1.10137922144883, 1.1524325077831, 0.128868067469359,
0.155932251596297, 0.159414951213752, 0.196968075413411, 1.19678937171326,
0.901168969181722, 3.42528220866977, 2.4377239516641, 2.0040870054458,
1.86099597585513, 1.51928615911568), RM = c(0.601769911504425,
0.495034377387319, 0.405469678953627, 0.368451352907311, 0.361802286482851,
0.320851688693098, 0.791548118347242, 0.816050925099649, 0.786622368849031,
0.545805622636092, 0.594370732740163, 0.594771872860171, 0.536043514857356,
0.617215610296153, 0.619287991498406, 0.602602774009141, 0.634069706132375,
0.596543561108693, 0.582203219315895, 0.695985131558462)), .Names = c("IM", "CM", "RM"), class = c("data.table", "data.frame"), row.names
= c(NA,
-20L), .internal.selfref = <pointer: 0x00000000003f0788>)
I have written a function as given below:
DSanity.markWinsorize <- function(dt, colnames)
{
PERnames <- unlist(lapply(colnames, function(x) paste0("PER",x)));
print(dt[,colnames])
if(length(colnames)>1)
{dt[,PERnames] <- sapply(dt[,colnames], Num.calPtile);}
else
{dt[,PERnames] <- Num.calPtile(dt[,colnames]);}
return(dt)
}
## Calculate Percentile score of a data vector
Num.calPtile <- function(x)
{
return((ecdf(x))(x))
}
The job of this function is to create new columns, calculating the percentile of each of the data points for the columns provided to the function markWinsorize.
Here I am trying to run the function markWinsorize:
colnames <- c('CM','AM','BM')
DSanity.markWinsorize(dt,colnames)
I get the following error:
> sdc1 <- DSanity.markWinsorize(sdc,colnames)
[1] "CM" "AM" "BM"
Show Traceback
Rerun with Debug
Error in approxfun(vals, cumsum(tabulate(match(x, vals)))/n, method = "constant", :
zero non-NA points In addition: Warning message:
In xy.coords(x, y) : NAs introduced by coercion
It would be great if some of you can help me out here. Thanks.
Your approach is quite unwieldy. I recommend a completely new approach.
library(dplyr)
colnames <- c("CM", "AM", "BM")
dt %>%
select_(.dots = colnames) %>%
mutate_each(funs(ntile(., 100)))
I think this gives what you want (perhaps with the addition of %>% bind_cols(dt)).

Resources