I have a tree called mytree that looks like the following:
In R, I have it stored as a list:
mytree <- list(left = structure(list(y = -10, x = 10, grad = -10.5, sim_score = 110.25,
value = -10.5, criterion = "x < 15"), row.names = 1L, class = "data.frame"),
right = list(left = list(left = structure(list(y = 7, x = 20,
grad = 6.5, sim_score = 42.25, value = 6.5, criterion = "x < 22.5"), row.names = 2L, class = "data.frame"),
right = structure(list(y = 8, x = 25, grad = 7.5, sim_score = 56.25,
value = 7.5, criterion = "x >= 22.5"), row.names = 3L, class = "data.frame"),
root = list(root = structure(list(y = c(7, 8), x = c(20,
25), grad = c(6.5, 7.5), sim_score = c(98, 98), value = c(7,
7), criterion = c("x < 30", "x < 30")), row.names = 2:3, class = "data.frame"),
gain = 0.5)), right = structure(list(y = -7, x = 35,
grad = -7.5, sim_score = 56.25, value = -7.5, criterion = "x >= 30"), row.names = 4L, class = "data.frame"),
root = list(root = structure(list(y = c(7, 8, -7), x = c(20,
25, 35), grad = c(6.5, 7.5, -7.5), sim_score = c(14.0833333333333,
14.0833333333333, 14.0833333333333), value = c(2.16666666666667,
2.16666666666667, 2.16666666666667), criterion = c("x >= 15",
"x >= 15", "x >= 15")), row.names = 2:4, class = "data.frame"),
gain = 140.166666666667)), root = list(root = structure(list(
y = c(-10, 7, 8, -7), x = c(10, 20, 25, 35), grad = c(-10.5,
6.5, 7.5, -7.5), sim_score = c(4, 4, 4, 4)), row.names = c(NA,
-4L), class = "data.frame"), gain = 120.333333333333))
which looks like this
$left
y x grad sim_score value criterion
1 -10 10 -10.5 110.25 -10.5 x < 15
$right
$right$left
$right$left$left
y x grad sim_score value criterion
2 7 20 6.5 42.25 6.5 x < 22.5
$right$left$right
y x grad sim_score value criterion
3 8 25 7.5 56.25 7.5 x >= 22.5
$right$left$root
$right$left$root$root
y x grad sim_score value criterion
2 7 20 6.5 98 7 x < 30
3 8 25 7.5 98 7 x < 30
$right$left$root$gain
[1] 0.5
$right$right
y x grad sim_score value criterion
4 -7 35 -7.5 56.25 -7.5 x >= 30
$right$root
$right$root$root
y x grad sim_score value criterion
2 7 20 6.5 14.08333 2.166667 x >= 15
3 8 25 7.5 14.08333 2.166667 x >= 15
4 -7 35 -7.5 14.08333 2.166667 x >= 15
$right$root$gain
[1] 140.1667
$root
$root$root
y x grad sim_score
1 -10 10 -10.5 4
2 7 20 6.5 4
3 8 25 7.5 4
4 -7 35 -7.5 4
$root$gain
[1] 120.3333
The splits are stored under criterion, and the leave values are stored under value.
Given a new data point, x = 5, I would like to query mytree and see which leaf node this instance falls under. For x = 5, my function should output a value of -10.5 because 5 < 15. Similarly, if x = 25, then it should end up in the leaf with the value 7.5. Here are some more examples of what I'd like my pred_tree function to output:
newdata <- data.frame(x = c(5, 19, 18, 30))
> pred_tree(tree = mytree, newdata = newdata)
[1] -10.5
[2] 6.5
[3] 6.5
[4] -7.5
Here's what I have so far:
pred_tree <- function(tree, newdata){
for(i in length(tree)){
# Check if this is a leaf
if(length(tree[[i]]) == 1){
# Check criterion
if(eval(parse(text=tree[[i]]$criterion))){
# Return value of leaf
return(tree[[i]]$value[1])
}
}else if(length(tree[[i]]) > 1){
for(j in 1:length(tree[[i]])){
if(length(tree[[i]][[j]]) == 1){
# Check criterion
if(eval(parse(text=tree[[i]][[j]]$criterion))){
# Return value of leaf
return(tree[[i]][[j]]$value[1])
}
}
}
}
}
}
pred_tree(tree, newdata = newdata)
Unfortunately, this function is not returning the correct output. Also, this is rather clunky and can be very slow if I have many queries to run. I'm guessing using a recursive algorithm would make more sense instead of using nested for loops. Can anyone point me in the right direction?
############# EDIT #############
mytree3 <- list(left = list(left = structure(list(y = -10, x = 10, grad = 0,
sim_score = 0, value = 0, criterion = "x < 15"), row.names = 1L, class = "data.frame"),
right = structure(list(y = 7, x = 20, grad = -0.5, sim_score = 0.25,
value = -0.5, criterion = "x >= 15"), row.names = 2L, class = "data.frame"),
root = list(root = structure(list(y = c(-10, 7), x = c(10,
20), grad = c(0, -0.5), sim_score = c(0.125, 0.125), value = c(-0.25,
-0.25), criterion = c("x < 22.5", "x < 22.5")), row.names = 1:2, class = "data.frame"),
gain = 0.125)), right = list(left = structure(list(y = 8,
x = 25, grad = 0.5, sim_score = 0.25, value = 0.5, criterion = "x < 30"), row.names = 3L, class = "data.frame"),
right = structure(list(y = -7, x = 35, grad = 0, sim_score = 0,
value = 0, criterion = "x >= 30"), row.names = 4L, class = "data.frame"),
root = list(root = structure(list(y = c(8, -7), x = c(25,
35), grad = c(0.5, 0), sim_score = c(0.125, 0.125), value = c(0.25,
0.25), criterion = c("x >= 22.5", "x >= 22.5")), row.names = 3:4, class = "data.frame"),
gain = 0.125)), root = list(root = structure(list(y = c(-10,
7, 8, -7), x = c(10, 20, 25, 35), grad = c(0, -0.5, 0.5, 0),
sim_score = c(0, 0, 0, 0), value = c(0, 0, 0, 0)), row.names = c(NA,
-4L), class = "data.frame"), gain = 0.25))
Running the following did not give the right output
pred_tree(tree = mytree3, newdata = newdata)
A simple recursion that you can do can be:
.pred <- function(x, tree)
{
#Ensure you pass in a list and not a dataframe
if(is.data.frame(tree)) tree <- list(tree)
#Reorder the list if necessary
if(!is.data.frame(tree[[1]])) tree <- tree[c(2, 1, 3)]
# Check whether the condition is met. If so return
if (eval(parse(text=tree[[1]][["criterion"]]),list(x = x))) tree[[1]][["value"]][1]
else .pred(x, tree[[2]])
}
pred_tree <- function(tree, newdata)
{
cbind(newdata,pred = Vectorize(.pred,"x")(x= newdata$x,tree))
}
Now ou can call your function:
pred_tree(mytree,data.frame(x=c(1,10,15,18,19,22,23,25,29,30,33,35,100)))
x pred
1 1 -10.5
2 10 -10.5
3 15 6.5
4 18 6.5
5 19 6.5
6 22 6.5
7 23 7.5
8 25 7.5
9 29 7.5
10 30 -7.5
11 33 -7.5
12 35 -7.5
13 100 -7.5
Related
I have two dataframes in R that I'm trying to join together, but one of the columns has values that are off by one or two (specifically the yardline_100 column in each). Below is the code that I'm using to join the two:
fin_df <- df1 %>%
left_join(df2,
by = c("posteam" = "posteam",
"qtr" = "qtr",
"down" = "down",
"yardline_100" = "yardline_100"))
Is there any way to make it so that they join even if that one column is off by one or two? You'll notice that the last two values rows have different numbers in that column. Below are samples of the dataframes:
df1 <- structure(list(play_id = c(4596, 4629, 4658, 4682, 4723, 4766,
4790, 4828, 4849, 4878, 4899, 4938), posteam = c("MIN", "MIN",
"MIN", "MIN", "MIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN",
"CIN"), qtr = c(5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), yardline_100 = c(63,
58, 55, 50, 38, 61, 55, 52, 52, 20, 15, 15), down = c(2, 1, 2,
3, 1, 1, 2, 3, 4, 1, 2, 3)), row.names = c(NA, -12L), class = c("nflverse_data",
"tbl_df", "tbl", "data.table", "data.frame"), nflverse_timestamp = structure(1659046255.35538, class = c("POSIXct",
"POSIXt")), nflverse_type = "play by play", nflfastR_version = structure(list(
c(4L, 3L, 0L, 9020L)), class = c("package_version", "numeric_version"
)), .internal.selfref = <pointer: 0x0000021967f81ef0>)
df2 <- structure(list(posteam = c("MIN", "MIN", "MIN", "MIN", "MIN",
"CIN", "CIN", "CIN", "CIN", "CIN", "CIN", "CIN"), qtr = c(5,
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5), yardline_100 = c(63, 58, 55,
50, 38, 61, 55, 53, 52, 20, 16, 16), down = c(2, 1, 2, 3, 1,
1, 2, 3, 4, 1, 2, 3), play_id_SR = c("a9f97fb0-1407-11ec-ae9a-d77d9ecb2022",
"d49d54d0-1407-11ec-ae9a-d77d9ecb2022", "e8f74ad0-1407-11ec-ae9a-d77d9ecb2022",
"0208ae60-1408-11ec-ae9a-d77d9ecb2022", "257fd030-1408-11ec-ae9a-d77d9ecb2022",
"fe058030-1408-11ec-ae9a-d77d9ecb2022", "0da68200-1409-11ec-ae9a-d77d9ecb2022",
"26a5bd20-1409-11ec-ae9a-d77d9ecb2022", "70eacce0-1409-11ec-ae9a-d77d9ecb2022",
"99e5fb10-1409-11ec-ae9a-d77d9ecb2022", "a7646b00-1409-11ec-ae9a-d77d9ecb2022",
"de2683d0-1409-11ec-ae9a-d77d9ecb2022")), row.names = c(NA, -12L
), class = c("tbl_df", "tbl", "data.frame"))
An option is to use fuzzyjoin.
library(fuzzyjoin)
df1 %>%
fuzzy_left_join(
df2,
by = c("posteam", "qtr", "down", "yardline_100"),
match_fun = list(`==`, `==`, `==`, function(x, y) abs(x - y) <= 2)) %>%
select(-matches("(posteam|qtr|down).y")) %>%
rename_with(~str_remove(.x, "(?<=(posteam|qtr|down)).x"))
## A tibble: 12 x 7
# play_id posteam qtr yardline_100.x down yardline_100.y play_id_SR
# <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
# 1 4596 MIN 5 63 2 63 a9f97fb0-1407-11ec-ae9a-d77d9ecb2022
# 2 4629 MIN 5 58 1 58 d49d54d0-1407-11ec-ae9a-d77d9ecb2022
# 3 4658 MIN 5 55 2 55 e8f74ad0-1407-11ec-ae9a-d77d9ecb2022
# 4 4682 MIN 5 50 3 50 0208ae60-1408-11ec-ae9a-d77d9ecb2022
# 5 4723 MIN 5 38 1 38 257fd030-1408-11ec-ae9a-d77d9ecb2022
# 6 4766 CIN 5 61 1 61 fe058030-1408-11ec-ae9a-d77d9ecb2022
# 7 4790 CIN 5 55 2 55 0da68200-1409-11ec-ae9a-d77d9ecb2022
# 8 4828 CIN 5 52 3 53 26a5bd20-1409-11ec-ae9a-d77d9ecb2022
# 9 4849 CIN 5 52 4 52 70eacce0-1409-11ec-ae9a-d77d9ecb2022
#10 4878 CIN 5 20 1 20 99e5fb10-1409-11ec-ae9a-d77d9ecb2022
#11 4899 CIN 5 15 2 16 a7646b00-1409-11ec-ae9a-d77d9ecb2022
#12 4938 CIN 5 15 3 16 de2683d0-1409-11ec-ae9a-d77d9ecb2022
Note the matching function function(x, y) abs(x - y) <= 2 for column "yardline_100".
The last two lines (select(...) and rename_with(...)) are necessary to remove the duplicate columns: fuzzyjoin seems to create duplicate (i.e. ".x" and ".y"-suffixed) columns even on exact matches; the last two commands remove these duplicate exact match columns.
I have a tree called mytree that looks like this:
I have it stored as a list:
mytree <- list(list(structure(list(y = c(-10, 7, 8, -7), x = c(10, 20,
25, 35), grad = c(-10.5, 6.5, 7.5, -7.5), sim_score = c(4, 4,
4, 4), value = c(-1, -1, -1, -1)), row.names = c(NA, -4L), class = "data.frame")),
list(structure(list(y = -10, x = 10, grad = -10.5, sim_score = 110.25,
value = -10.5, gain = 120.333333333333, criterion = "x < 15"), row.names = 1L, class = "data.frame"),
structure(list(y = c(7, 8, -7), x = c(20, 25, 35), grad = c(6.5,
7.5, -7.5), sim_score = c(14.0833333333333, 14.0833333333333,
14.0833333333333), value = c(2.16666666666667, 2.16666666666667,
2.16666666666667), gain = c(120.333333333333, 120.333333333333,
120.333333333333), criterion = c("x >= 15", "x >= 15",
"x >= 15")), row.names = 2:4, class = "data.frame")),
list(NULL, NULL, structure(list(y = c(7, 8), x = c(20, 25
), grad = c(6.5, 7.5), sim_score = c(98, 98), value = c(7,
7), gain = c(140.166666666667, 140.166666666667), criterion = c("x < 30",
"x < 30")), row.names = 2:3, class = "data.frame"), structure(list(
y = -7, x = 35, grad = -7.5, sim_score = 56.25, value = -7.5,
gain = 140.166666666667, criterion = "x >= 30"), row.names = 4L, class = "data.frame")),
list(NULL, NULL, NULL, NULL, structure(list(y = 7, x = 20,
grad = 6.5, sim_score = 42.25, value = 6.5, gain = 0.5,
criterion = "x < 22.5"), row.names = 2L, class = "data.frame"),
structure(list(y = 8, x = 25, grad = 7.5, sim_score = 56.25,
value = 7.5, gain = 0.5, criterion = "x >= 22.5"), row.names = 3L, class = "data.frame"),
NULL, NULL))
and it looks like this:
[[1]]
[[1]][[1]]
y x grad sim_score value
1 -10 10 -10.5 4 -1
2 7 20 6.5 4 -1
3 8 25 7.5 4 -1
4 -7 35 -7.5 4 -1
[[2]]
[[2]][[1]]
y x grad sim_score value gain criterion
1 -10 10 -10.5 110.25 -10.5 120.3333 x < 15
[[2]][[2]]
y x grad sim_score value gain criterion
2 7 20 6.5 14.08333 2.166667 120.3333 x >= 15
3 8 25 7.5 14.08333 2.166667 120.3333 x >= 15
4 -7 35 -7.5 14.08333 2.166667 120.3333 x >= 15
[[3]]
[[3]][[1]]
NULL
[[3]][[2]]
NULL
[[3]][[3]]
y x grad sim_score value gain criterion
2 7 20 6.5 98 7 140.1667 x < 30
3 8 25 7.5 98 7 140.1667 x < 30
[[3]][[4]]
y x grad sim_score value gain criterion
4 -7 35 -7.5 56.25 -7.5 140.1667 x >= 30
[[4]]
[[4]][[1]]
NULL
[[4]][[2]]
NULL
[[4]][[3]]
NULL
[[4]][[4]]
NULL
[[4]][[5]]
y x grad sim_score value gain criterion
2 7 20 6.5 42.25 6.5 0.5 x < 22.5
[[4]][[6]]
y x grad sim_score value gain criterion
3 8 25 7.5 56.25 7.5 0.5 x >= 22.5
[[4]][[7]]
NULL
[[4]][[8]]
NULL
The first index of the list, i.e. 1, 2, 3, 4, correspond to the level, or height of the tree. The second index corresponds to the index of the node in the given level. For example, mytree[[1]][[1]] contains the root, which has child nodes in mytree[[2]][[1]] and mytree[[2]][[2]].
Given a parent node stored in mytree[[i]][[j]], its children are stored in mytree[[i + 1]][[2 * j]] and mytree[[i + 1]][[2 * j -1]].
I want to write a function called eval_tree that when given a new instance x, it will check which leaf node x falls into by checking the criterion of the splits and then output the value of the leaf, which is stored under value. Here is an example of how I'd like eval_tree to work:
newdata <- data.frame(x = c(10, 20, 25, 35))
> eval_tree(tree = mytree, newdata = newdata)
[1] -10.5
[2] 6.5
[3] 7.5
[4] -7.5
Here is what I have so far. Unfortunately it's not working...and I think I may need to implement the function recursively so that it's more efficient. Can anyone point me in the right direction?
eval_tree <- function(tree, newdata){
if(length(tree) == 1){
# If tree only has a root, return value of root
return(tree[[1]][[1]]$value[1])
}else if(length(tree) > 1){
for (level in 2:length(tree)){
for(ind in 1:length(tree[[level]]))
if(eval(parse(text = tree[[level]][[ind]][["criterion"]]))){
# Criterion is true, then go to child node
# Check if there is child node
if(is.null(tree[[level + 1]][[ind * 2]]) && is.null(tree[[level + 1]][[ind * 2 - 1]])){
return(tree[[level]][[ind]]$value[1])
}else if(eval(parse(text = tree[[level + 1]][[ind * 2]][["criterion"]]))){
# Criterion is true, then go to childi node
# I think this is where recursion would be more appropriate than all these nested loops
}
}
}
}
}
you can try something like this:
index <- function(x,tree,e, i = 1, j = 1)
{
if(nrow((tree[[i]][[j]])) == 1)
{
if(eval(parse(text=tree[[i]][[j]]$crite), list(x = x))) {
if(is.null(e$a)){
e$a <- i
e$b <- tree[[i]][[j]]$val
}
else if(e$a > i)e$b <- tree[[i]][[j]]$val
TRUE
}
else FALSE
}
else index(x, tree, e,i + 1,2*j-1) | index(x, tree,e,i+1, 2*j)
}
pred_tree <- function( tree,newdata){
cbind(newdata,pred = sapply(newdata$x,function(x){
e <- new.env()
index(x,tree,e)
e$b
}))
}
pred_tree(mytree,data.frame(x = c(10,20,25,30,25)))
x pred
1 10 -10.5
2 20 6.5
3 25 7.5
4 30 -7.5
5 25 7.5
I am trying to make two donut plots to compare some metrics. The data frame is as below,
new_sum var `1` `2`
<dbl> <chr> <dbl> <dbl>
1 98.7 cnt_alerts 45.1 NA
2 98.7 cnt_incidents_total 15.6 NA
3 98.7 sum_of_events 100 NA
4 100 cnt_alerts NA 44.4
5 100 cnt_incidents_total NA 16.2
6 100 sum_of_events NA 100
So the two plots should represent the 1 and 2 columns, but the row sum_of_events should be taken from new_sum column. So in the end the two plots will look as follows (Trying to replicate with paint)
DATA
structure(list(new_sum = c(98.7093505166464, 98.7093505166464,
98.7093505166464, 100, 100, 100), var = c("cnt_alerts", "cnt_incidents_total",
"sum_of_events", "cnt_alerts", "cnt_incidents_total", "sum_of_events"
), `1` = c(45.0519047096481, 15.6423424701131, 100, NA, NA, NA
), `2` = c(NA, NA, NA, 44.4483592005942, 16.201786624667, 100
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-6L))
Something like this?
library(tidyverse)
df1 %>%
mutate(
id = rep(1:2, each = 3),
value = coalesce(`1`, `2`),
value = ifelse(var == "sum_of_events", new_sum, value)
) %>%
ggplot(aes(var)) +
geom_col(aes(y = 100), position = 'identity', fill = 'white', col = 1, width = 0.5) +
geom_col(aes(y = value), position = 'identity', fill = 'grey60', col = 1, width = 0.5) +
facet_grid(~id) +
coord_polar(theta = 'y') +
theme_minimal()
I'm having a very strange error in a script that used to work perfectly and I don't know what's the problem. I start creating a very long list with several data frames with the exact number of columns. The list is called lst. Then I want to do a summarise table with means and sd. Here is the script for that:
w1 <- lapply(lst, function(i) t(cbind(Mean = colMeans(i[, c(6,7,8,9)], na.rm = TRUE),
Sds = colSds(as.matrix(i[, c(6,7,8,9)]), na.rm = TRUE),
N = length(i[,2]),
len.max=max(i[,6]))))
The number of the columns are correct. However when I run the script first I get the Debug location and when I stopped I get this error message:
Error in t(cbind(Mean = colMeans(i[, c(6, 7, 8, 9)], na.rm = TRUE), Sds = colSds(as.matrix(i[, :
error in evaluating the argument 'x' in selecting a method for function 't': Error in `[.data.frame`(i, , c(6, 7, 8, 9)) : undefined columns selected
I dont know whats wrong with the function. I try to search in the internet and I saw something about change as,matrix for data.matrix. However this does not make the trick.
Indeed I get the same problem for another function very similar:
a1 <- lapply(lst, function(i) t(cbind(l1 = NROW(which(i[,6]>1)),
l1.05 = NROW(which(i[,6]<=1)) - NROW(which(i[,6]>0.5)),
l05.03 = NROW(which(i[,6]>0.3)) - NROW(which(i[,6]<=0.5)),
l03 = NROW(which(i[,6]<=0.3)))))
With the same outcome:
Error in t(cbind(l1 = NROW(which(i[, 6] > 1)), l1.05 = NROW(which(i[, :
error in evaluating the argument 'x' in selecting a method for function 't': Error in `[.data.frame`(i, , 6) : undefined columns selected
Can someone point me out what is the problem. Do you need some data? Thanks!
I'm working with the last RStudio and with the following packages:
plyr, matrixStats,dplyr
Here is an example of the list:
> lst
[[1]]
X Chr new pos1 pos2 len nsnp n.ind per.ind
1 1 1 1 12900000 13700000 0.9 284.7560 23.77778 7.952434
2 2 1 2 17000000 17300000 0.4 126.5582 16.00000 5.351171
3 3 1 3 21200000 21500000 0.4 126.5582 40.75000 13.628763
4 4 1 4 45300000 45700000 0.5 158.1978 23.20000 7.759197
5 5 1 5 45900000 46600000 0.8 253.1165 31.12500 10.409699
[[2]]
X Chr new pos1 pos2 len nsnp n.ind per.ind
1 1 1 1 12900000 13700000 0.9 312.90267 24.44444 4.288499
2 2 1 2 21200000 21500000 0.4 139.06785 38.00000 6.666667
3 3 1 3 32600000 33000000 0.5 173.83482 28.40000 4.982456
4 4 1 4 35800000 36100000 0.4 139.06785 37.25000 6.535088
5 5 1 5 36300000 36300000 0.1 34.76696 22.00000 3.859649
[[3]]
X Chr new pos1 pos2 len nsnp n.ind per.ind
1 1 1 1 35700000 36500000 0.9 287.4214 12.22222 11.42264
2 2 1 2 45900000 46600000 0.8 255.4857 12.50000 11.68224
3 3 1 3 49400000 50700000 1.4 447.1000 21.78571 20.36048
4 4 1 4 51000000 52000000 1.1 351.2929 16.00000 14.95327
5 5 1 5 52200000 53000000 0.9 287.4214 19.66667 18.38006
dput(lst[1:3])
list(structure(list(X = 1:5, Chr = c(1L, 1L, 1L, 1L, 1L), new = 1:5,
pos1 = c(12900000, 1.7e+07, 21200000, 45300000, 45900000),
pos2 = c(13700000, 17300000, 21500000, 45700000, 46600000
), len = c(0.9, 0.4, 0.4, 0.5, 0.8), nsnp = c(284.756031128405,
126.558236057069, 126.558236057069, 158.197795071336, 253.116472114137
), n.ind = c(23.7777777777778, 16, 40.75, 23.2, 31.125),
per.ind = c(7.95243403939056, 5.35117056856187, 13.628762541806,
7.75919732441472, 10.4096989966555)), .Names = c("X", "Chr",
"new", "pos1", "pos2", "len", "nsnp", "n.ind", "per.ind"), row.names = c(NA,
5L), class = "data.frame"), structure(list(X = 1:5, Chr = c(1L,
1L, 1L, 1L, 1L), new = 1:5, pos1 = c(12900000, 21200000, 32600000,
35800000, 36300000), pos2 = c(13700000, 21500000, 3.3e+07, 36100000,
36300000), len = c(0.9, 0.4, 0.5, 0.4, 0.1), nsnp = c(312.90267141585,
139.0678539626, 173.83481745325, 139.0678539626, 34.76696349065
), n.ind = c(24.4444444444444, 38, 28.4, 37.25, 22), per.ind = c(4.28849902534113,
6.66666666666667, 4.98245614035088, 6.53508771929825, 3.85964912280702
)), .Names = c("X", "Chr", "new", "pos1", "pos2", "len", "nsnp",
"n.ind", "per.ind"), row.names = c(NA, 5L), class = "data.frame"),
structure(list(X = 1:5, Chr = c(1L, 1L, 1L, 1L, 1L), new = 1:5,
pos1 = c(35700000, 45900000, 49400000, 5.1e+07, 52200000
), pos2 = c(36500000, 46600000, 50700000, 5.2e+07, 5.3e+07
), len = c(0.9, 0.8, 1.4, 1.1, 0.9), nsnp = c(287.421428571429,
255.485714285714, 447.1, 351.292857142857, 287.421428571429
), n.ind = c(12.2222222222222, 12.5, 21.7857142857143,
16, 19.6666666666667), per.ind = c(11.4226375908619,
11.6822429906542, 20.3604806408545, 14.9532710280374,
18.380062305296)), .Names = c("X", "Chr", "new", "pos1",
"pos2", "len", "nsnp", "n.ind", "per.ind"), row.names = c(NA,
5L), class = "data.frame"))
I'm trying to use dplyr to have the variables that are factors be represented by their values after importing a SPSS dataset using haven.
Two questions:
1) how can I loop over the columns in the dataframe containing labels over the imported dataset using dplyr?
u<-which(sapply(i,function(x) !is.null(attr(x,"labels"))))
n<-mutate_each(i,(as_factor),... = u)
2) how can I set the correct date after importing .sav file from SPSS. i$e3 is a date, but I'm uncertain how I can convert it to proper r-lingo.
Dataset:
> dput(i)
structure(list(e = structure(c(1, 1, 2, 2, 1), label = "Sex", class = c("labelled",
"numeric"), labels = structure(c(1, 2), .Names = c("Male", "Female"
))), e2 = structure(c(3, 3, 3, 3, 3), label = "The time from injury to surgery", class = c("labelled",
"numeric"), labels = structure(c(1, 2, 3), .Names = c("< 12 hours",
"12 to 24 hours", "> 24 hours"))), e3 = structure(c(13254624000,
13431139200, 13437360000, 13493174400, 13233369600), label = "Surgery Date")), .Names = c("e",
"e2", "e3"), row.names = c(NA, -5L), class = "data.frame")
I'm not sure how to adjust your dates properly (you can change the / 10 to / 100 or 1000). You could do this with base r:
i <- structure(list(e = structure(c(1, 1, 2, 2, 1), label = "Sex", class = c("labelled",
"numeric"), labels = structure(c(1, 2), .Names = c("Male", "Female"
))), e2 = structure(c(3, 3, 3, 3, 3), label = "The time from injury to surgery", class = c("labelled",
"numeric"), labels = structure(c(1, 2, 3), .Names = c("< 12 hours",
"12 to 24 hours", "> 24 hours"))), e3 = structure(c(13254624000,
13431139200, 13437360000, 13493174400, 13233369600), label = "Surgery Date")), .Names = c("e",
"e2", "e3"), row.names = c(NA, -5L), class = "data.frame")
i$e3 <- as.POSIXct(i$e3 / 10, origin = '1970-01-01')
# e e2 e3
# 1 1 3 2012-01-01 19:00:00
# 2 1 3 2012-07-24 03:12:00
# 3 2 3 2012-07-31 08:00:00
# 4 2 3 2012-10-03 22:24:00
# 5 1 3 2011-12-08 04:36:00
i <- setNames(i, sapply(i, function(x) attr(x, 'label')))
i[] <- lapply(i, function(x) {
if (!is.null(lab <- attr(x, 'labels')))
names(lab[x])
else x
})
# Sex The time from injury to surgery Surgery Date
# 1 Male > 24 hours 2012-01-01 19:00:00
# 2 Male > 24 hours 2012-07-24 03:12:00
# 3 Female > 24 hours 2012-07-31 08:00:00
# 4 Female > 24 hours 2012-10-03 22:24:00
# 5 Male > 24 hours 2011-12-08 04:36:00