Related
Toy data:
Say I have this df
df <- structure(list(x = structure(c(NA, 7L, NA, NA, 4L, 6L, 6L, 2L,
3L, 5L, 8L, 4L, 7L, 3L, 5L, 1L, 5L, 5L, 5L, NA), .Label = c("1",
"2", "3", "4", "5", "6", "7", "8"), class = "factor"), y = structure(c(NA,
2L, 3L, 2L, 2L, 2L, 2L, 1L, 3L, NA, 2L, 3L, 1L, 1L, 3L, 2L, 2L,
3L, 2L, 2L), .Label = c("1", "2", "3"), class = "factor"), z = structure(c(NA,
4L, 4L, 4L, 5L, 4L, 5L, 5L, 2L, NA, 4L, 1L, 1L, 3L, 2L, 5L, 2L,
2L, 4L, NA), .Label = c("1", "2", "3", "4", "5"), class = "factor"),
a = c(-32L, -51L, -22L, 44L, 55L, -24L, -50L, 67L, 1L, -47L,
66L, -98L, -91L, -42L, -89L, -31L, -8L, -33L, 38L, 61L),
b = c(46L, -19L, -37L, 47L, -28L, -48L, 14L, -10L, -13L,
-31L, 32L, 21L, -21L, 25L, -8L, 42L, -26L, -24L, 36L, -39L
)), row.names = c(NA, -20L), class = c("tbl_df", "tbl", "data.frame"
))
df
# A tibble: 20 × 5
x y z a b
<fct> <fct> <fct> <int> <int>
1 NA NA NA -32 46
2 7 2 4 -51 -19
3 NA 3 4 -22 -37
4 NA 2 4 44 47
5 4 2 5 55 -28
6 6 2 4 -24 -48
7 6 2 5 -50 14
8 2 1 5 67 -10
9 3 3 2 1 -13
10 5 NA NA -47 -31
11 8 2 4 66 32
12 4 3 1 -98 21
13 7 1 1 -91 -21
14 3 1 3 -42 25
15 5 3 2 -89 -8
16 1 2 5 -31 42
17 5 2 2 -8 -26
18 5 3 2 -33 -24
19 5 2 4 38 36
20 NA 2 NA 61 -39
I want to normalize variables x, y, and z on a 0-1 scale, and then produce some summary stats on them. I can produce the summary stats just fine using the code below
Code that works:
library(tidyverse)
vars <- c('x', 'y', 'z')
names(vars) <- vars
summary_stats <- function(data){
tibble(
n = sum(!is.na(data)),
mean = round(mean(as.numeric(data), na.rm = T), digits = 3),
sd = round(sd(as.numeric(data), na.rm = T), digits = 3),
se = round(sd/sqrt(n), digits = 3)
)
}
table <- map_df(
df %>%
dplyr::select(vars),
summary_stats,
.id = "covariate")
table
# A tibble: 3 × 5
covariate n mean sd se
<chr> <int> <dbl> <dbl> <dbl>
1 x 16 4.75 1.88 0.47
2 y 18 2.11 0.676 0.159
3 z 17 3.35 1.41 0.342
Code that doesn't work:
But i'm struggling to figure out how to normalize the variables. My latest attempt is to try this
summary_stats <- function(data){
data_norm <- drop_na(data) %>% dplyr::summarize(
(as.numeric(data) - min(as.numeric(data))) /
(max(as.numeric(data)) - min(as.numeric(data)))
)
tibble(
n = sum(!is.na(data_norm)),
mean = round(mean(as.numeric(data_norm), na.rm = T), digits = 3),
sd = round(sd(as.numeric(data_norm), na.rm = T), digits = 3),
se = round(sd/sqrt(n), digits = 3)
)
}
table <- map_df(
df %>%
dplyr::select(vars),
summary_stats,
.id = "covariate")
Errors:
But this returns the error
Error in UseMethod("drop_na_") : no applicable method for 'drop_na_' applied to an object of class "factor"
If I convert it to a numeric on the fly, so I have data_norm <- drop_na(as.numeric(data)) etc., I then get a very similar error saying
Error in UseMethod("drop_na_") : no applicable method for 'drop_na_' applied to an object of class "c('double', 'numeric')"
However, if I do this outside of the function it works fine
df %>% drop_na(x) %>% summarise(std_mean = (as.numeric(x) - min(as.numeric(x))) / (max(as.numeric(x)) - min(as.numeric(x))))
# A tibble: 16 × 1
std_mean
<dbl>
1 0.857
2 0.429
3 0.714
4 0.714
5 0.143
6 0.286
7 0.571
....
I need to remove the NA values or when I try and normalize the returned variable will have all NAs if there is at least 1 NA in that column. And if I apply drop_na() outside the function (to the master tibble i feed in to the map_dfr function), it will drop any row that has at least 1 NA value in any variable from the df, rather than just the NA values from that column.
Can anyone help here?
Update:
If I remove the drop_na() call from the function i get the following error
Error in UseMethod("summarise") :
no applicable method for 'summarise' applied to an object of class "c('double', 'numeric')"
This makes zero sense to me (i'm probably not understanding it) as summarise definitely works with numeric variables...
Looks like what's happening is that you're trying to write a function to take an entire data frame as an argument, but when you go to map it, you're actually only passing a single vector (e.g. df$x) as the argument to the function. This works fine for the first version of your function, but in the second version drop_na fails to work because it takes an entire data frame for an argument. Same goes for summarize, which is why you were getting a similar error. It also works outside of your function because you're able to specify a single vector.
So, what I did was swap out drop_na for na_omit, and also reorganized your code a bit.
First, let's just define a separate std_mean function so we don't have to deal with summarize:
std_mean <- function(x){
x <- na.omit(x)
(as.numeric(x) - min(as.numeric(x)))/(max(as.numeric(x)) - min(as.numeric(x)))
}
Now we can go back and fix your original function:
summary_stats <- function(vec){
data_norm <- std_mean(vec)
n = length(data_norm)
sd = round(sd(as.numeric(data_norm), na.rm = T), digits = 3)
data.frame(
n = n,
mean = round(mean(as.numeric(data_norm), na.rm = T), digits = 3),
sd = sd,
se = round(sd/sqrt(n), digits = 3)
)
}
We have to define n and sd beforehand because they were being used as arguments in other columns of the data frame. While it would be cool for data.frame to calculate the first column to then allow you to feed into later columns, that isn't the case.
And now we're ready to map:
map(df[vars],summary_stats)
$x
n mean sd se
1 16 0.536 0.269 0.067
$y
n mean sd se
1 18 0.556 0.338 0.08
$z
n mean sd se
1 17 0.588 0.353 0.086
Im trying to produce a correlation plot for my data but i get 'x must be numeric error', other fixes have not worked for my case. Do i have to change the month to numeric as well? or is there a way of selecting only the numeric columns for my plot
Tried converting all to numeric but it just changes back to factor automatically
getwd()
myDF <- read.csv("qbase.csv")
head(myDF)
str(myDF)
cp <-cor(myDF)
head(round(cp,2))
'data.frame': 12 obs. of 8 variables:
$ Month : Factor w/ 12 levels "18-Apr","18-Aug",..: 5 4 8 1 9 7 6 2 12 11 ...
$ Monthly.Recurring.Revenue: Factor w/ 2 levels "$25,000 ","$40,000 ": 1 1 1 1 1 2 2 2 2 2 ...
$ Price.per.Seat : Factor w/ 2 levels "$40 ","$50 ": 2 2 2 2 2 1 1 1 1 1 ...
$ Paid.Seats : int 500 500 500 500 500 1000 1000 1000 1000 1000 ...
$ Active.Users : int 10 50 50 100 450 550 800 900 950 800 ...
$ Support.Cases : int 0 0 1 5 35 155 100 75 50 45 ...
$ Users.Trained : int 1 5 0 50 100 300 50 30 0 100 ...
$ Features.Used : int 5 5 5 5 8 9 9 10 15 15 ...
The results to dput(myDF) as are follows:
dput( myDF)
structure(list(Month = structure(c(5L, 4L, 8L, 1L, 9L, 7L, 6L,
2L, 12L, 11L, 10L, 3L), .Label = c("18-Apr", "18-Aug", "18-Dec",
"18-Feb", "18-Jan", "18-Jul", "18-Jun", "18-Mar", "18-May", "18-Nov",
"18-Oct", "18-Sep"), class = "factor"), Monthly.Recurring.Revenue = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("$25,000 ",
"$40,000 "), class = "factor"), Price.per.Seat = structure(c(2L,
2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("$40 ",
"$50 "), class = "factor"), Paid.Seats = c(500L, 500L, 500L,
500L, 500L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L, 1000L),
Active.Users = c(10L, 50L, 50L, 100L, 450L, 550L, 800L, 900L,
950L, 800L, 700L, 600L), Support.Cases = c(0L, 0L, 1L, 5L,
35L, 155L, 100L, 75L, 50L, 45L, 10L, 5L), Users.Trained = c(1L,
5L, 0L, 50L, 100L, 300L, 50L, 30L, 0L, 100L, 50L, 0L), Features.Used = c(5L,
5L, 5L, 5L, 8L, 9L, 9L, 10L, 15L, 15L, 15L, 15L)), class = "data.frame", row.names = c(NA,
-12L))
You can convert dates to POSIXct and also remove the dollar sign to convert the second and third columns to numeric:
myDF$Month <- as.numeric(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
cp <-cor(myDF)
library(ggcorrplot)
ggcorrplot(cp)
You are trying to get a correlation between factors and numeric columns, wich can't happen (cor handles only numeric, hence the error). You can do:
library(data.table)
ir <- data.table(iris) # since you didn't produce a reproducible example
ir[, cor(.SD), .SDcols = names(ir)[(lapply(ir, class) == "numeric")]]
what is in there:
cor(.SD) will calculate the correlation matrix for a new dataframe composed of a subset data.table (.SD, see ?data.table).
.SDcols establish wich columns will go into that subset data.table. They are only those which class is numeric.
You can remove the dollar sign and change the integer variables to numeric using sapply, then calculate the correlation.
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF[,2:8],as.numeric)
cor(newdf)
Edited:
If you want to use the month variable. Please install lubridate and use month function.
For example:
library(lubridate)
myDF$Month<- month(as.POSIXct(myDF$Month, format="%d-%b", tz="GMT"))
myDF[,c(2,3)] <- sapply(myDF[,c(2,3)], function(x) as.numeric(gsub("[\\$,]", "", x)))
newdf <- sapply(myDF,as.numeric)
cor(as.data.frame(newdf))
The way to convert those months to Date class:
myDF$MonDt <- as.Date( paste0(myDF$Month, "-15"), format="%y-%b-%d")
Could also have used zoo::as.yearmon. Either method would allow you to apply as.numeric to get a valid time scaled value. The other answers are adequate when using single year data but because they incorrectly make the assumption the the leading two digits are day of the month rather than the year, they are going to fail to deliver valid answers in any multi-year dataset, but will not throw any warning about this.
with(myDF, cor(Active.Users, as.numeric(MonDt) ) )
[1] 0.8269705
As one of the other answers illustrated removing the $ and commas is needed before as.numeric will succeed on currency-formatted text. Again, this is also factor data so as.numeric could have yielded erroneous answers, although in this simple example it would not. A safe method would be:
myDF[2:3] <- lapply(myDF[2:3], function(x) as.numeric( gsub("[$,]", "", x)))
myDF
Month Monthly.Recurring.Revenue Price.per.Seat Paid.Seats Active.Users
1 18-Jan 25000 50 500 10
2 18-Feb 25000 50 500 50
3 18-Mar 25000 50 500 50
4 18-Apr 25000 50 500 100
5 18-May 25000 50 500 450
6 18-Jun 40000 40 1000 550
7 18-Jul 40000 40 1000 800
8 18-Aug 40000 40 1000 900
9 18-Sep 40000 40 1000 950
10 18-Oct 40000 40 1000 800
11 18-Nov 40000 40 1000 700
12 18-Dec 40000 40 1000 600
Support.Cases Users.Trained Features.Used MonDt
1 0 1 5 2018-01-15
2 0 5 5 2018-02-15
3 1 0 5 2018-03-15
4 5 50 5 2018-04-15
5 35 100 8 2018-05-15
6 155 300 9 2018-06-15
7 100 50 9 2018-07-15
8 75 30 10 2018-08-15
9 50 0 15 2018-09-15
10 45 100 15 2018-10-15
11 10 50 15 2018-11-15
12 5 0 15 2018-12-15
This question gets an answer that allows multiple correlation coefficients to be calculated and the two way data associations plotted on one page:
How to add p values for correlation coefficients plotted using splom in lattice?
I have following data that represents sequence of person's choice between four values (f1,f2,c1,c2) :
df=structure(list(combi = structure(c(24L, 8L, 3L, 19L, 4L, 23L,
15L, 12L, 14L, 22L, 5L, 13L, 18L, 9L, 2L, 25L, 11L, 7L, 21L,
10L, 6L, 17L, 20L, 16L), .Label = c("", "c1-c2-f1-f2", "c1-c2-f2-f1",
"c1-f1-c2-f2", "c1-f1-f2-c2", "c1-f2-c2-f1", "c1-f2-f1-c2", "c2-c1-f1-f2",
"c2-c1-f2-f1", "c2-f1-c1-f2", "c2-f1-f2-c1", "c2-f2-c1-f1", "c2-f2-f1-c1",
"f1-c1-c2-f2", "f1-c1-f2-c2", "f1-c2-c1-f2", "f1-c2-f2-c1", "f1-f2-c1-c2",
"f1-f2-c2-c1", "f2-c1-c2-f1", "f2-c1-f1-c2", "f2-c2-c1-f1", "f2-c2-f1-c1",
"f2-f1-c1-c2", "f2-f1-c2-c1"), class = "factor"), nb = c(10L,
0L, 2L, 4L, 1L, 5L, 1L, 2L, 1L, 3L, 1L, 0L, 3L, 5L, 0L, 18L,
5L, 2L, 5L, 0L, 4L, 4L, 11L, 2L)), .Names = c("combi", "nb"), class = "data.frame", row.names = c(1L,
3L, 5L, 7L, 9L, 11L, 13L, 15L, 17L, 19L, 21L, 23L, 25L, 27L,
29L, 31L, 33L, 35L, 37L, 39L, 41L, 43L, 45L, 47L))
I'm wondering if there's tree representation (or else) that could quantifiy, for each step choices number, by taking in account sub chain that are commun. Example :
f2 (52) -f1 (28) -c1-c2 (10)
-c2-c1 (18)
f2(52) there is 52 times chains begining by f2. there is 28 times chain beginning by f2-f1.
Thanks a lot.
If you read the combi values in (using as.character) you can expand those values to character columns:
df2 <- cbind(df, read.table(text=as.character(df$combi), sep="-",stringsAsFactors=FALSE) )
Then you can tabulate at whatever level you want:
xtabs(nb~V1, data=df2) # First level only
#V1
#c1 c2 f1 f2
#10 12 15 52
xtabs(nb~paste(V1,V2,sep="-"), data=df2) # first and second
#--
# paste(V1, V2, sep = "-")
#c1-c2 c1-f1 c1-f2 c2-c1 c2-f1 c2-f2 f1-c1 f1-c2 f1-f2 f2-c1 f2-c2 f2-f1
# 2 2 6 5 5 2 2 6 7 16 8 28
You can also deploy the addmargins function to compactly the display the two "most senior" position sub-totals:
addmargins( xtabs(nb~V1+V2, data=df2))
#=========
V2
V1 c1 c2 f1 f2 Sum
c1 0 2 2 6 10
c2 5 0 5 2 12
f1 2 6 0 7 15
f2 16 8 28 0 52
Sum 23 16 35 15 89
This could be "flattened" with ftable:
ftable( addmargins( xtabs(nb~V1+V2, data=df2)), row.vars=1:2)
V1 V2
c1 c1 0
c2 2
f1 2
f2 6
Sum 10
c2 c1 5
c2 0
f1 5
f2 2
Sum 12
f1 c1 2
c2 6
f1 0
f2 7
Sum 15
f2 c1 16
c2 8
f1 28
f2 0
Sum 52
Sum c1 23
c2 16
f1 35
f2 15
Sum 89
And the final tally would be:
xtabs(nb~paste(V1,V2,V3,V4,sep="-"), data=df2)
#-----
paste(V1, V2, V3, V4, sep = "-")
c1-c2-f1-f2 c1-c2-f2-f1 c1-f1-c2-f2 c1-f1-f2-c2 c1-f2-c2-f1 c1-f2-f1-c2 c2-c1-f1-f2 c2-c1-f2-f1
0 2 1 1 4 2 0 5
c2-f1-c1-f2 c2-f1-f2-c1 c2-f2-c1-f1 c2-f2-f1-c1 f1-c1-c2-f2 f1-c1-f2-c2 f1-c2-c1-f2 f1-c2-f2-c1
0 5 2 0 1 1 2 4
f1-f2-c1-c2 f1-f2-c2-c1 f2-c1-c2-f1 f2-c1-f1-c2 f2-c2-c1-f1 f2-c2-f1-c1 f2-f1-c1-c2 f2-f1-c2-c1
3 4 11 5 3 5 10 18
To see it all in a column:
as.matrix( xtabs(nb~paste(V1,V2,V3,V4,sep="-"), data=df2) )
#----------------
[,1]
c1-c2-f1-f2 0
c1-c2-f2-f1 2
c1-f1-c2-f2 1
c1-f1-f2-c2 1
c1-f2-c2-f1 4
c1-f2-f1-c2 2
c2-c1-f1-f2 0
c2-c1-f2-f1 5
c2-f1-c1-f2 0
c2-f1-f2-c1 5
c2-f2-c1-f1 2
c2-f2-f1-c1 0
f1-c1-c2-f2 1
f1-c1-f2-c2 1
f1-c2-c1-f2 2
f1-c2-f2-c1 4
f1-f2-c1-c2 3
f1-f2-c2-c1 4
f2-c1-c2-f1 11
f2-c1-f1-c2 5
f2-c2-c1-f1 3
f2-c2-f1-c1 5
f2-f1-c1-c2 10
f2-f1-c2-c1 18
I suppose a "final answer with all the subtotals might be:
ftable( addmargins( xtabs(nb~V1+V2+paste(V3,V4,sep="-"), data=df2)), row.vars=1:3)
However, that has so many zero entries that I hesitate to recommend. You could strip out zero rows:
my.ftable <- ftable( addmargins( xtabs(nb~V1+V2+paste(V3,V4,sep="-"), data=df2)), row.vars=1:3)
my.df.table <- as.data.frame(my.ftable)
names(my.df.table)[3] <- "3rd_4th"
my.df.table[ my.df.table$Freq > 0, ]
#---------
V1 V2 3rd_4th Freq
14 f2 f1 c1-c2 10
15 Sum f1 c1-c2 10
18 f1 f2 c1-c2 3
20 Sum f2 c1-c2 3
23 f1 Sum c1-c2 3
24 f2 Sum c1-c2 10
25 Sum Sum c1-c2 13
34 f2 c2 c1-f1 3
35 Sum c2 c1-f1 3
42 c2 f2 c1-f1 2
45 Sum f2 c1-f1 2
47 c2 Sum c1-f1 2
49 f2 Sum c1-f1 3
50 Sum Sum c1-f1 5
# and many more rows
#... until
321 c1 Sum Sum 10
322 c2 Sum Sum 12
323 f1 Sum Sum 15
324 f2 Sum Sum 52
325 Sum Sum Sum 89
The data.tree package specialises in tree representation. It is based on splitting variables in a hierarchal order, for example world -> continent -> country -> city. In your case, you've mentioned every order for c1, c2, f1 and f2. Likely you'd need to do four tree plots e.g. c1 --> either c2, f1 or f2, each leading to the two unused values, and then plot them.
A basic example starting with c1, and then splitting off, and not including specific values:
library(data.tree)
c1 <- Node$new("c1") # 1st level chain, "c1"
c2 <- c1$AddChild("c2") # new 2nd level chain, "c2", off c1
f1 <- c2$AddChild("f1-f2") # new level off c2
f2 <- c2$AddChild("f2-f1") # new level off c2
f1 <- c1$AddChild("f1") # new 2nd level chain, "f1", off c1
c2 <- f1$AddChild("c2-f2") # new level off f1
f2 <- f1$AddChild("f2-c2") # new level off f1
f2 <- c1$AddChild("f2") # new 2nd level chain, "f2", off c1
c2 <- f2$AddChild("c2-f1") # new level off f2
f1 <- f2$AddChild("f1-c2") # new level off f2
print(c1)
levelName
1 c1
2 ¦--c2
3 ¦ ¦--f1-f2
4 ¦ °--f2-f1
5 ¦--f1
6 ¦ ¦--c2-f2
7 ¦ °--f2-c2
8 °--f2
9 ¦--c2-f1
10 °--f1-c2
plot(c1)
Maybe not exactly what you mean by a "tree structure" but this gives you the numbers
in a table using base R. It should be easy to format that as you like from this result.
df=structure(list(combi = structure(c(24L, 8L, 3L, 19L, 4L, 23L,
15L, 12L, 14L, 22L, 5L, 13L, 18L, 9L, 2L, 25L, 11L, 7L, 21L,
10L, 6L, 17L, 20L, 16L), .Label = c("", "c1-c2-f1-f2", "c1-c2-f2-f1",
"c1-f1-c2-f2", "c1-f1-f2-c2", "c1-f2-c2-f1", "c1-f2-f1-c2", "c2-c1-f1-f2",
"c2-c1-f2-f1", "c2-f1-c1-f2", "c2-f1-f2-c1", "c2-f2-c1-f1", "c2-f2-f1-c1",
"f1-c1-c2-f2", "f1-c1-f2-c2", "f1-c2-c1-f2", "f1-c2-f2-c1", "f1-f2-c1-c2",
"f1-f2-c2-c1", "f2-c1-c2-f1", "f2-c1-f1-c2", "f2-c2-c1-f1", "f2-c2-f1-c1",
"f2-f1-c1-c2", "f2-f1-c2-c1"), class = "factor"), nb = c(10L,
0L, 2L, 4L, 1L, 5L, 1L, 2L, 1L, 3L, 1L, 0L, 3L, 5L, 0L, 18L,
5L, 2L, 5L, 0L, 4L, 4L, 11L, 2L)), .Names = c("combi", "nb"), class = "data.frame", row.names = c(1L,
3L, 5L, 7L, 9L, 11L, 13L, 15L, 17L, 19L, 21L, 23L, 25L, 27L,
29L, 31L, 33L, 35L, 37L, 39L, 41L, 43L, 45L, 47L))
tmp <- sapply(as.character(df$combi), strsplit, split = "-")
tmp <- do.call(rbind, tmp)
colnames(tmp) <- paste0("str", 1:4)
rownames(tmp) <- NULL
tmp <- data.frame(df, tmp)
tmp$str3 <- paste(tmp$str3, tmp$str4, sep = "-")
str1 <- aggregate(list(nb_str1 = tmp[,"nb"]), tmp["str1"], sum)
str2 <- aggregate(list(nb_str2 = tmp[,"nb"]), tmp[c("str1", "str2")], sum)
str3 <- aggregate(list(nb_str3 = tmp[,"nb"]), tmp[c("str1", "str2", "str3")], sum)
tmp <- merge(str3, str1)
tmp <- merge(tmp, str2)
tmp <- tmp[, c("str1", "nb_str1", "str2", "nb_str2", "str3", "nb_str3")]
tmp
#> str1 nb_str1 str2 nb_str2 str3 nb_str3
#> 1 c1 10 c2 2 f1-f2 0
#> 2 c1 10 c2 2 f2-f1 2
#> 3 c1 10 f1 2 c2-f2 1
#> 4 c1 10 f1 2 f2-c2 1
#> 5 c1 10 f2 6 c2-f1 4
#> 6 c1 10 f2 6 f1-c2 2
#> 7 c2 12 c1 5 f1-f2 0
#> 8 c2 12 c1 5 f2-f1 5
#> 9 c2 12 f1 5 c1-f2 0
#> 10 c2 12 f1 5 f2-c1 5
#> 11 c2 12 f2 2 c1-f1 2
#> 12 c2 12 f2 2 f1-c1 0
#> 13 f1 15 c1 2 c2-f2 1
#> 14 f1 15 c1 2 f2-c2 1
#> 15 f1 15 c2 6 c1-f2 2
#> 16 f1 15 c2 6 f2-c1 4
#> 17 f1 15 f2 7 c1-c2 3
#> 18 f1 15 f2 7 c2-c1 4
#> 19 f2 52 c1 16 c2-f1 11
#> 20 f2 52 c1 16 f1-c2 5
#> 21 f2 52 c2 8 c1-f1 3
#> 22 f2 52 c2 8 f1-c1 5
#> 23 f2 52 f1 28 c1-c2 10
#> 24 f2 52 f1 28 c2-c1 18
Created on 2018-03-15 by the reprex package (v0.2.0).
I am trying to apply a regression function to each separate level of a factor (Subject). The idea is that for each Subject, I can get a predicted reading time based on their actual reading time(RT) and the length of the corresponding printed string (WordLen). I was helped along by a colleague with some code for applying the function based on each level of another function (Region) within (Subject). However, neither the original code nor my attempted modification (to applying the function across breaks by a single factor) works.
Here is an attempt at some sample data:
test0<-structure(list(Subject = c(101L, 101L, 101L, 101L, 101L, 101L,
101L, 101L, 101L, 101L, 102L, 102L, 102L, 102L, 102L, 102L, 102L,
102L, 102L, 102L, 103L, 103L, 103L, 103L, 103L, 103L, 103L, 103L,
103L, 103L), Region = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L), RT = c(294L, 241L, 346L, 339L, 332L, NA, 399L,
377L, 400L, 439L, 905L, 819L, 600L, 520L, 811L, 1021L, 508L,
550L, 1048L, 1246L, 470L, NA, 385L, 347L, 592L, 507L, 472L, 396L,
761L, 430L), WordLen = c(3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L,
9L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 5L, 7L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 5L, 7L, 3L)), .Names = c("Subject", "Region", "RT", "WordLen"
), class = "data.frame", row.names = c(NA, -30L))
The unfortunate thing is that this data is returning a problem that I don't get with my full dataset:
"Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases"
Maybe this is because the sample data is too small?
Anyway, I am hoping that someone will see the issue with the code, despite my ability to provide working data...
This is the original code (does not work):
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
And this is the modified code (which not surprisingly, also does not work):
for(i in 1:length(levels(test0$Subject)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
I would very much appreciate any suggestions.
You can achieve result with function ddply() from library plyr.
This will split data frame according to Subject, calculate prediction of regression model and then add as new column to data frame.
ddply(test0,.(Subject),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
Subject Region RT WordLen pred
1 101 1 294 3 327.9778
......
4 101 1 339 3 327.9778
5 101 1 332 3 327.9778
6 101 2 NA 3 NA
7 101 2 399 5 363.8444
.......
13 102 1 600 3 785.4146
To split data by Subject and Region you should put both variable inside .().
ddply(test0,.(Subject,Region),transform,
pred=predict(lm(RT~WordLen,na.action="na.exclude")))
The only problem in your test data is that Subject and Region are not factors.
test0$Subject <- factor(test0$Subject)
test0$Region <- factor(test0$Region)
for(i in 1:length(levels(test0$Subject)))
for(j in 1:length(levels(test0$Region)))
{tmp=predict(lm(RT~WordLen,test0[test0$Subject==levels(test0$Subject)[i] & test0$Region==levels(test0$Region)[j],],na.action="na.exclude"))
test0[names(tmp),"rt.predicted"]=tmp
}
# 26 27 28 29 30
# 442.25 442.25 560.50 678.75 442.25
The reason you were getting the error you were (0 non-NA cases) is that when you were subsetting, you were doing it on levels of variables that were not factors. In you original dataset, try:
test0[test0$Subject==levels(test0$Subject)[1],]
You get:
# [1] Subject Region RT WordLen
# <0 rows> (or 0-length row.names)
Which is what lm() was trying to work with
While your questions seems to be asking for explanation of error, which others have answered (data not being factor at all), here is a way to do it using just base packages
test0$rt.predicted <- unlist(by(test0[, c("RT", "WordLen")], list(test0$Subject, test0$Region), FUN = function(x) predict(lm(RT ~
WordLen, x, na.action = "na.exclude"))))
test0
## Subject Region RT WordLen rt.predicted
## 1 101 1 294 3 310.4000
## 2 101 1 241 3 310.4000
## 3 101 1 346 3 310.4000
## 4 101 1 339 3 310.4000
## 5 101 1 332 3 310.4000
## 6 101 2 NA 3 731.0000
## 7 101 2 399 5 731.0000
## 8 101 2 377 7 731.0000
## 9 101 2 400 3 731.0000
## 10 101 2 439 9 731.0000
## 11 102 1 905 3 448.5000
## 12 102 1 819 3 NA
## 13 102 1 600 3 448.5000
## 14 102 1 520 3 448.5000
## 15 102 1 811 3 448.5000
## 16 102 2 1021 3 NA
## 17 102 2 508 3 399.0000
## 18 102 2 550 5 408.5000
## 19 102 2 1048 7 389.5000
## 20 102 2 1246 3 418.0000
## 21 103 1 470 3 870.4375
## 22 103 1 NA 3 870.4375
## 23 103 1 385 3 877.3750
## 24 103 1 347 3 884.3125
## 25 103 1 592 3 870.4375
## 26 103 2 507 3 442.2500
## 27 103 2 472 3 442.2500
## 28 103 2 396 5 560.5000
## 29 103 2 761 7 678.7500
## 30 103 2 430 3 442.2500
I would expect that this is caused by the fact that for a combination of your two categorical variables no data exists. What you could do is to first extract the subset, check if it isn't equal to NULL, and only perform the lm if there is data.
I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89