I am pretty much new to r and I have a dummy example of a bigger table underneath. I want to split the table based on id (a,b,c,d) and do iterative simple linear regression for every subset:
x is my x variable, and columns 1:6 are y variables, to have an output of each id and each column from 1:6. Also, it would be great if I could output the model p values of the slopes into a new data frame
id x 1 2 3 4 5 6
1 a 74 18 19 NA 23 29 1
2 a 77 16 19 17 22 29 2
3 a 79 16 NA 19 23 29 3
4 a 81 17 20 18 23 29 4
5 b 74 19 20 19 23 28 11
6 b 76 15 19 18 26 28 12
7 b 79 19 21 20 24 28 NA
8 b 81 19 21 20 23 28 14
9 c 68 19 20 20 23 29 8
10 c 70 17 22 22 27 29 9
11 c 73 18 22 21 23 29 10
12 c 75 19 20 19 23 29 11
13 d 65 18 18 19 22 28 5
14 d 68 18 NA 18 20 29 6
15 d 70 18 19 18 23 28 7
16 d 72 19 17 19 22 28 8
I tried to do use plyr package but it didn't work out
regression = NULL
for ( i in 3:ncol(dumm)){
regression[i] <- dlply(dumm, .(id), function(z) lm(dumm[,i]~dumm$x, z))
}
coefs <- ldply(regression, coef)
Thanks in advance!
Related
I am trying to create a vector where I have 3 repetitions of the number 1, then 3 repetitions of the number 2, and so on up to, for instance, 3 repetitions of the number 36.
c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5...)
I have tried the following use of rep() but got the following error:
Error in rep(3, seq(1:36)) : argument 'times' incorrect
What formulation do I need to use to properly generate the vector I want?
sort(rep(1:36, 3))
Or even better as #Wimpel mentioned in the comments, use the each argument of the rep function.
rep(1:36, each = 3)
output
# [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17 17 18 18 18 19 19 19 20 20 20 21 21 21 22
# [65] 22 22 23 23 23 24 24 24 25 25 25 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34 34 34 35 35 35 36 36 36
This one should work. However probably not the most elegant.
reps = c()
n = 36
for(i in 1:n){
reps = append(reps, rep(i, 3))
}
reps
alternatively using the rep function properly (see documentation (?rep for argument each):
rep(1:36,each = 3)
rep approach is preferable (see existing answers)
Here are some other options:
> kronecker(1:36, rep(1, 3))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36
> c(outer(rep(1, 3), 1:36))
[1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 9
[26] 9 9 10 10 10 11 11 11 12 12 12 13 13 13 14 14 14 15 15 15 16 16 16 17 17
[51] 17 18 18 18 19 19 19 20 20 20 21 21 21 22 22 22 23 23 23 24 24 24 25 25 25
[76] 26 26 26 27 27 27 28 28 28 29 29 29 30 30 30 31 31 31 32 32 32 33 33 33 34
[101] 34 34 35 35 35 36 36 36
I have a very large dataset (> 200000 lines) with 6 variables (only the first two shown)
>head(gt7)
ChromKey POS
1 2447 25
2 2447 183
3 26341 75
4 26341 2213
5 26341 2617
6 54011 1868
I have converted the Chromkey variable to a factor variable made up of > 55000 levels.
> gt7[1] <- lapply(gt7[1], factor)
> is.factor(gt7$ChromKey)
[1] TRUE
I can further make a table with counts of ChromKey levels
> table(gt7$ChromKey)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
88 88 44 33 11 11 33 22 121 11 22 11 11 11 22 11 33
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
22 22 44 55 22 11 22 66 11 11 11 22 11 11 11 187 77
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
77 11 44 11 11 11 11 11 11 22 66 11 22 11 44 22 22
... outut cropped
Which I can save in table format
> table <- table(gt7$ChromKey)
> head(table)
1 2 3 4 5 6
88 88 44 33 11 11
I would like to know whether is it possible to have a table (and histogram) of the number of levels with specific count numbers. From the example above, I would expect
88 44 33 11
2 1 1 2
I would very much appreciate any hint.
We can apply table again on the output to get the frequency count of the frequency
table(table(gt7$ChromKey))
I have tried to change column names based on a vector as follows:
library(data.table)
df <- fread(
"radio1 radio2 radio3 radio4 radio5 radio6 radio7
8 12 18 32 40 36 32
6 12 18 24 30 36 30
8 16 18 24 30 36 18
4 12 12 24 30 36 24
6 16 24 32 40 48 24
8 12 18 24 30 36 30
8 12 18 24 30 36 18
8 16 24 32 40 48 40
8 16 24 24 30 48 48",
header = TRUE
)
var <- c("radio1","radio2","radio3","radio4","radio5", "radio6", "radio7")
recode <- c("A","B","C","D","E", "F", "G")
variables <- cbind(var, recode)
variables <- as.data.table(variables)
for (i in seq_len(ncol(df))) {
colnames(df[[i]]) <- variables$recode[match(names(df)[i], variables $var)]
}
I however get the error:
Error in `colnames<-`(`*tmp*`, value = variables$recode[match(names(df)[i], :
attempt to set 'colnames' on an object with less than two dimensions
What am I doing wrong? Is there a better way to do this?
You can use match directly.
names(df) <- variables$recode[match(names(df), variables$var)]
df
# A B C D E F G
#1: 8 12 18 32 40 36 32
#2: 6 12 18 24 30 36 30
#3: 8 16 18 24 30 36 18
#4: 4 12 12 24 30 36 24
#5: 6 16 24 32 40 48 24
#6: 8 12 18 24 30 36 30
#7: 8 12 18 24 30 36 18
#8: 8 16 24 32 40 48 40
#9: 8 16 24 24 30 48 48
By changing colnames(df[[i]]) to colnames(df)[i], the loop works fine:
for (i in seq_len(ncol(df))) {
colnames(df)[i] <- variables$recode[match(names(df)[i], variables$var)] }
> df
A B C D E F G
1: 8 12 18 32 40 36 32
2: 6 12 18 24 30 36 30
3: 8 16 18 24 30 36 18
4: 4 12 12 24 30 36 24
5: 6 16 24 32 40 48 24
6: 8 12 18 24 30 36 30
7: 8 12 18 24 30 36 18
8: 8 16 24 32 40 48 40
9: 8 16 24 24 30 48 48
I have two vectors
a <- c(18,19,19,19,21,21,22,23,24,25,26,27,28,30,31,35,36,37)
b <- c(19,25,31,37)
I need the data frame following format:
a b
18 19
19 19
19 19
19 19
21 25
21 25
22 25
23 25
24 25
25 25
26 31
27 31
28 31
30 31
31 31
35 37
36 37
37 37
Here value 19 in vector b repeat up to the value 19 in vector a.
After that 21(in a) is the greater than 19 ,so the next value of 25(in b) is be repeat until the 25(in a )
in similar way construct the dataframe.
Thank you.
We can get the position index from findInterval, use that to create the times for the rep
i1 <- findInterval(b, a)
data.frame(a, b = rep(b, c(i1[1], diff(i1))))
# a b
#1 18 19
#2 19 19
#3 19 19
#4 19 19
#5 21 25
#6 21 25
#7 22 25
#8 23 25
#9 24 25
#10 25 25
#11 26 31
#12 27 31
#13 28 31
#14 30 31
#15 31 31
#16 35 37
#17 36 37
#18 37 37
Alternatively,
data.frame(a, b = sapply(a, function(x) b[x <= b][1]))
# a b
# 1 18 19
# 2 19 19
# 3 19 19
# 4 19 19
# 5 21 25
# 6 21 25
# 7 22 25
# 8 23 25
# 9 24 25
# 10 25 25
# 11 26 31
# 12 27 31
# 13 28 31
# 14 30 31
# 15 31 31
# 16 35 37
# 17 36 37
# 18 37 37
What's the name for the [1] below.
What is its significance?
Is it always only [1]? If not, then under what conditions is it something else? (example please)
> bb <- c(5,6,7)
> bb
[1] 5 6 7
It shows the count of the variables. In your case, it shows
bb <- c(5,6,7)
> bb
# [1] 5 6 7
Try,
c(1:50)
#[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
#[35] 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
You can also avoid that being displayed by using cat
cat(c(1:50))
#1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50