I have a list as follows:
id | value
----------
4 600
4 899
7 19
13 4930
13 300
: :
There are multiple ID repeats, and each one has a unique value. I want to turn this into something as follows:
id | list
----------
4 c(600, 899)
7 c(19)
13 c(4930, 300)
: :
Is there a vectorized method of accomplishing this?
EDIT: Extending the first question, is there a simple way to do the same thing for a generic MxN matrix? I.e., turning this:
id | value1 value2
-------------------
4 600 a
4 899 b
7 19 d
13 4930 e
13 300 a
: : :
into this:
id | list
----------
4 list(c(600, 899),c('a','b'))
7 list(c(19),c('b'))
13 list(c(4930, 300),c('e','a'))
: :
Thanks!
You could also use tapply if you want to stick with base functions:
tapply(dat$value,dat$id,c)
$`4`
[1] 600 899
$`7`
[1] 19
$`13`
[1] 4930 300
Edit:
For your edited problem, I would go with split and lapply:
x <- lapply(split(dat[2:3],dat$id),c,use.names=F)
dput(x)
structure(list(`4` = list(c(600, 899), c("a", "b")), `7` = list(
19, "d"), `13` = list(c(4930, 300), c("e", "a"))), .Names = c("4", "7", "13"))
The functions in package plyr should be of help here.
In the following example I assume your data is in the form of a data.frame - even if it really is a list, as you say, it should be straight-forward to convert to a data.frame:
dat <- data.frame(
id = c(4, 4, 7, 13, 13),
value = c(600, 899, 19, 4930, 300)
)
library(plyr)
dlply(dat, .(id), function(x)x$value)
The result is a list as you specified:
$`4`
[1] 600 899
$`7`
[1] 19
$`13`
[1] 4930 300
attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
id
1 4
2 7
3 13
I'd just split() the data:
d <- read.table(text = "id value
4 600
4 899
7 19
13 4930
13 300", header=T)
split(d$value, d$id)
$`4`
[1] 600 899
$`7`
[1] 19
$`13`
[1] 4930 300
Related
I would like to make two variables ("a" and "b") that keep:
taking a random value less ALWAYS than their current value (i.e. a1 > a2 > a3 ...> an , b1 > b2 > b3 ... bn ALWAYS)
until one of them less than or equal to 0:
I showed a demo below:
#iteration 1
a1 = 100 - rnorm(1,5,10)
b1 = 100 -rnorm(1,5,10)
a2 = a1 - rnorm(1,5,10)
b2 = b1 -rnorm(1,5,10)
a3 = a2 - rnorm(1,5,10)
b3 = b2 -rnorm(1,5,10)
#etc.
I would then like to repeat this many times. In the end, this would look something :
Currently, I am doing this manually, and then using the bind_rows() command to "pile" each iteration on top of each other. Can someone please show me a faster way to do this?
Thank you!
You could write a smallrecursive function:
fun <- function(x){
if(any(x < 0)) x
else rbind(x, fun(x - abs(rnorm(length(x),5,10)) ))
}
Now for 1 draw of A and B:
set.seed(1)
fun(c(A=100, B=100))
A B
x 100.00000 100.000000
x 98.73546 93.163567
x 95.37918 72.210759
x 87.08410 69.006075
x 77.20981 56.622828
x 66.45199 54.676712
x 46.33418 45.778279
x 45.12178 28.631280
x 28.87247 24.080617
x 24.03437 9.642254
10.82216 -1.296759
We can use this within a function to replicate. Will maintain BASE R although can be simplified in tidyverse:
random_seq <- function(n, start){
fun <- function(x){
if(any(x < 0)) c(x)
else rbind(x, fun(x - abs(rnorm(length(x),5,10)) ))
}
R <-replicate(n, data.frame(fun(start), row.names = NULL), simplify = FALSE)
S <- do.call(rbind, Map(cbind, id = seq(R), R))
U <-transform(S, time = ave(id, id, FUN = seq_along))
reshape(U, dir='wide', idvar = 'id', sep='')
}
set.seed(1)
random_seq(4, c(A=20,B=20))
id A1 B1 A2 B2 A3 B3 A4 B4
1 1 20 20 18.7354619 13.163567 15.379176 -7.789241 NA NA
4 2 20 20 11.7049223 16.795316 1.830632 4.412069 -8.927182 2.465953
8 3 20 20 -0.1178117 11.101568 NA NA NA NA
10 4 20 20 18.7875942 2.853001 2.538285 -1.697663 NA NA
BONUS:
if interested, fun can directly reproduce the names:
fun <- function(x){
nms <- as.numeric(sub('\\D+', '',names(x))) + 1
names(x) <- paste0(sub("\\d+", '', names(x)), nms)
if(any(x < 0)) c(x)
else c(x, Recall(x - abs(rnorm(length(x),5,10)) ))
}
fun(c(A0=20, B0=30))
A1 B1 A2 B2 A3 B3
20.000000 30.000000 11.234808 23.323201 -9.611483 1.544311
Here's a function that runs a single start to 0, nicely configurable, and we can use replicate to run it as many times as needed, returning a list.
to_0 = function(start = 100, fun = runif, ..., n = 1000) {
if(start <= 0) stop("Must start greater than 0")
result = start - c(0, cumsum(fun(n, ...)))
if(all(result > 0)) stop("Didn't reach 0, set a higher n or check inputs.")
first_0 = match(TRUE, result < 0)
result[seq_len(first_0)]
}
I used runif as the default instead of your rnorm because you say you want the series to be strictly decreasing, but rnorm is sometimes positive and sometimes negative so it will sometimes lead to increases.
I cut off the series at the first negative value. Since the lengths of each run are different, a data.frame seems like a bad choice, keeping them in a list is better. We can use lengths() to see how long each vector in the list is.
The function is parametrized, so you can easily try out other distributions or custom functions, e.g., to_0(start = 100, fun = rexp, rate = 0.1). Below I demonstrate with the uniform distribution starting at 10.
set.seed(47)
race = replicate(n = 100, to_0(start = 10))
head(race)
# [[1]]
# [1] 10.00000000 9.02303800 8.64912196 7.88761993 7.06512831 6.49158390 5.80017147 5.41110962 4.94216364 4.39885390 3.47396185
# [12] 3.33516427 2.63317707 2.47098343 1.87167641 1.36564030 0.46366678 0.06316398 0.03221901 -0.03913915
#
# [[2]]
# [1] 10.00000000 9.27320918 8.54814801 7.77974923 7.34440424 7.27499236 6.76825217 6.75134855 6.20214287 5.43031741 4.56633348
# [12] 3.59288910 3.24547860 2.60269295 1.75639299 1.73279651 1.72371866 1.38211688 0.71933800 0.04916749 -0.40714758
#
# [[3]]
# [1] 10.00000000 9.08923490 9.06189460 8.69397353 8.30179409 8.11077841 7.96295850 7.49701585 6.52812608 6.26480567 5.34558158
# [12] 5.31801508 4.90573089 3.98774633 3.89046321 3.70358854 3.61482042 3.53824450 3.36900151 2.86522484 2.23295349 1.80544403
# [23] 0.82311022 0.73664857 -0.09385818
#
# [[4]]
# [1] 10.0000000 9.2172681 8.4175584 8.1672679 7.3683421 7.3373712 7.0319788 6.6512214 5.7210315 5.2732412 4.6817849 4.1065416
# [13] 3.9452541 3.4009742 2.5018050 1.5316136 0.7175295 0.4410275 -0.1859260
#
# [[5]]
# [1] 10.00000000 9.91914621 9.90238843 9.82993154 9.33156028 8.90827720 8.44160294 7.46348397 6.76539075 6.27298443 5.97401412
# [12] 5.03395592 4.55537992 3.75737919 2.82175869 2.75045000 2.70081885 2.67523320 2.20266408 2.12695183 1.25880525 0.57011279
# [23] 0.03173135 -0.79275633
#
# [[6]]
# [1] 10.0000000 9.9292630 9.6154147 9.0754730 8.7814754 8.5273701 7.6998567 6.8127609 5.9944598 5.6232599 5.1505038 4.8676191
# [13] 4.6337121 4.5868438 4.0435219 3.0981151 2.2621741 1.9925101 1.2104707 0.9334569 0.7574446 0.1643009 -0.5220925
lengths(race)
# [1] 20 21 25 19 24 23 21 24 23 22 25 24 19 19 23 17 19 23 25 21 24 25 18 22 24 25 19 19 23 22 19 26 20 23 24 24 22 21 25 23 21 28 19 20 16 20
# [47] 22 25 20 22 23 23 24 22 19 23 23 23 22 18 22 23 24 21 21 23 21 22 20 25 22 23 21 17 20 20 16 25 21 21 21 20 20 19 24 19 23 24 26 25 20 21
# [93] 23 17 27 18 30 24 21 23
I have a simple data table apple that has numerous instances of numbers shortened as 40.08B, 40.08M, 400.08K, etc. I need to remove these letters and replace them with the appropriate number of zeros (i.e. 400.08K becomes 400080), so I wrote the following code:
apple2 <- dplyr::case_when(
stringr::str_detect(apple[,-1], 'B') ~ readr::parse_number(as.character(apple[,-1]), na = c("", "NA")) * 1e9,
stringr::str_detect(apple[,-1], 'M') ~ readr::parse_number(as.character(apple[,-1]), na = c("", "NA")) * 1e6,
stringr::str_detect(apple[,-1], 'K') ~ readr::parse_number(as.character(apple[,-1]), na = c("", "NA")) * 1e3,
TRUE ~ parse_number(as.character(apple[,-1]), na = c("", "NA"), trim_ws = TRUE)
)
The code works as expected in finding and converting the strings into appropriate numbers, but it only runs on the first row of the data table. In addition, it removes the headers. The error message is the following:
argument is not an atomic vector; coercingargument is not an atomic vector; coercingargument is not an atomic vector; coercing[1]
I've tried figuring this out for hours but to no avail - what am I doing wrong here? Thank you!
You are using case_when in a somewhat unorthodox way:
## some data:
d <- cbind.data.frame(
id = LETTERS,
matrix(
paste0(
ceiling(runif(26*5, max=999)),
sample( c("","B","K","M"), size=26*5, replace=T )
), nrow=26
)
)
library(stringr)
library(readr)
d %>% mutate( across( -1,
~ case_when(
str_detect(., 'B') ~ parse_number(as.character(.), na = c("", "NA")) * 1e9,
str_detect(., 'M') ~ parse_number(as.character(.), na = c("", "NA")) * 1e6,
str_detect(., 'K') ~ parse_number(as.character(.), na = c("", "NA")) * 1e3,
TRUE ~ parse_number(as.character(.), na = c("", "NA"), trim_ws = TRUE)
)
))
Input data:
id 1 2 3 4 5
1 A 834 27B 250 881B 988
2 B 313M 506B 309 413 141K
3 C 197 77 824 161B 43K
4 D 845K 172K 745B 922M 145M
5 E 168M 959M 990B 250K 893
6 F 430 687K 368M 10M 824M
7 G 940B 403B 655M 818 777K
8 H 281 833K 86B 849B 16K
9 I 485B 508B 349M 643M 926M
10 J 235B 10B 206M 505K 347M
11 K 897B 727M 405K 987B 674M
12 L 588B 40M 860M 58 934B
13 M 727K 375 188M 728K 201B
14 N 280K 442M 43K 400 445
15 O 988B 388M 530B 702M 240B
16 P 177M 782 410K 254K 758K
17 Q 706K 262 520B 104K 34
18 R 390B 99K 677K 965 635M
19 S 819 115M 920M 580M 295K
20 T 573M 901K 360 7K 88B
21 U 333B 593M 504B 992 241B
22 V 674 192M 841B 644B 659
23 W 524M 581M 692M 41 133
24 X 626K 686M 712K 756M 136B
25 Y 295 468 932M 486B 35K
26 Z 526K 798K 229K 958B 700B
Output:
id 1 2 3 4 5
1 A 8.34e+02 2.70e+10 2.50e+02 8.81e+11 9.88e+02
2 B 3.13e+08 5.06e+11 3.09e+02 4.13e+02 1.41e+05
3 C 1.97e+02 7.70e+01 8.24e+02 1.61e+11 4.30e+04
4 D 8.45e+05 1.72e+05 7.45e+11 9.22e+08 1.45e+08
5 E 1.68e+08 9.59e+08 9.90e+11 2.50e+05 8.93e+02
6 F 4.30e+02 6.87e+05 3.68e+08 1.00e+07 8.24e+08
7 G 9.40e+11 4.03e+11 6.55e+08 8.18e+02 7.77e+05
8 H 2.81e+02 8.33e+05 8.60e+10 8.49e+11 1.60e+04
9 I 4.85e+11 5.08e+11 3.49e+08 6.43e+08 9.26e+08
10 J 2.35e+11 1.00e+10 2.06e+08 5.05e+05 3.47e+08
11 K 8.97e+11 7.27e+08 4.05e+05 9.87e+11 6.74e+08
12 L 5.88e+11 4.00e+07 8.60e+08 5.80e+01 9.34e+11
13 M 7.27e+05 3.75e+02 1.88e+08 7.28e+05 2.01e+11
14 N 2.80e+05 4.42e+08 4.30e+04 4.00e+02 4.45e+02
15 O 9.88e+11 3.88e+08 5.30e+11 7.02e+08 2.40e+11
16 P 1.77e+08 7.82e+02 4.10e+05 2.54e+05 7.58e+05
17 Q 7.06e+05 2.62e+02 5.20e+11 1.04e+05 3.40e+01
18 R 3.90e+11 9.90e+04 6.77e+05 9.65e+02 6.35e+08
19 S 8.19e+02 1.15e+08 9.20e+08 5.80e+08 2.95e+05
20 T 5.73e+08 9.01e+05 3.60e+02 7.00e+03 8.80e+10
21 U 3.33e+11 5.93e+08 5.04e+11 9.92e+02 2.41e+11
22 V 6.74e+02 1.92e+08 8.41e+11 6.44e+11 6.59e+02
23 W 5.24e+08 5.81e+08 6.92e+08 4.10e+01 1.33e+02
24 X 6.26e+05 6.86e+08 7.12e+05 7.56e+08 1.36e+11
25 Y 2.95e+02 4.68e+02 9.32e+08 4.86e+11 3.50e+04
26 Z 5.26e+05 7.98e+05 2.29e+05 9.58e+11 7.00e+11
See also other ways to convert the human readable byte number to a number, eg this or perhaps this
We could make use of str_replace_all instead of multiple str_detect. Match and replace the 'B', 'M', 'K' substring in the column with a named vector in str_replace_all, then separate the column, and do the multiplication based on the separated columns
library(stringr)
library(dplyr)
library(tidyr)
apple %>%
mutate(col1 = str_replace_all(col1, setNames(c(' 1e9', ' 1e6', ' 1e3'),
c('B', 'M', 'K')))) %>%
separate(col1, into = c('col1', 'col2'), convert = TRUE) %>%
transmute(col1 = col1 * col2)
-output
# col1
#1 4.0e+10
#2 2.0e+08
#3 2.0e+06
#4 4.0e+05
#5 3.6e+10
data
apple <- structure(list(col1 = c("40B", "200M", "2M", "400K", "36B")),
class = "data.frame", row.names = c(NA,
-5L))
I have a table with a character field that can have either of these pattern :
input
97 # a single number
210 foo # a number and a word
87 bar 89 # a number, a word, a number
21 23 # two numbers
123 2 fizzbuzz # two number, a word
12 fizz 34 buzz # a number, a word, a number, a word
I'd like to split each line up to 4 parts, containing respectively the first number, the first word if it exists, the second number if it exists, and the second word if it exists. So my example would give :
input nb_1 word_1 nb_2 word_2
97 97
210 foo 210 foo
87 bar 89 87 bar 89
21 23 21 23
123 2 fizzbuzz 123 2 fizzbuzz
12 fizz 34 buzz 12 fizz 34 buzz
Please note the case of two number, a word (the example before the last one) : it has nothing in word_1 as there is no word between the two numbers.
Is there a way to do this without a tedious if / if / else structure ?
If it can help, all the words belong to a list of 10 specific words. Also, if there are two words, they can be the same or different. Also, the numbers can be one, two or three digits long.
Thanks
Here is an idea using gsub and cSplit from splitstackshape package,
library(splitstackshape)
df$num <- gsub('\\D', ' ', df$V1)
df$wrds <- gsub('\\d', ' ', df$V1)
newdf <- cSplit(df, 2:3, ' ', 'wide')
newdf
# V1 num_1 num_2 wrds_1 wrds_2
#1: 97 97 NA NA NA
#2: 210 foo 210 NA foo NA
#3: 87 bar 89 87 89 bar NA
#4: 21 23 21 23 NA NA
#5: 123 2 fizzbuzz 123 2 fizzbuzz NA
#6: 12 fizz 34 buzz 12 34 fizz buzz
The only problem is row 5, which can be fixed as follows,
newdf$wrds_1 <- as.character(newdf$wrds_1)
newdf$wrds_2 <- as.character(newdf$wrds_2)
newdf$wrds_2[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)]
newdf$wrds_1[grep('[0-9]+\\s+[0-9]+\\s+[A-Za-z]', newdf$V1)] <- NA
which finally gives,
newdf
# V1 num_1 num_2 wrds_1 wrds_2
#1: 97 97 NA NA NA
#2: 210 foo 210 NA foo NA
#3: 87 bar 89 87 89 bar NA
#4: 21 23 21 23 NA NA
#5: 123 2 fizzbuzz 123 2 NA fizzbuzz
#6: 12 fizz 34 buzz 12 34 fizz buzz
DATA
dput(df)
structure(list(V1 = c("97", " 210 foo", " 87 bar 89",
" 21 23", " 123 2 fizzbuzz",
" 12 fizz 34 buzz")), .Names = "V1", row.names = c(NA,
-6L), class = "data.frame")
Tried in a different way...
library(splitstackshape)
abc <- data.frame(a=c(97,"210 foo","87 bar 89","21 23","123 2 fizzbuzz","12 fizz 34 buzz"))
abc1 <- data.frame(cSplit(abc, "a", " ", stripWhite = FALSE))
abc <- cbind(abc,abc1)
names(abc) <- c("input","nb_1", "word_1", "nb_2","word_2")
abc[,1:5] <-apply(abc[,1:5] , 2, as.character)
for(i in 1:nrow(abc)){
abc$word_2[i] <- replace(abc$word_2[i] , is.na(abc$word_2[i]),abc$nb_2[grepl("[a-z]",abc$nb_2[i])][i])
abc$nb_2[i] <- replace(abc$nb_2[i] , is.na(abc$nb_2[i])|grepl("[a-z]",abc$nb_2[i]),abc$word_1[grepl("[0-9]",abc$word_1[i])][i])
}
abc$word_1 <- ifelse(grepl("[0-9]",abc$word_1),NA,abc$word_1)
abc[is.na(abc)] <- ""
print(abc)
input nb_1 word_1 nb_2 word_2
1 97 97
2 210 foo 210 foo
3 87 bar 89 87 bar 89
4 21 23 21 23
5 123 2 fizzbuzz 123 2 fizzbuzz
6 12 fizz 34 buzz 12 fizz 34 buzz
This is a hacky function to do it... although you might have other cases that would break it.
f <- function(x){
string2 <- strsplit(x, " ")[[1]]
if (length(string2) < 2)
return(c(string2, NA, NA, NA))
arenums <- grepl("\\d", string2)
c(string2[which(arenums)[1]],
if (arenums[2]) NA else string2[which(!arenums)[1]],
string2[which(arenums)[2]],
if (arenums[2]) string2[which(!arenums)[1]] else string2[which(!arenums)[2]])
}
> f("97")
[1] "97" NA NA NA
> f("210 foo")
[1] "210" "foo" NA NA
> f("87 bar 89")
[1] "87" "bar" "89" NA
> f("21 23")
[1] "21" NA "23" NA
> f("123 2 fizzbuzz")
[1] "123" NA "2" "fizzbuzz"
> f("12 fizz 34 buzz")
[1] "12" "fizz" "34" "buzz"
I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.
I am new to R, and want to sort a data frame called "weights". Here are the details:
>str(weights)
'data.frame': 57 obs. of 1 variable:
$ attr_importance: num 0.04963 0.09069 0.09819 0.00712 0.12543 ...
> names(weights)
[1] "attr_importance"
> dim(weights)
[1] 57 1
> head(weights)
attr_importance
make 0.049630556
address 0.090686474
all 0.098185517
num3d 0.007122618
our 0.125433292
over 0.075182467
I want to sort by decreasing order of attr_importance BUT I want to preserve the corresponding row names also.
I tried:
> weights[order(-weights$attr_importance),]
but it gives me a "numeric" back.
I want a data frame back - which is sorted by attr_importance and has CORRESPONDING row names intact. How can I do this?
Thanks in advance.
Since your data.frame only has one column, you need to set drop=FALSE to prevent the dimensions from being dropped:
weights[order(-weights$attr_importance),,drop=FALSE]
# attr_importance
# our 0.125433292
# all 0.098185517
# address 0.090686474
# over 0.075182467
# make 0.049630556
# num3d 0.007122618
Here is the big comparison on data.frame sorting:
How to sort a dataframe by column(s)?
Using my now-preferred solution arrange:
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
library(plyr)
arrange(dd,desc(z),b)
b x y z
1 Low C 9 2
2 Med D 3 1
3 Hi A 8 1
4 Hi A 9 1
rankdata.txt
regno name total maths science social cat
1 SUKUMARAN 400 78 89 73 S
2 SHYAMALA 432 65 79 87 S
3 MANOJ 500 90 129 78 C
4 MILYPAULOSE 383 59 88 65 G
5 ANSAL 278 39 77 60 O
6 HAZEENA 273 45 55 56 O
7 MANJUSHA 374 50 99 52 C
8 BILBU 408 81 97 72 S
9 JOSEPHROBIN 374 57 85 68 G
10 SHINY 381 70 79 70 S
z <- data.frame(rankdata)
z[with(z, order(-total+ maths)),] #order function maths group selection
z
z[with(z, order(name)),] # sort on name
z