I have a data frame that looks like this:
lhs1=c("A","D","C","B")
lhs2=c("B","A","C","I")
lhs3=c("I","B","A","D")
lhs4=c("A","C","B","D")
df <- data.frame(lhs1,lhs2,lhs3,lhs4)
lhs1 lhs2 lhs3 lhs4
1 A B I A
2 D A B C
3 C C A B
4 B I D D
And I want to add four more columns that shows the sale of each letter from base on the value on this data frame:
category <- c("A","B","C","D","E","I")
sale <- c(12,23,34,35,38,42)
look <- data.frame(category,sale)
category sale
A 12
B 23
C 34
D 35
E 38
I 42
So my data frame will look like this:
lhs1 lhs2 lhs3 lhs4 lhs1.sale lhs2.sale lhs3.sale lhs4.sale
A B I A 12 23 42 12
D A B C 35 12 23 34
C C A B 34 34 12 23
B I D D 23 42 35 35
Kindly help me create a loop than can create multiple vlookup for R.
Try this
df[paste(names(df), "sale", sep = ".")] <- look$sale[match(unlist(df), look$category)]
df
# lhs1 lhs2 lhs3 lhs4 lhs1.sale lhs2.sale lhs3.sale lhs4.sale
# 1 A B I A 12 23 42 12
# 2 D A B C 35 12 23 34
# 3 C C A B 34 34 12 23
# 4 B I D D 23 42 35 35
Here's a data.table solution.
library(data.table)
setkey(setDT(look),category) # convert look to data.table; index on category
cn <- paste0(names(df),".sales") # names for the new columns
setDT(df)[,c(cn):=lapply(.SD,function(col)look[col]$sale)]
df
# lhs1 lhs2 lhs3 lhs4 lhs1.sales lhs2.sales lhs3.sales lhs4.sales
# 1: A B I A 12 23 42 12
# 2: D A B C 35 12 23 34
# 3: C C A B 34 34 12 23
# 4: B I D D 23 42 35 35
Related
This was difficult to articulate, so have created an example. I have a set of list variables which I'd like to combine into a dataframe. These are examples of the lists
a <- 2:7
b <- 9:14
c <- 25:30
d <- 31:36
I have a list which has items that references the above list names
vars <- c("a","b","c","d")
I would like a way to combine those lists using an instruction on "vars <- c("a","b","c","d")". Is there a kind of data.frame(get(vars)) instruction that will join all those list items into a dataframe?
The result would look like this
df_result <- data.frame(a= 2:7, b = 9:14, c = 25:30, d = 31:36 )
> df_result
a b c d
1 2 9 25 31
2 3 10 26 32
3 4 11 27 33
4 5 12 28 34
5 6 13 29 35
6 7 14 30 36
Maybe it can't be done with one instruction, if there is any workarounds that would be great too.
Any help greatly appreciated. Many thanks.
Making use of mget you could do:
a <- 2:7
b <- 9:14
c <- 25:30
d <- 31:36
vars <- c("a","b","c","d")
d <- mget(vars, env = globalenv())
d <- do.call("cbind", d)
data.frame(d)
#> a b c d
#> 1 2 9 25 31
#> 2 3 10 26 32
#> 3 4 11 27 33
#> 4 5 12 28 34
#> 5 6 13 29 35
#> 6 7 14 30 36
I need to order a data.frame based on 2 columns and a given vector of variables.
Here n example of my df:
df = data.frame(A = rnorm(45),
B = rep(c('a', 'b', 'c'), each= 5, times = 3),
C = rep(c(10, 20, 30), each = 15))
I need to change the order of col B from c('a', 'b', 'c') to c('c', 'a', 'b') while still keeping col C fixed to the 3 variables groups.
Here the first 30 rows of my desired output:
A B C
-0.11451485 c 10
-0.11860742 c 10
0.08156183 c 10
1.11850750 c 10
-0.79072556 c 10
1.24141030 a 10
0.88538811 a 10
-1.35548712 a 10
0.05723677 a 10
0.14660464 a 10
-0.28587107 b 10
0.59452832 b 10
1.00163605 b 10
1.15892322 b 10
-1.41771696 b 10
-2.05743546 c 20
-1.22835358 c 20
1.50060736 c 20
-0.14956114 c 20
-1.13126592 c 20
1.08571256 a 20
-1.04991699 a 20
-1.50655996 a 20
-0.63675392 a 20
-0.26485423 a 20
0.30509657 b 20
0.85471772 b 20
-0.54064736 b 20
0.24578056 b 20
0.14917900 b 20
Any help will be really appreciated,
thanks
The key is to change the level of the factor column. After that, we can use arrange from the dplyr package to sort multiple columns. Notice that in your original post, sorting column A is not a requirement. I just add column A to the arrange call to show it is easy to include more than two columns to the arrange function.
library(dplyr)
df2 <- df %>%
# Change the level of the factor
mutate(B = factor(B, levels = c("c", "a", "b"))) %>%
# Arrange the column
arrange(C, B, A)
df2
# A B C
# 1 -2.39317699 c 10
# 2 -1.48901928 c 10
# 3 -0.42562766 c 10
# 4 0.03383395 c 10
# 5 0.66362189 c 10
# 6 -0.65324997 a 10
# 7 -0.59408686 a 10
# 8 0.37012883 a 10
# 9 0.53238177 a 10
# 10 3.03972004 a 10
# 11 -2.03192274 b 10
# 12 -1.05138447 b 10
# 13 -0.80795342 b 10
# 14 1.74526091 b 10
# 15 2.07681466 b 10
# 16 -1.90573715 c 20
# 17 -0.72626244 c 20
# 18 -0.48017481 c 20
# 19 -0.42995920 c 20
# 20 0.17729002 c 20
# 21 -0.62947278 a 20
# 22 -0.40038152 a 20
# 23 -0.23368555 a 20
# 24 0.44218806 a 20
# 25 1.58561071 a 20
# 26 -0.66270426 b 20
# 27 -0.50256255 b 20
# 28 -0.19890974 b 20
# 29 0.26562533 b 20
# 30 1.84093124 b 20
# 31 -0.93702848 c 30
# 32 0.10804529 c 30
# 33 0.25758608 c 30
# 34 1.33084399 c 30
# 35 1.67204875 c 30
# 36 -1.88922564 a 30
# 37 -1.74551938 a 30
# 38 -1.32215854 a 30
# 39 -0.43743607 a 30
# 40 1.07554466 a 30
# 41 -0.38154167 b 30
# 42 0.53823057 b 30
# 43 0.83401316 b 30
# 44 1.04418363 b 30
# 45 2.45985490 b 30
You can first use factor() to order your B factor by levels you define; With that, you can now order your data frame by B to get your desired output.
Generating some data:
set.seed(10)
df = data.frame(A = rnorm(45),
B = rep(c('a', 'b', 'c'), each= 5, times = 3),
C = rep(c(10, 20, 30), each = 15))
And using levels to re-level your factor before ordering the data frame:
df$B <- factor(df$B,levels = c('c', 'a', 'b'))
df$B <- sort(df$B, decreasing = F)
df <- df[order(df$C), ]
Output (first 20 rows):
1.0177950 c 10
0.75578151 c 10
-0.23823356 c 10
0.98744470 c 10
0.74139013 c 10
0.01874617 a 10
-0.18425254 a 10
-1.37133055 a 10
-0.59916772 a 10
0.29454513 a 10
0.38979430 b 10
-1.20807618 b 10
-0.36367602 b 10
-1.62667268 b 10
-0.25647839 b 10
-0.37366156 c 20
-0.68755543 c 20
-0.87215883 c 20
-0.10176101 c 20
-0.25378053 c 20
I'm working with some data on repeated measures of subjects over time. The data is in this format:
Subject <- as.factor(c(rep("A", 20), rep("B", 35), rep("C", 13)))
variable.A <- rnorm(mean = 300, sd = 50, n = Subject)
dat <- data.frame(Subject, variable.A)
dat
Subject variable.A
1 A 334.6567
2 A 353.0988
3 A 244.0863
4 A 284.8918
5 A 302.6442
6 A 298.3162
7 A 271.4864
8 A 268.6848
9 A 262.3761
10 A 341.4224
11 A 190.4823
12 A 297.1981
13 A 319.8346
14 A 343.9855
15 A 332.5318
16 A 221.9502
17 A 412.9172
18 A 283.4206
19 A 310.9847
20 A 276.5423
21 B 181.5418
22 B 340.5812
23 B 348.5162
24 B 364.6962
25 B 312.2508
26 B 278.9855
27 B 242.8810
28 B 272.9585
29 B 239.2776
30 B 254.9140
31 B 253.8940
32 B 330.1918
33 B 300.7302
34 B 237.6511
35 B 314.4919
36 B 239.6195
37 B 282.7955
38 B 260.0943
39 B 396.5310
40 B 325.5422
41 B 374.8063
42 B 363.1897
43 B 258.0310
44 B 358.8605
45 B 251.8775
46 B 299.6995
47 B 303.4766
48 B 359.8955
49 B 299.7089
50 B 289.3128
51 B 401.7680
52 B 276.8078
53 B 441.4852
54 B 232.6222
55 B 305.1977
56 C 298.4580
57 C 210.5164
58 C 272.0228
59 C 282.0540
60 C 207.8797
61 C 263.3859
62 C 324.4417
63 C 273.5904
64 C 348.4389
65 C 174.2979
66 C 363.4353
67 C 260.8548
68 C 306.1833
I've used the seq_along() function and the dplyr package to create an index of each observation for every subject:
dat <- as.data.frame(dat %>%
group_by(Subject) %>%
mutate(index = seq_along(Subject)))
Subject variable.A index
1 A 334.6567 1
2 A 353.0988 2
3 A 244.0863 3
4 A 284.8918 4
5 A 302.6442 5
6 A 298.3162 6
7 A 271.4864 7
8 A 268.6848 8
9 A 262.3761 9
10 A 341.4224 10
11 A 190.4823 11
12 A 297.1981 12
13 A 319.8346 13
14 A 343.9855 14
15 A 332.5318 15
16 A 221.9502 16
17 A 412.9172 17
18 A 283.4206 18
19 A 310.9847 19
20 A 276.5423 20
21 B 181.5418 1
22 B 340.5812 2
23 B 348.5162 3
24 B 364.6962 4
25 B 312.2508 5
26 B 278.9855 6
27 B 242.8810 7
28 B 272.9585 8
29 B 239.2776 9
30 B 254.9140 10
31 B 253.8940 11
32 B 330.1918 12
33 B 300.7302 13
34 B 237.6511 14
35 B 314.4919 15
36 B 239.6195 16
37 B 282.7955 17
38 B 260.0943 18
39 B 396.5310 19
40 B 325.5422 20
41 B 374.8063 21
42 B 363.1897 22
43 B 258.0310 23
44 B 358.8605 24
45 B 251.8775 25
46 B 299.6995 26
47 B 303.4766 27
48 B 359.8955 28
49 B 299.7089 29
50 B 289.3128 30
51 B 401.7680 31
52 B 276.8078 32
53 B 441.4852 33
54 B 232.6222 34
55 B 305.1977 35
56 C 298.4580 1
57 C 210.5164 2
58 C 272.0228 3
59 C 282.0540 4
60 C 207.8797 5
61 C 263.3859 6
62 C 324.4417 7
63 C 273.5904 8
64 C 348.4389 9
65 C 174.2979 10
66 C 363.4353 11
67 C 260.8548 12
68 C 306.1833 13
What I'm now looking to do is set up an analysis that looks at every 10 observations, so I'd like to create another column that basically gives me a number for every 10 observations. For example, Subject A would have a sequence of ten "1's" followed by a sequence of ten "2's" (IE, two groupings of 10). I've tried to use the rep() function but the issue I'm running into is that the other subjects don't have a number of observations that is divisible by 10.
Is there a way for the rep() function to just assign the grouping the next number, even if it doesn't have 10 total observations? For example, Subject B would have ten "1's", ten "2's" and then five "3's" (representing that his last group of observations)?
You can use modular division %/% to generate the ids:
dat %>%
group_by(Subject) %>%
mutate(chunk_id = (seq_along(Subject) - 1) %/% 10 + 1) -> dat1
table(dat1$Subject, dat1$chunk_id)
# 1 2 3 4
# A 10 10 0 0
# B 10 10 10 5
# C 10 3 0 0
For a plain vanilla base R solution, you also could try this:
dat$newcol <- 1
dat$index <- ave(dat$newcol, dat$Subject, FUN = cumsum)
dat$chunk_id <- (dat$index - 1) %/% 10 + 1
which, when you run the table command as above gives you
table(dat$Subject, dat$chunk_id)
1 2 3 4
A 10 10 0 0
B 10 10 10 5
C 10 3 0 0
If you don't want the extra 'newcol' column, just use 'NULL' to get rid of it:
dat$newcol <- NULL
I have a data.table in R say df.
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <-data.table(row.number,a,b)
df
row.number a b
1 1 A 14
2 2 A 59
3 3 A 39
4 4 A 22
5 5 A 75
6 6 A 89
7 7 A 11
8 8 A 88
9 9 A 22
10 10 A 6
11 11 B 37
12 12 B 42
13 13 B 39
14 14 B 8
15 15 B 74
16 16 B 67
17 17 B 18
18 18 B 12
19 19 B 56
20 20 B 21
I want to take the 'n' rows , (say 10) from the middle after arranging the records in increasing order of column b.
Use setorder to sort and .N to filter:
setorder(df, b)[(.N/2 - 10/2):(.N/2 + 10/2 - 1), ]
row.number a b
1: 11 B 36
2: 5 A 38
3: 8 A 41
4: 18 B 43
5: 1 A 50
6: 12 B 51
7: 15 B 54
8: 3 A 55
9: 20 B 59
10: 4 A 60
You could use the following code
library(data.table)
set.seed(9876) # for reproducibility
# your data
row.number <- c(1:20)
a <- c(rep("A", 10), rep("B", 10))
b <- c(sample(c(0:100), 20, replace = TRUE))
df <- data.table(row.number,a,b)
df
# define how many to select and store in n
n <- 10
# calculate how many to cut off at start and end
n_not <- (nrow(df) - n )/2
# use data.tables setorder to arrange based on column b
setorder(df, b)
# select the rows wanted based on n
df[ (n_not+1):(nr-n_not), ]
Please let me know whether this is what you want.
I have a data frame with 150000 lines in long format with multiple occurences of the same id variable. I'm using reshape (from stat, rather than package=reshape(2)) to convert this to wide format. I am generating a variable to count each occurence of a given level of id to use as an index.
I've got this working with a small dataframe using plyr, but it is far too slow for my full df. Can I programme this more efficiently?
I've struggled doing this with the reshape package as I have around 30 other variables. It may be best to reshape only what I'm looking at (rather than the whole df) for each individual analysis.
> # u=id variable with three value variables
> u<-c(rep("a",4), rep("b", 3),rep("c", 6), rep("d", 5))
> u<-factor(u)
> v<-1:18
> w<-20:37
> x<-40:57
> df<-data.frame(u,v,w,x)
> df
u v w x
1 a 1 20 40
2 a 2 21 41
3 a 3 22 42
4 a 4 23 43
5 b 5 24 44
6 b 6 25 45
7 b 7 26 46
8 c 8 27 47
9 c 9 28 48
10 c 10 29 49
11 c 11 30 50
12 c 12 31 51
13 c 13 32 52
14 d 14 33 53
15 d 15 34 54
16 d 16 35 55
17 d 17 36 56
18 d 18 37 57
>
> library(plyr)
> df2<-ddply(df, .(u), transform, count=rank(u, ties.method="first"))
> df2
u v w x count
1 a 1 20 40 1
2 a 2 21 41 2
3 a 3 22 42 3
4 a 4 23 43 4
5 b 5 24 44 1
6 b 6 25 45 2
7 b 7 26 46 3
8 c 8 27 47 1
9 c 9 28 48 2
10 c 10 29 49 3
11 c 11 30 50 4
12 c 12 31 51 5
13 c 13 32 52 6
14 d 14 33 53 1
15 d 15 34 54 2
16 d 16 35 55 3
17 d 17 36 56 4
18 d 18 37 57 5
> reshape(df2, idvar="u", timevar="count", direction="wide")
u v.1 w.1 x.1 v.2 w.2 x.2 v.3 w.3 x.3 v.4 w.4 x.4 v.5 w.5 x.5 v.6 w.6 x.6
1 a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
5 b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
8 c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
14 d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA
I still can't quite figure out why you would want to ultimately convert your dataset from wide to long, because to me, that seems like it would be an extremely unwieldy dataset to work with.
If you're looking to speed up the enumeration of your factor levels, you can consider using ave() in base R, or .N from the "data.table" package. Considering that you are working with a lot of rows, you might want to consider the latter.
First, let's make up some data:
set.seed(1)
df <- data.frame(u = sample(letters[1:6], 150000, replace = TRUE),
v = runif(150000, 0, 10),
w = runif(150000, 0, 100),
x = runif(150000, 0, 1000))
list(head(df), tail(df))
# [[1]]
# u v w x
# 1 b 6.368412 10.52822 223.6556
# 2 c 6.579344 75.28534 450.7643
# 3 d 6.573822 36.87630 283.3083
# 4 f 9.711164 66.99525 681.0157
# 5 b 5.337487 54.30291 137.0383
# 6 f 9.587560 44.81581 831.4087
#
# [[2]]
# u v w x
# 149995 b 4.614894 52.77121 509.0054
# 149996 f 5.104273 87.43799 391.6819
# 149997 f 2.425936 60.06982 160.2324
# 149998 a 1.592130 66.76113 118.4327
# 149999 b 5.157081 36.90400 511.6446
# 150000 a 3.565323 92.33530 252.4982
table(df$u)
#
# a b c d e f
# 25332 24691 24993 24975 25114 24895
Load our required packages:
library(plyr)
library(data.table)
Create a "data.table" version of our dataset
DT <- data.table(df, key = "u")
DT # Notice that the data are now automatically sorted
# u v w x
# 1: a 6.2378578 96.098294 643.2433
# 2: a 5.0322400 46.806132 544.6883
# 3: a 9.6289786 87.915303 334.6726
# 4: a 4.3393403 1.994383 753.0628
# 5: a 6.2300123 72.810359 579.7548
# ---
# 149996: f 0.6268414 15.608049 669.3838
# 149997: f 2.3588955 40.380824 658.8667
# 149998: f 1.6383619 77.210309 250.7117
# 149999: f 5.1042725 87.437989 391.6819
# 150000: f 2.4259363 60.069820 160.2324
DT[, .N, by = key(DT)] # Like "table"
# u N
# 1: a 25332
# 2: b 24691
# 3: c 24993
# 4: d 24975
# 5: e 25114
# 6: f 24895
Now let's run a few basic tests. The results from ave() aren't sorted, but they are in "data.table" and "plyr", so we should also test the timing for sorting when using ave().
system.time(AVE <- within(df, {
count <- ave(as.numeric(u), u, FUN = seq_along)
}))
# user system elapsed
# 0.024 0.000 0.027
# Now time the sorting
system.time(AVE2 <- AVE[order(AVE$u, AVE$count), ])
# user system elapsed
# 0.264 0.000 0.262
system.time(DDPLY <- ddply(df, .(u), transform,
count=rank(u, ties.method="first")))
# user system elapsed
# 0.944 0.000 0.984
system.time(DT[, count := 1:.N, by = key(DT)])
# user system elapsed
# 0.008 0.000 0.004
all(DDPLY == AVE2)
# [1] TRUE
all(data.frame(DT) == AVE2)
# [1] TRUE
That syntax for "data.table" sure is compact, and it's speed is blazing!
Using base R to create an empty matrix and then fill it in appropriately can often be significantly faster. In the code below I suspect the slow part would be converting the data frame to a matrix and transposing, as in the first two lines; if so, that could perhaps be avoided if it could be stored differently to start with.
g <- df$a
x <- t(as.matrix(df[,-1]))
k <- split(seq_along(g), g)
n <- max(sapply(k, length))
out <- matrix(ncol=n*nrow(x), nrow=length(k))
for(idx in seq_along(k)) {
out[idx, seq_len(length(k[[idx]])*nrow(x))] <- x[,k[[idx]]]
}
rownames(out) <- names(k)
colnames(out) <- paste(rep(rownames(x), n), rep(seq_len(n), each=nrow(x)), sep=".")
out
# b.1 c.1 d.1 b.2 c.2 d.2 b.3 c.3 d.3 b.4 c.4 d.4 b.5 c.5 d.5 b.6 c.6 d.6
# a 1 20 40 2 21 41 3 22 42 4 23 43 NA NA NA NA NA NA
# b 5 24 44 6 25 45 7 26 46 NA NA NA NA NA NA NA NA NA
# c 8 27 47 9 28 48 10 29 49 11 30 50 12 31 51 13 32 52
# d 14 33 53 15 34 54 16 35 55 17 36 56 18 37 57 NA NA NA