This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 6 years ago.
I'm cleaning a dataset, but the frame is not ideal, I have to reshape it, but I don't know how. The following are the original data frame:
Rater Rater ID Ratee1 Ratee2 Ratee3 Ratee1.item1 Ratee1.item2 Ratee2.item1 Ratee2.item2 Ratee3.item1 Ratee3.item2
A 12 701 702 800 1 2 3 4 5 6
B 23 45 46 49 3 3 3 3 3 3
C 24 80 81 28 2 3 4 5 6 9
Then I am wondering how to reshape it as the below:
Rater Rater ID Ratee item1 item2
A 12 701 1 2
A 12 702 3 4
A 12 800 5 6
B 23 45 3 3
B 23 46 3 3
B 23 49 3 3
C 24 80 2 3
C 24 81 4 5
C 24 28 6 9
This reshaping is a little bit different from this one (Reshaping data.frame from wide to long format). As I have three parts in the original data.
First part is about the rater's ID (Rater and Rater ID).
The second is about retee's ID (Ratee1, Ratee2, Ratee3).
The Third part is about Rater's rating on each retee (retee*.item1(or2)).
To make it more clear, let me brief the data collecting process.
First, a rater types in his own name and ID,
then nominates three persons (Ratee1 to Ratee3),
and then rates the questions regarding each retee (for each retee, there are two questions).
Does anyone know how to reshape this? Thanks!
We can use melt from data.table
library(data.table)
melt(setDT(df1), measure = patterns("^Ratee\\d+$", "^Ratee\\d+\\.item1",
"^Ratee\\d+\\.item2"), value.name = c("Ratee", "item1", "item2"))[,
variable := NULL][order(Rater)]
# Rater RaterID Ratee item1 item2
#1: A 12 701 1 2
#2: A 12 702 3 4
#3: A 12 800 5 6
#4: B 23 45 3 3
#5: B 23 46 3 3
#6: B 23 49 3 3
#7: C 24 80 2 3
#8: C 24 81 4 5
#9: C 24 28 6 9
Related
I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.
I have a .txt file that consists of numbers separated by spaces. Each row has a different amount of numbers in it. I need to do market basket analysis on the data, however I can't seem to properly load the data (especially because there is a different number of items in each 'basket'). What is the best way to store the data so I can find the frequent items and then check for frequent items in each basket?
Example of data:
1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54
You should be able to input with readLines and then use lapply to separate into numerics. Assume that is in a file named txt.txt:
dat <- lapply( readLines("txt.txt"), function(Line) scan(text=Line) )
The reason I didn't suggest read.table with fill=TRUE (which would give yiu something similar to the otehr answer that has appeared is that the column stucture was not needed. unless there was information encoded in the position of those numbers. I'm wondering whether the might be additional information encoded in the individual lines such as regions or stores or some other entity as the source of particular numbered items. This would be the reason for keeping it in a list structure with an uneven count. You can get a global enumerations just with table:
table( unlist(dat) )
1 2 3 4 5 6 7 8 21 23 32 34 43 54 67 145 154 456
1 4 2 3 2 1 1 1 2 1 2 1 1 1 1 1 1 1
my_text = '1 2 4 3 67 43 154
4 5 3 21 2
2 4 5 32 145
2 6 7 8 23 456 32 21 34 54'
my_text2 <- strsplit(my_text, split = '\n')
my_text2 <- lapply(my_text2, trimws)
my_text2 %>%
do.call('rbind',.) %>%
t %>%
as.data.frame() %>%
separate(V1, sep = ' ',into = paste('col_', 1:10))
col_ 1 col_ 2 col_ 3 col_ 4 col_ 5 col_ 6 col_ 7 col_ 8 col_ 9 col_ 10
1 1 2 4 3 67 43 154 <NA> <NA> <NA>
2 4 5 3 21 2 <NA> <NA> <NA> <NA> <NA>
3 2 4 5 32 145 <NA> <NA> <NA> <NA> <NA>
4 2 6 7 8 23 456 32 21 34 54
I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2
This question already has answers here:
R Partial Reshape Data from Long to Wide
(2 answers)
Closed 6 years ago.
I am struggling to reshape this df into a different one, I have this:
ID task mean sd mode
1 0 2 10 1.5 223
2 0 2 21 2.4 213
3 0 2 24 4.3 232
4 1 3 26 2.2 121
5 1 3 29 1.3 433
6 1 3 12 2.3 456
7 2 4 45 4.3 422
8 2 4 67 5.3 443
9 2 4 34 2.1 432
and I would like to reshape it in this way discarding sd and mode and placing the means in the rows like this :
ID task mean mean1 mean2
1 0 2 10 21 24
2 1 3 26 29 12
3 2 4 45 67 34
Thanks a lot for your help in advance
You need to create a new column first by which we can pivot the mean values. Using data.table, this approach works:
library(data.table)
dt <- data.table(df) # Convert to data.table
dcast(dt[,nr := seq(task),
.(ID)],
ID + task ~ nr,
value.var = "mean")
# ID task 1 2 3
#1: 0 2 10 21 24
#2: 1 3 26 29 12
#3: 2 4 45 67 34
Consequently, you can always rename the columns to what you want them to be called.
reshape(cbind(df,time=ave(df$ID,df$ID,FUN=seq_along)),dir='w',idvar=c('ID','task'),drop=c('sd','mode'),sep='');
## ID task mean1 mean2 mean3
## 1 0 2 10 21 24
## 4 1 3 26 29 12
## 7 2 4 45 67 34
Data
df <- data.frame(ID=c(0L,0L,0L,1L,1L,1L,2L,2L,2L),task=c(2L,2L,2L,3L,3L,3L,4L,4L,4L),mean=c(
10L,21L,24L,26L,29L,12L,45L,67L,34L),sd=c(1.5,2.4,4.3,2.2,1.3,2.3,4.3,5.3,2.1),mode=c(223L,
213L,232L,121L,433L,456L,422L,443L,432L));
Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]