Reshape data into long format, repeating range of ids for every variable [closed] - r

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
I want to reshape my data into a long format, but I would like to repeat the entire range of id's for each variable in my data set, even for those id entries on which the variable takes no value. At the moment I can get narrow data, with ids for each variable on which there is a corresponding entry
Suppose my data has 15 variables, with 20 possible id's, I want to create a narrow form of this data that is 15*20 in length (the range of ids, repeated for each variable), whereby each repeated range of id's shows the values taken by variable, for id1, id2, id3 e.t.c until the end of the range of id's is reached, then variable2 is displayed for id1, id2, id3 e.t.c..
I am unsure of ohw to do this in R, I am currently using the reshape package.

You can use the replicate function which is explained here
v1 <- 1:5
v2 <- 1:6
rep(v1, each = 6)
# 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5
rep(v2, 5)
#1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

Yeah, this is hard to work with, but you're looking for the melt function I think...
library(reshape2)
melt(yourdata, id.vars = 'ID COLUMN')
This will return a 300 x 3 data set that looks like:
ID COLUMN variable value
1 col2 7
1 col3 8
.... .... ....
20 col14 99
20 col15 100

Related

Go through a column and collect a running total in new column [duplicate]

This question already has answers here:
Creation of a specific vector without loop or recursion in R
(2 answers)
Split data.frame by value
(2 answers)
Closed 4 years ago.
I have a dataframe whose rows represent people. For a given family, the first row has the value 1 in the column A, and all following rows contain members of the same family until another row in in column A has the value 1. Then, a new family starts.
I would like to assign IDs to all families in my dataset. In other words, I would like to take:
A
1
2
3
1
3
3
1
4
And turn it into:
A family_id
1 1
2 1
3 1
1 2
3 2
3 2
1 3
4 3
I'm playing with a dataframe of 3 million rows, so a simple for-loop solution I came up with falls short of necessary efficiency. Also, the family_id need not be sequential.
I'll take a dplyr solution.
data:
df <- data.frame(A = c(1:3,1,3,3,1,4))
code:
df$familiy_id <- cumsum(c(-1,diff(df$A)) < 0)
result:
# A familiy_id
#1 1 1
#2 2 1
#3 3 1
#4 1 2
#5 3 2
#6 3 2
#7 1 3
#8 4 3
please note:
This solution starts a new group when a number occurs that is smaller than the previous one.
When its 100% sure that a new group always begins with a 1 consistently, then ronak's solution is perfect.

what is this function doing? replication [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
rep_tbl = replicate(reps, tbl[sample(1:nrow(tbl), size, replace = replace),
], simplify = FALSE) %>%
bind_rows() %>%
mutate(replicate = rep(1:reps, each = size)) %>%
select(replicate, everything()) %>%
group_by(replicate)
return(rep_tbl)
}
Hey, can anyone help me there? What is this function doing? Is the first line setting the variables of the function? And then what is this "replicate" doing? Thanks!
This formula replicates your data. lets say we have a dataset of 10 observations. In order to come up with additional like-datasets of your current one, you can replicate it by introducing random sampling of your dataset.
You can check out the wikipedia page on
statistical replication if you're more curious.
Lets take a simple dataframe:
df <- data.frame(x = 1:10, y = 1:10)
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
if we want to take a random sample of this, we can use the function rep_sample_n which takes 2 arguments tbl, size, and has another 2 optional arguments replace = FALSE, reps = 1.
Here is an example of us just taking 4 randomly selected columns from our data.
rep_sample_n(df, 4)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 1 1
2 1 3 3
3 1 4 4
4 1 10 10
Now if we want to randomly sample 15 observations from a 10 observation dataset, it will throw an error. Currently the replace = FALSE argument doesn't allow that because each time a sample row is chosen, it's removed from the pool for the next sample to be taken. In the example above, it chose the 1st observation, then it went to choose the 2nd (because we asked for 4), and it only have 2 through 10 left, and it chose the 3rd, then 4th and then 10th etc. If we allow replace = TRUE, it will choose an observation from the full dataset each time.
Notice how in this example, the 5th observation was chosen twice. That wouldn't happen with replace = FALSE
rep_sample_n(df, 4, replace = TRUE)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 5 5
2 1 3 3
3 1 2 2
4 1 5 5
Lastly and most importantly, we have the reps argument which is the basis for this function, really. It allows you randomly sample your dataset multiple times, and then combine all those samples together.
Below, we have sampled our original dataset of 10 observations by selecting 4 of them in a sample, then we replicated that 5 times, so we have 5 different sample dataframes of 4 observations each that have been combined together into one 20 observation dataframe, but each of the unique 5 dataframes has been tagged with a replicate #. The replicate column will point out which 4 observations goes with which replicated dataframe.
rep_sample_n(df, 4, reps = 5)
# A tibble: 20 x 3
# Groups: replicate [5]
replicate x y
<int> <int> <int>
1 1 8 8
2 1 4 4
3 1 3 3
4 1 1 1
5 2 4 4
6 2 5 5
7 2 8 8
8 2 3 3
9 3 6 6
10 3 1 1
11 3 3 3
12 3 2 2
13 4 5 5
14 4 7 7
15 4 10 10
16 4 3 3
17 5 7 7
18 5 10 10
19 5 3 3
20 5 9 9
I hope this provided some clarity
This function takes a data frame as input (and several input preferences). It takes a random sample of size rows from the table, with or without replacement as set by the replace input. It repeats that random sampling reps times.
Then, it binds all the samples together into a single data frame, adding a new column called "replicate" indicating which repetition of the sampling produced each row.
Finally, it "groups" the resulting table, preparing it for future group-wise operations with dplyr.
For general questions about specific functions, like "What is this "replicate" doing?", you should look at the function's help page: type ?replicate or help("replicate") to get there. It includes a description of the function and examples of how to use it. If you read the description, run the examples, and are still confused, feel free to come back with a specific question and example illustrating what you are confused by.
Similarly, for "Is the first line setting the variables of the function?", the arguments to function() are the inputs to the function. If you have basic questions about R like "How do functions work", have a look at An Introduction to R, or one of the other sources in the R Tag Wiki.

Apply a maximum value to whole group [duplicate]

This question already has answers here:
Aggregate a dataframe on a given column and display another column
(8 answers)
Closed 6 years ago.
I have a df like this:
Id count
1 0
1 5
1 7
2 5
2 10
3 2
3 5
3 4
and I want to get the maximum count and apply that to the whole "group" based on ID, like this:
Id count max_count
1 0 7
1 5 7
1 7 7
2 5 10
2 10 10
3 2 5
3 5 5
3 4 5
I've tried pmax, slice etc. I'm generally having trouble working with data that is in interval-specific form; if you could direct me to tools well-suited to that type of data, would really appreciate it!
Figured it out with help from Gavin Simpson here: Aggregate a dataframe on a given column and display another column
maxcount <- aggregate(count ~ Id, data = df, FUN = max)
new_df<-merge(df, maxcount)
Better way:
df$max_count <- with(df, ave(count, Id, FUN = max))

Reshape a data frame in R but not with aggregated functions [duplicate]

This question already has answers here:
Reshape three column data frame to matrix ("long" to "wide" format) [duplicate]
(6 answers)
Closed 7 years ago.
I'm trying to build a pivot table from this data frame below. "VisitID" is the unique ID for a user who came to visit a website, "PageName" is the page they visited, and "Order" is the sequence of the page they visited. For example, the first row of this data frame means "user 001 visited Homepage, which is the 1st page he/she visted".
VisitID PageName Order
001 Homepage 1
001 ContactUs 2
001 News 3
002 Homepage 1
002 Careers 2
002 News 3
The desired output should cast "VisitID" as rows and "Order" as columns, and fill the table with the "PageName":
1 2 3
001 Homepage ContactUs News
002 Homepage Careers News
I've thought about using reshape::cast to do the task, but I believe it only works when you give it an aggregated function. I might be wrong though. Thanks in advance for anyone who can offer help.
You don't need to aggregate. As long as there's only one row for each combination of columns in the casting formula, you'll get the value of value.var inserted in the output.
library(reshape2)
dcast(mydata, VisitID ~ Order, value.var="PageName")
Here's an example:
# Fake data
dat = data.frame(group1=rep(LETTERS[c(1,1:3)],each=2), group2=rep(letters[c(1,1:3)]),
values=1:8)
dat
group1 group2 values
1 A a 1
2 A a 2
3 A b 3
4 A c 4
5 B a 5
6 B a 6
7 C b 7
8 C c 8
Note that rows 1 and 2 have the same values of the group columns, as do rows 5 and 6. As a result, dcast aggregates by counting the number of values in each cell.
dcast(dat, group1 ~ group2, value.var="values")
Aggregation function missing: defaulting to length
group1 a b c
1 A 2 1 1
2 B 2 0 0
3 C 0 1 1
Now lets remove rows 1 and 5 to get rid of the duplicated group combinations. Since there's now only one value per cell, dcast returns the actual value, rather than a count of the number of values.
dcast(dat[-c(1,5),], group1 ~ group2, value.var="values")
group1 a b c
1 A 2 3 4
2 B 6 NA NA
3 C NA 7 8

How to find the first smaller value compared to the current row in subsequent rows? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
Suppose this is the data:
data<-data.frame(number=c(4,5,3,1,0),
datetime=c(as.POSIXct("2015/06/12 12:10:25"),
as.POSIXct("2015/06/12 12:10:27"),
as.POSIXct("2015/06/12 12:10:32"),
as.POSIXct("2015/06/12 12:10:33"),
as.POSIXct("2015/06/12 12:10:35")))
number datetime
1 4 2015/06/12 12:10:25
2 5 2015/06/12 12:10:27
3 3 2015/06/12 12:10:32
4 1 2015/06/12 12:10:33
5 0 2015/06/12 12:10:35
I want to calculate the time between a row to the next smaller value. Desired output:
number next smaller time between
1 4 3 7
2 5 3 5
3 3 1 1
4 1 0 2
5 0 NA NA
Example: 3 is the first number in subsequent rows which is smaller than 4.
Any suggestion? package?
Well it's not pretty and probably not super efficient, but it seems to get the job done. Here we go ...
newcols <- with(data, {
lapply(seq_along(number), function(i) {
x <- number[-(1:i)][-i][1]
c(x, abs(datetime[i] - datetime[number == x])[1])
})
})
setNames(
cbind(data[1], do.call(rbind, newcols)),
c(names(data)[1], "nextsmallest", "timediff")
)
# number nextsmallest timediff
# 1 4 3 7
# 2 5 3 5
# 3 3 1 1
# 4 1 0 2
# 5 0 NA NA
If I understand what you're trying to do, I'd suggest starting by ordering your dataframe in ascending order by 'number'. Next, add a new column using a lag function to retrieve the time value from the previous row. Finally, calculate the difference.
I could provide code later if you need it, but hopefully that will give you something to start with.

Resources