transpose groups of obs appearing more than once [duplicate] - r

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Reshape data frame by row [duplicate]
(4 answers)
Closed 4 years ago.
I have a dataframe like this one
familyid Year memberid count var
1 2000 1 2 5
1 2000 1 2 6
1 2000 2 1 8
2 2000 1 1 5
2 2000 2 1 4
3 2000 1 1 5
3 2000 2 2 7
3 2000 2 2 5
where the column count indicates how many times each observation compares in the dataframe. I want to transpose the dataframe only for those comparing more than one times, in other words I want to have
familyid Year memberid count var_1 var_2
1 2000 1 2 5 6
1 2000 2 1 8 NA
2 2000 1 1 5 NA
2 2000 2 1 4 NA
3 2000 1 1 5 NA
3 2000 2 2 7 5
What do you suggest to use for this purpose?
Thank you so much.

Related

Keep rows with duplicated id in R [duplicate]

This question already has answers here:
Filtering a dataframe showing only duplicates
(4 answers)
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 3 months ago.
I want to select rows with duplicated id but keep both rows in the resulting dataset. Here is the original dataset:
dd <- data.frame(id=c(1,1,2,2,3,4,4,5,6,7,7),
coder=c(1,2,1,2,1,1,2,1,1,1,2)
)
dd
id coder
1 1
1 2
2 1
2 2
3 1
4 1
4 2
5 1
6 1
7 1
7 2
In the end, I want this:
id coder
1 1
1 2
2 1
2 2
4 1
4 2
7 1
7 2
I tried subset(dd, duplicated(id)) but it only kept one row:
id coder
1 2
2 2
4 2
7 2
How to achieve that?

Replacing NA with observed values? [duplicate]

This question already has answers here:
Filling missing value in group
(3 answers)
Replace NA with previous or next value, by group, using dplyr
(5 answers)
Closed 2 years ago.
I have a dataset that contains multiple observations per person. In some cases an individual will have their ethnicity recorded in some rows but missing in others. In R, how can I replace the NA's with the ethnicity stated in the other rows without having to manually change them?
Example:
PersonID Ethnicity
1 A
1 A
1 NA
1 NA
1 A
2 NA
2 B
2 NA
3 NA
3 NA
3 A
3 NA
Need:
PersonID Ethnicity
1 A
1 A
1 A
1 A
1 A
2 B
2 B
2 B
3 A
3 A
3 A
3 A
You could use fill from tidyr
df %>%
group_by(PersonID)%>%
fill(Ethnicity,.direction = "downup")
# A tibble: 12 x 2
# Groups: PersonID [3]
PersonID Ethnicity
<int> <fct>
1 1 A
2 1 A
3 1 A
4 1 A
5 1 A
6 2 B
7 2 B
8 2 B
9 3 A
10 3 A
11 3 A
12 3 A

Generate data frame with parameters [duplicate]

This question already has answers here:
Fill missing dates by group
(3 answers)
Fastest way to add rows for missing time steps?
(4 answers)
Closed 3 years ago.
I have a data frame of ids with number column
df <- read.table(text="
id nr
1 1
2 1
1 2
3 1
1 3
", header=TRUE)
I´d like to create new dataframe from it, where each id will have unique nr from df dataframe. As you may notice, id 3 have only nr 1, but no 2 and 3. So result should be.
result <- read.table(text="
id nr
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
", header=TRUE)
You can use expand.grid as:
library(dplyr)
result <- expand.grid(id = unique(df$id), nr = unique(df$nr)) %>%
arrange(id)
result
id nr
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3
We can do:
tidyr::expand(df,id,nr)
# A tibble: 9 x 2
id nr
<int> <int>
1 1 1
2 1 2
3 1 3
4 2 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 3

find first and last value within a sequence in r [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
How to get the maximum value by group
(5 answers)
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Closed 5 years ago.
I have a sequence and then times that are recorded within each sequence. I am trying to find the max value of time that is recorded with its corresponding sequence. Example below:
Seq seconds
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 3 1
8 3 2
9 3 3
10 3 4
11 3 5
I would like a result that tells me the max time that was recorded in each sequence.
Seq Time
1 4
2 2
3 5
A solution from dplyr.
library(dplyr)
dt2 <- dt %>%
arrange(Seq, seconds) %>%
group_by(Seq) %>%
slice(n())
dt2
# A tibble: 3 x 2
# Groups: Seq [3]
Seq seconds
<int> <int>
1 1 4
2 2 2
3 3 5
DATA
dt <- read.table(text = " Seq seconds
1 1 1
2 1 2
3 1 3
4 1 4
5 2 1
6 2 2
7 3 1
8 3 2
9 3 3
10 3 4
11 3 5",
header = TRUE)
An option using data.table
library(data.table)
setDT(df1)[, .(Time = max(seconds)), Seq]
# Seq Time
#1: 1 4
#2: 2 2
#3: 3 5

how to create a column including the maximum value of another column in R? [duplicate]

This question already has answers here:
Calculate group mean, sum, or other summary stats. and assign column to original data
(4 answers)
Closed 5 years ago.
Using R, I would like to create a new column (MaxAct) showing the maximum numbers of a different column (ActNo) while grouping by two factors (HHID and PERID)
For example, I have this data set:
UID HHID PERID ActNo
1 1000 1 1
2 1000 1 2
3 1000 1 3
4 1000 2 1
5 1000 2 2
6 2000 1 1
7 2000 1 2
8 2000 1 3
9 2000 1 4
10 2000 2 1
11 2000 2 2
Then I want to add the new column (MaxAct) as follows:
UID HHID PERID ActNo MaxAct
1 1000 1 1 3
2 1000 1 2 3
3 1000 1 3 3
4 1000 2 1 2
5 1000 2 2 2
6 2000 1 1 4
7 2000 1 2 4
8 2000 1 3 4
9 2000 1 4 4
10 2000 2 1 2
11 2000 2 2 2
dat$MaxAct <- with(dat, ave(ActNo, HHID, PERID, FUN=max) )
For problems involving single vectors and grouping where you want the length of the result to equal the row count, ave is your function of choice. For more complicated problems, the lapply(split(dat, fac), FUN) approach may be needed or use do.call(rbind, by( ...))
If you have missing values:
dat$MaxAct <- with(dat, ave(ActNo, HHID, PERID, FUN=function(x) max(x, na.rm=TRUE) ) )
This is standard fare for plyr with mutate or transform, base R ave or data.table (which might be considered a sledgehammer for a peanuts here).
The plyr and ave approaches has been addressed so
data.table
library(data.table)
DT <- data.table(DF)
DT[,MaxAct := max(ActNo), by = list(HHID, PERID)]
Given the size of the data the memory efficient and fast nature of data.table is perhaps not required.
having read your previous question How to Create a Column of Ranks While Grouping in R, so we know that max(ActNo) is simply the number of rows in each group then
DT[,MaxAct := .N, by = list(HHID, PERID)]
will work, and be marginally quicker.
There are several approaches in R to do achieve this task. For me, the easiest way to do this is to use the plyr package
require(plyr)
ddply(dat, .(HHID, PERID), transform, MaxAct = max(ActNo))
UID HHID PERID ActNo MaxAct
1 1 1000 1 1 3
2 2 1000 1 2 3
3 3 1000 1 3 3
4 4 1000 2 1 2
5 5 1000 2 2 2
6 6 2000 1 1 4
7 7 2000 1 2 4
8 8 2000 1 3 4
9 9 2000 1 4 4
10 10 2000 2 1 2
11 11 2000 2 2 2
df <- read.table(textConnection("UID HHID PERID ActNo
1 1000 1 1
2 1000 1 2
3 1000 1 3
4 1000 2 1
5 1000 2 2
6 2000 1 1
7 2000 1 2
8 2000 1 3
9 2000 1 4
10 2000 2 1
11 2000 2 2"), header=T)
> ddply(df, .(HHID, PERID), transform, MaxAct = length(unique(ActNo)) )
UID HHID PERID ActNo MaxAct
1 1 1000 1 1 3
2 2 1000 1 2 3
3 3 1000 1 3 3
4 4 1000 2 1 2
5 5 1000 2 2 2
6 6 2000 1 1 4
7 7 2000 1 2 4
8 8 2000 1 3 4
9 9 2000 1 4 4
10 10 2000 2 1 2
11 11 2000 2 2 2

Resources