I am trying to reshape a dataframe in R. Here is the dataframe I have in dput:
dput(newdata)
structure(list(var1 = c(0L, 0L, 0L, 0L, 0L, 0L), var2 = c(0L,
0L, 0L, 0L, 0L, 0L), var3 = c(0L, 0L, 0L, 0L, 0L, 0L), Date = structure(c(15260,
15260, 15260, 15169, 15169, 15169), class = "Date"), Success = structure(c(2L,
1L, 1L, 2L, 1L, 1L), .Label = c("N", "Y"), class = "factor")), .Names = c("var1",
"var2", "var3", "Date", "Success"), row.names = c(NA, 6L), class = "data.frame")
Output I am look for:
Variable Date N Y
var1 3/2/2012 0 1
var1 3/4/2012 0 1
var1 3/6/2012 0 1
var2 3/2/2012 1 0
var2 3/4/2012 1 0
var2 3/6/2012 1 0
var3 3/2/2012 0 1
var3 3/4/2012 0 1
var3 3/6/2012 0 1
I am fairly new to R. I have been trying to use reshape() module but been unsuccessful until now. Any insight would be hugely appreciated. Thank you.
Thank you for providing reproducible input and desired output. This helps a lot. Unfortunately as your input is presented now is flawed: rows 2 and 3 in your data frame are identical, and so are rows 5 and 6. It would not be possible to perform your desired data transformation correctly on such data.
Assuming your duplicate rows are not relevant, you can accomplish your desired output via tidyr::spread() and tidyr::gather(). I call your data structure df:
library("dplyr")
library("tidyr")
# call to duplicated() removes all identical rows from df
wide <- df %>%
filter(!duplicated(.)) %>%
gather(Variable, value, starts_with("var")) %>%
spread(Success, value, fill = NA, drop = FALSE)
wide
Date Variable N Y
1 2011-07-14 var1 0 0
2 2011-07-14 var2 0 0
3 2011-07-14 var3 0 0
4 2011-10-13 var1 0 0
5 2011-10-13 var2 0 0
6 2011-10-13 var3 0 0
So as kgolyaev stated, you have duplicate rows which means that spread can't simplify down to a single row when spreading the columns. One way around this is to just use a mutate with ifelse instead of spreading. This works because you just have "N" and "Y" for Success values. Had it been 12 unique values, it would have been a different solution.
We can gather the vars into vars and num. And then we can just use a simple nested ifelse statement to get the 1s and 0s. Then remove unneeded columns and arrange by Date.
library(tidyverse)
df %>% gather("vars", "num", -c(Date, Success)) %>%
mutate(Y = ifelse(Success == "N", 0, 1),
N = ifelse(Success == "N", 1, 0)) %>%
select(-c(Success, num)) %>%
arrange(Date)
Date vars Y N
1 2011-07-14 var1 1 0
2 2011-07-14 var1 0 1
3 2011-07-14 var1 0 1
4 2011-07-14 var2 1 0
5 2011-07-14 var2 0 1
6 2011-07-14 var2 0 1
7 2011-07-14 var3 1 0
8 2011-07-14 var3 0 1
9 2011-07-14 var3 0 1
10 2011-10-13 var1 1 0
11 2011-10-13 var1 0 1
12 2011-10-13 var1 0 1
13 2011-10-13 var2 1 0
14 2011-10-13 var2 0 1
15 2011-10-13 var2 0 1
16 2011-10-13 var3 1 0
17 2011-10-13 var3 0 1
18 2011-10-13 var3 0 1
Related
Hi I am looking to retain rows in a dataset similar to the below:
ID
Value1
Value2
A
1
0
A
0
1
A
1
1
A
0
1
A
0
0
A
0
0
A
1
0
A
1
1
A
0
1
Where 'Value1' = 1 and 'Value2' in the immediate below row = 1. Under these conditions both rows should be retained; any other rows corresponding to ID 'A' should not be retained. Can anyone help with this please? In this example the below output should be returned:
ID
Value1
Value2
A
1
0
A
0
1
A
1
1
A
0
1
A
1
0
A
1
1
A
0
1
The logic is keep all the rows where row before has Value1=1 and row immediately after has Value2=1. I've added a few rows to your data to check different scenarios.
df=structure(list(ID = c("A", "A", "A", "A", "A", "A", "A", "A",
"A"), Value1 = c(1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Value2 = c(0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-9L))
ID Value1 Value2
1 A 1 0
2 A 0 1
3 A 0 0
4 A 1 0
5 A 0 0
6 A 0 1
7 A 1 0
8 A 0 1
9 A 0 1
edit: your edit requires you to distinguish between 1's in Value1 and Value2 columns, there are probably a number of options available here, one option is to say that if Value=1 then this starts a new sequence, so the next row needs to have Value2=1 and Value1!=1.
tmp=which((df$Value1==1)+c(tail(df$Value1!=1 & df$Value2==1,-1),NA)==2)
df[sort(c(tmp,tmp+1)),]
ID Value1 Value2
1 A 1 0
2 A 0 1
7 A 1 0
8 A 0 1
note the row names/indices.
You can try
library(dplyr)
inds <- df |> summarise(n = which(Value1 == 1 & c(Value2[2:n()] , 0) == 1))
df |> slice(unlist(Map(c , inds$n , inds$n + 1)))
data
ID Value1 Value2
1 A 1 0
2 A 0 1
3 A 1 0
4 A 0 1
My dataframe has columns and rows like this
Id Date Col1 Col2 Col3 X1
1 1/1/22 NA 1 0
1 1/1/22 0 0 1 6
2 5/7/21 0 1 0
2 5/7/21 0 2 0
I like to drop rows where the duplicate row (same Id, same date) where values for column X1 is missing or empty. If both the rows are missing X1 for that ID and date then dont drop. Only when one is missing and other is not missing then drop the missing row.
Expected output
Id Date Col1 Col2 Col3 X1
1 1/1/22 0 0 1 6
2 5/7/21 0 1 0
2 5/7/21 0 2 0
I tried this
library(tidyr)
df %>%
group_by(Id, Date) %>%
drop_na(X1)
This drops all rows with NA or missing and I am just left with one row, which is not what I want. Any suggestions much apricated. Thanks.
We can create a condition in filter to return all the rows if there are only missing values in 'X1' or just remove the missing rows
library(dplyr)
df %>%
group_by(Id, Date) %>%
filter(if(all(is.na(X1))) TRUE else complete.cases(X1)) %>%
ungroup
-output
# A tibble: 3 × 6
Id Date Col1 Col2 Col3 X1
<int> <chr> <int> <int> <int> <int>
1 1 1/1/22 0 0 1 6
2 2 5/7/21 0 1 0 NA
3 2 5/7/21 0 2 0 NA
Or without the if/else, use | with & condition
df %>%
group_by(Id, Date) %>%
filter(any(complete.cases(X1)) & complete.cases(X1) |
all(is.na(X1))) %>%
ungroup
data
df <- structure(list(Id = c(1L, 1L, 2L, 2L), Date = c("1/1/22", "1/1/22",
"5/7/21", "5/7/21"), Col1 = c(NA, 0L, 0L, 0L), Col2 = c(1L, 0L,
1L, 2L), Col3 = c(0L, 1L, 0L, 0L), X1 = c(NA, 6L, NA, NA)),
class = "data.frame", row.names = c(NA,
-4L))
I have data of individuals grouped into households. I'm trying to create a household-level dummy variable indicating a household with children. I've created a individual-level Child variable based on the observation's age. I'd like to "spread" this value, if it's a 1, to all members of the household.
The data looks like this:
HHID Child
1 0
1 1
1 0
2 0
2 1
3 0
3 0
3 0
I'd like the data frame like this:
HHID Child HH_child
1 0 1
1 1 1
1 0 1
2 0 1
2 1 1
3 0 0
3 0 0
3 0 0
I think it can be done using sqldf, but I'd like to do it in Tidyverse. Thanks!
Here is a tidyverse/dplyr solution:
library(dplyr)
df %>%
group_by(HHID) %>%
mutate(HH_child = if_else(any(Child == 1),1,0))
This gives us:
# A tibble: 8 x 3
HHID Child HH_child
<int> <int> <dbl>
1 1 0 1
2 1 1 1
3 1 0 1
4 2 0 1
5 2 1 1
6 3 0 0
7 3 0 0
8 3 0 0
Data:
structure(list(HHID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L), Child = c(0L,
1L, 0L, 0L, 1L, 0L, 0L, 0L)), row.names = c(NA, -8L), .internal.selfref = <pointer: 0x0b952498>, class = c("tbl_df",
"tbl", "data.frame"))
Simply
library(dplyr)
df %>%
group_by(HHID) %>%
mutate(HH_child = max(Child))
We can also coerce to binary
library(dplyr)
df %>%
group_by(HHID) %>%
mutate(HH_child = +(1 %in% Child))
Or using base R
df$HH_child <- with(df, ave(Child == 1, HHID, FUN = any))
So I'm having a dataset of the following form:
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
I would like to subset the dataframe and create a new dataframe, containing only the rows after Var1 first reached its group-maximum (including the row this happens) up to the row where Var2 becomes 1 for the first time (also including this row). So what I'd like to have should look like this:
ID Var1 Var2
1 12 0
1 11 1
2 8 0
2 7 0
2 6 1
The original dataset contains a number of NAs and the function should simply ignore those. Also if Var2 never reaches "1" for a group is should just add all rows to the new dataframe (of course only the ones after Var1 reaches its group maximum).
However I cannot wrap my hand around the programming. Does anyone know help?
A dplyr solution with cumsum based filter will do what the question asks for.
library(dplyr)
df1 %>%
group_by(ID) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
## A tibble: 5 x 3
## Groups: ID [2]
# ID Var1 Var2
# <int> <int> <int>
#1 1 12 0
#2 1 11 1
#3 2 8 0
#4 2 7 0
#5 2 6 1
Edit
Here is a solution that tries to answer to the OP's comment and question edit.
df1 %>%
group_by(ID) %>%
mutate_at(vars(starts_with('Var')), ~replace_na(., 0L)) %>%
filter(cumsum(Var1 == max(Var1)) == 1, cumsum(Var2) <= 1)
Data
df1 <- read.table(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
Using data.table with .I
library(data.table)
setDT(df1)[df1[, .I[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]$V1]
# ID Var1 Var2
#1: 1 12 0
#2: 1 11 1
#3: 2 8 0
#4: 2 7 0
#5: 2 6 1
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
Var1 = c(2L, 8L, 12L, 11L, 10L, 5L, 8L, 7L, 6L, 5L), Var2 = c(0L,
0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame",
row.names = c(NA,
-10L))
Here is data.table translation of Rui Barradas' working solution:
library(data.table)
dat <- fread(text = "
ID Var1 Var2
1 2 0
1 8 0
1 12 0
1 11 1
1 10 1
2 5 0
2 8 0
2 7 0
2 6 1
2 5 1
", header = TRUE)
dat[, .SD[cumsum(Var1 == max(Var1)) & cumsum(Var2) <= 1], by="ID"]
considering the following data:
Var1 Var2 Target
A 0 no
A 250 no
A 0 si
A 0 si
B 0 no
B 0 no
B 0 no
B 250 no
C 0 no
C 250 no
C 0 si
C 250 no
and look at the variable called Target. I need to reproduce it with the same values.
The condition to obtain "si" or "no" is the following:
for the same level of Var1 (e.g A) if Var2=250 and the nexts are =0 then Target=si
I made this code:
df$Target <- NA
for(i in unique(df$Var1)){
subset.data.frame(df, Var1==i)
for(n in 1: length(df$Var1))
df$Target <-
ifelse(df$Var2[n]==250 && df$Var2[n+1]==0 && df$Var1[n+1]==df$Var1[n], "si", "no"))
But I get Target=si only if the next Var2=0.
Instead, as described in the dataset above, all observations with Var2=0 after a 250 have to be Target=si.
Could you help me to solve the problem, please?
Thank you,
Andrea
Solution
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(Target = ifelse(cumsum(lag(Var2, default=0) == 250) > 0
& Var2 == 0, 'si', 'no'))
Result
# A tibble: 12 x 3
# Groups: Var1 [3]
Var1 Var2 Target
<fctr> <int> <chr>
1 A 0 no
2 A 250 no
3 A 0 si
4 A 0 si
5 B 0 no
6 B 0 no
7 B 0 no
8 B 250 no
9 C 0 no
10 C 250 no
11 C 0 si
12 C 250 no
Explanation
We use dplyr to group df by the levels of Var1, then for each group cumsum(lag(Var2, default=0) == 250) > 0 tells us for every row in that group if any previous observations of Var2 within that group were 250 and Var2 == 0 tells us if the current observation of Var2 is 0. If both of those conditions are TRUE, we code Target as "si", otherwise we code it as "no"
Data
The data I started with for df are
structure(list(Var1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
Var2 = c(0L, 250L, 0L, 0L, 0L, 0L, 0L, 250L, 0L, 250L, 0L,
250L)), .Names = c("Var1", "Var2"), row.names = c(NA, -12L
), class = "data.frame")
Comparison to akrun's Solution
The output of arkun's solution is below so you can determine which approach is more appropriate for your problem.
# A tibble: 12 x 3
# Groups: Var1 [3]
Var1 Var2 Target
<fctr> <int> <chr>
1 A 0 si
2 A 250 no
3 A 0 no
4 A 0 no
5 B 0 no
6 B 0 no
7 B 0 si
8 B 250 no
9 C 0 si
10 C 250 no
11 C 0 si
12 C 250 no
We can use dplyr
library(dplyr)
df1 %>%
group_by(Var1) %>%
mutate(Target = replace(Target, Var2==0 & lead(Var2, default = Var2[n()])==250, 'si'))