This question already has answers here:
R code to insert rows based on a column's value and increment it by 1
(3 answers)
Closed 6 years ago.
I have a dataset with 2 columns. First column is an ID and the 2nd will column is the total number of quarters. If the Col B(quarters) has the value 8, then the 8 rows should be created starting from 1 to 8. The ID in col A should be the same for all rows. The dataset shown below is an example.
ID Quarters
A 5
B 2
C 1
Expected output
ID Quarters
A 1
A 2
A 3
A 4
A 5
B 1
B 2
C 1
Here is what I tried.
library(data.table)
setDT(df.WQuarter)[, (Quarters=1:Quarters), ID]
I get this error. Can you please help. I am really stuck at this for the whole day. I am just learning the basics of R.
We can use base R to replicate the 'ID' by 'Quarters' and create the 'Quarters' by taking the sequence of that column.
with(df1, data.frame(ID= rep(ID, Quarters), Quarters = sequence(Quarters)))
# ID Quarters
#1 A 1
#2 A 2
#3 A 3
#4 A 4
#5 A 5
#6 B 1
#7 B 2
#8 C 1
If we are using data.table, convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', get the sequence of 'Quarters' or just seq(Quarters).
library(data.table)
setDT(df1)[, .(Quarters=sequence(Quarters)) , by = ID]
As #PierreLaFortune commented on the post, if we have NA values, then we need to remove it
setDT(df1)[, .(Quarters = seq_len(Quarters[!is.na(Quarters)])), by = ID]
Or using the dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(ID) %>%
mutate(Quarters = list(seq(Quarters))) %>%
ungroup() %>%
unnest(Quarters)
If the OP's "Quarters" column is non-numeric, it should be converted to 'numeric' before proceeding
df1$Quarters <- as.numeric(as.character(df1$Quarters))
The as.character is in case if the column is factor, but if it is character class, as.numeric is enough.
data
df1 <- structure(list(ID = c("A", "B", "C"), Quarters = c(5L, 2L, 1L
)), .Names = c("ID", "Quarters"), class = "data.frame", row.names = c(NA,
-3L))
Related
This question already has answers here:
How to create a consecutive group number
(13 answers)
Closed 5 months ago.
Suppose we start with this data frame myDF generated by the code immediately beneath:
> myDF
index
1 2
2 2
3 4
4 4
5 6
6 6
7 6
Generating code: myDF <- data.frame(index = c(2,2,4,4,6,6,6))
I'd like to add a column cumGrp to data frame myDF that provides a cumulative count of implicitly grouped elements, as illustrated below. Any suggestions of simple concise base R or dplyr code to do this?
> myDF
index cumGrp cumGrp explained
1 2 1 1st grouping of same index numbers (2) adjacent to each other
2 2 1 Same as above
3 4 2 2nd grouping of same index numbers (4) adjacent to each other
4 4 2 Same as above
5 6 3 3rd grouping of same index numbers (6) adjacent to each other
6 6 3 Same as above
7 6 3 Same as above
Many possible ways:
dplyr::cur_group_id
library(dplyr)
myDF %>%
group_by(index) %>%
mutate(cumGrp = cur_group_id())
cumsum
library(dplyr)
myDF %>%
mutate(cumGrp = cumsum(index != lag(index, default = 0)))
as.numeric + factor
myDF |>
transform(cumGrp = as.numeric(factor(index)))
data.table::.GRP
library(data.table)
setDT(myDF)[, num := .GRP, by = index]
match
myDF |>
transform(cumGrp = match(index, unique(index)))
collapse::group
library(collapse)
myDF |>
settransform(cumGrp = group(index))
This question already has answers here:
Tidying data with several repeating variables in R
(2 answers)
Stack dataframe columns with two distinct suffix into two columns, preferably using tidyverse [duplicate]
(1 answer)
Function to filter data equal to or greater than a certain value
(2 answers)
How to identify an ID with values in at least one column for all rows?
(2 answers)
Closed 1 year ago.
I want to convert the R data frame column names to rows by looking at conditions like,
If two column names are having partial common name separated by '_' like x_a01, y_a01 convert it to 1
row item with common name as a01 based on date.
Ex: x_a01, y_a01 -> a01, x_b01, y_b01 -> b01
These converted column names to row values should have non zero values.
Ex: x_c01, y_c01 have 0 values in 1st row these should be ignored while converting to row items
The dataframe:
Convert the above dataframe to:
We can use pivot_longer to reshape the data into 'long' format and then with filter remove any rows having both x and y values as 0
library(dplyr)
library(dplyr)
df1 %>%
pivot_longer(cols = -date, names_to = c(".value", "colname"),
names_sep = "_", values_drop_na = TRUE)%>%
filter(if_any(c(x, y), ~ . > 0))
-output
# A tibble: 5 x 4
# date colname x y
# <chr> <chr> <dbl> <dbl>
#1 01-01-2021 a01 1 2
#2 01-01-2021 b01 0 4
#3 01-01-2021 d01 3 4
#4 02-01-2021 b01 3.1 1.1
#5 02-01-2021 c01 4.5 6.2
data
df1 <- structure(list(date = c("01-01-2021", "02-01-2021"), x_a01 = c(1,
0), y_a01 = c(2, 0), x_b01 = c(0, 3.1), y_b01 = c(4, 1.1), x_c01 = c(0,
4.5), y_c01 = c(0, 6.2), x_d01 = c(3, 0), y_d01 = c(4, 0)),
class = "data.frame", row.names = c(NA,
-2L))
I want to join two dataframes with index and Year as long as the Year on the RHS is 1-3 years after the Year on the LHS. For example, dataframe df_lhs is
A index Year
1 A 12/31/2012
3 B 12/31/2011
5 C 12/31/2009
the df_rhs is
B index Year
5 A 12/31/2001
6 B 12/31/2010
2 C 12/31/2011
I hope the resulting inner_join to contain:
A index Year_left Year_right
5 C 12/31/2009 12/31/2011
This is what I tried
df = inner_join(df_lhs, df_rhs, by = c('index','Year'), suffix = c(".left", ".right"))
The code doesn't work. Maybe I should not think about using inner_join at all?
library(dplyr)
library(tidyr)
df_lhs %>%
separate(Year, sep = "/", into = c("m", "d", "y"), remove = F) %>%
inner_join(., {df_rhs%>%
separate(Year, sep = "/", into = c("m", "d", "y"), remove = F)},
by = c('index','m', 'd'), suffix = c(".left", ".right")) %>%
filter((as.numeric(y.right) - as.numeric(y.left)) %in% 1:3) %>%
select(A, B, index, Year.left, Year.right)
#> A B index Year.left Year.right
#> 1 5 2 C 12/31/2009 12/31/2011
What you can do is do a simple join/merge, and then filter out the rows which satisfy your condition (here 1-3 years).
Below is the code for merging two data frames based on multiple IDs.
merge(df_lhs,data df_rhs,by=c("index","Year"))
After this you will get simple merge and then you can filter based on some condition like difference of dates between 1-3 years.
This is just a suggestion. I hope this helps.
This question already has answers here:
Remove duplicated rows
(10 answers)
Closed 5 years ago.
I want to group the data by ID and slice the first row, based on condition.
I have the following dataset:
head(data)
ID Cond1
A 10
A 10
B 20
B 30
Now, I want to slice the rows based on condition:
If value in Cond1 is unique for both rows, keep them;
If value in Cond1 is duplicate, keep top row.
Any ideas?
You can use the base R function ave like this:
datafr[!(ave(datafr$Cond1, datafr$ID, FUN=duplicated)),]
ID Cond1
1 A 10
3 B 20
4 B 30
ave returns a numeric vector by ID with a 1 if the element of Cond1 is duplicated and a 0 if it is not. the ! performs two roles, first it converts the resulting vector to a logical vector appropriate for subetting. Second it reverses the results, keeping the non-duplicate elements.
In data.table, you could use a join.
setDT(datafr)[datafr[, !duplicated(Cond1), by=ID]$V1]
ID Cond1
1: A 10
2: B 20
3: B 30
The inner data.frame returns a logical for not duplicated elements by ID and is pulled out into a vector via $V1. This logical vector is fed to the original data.table to perform the subsetting.
data
datafr <-
structure(list(ID = c("A", "A", "B", "B"), Cond1 = c(10L, 10L,
20L, 30L)), .Names = c("ID", "Cond1"), row.names = c(NA, -4L), class = "data.frame")
We can use n_distinct to filter
library(dplyr)
data %>%
group_by(ID) %>%
filter(n_distinct(Cond1)==n()| row_number()==1)
Or just
data[!duplicated(data),]
# ID Cond1
#1 A 10
#3 B 20
#4 B 30
Based on the description in the OP's post, if we include another row with B 20, the first solution should give
data %>%
group_by(ID) %>%
filter(n_distinct(Cond1)==n()| row_number()==1)
# A tibble: 2 x 2
# Groups: ID [2]
# ID Cond1
# <chr> <int>
#1 A 10
#2 B 20
I'm wondering if there is an easy way to restructure some data I have. I currently have a data frame that looks like this...
Year Cat Number
2001 A 15
2001 B 2
2002 A 4
2002 B 12
But what I ultimately want is to have it in this shape...
Year Cat Number Cat Number
2001 A 15 B 2
2002 A 4 B 12
Is there a simple way to do this?
Thanks in advance
:)
One way would be to use dcast/melt from reshape2. In the below code, first I created a sequence of numbers (indx column) for each Year by using transform and ave. Then, melt the transformed dataset keeping id.var as Year, and indx. The long format dataset is then reshaped to wide format using dcast. If you don't need the suffix _number, you can use gsub to remove that part.
library(reshape2)
res <- dcast(melt(transform(df, indx=ave(seq_along(Year), Year, FUN=seq_along)),
id.var=c("Year", "indx")), Year~variable+indx, value.var="value")
colnames(res) <- gsub("\\_.*", "", colnames(res))
res
# Year Cat Cat Number Number
#1 2001 A B 15 2
#2 2002 A B 4 12
Or using dplyr/tidyr. Here, the idea is similar as above. After grouping by Year column, generate a indx column using mutate, then reshape to long format with gather, unite two columns to a single column VarIndx and then reshape back to wide format with spread. In the last step mutate_each, columns with names that start with Number are converted to numeric column.
library(dplyr)
library(tidyr)
res1 <- df %>%
group_by(Year) %>%
mutate(indx=row_number()) %>%
gather("Var", "Val", Cat:Number) %>%
unite(VarIndx, Var, indx) %>%
spread(VarIndx, Val) %>%
mutate_each(funs(as.numeric), starts_with("Number"))
res1
# Source: local data frame [2 x 5]
# Year Cat_1 Cat_2 Number_1 Number_2
#1 2001 A B 15 2
#2 2002 A B 4 12
Or you can create an indx variable .id using getanID from splitstackshape (from comments made by #Ananda Mahto (author of splitstackshape) and use reshape from base R
library(splitstackshape)
reshape(getanID(df, "Year"), direction="wide", idvar="Year", timevar=".id")
# Year Cat.1 Number.1 Cat.2 Number.2
#1: 2001 A 15 B 2
#2: 2002 A 4 B 12
data
df <- structure(list(Year = c(2001L, 2001L, 2002L, 2002L), Cat = c("A",
"B", "A", "B"), Number = c(15L, 2L, 4L, 12L)), .Names = c("Year",
"Cat", "Number"), class = "data.frame", row.names = c(NA, -4L
))