I have a column like so. The column begins and ends with a ',' and each value is separated by ',,'.
col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,
How can i transform this column into the following:
col1_9 col1_101 col1_102 col1_200 col1_201
1 1 0 1 1
0 1 1 0 1
1 1 1 1 1
1 1 2 1 0
df%>%
mutate(rowid = row_number(), value = 1)%>%
separate_rows(col1)%>%
filter(nzchar(col1)) %>%
pivot_wider(rowid, names_from = col1,
values_fn = sum, names_prefix = 'col1_',
values_fill = 0)
# A tibble: 4 x 6
rowid col1_101 col1_9 col1_201 col1_200 col1_102
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1 0
2 2 1 0 1 0 1
3 3 1 1 1 1 1
4 4 1 1 0 1 2
in Base R:
a <- setNames(strsplit(trimws(df$col1,white=','), ',+'), seq(nrow(df)))
as.data.frame.matrix(t(table(stack(a))))
101 102 200 201 9
1 1 0 1 1 1
2 1 1 0 1 0
3 1 1 1 1 1
4 1 2 1 0 1
An option could be:
First remove the "," at begin and end using str_sub from stringr
One Hot encode the column using mtabulate and strsplit with sep of ",,"
Order the column names based on number
Finally, give the columns the "col1_" names using paste0
Which gives this as result:
df <- read.table(text = "col1
,101,,9,,201,,200,
,201,,101,,102,
,9,,101,,102,,200,,201,
,101,,200,,9,,102,,102,", header = TRUE)
library(stringr)
library(qdapTools)
df$col1 <- str_sub(df$col1, 2, -2)
df <- mtabulate(strsplit(df$col1, ",,"))
df <- df[, order(as.numeric(names(df)))]
names(df) <- paste0("col1_", names(df))
df
#> col1_9 col1_101 col1_102 col1_200 col1_201
#> 1 1 1 0 1 1
#> 2 0 1 1 0 1
#> 3 1 1 1 1 1
#> 4 1 1 2 1 0
Created on 2022-07-21 by the reprex package (v2.0.1)
I have the following data:
companyID status
1 1
1 1
1 0
1 2
2 1
2 1
2 1
3 1
3 0
3 2
3 2
3 2
And would like to subset those observations (by companyID) where status has 0, 1, and 2 across the group (companyID). My preferred outcome would look like the following:
companyID status
1 1
1 1
1 0
1 2
3 1
3 0
3 2
3 2
3 2
Thank you in advance for any help!!
You can select groups where all the values from 0-2 are present in the group.
library(dplyr)
df %>% group_by(companyID) %>%filter(all(0:2 %in% status))
# companyID status
# <int> <int>
#1 1 1
#2 1 1
#3 1 0
#4 1 2
#5 3 1
#6 3 0
#7 3 2
#8 3 2
#9 3 2
In base R and data.table :
#Base R :
subset(df, as.logical(ave(status, companyID, FUN = function(x) all(0:2 %in% x))))
#data.table
library(data.table)
setDT(df)[, .SD[all(0:2 %in% status)], companyID]
We can use
library(dplyr)
df %>%
group_by(companyID) %>%
filter(sum(0:2 %in% status) == 3)
I have a group and persons in each group. and an indicator. How to count indicator per each group for each person element?
group person ind
1 1 1
1 1 1
1 2 1
2 1 0
2 2 1
2 2 1
output
so in the first group 2 persons have 1 in ind, and second group one person so
group person ind. count
1 1 1 2
1 1 1 2
1 2 1 2
2 1 0 1
2 2 1 1
2 2 1 1
Could do:
library(dplyr)
df %>%
group_by(group) %>%
mutate(
count = n_distinct(person[ind == 1])
)
Output:
# A tibble: 6 x 4
# Groups: group [2]
group person ind count
<int> <int> <int> <int>
1 1 1 1 2
2 1 1 1 2
3 1 2 1 2
4 2 1 0 1
5 2 2 1 1
6 2 2 1 1
Or in data.table:
library(data.table)
setDT(df)[, count := uniqueN(person[ind == 1]), by = group]
An option using base R
df1$count <- with(df1, ave(ind* person, group, FUN =
function(x) length(unique(x[x!=0]))))
df1$count
#[1] 2 2 2 1 1 1
I have a panel data with the following structure:
ID Month Action
1 1 0
1 2 0
1 3 1
1 4 1
2 1 0
2 2 1
2 3 0
2 4 1
3 1 0
3 2 0
3 3 0
4 1 0
4 2 1
4 3 1
4 4 0
where each ID has one row for each month, action indicates if this ID did this action in this month or not, 0 is no, 1 is yes.
I need to find the ID that has continuously had action=1 once they started the action (it does not matter in which month they started, but once started, in the following months the action should always be 1). I also wish to record all the rows that belong to these IDs in a new data frame.
How can I do this in R?
In my example, ID=1 consistently had action=1 since Month 3, so the final data frame I'm looking for should only have the rows belong to ID=1.
ID Month Action
1 1 0
1 2 0
1 3 1
1 4 1
You could do something like:
library(dplyr)
df %>%
group_by(ID) %>%
filter(all(diff(Action)>=0) & max(Action)>0) -> newDF
This newDF includes only the IDs where (a) the Action is never decreasing (i.e., no 1=>0) and (b) there is at least one Action==1).
ID Month Action
<int> <int> <int>
1 1 1 0
2 1 2 0
3 1 3 1
4 1 4 1
A base R approach using ave where we check if all the numbers after first occurrence of 1 are all 1. The addition of any condition is to remove enteries with all 0's.
df[with(df, as.logical(ave(Action, ID, FUN = function(x) {
inds = cumsum(x)
any(inds > 0) & all(x[inds > 0] == 1)
}))), ]
# ID Month Action
#1 1 1 0
#2 1 2 0
#3 1 3 1
#4 1 4 1
Or another option with same logic but in a little concise way would be
df[with(df, ave(Action == 1, ID, FUN = function(x)
all(x[which.max(x):length(x)] == 1)
)), ]
# ID Month Action
#1 1 1 0
#2 1 2 0
#3 1 3 1
#4 1 4 1
More than 2,000 subjects. I would like to change the value for 'time2' to 0 for each first row by subject. For instance, ID=2 subject has 1 for 'time2' at first row of this subject. How to change it to 0, considering 2k subjects?
ID time1 time2
1 0 0
1 0 1
1 1 5
2 0 1
2 1 3
2 3 5
3 ....
With dplyr, we can use ifelse based on a logical condition with row_number()
df2 %>%
group_by(ID) %>%
mutate(time2 = ifelse(row_number()==1, 0, time2))
# A tibble: 6 x 3
# Groups: ID [2]
# ID time1 time2
# <int> <int> <dbl>
#1 1 0 0
#2 1 0 1
#3 1 1 5
#4 2 0 0
#5 2 1 3
#6 2 3 5
Or using data.table, create a row index (.I) grouped by 'ID' and assign (:=) those elements in 'time2' that corresponds to the row index to 0
library(data.table)
setDT(df2)[df2[, .I[seq_len(.N)==1] , ID]$V1, time2 := 0][]
# ID time1 time2
#1: 1 0 0
#2: 1 0 1
#3: 1 1 5
#4: 2 0 0
#5: 2 1 3
#6: 2 3 5
Or a compact base R option would be (assuming that 'ID' is ordered)
df$time2[!duplicated(df$ID)] <- 0
df
# ID time1 time2
#1 1 0 0
#2 1 0 1
#3 1 1 5
#4 2 0 0
#5 2 1 3
#6 2 3 5
You could also use dplyr in combination with replace:
df %>%
dplyr::group_by(ID) %>%
dplyr::mutate(time2 = replace(time2, 1, 0))
# Source: local data frame [6 x 3]
# Groups: ID [2]
#
# ID time1 time2
# <int> <int> <dbl>
# 1 1 0 0
# 2 1 0 1
# 3 1 1 5
# 4 2 0 0
# 5 2 1 3
# 6 2 3 5