I simplified the dataset to demonstrate what I want to do. I'm not used to dealing with multiple columns. Here I made a simple data
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(65,58,47,NA,25,27,43),
title_2=c(NA,NA,32,35,12,NA,1))
In my actual dataset there are so many columns, but for now I just named as above. My goal is to change the values of title_1 and title_2 by the following rule. If there is a number , change it to 1. If there is an NA value, change it to 0. But in my actual dataset, there are hundreds of columns named as title_1, title_2, ... , title_100, ... So, I cannot type all the column names. So for my simple data, I want to use the code that doesn't type the column names explicitly. My expected output is
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(1,1,1,0,1,1,1),
title_2=c(0,0,1,1,1,0,1))
With dplyr we can use tidyselect syntax inside across() to select all variables starting with "title_" and then apply a function on all selected columns inside across():
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(65,58,47,NA,25,27,43),
title_2=c(NA,NA,32,35,12,NA,1))
library(dplyr)
data %>%
mutate(across(starts_with("title_"), ~ ifelse(is.na(.x), 0, 1)))
#> id title_1 title_2
#> 1 1 1 0
#> 2 1 1 0
#> 3 1 1 1
#> 4 2 0 1
#> 5 2 1 1
#> 6 2 1 0
#> 7 2 1 1
In base R we would use grepl to select the column names, then assign those columns new values with lapply:
data<-data.frame(id=c(1,1,1,2,2,2,2),
title_1=c(65,58,47,NA,25,27,43),
title_2=c(NA,NA,32,35,12,NA,1))
mycols <- grepl("^title_", names(data))
data[mycols] <- lapply(data[mycols], \(x) ifelse(is.na(x), 0, 1))
data
#> id title_1 title_2
#> 1 1 1 0
#> 2 1 1 0
#> 3 1 1 1
#> 4 2 0 1
#> 5 2 1 1
#> 6 2 1 0
#> 7 2 1 1
Finally, we would select the columns with data.table similary, but here we'd prefer the actual names with grep(value = TRUE):
mycols <- grep("^title_", names(data), value = TRUE)
library(data.table)
data_tb <- as.data.table(data)
data_tb[,
get("mycols") := lapply(.SD, \(x) ifelse(is.na(x), 0, 1)),
.SDcols = mycols]
data_tb
#> id title_1 title_2
#> 1: 1 1 0
#> 2: 1 1 0
#> 3: 1 1 1
#> 4: 2 0 1
#> 5: 2 1 1
#> 6: 2 1 0
#> 7: 2 1 1
Created on 2022-07-26 by the reprex package (v2.0.1)
While an ifelse statement as presented by #TimTeaFan is the best solution,
here is an alternative approach using across twice:
library(dplyr)
library(tidyr)
data %>%
mutate(across(-id, ~ .-.+1),
across(-id, ~ replace_na(.,0)))
id title_1 title_2
1 1 1 0
2 1 1 0
3 1 1 1
4 2 0 1
5 2 1 1
6 2 1 0
7 2 1 1
Related
I am doing my best to learn R, and this is my first post on this forum.
I currently have a data frame with a populated vector "x" and an unpopulated vector "counter" as follows:
x <- c(NA,1,0,0,0,0,1,1,1,1,0,1)
df <- data.frame("x" = x, "counter" = 0)
x counter
1 NA 0
2 1 0
3 0 0
4 0 0
5 0 0
6 0 0
7 1 0
8 1 0
9 1 0
10 1 0
11 0 0
12 1 0
I am having a surprisingly difficult time trying to write code that will simply populate counter so that counter sums the cumulative, sequential 1s in x, but reverts back to zero when x is zero. Accordingly, I would like counter to calculate as follows per the above example:
x counter
1 NA NA
2 1 1
3 0 0
4 0 0
5 0 0
6 0 0
7 1 1
8 1 2
9 1 3
10 1 4
11 0 0
12 1 1
I have tried using lag() and ifelse(), both with and without for loops, but seem to be getting further and further away from a workable solution (while lag got me close, the figures were not calculating as expected....my ifelse and for loops eventually ended up with length 1 vectors of NA_real_, NA or 1). I have also considered cumsum - but not sure how to frame the range to just the 1s - and have searched and reviewed similar posts, for example How to add value to previous row if condition is met; however, I still cannot figure out what I would expect to be a very simple task.
Admittedly, I am at a low point in my early R learning curve and greatly appreciate any help and constructive feedback anyone from the community can provide. Thank you.
You can use :
library(dplyr)
df %>%
group_by(x1 = cumsum(replace(x, is.na(x), 0) == 0)) %>%
mutate(counter = (row_number() - 1) * x) %>%
ungroup %>%
select(-x1)
# x counter
# <dbl> <dbl>
# 1 NA NA
# 2 1 1
# 3 0 0
# 4 0 0
# 5 0 0
# 6 0 0
# 7 1 1
# 8 1 2
# 9 1 3
#10 1 4
#11 0 0
#12 1 1
Explaining the steps -
Create a new column (x1), replace NA in x with 0 and increment the group value by 1 (using cumsum) whenever x = 0.
For each group subtract the row number with 0 and multiply it by x. This multiplication is necessary because it will help to keep counter as 0 where x = 0 and counter as NA where x is NA.
Welcome #cpanagakos.
In dplyr::lag it's not posibble to use a column that still doesn't exist.
(It can't refer to itself.)
https://www.reddit.com/r/rstats/comments/a34n6b/dplyr_use_previous_row_from_a_column_thats_being/
For example:
library(tidyverse)
df <- tibble("x" = c(NA, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1))
# error: lag cannot refer to a column that still doesn't exist
df %>%
mutate(counter = case_when(is.na(x) ~ coalesce(lag(counter), 0),
x == 0 ~ 0,
x == 1 ~ lag(counter) + 1))
#> Error: Problem with `mutate()` input `counter`.
#> x object 'counter' not found
#> i Input `counter` is `case_when(...)`.
So, if you have a criteria that "resets" the counter, you would need to write a formula that changes the group when you need a reset an then refer to the row_number, that will be restarted at 1 inside the group (like #Ronald Shah and others suggest):
Create sequential counter that restarts on a condition within panel data groups
df %>%
group_by(x1 = cumsum(!coalesce(x, 0))) %>%
mutate(counter = row_number() - 1) %>%
ungroup()
#> # A tibble: 12 x 3
#> x x1 counter
#> <dbl> <int> <dbl>
#> 1 NA 1 NA
#> 2 1 1 1
#> 3 0 2 0
#> 4 0 3 0
#> 5 0 4 0
#> 6 0 5 0
#> 7 1 5 1
#> 8 1 5 2
#> 9 1 5 3
#> 10 1 5 4
#> 11 0 6 0
#> 12 1 6 1
This would be one of the few cases where using a for loop in R could be justified: because the alternatives are conceptually harder to understand.
I am writing a script which identifies the interval that a vector of numbers fall between. e.g. 0.3 falls in the first interval of 0.5, 0.8, 1. A simplified example of the code below:
df <- data.frame(p1 = runif(10)/2, p2 = rep(-1,10), p3 = rep(1, 10));
df$p2 <- df$p1 + runif(10)/2;
r <- runif(10);
c(1,2,3)[apply(abs(outer(as.numeric(df[1,]), r, '-')),2, which.min)];
This works well when each value in r is applied to the same vector - in this case as.numeric(df[1,]). However, I now need to apply each value in r to its corresponding unique row in the dataset df. At the moment I am doing this in a loop, which seems inefficent, but I have been unable to find an efficient alterative to looping through each row:
a <- array(dim=10);
for(x in 1:10){
a[x] <- c(1,2,3)[apply(abs(outer(as.numeric(df[x,]), r[x], '-')),2, which.min)];
}
Is there a more efficient alternative than a loop?
Thanks,
James
As #Gregor mentioned, it's better to use findInterval. You can use mutate_xxx functions from the dplyr package to apply findInterval on all columns
library(tidyverse)
set.seed(1111)
df <- data.frame(p1 = runif(10)/2, p2 = rep(-1,10), p3 = rep(1, 10));
df$p2 <- df$p1 + runif(10)/2;
# define intervals
intv1 <- c(0.3, 0.5, 0.8, 1)
# columns start with `p`
df %>%
mutate_at(vars(starts_with("p")), funs(bin = findInterval(., intv1)))
#> p1 p2 p3 p1_bin p2_bin p3_bin
#> 1 0.23275132 0.2335964 1 0 0 4
#> 2 0.20646243 0.5809412 1 0 2 4
#> 3 0.45350161 0.8289807 1 1 3 4
#> 4 0.06855271 0.3852559 1 0 1 4
#> 5 0.36940842 0.8034923 1 1 3 4
#> 6 0.48816350 0.5678450 1 1 2 4
#> 7 0.43997997 0.8983940 1 1 3 4
#> 8 0.05839214 0.3368955 1 0 1 4
#> 9 0.27314439 0.7233537 1 0 2 4
#> 10 0.07005799 0.4530015 1 0 1 4
# selected columns only
col2select <- c("p1", "p2")
df %>%
mutate_at(col2select, funs(bin = findInterval(., intv1)))
#> p1 p2 p3 p1_bin p2_bin
#> 1 0.23275132 0.2335964 1 0 0
#> 2 0.20646243 0.5809412 1 0 2
#> 3 0.45350161 0.8289807 1 1 3
#> 4 0.06855271 0.3852559 1 0 1
#> 5 0.36940842 0.8034923 1 1 3
#> 6 0.48816350 0.5678450 1 1 2
#> 7 0.43997997 0.8983940 1 1 3
#> 8 0.05839214 0.3368955 1 0 1
#> 9 0.27314439 0.7233537 1 0 2
#> 10 0.07005799 0.4530015 1 0 1
# for all columns
df %>%
mutate_all(funs(bin = findInterval(., intv1)))
#> p1 p2 p3 p1_bin p2_bin p3_bin
#> 1 0.23275132 0.2335964 1 0 0 4
#> 2 0.20646243 0.5809412 1 0 2 4
#> 3 0.45350161 0.8289807 1 1 3 4
#> 4 0.06855271 0.3852559 1 0 1 4
#> 5 0.36940842 0.8034923 1 1 3 4
#> 6 0.48816350 0.5678450 1 1 2 4
#> 7 0.43997997 0.8983940 1 1 3 4
#> 8 0.05839214 0.3368955 1 0 1 4
#> 9 0.27314439 0.7233537 1 0 2 4
#> 10 0.07005799 0.4530015 1 0 1 4
Created on 2018-05-08 by the reprex package (v0.2.0).
Thanks for the suggestions. I have used dplyr and come up with the following. It seems quicker than my original, but still suffers as the size of the dataset increases:
library(dplyr);
# Dummy data-set
nRows <- 1000
df <- data.frame(p1 = runif(nRows )/2, p2 = rep(-1,nRows ), p3 = rep(1, nRows ), r = runif(nRows))
df$p2 <- df$p1 + runif(nRows )/2
df %>% dplyr::rowwise() %>%
dplyr::mutate_at(vars(starts_with("r")), funs(bin = 1+findInterval(., c(p1,p2,p3))))
-- James
Have created a dataframe that contains ids and stringvalues :
mycols <- c('id','2')
ids <- c(1,1,2,3)
stringvalues <- c('a','a','b','c')
mydf <- data.frame(ids , stringvalues)
mydf contains :
ids stringvalues
1 1 a
2 1 a
3 2 b
4 3 c
I'm attempting to produce a new dataframe that contains the id and
corresponding counts for each string :
id, a , b , c
1 , 2 , 0 , 0
2 , 0 , 1 , 0
3 , 0 , 0 , 1
I'm trying to create multiple summarise implementations :
g1 <- group_by(mydf , ids)
s1 <- summarise(g1 , a = count('a'))
s2 <- summarise(g1 , b = count('b'))
s3 <- summarise(g1 , c = count('c'))
But returns error : Evaluation error: no applicable method for 'groups' applied to an object of class "character".
How to create new columns that count number of string entries in the column ?
Does doing a dplyr::count followed by tidyr::spread work for you? (I'm only posting this as you mentioned you were wanting to create a dataframe of this sort - otherwise it's much simpler to use table(mydf) as the other comments/answers suggest.)
library(dplyr)
library(tidyr)
mydf %>% count(ids, stringvalues) %>% spread(stringvalues, n, fill = 0)
#> # A tibble: 3 x 4
#> ids a b c
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 0 0
#> 2 2 0 1 0
#> 3 3 0 0 1
You can use count directly. First,
count(mydf, ids,stringvalues)
gives
# A tibble: 3 x 3
ids stringvalues n
<dbl> <fctr> <int>
1 1 a 2
2 2 b 1
3 3 c 1
then reshape,
count(mydf, ids,stringvalues) %>% tidyr::spread(stringvalues, n)
gives
# A tibble: 3 x 4
ids a b c
* <dbl> <int> <int> <int>
1 1 2 NA NA
2 2 NA 1 NA
3 3 NA NA 1
then replace the NAs with something like res[is.na(res)] <- 0, where res is the object constructed above.
Here's a base-R solution:
data.frame(cbind(table(mydf)))
Output option 1 (row # = ID):
a b c
1 2 0 0
2 0 1 0
3 0 0 1
Output option 2 (with ID as column):
data.frame(cbind(id=unique(mydf$ids),table(mydf)))
id a b c
1 1 2 0 0
2 2 0 1 0
3 3 0 0 1
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.
I am trying to reshape the following dataset with reshape(), without much results.
The starting dataset is in "wide" form, with each id described through one row. The dataset is intended to be adopted for carry out Multistate analyses (a generalization of Survival Analysis).
Each person is recorded for a given overall time span. During this period the subject can experience a number of transitions among states (for simplicity let us fix to two the maximum number of distinct states that can be visited). The first visited state is s1 = 1, 2, 3, 4. The person stays within the state for dur1 time periods, and the same applies for the second visited state s2:
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3
The dataset in long format which I woud like to obtain is:
id cohort s
1 1 3
1 1 3
1 1 3
1 1 3
1 1 2
1 1 2
1 1 2
1 1 2
1 1 2
2 0 1
2 0 1
2 0 1
2 0 1
2 0 4
2 0 4
2 0 4
In practice, each id has dur1 + dur2 rows, and s1 and s2 are melted in a single variable s.
How would you do this transformation? Also, how would you cmoe back to the original dataset "wide" form?
Many thanks!
dat <- cbind(id=c(1,2), cohort=c(1, 0), s1=c(3, 1), dur1=c(4, 4), s2=c(2, 4), dur2=c(5, 3))
You can use reshape() for the first step, but then you need to do some more work. Also, reshape() needs a data.frame() as its input, but your sample data is a matrix.
Here's how to proceed:
reshape() your data from wide to long:
dat2 <- reshape(data.frame(dat), direction = "long",
idvar = c("id", "cohort"),
varying = 3:ncol(dat), sep = "")
dat2
# id cohort time s dur
# 1.1.1 1 1 1 3 4
# 2.0.1 2 0 1 1 4
# 1.1.2 1 1 2 2 5
# 2.0.2 2 0 2 4 3
"Expand" the resulting data.frame using rep()
dat3 <- dat2[rep(seq_len(nrow(dat2)), dat2$dur), c("id", "cohort", "s")]
dat3[order(dat3$id), ]
# id cohort s
# 1.1.1 1 1 3
# 1.1.1.1 1 1 3
# 1.1.1.2 1 1 3
# 1.1.1.3 1 1 3
# 1.1.2 1 1 2
# 1.1.2.1 1 1 2
# 1.1.2.2 1 1 2
# 1.1.2.3 1 1 2
# 1.1.2.4 1 1 2
# 2.0.1 2 0 1
# 2.0.1.1 2 0 1
# 2.0.1.2 2 0 1
# 2.0.1.3 2 0 1
# 2.0.2 2 0 4
# 2.0.2.1 2 0 4
# 2.0.2.2 2 0 4
You can get rid of the funky row names too by using rownames(dat3) <- NULL.
Update: Retaining the ability to revert to the original form
In the example above, since we dropped the "time" and "dur" variables, it isn't possible to directly revert to the original dataset. If you feel this is something you would need to do, I suggest keeping those columns in and creating another data.frame with the subset of the columns that you need if required.
Here's how:
Use aggregate() to get back to "dat2":
aggregate(cbind(s, dur) ~ ., dat3, unique)
# id cohort time s dur
# 1 2 0 1 1 4
# 2 1 1 1 3 4
# 3 2 0 2 4 3
# 4 1 1 2 2 5
Wrap reshape() around that to get back to "dat1". Here, in one step:
reshape(aggregate(cbind(s, dur) ~ ., dat3, unique),
direction = "wide", idvar = c("id", "cohort"))
# id cohort s.1 dur.1 s.2 dur.2
# 1 2 0 1 4 4 3
# 2 1 1 3 4 2 5
There are probably better ways, but this might work.
df <- read.table(text = '
id cohort s1 dur1 s2 dur2
1 1 3 4 2 5
2 0 1 4 4 3',
header=TRUE)
hist <- matrix(0, nrow=2, ncol=9)
hist
for(i in 1:nrow(df)) {
hist[i,] <- c(rep(df[i,3], df[i,4]), rep(df[i,5], df[i,6]), rep(0, (9 - df[i,4] - df[i,6])))
}
hist
hist2 <- cbind(df[,1:2], hist)
colnames(hist2) <- c('id', 'cohort', paste('x', seq_along(1:9), sep=''))
library(reshape2)
hist3 <- melt(hist2, id.vars=c('id', 'cohort'), variable.name='x', value.name='state')
hist4 <- hist3[order(hist3$id, hist3$cohort),]
hist4
hist4 <- hist4[ , !names(hist4) %in% c("x")]
hist4 <- hist4[!(hist4[,2]==0 & hist4[,3]==0),]
Gives:
id cohort state
1 1 1 3
3 1 1 3
5 1 1 3
7 1 1 3
9 1 1 2
11 1 1 2
13 1 1 2
15 1 1 2
17 1 1 2
2 2 0 1
4 2 0 1
6 2 0 1
8 2 0 1
10 2 0 4
12 2 0 4
14 2 0 4
Of course, if you have more than two states per id then this would have to be modified (and it might have to be modified if you have more than two cohorts). For example, I suppose with 9 sample periods one person could be in the following sequence of states:
1 3 2 4 3 4 1 1 2