This question already has answers here:
Convert numeric vector to binary (0/1) based on limit
(4 answers)
Closed 2 years ago.
I'm having trouble with some data, and I think it's easy to solve. I have a subset like this:
data <- data.frame("treat" = 1:10, "value" = c(12,32,41,0,12,13,11,0,12,0))
And what I need is a third column that returns to me the value "1" when the value on second column is different from 0, and returns "0" when the value on the second column is equal 0. Like this:
data$param <- c(1,1,1,0,1,1,1,0,1,0)
I tried to do this with the function if() and else() but I don't get it.
You can try:
data$param <- ifelse(data$value != 0, 1, 0)
or you can use dplyr library:
data %>%
mutate(param = case_when(value != 0 ~ 1, TRUE ~ 0))
or
data$param <- +(data$value != 0)
data$param <- as.integer(data$value != 0)
data
treat value param
1 1 12 1
2 2 32 1
3 3 41 1
4 4 0 0
5 5 12 1
6 6 13 1
7 7 11 1
8 8 0 0
9 9 12 1
10 10 0 0
Here is another alternative using cut function.
library(dplyr)
data %>%
mutate(param = cut(value, breaks = c(-Inf,0,max(value)), labels = c(0,1)))
# treat value param
# 1 1 12 1
# 2 2 32 1
# 3 3 41 1
# 4 4 0 0
# 5 5 12 1
# 6 6 13 1
# 7 7 11 1
# 8 8 0 0
# 9 9 12 1
# 10 10 0 0
Related
This question already has answers here:
Transform one column from categoric to binary, keep the rest [duplicate]
(3 answers)
Closed 3 years ago.
A while ago I've posted a question about how to convert factor data.frame into a binary (hot-encoding) data.frame here. Now I am trying to find the most efficient way to loop over trials (rows) and binarize a factor variable. A minimal example would look like this:
d = data.frame(
Trial = c(1,2,3,4,5,6,7,8,9,10),
Category = c('a','b','b','b','a','b','a','a','b','a')
)
d
Trial Category
1 1 a
2 2 b
3 3 b
4 4 b
5 5 a
6 6 b
7 7 a
8 8 a
9 9 b
10 10 a
While I would like to get this:
Trial a b
1 1 1 0
2 2 0 1
3 3 0 1
4 4 0 1
5 5 1 0
6 6 0 1
7 7 1 0
8 8 1 0
9 9 0 1
10 10 1 0
What would be the most efficient way of doing it?
here is an option with pivot_wider. Create a column of 1's and then apply pivot_wider with names_from the 'Category' and values_from the newly created column
library(dplyr)
library(tidyr)
d %>%
mutate(n = 1) %>%
pivot_wider(names_from = Category, values_from = n, values_fill = list(n = 0))
# A tibble: 10 x 3
# Trial a b
# <dbl> <dbl> <dbl>
# 1 1 1 0
# 2 2 0 1
# 3 3 0 1
# 4 4 0 1
# 5 5 1 0
# 6 6 0 1
# 7 7 1 0
# 8 8 1 0
# 9 9 0 1
#10 10 1 0
The efficient option would be data.table
library(data.table)
dcast(setDT(d), Trial ~ Category, length)
It can also be done with base R
table(d)
I have two datasets, one at the individual level and one at the school level. I would like to calculate the proportion of fighting in each school using a loop (since i have >100 schools).
Current code:
for (i in levels(df$school_id)) {
school <- subset(df, school_id == i)
number_students <- nrow(school)
prop <- (sum(school$fight_binary, na.rm = TRUE))/number_students
df$proportion_fight[df$school_id == i] <- prop
}
I tried initializing the new column first, but when I run this loop nothing happens at all.
Here's some sample data
INDIVIDUAL LEVEL:
student_id school_id ever_fight
1 2 1
2 3 0
3 1 1
4 1 1
5 2 0
6 2 0
7 2 0
8 2 0
9 3 1
10 1 0
11 3 1
12 3 1
13 3 1
14 3 1
15 1 0
16 2 0
17 1 0
18 1 0
19 1 0
20 1 0
SCHOOL LEVEL (need to fill the second column with data from above):
school_id proportion_fight
1
2
3
We can use a group by mean
library(dplyr)
df1 %>%
group_by(school_id) %>%
summarise(proportion_flight = mean(ever_flight))
This question already has answers here:
Subset panel data by group [duplicate]
(3 answers)
Closed 4 years ago.
I'm trying to use dplyr to take the first and last rows of repeated values by group. I'm doing this for efficiency reasons, particularly so that graphing is faster.
This is not a duplicate of Select first and last row from grouped data because I'm not asking for the strict first and last row in a group; I'm asking for the first and last row in a group by level (in my case 1's and 0's) that may appear in multiple chunks.
Here's an example. Say I want to remove all the redundant 1's and 0's from column C while keeping A and B intact.
df = data.frame(
A = rep(c("a", "b"), each = 10),
B = rep(c(1:10), 2),
C = c(1,0,0,0,0,0,1,1,1,1,0,0,0,1,0,0,0,0,0,1))
A B C
a 1 1
a 2 0
a 3 0
a 4 0
a 5 0
a 6 0
a 7 1
a 8 1
a 9 1
a 10 1
b 1 0
b 2 0
b 3 0
b 4 1
b 5 0
b 6 0
b 7 0
b 8 0
b 9 0
b 10 1
The end result should look like this:
A B C
a 1 1
a 2 0
a 6 0
a 7 1
a 10 1
b 1 0
b 3 0
b 4 1
b 5 0
b 9 0
b 10 1
Using unique will either not remove anything or just take one of the 1's or 0's without retaining the start-and-end quality that I'm trying to achieve. Is there a way to do this without a loop, perhaps using dplyr or forcats?
I think that slice should get you close:
df %>%
group_by(A,C) %>%
slice(c(1, n()))
gives
A B C
<chr> <int> <dbl>
1 a 2 0
2 a 6 0
3 a 1 1
4 a 10 1
5 b 1 0
6 b 9 0
7 b 4 1
8 b 10 1
though this doesn't quite match your expected outcome. n() gives the last row in the group.
After your edit it is clear that you are not looking for the values within any group that is established (which is what my previous version did). You want to group by those runs of 1's or 0's. For that, you will need to create a column that checks whether or not the run of 1's/0's has changed and then one to identify the groups. Then, slice will work as described before. However, because some of your runs are only 1 row long, we need to only include n() if it is more than 1 (otherwise the 1 row shows up twice).
df %>%
mutate(groupChanged = (C != lag(C, default = C[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))
Gives
A B C groupChanged toCutBy
<chr> <int> <dbl> <lgl> <int>
1 a 1 1 FALSE 0
2 a 2 0 TRUE 1
3 a 6 0 FALSE 1
4 a 7 1 TRUE 2
5 a 10 1 FALSE 2
6 b 1 0 TRUE 3
7 b 3 0 FALSE 3
8 b 4 1 TRUE 4
9 b 5 0 TRUE 5
10 b 9 0 FALSE 5
11 b 10 1 TRUE 6
If the runs of 1 or 0 must stay within the level in column A, you also need to add a check for a change in column A to the call. In this example, it does not have an effect (so returns exactly the same values), but it may be desirable in other instances.
df %>%
mutate(groupChanged = (C != lag(C, default = C[1]) |
A != lag(A, default = A[1]))
, toCutBy = cumsum(groupChanged)
) %>%
group_by(toCutBy) %>%
slice(c(1, ifelse(n() == 1, NA, n())))
One solution:
C_filter <- function(x) {
!sapply(1:length(x), function(i) {
identical(x[i], x[i-1])
}) | !sapply(1:length(x), function(i) {
identical(x[i], x[i+1])
})
}
df %>% group_by(A) %>% filter(C_filter(C))
A B C
1 a 1 1
2 a 2 0
3 a 6 0
4 a 7 1
5 a 10 1
6 b 1 0
7 b 3 0
8 b 4 1
9 b 5 0
10 b 9 0
11 b 10 1
trying to get the spread() function to work with duplicates in the key column- yes, this has been covered before but I can't seem to get it to work and I've spent the better part of a day on it (somewhat new to R).
I have two columns of data. The first column 'snowday' represents the first day of a winter season, with the corresponding snow depth in the 'depth' column. This is several years of data (~62 years). So there should be sixty two years of first, second, third, etc days for the snowday column- this produces duplicates in snowday:
snowday row depth
1 1 0
1 2 0
1 3 0
1 4 0
1 5 0
1 6 0
...
75 4633 24
75 4634 4
75 4635 6
75 4636 20
75 4637 29
75 4638 1
I added a "row" column to make the data frame more transient (which I vaguely understand to be hones so 1:4638 rows is the total measurements taken over ~62 years at 75 days per year . Now i'd like to spread it wide:
wide <- spread(seasondata, key = snowday, value = depth, fill = 0)
and i get all zeros:
row 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0
what I want it to look like is something like this (the columns are defined by the "snowday" and the row values are the various depths recorded on for that particular day over the various years- e.g. days 1 through 11 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 1 3 4 0 0 1 0 2 8 9 19 0 3
0 8 0 0 0 4 0 6 6 0 1 0 2 0
3 5 0 0 0 2 0 1 0 2 7 0 12 4
I think I'm fundamentally missing something here- I've tried working through drop=TRUE or convert = TRUE, and the output values are either all zeros or NA's depending on how I tinker. Also, all values in the data.frame(seasondata) are integers. Any thoughts?
It seems to me what you wish to do is to split up the the depth column according to values of snowday, and then bind all the 75 columns together.
There is a complication, in that 62*75 is not 4638, so I assume we do not observe 75 snowdays in some years. That is, some of the 75 columns (snowdays) will not have 62 observations. We'll make sure all 75 columns are 62 entries long by filling short columns up with NAs.
I make some fake data as an example. We observe 3 "years" of data for snowdays 1 and 2, but only 2 "years" of data for snowdays 3 and 4.
set.seed(1)
seasondata <- data.frame(
snowday = c(rep(1:2, each = 3), rep(3:4, each = 2)),
depth = round(runif(10, 0, 10), 0))
# snowday depth
# 1 1 3
# 2 1 4
# 3 1 6
# 4 2 9
# 5 2 2
# 6 2 9
# 7 3 9
# 8 3 7
# 9 4 6
# 10 4 1
We first figure out how long a column should be. In your case, m == 62. In my example, m == 3 (the years of data).
m <- max(table(seasondata$snowday))
Now, we use the by function to split up depth by values of snowdays, and fill short columns with NAs, and finally cbind all the columns together:
out <- do.call(cbind,
by(seasondata$depth, seasondata$snowday,
function(x) {
c(x, rep(NA, m - length(x)))
}
)
)
out
# 1 2 3 4
# [1,] 3 9 9 6
# [2,] 4 2 7 1
# [3,] 6 9 NA NA
Using spread:
You can use spread if you wish. In this case, you have to define row correctly. row should be 1 for the first first snowday (snowday == 1), 2 for the second first snowday, etc. row should also be 1 for the first second snowday, 2 for the second second snowday, etc.
seasondata$row <- unlist(sapply(rle(seasondata$snowday)$lengths, seq_len))
seasondata
# snowday depth row
# 1 1 3 1
# 2 1 4 2
# 3 1 6 3
# 4 2 9 1
# 5 2 2 2
# 6 2 9 3
# 7 3 9 1
# 8 3 7 2
# 9 4 6 1
# 10 4 1 2
Now we can use spread:
library(tidyr)
spread(seasondata, key = snowday, value = depth, fill = NA)
# row 1 2 3 4
# 1 1 3 9 9 6
# 2 2 4 2 7 1
# 3 3 6 9 NA NA
I want to add a column to a data frame based on the values in an other column. I want a specific value for the first time a value appears in the other column only. For example:
s <- c(6,5,6,7,8,7,6,5)
i <- c(4,5,4,3,2,3,4,5)
t <- c(1,1,3,4,5,6,6,8)
df<- data.frame(t,s,i)
> df
t s i
1 1 6 4
2 1 5 5
3 3 6 4
4 4 7 3
5 5 8 2
6 6 7 3
7 6 6 4
8 8 5 5
Now I want to add a column "mark" that gives a 1 for the first time t=1 and the first time t=6. So that I get: 1 0 0 0 0 1 0 0. I have this code:
for(i in 1:nrow(df)){
if (df$t[i] == 1 & df$t[i-1] != 1 | (df$t[i] == 6 & df$t[i-1] != 6)){
df$mark[i] <- 1
} else {
df$mark[i] <- 0
}
}
This however gives the following error:
Error in if (df$t[i] == 1 & df$t[i - 1] != 1 | (df$t[i] == 6 & df$t[i - :argument is of length zero
Can anyone tell me what is going wrong?
Don't use loops, just do
df$mark <- 0
df$mark[match(c(1, 6), df$t)] <- 1
from ?match documentation
match returns a vector of the positions of (first) matches of its
first argument in its second.
The reason you are getting an error in your loop is because you are looping from 1 to nrow(df). But in your loop you are specifying df$t[i-1], which basically means df$t[0] in your first iteration; which is a non-existing entry
within(df, mark<- (c(1,diff(t %in% c(1,6)))==1) +0)
# t s i mark
# 1 1 6 4 1
# 2 1 5 5 0
# 3 3 6 4 0
# 4 4 7 3 0
# 5 5 8 2 0
# 6 6 7 3 1
# 7 6 6 4 0
# 8 8 5 5 0
Or
duplicated(df$t,fromLast=T) +0
#[1] 1 0 0 0 0 1 0 0