I want to create a sequence of numbers as a string. I have columns "start" and "end" indicating the start and end of the sequence. The desired output is a string with a sequence by 1. See the example below.
df <- data.frame(ID=seq(1:5),
start=seq(2,10,by=2),
end=seq(5,13,by=2),
desired_output_aschar= c("2,3,4,5", "4,5,6,7", "6,7,8,9", "8,9,10,11", "10,11,12,13"))
View(df)
Thank you in advance...
The following solution needs only one *apply loop.
mapply(function(x, y) paste(x:y, collapse = ","), df$start, df$end)
#[1] "2,3,4,5" "4,5,6,7" "6,7,8,9" "8,9,10,11" "10,11,12,13"
With the new lambdas, same output.
mapply(\(x, y) paste(x:y, collapse = ","), df$start, df$end)
Mapply to call the different seq function, sapply to call the columns
sapply(
data.frame(mapply(seq,df$start,df$end)),
paste0,
collapse=","
)
X1 X2 X3 X4 X5
"2,3,4,5" "4,5,6,7" "6,7,8,9" "8,9,10,11" "10,11,12,13"
Using dplyr -
library(dplyr)
df %>%
rowwise() %>%
mutate(output = toString(start:end)) %>%
ungroup
# ID start end output
# <int> <dbl> <dbl> <chr>
#1 1 2 5 2, 3, 4, 5
#2 2 4 7 4, 5, 6, 7
#3 3 6 9 6, 7, 8, 9
#4 4 8 11 8, 9, 10, 11
#5 5 10 13 10, 11, 12, 13
We could use map2 from purrr
library(dplyr)
library(purrr)
df %>%
mutate(output = map2_chr(start, end, ~ toString(.x:.y)))
ID start end desired_output_aschar output
1 1 2 5 2,3,4,5 2, 3, 4, 5
2 2 4 7 4,5,6,7 4, 5, 6, 7
3 3 6 9 6,7,8,9 6, 7, 8, 9
4 4 8 11 8,9,10,11 8, 9, 10, 11
5 5 10 13 10,11,12,13 10, 11, 12, 13
A data.table option
> setDT(df)[, out := toString(seq(start, end)), ID][]
ID start end desired_output_aschar out
1: 1 2 5 2,3,4,5 2, 3, 4, 5
2: 2 4 7 4,5,6,7 4, 5, 6, 7
3: 3 6 9 6,7,8,9 6, 7, 8, 9
4: 4 8 11 8,9,10,11 8, 9, 10, 11
5: 5 10 13 10,11,12,13 10, 11, 12, 13
Related
This is the sample data and test results:
tta <- data.frame(v1=c(8, 6, 1, 3, 8, 3, 3, 4, 5, 5, 7, 3, 4, 2, 8, 2, 2, 2, 5, 8, 4, 5, 3, 5, 3),
v2=c(9, 5, 3, 5, 4, 4, 8, 3, 1, 3, 3, 7, 7, 7, 9, 3, 7, 3, 3, 8, 4, 6, 3, 7, 5),
group=c(rep(c(1:5), each=5)))
## not perfect and need downstream analysis or merge
resulta <- tta %>%
filter(v1<=6 & v2<=6) %>%
group_by(group) %>%
summarise(n=n(), frac=n/5)
## resulta
## lost the group 3 that has no data meet the criterion that "v1<=6 & v2<=6"
##
## # A tibble: 4 × 3
## group n frac
## <int> <int> <dbl>
## 1 1 3 0.6
## 2 2 4 0.8
## 3 4 3 0.6
## 4 5 4 0.8
## expect results
##
## # A tibble: 4 × 3
## group n frac
## <int> <int> <dbl>
## 1 1 3 0.6
## 2 2 4 0.8
## 3 3 0 0.0
## 4 4 3 0.6
## 5 5 4 0.8
##
There are two problems:
Lost the group 3 that has no data meet the criterion ("v1<=6 & v2<=6") if you use filter first.
frac=n/5: the population calculation is not perfect if group data is not 5 rows or random length.
Are there any solutions? Another method besides dplyr is also okay. Thanks for your help
You may try,
tta %>%
mutate(key = as.numeric(v1<=6 & v2<=6)) %>%
group_by(group) %>%
summarize(n = sum(key), frac = n/n())
group n frac
<int> <dbl> <dbl>
1 1 3 0.6
2 2 4 0.8
3 3 0 0
4 4 3 0.6
5 5 4 0.8
I am thoruoghly researching this question on SO from the very morning. Original dataset has more than 1000 rows. My global goal is to extract particular columns to run an OLS-regression.
I selected the columns I need and transformed it to a wide format using pivot_wider. In the transformed table I have 5 columns which represent indicators name. The rows are respondets' ids, the values are answers.
The problem is that after the transformation the values mutated into the nested objects. I tried to resolve this issue on a sample dataset using unnest(cols = everything()). And it works fine:
examp_df <- tibble(
seance = rep(1:5, each = 5),
ind = rep(inds, 5),
ind_name = rep(inds_name, 5),
answer = list(rep(rnorm(5, 0.7, 1), 5))
)
examp_df_wide <- examp_df %>%
pivot_wider(id_cols = seance,
names_from = ind_name,
values_from = answer)
exmap_df_wide <- examp_df_wide %>%
unnest(cols = everything())
But when I try it on my original dataset, I receive an error about incompatiability of length. And then I do not understand how unnest works.
Here's the dataset which I have problems with. How can I unnest the data?
The list of sources that I have researched:
Pivot wider produces nested object
R: Error: Incompatible lengths when using unnest in dplyr
Unnest or unchop dataframe containing lists of different lengths
https://tidyr.tidyverse.org/articles/nest.html
Original data is here.
The code fot the original data is the following:
data_all <- data_all %>%
pivot_wider(id_cols = seance_id,
names_from = ind_name,
values_from = criteria_answ)
> data_all <- data_all %>%
+ unnest(cols = everything())
Error: Incompatible lengths: 4, 5.
Run `rlang::last_error()` to see where the error occurred.
If you want an unnested dataframe you can do :
library(tidyr)
pivot_wider(data_all, names_from = ind_name, values_from = criteria_answ)
# A tibble: 3,930 x 7
# seance_id criteria name2 name1 name3 name5 name4
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 3133688 ind3_2 7 NA NA NA NA
# 2 3133688 ind4_2 6 NA NA NA NA
# 3 3133688 ind3_3 NA 7 NA NA NA
# 4 3133688 ind3_4 NA NA 7 NA NA
# 5 3133688 ind4_3 NA 6 NA NA NA
# 6 3133688 ind4_4 NA NA 6 NA NA
# 7 3133688 nps NA NA NA 5 NA
# 8 3145092 ind1_1 NA NA NA NA 5
# 9 3145092 ind1_2 4 NA NA NA NA
#10 3145092 ind1_3 NA 5 NA NA NA
# … with 3,920 more rows
If you want the output where every seance_id is in 1 row you need to think how will you show those values that have more than 1 value in a column for a seance_id? For example, if you look at the above output seance_id = 3133688 has two values in name2 column. To collapse 3133688 into one row how will you combine these values? Do you want to take their sum, mean or combine them as one comma separated value. You can use values_fn argument in pivot_wider and pass a function to apply. For example, with toString :
pivot_wider(data_all, id_cols = seance_id, names_from = ind_name,
values_from = criteria_answ, values_fn = toString)
# A tibble: 422 x 6
# seance_id name2 name1 name3 name5 name4
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 3133688 7, 6 7, 6 7, 6 5 NA
# 2 3145092 4, 5, 5, 8 5, 5, 5, 9 5, 5, 6, 7 3 5, 6, 5, 8
# 3 3143656 10 10 10 10 10
# 4 3145088 9, 9, 9 10, 8, 8 9, 10, 7 8 9, 10, 10
# 5 3145117 6, 4, 7 7, 6, 9 7, 6, 9 6 8, 8, 9
# 6 3148589 10, 10, 7 10, 10, 5 8, 9, 5 9 10, 10, 7
# 7 3135731 10, 9, 7 9, 8, 6 8, 8, 8 8 10, 9, 7
# 8 3145111 7, 7, 7, 8, 9 10, 10, 9, 8, 9 9, 7, 7, 8, 9 4 9, 7, 9, 8, 9
# 9 3149981 8, 8, 8, 8 8, 8, 8, 9 8, 8, 8, 8 9 9, 8, 8, 9
#10 3150048 9 10 10 10 9
# … with 412 more rows
I'm currently using data that has four columns that i want to sum by their unique ID into a new column. I'm new to using R so any help is appreciated! thank you
Example of the input columns and desired output:
Like this (using the iris dataset as an example)
iris$new_col <- iris$Sepal.Length + iris$Sepal.Width
For your example
df$Sum <- df$Pillar_1 + df$Pillar_2 + df$Pillar_3
This supposes your dataframe is called df
Using dplyr:
library(tidyverse)
df <- tibble(
opportunity = c(
639495, 303678, 629464, 297662, 302891
),
`Pillar 1` = c(
4, 3, 5, 3, 2
),
`Pillar 2` = c(
7, 8, 9, 4, 4
),
`Pillar 3` = c(
4, 6, 2, 5, 8
)
)
df %>% mutate(
Sum = `Pillar 1` + `Pillar 2` + `Pillar 3`
)
Yielding the output
# A tibble: 5 x 5
opportunity `Pillar 1` `Pillar 2` `Pillar 3` Sum
1 639495 4 7 4 15
2 303678 3 8 6 17
3 629464 5 9 2 16
4 297662 3 4 5 12
5 302891 2 4 8 14
df <- data.frame(
opportunity = c(
639495, 303678, 629464, 297662, 302891
),
Pillar1 = c(
4, 3, 5, 3, 2
),
Pillar2 = c(
7, 8, 9, 4, 4
),
Pillar3 = c(
4, 6, 2, 5, 8
)
)
df$Sum <- apply(df[,-1], 1, sum)
> df
opportunity Pillar1 Pillar2 Pillar3 Sum
1 639495 4 7 4 15
2 303678 3 8 6 17
3 629464 5 9 2 16
4 297662 3 4 5 12
5 302891 2 4 8 14
With dplyr, you could also do:
library(dplyr)
df %>%
mutate(Sum = rowSums(select(., contains("Pillar"))))
Output:
opportunity Pillar1 Pillar2 Pillar3 Sum
1 639495 4 7 4 15
2 303678 3 8 6 17
3 629464 5 9 2 16
4 297662 3 4 5 12
5 302891 2 4 8 14
If you'd want to include in the Sum certain columns which do not contain Pillar as a string, you can also filter by index like:
df %>%
mutate(Sum = rowSums(select(., 2:4)))
Or instead of 2:4 just -1 if you want to sum all columns except first one (as already indicated in one of the other answers).
Here is an option using tidyverse
library(tidyverse)
df %>%
mutate(Sum = select(., starts_with('Pillar')) %>%
reduce(`+`))
# opportunity Pillar1 Pillar2 Pillar3 Sum
#1 639495 4 7 4 15
#2 303678 3 8 6 17
#3 629464 5 9 2 16
#4 297662 3 4 5 12
#5 302891 2 4 8 14
data
df <- structure(list(opportunity = c(639495, 303678, 629464, 297662,
302891), Pillar1 = c(4, 3, 5, 3, 2), Pillar2 = c(7, 8, 9, 4,
4), Pillar3 = c(4, 6, 2, 5, 8)), class = "data.frame",
row.names = c(NA, -5L))
I have a dataset similar to:
Name, Day, Score, Diff
Jain, 1, 8, 0
Jain, 2, 6, -2
Jain, 3, 8, 2
Jain, 4, 12, 4
Jain, 5, 13, 1
Jain, 6, 6, -7
Matt, 1,4, 0
Matt, 2, 10, 6
Matt, 3, 11, 1
Matt, 4, 12, 1
Matt, 5, 5, -7
Matt, 6, 6, 1
I want to add a new column which will record "Off" when a score difference drops 3 points, until there's a gain of +3 points, which will then record "On" until there's a drop.
Example:
Name, Day, Score, Diff, OnOff
Jain, 1, 8, 0, "Off"
Jain, 2, 6, -2, "Off"
Jain, 3, 8, 2, "Off"
Jain, 4, 12, 4, "On"
Jain, 5, 13, 1, "On"
Jain, 6, 6, -7, "Off"
Matt, 1,4, 0, "Off"
Matt, 2, 10, 6, "On"
Matt, 3, 11, 1, "On"
Matt, 4, 12, 1, "On"
Matt, 5, 5, -7, "Off"
Matt, 6, 6, 1, "Off"
Can't seem to figure out how to code this one. I've attempted with the following:
df$OnOff <- ifelse(df$Diff >= 3, "On", ifelse(df$Diff <= -3, "Off", ""))
df$OnOff <- ifelse(df$OnOff == "", lag(df$OnOff), df$OnOff)
Put in the changes, then use zoo::na.locf (or similar) to fill in the blanks. Calling your data dd:
dd$OnOff = NA
dd$OnOff[1] = "off"
dd$OnOff[dd$Diff >= 3] = "on"
dd$OnOff[dd$Diff <= -3] = "off"
dd$OnOff = zoo::na.locf(dd$OnOff)
dd
# Name Day Score Diff OnOff
# 1: Jain 1 8 0 off
# 2: Jain 2 6 -2 off
# 3: Jain 3 8 2 off
# 4: Jain 4 12 4 on
# 5: Jain 5 13 1 on
# 6: Jain 6 6 -7 off
# 7: Matt 1 4 0 off
# 8: Matt 2 10 6 on
# 9: Matt 3 11 1 on
# 10: Matt 4 12 1 on
# 11: Matt 5 5 -7 off
# 12: Matt 6 6 1 off
You don't mention grouping in the question, but you can use dplyr or data.table to do the locf by Name if needed.
To do things by name, you'll need to set the first row of each name to the default 'off'. See Melissa's solution for a dplyr method. With data.table it looks like this:
setdt(dd)
dd[, OnOff := c('off', rep(NA, .N - 1)), by = Name]
dd[Diff >= 3, OnOff := "on"]
dd[Diff <= -3, OnOff := "off"]
dd[, OnOff := zoo::na.locf(OnOff), by = Name]
Using this data:
dd = data.table::fread("Name, Day, Score, Diff
Jain, 1, 8, 0
Jain, 2, 6, -2
Jain, 3, 8, 2
Jain, 4, 12, 4
Jain, 5, 13, 1
Jain, 6, 6, -7
Matt, 1,4, 0
Matt, 2, 10, 6
Matt, 3, 11, 1
Matt, 4, 12, 1
Matt, 5, 5, -7
Matt, 6, 6, 1")
Here's another tidyverse solution using fill:
library(tidyverse)
df %>%
mutate(
OnOff = case_when(
1:n() == 1 ~ 'Off',
Diff < -2 ~ "Off",
Diff >2 ~ "On",
TRUE ~ NA_character_)
) %>%
fill(OnOff)
doing it by name:
df %>%
group_by(Name) %>%
mutate(
OnOff = case_when(
1:n() == 1 ~ 'Off',
Diff < -2 ~ "Off",
Diff >2 ~ "On",
TRUE ~ NA_character_)
) %>%
fill(OnOff)
One can write a simple function that traverse on Diff to compare value in order to switch between On and Off as:
#Function to decide On/Off logic
getOnOff <- function(x){
lstVal <- "Off"
value <- rep(NA,length(x))
for(i in seq_along(x)){
if(x[i] >= 3){
lstVal = "On"
}else if(x[i] <= -3){
lstVal = "Off"
}
value[i] <- lstVal
}
value
}
#Now use the function with `dplyr` to after grouping on Name
library(dplyr)
df %>% group_by(Name) %>%
mutate(OnOff = getOnOff(Diff))
# # A tibble: 12 x 5
# # Groups: Name [2]
# Name Day Score Diff OnOff
# <chr> <int> <int> <int> <chr>
# 1 Jain 1 8 0 Off
# 2 Jain 2 6 -2 Off
# 3 Jain 3 8 2 Off
# 4 Jain 4 12 4 On
# 5 Jain 5 13 1 On
# 6 Jain 6 6 -7 Off
# 7 Matt 1 4 0 Off
# 8 Matt 2 10 6 On
# 9 Matt 3 11 1 On
# 10 Matt 4 12 1 On
# 11 Matt 5 5 -7 Off
# 12 Matt 6 6 1 Off
Option#2: Probably OP has not meant to switch on absolute count of different condition but if that is needed then one can try using cumsum with dplyr. The occurrence of Diff >= 3 means count goes up and Diff <= -3 means count goes down. The cumulative sum of these will give relative count on which On/Off can be decided.
library(dplyr)
df %>% mutate(OnOff = ifelse(cumsum(Diff >= 3) - (cumsum(Diff<= -3))>0, "On","Off"))
# Name Day Score Diff OnOff
# 1 Jain 1 8 0 Off
# 2 Jain 2 6 -2 Off
# 3 Jain 3 8 2 Off
# 4 Jain 4 12 4 On
# 5 Jain 5 13 1 On
# 6 Jain 6 6 -7 Off
# 7 Matt 1 4 0 Off
# 8 Matt 2 10 6 On
# 9 Matt 3 11 1 On
# 10 Matt 4 12 1 On
# 11 Matt 5 5 -7 Off
# 12 Matt 6 6 1 Off
#
Data:
df <- read.table(text="
Name, Day, Score, Diff
Jain, 1, 8, 0
Jain, 2, 6, -2
Jain, 3, 8, 2
Jain, 4, 12, 4
Jain, 5, 13, 1
Jain, 6, 6, -7
Matt, 1,4, 0
Matt, 2, 10, 6
Matt, 3, 11, 1
Matt, 4, 12, 1
Matt, 5, 5, -7
Matt, 6, 6, 1",
header = TRUE, stringsAsFactors = FALSE, sep = ",")
This question already has an answer here:
R spreading multiple columns with tidyr [duplicate]
(1 answer)
Closed 5 years ago.
I'm using the below method to cast variables in a dataframe from long to wide format. However, I'm looking for an alternative way, using another package.
Any help is much appreciated?
subject <- c(1:10, 1:10)
condition <- c(rep(1,10), rep(2,10))
value <- c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5)
rating <- c(1, 3, 5, 2, 3, 5, 6, 7, 5, 3, 5, 7, 3, 6, 3, 5, 6, 7, 7, 8)
df <- data.frame(subject, condition, value, rating)
library(data.table)
df_wide <- dcast(setDT(df), subject ~ condition, value.var=c("rating", "value"))
We can use tidyverse
library(tidyverse)
df %>%
gather(key, val, value:rating) %>%
unite(cond, key, condition) %>%
spread(cond, val)
# subject rating_1 rating_2 value_1 value_2
#1 1 1 5 1 1
#2 2 3 7 2 2
#3 3 5 3 3 3
#4 4 2 6 4 4
#5 5 3 3 5 5
#6 6 5 5 1 1
#7 7 6 6 2 2
#8 8 7 7 3 3
#9 9 5 7 4 4
#10 10 3 8 5 5