How to 'summarize' variable which mixed by 'numeric' and 'character' - r

here is data.frame data as below , how to transfer it to wished_data Thanks!
library(tidyverse)
data <- data.frame(category=c('a','b','a','b','a'),
values=c(1,'A','2','4','B'))
#below code can't work
data %>% group_by(category ) %>%
summarize(sum=if_else(is.numeric(values)>0,sum(is.numeric(values)),paste0(values)))
#below is the wished result
wished_data <- data.frame(category=c('a','a','b','b'),
values=c('3','B','A','4'))

Mixing numeric and character variables in a column is not tidy. Consider giving each type their own column, for example:
data %>%
mutate(letters = str_extract(values, "[A-Z]"),
numbers = as.numeric(str_extract(values, "\\d"))) %>%
group_by(category) %>%
summarise(values = sum(numbers, na.rm = T),
letters = na.omit(letters))
category values letters
<chr> <dbl> <chr>
1 a 3 B
2 b 4 A
In R string math does not make sense, "1+1" is not "2", and is.numeric("1") gives FALSE. A workaround is converting to list object, or to give each their own columns.

I'd create a separate column to group numeric values in a category separately from characters.
data %>%
mutate(num_check = grepl("[0-9]", values)) %>%
group_by(category, num_check) %>%
summarize(sum = ifelse(
unique(num_check),
as.character(sum(as.numeric(values))),
unique(values)
), .groups = "drop")
#> # A tibble: 4 × 3
#> category num_check sum
#> <chr> <lgl> <chr>
#> 1 a FALSE B
#> 2 a TRUE 3
#> 3 b FALSE A
#> 4 b TRUE 4

Here is a bit of a messy answer,
library(dplyr)
bind_rows(data %>%
filter(is.na(as.numeric(values))),
data %>%
mutate(values = as.numeric(values)) %>%
group_by(category) %>%
summarise(values = as.character(sum(values, na.rm = TRUE)))) %>%
arrange(category)
category values
#1 a B
#2 a 3
#3 b A
#4 b 4

Related

How can I add individual summary values per participant or group to a long dataframe in R, when the replacement is shorter than the original variable?

I have a long dataset with about 6000 observations per participant. I would like to compute a count for one of my variables (max count is 12) and add this count into a new variable in the dataframe. However, there should be only one value entered per participant and the remaining cells may be filled with NA.
I have first attempted to create an empty variable and then tried the following mutation:
dfl$Hits <- NA
dfl$Hits <- dfl %>%
group_by(participant) %>%
filter(SpaceREsponseType == "Hit") %>%
count() %>%
mutate(id = cur_group_id()) %>%
mutate(id, na.rm = F)
I have also tried
dfl$Hits <- dfl %>%
group_by(participant) %>%
mutate(n = replace(rep(NA, n()), 1, sum(!is.na(SpaceREsponseType == "Hit")))) %>%
ungroup
However, this results in the following error message:
Error:
! Assigned data ... %>% count() must be compatible with existing data.
✖ Existing data has 66619 rows.
✖ Assigned data has 142 rows.
ℹ Only vectors of size 1 are recycled.
What do I need to add to make this work?
Thanks in advance and best wishes,
Jasmine
I have created a sample DF.
The data are grouped by participant and Hit and a row number is added.
with mutate)n=n()) the Hits and No Hits are count per participant.
After making the data wider the condition is added with case_when.
Then the result is brought back into the original format.
library(tidyverse)
df <- data.frame(
participant = sample(c("A", "B", "C"), replace = T, 100),
Hit = sample(c("Hit", "NoHit"), replace = T, 100)
)
df |>
group_by(participant, Hit) |>
mutate(rn = row_number()) |>
mutate(n = n()) |>
pivot_wider(names_from = Hit, values_from = n) |>
ungroup() |>
mutate(across(
ends_with("it"),
~ case_when(
rn == 1 ~ .x,
rn > 1 ~ NA_integer_
)
)) |>
pivot_longer(NoHit:Hit) |>
select(-rn)
#> # A tibble: 114 × 3
#> participant name value
#> <chr> <chr> <int>
#> 1 A NoHit 21
#> 2 A Hit 12
#> 3 B NoHit 17
#> 4 B Hit 17
#> 5 A NoHit NA
#> 6 A Hit NA
#> 7 C NoHit 19
#> 8 C Hit 14
#> 9 B NoHit NA
#> 10 B Hit NA
#> # … with 104 more rows

Extract digits from strings in R

i have a dataframe which contains a text string like below that shows the ingredients and the proportion of each ingredient. What i would like to achive is to extract the proportion of each ingredient as a separate variable:
What i have:
given <- tibble(
ingredients =c("1.5BZ+1FZ+2HT","2FZ","0.5HT+2BZ")
)
What i want to achive:
to_achieve <- tibble(
ingredients =c("1.5BZ+1FZ+2HT","2FZ","0.5HT+2BZ"),
proportion_bz = c(1.5,0,2),
proportion_fz = c(1,2,0),
proportion_ht=c(2,2,0.5)
)
Please note there might be more than a dozen different ingredients and tidyverse methods are preferred.
Thanks in advance,
Felix
Making heavy use of tidyr you could first split your strings into rows per ingredient using separate_rows, afterwards extract the numeric proportion and the type of ingredient and finally use pivot_wider to reshape into your desired format:
library(dplyr)
library(tidyr)
given %>%
mutate(ingredients_split = ingredients) |>
tidyr::separate_rows(ingredients_split, sep = "\\+") |>
tidyr::extract(
ingredients_split,
into = c("proportion", "ingredient"),
regex = "^([\\d+\\.]+)(.*)$"
) |>
mutate(
proportion = as.numeric(proportion),
ingredient = tolower(ingredient)
) |>
pivot_wider(
names_from = ingredient,
names_prefix = "proportion_",
values_from = proportion,
values_fill = 0
)
#> # A tibble: 3 × 4
#> ingredients proportion_bz proportion_fz proportion_ht
#> <chr> <dbl> <dbl> <dbl>
#> 1 1.5BZ+1FZ+2HT 1.5 1 2
#> 2 2FZ 0 2 0
#> 3 0.5HT+2BZ 2 0 0.5
library(tidyr)
library(readr)
library(stringr)
library(janitor)
# SOLUTION -----
given %>%
separate(ingredients, into = c("a", "b", "c"), sep = "\\+", remove = F) %>%
pivot_longer(a:c) %>%
select(-name) %>%
mutate(name = str_remove_all(value, "[0-9]|\\."),
value = parse_number(value)) %>%
na.omit() %>%
pivot_wider(names_prefix = "proportion_", values_fill = 0) %>%
clean_names()
# OUTPUT ----
#># A tibble: 3 × 4
#> ingredients proportion_bz proportion_fz proportion_ht
#> <chr> <dbl> <dbl> <dbl>
#>1 1.5BZ+1FZ+2HT 1.5 1 2
#>2 2FZ 0 2 0
#>3 0.5HT+2BZ 2 0 0.5

how to pass column names including space in R

assume my column names are: User ID and name
how should I pass this column name to functions like what I have below?
df %>%
group_by(User ID) %>%
count(name)
apparently, group_by() or similar functions do not accept column names with space in their names.
You need to use tibble instead of data.frame:
library(tidyverse)
df <- tibble(`User ID` = 1:2, x = 5:6)
df %>%
group_by(`User ID`) %>%
summarise(total = sum(x))
#> # A tibble: 2 × 2
#> `User ID` total
#> <int> <int>
#> 1 1 5
#> 2 2 6

applying function to each group using dplyr and return specified dataframe

I used group_map for the first time and think I do it correctly. This is my code:
library(REAT)
df <- data.frame(value = c(1,1,1, 1,0.5,0.1, 0,0,0,1), group = c(1,1,1, 2,2,2, 3,3,3,3))
haves <- df %>%
group_by(group) %>%
group_map(~gini(.x$value, coefnorm = TRUE))
The thing is that haves is a list rather than a data frame. What would I have to do to obtain this df
wants <- data.frame(group = c(1,2,3), gini = c(0,0.5625,1))
group gini
1 0.0000
2 0.5625
3 1.0000
Thanks!
You can use dplyr::summarize:
df %>%
group_by(group) %>%
summarize(gini = gini(value, coefnorm = TRUE))
#> # A tibble: 3 x 2
#> group gini
#> <dbl> <dbl>
#> 1 1 0
#> 2 2 0.562
#> 3 3 1
According to the documentation, group_map always produces a list. group_modify is an alternative that produces a tibble if the function does, but gini just outputs a vector. So, you could do something like this...
df %>%
group_by(group) %>%
group_modify(~tibble(gini = gini(.x$value, coefnorm = TRUE)))
# A tibble: 3 x 2
# Groups: group [3]
group gini
<dbl> <dbl>
1 1 0
2 2 0.562
3 3 1
Using data.table
library(data.table)
setDT(df)[, .(gini = gini(value, coefnorm = TRUE)), group]
For grouped datasets, we can specify .data if in case we don't want to use column names unquoted
library(dplyr)
df %>%
group_by(group) %>%
summarize(gini = gini(.data$value, coefnorm = TRUE))

How to gather then mutate a new column then spread to wide format again

Using tidyr/dplyr, I have some factor columns which I'd like to Z-score, and then mutate an average Z-score, whilst retaining the original data for reference.
I'd like to avoid using a for loop in tidyr/dplyr, thus I'm gathering my data and performing my calculation (Z-score) on a single column. However, I'm struggling with restoring the wide format.
Here is a MWE:
library(dplyr)
library(tidyr)
# Original Data
dfData <- data.frame(
Name = c("Steve","Jwan","Ashley"),
A = c(10,20,12),
B = c(0.2,0.3,0.5)
) %>% tbl_df()
# Gather to Z-score
dfLong <- dfData %>% gather("Factor","Value",A:B) %>%
mutate(FactorZ = paste0("Z_",Factor)) %>%
group_by(Factor) %>%
mutate(ValueZ = (Value - mean(Value,na.rm = TRUE))/sd(Value,na.rm = TRUE))
# Now go wide to do some mutations (eg Z)Avg = (Z_A + Z_B)/2)
# This does not work
dfWide <- dfLong %>%
spread(Factor,Value) %>%
spread(FactorZ,ValueZ)%>%
mutate(Z_Avg = (Z_A+Z_B)/2)
# This is the desired result
dfDesired <- dfData %>% mutate(Z_A = (A - mean(A,na.rm = TRUE))/sd(A,na.rm = TRUE)) %>% mutate(Z_B = (B - mean(B,na.rm = TRUE))/sd(B,na.rm = TRUE)) %>%
mutate(Z_Avg = (Z_A+Z_B)/2)
Thanks for any help/input!
Another approach using dplyr (version 0.5.0)
library(dplyr)
dfData %>%
mutate_each(funs(Z = scale(.)), -Name) %>%
mutate(Z_Avg = (A_Z+B_Z)/2)
means <-function(x)mean(x, na.rm=T)
dfWide %>% group_by(Name) %>% summarise_each(funs(means)) %>% mutate(Z_Avg = (Z_A + Z_B)/2)
# A tibble: 3 x 6
Name A B Z_A Z_B Z_Avg
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
3 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
Here is one approach with long and wide format. For z-transformation, you can use the base function scale. Furthermore, this approach includes a join to combine the original data frame and the one including the new values.
dfLong <- dfData %>%
gather(Factor, Value, A:B) %>%
group_by(Factor) %>%
mutate(ValueZ = scale(Value))
# Name Factor Value ValueZ
# <fctr> <chr> <dbl> <dbl>
# 1 Steve A 10.0 -0.7559289
# 2 Jwan A 20.0 1.1338934
# 3 Ashley A 12.0 -0.3779645
# 4 Steve B 0.2 -0.8728716
# 5 Jwan B 0.3 -0.2182179
# 6 Ashley B 0.5 1.0910895
dfWide <- dfData %>% inner_join(dfLong %>%
ungroup %>%
select(-Value) %>%
mutate(Factor = paste0("Z_", Factor)) %>%
spread(Factor, ValueZ) %>%
mutate(Z_Avg = (Z_A + Z_B) / 2))
# Name A B Z_A Z_B Z_Avg
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Steve 10 0.2 -0.7559289 -0.8728716 -0.8144003
# 2 Jwan 20 0.3 1.1338934 -0.2182179 0.4578378
# 3 Ashley 12 0.5 -0.3779645 1.0910895 0.3565625
I would just do it all in wide format. No need to keep switching between the long and wide formats.
dfData %>%
mutate(Z_A=(A-mean(unlist(dfData$A)))/sd(unlist(dfData$A)),
Z_B=(B-mean(unlist(dfData$B)))/sd(unlist(dfData$B))) %>%
mutate(Z_AVG=(Z_A+Z_B)/2)

Resources