Pivot_longer on integer and factor - r

I have a dataset that looks like the following.
# A tibble: 1 x 4
hhm1q001 hhm2q001 hhm1q002 hhm2q002
<chr> <chr> <int> <int>
1 blue red 30 50
I have been trying to transform it to long using tidyr::pivot_longer
my expected output looks like this:
hhm q001 q002
<int> <chr> <int>
1 1 blue 30
2 2 red 50
I have tried the following code
HHS_long <- pivot_longer(HHS_all,
cols= starts_with("hhm"), #identifies the column from which to go from wide to long
names_to = ("hhm"), #name(s) of new column(s) created from cols=
values_drop_na = FALSE
)
head(HHS_long)
Unfortunately i get the following error
.Error: No common type for hhm1q101 <factor<b5064>> and hhm1q102 <integer>.
Not sure how to go around this, i get they are not the same class, but i have quite a lot of variable in the dataset and they are definitely of a different class.
Hope this is the correct format to posting.
Thanks for any help

I struggle a little bit when I use the new pivot_longer. Sometimes, I feel that renaming variable names before pivot_longer could make it much easier:
library(tidyverse)
HHS_all <- data.frame(hhm1q001 = "blue", hhm2q001 = "red", hhm1q002 = 30, hhm2q002 = 50)
df <- HHS_all %>%
rename(q001hhm1 = hhm1q001, q001hhm2 = hhm2q001,
q002hhm1 = hhm1q002, q002hhm2 = hhm2q002)
df %>%
pivot_longer(everything(), names_to = c(".value", "hhm"), names_sep = "hhm")
# A tibble: 2 x 3
hhm q001 q002
<chr> <fct> <dbl>
1 1 blue 30
2 2 red 50

I'm not familiar with the pivot_longer of tidyr and whether it would be able to do so. Using a combination of tidyr::separate and the melt and cast verbs from reshape2 I was able to produce your expected output:
df <- data.frame(hhm1q001 = "blue", hhm2q001 = "red", hhm1q002 = 30, hhm2q002= 50 )
df %>%
mutate(id = row_number()) %>%
reshape2::melt(id.vars = "id") %>%
tidyr::separate(variable, into = c("hhm", "q00"), sep = 4) %>%
tidyr::separate(hhm, into = c("prefix", "hhm"), sep = 3) %>%
select(-prefix, -id) %>%
reshape2::dcast(hhm ~ q00)

Related

Using pivot_longer from tidyr to create a long format data with one variable nested in another variable

This is the edited version of the question.
I need help to convert my wide data to long format data using the pivot_longer() function in R. The main problem is wanting to create long data with a variable nested in another variable.
For example, if I have wide data like this, where
variable fu1 and fu2 are variables for the follow-up (in days). There are two follow-up events (fu1 and fu2)
variables cpass and is are the results of two tests at each follow up
IDno <- c(1,2)
Sex <- c("M","F")
fu1 <- c(13,15)
fu2 <- c(20,18)
cpass1 <- c(27, 85)
cpass2 <- c(33, 90)
is1 <- c(201, 400)
is2 <- c(220, 430)
mydata <- data.frame(IDno, Sex,
fu1, cpass1, is1,
fu2, cpass2, is2)
mydata
which looks like this
And now, I want to convert it to long format data, and it should look like this:
I have tried the codes below, but they do not produce the data frame in the format that I want:
#renaming variables
mydata_wide <- mydata %>%
rename(fu1_day = fu1,
cp_one = cpass1,
is_one = is1,
fu2_day = fu2,
cp_two = cpass2,
is_two = is2)
#pivoting
mydata_wide %>%
pivot_longer(
cols = c(fu1_day, fu2_day),
names_to = c("fu", ".value"),
values_to = "day",
names_sep = "_") %>%
pivot_longer(
cols = c("cp_one", "is_one", "cp_two", "is_two"),
names_to = c("test", ".value"),
values_to = "value",
names_sep = "_")
The data frame, unfortunately, looks like this:
I have looked at some tutorials but have not found the best solution for this problem. Any help is very much appreciated.
library(tidyverse)
mydata %>% # the "nested" pivoting must be done within two calls
pivot_longer(cols=c(fu1,fu2),names_to = 'fu', values_to = 'day') %>%
pivot_longer(cols=c(starts_with('cpass'), starts_with('is')),
names_to = 'test', values_to = 'value') %>%
# with this filter check not mixing the tests and the follow-ups
filter(str_extract(fu,"\\d") == str_extract(test,"\\d")) %>%
mutate(test = gsub("\\d","",test)) # remove numbers in strings
Output:
# A tibble: 8 × 6
IDno Sex fu day test value
<dbl> <chr> <chr> <dbl> <chr> <dbl>
1 1 M fu1 13 cpass 27
2 1 M fu1 13 is 201
3 1 M fu2 20 cpass 33
4 1 M fu2 20 is 220
5 2 F fu1 15 cpass 85
6 2 F fu1 15 is 400
7 2 F fu2 18 cpass 90
8 2 F fu2 18 is 430
I'm not sure if your example is your real expected output, the first dataset and the output example that you describe do not show the same information.
I took inspiration from almost similar post from How to reshape Panel / Longitudinal survey data from wide to long format using pivot_longer and from the solution provided by RobertoT and put together these codes:
STEP 1: Generate wide data for simulation
IDno <- c(1,2)
Sex <- c("M","F")
fu1_day <- c(13,15)
fu2_day <- c(20,18)
fu1_cpass <- c(27, 85)
fu2_cpass <- c(33, 90)
fu1_is <- c(201, 400)
fu2_is <- c(220, 430)
mydata_wide <- data.frame(IDno, Sex,
fu1_day, fu1_cpass, fu1_is,
fu2_day, fu2_cpass, fu2_is)
mydata_wide
STEP 1: CONVERT TO LONG DATA (out1)
out1 <- mydata_wide %>%
select(IDno, contains("day")) %>%
pivot_longer(cols = c(fu1_day, fu2_day),
names_to = c('fu', '.value'),
names_sep="_")
out1
STEP 2: CREATE ANOTHER LONG DATA AND JOIN WITH out1
mydata_wide %>%
select(-contains('day')) %>%
pivot_longer(cols = -c(IDno, Sex),
names_to = c('fu', 'test'),
names_sep="_") %>%
left_join(out1)
The result looks like this

convert values in a column consists of text map into integers and aggregate them

I have a dataframe as following:
data.frame("id" = 1:2, "tag" = c("a,b,c","a,d"))
id tag
1 a,b,c
2 a,d
in tag where ever is a or b consider as lan and and "d"="c"="con" means that a and b are consider as lan , d and c consider as con then we want to count the number of lan and con in each row in seperate columns like table in below:
I want to create two columns which are the aggregation of a,b,c to shows like the follows:
id tag. lan_count. con_count
1 a,b,c 2 1
2 a,d 1 1
Could you please give me advice how to do this.
You can also use the following code:
library(dplyr)
library(tidyr)
df <- data.frame("id" = 1:2, "tag" = c("a,b,c","a,d"))
df %>%
separate_rows(tag, sep = ",") %>%
group_by(id) %>%
add_count(tag) %>%
pivot_wider(id, names_from = tag, values_from = n) %>%
rowwise() %>%
mutate(lan_count = sum(c_across(a:b), na.rm = TRUE),
con_count = sum(c_across(c:d), na.rm = TRUE)) %>%
select(-c(a:d))
# A tibble: 2 x 3
# Rowwise: id
id lan_count con_count
<int> <int> <int>
1 1 2 1
2 2 1 1
The main issue here is that your data is untidy. So my solution is in two parts: first, tidy the data and then summarise it. Once the data is tidy, the summary is trivial.
library(tidyverse)
# Adjust to suit your real data
maxCols <- 10
d <- data.frame(id = 1:2, tag = c("a,b,c","a,d"))
d %>%
separate(
tag,
sep=",",
into=paste0("Element", 1:maxCols),
extra="drop",
fill="right",
remove=FALSE
) %>%
pivot_longer(
cols=starts_with("Element"),
values_to="Value",
names_prefix="Element"
) %>%
select(-name) %>%
# Remove unused Values
filter(!is.na(Value)) %>%
# At this point the data frame is tidy
group_by(tag) %>%
# Translate tags into "categories". Add more if required. or write a function
mutate(
lan=Value %in% c("a", "b"),
con=Value %in% c("c", "d")
) %>%
# Adjust the column specification if more categories are added.
# Or use a factor instead of binary indicators
summarise(across(lan:con, sum))
# A tibble: 2 x 3
tag lan con
* <fct> <int> <int>
1 a,b,c 2 1
2 a,d 1 1

String with values mapped from other data frame in R

I would like to make a string basing on ids from other columns where the real value sits in a dictionary.
Ideally, this would look like:
library(tidyverse)
region_dict <- tibble(
id = c("reg_id1", "reg_id2", "reg_id3"),
name = c("reg_1", "reg_2", "reg_3")
)
color_dict <- tibble(
id = c("col_id1", "col_id2", "col_id3"),
name = c("col_1", "col_2", "col_3")
)
tibble(
region = c("reg_id1", "reg_id2", "reg_id3"),
color = c("col_id1", "col_id2", "col_id3"),
my_string = str_c(
"xxx"_,
region_name,
"_",
color_name
))
#> # A tibble: 3 x 3
#> region color my_string
#> <chr> <chr> <chr>
#> 1 reg_id1 col_id1 xxx_reg_1_col_1
#> 2 reg_id2 col_id2 xxx_reg_2_col_2
#> 3 reg_id3 col_id3 xxx_reg_3_col_3
Created on 2021-03-01 by the reprex package (v0.3.0)
I know of dplyr's recode() function but I can't think of a way to use it the way I want.
I also thought about first using left_join() and then concatenating the string from the new columns. This is what would work but doesn't seem pretty to me as I would get columns that I'd need to remove later. In the real dataset I have 5 variables.
I'll be glad to read your ideas.
This may also be solved with a fuzzyjoin, but based on the similarity in substring, it would make sense to remove the prefix substring from the 'id' columns of each data and do a left_join, then create the 'my_string' by pasteing the columns together
library(stringr)
library(dplyr)
region_dict %>%
mutate(id1 = str_remove(id, '.*_')) %>%
left_join(color_dict %>%
mutate(id1 = str_remove(id, '.*_')), by = 'id1') %>%
transmute(region = id.x, color = id.y,
my_string = str_c('xxx_', name.x, '_', name.y))
-output
# A tibble: 3 x 3
# region color my_string
# <chr> <chr> <chr>
#1 reg_id1 col_id1 xxx_reg_1_col_1
#2 reg_id2 col_id2 xxx_reg_2_col_2
#3 reg_id3 col_id3 xxx_reg_3_col_3

Indicate which corresponding columns have a TRUE indicator

I have the following dataset:
df<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_source1=c("Capsule","Powder,Metered","Tablet",NA),
DF_source2=c(NA,NA,"Tablet",NA),
DF_source3=c("Tablet, Extended Release","Liquid","Tablet",NA),
Route_source1=c("Oral","INHALATION","Oral",NA),
Route_source2=c(NA,"TOPICAL","Oral",NA),
Route_source3=c("Oral","IRRIGATION","oral",NA))
I want to know which DF_source matches DF, and additionally which associated Route I should take.
I want the output to look like this:
df_out<-data.frame(
identifer=c(1,2,3,4),
DF=c("Tablet","Powder","Suspension","System"),
DF_match=c("Tablet, Extended Release","Powder,Metered;Powder",NA,NA),
Route_match=c("Oral","INHALATION;TOPICAL",NA,NA),
DF_match_count=c(1,2,0,0),
DF_route_count=c(1,2,0,0))
I tried this but I'm not sure how to pull values for DF_match and Route_ Match
df%>%mutate_at(vars(matches("(DF_source)")),
list(string_detect = ~str_detect(tolower(DF),tolower(str_replace_all(.,"/|,(\\s)?|(?<!,)\\s","|")))))
Any help would be appreciated, thanks!
I'm not entirely sure this is what you have in mind, but hope this might help.
Your end result appears not to match your example data (e.g. TOPICAL is missing).
This might be easier in a tidier form with pivot_longer.
Edit: If columns are factors, convert to character for str_detect in filter.
library(tidyverse)
library(stringr)
df %>%
mutate_if(is.factor, as.character) %>%
pivot_longer(cols = -c(identifer, DF), names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
filter(str_detect(DF_source, DF)) %>%
group_by(identifer) %>%
summarise(DF_match = paste(DF_source, collapse = ';'),
Route_match = paste(Route_source, collapse = ';'),
match_count = n()) %>%
right_join(df[,c("identifer", "DF")], by = "identifer") %>%
select(c(identifer, DF, DF_match, Route_match, match_count))
Output
# A tibble: 4 x 5
identifer DF DF_match Route_match match_count
<dbl> <chr> <chr> <chr> <int>
1 1 Tablet Tablet, Extended Release Oral 1
2 2 Powder Powder,Metered;Powder INHALATION;TOPICAL 2
3 3 Suspension NA NA NA
4 4 System NA NA NA

summarise function in R

I am trying to create a R database including some numerical variable.
While doing this, I made a typing mistake whose result looks weird to me and I would like to understand why (for sure I am missing something, here).
I have tried to look around for possible explanation but haven' t found what I am looking for.
library("dplyr")
library("tidyr")
data <-
data.frame(FS = c(1), FS_name = c("Armenia"), Year = c(2015), class =
c("class190"), area_1000ha = c(66.447)) %>%
mutate(FS_name = as.character(FS_name)) %>%
mutate(Year = as.integer(Year)) %>%
mutate(class = as.character(class)) %>%
tbl_df()
data
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = TRUE)) %>%
ungroup()
As you can see, the mistake is
rm.na=
rather than
na.rm=
When I type correctly, I have the right result on area_1000ha variable (10.5).
If I don't - i.e. keeping rm.na= I get 11.5, instead (+1, in fact).
What am I missing?
I think rm.na=TRUE is added to the sum, and as TRUE is considered as 1, it sums your initial sum and 1.
If you change TRUE to 2 for example
x <- data %>%
group_by(FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = 2)) %>%
ungroup()
The result is
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 12.5
There is no function in R as rm.na hence R is considering it as a variable which has value TRUE i.e. 1.
Try keeping it na.rm = T and you will get the right result.
Even if you change the name of the variable
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, tester = TRUE)) %>%
ungroup()
I have replaced rm.na with tester variable.
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 11.5

Resources