Using regular expressions in tidyr::extract

Using regular expressions in tidyr::extract - r

I am working with 3D motion-capture data. This means I have 3 columns (X,Y,Z) of joint coordinates for several joints in the body (e.g. the three columns describing the position of the left knee joint center are: LKX,LKY,LKZ).
My end goal is to plot at least 9 joint centers, and I believe the only way to achieve this is to transform my wide format dataframe into a long one.
As you can tell, I am trying to transform many sets of jointcenters ending with either: X,Y or Z. Therefore, I try to use regular expressions within tidyr:extract, but I just can´t get the code right.
df_wide <- data.frame(
ID = rep(1:2, each=10),
JN = rep(1:2, each=5),
Frame = rep(1:5, 4),
System = rep(1:2, 10),
RKX = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
RKY = rep(1:10+rnorm(10,mean=1,sd=0.5),2),
RKZ = rep(1:10+rnorm(10,mean=1,sd=0.5), 2),
LHeX = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
LHeY = rep(1:10-rnorm(10,mean=1,sd=0.5),2),
LHeZ = rep(1:10-rnorm(10,mean=1,sd=0.5),2))
head(df_wide, 2)
ID JN Frame System RKX RKY RKZ LHeX LHeY LHeZ
1 1 1 1 1 1.332827 2.068720 2.295742 -0.02336031 -0.3011227 -1.212326
2 1 1 2 2 3.570076 3.306799 3.136177 2.08828231 1.9226740 2.106496
I wish to obtain this result:
ID JN Frame System joint X Y Z
1 1 1 1 1 RK 1.440103 2.221676 1.621871
2 1 1 1 1 LHe 3.537940 3.060948 2.856955
Here is my latest (of many) attempts. It has two problems; 1) extract only produces NA; 2) spread returns "Error: Duplicate identifiers for rows" I suspect this is related to the problem with extract.
df_3D <- df_wide %>%
gather(keys, values, -ID, -JN, -Frame, -System)%>%
extract(keys, c("X", "Y", "Z", "joint"), "(X$) (Y$) (Z$) ([A-Z].$)")%>%
spread(X, values)
I have found several good questions and answers regarding the transformation, but none of them specifically target the use of regular expressions.

Your approach is a little off. Each element of the keys column once you've gathered has the structure <Joint><Coord>, so you want something like:
df_wide %>%
gather(keys, values, -ID, -JN, -Frame, -System) %>%
extract(keys, c("Joint", "Coord"), "(.*)(X|Y|Z)$") %>%
spread(Coord, values)
The regex I've used here captures anything in the first group (since I don't know all the possible joint names), then X or Y or Z as the final character in the second group. There are lots of other regexes that would achieve the same thing.
Output:
ID JN Frame System Joint X Y Z
1 1 1 1 1 LHe 0.1344259 -0.2927277 0.05375166
2 1 1 1 1 RK 1.8083539 2.4053498 2.32899399
3 1 1 2 2 LHe 1.1777492 1.1780538 0.96549849
4 1 1 2 2 RK 3.2254236 2.4100235 2.79816371

You'll need to gather your data into a super long format, then split out the dimension, then spread THAT data back out into your X, Y, and Z columns:
library(tidyr)
library(stringr)
df2 <- df_wide %>%
# leave the other columns
gather( jointid, position, -ID, -JN, -Frame, -System ) %>%
# insert a seperator to make it easier to split the X/Y/Z from the joint name
mutate(jointid = str_replace( jointid, "X|Y|Z", ";\\0")) %>%
# split the joint name and the dimension apart
tidyr::separate(jointid, c('joint', 'dim'), sep = ";" ) %>%
# spread the joint and position apart into 3 columns
spread(dim, position)

Related

Rounded percentages that add up to 100% in group_by statement

I'm having a hard time making rounded percentages that add up to 100% within groups.
Consider the following example:
# Loading main library used
library(dplyr)
# Creating the basic data frame
df = data.frame(group = c('A','A','A','A','B','B','B','B'),
categories = c('Cat1','Cat2','Cat3','Cat4','Cat1','Cat2','Cat3','Cat4'),
values = c(2200,4700,3000,2000,2900,4400,2200,1000))
print(df)
# group categories values
# 1 A Cat1 2200
# 2 A Cat2 4700
# 3 A Cat3 3000
# 4 A Cat4 2000
# 5 B Cat1 2900
# 6 B Cat2 4400
# 7 B Cat3 2200
# 8 B Cat4 1000
df_with_shares = df %>%
# Calculating group totals and adding them back to the main df
left_join(df %>%
group_by(group) %>%
summarize(group_total = sum(values)),
by='group') %>%
# Calculating each category's share within the groups
mutate(group_share = values / group_total,
group_share_rounded = round(group_share,2))
# Summing the rounded shares within groups
rounded_totals = df_with_shares %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <int>
# 1 A 0.99
# 2 B 1.01
# Note how the totals do not add up to 100% as expected
I am aware of a few generic solutions to the "rounding percentages to add up to 100%" problem, as explained in this SO post. I was even able to make a little R implementation of one of those approaches, as seen here. This is what it would look like if I just applied that R approach to this problem:
df_with_rounded_shares = df %>%
mutate(
percs = values / sum(values),
percs_cumsum = cumsum(percs),
percs_cumsum_round = round(percs_cumsum, 2),
percs_cumsum_round_offset = replace_na(lag(percs_cumsum_round,1),0),
percs_rounded_final = percs_cumsum_round - percs_cumsum_round_offset)
However, the method I devised in the thread above does not work as I would like. It just calculates the shares of the values column across the whole dataset. In other words, it does not take into consideration the grouping variable representing the multiple groups in the data, each of which need their rounded values to add up to 100% independently from every other group.
What can I do to generate a column of rounded percentages that add up to 100% by group?
PS: While writing this question I actually found something that worked, so I'll answer my own question below. I know it's super simple, but I think it's still worth having a direct answer here on SO addressing this issue.

The method devised in your implementation (from here) just needs a few small tweaks to make it work.
First, include a group_by statement before calculating the new columns. Also, you need to use a summarize statement instead of the mutate statement you have now.
In essence, this is what it'll look like:
# Modified version of your implementation of the rounding procedure.
# The new procedure below accommodates for grouping variables.
df_with_rounded_shares_by_group = df %>%
group_by(group) %>%
summarize(
group_share = values / sum(values),
group_share_cumsum = cumsum(group_share),
group_share_cumsum_round = round(group_share_cumsum, 2),
group_share_cumsum_round_offset = replace_na(lag(group_share_cumsum_round,1),0),
group_share_rounded_final = group_share_cumsum_round - group_share_cumsum_round_offset) %>%
# Removing unnecessary temporary columns
select(-group_share_cumsum, -group_share_cumsum_round, -group_share_cumsum_round_offset)
# Verifying if the results add up to 100% within each group
rounded_totals = df_with_rounded_shares_by_group %>%
group_by(group) %>%
summarize(total_share = sum(group_share_rounded_final))
print(rounded_totals)
# # A tibble: 2 x 2
# group total_share
# <chr> <dbl>
# 1 A 1
# 2 B 1
# Yep, they all add up to 100% as expected!
Btw, apologies for the ridiculously long column names. I just made them enormous to make it clear what each step was really doing.

How to count the occurrence of a word in multiple variables in R and sort them from highest to lowest?

I have a huge dataset with over 3 million obs and 108 columns. There are 14 variables I'm interested in: DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF (they're in different positions). These variables contain ICD-10 codes.
I'm interested in counting how many times certain ICD-10 codes appear and then sort them from highest to lowest in a dataframe. Here's some reproductible data:
data <- data.frame(DIAG_PRINC = c("O200", "O200", "O230"),
DIAG_SECUN = c("O555", "O530", "O890"),
DIAGSEC1 = c("O766", "O876", "O899"),
DIAGSEC2 = c("O200", "I520", "O200"),
DIAGSEC3 = c("O233", "O200", "O620"),
DIAGSEC4 = c("O060", "O061", "O622"),
DIAGSEC5 = c("O540", "O123", "O344"),
DIAGSEC6 = c("O876", "Y321", "S333"),
DIAGSEC7 = c("O450", "X900", "O541"),
DIAGSEC8 = c("O222", "O111", "O123"),
DIAGSEC9 = c("O987", "O123", "O622"),
CID_MORTE = c("O066", "O699", "O555"),
CID_ASSO = c("O600", "O060", "O068"),
CID_NOTIF = c("O069", "O066", "O065"))
I also have a list of ICD-10 codes that I'm interested in counting.
GRUPO1 <- c("O00", "O000", "O001", "O002", "O008", "O009",
"O01", "O010", "O011", "O019",
"O02", "O020", "O021", "O028", "O029",
"O03", "O030", "O031", "O032", "O033", "O034", "O035", "O036", "O037",
"O038", "O039",
"O04", "O040", "O041", "O042", "O043", "O044", "O045", "O046", "O047",
"O048", "O049",
"O05", "O050", "O051", "O052", "O053", "O054", "O055", "O056", "O057",
"O058", "O059",
"O06", "O060", "O061", "O062", "O063", "O064", "O065", "O066", "O067",
"O068", "O069",
"O07", "O070", "O071", "O072", "O073", "O074", "O075", "O076", "O077",
"O078", "O079",
"O08", "O080", "O081", "O082", "O083", "O084", "O085", "O086", "O087",
"O088", "O089")
What I need is a dataframe counting how many times the ICD-10 codes from "GRUPO1" appear in any row/column from DIAG_PRINC, DIAG_SECUN, DIAGSEC1:DIAGSEC9, CID_ASSO, CID_MORTE and CID_NOTIF variables. For example, on my reproductible data ICD-10 cod "O066" appears twice.
Thank you in advance!

We can unlist the data into a vector, use %in% to subset the values from 'GRUPO1' and get the frequency count with table in base R
v1 <- unlist(data)
out <- table(v1[v1 %in% GRUPO1])
out[order(-out)]
O060 O066 O061 O065 O068 O069
2 2 1 1 1 1

Here is a tidyverse solution using tidyr and dplyr:
library(tidyverse)
pivot_longer(data, everything()) %>%
filter(value %in% GRUPO1) %>%
count(value)
Output
value n
<chr> <int>
1 O060 2
2 O061 1
3 O065 1
4 O066 2
5 O068 1
6 O069 1

How to add leading zeros to select rows in a data frame column using dplyr pipes

Consider data created here:
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100))
#Note that not all IDs are the same length
We have data for 100 IDs, where each individual has a unique ID number. Columns not.imp1:3 are only relevant to show the structure of the dataframe.
We want to add a leading zero to the first 95 ID numbers. I am trying to do this using dplyr pipes, but cant figure out how to add the zeros.
Here is how I subset the data that I want to add the zeros to:
library(dplyr)
data%>%
select(ID)%>%
slice(1:95)
I have tried several things like adding %>%mutate(paste0("0",.)) to the pipe, but havent gotten anything to work. what is the best way to do this?

Using sprintf() to pad to 3 digits:
data %>% mutate(ID = sprintf("%03d", ID))
You can change %03d to change how many leading zeros to add. Eg. %05d will ensure all IDs are at least 5 digits long.

You can try this approach
data2 <- data %>%
mutate(ID = ifelse(row_number()<= 95, paste0("0", ID), ID))
head(data2)
# ID not.imp1 not.imp2 not.imp3
# 1 09449 -1.4297317 -2.2210106 0.1923912
# 2 07423 1.9010681 1.0825734 -0.8855694
# 3 06283 0.2508254 -0.5307967 2.1645044
# 4 05593 -2.2451267 0.1281156 -1.8528800
# 5 09194 -0.1677409 -0.7422480 -0.4237452
# 6 07270 -0.2536918 1.2289698 1.0083092
tail(data2)
# ID not.imp1 not.imp2 not.imp3
# 95 06538 1.0071791 0.1596557 -0.7099883
# 96 4829 0.2444440 0.8869954 -1.2938356
# 97 2571 -1.1012023 0.8343393 -0.6264487
# 98 150 0.2116460 -0.2146265 -1.8281045
# 99 3107 -1.2379193 0.3491078 1.4531531
# 100 9953 -0.9326725 1.1146032 -1.5542687

Use of str_pad is helpful
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100)) %>%
mutate(ID = str_pad(string = ID, width = 4, side = 'left', pad = 0))

This would be a quick and easy way to do it. I didn't use dplyr or pipeing, but you could merge this idea with the code you already tried if you want to.
data[which(nchar(data[,"ID"])==3),"ID"]<-paste0(0,data[which(nchar(data[,"ID"])==3),"ID"])
data[which(nchar(data[,"ID"])==2),"ID"]<-paste0(00,data[which(nchar(data[,"ID"])==2),"ID"])

Melt Data and fill new column with desired data

Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.

We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")

I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.

Count based on multiple conditions from other data.frame

I am migrating analysis from Excel to R, and would like some input on how best to perform something similar to Excel's COUNTIFS in R.
I have a two data.frames, statedf and memberdf.
statedf=data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7)
memberdf=data.frame(memID = 1:15, state = c('MD','MD','NY','NY','MD'),
finalweek = c(3,3,5,3,3,5,3,5,3,5,6,5,2,3,5),
orders = c(1,2,3))
This data is for a subscription-based business. I would like to know the number of members who newly lapsed for each week/state combo in statedf, where newly lapse is defined by statedf$week - 1 = memberdf$finalweek. Further I would like to have separate counts for each order value (1,2,3).
The desired output would look like
out <- data.frame(state=c('MD','MD','MD','NY','NY','NY'), week = 5:7,
oneorder = c(0,1,0,0,0,0),
twoorder = c(0,0,1,0,1,0),
threeorder = c(0,3,0,0,1,0))
I asked (and got a great response for) a simpler version of this question yesterday - the answers revolved around creating a new data.frame based on member.df. However, I need to append the data to statedf, because statedf has member/week combos that don't exist in memberdf, and vice versa. If this was in Excel, I'd use COUNTIFS but am struggling for a solution in R.
Thanks.

Here is a solution with the dplyr and tidyr packages:
library(tidyr) ; library(dplyr)
counts <- memberdf %>%
mutate(lapsedweek = finalweek + 1) %>%
group_by(state, lapsedweek, orders) %>%
tally()
counts <- counts %>% spread(orders, n, fill = 0)
out <- left_join(statedf, counts, by = c("state", "week" = "lapsedweek"))
out[is.na(out)] <- 0 # convert rows with all NAs to 0s
names(out)[3:5] <- paste0("order", names(out)[3:5]) # rename columns

We could create a new variable ('week1') in the 'statedf' dataset, merge the 'memberdf' with 'statedf', and then reshape from 'long' to 'wide' format with dcast. I changed the 'orders' column to match the column names in the 'out'.
statedf$week1 <- statedf$week-1
df1 <- merge(memberdf[-1], statedf, by.x=c('state', 'finalweek'),
by.y=c('state', 'week1'), all.y=TRUE)
lvls <- paste0(c('one', 'two', 'three'), 'order')
df1$orders <- factor(lvls[df1$orders],levels=lvls)
library(reshape2)
out1 <- dcast(df1, state+week~orders, value.var='orders', length)[-6]
out1
# state week oneorder twoorder threeorder
#1 MD 5 0 0 0
#2 MD 6 1 0 3
#3 MD 7 0 1 0
#4 NY 5 0 0 0
#5 NY 6 0 1 1
#6 NY 7 0 0 0
all.equal(out, out1)
#[1] TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Using regular expressions in tidyr::extract - r

Related

Rounded percentages that add up to 100% in group_by statement

How to count the occurrence of a word in multiple variables in R and sort them from highest to lowest?

How to add leading zeros to select rows in a data frame column using dplyr pipes

Melt Data and fill new column with desired data

Count based on multiple conditions from other data.frame

Categories

Resources