I have a very large df with a column that contains the file directory for each row's data.
Example: D:Mouse_2174/experiment/13/trialsummary.txt.1
I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.
I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"
Is there a way to eliminate all chars before _ and after / to obtain mouse ID?
For session number, I'm not quite as sure what to do with multiple / in the directory name.
percent_correct_list$mouse_id <- sub("/.+", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub("+._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174
Here is sample code for the directories:
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)
What I want:
rn
id
session
D:..
2174
9
D:..
2181
33
D:..
2183
107
D:..
2185
87
Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)
For sure isnt an elegant solution. Only works if your ID and Session are always numbers...
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
# Extract all numeric values from the string
mutate(allnums = regmatches(rn, gregexpr("+[[:digit:]]+", rn)))%>%
# Separate them
separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
# Extract them individually
mutate(id = as.numeric(regmatches(id, gregexpr("+[[:digit:]]+", id,))),
session = as.numeric(regmatches(session, gregexpr("+[[:digit:]]+", session)))) %>%
select(-idk)
Output:
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87
Here's a somewhat long-winded solution, using tidyr::separate. Perhaps there is something more concise/elegant.
It does assume that all values of rn take the same format.
library(dplyr)
library(tidyr)
new_df <- df %>%
# separate on / into 4 new columns
separate(rn, into = c(paste0("item", 1:4)), sep = "/", remove = FALSE) %>%
# remove unwanted columns
select(-item2, -item4) %>%
# separate again on _ into 2 new columns
separate(item1, sep = "_", into = c("prefix", "id")) %>%
# retain and rename desired columns
select(rn, id, session = item3)
Result:
rn id session
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87
Related
I've created a dynamic column name w/ dplyr::mutate() based on this thread Use dynamic variable names in `dplyr` and now I want to sort the new column.... but I'm not correctly passing the column name
library(glue)
library(dplyr)
# data
set.seed(123)
df <- data.frame(distance = sample(1:100, size = 10))
# custom function
multiply_function <- function(df, metric, multiplier){
df %>%
mutate(., "{{metric}}_x{{multiplier}}" := {{metric}} * multiplier) %>%
arrange(desc("{{metric}}_x{{multiplier}}")) # <--- this is not working
}
df %>%
multiply_function(., metric = distance, multiplier = 3)
distance distance_x3
1 31 93
2 79 237
3 51 153
4 14 42
5 67 201
6 42 126
7 50 150
8 43 129
9 97 291
10 25 75
Unfortunately I don't know if any way to use that nice glue syntax with anything that's not on the left side of a :=. That's there the magic happens. You can get something to work if you take care of the explicity conversion to sumbol your self and do the string building manually. It's not pretty, but this works
multiply_function <- function(df, metric, multiplier){
metric <- ensym(metric)
newname <- glue::glue("{rlang::as_string(metric)}_x{as.character(multiplier)}")
df %>%
mutate("{newname}" := !!metric * multiplier) %>%
arrange(desc(.data[[newname]]))
}
I'm not sure this arranging step is best placed within a function if that function is part of a pipe itself. Problems may emerge if trying to make several variables. For one new variable:
multiply_function <- function(df, metric, multiplier){
df %>%
mutate("{{metric}}_x{{multiplier}}" := {{metric}} * multiplier) %>%
arrange(desc(!!rlang::sym(setdiff(names(.), names(df)))))
}
Consider data created here:
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100))
#Note that not all IDs are the same length
We have data for 100 IDs, where each individual has a unique ID number. Columns not.imp1:3 are only relevant to show the structure of the dataframe.
We want to add a leading zero to the first 95 ID numbers. I am trying to do this using dplyr pipes, but cant figure out how to add the zeros.
Here is how I subset the data that I want to add the zeros to:
library(dplyr)
data%>%
select(ID)%>%
slice(1:95)
I have tried several things like adding %>%mutate(paste0("0",.)) to the pipe, but havent gotten anything to work. what is the best way to do this?
Using sprintf() to pad to 3 digits:
data %>% mutate(ID = sprintf("%03d", ID))
You can change %03d to change how many leading zeros to add. Eg. %05d will ensure all IDs are at least 5 digits long.
You can try this approach
data2 <- data %>%
mutate(ID = ifelse(row_number()<= 95, paste0("0", ID), ID))
head(data2)
# ID not.imp1 not.imp2 not.imp3
# 1 09449 -1.4297317 -2.2210106 0.1923912
# 2 07423 1.9010681 1.0825734 -0.8855694
# 3 06283 0.2508254 -0.5307967 2.1645044
# 4 05593 -2.2451267 0.1281156 -1.8528800
# 5 09194 -0.1677409 -0.7422480 -0.4237452
# 6 07270 -0.2536918 1.2289698 1.0083092
tail(data2)
# ID not.imp1 not.imp2 not.imp3
# 95 06538 1.0071791 0.1596557 -0.7099883
# 96 4829 0.2444440 0.8869954 -1.2938356
# 97 2571 -1.1012023 0.8343393 -0.6264487
# 98 150 0.2116460 -0.2146265 -1.8281045
# 99 3107 -1.2379193 0.3491078 1.4531531
# 100 9953 -0.9326725 1.1146032 -1.5542687
Use of str_pad is helpful
data <- data.frame(ID = sample(10000,100), not.imp1 = rnorm(100), not.imp2 = rnorm(100), not.imp3 = rnorm(100)) %>%
mutate(ID = str_pad(string = ID, width = 4, side = 'left', pad = 0))
This would be a quick and easy way to do it. I didn't use dplyr or pipeing, but you could merge this idea with the code you already tried if you want to.
data[which(nchar(data[,"ID"])==3),"ID"]<-paste0(0,data[which(nchar(data[,"ID"])==3),"ID"])
data[which(nchar(data[,"ID"])==2),"ID"]<-paste0(00,data[which(nchar(data[,"ID"])==2),"ID"])
I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59
I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.
An example data file is available here:
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
I am running into a couple road blocks that I could use some input into how navigate.
1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.
Some additional corks to the data:
1) the asteriks are NOT necessary
2) the cognitive TOTAL will never have a value
I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.
Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.
Here is what I have tried:
# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
#docvarsfrom = "filenames",
dvsep = " ")
From here I do not know how to split the data, in my head I would do something like this
data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))
This of course, gives the correct column names but removes almost all the data I actually am interested in.
Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.
Sincerely,
Alex
Here is a way of converting that text file to a dataframe that you can do analysis on
library(tidyverse)
input <- read_lines('c:/temp/scores.txt')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
table <- left_join(scores,
header,
by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)
# A tibble: 10 x 6
index type Score `T-Score` row title
<chr> <chr> <chr> <chr> <int> <chr>
1 "1. " Line Bisection 9 53 3 Subtest/Section
2 "2. " Semantic Memory 8 51 4 Subtest/Section
3 "3. " Word Fluency 1 56* 5 Subtest/Section
4 "4. " Recognition Memory 40 59 6 Subtest/Section
5 "5. " Gesture Object Use 2 68 7 Subtest/Section
6 "6. " Arithmetic 5 49 8 Subtest/Section
7 "7. " Spoken Words 17 45* 14 Spoken Language
8 "9. " Spoken Sentences 25 53* 15 Spoken Language
9 "11. " Spoken Paragraphs 4 60 16 Spoken Language
10 "8. " Written Words 14 45* 20 Written Language
What is the source for the code at the link provided?
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.
This is what was accomplished before hitting a wall.
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))
head(df2)
When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.
If anything, it would good to know something about those unfamiliar data types.
Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.
library(stringr)
library(dplyr)
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, "Score" = str_extract(df$V2, "\\d+") )
df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$"))
df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))
df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))
df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*"))
df6 <- replace(df5, is.na(df5), "")
df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$")) # remove digits, new variable
df7$V1 <- NULL # remove variable
df7$V2 <- NULL # remove variable
df8 <- df7[, c(6,3,1,4,2,5)] # re-align variables
head(df8,15)
I am working with a data frame that has two columns, name and spouse. I am trying to calculate the interracial marriage frequency, but I need to remove repeated registers.
When I have the name of a creature I need to keep this register in the data frame but remove the register where that creature name is the spouse name. I have this following data sample:
name spouse
15 Finarfin Eärwen
6 Tar-Vanimeldë Herucalmo
17 Faramir owyn
8 Tar-Meneldur Almarian
14 Finduilas of Dol Amroth Denethor II
12 Finwë MÃriel Serindë then ,Indis
9 Tar-Ancalimë Hallacar
7 Tar-MÃriel Ar-Pharazôn
5 Tarannon Falastur Berúthiel
21 Rufus Burrows Asphodel Brandybuck
2 Angrod Eldalótë
4 Ar-Gimilzôr Inzilbêth
19 Lobelia Sackville-Baggins Otho Sackville-Baggins
25 Mrs. Proudfoot Odo Proudfoot
22 Rudigar Bolger Belba Baggins
24 Odo Proudfoot Mrs. Proudfoot
3 Ar-Pharazôn Tar-MÃriel
13 Fingolfin Anairë
18 Silmariën Elatan
23 Rowan Greenhand Belba Baggins
20 RÃan Huor
1 Adanel Belemir
16 Fastolph Bolger Pansy Baggins
10 Morwen Steelsheen Thengel
11 Tar-Aldarion Erendis
25 Belemir Adanel
For example, I ran the code and in line 1 it caught name Adanel and got Belemir as its spouse, so I need to keep line 1, but remove line 25, because with that I will avoid duplicated data.
I have tried this following code:
interacialMariage <-data %>% filter(spouse != name) %>% select(name, spouse)
How can I get the same spouse name register out of the data frame registers?
P.S.: I would need it to avoid case sensitive (Belemir == belemir) so that I don't have problems in the future.
Thanks!
You could set up another vector with the row-wise alphabetically sorted names, and deduplicate using that...
sorted <- sapply(1:nrow(data),
function(i) paste(sort(c(trimws(tolower(data$name[i])),
trimws(tolower(data$spouse[i])))),
collapse=" "))
irM <- data[!duplicated(sorted),]
The trimws strips off any leading or trailing spaces before sorting and pasting, and tolower converts everything to lower case.
My attempt with tidyverse:
library(tidyverse)
dat %>%
mutate(id = 1:n()) %>% # add id to label the pairs
gather('key', 'name', -id) %>% # transform: key (name | spouse), name, id
group_by(name) %>% # group by unique name to find duplicated
top_n(-1, wt = id) %>% # if name > 1, take row with the lower id
spread(key, name) %>% # spread data to original format
select(-id) # remove id's
# # A tibble: 3 x 2
# name spouse
# <chr> <chr>
# 1 Adanel Belemir
# 2 Fastolph Bolger Pansy Baggins
# 3 Morwen Steelsheen Thengel
Data:
dat <- data.frame(
name = c("Adanel", "Fastolph Bolger", "Morwen Steelsheen", "Belemir"),
spouse = c("Belemir", "Pansy Baggins", "Thengel", "Adanel" ),
stringsAsFactors = F
)