I'm not even sure how to approach this situation, I'm probably blocked. I have a wide dataframe, something like this
Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16
I would like to export the data to a csv file with the following format
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
My first question is what is the best approach to achieve this. Should I separate Amy_X into Amyand Xand then create a repeat of the vector of names Amy, Amy, John, John and use than as another header. What's the best solution for this scenario?
The question says to output the file to csv but the output shown is not comma-separated values (csv). We show both.
Using input data frame DF defined reproducibly in the Note at the end, create a data frame from the headers and use separate_rows on it and rbind that to DF. Then do any remaining fix ups. Write it out without the row and column names and without quotes. Replace stdout() with your file name.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
names %>%
as.list %>%
as.data.frame %>%
separate_rows(everything()) %>%
setNames(names(DF)) %>%
rbind(DF)
DF2[2, 1] <- DF2[1, duplicated(unlist(DF2[1, ]))] <- ""
output <- capture.output(prmatrix(DF2, quote = FALSE,
rowlab = rep("", nrow(DF2)), collab = rep("", ncol(DF2))))[-1]
writeLines(output, stdout())
giving the following which reproduces the output shown in the question:
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
If you really did want csv then use this instead of the writeLines and statement prior to it above:
write.table(DF2, stdout(), sep = ",", quote = FALSE, row.names = FALSE,
col.names = FALSE)
giving:
Date,Amy,,John,
,X,Y,X,Y
March,14,15,10.5,14.5
April,10,11,15,16
Note
Lines <- "Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE, as.is = TRUE)
Related
I have a very large df with a column that contains the file directory for each row's data.
Example: D:Mouse_2174/experiment/13/trialsummary.txt.1
I would like to create 2 new columns, one with only the mouse ID (2174) and one with the session number (13). There will be different IDs and session numbers based on the row.
I've used sub as recommended here (match part of names in data.frame to new column), but only can get the subject column to say "D:Mouse_2174" I've added an additional line and can get it down to "D:Mous2174"
Is there a way to eliminate all chars before _ and after / to obtain mouse ID?
For session number, I'm not quite as sure what to do with multiple / in the directory name.
percent_correct_list$mouse_id <- sub("/.+", "", percent_correct_list$rn)
#gives me D:Mouse_2174
percent_correct_list$mouse_id <- sub("+._", "", percent_correct_list$mouse_id)
#gives me D:Mous2174
Here is sample code for the directories:
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")
)
What I want:
rn
id
session
D:..
2174
9
D:..
2181
33
D:..
2183
107
D:..
2185
87
Maybe there's some way to do this earlier along in the process too (like when I import all the data into a df using lapply - but this is good as well)
For sure isnt an elegant solution. Only works if your ID and Session are always numbers...
df <- data.frame(
rn = c("D:Mouse_2174/iti_intervals/9/trialsummary.txt.1",
"D:Mouse_2181/iti_intervals/33/trialsummary.txt.1",
"D:Mouse_2183/iti_intervals/107/trialsummary.txt.2",
"D:Mouse_2185/iti_intervals/87/trialsummary.txt.1")) %>%
# Extract all numeric values from the string
mutate(allnums = regmatches(rn, gregexpr("+[[:digit:]]+", rn)))%>%
# Separate them
separate(allnums, into = c("id", "session", "idk"), sep = "\\,") %>%
# Extract them individually
mutate(id = as.numeric(regmatches(id, gregexpr("+[[:digit:]]+", id,))),
session = as.numeric(regmatches(session, gregexpr("+[[:digit:]]+", session)))) %>%
select(-idk)
Output:
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87
Here's a somewhat long-winded solution, using tidyr::separate. Perhaps there is something more concise/elegant.
It does assume that all values of rn take the same format.
library(dplyr)
library(tidyr)
new_df <- df %>%
# separate on / into 4 new columns
separate(rn, into = c(paste0("item", 1:4)), sep = "/", remove = FALSE) %>%
# remove unwanted columns
select(-item2, -item4) %>%
# separate again on _ into 2 new columns
separate(item1, sep = "_", into = c("prefix", "id")) %>%
# retain and rename desired columns
select(rn, id, session = item3)
Result:
rn id session
1 D:Mouse_2174/iti_intervals/9/trialsummary.txt.1 2174 9
2 D:Mouse_2181/iti_intervals/33/trialsummary.txt.1 2181 33
3 D:Mouse_2183/iti_intervals/107/trialsummary.txt.2 2183 107
4 D:Mouse_2185/iti_intervals/87/trialsummary.txt.1 2185 87
I have a file separated by semicolons in which one of the variables of type character contains semicolon inside it. The readr::read_csv2 function splits the contents of those variables that have semicolons into more columns, messing up the formatting of the file.
For example, when using read_csv2 to open the file below, Bill's age column will show jogging, not 41.
File:
name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32
Considering that the original file doesn't contain quotes around the character type variables, how can I import the file so that karate and jogging belong in the hobbies column?
read.csv()
You can use the read.csv() function. But there would be some warning messages (or use suppressWarnings() to wrap around the read.csv() function). If you wish to avoid warning messages, using the scan() method in the next section.
library(dplyr)
read.csv("./path/to/your/file.csv", sep = ";",
col.names = c("name", "hobbies", "age", "X4")) %>%
mutate(hobbies = ifelse(is.na(X4), hobbies, paste0(hobbies, ";" ,age)),
age = ifelse(is.na(X4), age, X4)) %>%
select(-X4)
scan() file
You can first scan() the CSV file as a character vector first, then split the string with pattern ; and change it into a dataframe. After that, do some mutate() to identify your target column and remove unnecessary columns. Finally, use the first row as the column name.
library(tidyverse)
library(janitor)
semicolon_file <- scan(file = "./path/to/your/file.csv", character())
semicolon_df <- data.frame(str_split(semicolon_file, ";", simplify = T))
semicolon_df %>%
mutate(X4 = na_if(X4, ""),
X2 = ifelse(is.na(X4), X2, paste0(X2, ";" ,X3)),
X3 = ifelse(is.na(X4), X3, X4)) %>%
select(-X4) %>%
janitor::row_to_names(row_number = 1)
Output
name hobbies age
2 Jon cooking 38
3 Bill karate;jogging 41
4 Maria fishing 32
Assuming that you have the columns name and age with a single entry per observation and hobbies with possible multiple entries the following approach works:
read in the file line by line instead of treating it as a table:
tmp <- readLines(con <- file("table.csv"))
close(con)
Find the position of the separator in every row. The entry before the first separator is the name the entry after the last is the age:
separator_pos <- gregexpr(";", tmp)
name <- character(length(tmp) - 1)
age <- integer(length(tmp) - 1)
hobbies <- vector(length=length(tmp) - 1, "list")
fill the three elements using a for loop:
# the first line are the colnames
for(line in 2:length(tmp)){
# from the beginning of the row to the first";"
name[line-1] <- strtrim(tmp[line], separator_pos[[line]][1] -1)
# between the first ";" and the last ";".
# Every ";" is a different elemet of the list
hobbies[line-1] <- strsplit(substr(tmp[line], separator_pos[[line]][1] +1,
separator_pos[[line]][length(separator_pos[[line]])]-1),";")
#after the last ";", must be an integer
age[line-1] <- as.integer(substr(tmp[line],separator_pos[[line]][length(separator_pos[[line]])]+1,
nchar(tmp[line])))
}
Create a separate matrix to hold the hobbies and fill it rowwise:
hobbies_matrix <- matrix(NA_character_, nrow = length(hobbies), ncol = max(lengths(hobbies)))
for(line in 1:length(hobbies))
hobbies_matrix[line,1:length(hobbies[[line]])] <- hobbies[[line]]
Add all variable to a data.frame:
df <- data.frame(name = name, hobbies = hobbies_matrix, age = age)
> df
name hobbies.1 hobbies.2 age
1 Jon cooking <NA> 38
2 Bill karate jogging 41
3 Maria fishing <NA> 32
You could also do:
read.csv(text=gsub('(^[^;]+);|;([^;]+$)', '\\1,\\2', readLines('file.csv')))
name hobbies age
1 Jon cooking 38
2 Bill karate;jogging 41
3 Maria fishing 32
Ideally you'd ask whoever generated the file to do it properly next time :) but of course this is not always possible.
Easiest way is probably to read the lines from the file into a character vector, then clean up and make a data frame by string matching.
library(readr)
library(dplyr)
library(stringr)
# skip header, add it later
dataset <- read_lines("your_file.csv", skip = 1)
dataset_df <- data.frame(name = str_match(dataset, "^(.*?);")[, 2],
hobbies = str_match(dataset, ";(.*?);\\d")[, 2],
age = as.numeric(str_match(dataset, ";(\\d+)$")[, 2]))
Result:
name hobbies age
1 Jon cooking 38
2 Bill karate;jogging 41
3 Maria fishing 32
Using the file created in the Note at the end
1) read.pattern can read this by specifying the pattern as a regular expression with the portions within parentheses representing the fields.
library(gsubfn)
read.pattern("hobbies.csv", pattern = '^(.*?);(.*);(.*)$', header = TRUE)
## name hobbies age
## 1 Jon cooking 38
## 2 Bill karate;jogging 41
## 3 Maria fishing 32
2) Base R Using base R we can read in the lines, put quotes around the middle field and then read it in normally.
L <- "hobbies.csv" |>
readLines() |>
sub(pattern = ';(.*);', replacement = ';"\\1";')
read.csv2(text = L)
## name hobbies age
## 1 Jon cooking 38
## 2 Bill karate;jogging 41
## 3 Maria fishing 32
Note
Lines <- "name;hobbies;age
Jon;cooking;38
Bill;karate;jogging;41
Maria;fishing;32
"
cat(Lines, file = "hobbies.csv")
I have scraped some data from a url to analyse cycling results. Unfortunately the name column exists of the name and the name of the team in one field. I would like to extract these from each other. Here's the code (last part doesn't work)
#get url
stradebianchi_2020 <- read_html("https://www.procyclingstats.com/race/strade-bianche/2020/result")
#scrape table
results_2020 <- stradebianchi_2020%>%
html_nodes("td")%>%
html_text()
#transpose scraped data into dataframe
results_stradebianchi_2020 <- as.data.frame(t(matrix(results_2020, 8, byrow = F)))
#rename
names(results_stradebianchi_2020) <- c("rank", "#", "name", "age", "team", "UCI point", "PCS points", "time")
#split rider from team
separate(data = results_stradebianchi_2020, col = name, into = c("left", "right"), sep = " ")
I think the best option is to get the team variable name and use that name to remove it from the 'name' column.
All suggestions are welcome!
I think your request is wrongly formulated. You want to remove team from name.
That's how you should do it in my opinion:
results_stradebianchi_2020 %>%
mutate(name = stringr::str_remove(name, team))
Write this instead of your line with separate.
In this case separate is not an optimal solution for you because the separation character is not clearly defined.
Also, I would advise you to remove the initial blanks from name with stringr::str_trim(name)
You could do this in base R with gsub and replace in the name column the pattern of team column with "", i.e. nothing. We use apply() with MARGIN=1 to go through the data frame row by row. Finally we use trimws to clean from whitespace (where we change to whitespace="[\\h\\v]" for better matching the spaces).
res <- transform(results_stradebianchi_2020,
name=trimws(apply(results_stradebianchi_2020, 1, function(x)
gsub(x["team"], "", x["name"])), whitespace="[\\h\\v]"))
head(res)
# rank X. name age team UCI.point PCS.points time
# 1 1 201 van Aert Wout 25 Team Jumbo-Visma 300 200 4:58:564:58:56
# 2 2 234 Formolo Davide 27 UAE-Team Emirates 250 150 0:300:30
# 3 3 87 Schachmann Maximilian 26 BORA - hansgrohe 215 120 0:320:32
# 4 4 111 Bettiol Alberto 26 EF Pro Cycling 175 100 1:311:31
# 5 5 44 Fuglsang Jakob 35 Astana Pro Team 120 90 2:552:55
# 6 6 7 Štybar Zdenek 34 Deceuninck - Quick Step 115 80 3:593:59
I am currently working with clinical assessment data that is scored and output by a software package in a .txt file. My goal is extract the data from the txt file into a long format data frame with a column for: Participant # (which is included in the file name), subtest, Score, and T-score.
An example data file is available here:
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
I am running into a couple road blocks that I could use some input into how navigate.
1) I only need the information that corresponds to each subtest, these all have a number prior to the subtest name. Therefore, the rows that only have one to two words that are not necessary (eg cognitive screen) seem to be interfering creating new data frames because I have a mismatch in columns provided and columns wanted.
Some additional corks to the data:
1) the asteriks are NOT necessary
2) the cognitive TOTAL will never have a value
I am utilizing the readtext package to import the data at the moment and I am able to get a data frame with two columns. One being the file name (this includes the participant name) so that problem is fixed. However, the next column is a a giant character string with the columns data points for both Score and T-Score. Presumably I would then need to split these into the columns of interest, previously listed.
Next problem, when I view the data the T scores are in the correct order, however the "score" data no longer matches the true values.
Here is what I have tried:
# install.packages("readtext")
library(readtext)
library(tidyr)
pathTofile <- path.expand("/Users/Brahma/Desktop/CAT TEXT FILES/")
data <- readtext(paste0(pathTofile2, "CAToutput.txt"),
#docvarsfrom = "filenames",
dvsep = " ")
From here I do not know how to split the data, in my head I would do something like this
data2 <- separate(data2, text, sep = " ", into = c("subtest", "score", "t_score"))
This of course, gives the correct column names but removes almost all the data I actually am interested in.
Any help would be appreciated whether a solution or a direction you might suggest I look for more answers.
Sincerely,
Alex
Here is a way of converting that text file to a dataframe that you can do analysis on
library(tidyverse)
input <- read_lines('c:/temp/scores.txt')
# do the match and keep only the second column
header <- as_tibble(str_match(input, "^(.*?)\\s+Score.*")[, 2, drop = FALSE])
colnames(header) <- 'title'
# add index to the list so we can match the scores that come after
header <- header %>%
mutate(row = row_number()) %>%
fill(title) # copy title down
# pull off the scores on the numbered rows
scores <- str_match(input, "^([0-9]+[. ]+)(.*?)\\s+([0-9]+)\\s+([0-9*]+)$")
scores <- as_tibble(scores) %>%
mutate(row = row_number())
# keep only rows that are numbered and delete first column
scores <- scores[!is.na(scores[,1]), -1]
# merge the header with the scores to give each section
table <- left_join(scores,
header,
by = 'row'
)
colnames(table) <- c('index', 'type', 'Score', 'T-Score', 'row', 'title')
head(table, 10)
# A tibble: 10 x 6
index type Score `T-Score` row title
<chr> <chr> <chr> <chr> <int> <chr>
1 "1. " Line Bisection 9 53 3 Subtest/Section
2 "2. " Semantic Memory 8 51 4 Subtest/Section
3 "3. " Word Fluency 1 56* 5 Subtest/Section
4 "4. " Recognition Memory 40 59 6 Subtest/Section
5 "5. " Gesture Object Use 2 68 7 Subtest/Section
6 "6. " Arithmetic 5 49 8 Subtest/Section
7 "7. " Spoken Words 17 45* 14 Spoken Language
8 "9. " Spoken Sentences 25 53* 15 Spoken Language
9 "11. " Spoken Paragraphs 4 60 16 Spoken Language
10 "8. " Written Words 14 45* 20 Written Language
What is the source for the code at the link provided?
https://github.com/AlexSwiderski/CatTextToData/blob/master/Example_data
This data is odd. I was able to successfully match patterns and manipulate most of the data, but two rows refused to oblige. Rows 17 and 20 refused to be matched. In addition, the data type / data structure are very unfamiliar.
This is what was accomplished before hitting a wall.
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, Extract = str_extract(df$V2, "[1-9]+\\s[1-9]+\\*+\\s?"))
df2 <- df1 %>% mutate(V2, Extract2 = str_extract(df1$V2, "[0-9]+.[0-9]+$"))
head(df2)
When the data was further explored, the second column, V2, included data types that are completely unfamiliar. These included: Arithmetic, Complex Words, Digit Strings, and Function Words.
If anything, it would good to know something about those unfamiliar data types.
Took another look at this problem and found where it had gotten off track. Ignore my previous post. This solution works in Jupyter Lab using the data that was provided.
library(stringr)
library(dplyr)
df <- read.csv("test.txt", header = FALSE, sep = ".", skip = 1)
df1 <- df %>% mutate(V2, "Score" = str_extract(df$V2, "\\d+") )
df2 <- df1 %>% mutate(V2, "T Score" = str_extract(df$V2, "\\d\\d\\*?$"))
df3 <- df2 %>% mutate(V2, "Subtest/Section" = str_remove_all(df2$V2, "\\\t+[0-9]+"))
df4 <- df3 %>% mutate(V1, "Sub-S" = str_extract(df3$V1, "\\s\\d\\d\\s*"))
df5 <- df4 %>% mutate(V1, "Sub-T" = str_extract(df4$V1,"\\d\\d\\*"))
df6 <- replace(df5, is.na(df5), "")
df7 <- df6 %>% mutate(V1, "Description" = str_remove_all(V1, "\\d\\d\\s\\d\\d\\**$")) # remove digits, new variable
df7$V1 <- NULL # remove variable
df7$V2 <- NULL # remove variable
df8 <- df7[, c(6,3,1,4,2,5)] # re-align variables
head(df8,15)
I have the following example data (the real data contains other columns with both numeric and character variables):
structure(list(AM = structure(1:20, .Label = c("AMP_R", "AZI_R",
"CHL_R", "CIP_R", "COL_R", "ERY_R", "ETP_R", "F.C_R", "FEP_R",
"FOT_R", "FOX_R", "GEN_R", "IMI_R", "MERO_R", "NAL_R", "STR_R",
"SULFA_R", "T.C_R", "TAZ_R", "TET_R"), class = "factor")), .Names = "AM", row.names = c(NA,
-20L), class = "data.frame")
I tried to create a function that will detect whether or not a column in a data frame contains variables with the ending "_R". If they do, it will remove this ending and proceed with renaming the variables to full names, accoring to a conversion table. If the "_R" ending is not present, it will just convert the names directly.
I have tried the following on the first part of the function:
library(dplyr)
convert_AM_names <- function(data, col) {
data %>%
mutate(col = gsub("(.*?)_R", "\\1", col))
}
I want to use it in a dplyr chain, like this:
AM <- AM %>%
rowwise() %>%
convert_AM_names(., AM)
However, when I do this, it gives the error "Error in mutate_impl(.data, dots): Column "col" must be length 1, not 20"
I saw that similar issues have been addressed here at SO, but for most of them the solution was to use rowwise(), which doesn't seem to work here. Any suggestions?
You can use an anchor for your regular expression that only matches when the _R is right at the end:
convert_AM_names <- function(col) {
gsub("(.*)_R$", "\\1", col)
}
library(dplyr)
df %>%
mutate(AM = convert_AM_names(AM))
Or directly - without the overhead of convert_AM_names():
df %>%
mutate(AM = gsub("(.*)_R$", "\\1", AM))
Both will yield:
AM
1 AMP
2 AZI
3 CHL
4 CIP
5 COL
6 ERY
7 ETP
8 F.C
9 FEP
10 FOT
11 FOX
12 GEN
13 IMI
14 MERO
15 NAL
16 STR
17 SULFA
18 T.C
19 TAZ
20 TET
You can use mutate_at() which allows you to select a column and apply a function to it.
AM %>%
mutate_at(.vars = "AM",
.funs = gsub,
pattern = "(.*?)_R",
replacement = "\\1")
If you wanted, you could also rewrite your function:
convert_AM_names <- function(col) {
gsub("(.*?)_R", "\\1", col)
}
And use it in mutate_at():
AM %>%
mutate_at(.vars = "AM",
.funs = convert_AM_names)
In both cases, the result looks like this:
AM
1 AMP
2 AZI
3 CHL
4 CIP
5 COL
6 ERY
7 ETP
8 F.C
9 FEP
10 FOT
11 FOX
12 GEN
13 IMI
14 MERO
15 NAL
16 STR
17 SULFA
18 T.C
19 TAZ
20 TET