Can anyone help arranging long data to wide data, but complicated by linked results, ie listed in wide format identified by Study Number this repetitive results listed in wide format after the SN (I've shown an abbreviated table there are more results per patient listed along the bottom with repetitive in columns LabTest, LabDate, Result, Lower, Upper)...I've tried melt and recast, and binding columns but can't seem to get it to work. Over 1000 results to reformat so can't input results manually need to reformat a long data excel document in wide format in R Thank you
Original data looks like this
SN LabTest LabDate Result Lower Upper
TD62 Creat 05/12/2004 22 30 90
TD62 AST 06/12/2004 652 6 45
TD58 Creat 26/05/2007 72 30 90
TD58 Albumin 26/05/2005 22 25 35
TD14 AST 28/02/2007 234 6 45
TD14 Albumin 26/02/2007 15 25 35
Formatted data should look like this
SN LabTCode LabDate Result Lower Upper LabCode LabDate Result Lower Upper
TD62 Creat 05/12/04 22 30 90 AST 06/12/04 652 6 45
TD58 Creat 26/05/05 72 30 90 Alb 26/05/05 22 25 35
TD14 AST 28/02/07 92 30 90 Alb 26/02/07 15 25 35
Formatted data looks like this
So far I have tried:
data_wide2 <- dcast(tdl, SN + LabDate ~ LabCode, value.var="Result")
and
melt(tdl, id = c("SN", "LabDate"), measured= c("Result", "Upper", + "Lower"))
Your issue is that R won't like the final table because it has duplicate column names. Maybe you need the data in that format but it's a bad way to store data because it would be difficult to put the columns back into rows again without a load of manual work.
That said, if you want to do it you'll need a new column to help you transpose the data.
I've used dplyr and tidyr below, which are worth looking at rather than reshape. They're by the same author but more modern and designed to fit together as part of the 'tidyverse'.
library(dplyr)
library(tidyr)
#Recreate your data (not doing this bit in your question is what got you downvoted)
df <- data.frame(
SN = c("TD62","TD62","TD58","TD58","TD14","TD14"),
LabTest = c("Creat","AST","Creat","Albumin","AST","Albumin"),
LabDate = c("05/12/2004","06/12/2004","26/05/2007","26/05/2005","28/02/2007","26/02/2007"),
Result = c(22,652,72,22,234,15),
Lower = c(30,6,30,25,6,25),
Upper = c(90,45,90,35,45,35),
stringsAsFactors = FALSE
)
output <- df %>%
group_by(SN) %>%
mutate(id_number = row_number()) %>% #create an id number to help with tracking the data as it's transposed
gather("key", "value", -SN, -id_number) %>% #flatten the data so that we can rename all the column headers
mutate(key = paste0("t",id_number, key)) %>% #add id_number to the column names. 't' for 'test' to start name with a letter.
select(-id_number) %>% #don't need id_number anymore
spread(key, value)
SN t1LabDate t1LabTest t1Lower t1Result t1Upper t2LabDate t2LabTest t2Lower t2Result t2Upper
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 TD14 28/02/2007 AST 6 234 45 26/02/2007 Albumin 25 15 35
2 TD58 26/05/2007 Creat 30 72 90 26/05/2005 Albumin 25 22 35
3 TD62 05/12/2004 Creat 30 22 90 06/12/2004 AST 6 652 45
And you're there, possibly with some sorting issues still to crack if you need the columns in a specific order.
Related
I have a set of csv files, 64 of them. all of those title are named like "LineNum_nnn.csv" (LineNum_101.csv, LineNum_107.csv, LineNum_501.csv, ...)
Each csv file has five columns:
Date / On / Off / Transfer / LineNum
2020-01-02 / 8874 / 7170 / 1886 / 211
2020-01-03 / 8928 / 7170 / 1886 / 211
... so on. All of them have about 800 rows.
I used lapply function to do two things:
1.import all the csv files in my working directory and 2.apply my pre-processing function(named function_merged) to all 67 dataframes in the list.
I imported csv files with this code. Works well.
wholeDataList = lapply(fileList, function(x) read.csv(x, encoding = "UTF-8"))
head(wholeDataList[[2]], 3)
A data.frame: 3 × 5
Date On Off Transfer LineNum
<chr> <int> <int> <int> <int>
1 2020-01-02 16830 14564 4536 102
2 2020-01-03 17440 14978 4614 102
3 2020-01-04 12579 10862 3011 102
applied my pre-processing function(function-Merged) to the dataframe list:
wholeDataList_merged = lapply(wholeDataList, function(x) function_Merged(x))
Also works well. Now I have a one-dimensional list with 67 processed dataframes in a row. for example:
head(wholeDataList_merged[[2]], 3)
gives me
A data.frame: 3 × 11
Date On Off Transfer LineNum Days Workdays On_RunMed NumericDate Loess_Fit Loess_SE
<date> <int> <int> <int> <int> <chr> <fct> <dbl> <dbl> <dbl> <dbl>
1 2020-01-02 16830 14564 4536 102 Thu TRUE 16709 18263 16628.28 261.7660
6 2020-01-07 15734 13311 4268 102 Tue TRUE 16709 18268 16919.24 187.8690
7 2020-01-08 16709 14375 4698 102 Wed TRUE 16830 18269 16960.49 175.9965
Then the major problem: as you can see these dataframes have a Date column, and I have to split all those 67 dataframes, each into three frames based on their dates: 2010-08-09 / 2020-11-17 / 2021-07-04.
for example, that "wholeDataList_merged[[2]]" (dataframe of line number 102. right above one) has to be trimmed into three dataframes with their date information:
I want Line102_Phase1(before 20-08-09), Line102_Phase2(btw 20-08-09 and 20-11-17), Line102_Phase3(btw 20/11/17 and 21/07/04).
... for all those 67 dataframes. (sigh)
I know R function cannot make multiple outputs. so I hope I can do this(splitting a dataframe into three sub-frames based on date value)
like,
function_name <- function(dataframe) {
temp_list <- list[]
df_Phase1 <-
dataframe["2020-02-18" <= dataframe$Date
& dataframe$Date <= "2020-08-09",]
temp_list <- append(temp_list, df_Phase1)
df_Phase2 <-
dataframe["2020-08-10" <= dataframe$Date
& dataframe$Date <= "2020-11-17",]
temp_list <- append(temp_list, df_Phase2)
df_Phase3 <-
dataframe["2020-11-18" <= dataframe$Date
& dataframe$Date <= "2021-07-04",]
temp_list <- append(temp_list, df_Phase3)
return(temp_list)
#and attatch those three splitted frames right next to the original dataframe(then it would be 67*4),
# or make a new 67*3 list with those splitted frames... whatever.
}
And most af all, these splitted dataframes have to contain 1. their line numbers(102, 104, ..) and 2. each phases(1, 2, 3) in the dataframe variable name. for example "Line104_Phase2".
How can I do this? Well do I have to unlist those frames and extract three of them by date from each dataframes with for loops and dynamic variables?
splitting... I think it could be done with lots of effort somehow, but I cannot even grasp anything with the variable names. Help me.
Since you didn't provide a reproducible examples, here are some data that hopefully replicate your problem.
set.seed(42)
dateIntervals<-as.Date(c("2010-08-09", "2020-11-17", "2021-07-04"))
possibleDates<-seq(dateIntervals[1]-1000, dateIntervals[3], by = "day")
genDF<-function() data.frame(Date = sample(possibleDates, 100), Value = runif(100))
listdf<-replicate(2, genDF(), simplify = FALSE)
Now listdf, which should play the role of your wholeDataList_merged, has only two elements and each element just two columns, but it shouldn't make any difference. Next, you can try:
lapply(listdf, function(x) split(x, findInterval(x$Date, dateIntervals)))
And you will see each element being split into three elements depending on the date.
I am using the pivot function from the lessR package, to create an Excel-like pivot table with two categorical variables that make up the vertical and horizontal categories, and a mean in each cell. (Hope this makes sense).
I followed the code that the documentation (https://cran.r-project.org/web/packages/lessR/vignettes/pivot.html) gives. Let's follow their example:
d <- Read("Employee")
a <- pivot(d, mean, Salary, Dept, Gender)
The data d is like this:
Years Gender Dept Salary JobSat Plan Pre Post
Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
Wu, James NA M SALE 94494.58 low 1 62 74
Hoang, Binh 15 M SALE 111074.86 low 3 96 97
Jones, Alissa 5 F <NA> 53772.58 <NA> 1 65 62
Downs, Deborah 7 F FINC 57139.90 high 2 90 86
Afshari, Anbar 6 F ADMN 69441.93 high 2 100 100
Knox, Michael 18 M MKTG 99062.66 med 3 81 84
Campagna, Justin 8 M SALE 72321.36 low 1 76 84
Kimball, Claire 8 F MKTG 61356.69 high 2 93 92
The pivottable a is a nice table, exactly as I want it to look in terms of cell contents, etc. It appears to be a knitr_kable.
Gender F M
Dept
------- --------- ---------
ACCT 63237.16 59626.20
ADMN 81434.00 80963.35
FINC 57139.90 72967.60
MKTG 64496.02 99062.66
SALE 64188.25 86150.97
Next, I would like to make a dataframe out of this, for easier manipulation in my code and for copying it to the clipboard. However, I don't know how to convert a knitr_kable to a dataframe. Here is my code and the error it results in:
as.data.frame(a)
Error in as.data.frame.default(a) :
cannot coerce class ‘"knitr_kable"’ to a data.frame
The knitr-documentation does not say anything about this conversion - it is only about converting a dataframe to a knitr_kable, which is the opposite of what I want.
I have also tried pivottabler, but this has similar issues: the resulting class cannot be coerced to a dataframe either.
Here are two potential answers:
Most direct: Wrangle the data yourself
If you're open to a tidyverse-style approach, it only takes a few lines to do the wrangling and summarising yourself. That will give you a datatable output that you can work with right away.
# load packages
library(lessR)
library(dplyr)
library(tidyr)
# load data
d <- Read("Employee")
# use tidyverse-style code to pivot and summarise the data yourself
d %>%
group_by(Gender, Dept) %>%
summarise(Salary_mean = mean(Salary)) %>%
pivot_wider(names_from= "Gender", values_from = "Salary_mean")
Read the knitr::kable() markdown output into a data frame
If you prefer to work backwards from a knitr::kable() output to a dataframe, this is addressed in this SO question: Markdown table to data frame in R
Looking for advice on refining my code and also trimming to a date range.
The spreadsheet itself is pulled from another system and so the structure of the excel cannot be changed. When you pull the data it basically starts at E2, with the first date column in F2, and the first item in E3. The data will continue to populate to the right for as long as it goes on for. I have replicated the structure below.
AndI want it to look like:
I have come up with the below, which works, but I was looking for advice on refining it down to fewer individual step by steps.
In the below code:
= extracting data
= pulling the dates out
= formatting from
excel number to an actual date
= grabbing the item names
= transposing data and skipping some parts
= adding in dates to the row names
#1
df <- data.frame(read_excel("C:/example.xlsx",
sheet = "Sheet1"))
#2
dfdate <- gtb[1, -c(1,2,3,4,5)]
#3
dfdate <- format(as.Date(as.numeric(dfdate),
origin = "1899-12-30"), "%d/%m/%Y")
#4
rownames(gtb) <- gtb[,1]
#5
gtb <- as.data.frame(t(gtb[, -c(1,2,3,4,5)]))
#6
rownames(gtb) <- dfdate
After the row names have been added the structure is such that I am happy to start creating the visuals where needed.
thanks for your advice
David
Here is one suggestion, I don't really have easy access to your data, but I am including code to remove those columns as you do, based on their names, which can be nicer than removing by index.
df <- read.table( text=
"Item_Code 01/01/2018 01/02/2018 01/03/2018 01/04/2018
Item 99 51 60 69
Item2 42 47 88 2
Item3 36 81 42 48
",header=TRUE, check.names=FALSE) %>%
rename( `Item Code` = Item_Code )
library(tibble)
library(lubridate)
x <- df %>% select( -matches("Code \\d|Internal Code") ) %>%
column_to_rownames("Item Code") %>%
t %>% as.data.frame %>%
rownames_to_column("Item Code") %>%
mutate( `Item Code` = dmy(`Item Code`) )
x
Output:
> x
Item Code Item Item2 Item3
1 2018-01-01 99 42 36
2 2018-02-01 51 47 81
3 2018-03-01 60 88 42
4 2018-04-01 69 2 48
I went a bit forth and back with this solution, but it can be nice to also showcase how to remove columns by a regex on their column names, since you are removing several similarly named columns.
The t trick, that you also use, works becuase there is really only one more column there that would cause problems with this, as others have commented, and this can be temporarily stowed away as rownames. If that weren't the case, you're looking at a more complex solution involving pivot_wider and pivot_longer or splitting the data.frame and transposing only one of the halves.
I got an XLSX with data from a questionnaire for my master thesis.
The questions and answers for an interviewee are in one row in the second column. The first column contains the date.
The data of the second column comes in a form like this:
"age":"52","height":"170","Gender":"Female",...and so on
I started with:
test12 <- read_xlsx("Testdaten.xlsx")
library(splitstackshape)
test13 <- concat.split(data = test12, split.col= "age", sep =",")
Then I got the questions and the answers as a column divided by a ":".
For e.g. column 1: "age":"52" and column2:"height":"170".
But the data is so messy that sometimes in the column of the age question and answer there is a height question and answer and for some questionnaires questions and answers double.
I would need the questions as variables and the answers as observations. But I have no clue how to get there. I could clean the data in excel first, but with the fact that columns are not constant and there are for e.g. some height questions in the age column I see no chance to do it as I will get new data regularly, formated the same way.
Here is an example of the data:
A tibble: 5 x 2
partner.createdAt partner.wphg.info
<chr> <chr>
1 2019-11-09T12:13:11.099Z "{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\""
2 2019-11-01T06:43:22.581Z "{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\""
3 2019-11-10T07:59:46.136Z "{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\""
4 2019-11-11T13:01:48.488Z "{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000~
5 2019-11-08T14:54:26.654Z "{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\""
Thank you so much for your time!
You can loop through each entry, splitting at , as you did. Then you can loop through them all again, splitting at :.
The result will be a bunch of variable/value pairings. This can be all done stacked. Then you just want to pivot back into columns.
data
Updated the data based on your edit.
data <- tribble(~partner.createdAt, ~partner.wphg.info,
'2019-11-09T12:13:11.099Z', '{\"age_years\":\"50\",\"job_des\":\"unemployed\",\"height_cm\":\"170\",\"Gender\":\"female\",\"born_in\":\"Italy\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"200000\"',
'2019-11-01T06:43:22.581Z', '{\"age_years\":\"34\",\"job_des\":\"self-employed\",\"height_cm\":\"158\",\"Gender\":\"male\",\"born_in\":\"Germany\",\"Alcoholic\":\"true\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"10000\"',
'2019-11-10T07:59:46.136Z', '{\"age_years\":\"24\",\"height_cm\":\"187\",\"Gender\":\"male\",\"born_in\":\"England\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"3\",\"total_wealth\":\"150000\"',
'2019-11-11T13:01:48.488Z', '{\"age_years\":\"59\",\"job_des\":\"employed\",\"height_cm\":\"167\",\"Gender\":\"female\",\"born_in\":\"United States\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"2\",\"total_wealth\":\"1000000\"',
'2019-11-08T14:54:26.654Z', '{\"age_years\":\"36\",\"height_cm\":\"180\",\"born_in\":\"Germany\",\"Alcoholic\":\"false\",\"knowledge_selfass\":\"5\",\"total_wealth\":\"170000\",\"job_des\":\"employed\",\"Gender\":\"male\"')
libraries
We need a few here. Or you can just call tidyverse.
library(stringr)
library(purrr)
library(dplyr)
library(tibble)
library(tidyr)
function
This function will create a data frame (or tibble) for each question. The first column is the date, the second is the variable, the third is the value.
clean_record <- function(date, text) {
clean_records <- str_split(text, pattern = ",", simplify = TRUE) %>%
str_remove_all(pattern = "\\\"") %>% # remove double quote
str_remove_all(pattern = "\\{|\\}") %>% # remove curly brackets
str_split(pattern = ":", simplify = TRUE)
tibble(date = as.Date(date), variable = clean_records[,1], value = clean_records[,2])
}
iteration
Now we use pmap_dfr from purrr to loop over the rows, outputting each row with an id variable named record.
This will stack the data as described in the function. The mutate() line converts all variable names to lowercase. The distinct() line will filter out rows that are exact duplicates.
What we do then is just pivot on the variable column. Of course, replace data with whatever you name your data frame.
data_clean <- pmap_dfr(data, ~ clean_record(..1, ..2), .id = "record") %>%
mutate(variable = tolower(variable)) %>%
distinct() %>%
pivot_wider(names_from = variable, values_from = value)
result
The result is something like this. Note how I had reordered some of the columns, but it still works. You are probably not done just yet. All columns are now of type character. You need to figure out the desired type for each and convert.
# A tibble: 5 x 10
record date age_years job_des height_cm gender born_in alcoholic knowledge_selfass total_wealth
<chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 1 2019-11-09 50 unemployed 170 female Italy false 5 200000
2 2 2019-11-01 34 self-employed 158 male Germany true 3 10000
3 3 2019-11-10 24 NA 187 male England false 3 150000
4 4 2019-11-11 59 employed 167 female United States false 2 1000000
5 5 2019-11-08 36 employed 180 male Germany false 5 170000
For example, convert age_years to numeric.
data_clean %>%
mutate(age_years = as.numeric(age_years))
I am sure you may run into other things, but this should be a start.
Seeing the huge possibilities by R to avoid endless Excel-Copy-and-paste-sessions I started exploring it, but now I face trouble formatting several (15+) .csv tables. They all have the same appearance:
Date;Value;Name
1.1.2011;30;GWM5
1.2.2011;31;GWM5
1.3.2011;35;GWM5
1.4.2011;31;GWM5
1.1.2011;23;GWM5
1.2.2011;24;GWM5
1.3.2011;21;GWM5
1.4.2011;22;GWM5
As you can see, the name is constant, the date is reoccurring, only the value is variable. I need to have it in a table like this:
Name;date;value1;value2
GWM5;1.1.2011;30;23
GWM5;1.2.2011;31;24
GWM5;1.3.2011;35;21
GWM5;1.4.2011;31;22
I already tried to order, transpose and determine duplicates. But transpose (even as a function: http://www.endmemo.com/program/R/transpose.php ) didn't gave me the right arrangement in individual cells (there were always 2 or more values in 1 cell) and determine duplicates only gave me single values, not rows.
Can you please help me? I would like to avoid putting them into the right order in Excel by hand.
The base R approach would be something like this:
mydf <- read.csv("your.file.txt", sep=";")
mydf$time <- with(mydf, ave(rep(1, nrow(mydf)), Date, FUN = seq_along))
reshape(mydf, direction = "wide", idvar=c("Name", "Date"), timevar="time")
# Date Name Value.1 Value.2
# 1 1.1.2011 GWM5 30 23
# 2 1.2.2011 GWM5 31 24
# 3 1.3.2011 GWM5 35 21
# 4 1.4.2011 GWM5 31 22
This seems about right, although there might be an easier way:
d <- read.csv2(text="
Date;Value;Name
1.1.2011;30;GWM5
1.2.2011;31;GWM5
1.3.2011;35;GWM5
1.4.2011;31;GWM5
1.1.2011;23;GWM5
1.2.2011;24;GWM5
1.3.2011;21;GWM5
1.4.2011;22;GWM5")
library(reshape2)
library(plyr)
d2 <- ddply(d,"Date",transform,n=paste0("Value",seq_along(Value)))
dcast(d2,Date+Name~n,value.var="Value")
## Date Name Value1 Value2
## 1 1.1.2011 GWM5 30 23
## 2 1.2.2011 GWM5 31 24
## 3 1.3.2011 GWM5 35 21
## 4 1.4.2011 GWM5 31 22