R- How to reshape Long to Wide with multiple variables/columns - r

I started off with the following subset of my data
UserID labelnospaces responses
1 Were you given any info? yes
1 By using this service..? yes
1 How satisfied are you? Very satisfied
2 Were you given any info? no
2 By using this service..? no
2 How satisfied are you? unsatisfied
By using the code below, I was able to get from long to wide perfectly
service_L_to_W<- reshape(data=service, idvar="UserID",
timevar = "labelnospaces",
direction = "wide")
Using the code above, I got (this is what I wanted)
UserID Were you given any info? By using this service..? How satisfied are you?
1 yes yes very satisfied
2 no no unsatisfied
My question is how do I edit my code so that I can convert my data (with the extra variables/columns) from long to wide:
UserID Full Name DOB EncounterID QuestionID Name Type labelnospaces responses
1 John Smith 1-1-90 13 505 Intro Check Were you given any info? yes
1 John Smith 1-1-90 13 506 Care Check By using this service.. yes
1 John Smith 1-1-90 13 507 Out Check How satisfied are you? vsat
2 Jane Doe 2-2-80 14 505 Intro Check Were you given any info? no
2 Jane Doe 2-2-80 14 506 Care Check By using this service.. no
2 Jane Doe 2-2-80 14 507 Out Check How satisfied are you? unsat

Some variables are can be better to together
df %>%
pivot_wider(id_cols = c(UserID, Full.Name, DOB, EncounterID), names_from = c(QuestionID, QName, labelnospaces), values_from = responses)
UserID Full.Name DOB EncounterID `505_Intro_Were you given any info?` `506_Care_By using this service..`
<int> <chr> <chr> <int> <chr> <chr>
1 1 John Smith 1-1-90 13 yes yes
2 2 Jane Doe 2-2-80 14 no no
`507_Out_How satisfied are you?`
<chr>
1 vsat
2 unsat

Related

R cleaning, change data format horizontal to vertical, repeating some data

Right now, my data looks like this:
Coder Bill Witness1name Witness1job Witness2name Witness2Job
Joe 123 Fred Plumber Bob Coach
Karen 122 Sally Barista Helen Translator
Harry 431 Lisa Swimmer N/A N/A
Frank 301 N/A N/A N/A N/A
But I want my data to look like this:
Coder Bill WitnessName WitnessJob
Joe 123 Fred Plumber
Joe 123 Bob Coach
Karen 122 Sally Barista
Karen 122 Helen Translator
Harry 431 Lisa Swimmer
Frank 301 N/A N/A
So I want to take it from the coder/bill level to the "witness" level. Some coder/bills have up to 10 witnesses in their rows. Some have no witnesses, but I do not want to completely drop them from the dataset (see Frank).
All help is appreciated! I am familiar with the tidyverse package.
For those interested, I figured it out.
I had to change all the column names like this:
Witness1Name to Witness1_Name
Witness1Job to Witness1_Job
etc.
Then I ran this:
cleandata <- pivot_longer(mddata, cols = -c(Coder, Bill),
names_to = c("Witness", ".value"),
names_pattern = 'Witness(\\d)_(.*)') %>%
drop_na(Name)
And it gave me this:
Coder Bill Witness Name Job
Joe 123 1 Fred Plumber
Joe 123 2 Bob Coach
Karen 122 1 Sally Barista
Karen 122 2 Helen Translator
Harry 431 1 Lisa Swimmer
Close enough to what I wanted

Reshaping a dataset of patients with different numbers of diagnosis from long to wide [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 3 years ago.
I am a beginner, confronted with a big task and all the typical long to wide reshaping tools I found using the search function did not really do the job for me. I would be glad if someone could help me.
I try to achieve the following:
I have patientdata in which every patient has a unique patient number but multiple stays in hospital have lead to multiple cases per person. I want to work with these cases. Problem is, I have all the diagnoses per case but not everybody has the same number of diagnosis and I don't know how to tell R to create a new dagnosis (and date of diagnosis) variable each time there is already a diagnosis. Every help is highly appreciated!
So, I have a huge dataset that looks roughly like that:
Patient Case Diagnosis DateOfDiagnosis
1 John Doe 1 A 2010-10-10
2 John Doe 1 B 2010-10-10
3 John Doe 1 C 2010-10-10
4 Peter Griffin 2 D 2010-10-11
5 Peter Griffin 2 E 2010-10-11
6 Homer Simpson 3 F 2010-10-12
7 Homer Simpson 4 G 2010-10-13
I need row by case and I need all the diagnosis and their dates in separate variables. This would be no problem but there is no pattern in the cases or diagnosis so some patients have only one case others 5 and some cases have 1 others 5 diagnoses with respective date.
So what I need looks like this:
Patient Case Diag1 DateOfDiag1 Diag2 DateOfDiag2 Diag3 DateOfDiag3 ....
1 John Doe 1 A 2010-10-10 B 2010-10-10 C 2010-10-10
2 Peter Grif 2 D 2010-10-11 E 2010-10-11 NA NA
3 Homer Simp 3 F 2010-10-12 NA NA NA NA
4 Homer Simp 4 G 2010-10-13 NA NA NA NA
The code for my example is:
Patient <- c('John Doe','John Doe','John Doe', 'Peter Griffin','Peter Griffin', 'Homer Simpson', 'Homer Simpson')
Case <- c(1,1,1,2,2,3,4)
Diagnosis <- c('A','B','C','D','E','F','G')
DateOfDiagnosis <- as.Date(c('2010-10-10','2010-10-10','2010-10-10','2010-10-11','2010-10-11','2010-10-12','2010-10-13'))
df<-data.frame(Patient, Case, Diagnosis, DateOfDiagnosis)
Every help is highly appreciated!
Kind regards,
Jan
You could use pivot_wider, after creating a unique column.
library(dplyr)
library(tidyr)
df %>%
group_by(Patient, Case) %>%
mutate(row = row_number()) %>%
pivot_wider(values_from = c(Diagnosis, DateOfDiagnosis), names_from = row)
# Patient Case Diagnosis_1 Diagnosis_2 Diagnosis_3 DateOfDiagnosis_1 DateOfDiagnosis_2 DateOfDiagnosis_3
# <fct> <dbl> <fct> <fct> <fct> <date> <date> <date>
#1 John Doe 1 A B C 2010-10-10 2010-10-10 2010-10-10
#2 Peter Griffin 2 D E NA 2010-10-11 2010-10-11 NA
#3 Homer Simpson 3 F NA NA 2010-10-12 NA NA
#4 Homer Simpson 4 G NA NA 2010-10-13 NA NA

wide to long multiple columns issue

I have something like this:
id role1 Approved by Role1 role2 Approved by Role2
1 Amy 1/1/2019 David 4/4/2019
2 Bob 2/2/2019 Sara 5/5/2019
3 Adam 3/3/2019 Rachel 6/6/2019
I want something like this:
id Name Role Approved
1 Amy role1 1/1/2019
2 Bob role1 2/2/2019
3 Adam role1 3/3/2019
1 David role2 4/4/2019
2 Sara role2 5/5/2019
3 Rachel role2 6/6/2019
I thought something like this would work
melt(df,id.vars= id,
measure.vars= list(c("role1", "role2"),c("Approved by Role1", "Approved by Role2")),
variable.name= c("Role","Approved"),
value.name= c("Name","Date"))
but i am getting Error: measure variables not found in data:c("role1", "role2"),c("Approved by Role1", "Approved by Role2")
I have tried replacing this with the number of the columns as well and haven't had any luck.
Any suggestions?? Thanks!
I really like the new tidyr::pivot_longer() function. It's still only available in the dev version of tidyr, but should be released shortly. First I'm going to clean up the column names slightly, so they have a consistent structure:
> df
# A tibble: 3 x 5
id name_role1 approved_role1 name_role2 approved_role2
<dbl> <chr> <chr> <chr> <chr>
1 1 Amy 1/1/2019 David 4/4/2019
2 2 Bob 2/2/2019 Sara 5/5/2019
3 3 Adam 3/3/2019 Rachel 6/6/2019
Then it's easy to convert to long format with pivot_longer():
library(tidyr)
df %>%
pivot_longer(
-id,
names_to = c(".value", "role"),
names_sep = "_"
)
Output:
id role name approved
<dbl> <chr> <chr> <chr>
1 1 role1 Amy 1/1/2019
2 1 role2 David 4/4/2019
3 2 role1 Bob 2/2/2019
4 2 role2 Sara 5/5/2019
5 3 role1 Adam 3/3/2019
6 3 role2 Rachel 6/6/2019

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

R count number of Team members based on Team name

I have a df where each row represents an individual and each column a characteristic of these individuals. One of the columns is TeamName, which is the name of the Team that individual belongs to. Multiple individuals belong to a Team.
I'd like a function in R that creates a new column with the number of team members for each Team.
So, for example I have:
df
Name Surname TeamName
John Smith Champions
Mary Osborne Socceroos
Mark Johnson Champions
Rory Bradon Champions
Jane Bryant Socceroos
Bruce Harper
I'd like to have
df1
Name Surname TeamName TeamNo
John Smith Champions 3
Mary Osborne Socceroos 2
Mark Johnson Champions 3
Rory Bradon Champions 3
Jane Bryant Socceroos 2
Bruce Harper 0
So as you can see the counting includes that individual too, and if someone (e.g. Bruce Harper) has no Team name, then he gets a 0.
How can I do that? Thanks!
This is a solution based on using data.table which perhaps is too much for what you need, but here it goes:
library(data.table)
dt=data.table(df)
# First, let's convert the factors of TeamName, to characters
dt[,TeamName:=as.character(TeamName)]
# Now, let find all the team numbers
dt[,TeamNo:=.N, by='TeamName']
# Let's exclude the special cases
dt[is.na(TeamName),TeamNo:=NA]
dt[TeamName=="",TeamNo:=NA]
It is clearly not the best solution, but I hope this helps
If you need to know the number of unique members in the first two columns based on the 'TeamName' column, one option is n_distinct from dplyr
library(dplyr)
library(tidyr)
df %>%
unite(Var, Name, Surname) %>% #paste the columns together
group_by(TeamName) %>% #group by TeamName
mutate(TeamNo= n_distinct(Var)) %>% #create the TeamNo column
separate(Var, into=c('Name', 'Surname')) #split the 'Var' column
Or if it just the number of rows per 'TeamName', we can group by 'TeamName', get the number of rows per group with n(), create the 'TeamNo' column with mutate based on that n(), and if needed an ifelse condition can be used to give NA for 'TeamName' that are '' or NA.
df %>%
group_by(TeamName) %>%
mutate(TeamNo = ifelse(is.na(TeamName)|TeamName=='', NA_integer_, n()))
# Name Surname TeamName TeamNo
#1 John Smith Champions 3
#2 Mary Osborne Socceroos 2
#3 Mark Johnson Champions 3
#4 Rory Bradon Champions 3
#5 Jane Bryant Socceroos 2
#6 Bruce Harper NA
Or you can use ave from base R. Suppose if there are '' and NA, I would first convert the '' to NA and then use ave to get the length of 'TeamNo' grouped by that column. It will give NA for `NA' values. For example.
v1 <- c(df$TeamName, NA)# appending an NA with the example to show the case
is.na(v1) <- v1=='' #convert the `'' to `NA`
as.numeric(ave(v1, v1, FUN=length))
#[1] 3 2 3 3 2 NA NA
Using sqldf:
library(sqldf)
sqldf("SELECT Name, Surname, TeamName, n
FROM df
LEFT JOIN
(SELECT TeamName, COUNT(Name) AS n
FROM df
WHERE NOT TeamName IS '' GROUP BY TeamName)
USING (TeamName)")
Output:
Name Surname TeamName n
1 John Smith Champions 3
2 Mary Osborne Socceroos 2
3 Mark Johnson Champions 3
4 Rory Bradon Champions 3
5 Jane Bryant Socceroos 2
6 Bruce Harper NA

Resources