Melting dataframe in R - r

I have the following R dataframe :
foo <- data.frame("Department" = c('IT', 'IT', 'Sales'),
"Name.boy" = c('John', 'Mark', 'Louis'),
"Age.boy" = c(21,23,44),
"Name.girl" = c('Jane', 'Charlotte', 'Denise'),
"Age.girl" = c(16,25,32))
which looks like the following :
Department Name.boy Age.boy Name.girl Age.girl
IT John 21 Jane 16
IT Mark 23 Charlotte 25
Sales Louis 44 Denise 32
How do I 'melt' the dataframe, so that for a given Department, I have three columns : Name, Age, and Sex ?
Department Name Age Sex
IT John 21 Boy
IT Jane 16 Girl
IT Mark 23 Boy
IT Charlotte 25 Girl
Sales Louis 44 Boy
Sales Denise 32 Girl

We can use pivot_longer from tidyr
library(tidyr)
pivot_longer(foo, cols = -Department, names_to = c(".value", "Sex"),
names_sep="\\.")
# A tibble: 6 x 4
# Department Sex Name Age
# <chr> <chr> <chr> <dbl>
#1 IT boy John 21
#2 IT girl Jane 16
#3 IT boy Mark 23
#4 IT girl Charlotte 25
#5 Sales boy Louis 44
#6 Sales girl Denise 32

Using reshape:
reshape(foo, direction="long", varying=2:5, tiemvar="Sex")
Department Sex Name Age id
1.boy IT boy John 21 1
2.boy IT boy Mark 23 2
3.boy Sales boy Louis 44 3
1.girl IT girl Jane 16 1
2.girl IT girl Charlotte 25 2
3.girl Sales girl Denise 32 3

Related

wide to long multiple columns issue

I have something like this:
id role1 Approved by Role1 role2 Approved by Role2
1 Amy 1/1/2019 David 4/4/2019
2 Bob 2/2/2019 Sara 5/5/2019
3 Adam 3/3/2019 Rachel 6/6/2019
I want something like this:
id Name Role Approved
1 Amy role1 1/1/2019
2 Bob role1 2/2/2019
3 Adam role1 3/3/2019
1 David role2 4/4/2019
2 Sara role2 5/5/2019
3 Rachel role2 6/6/2019
I thought something like this would work
melt(df,id.vars= id,
measure.vars= list(c("role1", "role2"),c("Approved by Role1", "Approved by Role2")),
variable.name= c("Role","Approved"),
value.name= c("Name","Date"))
but i am getting Error: measure variables not found in data:c("role1", "role2"),c("Approved by Role1", "Approved by Role2")
I have tried replacing this with the number of the columns as well and haven't had any luck.
Any suggestions?? Thanks!
I really like the new tidyr::pivot_longer() function. It's still only available in the dev version of tidyr, but should be released shortly. First I'm going to clean up the column names slightly, so they have a consistent structure:
> df
# A tibble: 3 x 5
id name_role1 approved_role1 name_role2 approved_role2
<dbl> <chr> <chr> <chr> <chr>
1 1 Amy 1/1/2019 David 4/4/2019
2 2 Bob 2/2/2019 Sara 5/5/2019
3 3 Adam 3/3/2019 Rachel 6/6/2019
Then it's easy to convert to long format with pivot_longer():
library(tidyr)
df %>%
pivot_longer(
-id,
names_to = c(".value", "role"),
names_sep = "_"
)
Output:
id role name approved
<dbl> <chr> <chr> <chr>
1 1 role1 Amy 1/1/2019
2 1 role2 David 4/4/2019
3 2 role1 Bob 2/2/2019
4 2 role2 Sara 5/5/2019
5 3 role1 Adam 3/3/2019
6 3 role2 Rachel 6/6/2019

Replace multiple strings/values based on separate list

I have a data frame that looks similar to this:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 John Smith GROUP1 2015 1 John Smith 5 Adam Smith 12 Mike Smith 20 Sam Smith 7 Luke Smith 3 George Smith
Each row repeats for new logs, but the values in X.1 : Y.3 change often.
The ID's and the ID's present in X.1 : Y.3 have a numeric value and then the name ID, i.e., "1 John Smith" or "20 Sam Smith" will be the string.
I have an issue where in certain instances, the ID will remain as "1 John Smith" but in X.1 : Y.3 the number may change preceding "John Smith", so for example it might be "14 John Smith". The names will always be correct, it's just the number that sometimes gets mixed up.
I have a list of 200+ ID's that are impacted by this mismatch - what is the most efficient way to replace the values in X.1 : Y.3 so that they match the correct ID in column ID?
I won't know which column "14 John Smith" shows up in, it could be X.1, or Y.2, or Y.3 depending on the row.
I can use a replace function in a dplyr line of code, or gsub for each 200+ ID's and for each column effected, but it seems very inefficient. Is there a quicker way than repeated something like the below x times?
df%>%mutate(X.1=replace(X.1, grepl('John Smith', X.1), "1 John Smith"))%>%as.data.frame()
Sometimes it helps to temporarily reshape the data. That way we can operate on all the X and Y values without iterating over them.
library(stringr)
library(tidyr)
## some data to work with
exd <- read.csv(text = "EVENT,ID,GROUP,YEAR,X.1,X.2,X.3,Y.1,Y.2,Y.3
1,1 John Smith,GROUP1,2015,19 John Smith,11 Adam Smith,9 Sam Smith,5 George Smith,13 Mike Smith,12 Luke Smith
2,2 John Smith,GROUP1,2015,1 George Smith,9 Luke Smith,19 Adam Smith,7 Sam Smith,17 Mike Smith,11 John Smith
3,3 John Smith,GROUP1,2015,5 George Smith,18 John Smith,12 Sam Smith,6 Luke Smith,2 Mike Smith,4 Adam Smith",
stringsAsFactors = FALSE)
## re-arrange to put X and Y columns into a single column
exd <- gather(exd, key = "var", value = "value", X.1, X.2, X.3, Y.1, Y.2, Y.3)
## find the X and Y values that contain the ID name
matches <- str_detect(exd$value, str_replace_all(exd$ID, "^\\d+ *", ""))
## replace X and Y values with the matching ID
exd[matches, "value"] <- exd$ID[matches]
## put it back in the original shape
exd <- spread(exd, key = "var", value = value)
exd
## EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
## 1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
## 2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
## 3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Not sure if you're set on dplyr and piping, but I think this is a plyr solution that does what you need. Given this example dataset:
> df
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 19 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 11 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 18 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
This adply function goes row by row and replaces any matching X:Y column values with the one from the ID column:
library(plyr)
adply(df, .margins = 1, function(x) {
idcol <- as.character(x$ID)
searchname <- trimws(gsub('[[:digit:]]+', "", idcol))
sapply(x[5:10], function(y) {
ifelse(grepl(searchname, y), idcol, as.character(y))
})
})
Output:
EVENT ID GROUP YEAR X.1 X.2 X.3 Y.1 Y.2 Y.3
1 1 1 John Smith GROUP1 2015 1 John Smith 11 Adam Smith 9 Sam Smith 5 George Smith 13 Mike Smith 12 Luke Smith
2 2 2 John Smith GROUP1 2015 1 George Smith 9 Luke Smith 19 Adam Smith 7 Sam Smith 17 Mike Smith 2 John Smith
3 3 3 John Smith GROUP1 2015 5 George Smith 3 John Smith 12 Sam Smith 6 Luke Smith 2 Mike Smith 4 Adam Smith
Data:
names <- c("EVENT","ID",'GROUP','YEAR', paste(rep(c("X.", "Y."), each = 3), 1:3, sep = ""))
first <- c("John", "Sam", "Adam", "Mike", "Luke", "George")
set.seed(2017)
randvals <- t(sapply(1:3, function(x) paste(sample(1:20, size = 6),
paste(sample(first, replace = FALSE, size = 6), "Smith"))))
df <- cbind(data.frame(1:3, paste(1:3, "John Smith"), "GROUP1", 2015), randvals)
names(df) <- names
I think that the most efficient way to accomplish this is by building a loop. The reason is that you will have to repeat the function to replace the names for every name in your ID list. With a loop, you can automate this.
I will make some assumptions first:
The ID list can be read as a character vector
You don't have any typos in the ID list or in your data.frame, including
different lowercase and uppercase letters in the names.
Your ID list does not contain the numbers. In case that it does contain numbers, you have to use gsub to erase them.
The example can work with a data.frame (DF) with the same structure that
you put in your question.
>
ID <- c("John Smith", "Adam Smith", "George Smith")
for(i in 1:length(ID)) {
DF[, 5:10][grep(ID[i], DF[, 5:10])] <- ID[i]
}
With each round this loop will:
Identify the positions in the columns X.1:Y.3 (columns 5 to 10 in your question) where the name "i" appears.
Then, it will change all those values to the one in the "i" position of the ID vector.
So, the first iteration will do: 1) Search for every position where the name "John Smith" appears in the data frame. 2) Replace all those "# John Smith" with "John Smith".
Note: If you simply want to delete the numbers, you can use gsub to replace them. Take into account that you probably want to erase the first space between the number and the name too. One way to do this is using gsub and a regular expression:
DF[, 5:10] <- gsub("[0-9]+ ", "", DF[, 5:10])

Best method to Merge two Datasets (Maybe if function?)

I have two data sets I am working with. Datasets TestA and Test B (Below is how to make them in R)
Instructor <- c('Mr.A','Mr.A','Mr.B', 'Mr.C', 'Mr.D')
Class <- c('French','French','English', 'Math', 'Geometry')
Section <- c('1','2','3','5','5')
Time <- c('9:00-10:00','10:00-11:00','9:00-10:00','9:00-10:00','10:00-11:00')
Date <- c('MWF','MWF','TR','TR','MWF')
Enrollment <- c('30','40','24','29','40')
TestA <- data.frame(Instructor,Class,Section,Time,Date,Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment)
Student <- c("Frances","Cass","Fern","Pat","Peter","Kory","Cole")
ID <- c('123','121','101','151','456','789','314')
Instructor <- c('','','','','','','')
Time <- c('','','','','','','')
Date <- c('','','','','','','')
Enrollment <- c('','','','','','','')
Class <- c('French','French','French','French','English', 'Math', 'Geometry')
Section <- c('1','1','2','2','3','5','5')
TestB <- data.frame(Student, ID, Instructor, Class, Section, Time, Date, Enrollment)
rm(Instructor,Class,Section,Time,Date,Enrollment,ID,Student)
I would like to merge both datasets (If possible, without using merge() ) So that All the columns of Test A are filled with the information provided by TestB and it should be added depending on the Class and Section.
I tried using merge(TestA, TestB, by=c('Class','Section'), all.x=TRUE) but it adds observations to the original TestA. This is just a test but in the datasets I am using there are hundreds of observations. It worked when I did it with these smaller frames but something is happening to the bigger set. That's why I'd like to know if there is a merge alternative.
Any ideas on how to do this?
The output should look like this
Class Section Instructor Time Date Enrollment Student ID
English 3 Mr.B 9:00-10:00 TR 24 Peter 456
French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
I was once a big fan of merge() until I learned about dplyr's join functions.
Try this instead:
library(dplyr)
TestA %>%
left_join(TestB, by = c("Class", "Section")) %>% #Here, you're joining by just the "Class" and "Section" columns of TestA and TestB
select(Class,
Section,
Instructor = Instructor.x,
Time = Time.x,
Date = Date.x,
Enrollment = Enrollment.x,
Student,
ID) %>%
arrange(Class, Section) #Added to match your output.
The select statement is keeping only those columns that are specifically named and, in some cases, renaming them.
Output:
Class Section Instructor Time Date Enrollment Student ID
1 English 3 Mr.B 9:00-10:00 TR 24 Peter 456
2 French 1 Mr.A 9:00-10:00 MWF 30 Frances 123
3 French 1 Mr.A 9:00-10:00 MWF 30 Cass 121
4 French 2 Mr.A 10:00-11:00 MWF 40 Fern 101
5 French 2 Mr.A 10:00-11:00 MWF 40 Pat 151
6 Geometry 5 Mr.D 10:00-11:00 MWF 40 Cole 314
7 Math 5 Mr.C 9:00-10:00 TR 29 Kory 789
The key is to drop the empty but duplicate columns from TestB before merging / joining as shown by SymbolixAU.
Here is an implementation in data.table syntax:
library(data.table)
setDT(TestB)[, .(Student, ID, Class, Section)][setDT(TestA), on = .(Class, Section)]
Student ID Class Section Instructor Time Date Enrollment
1: Frances 123 French 1 Mr.A 9:00-10:00 MWF 30
2: Cass 121 French 1 Mr.A 9:00-10:00 MWF 30
3: Fern 101 French 2 Mr.A 10:00-11:00 MWF 40
4: Pat 151 French 2 Mr.A 10:00-11:00 MWF 40
5: Peter 456 English 3 Mr.B 9:00-10:00 TR 24
6: Kory 789 Math 5 Mr.C 9:00-10:00 TR 29
7: Cole 314 Geometry 5 Mr.D 10:00-11:00 MWF 40

abind error - arg 'XXX' has dims=1912, 35, 1; but need dims=0, 35, X

I am trying to use abind to create a 3-D array out of a large 2D array. The source data is structured like this
Firstname Lastname Country City Measure Wk1 Wk2... Wkn
foo bar UK London Height 23 34 34
foo bar UK London Weight 67 67 67
foo bar UK London Fat 6 7 9
John doe US NY Height 546 776 978
John doe US NY Weight 123 656 989
John doe US NY Fat 34 45 67
There are 1912 rows per Measure and 25 weeks of data. I am trying to create a 3D array such that I can measure city wise trends of the Measures - height weight etc.
When I use abind(split(df,df$city), along =3), it gives me the error:
abind error - arg 'XXX' has dims=1912, 35, 1; but need dims=0, 35, X
I have verified that the number of rows are 1912 per measure and the number of columns are also homogenous. Any help will be greatly appreciated.
Are you sure that you want to use arrays to measure city trends?
Usually the right approach to analysing data like yours is to unpivot the weeks into long format.
I'll start by importing your data into R...
tc <- textConnection("Firstname Lastname Country City Measure Wk1 Wk2 Wk3
foo bar UK London Height 23 34 34
foo bar UK London Weight 67 67 67
foo bar UK London Fat 6 7 9
John doe US NY Height 546 776 978
John doe US NY Weight 123 656 989
John doe US NY Fat 34 45 67")
df <- read.table(tc, header = TRUE)
Then install and load a couple of useful packages.
install.packages("tidyr")
install.packages("dplyr")
library(tidyr)
library(dplyr)
Now to unpivot your data using the gather command from tidyr.
> long_df <- gather(df, Week, Value, -c(1:5))
> long_df
Firstname Lastname Country City Measure Week Value
1 foo bar UK London Height Wk1 23
2 foo bar UK London Weight Wk1 67
3 foo bar UK London Fat Wk1 6
4 John doe US NY Height Wk1 546
5 John doe US NY Weight Wk1 123
6 John doe US NY Fat Wk1 34
7 foo bar UK London Height Wk2 34
8 foo bar UK London Weight Wk2 67
9 foo bar UK London Fat Wk2 7
10 John doe US NY Height Wk2 776
11 John doe US NY Weight Wk2 656
12 John doe US NY Fat Wk2 45
13 foo bar UK London Height Wk3 34
14 foo bar UK London Weight Wk3 67
15 foo bar UK London Fat Wk3 9
16 John doe US NY Height Wk3 978
17 John doe US NY Weight Wk3 989
18 John doe US NY Fat Wk3 67
Now you can use dplyr to produce any summaries of the data that you please...
> long_df %>%
+ group_by(Country, City, Measure) %>%
+ summarise(mean_val = mean(Value))
Source: local data frame [6 x 4]
Groups: Country, City
Country City Measure mean_val
1 UK London Fat 7.333333
2 UK London Height 30.333333
3 UK London Weight 67.000000
4 US NY Fat 48.666667
5 US NY Height 766.666667
6 US NY Weight 589.333333
Or summaries by Country and Measure...
> long_df %>%
+ group_by(Country, Measure) %>%
+ summarise(mean_val = mean(Value), med_val = median(Value), count = n())
Source: local data frame [6 x 5]
Groups: Country
Country Measure mean_val med_val count
1 UK Fat 7.333333 7 3
2 UK Height 30.333333 34 3
3 UK Weight 67.000000 67 3
4 US Fat 48.666667 45 3
5 US Height 766.666667 776 3
6 US Weight 589.333333 656 3

Remove observations from DF if duplicate in specific columns while other columns must differ

I have a large data frame with multiple columns and many rows (200k). I order the rows by a group variable, and each group can have one or more entries. The other columns for each group should have identical values, however in some cases they don't. It looks like this:
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
I want to delete all entries of a group if age or city is not identical for all rows of the group (indication of observation error). Otherwise, I want to keep all the entries.
The output I'm hoping for would be:
group name age color city
2 Martin 78 black LA
2 Martin 78 blue LA
4 Jake 33 blue NJ
The closest I have come is this:
dup <- df[ duplicated(df[,c("group","name","color")]) | duplicated(df[,c("group","name","color")],fromLast=TRUE) ,"group"]
df_nodup <- df[!(df$group %in% dup),]
However, this is far from doing everything that I need.
P.s.: I had the same question answered for py/pandas. I'd like to have a solution for R, as well however.
/e: While Frank's answer was helpful to understand the principle of a solution and his second suggestion worked, it was very slow. (took ~15min on my df).
user20650's answer was harder to comprehend, but runs tremendously faster (~10sec).
A similar approach to Franks, you can count the length of the unique combinations of age and city by group - do this using ave. You can then subset your data if the length of the unique combinations is greater than one
# your data
df <- read.table(text="group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ ", header=T)
# calculate and subset
df[with(df, ave(paste(age, city), group, FUN=function(x) length(unique(x))))==1,]
# group name age color city
# 4 2 Martin 78 black LA
# 5 2 Martin 78 blue LA
# 8 4 Jake 33 blue NJ
Here is an approach:
temp <- tapply(df$group, list(df$name, df$age, df$city), unique)
temp[!is.na(temp)] <- 1
keepers <- names(which(apply(temp, 1, sum, na.rm=TRUE)==1))
df[df$name %in% keepers, ]
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
Alternate, slightly simpler approach:
temp2 <- unique(df[,c('name','age','city')])
keepers2 <- names(which(tapply(temp2$name, temp2$name, length)==1))
df[df$name %in% keepers2, ]
# group name age color city
#4 2 Martin 78 black LA
#5 2 Martin 78 blue LA
#8 4 Jake 33 blue NJ
Here's an approach using dplyr:
df <- read.table(text = "
group name age color city
1 Anton 50 orange NY
1 Anton 21 red NY
1 Anton 21 red NJ
2 Martin 78 black LA
2 Martin 78 blue LA
3 Maria 29 red NC
3 Maria 29 pink LV
4 Jake 33 blue NJ
", header = TRUE)
library(dplyr)
df %>%
group_by(group) %>%
filter(n_distinct(age) == 1 && n_distinct(city) == 1)
I think it's pretty easy to see what's going on - you group, then filter to keep groups when there is only one distinct age and city.

Resources