How to Restructure R Data Frame in R [duplicate] - r

This question already has answers here:
reshape wide to long with character suffixes instead of numeric suffixes
(3 answers)
Closed 5 years ago.
I have data in this format:
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
and I would like it in this format:
boss employee
1 wil james
2 wil andy
3 james dean
4 james bert
5 billy herb
6 billy collin
7 tony mike
8 tony david
I have searched the forums, but I have not yet found anything that helps. I have tried using dplyr and some others, but I am still pretty new to R.
If this question has been answered and you could give me a link that would be greatly appreciated.
Thanks,
Wil

Here is a solution that uses tidyr. Specifically, the gather function is used to combine the two employee columns. This also generates a column bsaed on the column headers (employee1 and employee2) which is called key. We remove that with select from dplyr.
library(tidyr)
library(dplyr)
df <- read.table(
text = "boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david",
header = TRUE,
stringsAsFactors = FALSE
)
df2 <- df %>%
gather(key, employee, -boss) %>%
select(-key)
> df2
boss employee
1 wil james
2 james dean
3 billy herb
4 tony mike
5 wil andy
6 james bert
7 billy collin
8 tony david
I would be shocked if there isn't a slicker, base solution but this should work for you.

Using base R:
df1 <- df[, 1:2]
df2 <- df[, c(1, 3)]
names(df1)[2] <- names(df2)[2] <- "employee"
rbind(df1, df2)
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 11 wil andy
# 21 james bert
# 31 billy collin
# 41 tony david
Using dplyr:
df %>%
select(boss, employee1) %>%
rename(employee = employee1) %>%
bind_rows(df %>%
select(boss, employee2) %>%
rename(employee = employee2))
# boss employee
# 1 wil james
# 2 james dean
# 3 billy herb
# 4 tony mike
# 5 wil andy
# 6 james bert
# 7 billy collin
# 8 tony david
Data:
df <- read.table(text = "
boss employee1 employee2
1 wil james andy
2 james dean bert
3 billy herb collin
4 tony mike david
", header = TRUE, stringsAsFactors = FALSE)

Related

Using the %in% function for multiple columns to [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Find complement of a data frame (anti - join)
(7 answers)
Closed 1 year ago.
I am trying to use the %in% function to match the observations of one of my datasets to the observations of another dataset. Essentially, I would like to make two new datasets, one that contains the observations of the second dataset, and another which contains all other observations. Here is an example dataset:
Df
Last.Name First.Name Group
Williams Bob A
Williams Dan C
Miller Bob A
Smith Dan C
Williams Rick A
Smart Jeff C
Miller Bob A
Smith Dan C
Jones Bob A
Williams Buddy C
Miller Bob A
Hends Dan C
Williams Rick A
Smart Jeff C
Millers Bob A
Smith Danny C
Here is a dataset that I am trying to match the observations:
dfMatch
LastName FirstName
Williams Bob
Williams Buddy
Miller Bob
Smith Dan
Williams Rick
Smart Jeff
Miller Bob
Smith Dan
I tried various versions of the following code:
newdf<-Df[ Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName,]
and
newdf<-Df[ which(Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName),]
To get this new dataset:
newDF
Last.Name First.Name Group
Williams Bob A
Miller Bob A
Smith Dan C
Williams Rick A
Smart Jeff C
Miller Bob A
Smith Dan C
Williams Buddy C
However, this does not work.
I would also like to use similar code to build a dataset which includes all observations not listed in the dfMatch set, such as:
DfNoMatch
Last.Name First.Name Group
Williams Dan C
Jones Bob A
Miller Bob A
Hends Dan C
Williams Rick A
Smart Jeff C
Millers Bob A
Smith Danny C
By using code similar to:
DfNoMatch<-Df[ !Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName,]
and
DfNoMatch<-Df[! which(Df$Last.Name %in% DfMatch$LastName & Df$First.Name %in% DfMatch$FirstName),]
Thank you in advance and any help is greatly appreciated!
To really match the observations use the match-function. The %in%-function only tells you that there is a match, but it doesn't tell you what is matched where.

Using group_by to replace different strings by group

I'm trying to replace every instance of an author's name in a data.frame with a different string, but only when that author was the one speaking. For instance, if we have the data:
test <- data.frame(author = c("jon", "mike", "sam"), text = rep("jon and mike mike and sam sam sam", 3))
I'd like to replace every instance of "sam" with some other text when author=="sam".
I've tried using do and str_replace_all to do this, but haven't gotten it to work:
test %>% group_by(author) %>% do(mutate(., text2 = str_replace(.$text, eval(parse(text = .$author)), "yay")))
str_replace_all is Vectorised over string, pattern and replacement. (see ?str_replace_all), so you can just use the author column as pattern:
test %>% mutate(new_text = str_replace_all(text, author, 'yay'))
# author text new_text
#1 jon jon and mike mike and sam sam sam yay and mike mike and sam sam sam
#2 mike jon and mike mike and sam sam sam jon and yay yay and sam sam sam
#3 sam jon and mike mike and sam sam sam jon and mike mike and yay yay yay

Find Duplicates in R based on multiple characters

I can't seem to remember how to code this properly in R -
if I want to remove duplicates within a csv file based on multiple entries - first name and last name that are stored in separate columns
Then I can code: file[(duplicated(file$First.Name),]
but that only looks at the first name, I want it to look at the last same simultaneously.
If this is my starting file:
Steve Jones
Eric Brown
Sally Edwards
Steve Jones
Eric Davis
I want the output to be
Steve Jones
Eric Brown
Sally Edwards
Eric Davis
Only removing names of first and last name matching.
You can use
file[!duplicated(file[c("First.Name", "Last.Name")]), ]
Here is the solution for better performance (using data.table assuming First Name and Last Name are stored in separate columns):
> df <- read.table(text = 'Steve Jones
+ Eric Brown
+ Sally Edwards
+ Steve Jones
+ Eric Davis')
> colnames(df) <- c("First.Name","Last.Name")
> df
First.Name Last.Name
1 Steve Jones
2 Eric Brown
3 Sally Edwards
4 Steve Jones
5 Eric Davis
Here is where data.table specific code begins
> dt <- setDT(df)
> unique(dt,by=c('First.Name','Last.Name'))
First.Name Last.Name
1: Steve Jones
2: Eric Brown
3: Sally Edwards
4: Eric Davis
If there is a single column, use sub to remove the substring (i.e. first name) followed by space, get the logical vector (!duplicated(..) based on that to subset the rows of the dataset.
df1[!duplicated(sub("\\w+\\s+", "", df1$Col1)),,drop=FALSE]
# Col1
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
If it is based on two columns and the dataset have two columns, just do duplicated directly on the dataset to get the logical vector, negate it and subset the rows.
df1[!duplicated(df1), , drop=FALSE]
# first.name second.name
#1 Steve Jones
#2 Eric Brown
#3 Sally Edwards
#5 Eric Davis
try:
!duplicated(paste(File$First.Name,File$Last.Name))

cbind for multiple table() functions

I'm trying to count the frequency of multiple columns in a data.frame.
I used the table function on each column and bound them all by cbind, and was going to use the aggregate function after to calculate the means by my identifier.
Example:
df1
V1 V2 V3
George Mary Mary
George Mary Mary
George Mary George
Mary Mary George
Mary George George
Mary
Frequency<- as.data.frame(cbind(table(df1$V1), table(df1$V2), table(df1$V3)))
row.names V1
George 3
Mary 3
1
George 1
Mary 4
1
George 3
Mary 2
The result I get (visually) is a 2 column data frame, but when I check the dimension of Frequency, I get a result implying that the 2nd column only exists.
It's causing me trouble when I try to rename the columns and run the aggregate function, errors I get for rename:
colnames(Frequency) <- c("Name", "Frequency")
Error in names(Frequency) <- c("Name", "Frequency") :
'names' attribute [2] must be the same length as the vector [1]
The Final purpose is to run an aggregate command and get the mean by name:
Name.Mean<- aggregate(Frequency$Frequency, list(Frequency.Name), mean)
Desired output:
Name Mean
George Value
Mary Value
Using mtabulate (data from #user3169080's post)
library(qdapTools)
d1 <- mtabulate(df1)
is.na(d1) <- d1==0
colMeans(d1, na.rm=TRUE)
# Alice George Mary
# 4.0 3.0 2.5
I hope this is what you were looking for:
> df1
V1 V2 V3
1 George George George
2 Mary Mary Alice
3 George George George
4 Mary Mary Alice
5 <NA> George George
6 <NA> Mary Alice
7 <NA> <NA> George
8 <NA> <NA> Alice
> ll=unlist(lapply(df1,table))
> nn=names(ll)
> nn1=sapply(nn,function(x) substr(x,4,nchar(x)))
> mm=data.frame(ll)
> mm$names=nn1
> tapply(mm$ll,mm$names,mean)
> Mean=tapply(mm$ll,mm$names,mean)
> data.frame(Mean)
Mean
Alice 4.0
George 3.0
Mary 2.5

Turn names into numbers in a dataframe based on the row index of the name in another dataframe

I have two dataframes. One is just the names of my facebook friends and another one is the links with a sorce and target columns. I want to turn the names in the links dataframe to numbers based on the row index of that name in the friends dataframe.
friends
name
1 Andrewt Thomas
2 Robbie McCord
3 Mohammad Mojadidi
4 Andrew John
5 Professor Owk
6 Joseph Charles
links
source target
1 Andrewt Thomas Andrew John
2 Andrewt Thomas James Zou
3 Robbie McCord Bz Benz
4 Robbie McCord Yousef AL-alawi
5 Robbie McCord Sherhan Asimov
6 Robbie McCord Aigerim Aig
Seems trivial, but I cannot figure it out. Thanks for help.
Just use a simple match
links$source <- match(links$source, friends$name)
links
# source target
# 1 1 Andrew John
# 2 1 James Zou
# 3 2 Bz Benz
# 4 2 Yousef AL-alawi
# 5 2 Sherhan Asimov
# 6 2 Aigerim Aig
Something like this?
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
Full example
links <- data.frame(source = c("John", "John", "Alice"), target = c("Jimmy", "Al", "Chris"))
links$source <- vapply(links$source, function(x) which(friends$name == x), integer(1))
links$source
[1] 3 3 2

Resources