Pivot columns in Data Frame - r

I have the data frame below:
data<-data.frame(names= c("Bob","Bob", "Fred","Fred","Tom"), id =c(1,1,2,2,3),amount = c(100,200,400,500,700), status = c("Active","Not Active","Active","Retired","Active"))
data
names id amount status
1 Bob 1 100 Active
2 Bob 1 200 Not Active
3 Fred 2 400 Active
4 Fred 2 500 Retired
5 Tom 3 700 Active
I would like to Pivot the "Status" column so the "amount" data appears under the new status columns so that the result looks like this:
names id Active Not Active Retired
Bob 1 100 200
Fred 2 400 500
Tom 3 700
Is this possible? What is the best way?

I am now compelled to turn a comment into an answer. Here's the Hadleyverse version:
library(tidyr)
spread(data, status, amount)
## names id Active Not Active Retired
## 1 Bob 1 100 200 NA
## 2 Fred 2 400 NA 500
## 3 Tom 3 700 NA NA

Here is a solution using dcast from the package reshape2:
library(reshape2)
dcast(data, names + id ~ status, value.var="amount")
# names id Active Not Active Retired
# 1 Bob 1 100 200 NA
# 2 Fred 2 400 NA 500
# 3 Tom 3 700 NA NA

This would be the base method:
> xtabs(amount~names+status, data=data)
status
names Active Not Active Retired
Bob 100 200 0
Fred 400 0 500
Tom 700 0 0

Here is another base R option
reshape(data, idvar=c('names', 'id'), timevar='status', direction='wide')
# names id amount.Active amount.Not Active amount.Retired
#1 Bob 1 100 200 NA
#3 Fred 2 400 NA 500
#5 Tom 3 700 NA NA

Related

Add row with group sum in new column at the end of group category

I have been searching this information since yesterday but so far I could not find a nice solution to my problem.
I have the following dataframe:
CODE CONCEPT P. NR. NAME DEPTO. PRICE
1 Lunch 11 John SALES 160
1 Lunch 11 John SALES 120
1 Lunch 11 John SALES 10
1 Lunch 13 Frank IT 200
2 Internet 13 Frank IT 120
and I want to add a column with the sum of rows by group, for instance, the total amount of concept: Lunch, code: 1 by name in order to get an output like this:
CODE CONCEPT P. NR. NAME DEPTO. PRICE TOTAL
1 Lunch 11 John SALES 160 NA
1 Lunch 11 John SALES 120 NA
1 Lunch 11 John SALES 10 290
1 Lunch 13 Frank IT 200 200
2 Internet 13 Frank IT 120 120
So far, I tried with:
aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
But this retrieves just the total of the concepts like this:
NAME CODE TOTAL
John 1 290
Frank 1 200
Frank 2 120
And not the table with the rest of the data as I would like to have it.
I also tried adding an extra column with NA but somehow I cannot paste the total in a specific row position.
Any suggestions? I would like to have something I can do in BaseR.
Thanks!!
In base R you can use ave to add new column. We insert the sum of group only if it is last row in the group.
df$TOTAL <- with(df, ave(PRICE, CODE, CONCEPT, PNR, NAME, FUN = function(x)
ifelse(seq_along(x) == length(x), sum(x), NA)))
df
# CODE CONCEPT PNR NAME DEPTO. PRICE TOTAL
#1 1 Lunch 11 John SALES 160 NA
#2 1 Lunch 11 John SALES 120 NA
#3 1 Lunch 11 John SALES 10 290
#4 1 Lunch 13 Frank IT 200 200
#5 2 Internet 13 Frank IT 120 120
Similar logic using dplyr
library(dplyr)
df %>%
group_by(CODE, CONCEPT, PNR, NAME) %>%
mutate(TOTAL = ifelse(row_number() == n(), sum(PRICE) ,NA))
For a base R option, you may try merging the original data frame and aggregate:
df2 <- aggregate(PRICE~NAME+CODE, data = df, FUN = sum)
out <- merge(df[ , !(names(df) %in% c("PRICE"))], df2, by=c("NAME", "CODE"))
out[with(out, order(CODE, NAME)), ]
NAME CODE CONCEPT PNR DEPT PRICE
1 Frank 1 Lunch 13 IT 200
3 John 1 Lunch 11 SALES 290
4 John 1 Lunch 11 SALES 290
5 John 1 Lunch 11 SALES 290
2 Frank 2 Internet 13 IT 120

Find and tag a number between a range

I have two dfs as below
>codes1
Country State City Start No End No
IN Telangana Hyderabad 100 200
IN Maharashtra Pune (Bund Garden) 300 400
IN Haryana Gurgaon 500 600
IN Maharashtra Pune 700 800
IN Gujarat Ahmedabad (Vastrapur) 900 1000
Now i want to tag ip address from table 1
>codes2
ID No
1 157
2 346
3 389
4 453
5 562
6 9874
7 98745
Now i want to tag numbers in codes2 df as per the range given in codes1 df for No column , expected ouput is
ID No Country State City
1 157 IN Telangana Hyderabad
2 346 IN Maharashtra Pune(Bund Garden)
.
.
.
Basically want to tag No column in codes 2 with codes1 according to the range (Start No and End No) that No observations falls in.
Also the order could be anything in codes 2 df .
You could use the non-equi join capability of the data.table package for that:
library(data.table)
setDT(codes1)
setDT(codes2)
codes2[codes1, on = .(No > StartNo, No < EndNo), ## (1)
`:=`(cntry = Country, state = State, city = City)] ## (2)
(1) obtains matching row indices in codes2 corresponding to each row in codes1, while matching on the condition provided to the on argument.
(2) updates codes2 values for those matching rows for the columns specified directly by reference (i.e., you don't have to assign the result back to another variable).
This gives:
codes2
# ID No cntry state city
# 1: 1 157 IN Telangana Hyderabad
# 2: 2 346 IN Maharashtra Pune (Bund Garden)
# 3: 3 389 IN Maharashtra Pune (Bund Garden)
# 4: 4 453 NA NA NA
# 5: 5 562 IN Haryana Gurgaon
# 6: 6 9874 NA NA NA
# 7: 7 98745 NA NA NA
if you're comfortable writing SQL, you might consider using the sqldf package to do something like
library('sqldf')
result <- sqldf('select * from codes2 left join codes1 on codes2.No between codes1.StartNo and codes1.EndNo')
you may have to remove special characters and spaces from the columnnames of your dataframes beforehand.

If conditions and copying values from different rows

I have the following data:
Data <- data.frame(Project=c(123,123,123,123,123,123,124,124,124,124,124,125,125,125),
Name=c("Harry","David","David","Harry","Peter","Peter","John","Alex","Alex","Mary","Mary","Dan","Joe","Joe"),
Value=c(1,4,7,3,8,9,8,3,2,5,6,2,2,1),
OldValue=c("","Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","","Open","In Progress"),
NewValue=c("Open","In Progress","Complete","Open","In Progress","Complete","Open","In Progress","System Declined","In Progress","Complete","Open","In Progress","Complete"))
The data should look like this
I want to create another column called EditedBy that applies the following logic.
IF the project in row 1 equals the project in row 2 AND the New Value in row 1 equals "Open" THEN take the name from row 2. If either of the first two conditions are False, then stick with the name in the first row.
So the data should look like this
How can I do this?
We can do this with data.table
library(data.table)
setDT(Data)[, EditedBy := Name[2L] ,.(Project, grp=cumsum(NewValue == "Open"|
shift(NewValue == "System Declined", fill=TRUE)))]
Data
# Project Name Value OldValue NewValue EditedBy
# 1: 123 Harry 1 Open David
# 2: 123 David 4 Open In Progress David
# 3: 123 David 7 In Progress Complete David
# 4: 123 Harry 3 Complete Open Peter
# 5: 123 Peter 8 Open In Progress Peter
# 6: 123 Peter 9 In Progress Complete Peter
# 7: 124 John 8 Complete Open Alex
# 8: 124 Alex 3 Open In Progress Alex
# 9: 124 Alex 2 In Progress System Declined Alex
#10: 124 Mary 5 System Declined In Progress Mary
#11: 124 Mary 6 In Progress Complete Mary
#12: 125 Dan 2 Open Joe
#13: 125 Joe 2 Open In Progress Joe
#14: 125 Joe 1 In Progress Complete Joe

Erasing duplicates with NA values

I have a data frame like this:
names <- c('Mike','Mike','Mike','John','John','John','David','David','David','David')
dates <- c('04-26','04-26','04-27','04-28','04-27','04-26','04-01','04-02','04-02','04-03')
values <- c(NA,1,2,4,5,6,1,2,NA,NA)
test <- data.frame(names,dates,values)
Which is:
names dates values
1 Mike 04-26 NA
2 Mike 04-26 1
3 Mike 04-27 2
4 John 04-28 4
5 John 04-27 5
6 John 04-26 6
7 David 04-01 1
8 David 04-02 2
9 David 04-02 NA
10 David 04-03 NA
I'd like to get rid of duplicates with NA values. So, in this case, I have a valid observation from Mike on 04-26 and also have a valid observation from David on 04-02, so rows 1 and 9 should be erased and I will end up with:
names dates values
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
I tried to use duplicated function, something like this:
test[!duplicated(test[,c('names','dates')]),]
But that does not work since some NA values come before the valid value. Do you have any suggestions without trying things like merge or making another data frame?
Update: I'd like to keep rows with NA that are not duplicates.
What about this way?
library(dplyr)
test %>% group_by(names, dates) %>% filter((n()>=2 & !is.na(values)) | n()==1)
Source: local data frame [8 x 3]
Groups: names, dates [8]
names dates values
(fctr) (fctr) (dbl)
1 Mike 04-26 1
2 Mike 04-27 2
3 John 04-28 4
4 John 04-27 5
5 John 04-26 6
6 David 04-01 1
7 David 04-02 2
8 David 04-03 NA
Here is an attempt in data.table:
# set up
libary(data.table)
setDT(test)
# construct condition
test[, dupes := max(duplicated(.SD)), .SDcols=c("names", "dates"), by=c("names", "dates")]
# print out result
test[dupes == 0 | !is.na(values),]
Here is a similar method using base R, except that the dupes variable is kept separately from the data.frame:
dupes <- duplicated(test[c("names", "dates")])
# this generates warnings, but works nonetheless
dupes <- ave(dupes, test$names, test$dates, FUN=max)
# print out result
test[dupes == 0 | !is.na(test$values),]
If there are duplicated rows where the values variable is NA, and these duplicates add nothing to the data, then you can drop them prior to running the code above:
testNoNADupes <- test[!(duplicated(test) & is.na(test$values)),]
This should work based on your sample.
test <- test[order(test$values),]
test <- test[!(duplicated(test$names) & duplicated(test$dates) & is.na(test$values)),]

Adding a ranking column to a dataframe

This seems like it must be a very common task, but I can't find a solution in google or SO. I want to add a column called 'rank' to 'dat1' based on the sequence that 'order.scores' applies to 'dat'. I tried using row.names(), but the rownames are based on 'dat', not 'dat1'. I also tried 'dat$rank <-rank(dat1)', but this produces an error message.
fname<-c("Joe", "Bob", "Bill", "Tom", "Sue","Sam","Jane","Ruby")
score<-c(500, 490, 500, 750, 550, 500, 210, 320)
dat<-data.frame(fname,score)
order.scores<-order(dat$score,dat$fname)
dat1<-dat[order.scores,]
You can compute a ranking from an ordering as follows:
dat$rank <- NA
dat$rank[order.scores] <- 1:nrow(dat)
dat
# fname score rank
# 1 Joe 500 5
# 2 Bob 490 3
# 3 Bill 500 4
# 4 Tom 750 8
# 5 Sue 550 7
# 6 Sam 500 6
# 7 Jane 210 1
# 8 Ruby 320 2
Try:
## dat, dat1, and order.scores as defined
dat <- data.frame(fname=c("Joe", "Bob", "Bill", "Tom", "Sue","Sam","Jane","Ruby"),
score=c(500, 490, 500, 750, 550, 500, 210, 320))
order.scores <- order(dat$score)
dat1 <- dat[order.scores,]
dat1$rank <- rank(dat1$score)
dat1
## fname score rank
## 7 Jane 210 1
## 8 Ruby 320 2
## 2 Bob 490 3
## 3 Bill 500 5
## 1 Joe 500 5
## 6 Sam 500 5
## 5 Sue 550 7
## 4 Tom 750 8
This shows the ties in rank based on $score. If you don't want ties in $rank, then you might as well say dat1$rank <- 1:nrow(dat1) since they are already in order.
You can also use arrange and mutate from dplyr:
library(dplyr)
dat <- arrange(dat, desc(score)) %>%
mutate(rank = 1:nrow(dat))
dat
You can use:
dat$Rank <- rank(dat$score)
dat$Rank
you could do:
dat$rank <- order(order.scores)
dat$rank
#[1] 5 3 4 8 7 6 1 2
For the given dataframe dat:
fname score
Joe 500
Bob 490
Bill 500
Tom 750
Sue 550
Sam 500
Jane 210
Ruby 320
We can also use dplyr as below, it assigns the lowest rank to the smallest value, which is 210 in this case.
ranks = dat %>%
mutate(ranks = order(order(score)))
The output will be as below:
fname score ranks
Joe 500 4
Bob 490 3
Bill 500 5
Tom 750 8
Sue 550 7
Sam 500 6
Jane 210 1
Ruby 320 2
If the converse is required, i.e., rank 1 should be assigned to the highest value which is 750 in this case, then the code will be changed slightly as below:
ranks = dat %>%
mutate(ranks = order(order(score, decreasing = T)))
The output in this case will be as below:
fname score ranks
Joe 500 3
Bob 490 6
Bill 500 4
Tom 750 1
Sue 550 2
Sam 500 5
Jane 210 8
Ruby 320 7
Generally, Rank can be applied to find the least to highest in numerical values of a column data.
example: Salary is a column and it has 4 digit salary to 5 digit salary then here it goes by applying rank function!
simple understanding - the rank of salaries among them.
df['Salary'].rank(ascending = False).astype(int)

Resources