Error in data frame, replacement has xx, data has xx - r

I hope someone can help with this problem - I have been chewing over it for a few hours now!
I have a data frame called 'journeys' as follows which shows a customer ID, their date of travel, mode and journey start time:
ID | Date | Mode | Time
------ | --------- | ------- | -----
1234 | 12/10/16 | Bus | 120
1234 | 12/10/16 | Bus | 130
1234 | 12/10/16 | Bus | 290
1234 | 12/10/16 | Train | 310
1234 | 12/10/16 | Bus | 330
4567 | 12/10/16 | Bus | 220
4567 | 12/10/16 | Tram | 230
4567 | 13/10/16 | Bus | 290
4567 | 13/10/16 | Bus | 450
4567 | 14/10/16 | Train | 1000
So on 12/10, customer 1234 made 4 bus jnys and 1 train jny.
I have written a basic loop in r to create a 5th column which identifies if the journey stages are linked i.e. the 2nd journey linked to the 1st journey, the 3rd journey linked to the 2nd journey (where 1=linked, 0=not linked), based on the following conditions:
the jnys are for the same person and take place on the same day
2 bus journeys/2 tram jnys/a bus and tram jny/a tram and bus jny are within 60 mins of one another (so a bus and train journey within 60 mins of one another would not be linked). The code is as follows:
df <- read.table("Journeys.txt", header=TRUE, sep=",")
for (i in 2:dim(df)[1]) {
if ((df$ID[i]==df$ID[i-1])
& (df$Date[i]==df$Date[i-1])
& ((df$Mode[i]=='Bus' & df$Mode[i-1]=='Bus')|
(df$Mode[i]=='Bus' & df$Mode[i-1]=='Tram')|
(df$Mode[i]=='Tram' & df$Mode[i-1]=='Bus')|
(df$Mode[i]=='Tram' & df$Mode[i-1]=='Tram'))
& (df$Time[i]-df$Time[i-1]<60))
{df$linked[i] <- 1}
else {df$linked[i] <- 0}
This should give me the following output:
ID | Date | Mode | Time | Linked
------ | --------- | ------- | ----- | -----
1234 | 12/10/16 | Bus | 120 | 0
1234 | 12/10/16 | Bus | 130 | 1
1234 | 12/10/16 | Bus | 290 | 0
1234 | 12/10/16 | Train | 310 | 0
1234 | 12/10/16 | Bus | 330 | 0
4567 | 12/10/16 | Bus | 220 | 0
4567 | 12/10/16 | Tram | 230 | 1
4567 | 13/10/16 | Bus | 290 | 0
4567 | 13/10/16 | Bus | 450 | 0
4567 | 14/10/16 | Train | 1000 | 0
However, when I try to run this I keep getting the following error message:
Error in $<-.data.frame(tmp, "linked", value = c(NA, 1)) :
replacement has 2 rows, data has 52231
When I ran this on a test dataset of about 150 rows, I didn't get this error message. I know it's related to the linked column, but I don't fully understand how to resolve it.

I use the same data as you, and it was working with your code (copy paste it) but the first row. you need to initialized it. df$linked[1] <- 0
Here a better use of the if and the condition (faster to read and faster to process for R).
I also add in comment (cat(i)), if you uncomment it, it is helpfull to see what is happening in the loop.
Last thing, I think you are expecting a 0 and not a 1 for the 8th row, as this is not the same day...
df<- read.csv("train.csv", sep=",")
df$linked <- 0
for (i in 2:dim(df)[1]) {
if (df$ID[i]==df$ID[i-1]) {
#cat(i)
if (df$Date[i]==df$Date[i-1]){
#cat(i)
if (df$Time[i]-df$Time[i-1]<60) {
#cat(i)
if (df$Mode[i]=="Bus" & df$Mode[i-1] %in% c("Bus", "Tram")) {
#cat(i)
df$linked[i] <- 1
} else {
if (df$Mode[i]=="Tram" & df$Mode[i-1] %in% c("Bus", "Tram")) {
df$linked[i] <- 1
#cat(i)
}
}
}
}
}
}
ID Date Mode Time linked
1 1234 12/10/2016 Bus 120 0
2 1234 12/10/2016 Bus 130 1
3 1234 12/10/2016 Bus 290 0
4 1234 12/10/2016 Train 310 0
5 1234 12/10/2016 Bus 330 0
6 4567 12/10/2016 Bus 220 0
7 4567 12/10/2016 Tram 230 1
8 4567 13/10/2016 Bus 290 0
9 4567 13/10/2016 Bus 450 0
10 4567 14/10/2016 Train 1000 0

Related

How to join dataframes by ID column? [duplicate]

This question already has answers here:
Simultaneously merge multiple data.frames in a list
(9 answers)
Closed 1 year ago.
I have 3 Dataframes like the ones below with IDs that may not necessarily occur in all
DF1:
ID | Name | Phone# | Country | State | Amount_month1
0210 | John K. | 8942829725 | USA | PA | 1300
0215 | Peter | 8711234566 | USA | KS | 50
2312 | Steven | 9012341221 | USA | TX | 1000
0005 | Haris | 9167456363 | USA | NY | 1200
DF2:
ID | Name | Phone# | Country | State | Amount_month2
0210 | John K. | 8942829725 | USA | PA | 200
2312 | Steven | 9012341221 | USA | TX | 350
2112 | Jerry | 9817273794 | USA | CA | 100
DF3:
ID | Name | Phone# | Country | State | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 300
0005 | Haris | 9167456363 | USA | NY | 1250
1212 | Jerry | 9817273794 | USA | CA | 1200
1210 | Drew | 8012341234 | USA | TX | 1400
I would like to join these 3 dataframes by ID and add the varying column amounts as separate columns, the missing amount values can be either 0 or NA such as:
ID | Name | Phone# | Country |State| Amount_month1 | Amount_month2 | Amount_month3
0210 | John K. | 8942829725 | USA | PA | 1300 | 200 | 300
0215 | Peter | 8711234566 | USA | KS | 50 | 0 | 0
2312 | Steven | 9012341221 | USA | TX | 1000 | 350 | 0
0005 | Haris | 9167456363 | USA | NY | 1200 | 0 | 1250
1212 | Jerry | 9817273794 | USA | CA | 0 | 100 | 1200
1210 | Drew | 8012341234 | USA | TX | 0 | 0 | 1400
It can be done in a single line using Reduce and merge
Reduce(function(x, y) merge(x, y, all=TRUE), list(DF1, DF2, DF3))
You can use left_join from the package dplyr first joining the first two df`s, then joining that result with the third df:
library(dplyr)
df_12 <- left_join(df1,df2, by = "ID")
df_123 <- left_join(df_12, df3, by = "ID")
Result:
df_123
ID Amount_month1 Amount_month2 Amount_month3
1 1 100 NA NA
2 2 200 50 NA
3 3 300 177 666
4 4 400 NA 77
Mock data:
df1 <- data.frame(
ID = as.character(1:4),
Amount_month1 = c(100,200,300,400)
)
df2 <- data.frame(
ID = as.character(2:3),
Amount_month2 = c(50,177)
)
df3 <- data.frame(
ID = as.character(3:4),
Amount_month3 = c(666,77)
)

Removing duplicated data based on each group using R

I have a dataset which contains employee id, name and their bank account information. Some of these employees have duplicate names with either same employee id or different employee id for same employee name. Few of these employees also have same bank account information for same names while some have different bank account numbers under same name. The aim is to find those employees who have same name but different bank account number. Here's a sample of the data:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 123 | Joan | 6758 |
| 134 | Karyn | 1244 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
| 235 | Larry | 5201 |
| 433 | Larry | 5201 |
| 231 | Larry | 5201 |
| 120 | Amy | 7890 |
| 135 | Amy | 7890 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
I had to find the employees who were duplicates based on their names which I could do successfully. Once that was done, I had to identify the employees with same name but different bank account no. Right now the issue is that it is not grouping the employees based on name and searching for different bank account. Instead, it is looking for account numbers of different individuals and if it finds it to be same, it removes one of the duplicate values. For example, Chris and Cassy have same bank account number '1280', so it is identifying it to be same and automatically removing one of Chris's record (bank account no 1280 in the output). The output that I'm getting is as shown below:
| Emp_id | Name | Bank Account |
|--------|:-----:|-------------:|
| 120 | Amy | 7890 |
| 900 | Cassy | 1280 |
| 900 | Cassy | 9873 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
This is the code that I have followed:
sample=data.frame(Id=c("123","134","143","143","235","433","231","120","135","150","150","900","900"),
Name=c("Joan","Karyn","Larry","Larry","Larry","Larry","Larry","Amy","Amy","Chris","Chris","Cassy","Cassy"),
Bank_Account=c("6758","1244","4900","5201","5201","5201","5201","7890","7890","1280","6565","1280","9873"))
n_occur <- data.frame(table(sample$Name))
n_occur=n_occur[n_occur$Freq > 1,]
Duplicates=sample[sample$Name %in% n_occur$Var1[n_occur$Freq > 1],]
Duplicates=Duplicates %>% arrange(Duplicates$Name, Duplicates$Name)
Duplicates=Duplicates[!duplicated(Duplicates$Bank_Account),]
The actual output however, should have considered the bank account nos within each name (same name). The output should look something like this:
| Emp_id | Name | Bank Account |
|--------|:-------:|-------------:|
| 900 | Cassy |1280 |
| 900 | Cassy |9873 |
| 150 | Chris | 1280 |
| 150 | Chris | 6565 |
| 143 | Larry | 4900 |
| 143 | Larry | 5201 |
Can someone please direct me towards right code?
We can use n_distinct to filter
library(dplyr)
sample %>%
group_by(Name) %>%
filter(n() > 1) %>%
group_by(Id, add = TRUE) %>%
filter(n_distinct(Bank_Account) > 1) %>%
arrange(desc(Id))
# A tibble: 6 x 3
# Groups: Name, Id [3]
# Id Name Bank_Account
# <fct> <fct> <fct>
#1 900 Cassy 1280
#2 900 Cassy 9873
#3 150 Chris 1280
#4 150 Chris 6565
#5 143 Larry 4900
#6 143 Larry 5201
Step 1 - Identifying duplicate names:
step_1 <- sample %>%
arrange(Name) %>%
mutate(dup = duplicated(Name)) %>%
filter(Name %in% unique(as.character(Name[dup == T])))
Step 2 - Identifying duplicate accounts for these names:
step_2 <- step_1 %>%
group_by(Name, Bank_Account) %>%
mutate(dup = duplicated(Bank_Account)) %>%
filter(dup == F)

Loop/if else in R for data frame

I am really stuck on doing a loop in R. I have tried using ifelse too, but just can't seem to get a result.
I have a data frame as follows which shows a customer ID, their date of travel, mode and journey start time:
ID | Date | Mode | Time
------ | --------- | ------- | -----
1234 | 12/10/16 | Bus | 120
1234 | 12/10/16 | Bus | 130
1234 | 12/10/16 | Bus | 290
1234 | 12/10/16 | Train | 310
1234 | 12/10/16 | Bus | 330
4567 | 12/10/16 | Bus | 220
4567 | 12/10/16 | Bus | 230
4567 | 13/10/16 | Bus | 290
4567 | 13/10/16 | Bus | 450
4567 | 14/10/16 | Train | 1000
So on 12/10, customer 1234 made 4 bus jnys and 1 train jny.
I want to create a 5th column which identifies if the journey stages are linked i.e. is the 2nd journey linked to the 1st journey, is the 3rd journey linked to the 2nd journey (where 1=linked, 0=not linked).
The following conditions need to apply:
the jnys are for the same person and take place on the same day
2 bus journeys are within 60 mins of one another (so a bus and train journey within 60 mins of one another would not be linked)
if the i+1th and the ith journey are linked, then the i+1th journey cannot be linked to the i+2th journey
I would like the output to be as follows:
ID | Date | Mode | Time | Linked
------ | --------- | ------- | ----- | -----
1234 | 12/10/16 | Bus | 120 | 0
1234 | 12/10/16 | Bus | 130 | 1
1234 | 12/10/16 | Bus | 290 | 0
1234 | 12/10/16 | Train | 310 | 0
1234 | 12/10/16 | Bus | 330 | 0
4567 | 12/10/16 | Bus | 220 | 0
4567 | 12/10/16 | Bus | 230 | 1
4567 | 13/10/16 | Bus | 290 | 0
4567 | 13/10/16 | Bus | 450 | 0
4567 | 14/10/16 | Train | 1000 | 0
Any help would be much appreciated!
1) ave Try this:
transform(DF, linked = ave(Time, ID, Date, cumsum(c(FALSE, Mode[-1] != Mode[-nrow(DF)])),
FUN = function(x) c(0, diff(x) < 60)))
giving:
ID Date Mode Time linked
1 1234 12/10/16 Bus 120 0
2 1234 12/10/16 Bus 130 1
3 1234 12/10/16 Bus 290 0
4 1234 12/10/16 Train 310 0
5 1234 12/10/16 Bus 330 0
6 4567 12/10/16 Bus 220 0
7 4567 12/10/16 Bus 230 1
8 4567 13/10/16 Bus 290 0
9 4567 13/10/16 Bus 450 0
10 4567 14/10/16 Train 1000 0
2) sqldf Here is a solution using sqldf.
library(sqldf)
sqldf("select a.*, coalesce(a.ID = b.ID and
a.Date = b.Date and
a.Mode = b.Mode and
a.Time < b.Time + 60, 0) linked
from DF a left join DF b on a.rowid = b.rowid + 1")
3) data.table Note that data.table tends to be both fast and memory efficient and may be able to handle data sizes in memory that other approaches cannot.
library(data.table)
dt <- as.data.table(DF)
dt[, linked := (Time < shift(Time, fill = -60) + 60) *
(Mode == shift(Mode, fill = Mode[1])), by = "ID,Date"]
4) dplyr
library(dplyr)
DF %>%
group_by(ID, Date) %>%
mutate(linked = (Time < lag(Time, default = -Inf) + 60) *
(Mode == lag(Mode, default = Mode[1]))) %>%
ungroup()
giving a similar answer.
Note: The input DF in reproducible form is:
Lines <-
"ID | Date | Mode | Time
------ | --------- | ------- | -----
1234 | 12/10/16 | Bus | 120
1234 | 12/10/16 | Bus | 130
1234 | 12/10/16 | Bus | 290
1234 | 12/10/16 | Train | 310
1234 | 12/10/16 | Bus | 330
4567 | 12/10/16 | Bus | 220
4567 | 12/10/16 | Bus | 230
4567 | 13/10/16 | Bus | 290
4567 | 13/10/16 | Bus | 450
4567 | 14/10/16 | Train | 1000"
DF <- read.table(text = Lines, header = TRUE, sep = "|", strip.white = TRUE,
comment = "-", as.is = TRUE)
Update: Fixed.
I like Grothendieck's answer, but it may not be as easy to interpret for someone new to R. So lets do it in a less programatically efficient way that shows you the steps to take. I'll use the same data frame naming convention as Grothendieck.
Lets determine if the time between journeys is within 60 minutes. Lets loop through all rows in the data frame and if they are the same account and if they are the same type of Mode then check if they are less than 60 minutes apart and if all three conditions check out then set linked to 1. Otherwise, we'll set linked to 0.
for (i in 2:dim(df)[1]){
if (df$ID[i]==df$ID[i-1]){
if (df$Mode[i]==df$Mode[i-1]){
if ((df$Time[i]-df$Time[i-1]) < 60){
df$linked[i] <- 1
}
else {
df$linked[i] <- 0
}
}
else {
df$linked[i] <- 0
}
}
else {
df$linked[i] <- 0
}
}
Using the dplyr package:
library(dplyr)
DF %>%
# The journeys are for the same person, take place on the same day
# and on the same mode of transport
group_by(ID, Date, Mode) %>%
# 2 bus journeys are within 60 mins of one another
mutate(linked0 = c(Inf, diff(Time))<60,
# if the i+1th and the ith journey are linked,
# then the i+1th journey cannot be linked to the i+2th journey
linkedsum = cumsum(linked0),
linked = ifelse(linkedsum==1, linked0, 0))
ID Date Mode Time linked0 linkedsum linked
<int> <chr> <chr> <int> <lgl> <int> <dbl>
1 1234 12/10/16 Bus 120 FALSE 0 0
2 1234 12/10/16 Bus 130 TRUE 1 1
3 1234 12/10/16 Bus 290 FALSE 1 0
4 1234 12/10/16 Train 310 FALSE 0 0
5 1234 12/10/16 Bus 330 TRUE 2 0
6 4567 12/10/16 Bus 220 FALSE 0 0
7 4567 12/10/16 Bus 230 TRUE 1 1
8 4567 13/10/16 Bus 290 FALSE 0 0
9 4567 13/10/16 Bus 450 FALSE 0 0
10 4567 14/10/16 Train 1000 FALSE 0 0
To perform this inside a database, see the dplyr database vignette.

copy command in cassandra execution order

I am copying csv file to cassandra. I have the below csv file and the table is created as below.
CREATE TABLE UCBAdmissions(
id int PRIMARY KEY,
admit text,
dept text,
freq int,
gender text
)
When I use
copy UCBAdmissions from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.318 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+-------+------+------+--------
(0 rows)
copy UCBAdmissions(id,admit,gender, dept , freq )from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.364 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+----------+------+------+--------
23 | Admitted | F | 24 | Female
5 | Admitted | B | 353 | Male
10 | Rejected | C | 205 | Male
16 | Rejected | D | 244 | Female
13 | Admitted | D | 138 | Male
11 | Admitted | C | 202 | Female
1 | Admitted | A | 512 | Male
19 | Admitted | E | 94 | Female
8 | Rejected | B | 8 | Female
2 | Rejected | A | 313 | Male
4 | Rejected | A | 19 | Female
18 | Rejected | E | 138 | Male
15 | Admitted | D | 131 | Female
22 | Rejected | F | 351 | Male
20 | Rejected | E | 299 | Female
7 | Admitted | B | 17 | Female
6 | Rejected | B | 207 | Male
9 | Admitted | C | 120 | Male
14 | Rejected | D | 279 | Male
21 | Admitted | F | 22 | Male
17 | Admitted | E | 53 | Male
24 | Rejected | F | 317 | Female
12 | Rejected | C | 391 | Female
3 | Admitted | A | 89 | Female
UCBAdmissions.csv
"","Admit","Gender","Dept","Freq"
"1","Admitted","Male","A",512
"2","Rejected","Male","A",313
"3","Admitted","Female","A",89
"4","Rejected","Female","A",19
"5","Admitted","Male","B",353
"6","Rejected","Male","B",207
"7","Admitted","Female","B",17
"8","Rejected","Female","B",8
"9","Admitted","Male","C",120
"10","Rejected","Male","C",205
"11","Admitted","Female","C",202
"12","Rejected","Female","C",391
"13","Admitted","Male","D",138
"14","Rejected","Male","D",279
"15","Admitted","Female","D",131
"16","Rejected","Female","D",244
"17","Admitted","Male","E",53
"18","Rejected","Male","E",138
"19","Admitted","Female","E",94
"20","Rejected","Female","E",299
"21","Admitted","Male","F",22
"22","Rejected","Male","F",351
"23","Admitted","Female","F",24
"24","Rejected","Female","F",317
I see the output order getting changed from the csv file as seen above.
Question: What is the difference between 1 and 2? Should we follow the same order as of csv file to create the table in cassandra?
Cassandra is designed to be distributed - to accomplish this, it uses the partition key of your table (id) and hashes it using the cluster's partitioner (probably Murmur3Partitioner) to create an integer (actually a Long), and then uses that integer to assign it to a node in the ring.
What you're seeing are the results ordered by the resulting token, which is non-intuitive, but not necessarily wrong. There is no straight-forward way to do a SELECT * FROM table ORDER BY primaryKey ASC in Cassandra - the distributed nature makes that difficult to do effectively.

Tidy Data Layout - convert variables into factors

I have the following data table
| State | Prod. |Non-Prod.|
|-------|-------|---------|
| CA | 120 | 23 |
| GA | 123 | 34 |
| TX | 290 | 34 |
How can I convert this table to tiny data format in R or any other software like Excel?
|State | Class | # of EEs|
|------|----------|---------|
| CA | Prod. | 120 |
| CA | Non-Prod.| 23 |
| GA | Prod. | 123 |
| GA | Non-Prod.| 34 |
Trying using reshape2:
library(reshape2)
melt(df,id.vars='State')
# State variable value
# 1 CA Prod 120
# 2 GA Prod 123
# 3 TX Prod 290
# 4 CA Non-Prod. 23
# 5 GA Non-Prod. 34
# 6 TX Non-Prod. 34

Resources