Merging two data frames without duplicating metric values - r

I have two data frames and I want to merge them by leader values, so that I can see the total runs and walks for each groups. Each leader can have multiple members in their team, but the problem that I'm having is that when I merge them, the metrics also gets duplicated over to the newly added rows.
Here is an example of the two data sets that I have:
Data set 1:
+-------------+-----------+------------+-------------+
| leader name | leader id | total runs | total walks |
+-------------+-----------+------------+-------------+
| ab | 11 | 4 | 9 |
| tg | 47 | 8 | 3 |
+-------------+-----------+------------+-------------+
Data set 2:
+-------------+-----------+--------------+-----------+
| leader name | leader id | member name | member id |
+-------------+-----------+--------------+-----------+
| ab | 11 | gfh | 589 |
| ab | 11 | tyu | 739 |
| tg | 47 | rtf | 745 |
| tg | 47 | jke | 996 |
+-------------+-----------+--------------+-----------+
I want to merge the two datasets so that they become like this:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | | |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | | |
+-------------+-----------+--------------+------------+------------+-------------+
But right now I keep getting:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | 4 | 9 |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | 8 | 3 |
+-------------+-----------+--------------+------------+------------+-------------+
It doesn't matter if they're blank, NA's or 0's, as long as the values aren't duplicating. Is there a way to achieve this?

We can do a replace on those 'total' columns after a left_join
library(dplyr)
left_join(df2, df1 ) %>%
group_by(leadername) %>%
mutate_at(vars(starts_with('total')), ~ replace(., row_number() > 1, NA))
# A tibble: 4 x 6
# Groups: leadername [2]
# leadername leaderid membername memberid totalruns totalwalks
# <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
Or without using the group_by
left_join(df2, df1 ) %>%
mutate_at(vars(starts_with('total')), ~
replace(., duplicated(leadername), NA))
Or a base R option is
out <- merge(df2, df1, all.x = TRUE)
i1 <- duplicated(out$leadername)
out[i1, c("totalruns", "totalwalks")] <- NA
out
# leadername leaderid membername memberid totalruns totalwalks
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
data
df1 <- structure(list(leadername = c("ab", "tg"), leaderid = c(11, 47
), totalruns = c(4, 8), totalwalks = c(9, 3)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(leadername = c("ab", "ab", "tg", "tg"), leaderid = c(11,
11, 47, 47), membername = c("gfh", "tyu", "rtf", "jke"), memberid = c(589,
739, 745, 996)), class = "data.frame", row.names = c(NA, -4L))

Related

Convert a json to a long data frame in R

I want to convert an json object to a data frame using R.
The data I am working with look like this:
{"A": [123, 234, 345]}
{"B": [1213, 132, 342, 1235]}
{"C": [132, 12]}
I want to convert this something like this:
| Name | Value |
| ---- | ----- |
| A | 123 |
| A | 234 |
| A | 345 |
| B | 1213 |
| B | 132 |
| B | 342 |
| B | 1235 |
| C | 132 |
| C | 12 |
The dataset is quite large (more than 1M entries) so it would be great if the method is scalable.
string <- c("{\"A\": [123, 234, 345]}", "{\"B\": [1213, 132, 342, 1235]}",
"{\"C\": [132, 12]}")
stack(sapply(string, rjson::fromJSON, USE.NAMES = FALSE))
values ind
1 123 A
2 234 A
3 345 A
4 1213 B
5 132 B
6 342 B
7 1235 B
8 132 C
9 12 C

How to read and process columns with sub columns from an excel/.csv/any file?

I tried reading an Excel file where I need to read sub columns too, but not getting a way to resolve this.
The Excel file contains data as,
| Sl No. | Sales 1 | Sales 2 | % Change |
| | 1 Qtr | % Qtr | 2 Qtr| % Qtr | |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
After importing I can see the data as
| Sl No. |Sales 1| ...3 |Sales 2 | ...5 | % Change |
| NA | 1 Qtr | % Qtr | 2 Qtr | % Qtr | NA |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
I tried several ways to merge "Sales 1 & ...3 and Sales 2 & ...5" and keep 1 Qtr,% Qtr,2 Qtr,% Qtr as sub columns, but unable to do so
I need it to be like,
| Sl No. | Sales 1 | Sales 2 | % Change |
| | 1 Qtr | % Qtr | 2 Qtr| % Qtr | |
| 1 | 134 | 67 | 175 | 74 | 12.5 |
Unfortunately, R doesn't allow for multiple colnames. So probably the easiest thing you can do using base R is combining the colnames and then getting rid of the first line.
library(openxlsx)
x <- read.xlsx("your_file.xlsx")
# Sl.No Sales.1 X3 Sales.2 X5 %Change
# 1 NA 1 Qtr %Qtr 2 Qtr %Qtr NA
# 2 1 134 67 175 74 12.5
colnames(x) <- paste0(colnames(x),ifelse(is.na(x[1,]),"",paste0(" - ", x[1,])))
x <- x[-1,]
# Sl.No Sales.1 - 1 Qtr X3 - %Qtr Sales.2 - 2 Qtr X5 - %Qtr %Change
# 2 1 134 67 175 74 12.5
colnames(x)
# [1] "Sl.No" "Sales.1 - 1 Qtr" "X3 - %Qtr" "Sales.2 - 2 Qtr" "X5 - %Qtr" "%Change"

copy command in cassandra execution order

I am copying csv file to cassandra. I have the below csv file and the table is created as below.
CREATE TABLE UCBAdmissions(
id int PRIMARY KEY,
admit text,
dept text,
freq int,
gender text
)
When I use
copy UCBAdmissions from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.318 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+-------+------+------+--------
(0 rows)
copy UCBAdmissions(id,admit,gender, dept , freq )from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.364 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+----------+------+------+--------
23 | Admitted | F | 24 | Female
5 | Admitted | B | 353 | Male
10 | Rejected | C | 205 | Male
16 | Rejected | D | 244 | Female
13 | Admitted | D | 138 | Male
11 | Admitted | C | 202 | Female
1 | Admitted | A | 512 | Male
19 | Admitted | E | 94 | Female
8 | Rejected | B | 8 | Female
2 | Rejected | A | 313 | Male
4 | Rejected | A | 19 | Female
18 | Rejected | E | 138 | Male
15 | Admitted | D | 131 | Female
22 | Rejected | F | 351 | Male
20 | Rejected | E | 299 | Female
7 | Admitted | B | 17 | Female
6 | Rejected | B | 207 | Male
9 | Admitted | C | 120 | Male
14 | Rejected | D | 279 | Male
21 | Admitted | F | 22 | Male
17 | Admitted | E | 53 | Male
24 | Rejected | F | 317 | Female
12 | Rejected | C | 391 | Female
3 | Admitted | A | 89 | Female
UCBAdmissions.csv
"","Admit","Gender","Dept","Freq"
"1","Admitted","Male","A",512
"2","Rejected","Male","A",313
"3","Admitted","Female","A",89
"4","Rejected","Female","A",19
"5","Admitted","Male","B",353
"6","Rejected","Male","B",207
"7","Admitted","Female","B",17
"8","Rejected","Female","B",8
"9","Admitted","Male","C",120
"10","Rejected","Male","C",205
"11","Admitted","Female","C",202
"12","Rejected","Female","C",391
"13","Admitted","Male","D",138
"14","Rejected","Male","D",279
"15","Admitted","Female","D",131
"16","Rejected","Female","D",244
"17","Admitted","Male","E",53
"18","Rejected","Male","E",138
"19","Admitted","Female","E",94
"20","Rejected","Female","E",299
"21","Admitted","Male","F",22
"22","Rejected","Male","F",351
"23","Admitted","Female","F",24
"24","Rejected","Female","F",317
I see the output order getting changed from the csv file as seen above.
Question: What is the difference between 1 and 2? Should we follow the same order as of csv file to create the table in cassandra?
Cassandra is designed to be distributed - to accomplish this, it uses the partition key of your table (id) and hashes it using the cluster's partitioner (probably Murmur3Partitioner) to create an integer (actually a Long), and then uses that integer to assign it to a node in the ring.
What you're seeing are the results ordered by the resulting token, which is non-intuitive, but not necessarily wrong. There is no straight-forward way to do a SELECT * FROM table ORDER BY primaryKey ASC in Cassandra - the distributed nature makes that difficult to do effectively.

Tidy Data Layout - convert variables into factors

I have the following data table
| State | Prod. |Non-Prod.|
|-------|-------|---------|
| CA | 120 | 23 |
| GA | 123 | 34 |
| TX | 290 | 34 |
How can I convert this table to tiny data format in R or any other software like Excel?
|State | Class | # of EEs|
|------|----------|---------|
| CA | Prod. | 120 |
| CA | Non-Prod.| 23 |
| GA | Prod. | 123 |
| GA | Non-Prod.| 34 |
Trying using reshape2:
library(reshape2)
melt(df,id.vars='State')
# State variable value
# 1 CA Prod 120
# 2 GA Prod 123
# 3 TX Prod 290
# 4 CA Non-Prod. 23
# 5 GA Non-Prod. 34
# 6 TX Non-Prod. 34

Why won't my column name change work in R?

This is part of a script im writing to merge the collumns more fully after using merge().
If both datasets have a column with the same name merge() gives you columns column.x and column.y. I have written a script to put this data together and to drop the unneeded columns (which would be column.y and column.x_error, a column i've added to give warnings in case dat$column.x != dat$column.y). I also want to rename column.x to column, to decrease unneeded manual actions in my dataset. I have not managed to rename column.x to column, see the code for more info.
dat is obtained from doing a dat = merge(data1,data2, by= "ID", all.x=TRUE)
#obtain a list of double columns
dubbelkol = cbind()
sorted = sort(names(dat))
for(i in as.numeric(1:length(names(dat)))) {
if(grepl(".x",sorted[i])){
if (grepl(".y", sorted[i+1]) && (sub(".x","",sorted[i])==sub(".y","",sorted[i+1]))){
dubbelkol = cbind(dubbelkol,sorted[i],sorted[i+1])
}
}
}
#Check data, fill in NA in column.x from column.y if poss
temp = cbind()
for (p in as.numeric(1:(length(dubbelkol)-1))){
if(grepl(".x",dubbelkol[p])){
dat[dubbelkol[p]][is.na(dat[dubbelkol[p]])] = dat[dubbelkol[p+1]][is.na(dat[dubbelkol[p]])]
temp = (dat[dubbelkol[p]] != dat[dubbelkol[p+1]])
colnames(temp) = (paste(dubbelkol[p],"_error", sep=""))
dat[colnames(temp)] = temp
}
}
#If every value in "column.x_error" is TRUE or NA, delete "column.y" and "column.x_error"
#Rename "column.x" to "column"
#from here until next comment everything works
droplist= c()
for (k in as.numeric(1:length(names(dat)))) {
if (grepl(".x_error",colnames(dat[k]))) {
if (all(dat[k]==FALSE, na.rm = TRUE)) {
droplist = c(droplist,colnames(dat[k]), sub(".x_error",".y",colnames(dat[k])))
#the next line doesnt work, it's supposed to turn the .x column back to "" before the .y en .y_error columns are dropped.
colnames(dat[sub(".x_error",".x",colnames(dat[k]))])= paste(sub(".x_error","",colnames(dat[k])))
}
}
}
dat = dat[,!names(dat) %in% droplist]
paste(sub(".x_error","",colnames(dat[k]))) will give me "BNR" just fine, but the colnames(...) = ... won't change the column name in dat.
Any idea what's going wrong?
data1
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | NA |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
data2
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
dat
+----+-------+-------+-----------+
| ID | BNR.x | BNR.y |BNR.x_error|
+----+-------+-------+-----------+
| 1 | 123 | NA |FALSE |
| 2 | 234 | 234 |FALSE |
| 3 | NA | 345 |FALSE |
| 4 | 456 | 456 |FALSE |
| 5 | 677 | 677 |FALSE |
| 6 | NA | NA |NA |
+----+-------+-------+-----------+
desired output
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
I suggest replacing:
sub(".x_error",".x",colnames(dat[k]))]
with:
sub("\\.x_error", "\\.x", colnames(dat[k]))]
if you wish to replace an actual .. You have to escape . with \\.. A . in regex means any character.
Even better, since you are replacing . with . why not just say:
sub("x_error", "x", colnames(dat[k]))]
(or) if there is no other _error other than x_error, simply:
sub("_error", "", colnames(dat[k]))]
Edit: The problem seems to be that your data format seems to be loading additional columns on the left and the right. You can select the columns you want first and then merge.
d1 <- read.table(textConnection("| ID | BNR |
| 1 | 123 |
| 2 | 234 |
| 3 | NA |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |"), sep = "|", header = TRUE, stringsAsFactors = FALSE)[,2:3]
d1$BNR <- as.numeric(d1$BNR)
d2 <- read.table(textConnection("| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |"), header = FALSE, sep = "|", stringsAsFactors = FALSE)[,2:3]
names(d2) <- c("ID", "BNR")
d2$BNR <- as.numeric(d2$BNR)
# > d1
# ID BNR
# 1 1 123
# 2 2 234
# 3 3 NA
# 4 4 456
# 5 5 677
# 6 6 NA
# > d2
# ID BNR
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 456
# 5 5 677
# 6 6 NA
dat <- merge(d1, d2, by="ID", all=T)
> dat
# ID BNR.x BNR.y
# 1 1 123 123
# 2 2 234 234
# 3 3 NA 345
# 4 4 456 456
# 5 5 677 677
# 6 6 NA NA
# replace all NA values in x from y
dat$BNR.x <- ifelse(is.na(dat$BNR.x), dat$BNR.y, dat$BNR.x)
# now remove y
dat$BNR.y <- null

Resources