Convert a json to a long data frame in R - r

I want to convert an json object to a data frame using R.
The data I am working with look like this:
{"A": [123, 234, 345]}
{"B": [1213, 132, 342, 1235]}
{"C": [132, 12]}
I want to convert this something like this:
| Name | Value |
| ---- | ----- |
| A | 123 |
| A | 234 |
| A | 345 |
| B | 1213 |
| B | 132 |
| B | 342 |
| B | 1235 |
| C | 132 |
| C | 12 |
The dataset is quite large (more than 1M entries) so it would be great if the method is scalable.

string <- c("{\"A\": [123, 234, 345]}", "{\"B\": [1213, 132, 342, 1235]}",
"{\"C\": [132, 12]}")
stack(sapply(string, rjson::fromJSON, USE.NAMES = FALSE))
values ind
1 123 A
2 234 A
3 345 A
4 1213 B
5 132 B
6 342 B
7 1235 B
8 132 C
9 12 C

Related

Merging two data frames without duplicating metric values

I have two data frames and I want to merge them by leader values, so that I can see the total runs and walks for each groups. Each leader can have multiple members in their team, but the problem that I'm having is that when I merge them, the metrics also gets duplicated over to the newly added rows.
Here is an example of the two data sets that I have:
Data set 1:
+-------------+-----------+------------+-------------+
| leader name | leader id | total runs | total walks |
+-------------+-----------+------------+-------------+
| ab | 11 | 4 | 9 |
| tg | 47 | 8 | 3 |
+-------------+-----------+------------+-------------+
Data set 2:
+-------------+-----------+--------------+-----------+
| leader name | leader id | member name | member id |
+-------------+-----------+--------------+-----------+
| ab | 11 | gfh | 589 |
| ab | 11 | tyu | 739 |
| tg | 47 | rtf | 745 |
| tg | 47 | jke | 996 |
+-------------+-----------+--------------+-----------+
I want to merge the two datasets so that they become like this:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | | |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | | |
+-------------+-----------+--------------+------------+------------+-------------+
But right now I keep getting:
+-------------+-----------+--------------+------------+------------+-------------+
| leader name | leader id | member name | member id | total runs | total walks |
+-------------+-----------+--------------+------------+------------+-------------+
| ab | 11 | gfh | 589 | 4 | 9 |
| ab | 11 | tyu | 739 | 4 | 9 |
| tg | 47 | rtf | 745 | 8 | 3 |
| tg | 47 | jke | 996 | 8 | 3 |
+-------------+-----------+--------------+------------+------------+-------------+
It doesn't matter if they're blank, NA's or 0's, as long as the values aren't duplicating. Is there a way to achieve this?
We can do a replace on those 'total' columns after a left_join
library(dplyr)
left_join(df2, df1 ) %>%
group_by(leadername) %>%
mutate_at(vars(starts_with('total')), ~ replace(., row_number() > 1, NA))
# A tibble: 4 x 6
# Groups: leadername [2]
# leadername leaderid membername memberid totalruns totalwalks
# <chr> <dbl> <chr> <dbl> <dbl> <dbl>
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
Or without using the group_by
left_join(df2, df1 ) %>%
mutate_at(vars(starts_with('total')), ~
replace(., duplicated(leadername), NA))
Or a base R option is
out <- merge(df2, df1, all.x = TRUE)
i1 <- duplicated(out$leadername)
out[i1, c("totalruns", "totalwalks")] <- NA
out
# leadername leaderid membername memberid totalruns totalwalks
#1 ab 11 gfh 589 4 9
#2 ab 11 tyu 739 NA NA
#3 tg 47 rtf 745 8 3
#4 tg 47 jke 996 NA NA
data
df1 <- structure(list(leadername = c("ab", "tg"), leaderid = c(11, 47
), totalruns = c(4, 8), totalwalks = c(9, 3)), class = "data.frame", row.names = c(NA,
-2L))
df2 <- structure(list(leadername = c("ab", "ab", "tg", "tg"), leaderid = c(11,
11, 47, 47), membername = c("gfh", "tyu", "rtf", "jke"), memberid = c(589,
739, 745, 996)), class = "data.frame", row.names = c(NA, -4L))

Joining two datasets in R

I got 2 Dataset that I want to combine
Dataset_1:
id| value_1
1 | a
1 | b
1 | b
2 | a
2 | a
2 | b
...
Dataset_2:
id| value_2
1 | 123
1 | 433
1 | 234
2 | 222
2 | 333
2 | 333
...
and the result should look like:
id| value_1 | value 2
1 | a | 123
1 | b | 433
1 | b | 234
2 | a | 222
2 | a | 333
2 | b | 333
if tried to use these functions:
inner_join(dataset_1,dataset_2,by="id")
and
full_join(dataset_1,dataset_2,by="id")
and
merge(dataset_1,dataset_2,by="id")
but i always get all possible combinations of the 2 datasets and not the combined one.
It should be simple but I can't figure out what I am doing wrong.
id is a double, value_1 is a chr and value_2 is an int.
Thanks for any help!
Your example displays the need for a bind not a join.
Dataset_3 <- bind_cols(Dataset_1,Dataset_2[-1] )
What happening is:
When a join finds a repeated id, it creates more cases for each combination of results.

copy command in cassandra execution order

I am copying csv file to cassandra. I have the below csv file and the table is created as below.
CREATE TABLE UCBAdmissions(
id int PRIMARY KEY,
admit text,
dept text,
freq int,
gender text
)
When I use
copy UCBAdmissions from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.318 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+-------+------+------+--------
(0 rows)
copy UCBAdmissions(id,admit,gender, dept , freq )from 'UCBAdmissions.csv' WITH DELIMITER = ',' AND HEADER = TRUE;
The output is
24 rows imported in 0.364 seconds.
cqlsh> select *from UCBAdmissions;
id | admit | dept | freq | gender
----+----------+------+------+--------
23 | Admitted | F | 24 | Female
5 | Admitted | B | 353 | Male
10 | Rejected | C | 205 | Male
16 | Rejected | D | 244 | Female
13 | Admitted | D | 138 | Male
11 | Admitted | C | 202 | Female
1 | Admitted | A | 512 | Male
19 | Admitted | E | 94 | Female
8 | Rejected | B | 8 | Female
2 | Rejected | A | 313 | Male
4 | Rejected | A | 19 | Female
18 | Rejected | E | 138 | Male
15 | Admitted | D | 131 | Female
22 | Rejected | F | 351 | Male
20 | Rejected | E | 299 | Female
7 | Admitted | B | 17 | Female
6 | Rejected | B | 207 | Male
9 | Admitted | C | 120 | Male
14 | Rejected | D | 279 | Male
21 | Admitted | F | 22 | Male
17 | Admitted | E | 53 | Male
24 | Rejected | F | 317 | Female
12 | Rejected | C | 391 | Female
3 | Admitted | A | 89 | Female
UCBAdmissions.csv
"","Admit","Gender","Dept","Freq"
"1","Admitted","Male","A",512
"2","Rejected","Male","A",313
"3","Admitted","Female","A",89
"4","Rejected","Female","A",19
"5","Admitted","Male","B",353
"6","Rejected","Male","B",207
"7","Admitted","Female","B",17
"8","Rejected","Female","B",8
"9","Admitted","Male","C",120
"10","Rejected","Male","C",205
"11","Admitted","Female","C",202
"12","Rejected","Female","C",391
"13","Admitted","Male","D",138
"14","Rejected","Male","D",279
"15","Admitted","Female","D",131
"16","Rejected","Female","D",244
"17","Admitted","Male","E",53
"18","Rejected","Male","E",138
"19","Admitted","Female","E",94
"20","Rejected","Female","E",299
"21","Admitted","Male","F",22
"22","Rejected","Male","F",351
"23","Admitted","Female","F",24
"24","Rejected","Female","F",317
I see the output order getting changed from the csv file as seen above.
Question: What is the difference between 1 and 2? Should we follow the same order as of csv file to create the table in cassandra?
Cassandra is designed to be distributed - to accomplish this, it uses the partition key of your table (id) and hashes it using the cluster's partitioner (probably Murmur3Partitioner) to create an integer (actually a Long), and then uses that integer to assign it to a node in the ring.
What you're seeing are the results ordered by the resulting token, which is non-intuitive, but not necessarily wrong. There is no straight-forward way to do a SELECT * FROM table ORDER BY primaryKey ASC in Cassandra - the distributed nature makes that difficult to do effectively.

Tidy Data Layout - convert variables into factors

I have the following data table
| State | Prod. |Non-Prod.|
|-------|-------|---------|
| CA | 120 | 23 |
| GA | 123 | 34 |
| TX | 290 | 34 |
How can I convert this table to tiny data format in R or any other software like Excel?
|State | Class | # of EEs|
|------|----------|---------|
| CA | Prod. | 120 |
| CA | Non-Prod.| 23 |
| GA | Prod. | 123 |
| GA | Non-Prod.| 34 |
Trying using reshape2:
library(reshape2)
melt(df,id.vars='State')
# State variable value
# 1 CA Prod 120
# 2 GA Prod 123
# 3 TX Prod 290
# 4 CA Non-Prod. 23
# 5 GA Non-Prod. 34
# 6 TX Non-Prod. 34

Why won't my column name change work in R?

This is part of a script im writing to merge the collumns more fully after using merge().
If both datasets have a column with the same name merge() gives you columns column.x and column.y. I have written a script to put this data together and to drop the unneeded columns (which would be column.y and column.x_error, a column i've added to give warnings in case dat$column.x != dat$column.y). I also want to rename column.x to column, to decrease unneeded manual actions in my dataset. I have not managed to rename column.x to column, see the code for more info.
dat is obtained from doing a dat = merge(data1,data2, by= "ID", all.x=TRUE)
#obtain a list of double columns
dubbelkol = cbind()
sorted = sort(names(dat))
for(i in as.numeric(1:length(names(dat)))) {
if(grepl(".x",sorted[i])){
if (grepl(".y", sorted[i+1]) && (sub(".x","",sorted[i])==sub(".y","",sorted[i+1]))){
dubbelkol = cbind(dubbelkol,sorted[i],sorted[i+1])
}
}
}
#Check data, fill in NA in column.x from column.y if poss
temp = cbind()
for (p in as.numeric(1:(length(dubbelkol)-1))){
if(grepl(".x",dubbelkol[p])){
dat[dubbelkol[p]][is.na(dat[dubbelkol[p]])] = dat[dubbelkol[p+1]][is.na(dat[dubbelkol[p]])]
temp = (dat[dubbelkol[p]] != dat[dubbelkol[p+1]])
colnames(temp) = (paste(dubbelkol[p],"_error", sep=""))
dat[colnames(temp)] = temp
}
}
#If every value in "column.x_error" is TRUE or NA, delete "column.y" and "column.x_error"
#Rename "column.x" to "column"
#from here until next comment everything works
droplist= c()
for (k in as.numeric(1:length(names(dat)))) {
if (grepl(".x_error",colnames(dat[k]))) {
if (all(dat[k]==FALSE, na.rm = TRUE)) {
droplist = c(droplist,colnames(dat[k]), sub(".x_error",".y",colnames(dat[k])))
#the next line doesnt work, it's supposed to turn the .x column back to "" before the .y en .y_error columns are dropped.
colnames(dat[sub(".x_error",".x",colnames(dat[k]))])= paste(sub(".x_error","",colnames(dat[k])))
}
}
}
dat = dat[,!names(dat) %in% droplist]
paste(sub(".x_error","",colnames(dat[k]))) will give me "BNR" just fine, but the colnames(...) = ... won't change the column name in dat.
Any idea what's going wrong?
data1
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | NA |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
data2
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
dat
+----+-------+-------+-----------+
| ID | BNR.x | BNR.y |BNR.x_error|
+----+-------+-------+-----------+
| 1 | 123 | NA |FALSE |
| 2 | 234 | 234 |FALSE |
| 3 | NA | 345 |FALSE |
| 4 | 456 | 456 |FALSE |
| 5 | 677 | 677 |FALSE |
| 6 | NA | NA |NA |
+----+-------+-------+-----------+
desired output
+----+-------+
| ID | BNR |
+----+-------+
| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |
+----+-------+
I suggest replacing:
sub(".x_error",".x",colnames(dat[k]))]
with:
sub("\\.x_error", "\\.x", colnames(dat[k]))]
if you wish to replace an actual .. You have to escape . with \\.. A . in regex means any character.
Even better, since you are replacing . with . why not just say:
sub("x_error", "x", colnames(dat[k]))]
(or) if there is no other _error other than x_error, simply:
sub("_error", "", colnames(dat[k]))]
Edit: The problem seems to be that your data format seems to be loading additional columns on the left and the right. You can select the columns you want first and then merge.
d1 <- read.table(textConnection("| ID | BNR |
| 1 | 123 |
| 2 | 234 |
| 3 | NA |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |"), sep = "|", header = TRUE, stringsAsFactors = FALSE)[,2:3]
d1$BNR <- as.numeric(d1$BNR)
d2 <- read.table(textConnection("| 1 | 123 |
| 2 | 234 |
| 3 | 345 |
| 4 | 456 |
| 5 | 677 |
| 6 | NA |"), header = FALSE, sep = "|", stringsAsFactors = FALSE)[,2:3]
names(d2) <- c("ID", "BNR")
d2$BNR <- as.numeric(d2$BNR)
# > d1
# ID BNR
# 1 1 123
# 2 2 234
# 3 3 NA
# 4 4 456
# 5 5 677
# 6 6 NA
# > d2
# ID BNR
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 456
# 5 5 677
# 6 6 NA
dat <- merge(d1, d2, by="ID", all=T)
> dat
# ID BNR.x BNR.y
# 1 1 123 123
# 2 2 234 234
# 3 3 NA 345
# 4 4 456 456
# 5 5 677 677
# 6 6 NA NA
# replace all NA values in x from y
dat$BNR.x <- ifelse(is.na(dat$BNR.x), dat$BNR.y, dat$BNR.x)
# now remove y
dat$BNR.y <- null

Resources