Convert json list to data frame - r

I am having an issue converting a json file to a data frame.
I use jsonlite and fromJSON() function with also unlist() function but I cannot manage to get the data in the data model I want.
Json file is structured this way:
{"JOHN":["AZ","YZ","ZE","ZR","FZ"],"MARK":["FZ","JF","FS"],"LINDA":["FZ","RZ","QF"]}
And I would like to have a data frame similar to this:
NAME GROUP
JOHN AZ
JOHN YZ
JOHN ZE
JOHN ZR
JOHN FZ
MARK FZ
MARK JF
MARK FS
...
Thanks !

We can use fromJSON from jsonlite to get a list of key/value vectors, convert that to a two column data.frame with stack, rearrange the columns and change the column names (if needed).
library(jsonlite)
setNames(stack(fromJSON(str1))[2:1], c("NAME", "GROUP"))
# NAME GROUP
#1 JOHN AZ
#2 JOHN YZ
#3 JOHN ZE
#4 JOHN ZR
#5 JOHN FZ
#6 MARK FZ
#7 MARK JF
#8 MARK FS
#9 LINDA FZ
#10 LINDA RZ
#11 LINDA QF
data
str1 <- '{"JOHN":["AZ","YZ","ZE","ZR","FZ"],"MARK":["FZ","JF","FS"],"LINDA":["FZ","RZ","QF"]}'

Related

Join of 2 dataframes [duplicate]

This question already has answers here:
How can I match fuzzy match strings from two datasets?
(7 answers)
Closed 3 years ago.
I have 2 dataframes and I want to join by name, but names are not written exactly the same:
Df1:
ID Name Age
1 Jose 13
2 M. Jose 12
3 Laura 8
4 Karol P 32
Df2:
Name Surname
José Hall
María José Perez
Laura Alza
Karol Smith
I need to join and get this:
ID Name Age Surname
1 Jose 13 Hall
2 M. Jose 12 Perez
3 Laura 8 Alza
4 Karol P 32 Smith
How to consider that the names are not exactly the same before to join?
You can get close to your result using stringdist_left_join from fuzzyjoin
library(fuzzyjoin)
stringdist_left_join(df1, df2, by = "Name")
# ID Name.x Age Name.y Surname
#1 1 Jose 13 José Hall
#2 2 M. Jose 12 <NA> <NA>
#3 3 Laura 8 Laura Alza
#4 4 Karol P 32 Karol Smith
For the example shared it does not work for 1 entry since it is difficult to match Maria with M.. You can get the result for it by adjusting the max_dist argument to a higher value (default is 2) however, this will screw up other results and would give unwanted matches. If you have minimal NA entries (like the example shared) after this join you could just match them by "hand".
I would clean the database before (for example deleting those ´, in excel is easy doing those replace) and then use
new_df <- merge(df1, df2, by="name")
or you could try to assign an ID for df2 that coincide with df2 if it is possible.

Create weight node and edges lists from a normal dataframe in R?

I'm trying to use visNetwork to create a node diagram. However, my data is not in the correct format and I haven't been able to find any help on this on the internet.
My current data frame looks similar to this:
name town car color age school
John Bringham Swift Red 22 Brighton
Sarah Bringham Corolla Red 33 Rustal
Beth Burb Swift Blue 43 Brighton
Joe Spring Polo Black 18 Riding
I'm wanting to change use this to create nodes and edges lists that can be used to create a vis network.
I know that the "nodes" list will be made from the unique values in the "name" column but I'm not sure how I would use the rest of the data to create the "edges" list?
I was thinking that it may be possible to group by each column and then read back the matches from this function but I am not sure how to implement this. The idea that I thought of is to weight the edges based on how many matches they detect in the various group by functions. I'm not sure how to actually implement this yet.
For example, Joe will not match with anyone because he shares no common columns with any of the others. John and Sarah will have a weight of 2 because they share two common columns.
Also open to solutions in python!
One option is to compar row by row, in order to calculate the number of commun values.
For instance for John (first row) and Sarah (second row):
sum(df[1,] == df[2,])
# 2
Then you use the function combn() from library utils to know in advance the number of pair-combinaison you have to calculate:
nodes <- matrix(combn(df$name, 2), ncol = 2, byrow = T) %>% as.data.frame()
nodes$V1 <- as.character(nodes$V1)
nodes$V2 <- as.character(nodes$V2)
nodes$weight <- NA
(nodes)
# V1 V2 weight
#1 John Sarah NA
#2 John Beth NA
#3 John Joe NA
#4 Sarah Beth NA
#5 Sarah Joe NA
#6 Beth Joe NA
Finally a loop to calculate weight for each node.
for(n in 1:nrow(nodes)){
name1 <- df[df$name == nodes$V1[n],]
name2 <- df[df$name == nodes$V2[n],]
nodes$weight[n] <- sum(name1 == name2)
}
# V1 V2 weight
#1 John Sarah 2
#2 John Beth 2
#3 John Joe 0
#4 Sarah Beth 0
#5 Sarah Joe 0
#6 Beth Joe 0
I think node will be the kind of dataframe that you can use in the function visNetwork().

How to convert a n x 3 data frame into a square (ordered) matrix?

I need to reshape a table or (data frame) to be able to use an R package (NetworkRiskMetrics). Suppose I have a data frame of lenders, borrowers and loan values:
lender borrower loan_USD
John Mark 100
Mark Paul 45
Joe Paul 30
Dan Mark 120
How do I convert this data frame into:
John Mark Joe Dan Paul
John
Mark
Joe
Dan
Paul
(placing zeros in empty cells)?
Thanks.
Use reshape function
d <- data.frame(lander=c('a','b','c', 'a'), borower=c('m','p','m','p'), loan=c(10,20,15,12))
d
loan lander borower
10.1 1 a m
20.1 1 b p
15.1 1 c m
12.1 1 a p
reshape(data=d, direction='long', varying=list('lander','borower'), idvar='loan', timevar='loan')
lander borower loan
1 a m 10
2 b p 20
3 c m 15
4 a p 12

Splitting a string few characters after the delimiter

I have a large data set of names and states that I need to split. After splitting, I want to create new rows with each name and state. My data strings are in multiple lines that look like this
"Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ"
"Ralph Hogan, TX, Michael Johnson, FL"
I need the data to look like this
attr name state
1 Peter Johnson IN
2 Chet Charles TX
3 Ed Walsh AZ
4 Ralph Hogan TX
5 Michael Johnson FL
I can't figure out how to do this, perhaps split it somehow a few characters after the comma? Any help would be greatly appreciated.
If it is multiple line strings, then we can create a delimiter with gsub, split the strings using strsplit, create data.frame with the components of the split in the output list, and rbind it together.
d1 <- do.call(rbind, lapply(strsplit(gsub("([A-Z]{2})(\\s+|,)",
"\\1;", lines), "[,;]"), function(x) {
x1 <- trimws(x)
data.frame(name = x1[c(TRUE, FALSE)],state = x1[c(FALSE, TRUE)]) }))
cbind(attr = seq_len(nrow(d1)), d1)
# attr name state
#1 1 Peter Johnson IN
#2 2 Chet Charles TX
#3 3 Ed Walsh AZ
#4 4 Ralph Hogan TX
#5 5 Michael Johnson FL
Or this can be done in a compact way
library(data.table)
fread(paste(gsub("([A-Z]{2})(\\s+|,)", "\\1\n", lines), collapse="\n"),
col.names = c("names", "state"), header = FALSE)[, attr := 1:.N][]
# names state attr
#1: Peter Johnson IN 1
#2: Chet Charles TX 2
#3: Ed Walsh AZ 3
#4: Ralph Hogan TX 4
#5: Michael Johnson FL 5
data
lines <- readLines(textConnection("Peter Johnson, IN Chet Charles, TX Ed Walsh, AZ
Ralph Hogan, TX, Michael Johnson, FL"))

Lookup values in a vectorized way

I keep reading about the importance of vectorized functionality so hopefully someone can help me out here.
Say I have a data frame with two columns: name and ID. Now I also have another data frame with name and birthplace, but this data frame is much larger than the first, and contains some but not all of the names from the first data frame. How can I add a third column to the the first table that is populated with birthplaces looked up using the second table.
What I have is now is:
corresponding.birthplaces <- sapply(table1$Name,
function(name){return(table2$Birthplace[table2$Name==name])})
This seems inefficient. Thoughts? Does anyone know of a good book/resource for using R 'properly'..I get the feeling that I generally do think in the least computationally effective manner conceivable.
Thanks :)
See ?merge which will perform a database link merge or join.
Here is an example:
set.seed(2)
d1 <- data.frame(ID = 1:5, Name = c("Bill","Bob","Jessica","Jennifer","Robyn"))
d2 <- data.frame(Name = c("Bill", "Gavin", "Bob", "Joris", "Jessica", "Andrie",
"Jennifer","Joshua","Robyn","Iterator"),
Birthplace = sample(c("London","New York",
"San Francisco", "Berlin",
"Tokyo", "Paris"), 10, rep = TRUE))
which gives:
> d1
ID Name
1 1 Bill
2 2 Bob
3 3 Jessica
4 4 Jennifer
5 5 Robyn
> d2
Name Birthplace
1 Bill New York
2 Gavin Tokyo
3 Bob Berlin
4 Joris New York
5 Jessica Paris
6 Andrie Paris
7 Jennifer London
8 Joshua Paris
9 Robyn San Francisco
10 Iterator Berlin
Then we use merge() to do the join:
> merge(d1, d2)
Name ID Birthplace
1 Bill 1 New York
2 Bob 2 Berlin
3 Jennifer 4 London
4 Jessica 3 Paris
5 Robyn 5 San Francisco

Resources