Get the length of an element on nested list - r

I have a list (of lists) that came from JSON (jsonlite) like this one (dput below)
{
"1":["123", "131", "342"],
"2":["123", "131"],
"3":["123", "131", "352"],
"4":["31", "352"],
"5":["153", "131"],
"6":["153", "131", "382"]
}
structure(list(`1` = c("123", "131", "342"), `2` = c("123", "131" ), `3` = c("123", "131", "352"), `4` = c("31", "352"), `5` = c("153", "131"), `6` = c("153", "131", "382")), .Names = c("1", "2", "3", "4", "5", "6"))
Then, I'm trying to convert it to a data frame with the key and the length of the nested list, like
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 3 2 3 2 2 3
with that code:
a = (read_file("ghist.json") %>% fromJSON)$hist # Reads my list from a JSON file
dates = data.frame() #Creates an empty data frame
#Iterate my list element by element
for(i in 1:length(a)){
dates[1, i] = strtoi(names(a)[i]) #Appends to my data frame on the first row, line 'i' the key from my list (index 'i'), as Integer
dates[2, i] = length(a[i]) #Here is my problem, it returns '1', not the real length of my list (index 'i')
}
print(dates) #Just debug
With the code above I'm getting
V1 V2 V3 V4 V5 V6
1 1 2 3 4 5 6
2 1 1 1 1 1 1
Note: I know the numbers on key are just crescent numbers, but it will become dates in ms

You can just use the built in lengths function to construct your data frame. This gives you the length of list elements, which is exactly what you want.
a <- structure(list(`1` = c("123", "131", "342"), `2` = c("123", "131"), `3` = c("123", "131", "352"), `4` = c("31", "352"), `5` = c("153", "131"), `6` = c("153", "131", "382")), .Names = c("1", "2", "3", "4", "5", "6"))
dates <- data.frame(
matrix(
data = c(names(a), lengths(a)),
ncol = length(a),
byrow = TRUE
)
)
dates
#> X1 X2 X3 X4 X5 X6
#> 1 1 2 3 4 5 6
#> 2 3 2 3 2 2 3
The bug in your code is very minor, thought I wouldn't recommend this approach. It's that you need length(a[[i]]). I suggest you look at some resources on subsetting in R, but to illustrate compare the two at the bottom. a[1] returns a list containing the desired element, a[[1]] returns the actual element. The length of a[1] is 1.
dates = data.frame()
for(i in 1:length(a)){
dates[1, i] = strtoi(names(a)[i])
dates[2, i] = length(a[[i]]) # changed here
}
dates
#> V1 V2 V3 V4 V5 V6
#> 1 1 2 3 4 5 6
#> 2 3 2 3 2 2 3
a[1]
#> $`1`
#> [1] "123" "131" "342"
a[[1]]
#> [1] "123" "131" "342"
Created on 2018-06-26 by the reprex package (v0.2.0).

Try nrow(a[i]) instead of length(a[i]) in your loop.

Related

Merging 2 data sets with different number of rows, matched on a column, and creating NA values

I'm trying to accomplish something that allows me to merge two datasets with differing number of rows, match them on a common column and create NA values where there isn't matching data. For some reason, when I'm merging, the newly created data frame is auto filling values that should be NA and creating extra rows that I don't want. I'm trying to merge df_add (which has a total of 6 rows) into df_main (which has a total of 4 rows) and match the 2 on column "match_id" in df_main and "other_id" in df_add.
df_main <- data.frame (match_id = c("1", "1", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", "2006-09-13", "2006-09-13"),
type = c("Good", "Good", "Bad", "Bad")
)
df_add <- data.frame (other_id = c("1", "1", "1", "2", "2", "2"),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
This code is the closest I've gotten so far - it gives me the 6 rows that I want with the NA values but it doesn't match "match_id" and "other_id"
merge(df_main, df_add, by = 0, all = TRUE)[-1]
This is what I want my final merged data set to look like with only a total of 6 rows:
df_goal <- data.frame (match_id = c("1", "1", "1", "2", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", NA, "2006-09-13", "2006-09-13", NA),
type = c("Good", "Good", NA, "Bad", "Bad", NA),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
df_goal
Is there a way to accomplish this in r? Any help would be greatly appreciated!
This is really not a merge operation, mostly a cbind by-id.
ids <- unique(c(df_main$match_id, df_add$other_id))
ids
# [1] "1" "2"
mains <- split(df_main, df_main$match_id)
adds <- split(df_add, df_add$other_id)
do.call(rbind,
Map(function(x1, x2) {
nr <- max(nrow(x1), nrow(x2))
cbind(
rbind(x1, x1[0,][rep(NA, nr - nrow(x1)),]),
rbind(x2, x2[0,][rep(NA, nr - nrow(x2)),])
)
}, mains[ids], adds[ids])
)
# match_id index_date type other_id measure_date wt
# 1.1 1 2006-09-13 Good 1 2005-01-01 10
# 1.2 1 2006-09-13 Good 1 2005-03-13 11
# 1.NA <NA> <NA> <NA> 1 2005-04-19 15
# 2.3 2 2006-09-13 Bad 2 2005-06-22 60
# 2.4 2 2006-09-13 Bad 2 2005-09-29 42
# 2.NA <NA> <NA> <NA> 2 2005-11-03 33
The use of [ids] is solely to ensure that the _id variables are in the same order. This will run into problems if an id is in one and not the other, though if that's a possibility then it's possible to overcome that ...
Below is a solution with the package data.table. I have added the variable id_row to define a grouping order with the *_id columns. Then you merge on this as well through an outer join.
library(data.table)
df_main <- data.frame (match_id = c("1", "1", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", "2006-09-13", "2006-09-13"),
type = c("Good", "Good", "Bad", "Bad")
)
df_add <- data.frame (other_id = c("1", "1", "1", "2", "2", "2"),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
df_goal <- data.frame (match_id = c("1", "1", "1", "2", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", NA, "2006-09-13", "2006-09-13", NA),
type = c("Good", "Good", NA, "Bad", "Bad", NA),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
# convert to data.table
setDT(df_main)
setDT(df_add)
# define a row counter by either match_id and other_id
df_main[ , id_row := 1L:.N, by = match_id]
df_add[ , id_row := 1L:.N, by = other_id]
# rename other_id to match_id
setnames(df_add, "other_id", "match_id")
# set joining keys
setkey(df_main, match_id, id_row)
setkey(df_add, match_id, id_row)
# do an outer join
out = df_main[ df_add ]
out
#> match_id index_date type id_row measure_date wt
#> 1: 1 2006-09-13 Good 1 2005-01-01 10
#> 2: 1 2006-09-13 Good 2 2005-03-13 11
#> 3: 1 <NA> <NA> 3 2005-04-19 15
#> 4: 2 2006-09-13 Bad 1 2005-06-22 60
#> 5: 2 2006-09-13 Bad 2 2005-09-29 42
#> 6: 2 <NA> <NA> 3 2005-11-03 33
Created on 2022-09-23 with reprex v2.0.2
You're missing a column to join by, we can create it and then slightly modify your code:
df_main$id2 <- ave(df_main$match_id, df_main$match_id, FUN = seq_along)
df_add$id2 <- ave(df_add$other_id, df_add$other_id, FUN = seq_along)
merge(df_main, df_add, by.x = c("match_id", "id2"), by.y = c("other_id", "id2"), all = TRUE)
#> match_id id2 index_date type measure_date wt
#> 1 1 1 2006-09-13 Good 2005-01-01 10
#> 2 1 2 2006-09-13 Good 2005-03-13 11
#> 3 1 3 <NA> <NA> 2005-04-19 15
#> 4 2 1 2006-09-13 Bad 2005-06-22 60
#> 5 2 2 2006-09-13 Bad 2005-09-29 42
#> 6 2 3 <NA> <NA> 2005-11-03 33
Created on 2022-09-27 by the reprex package (v2.0.1)

Remove empty values without changing the rest of the data frame in r

I have this data frame (much bigger, this is an example):
V1 V2 V3 V4 V5
row1 1 2 3
row2 Row1name row1class 4 3 8
row3 12 6 3
row4 row2name row2class 3 7 5
row5 row3name row3class <NA> <NA> <NA>
row6 row4name row4class <NA> <NA> <NA>
I want to fix the data so that I get the following:
V1 V2 V3 V4 V5
row1 Row1name row1class 1 2 3
row2 Row2name row2class 4 3 8
row3 Row3name row3class 12 6 3
row4 Row4name row4class 3 7 5
Any idea how to remove the empty spaces without changing V3-V5?
If you have same number of empty and NA values in each column, you can use
purrr::map_df(df, ~.x[!(.x == '' | is.na(.x))])
# V1 V2 V3 V4 V5
# <chr> <chr> <chr> <chr> <chr>
#1 Row1name row1class 1 2 3
#2 row2name row2class 4 3 8
#3 row3name row3class 12 6 3
#4 row4name row4class 3 7 5
and similarly in base R :
do.call(cbind.data.frame, lapply(df, function(x) x[!(x == '' | is.na(x))]))
You may want to use type.convert after using the above to change the class of the columns to their respective classes.
data
df <- structure(list(V1 = c("", "Row1name", "", "row2name", "row3name",
"row4name"), V2 = c("", "row1class", "", "row2class", "row3class",
"row4class"), V3 = c("1", "4", "12", "3", NA, NA), V4 = c("2",
"3", "6", "7", NA, NA), V5 = c("3", "8", "3", "5", NA, NA)),
row.names = c("row1", "row2", "row3", "row4", "row5", "row6"),
class = "data.frame")
The <NA> indicates character data where numeric is needed; transform!
dat[3:5] <- lapply(dat[3:5], as.numeric)
Now, NA in rowSums removes the NA rows.
res <- dat[!is.na(rowSums(dat[3:5])), ]
And rebuild the first two columns using paste.
res <- dat[!is.na(rowSums(dat[3:5])), ]
res <- transform(res,
V1=paste0(rownames(res), "name"),
V2=paste0(rownames(res), "Class"))
Result
res
# V1 V2 V3 V4 V5
# row1 row1name row1Class 1 2 3
# row2 row2name row2Class 4 3 8
# row3 row3name row3Class 12 6 3
# row4 row4name row4Class 3 7 5
Data
dat <- structure(list(V1 = c("", "Row1name", "", "row2name", "row3name",
"row4name"), V2 = c("", "row1class", "", "row2class", "row3class",
"row4class"), V3 = c("1", "4", "12", "3", "<NA>", "<NA>"), V4 = c("2",
"3", "6", "7", "<NA>", "<NA>"), V5 = c("3", "8", "3", "5", "<NA>",
"<NA>")), row.names = c("row1", "row2", "row3", "row4", "row5",
"row6"), class = "data.frame")

Fill missing values in a data frame

Hey I need to fill out the missing values of a data frame. The logic is easy, if there is value in M[i, j + 1] then use M[i, j + 1], else use M[i, j - 1]. But the tricky thing is I need to fill out the missing value since the beginning of the row to the column after last non-na value for each row, not only the cells near the non-empty cells.
Here is the data
a1 <- c('a',9,8,rep(NA,5))
a2 <- c('b',NA,NA,NA,NA,3,NA,4)
a3 <- c('c',11,6,7,NA,NA,NA,6)
M <- rbind(a1,a2,a3)
ind <- !is.na(M[,-1])
t <- tapply(M[,-1][ind], row(M[,-1])[ind], head, 1)
M <- M %>%
as.data.frame(stringsAsFactors = FALSE) %>%
group_by(V1) %>%
do(mutate(., last_non_na_col = max(apply(.,1,function(x) max(which(!is.na(x)))))))
for (i in 1:nrow(M)) {
for (j in 3:(M$last_non_na_col[i]+1)) {
if (is.na(M[i,j])) {
M[i,j] = ifelse(!is.na(M[i,j+1]),M[i,j+1],(ifelse(!is.na(M[i,j-1]),M[i,j-1],t[i])))
} }
for (j in 2) { M[i,j] = ifelse(is.na(M[i,j]), M[i,j+1], M[i,j])}
}
The raw data is like this
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
a1 "a" "9" "8" NA NA NA NA NA
a2 "b" NA NA NA NA "3" NA "4"
a3 "c" "11" "6" "7" NA NA NA "6"
The output of my code is the following, which is correct. Please notice that for cell M[2,5], the filled value should be 7(which is the number prior to it), not 6(the nearest number after it).
V1 V2 V3 V4 V5 V6 V7 V8 last_non_na_col
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <int>
1 a 9 8 8 NA NA NA NA 3
2 b 3 3 3 3 3 4 4 8
3 c 11 6 7 7 7 6 6 8
I did this in for loop. Does any one can help me to do this in tidyverse?
Thanks,
Cathy
As we have a tbl_df, we could use tidyverse methods
library(tidyverse)
gather(M, key, val, -V1) %>%
group_by(V1) %>%
fill(val, .direction = 'up') %>%
mutate(val = replace(val, which(is.na(val))[1],
val[tail(which(!is.na(val)), 1)])) %>%
spread(key, val)
# A tibble: 3 x 8
# Groups: V1 [3]
# V1 V2 V3 V4 V5 V6 V7 V8
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 a 9 8 8 NA NA NA NA
#2 b 3 3 3 3 3 4 4
#3 c 11 6 7 5 5 6 6
In the OP's for loop, we could use na.locf (to fill up the NA elements by the adjacent non-NA elements - from zoo package)
library(zoo)
last_non_na_col <- c(3, 8, 8)
for (i in seq_len(nrow(M))) {
M[i, -1] <- na.locf(unlist(M[i, -1]), fromLast = TRUE, na.rm = FALSE)
for (j in 3:(pmin(ncol(M), last_non_na_col[i]+1))) {
if (is.na(M[i,j])) {
M[i,j] = ifelse(!is.na(M[i,j+1]), M[i,j+1], M[i,j-1])
}
}
}
M
# A tibble: 3 x 8
# Groups: V1 [3]
# V1 V2 V3 V4 V5 V6 V7 V8
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#1 a 9 8 8 NA NA NA NA
#2 b 3 3 3 3 3 4 4
#3 c 11 6 7 5 5 6 6
NOTE: Here, we created the last_non_na_col as a vector instead of a separate column in the dataset for easiness in indexing
data
M <- structure(list(V1 = c("a", "b", "c"), V2 = c("9", NA, "11"),
V3 = c("8", NA, "6"), V4 = c(NA, NA, "7"), V5 = c(NA_character_,
NA_character_, NA_character_), V6 = c(NA, "3", "5"), V7 = c(NA_character_,
NA_character_, NA_character_), V8 = c(NA, "4", "6")), .Names = c("V1",
"V2", "V3", "V4", "V5", "V6", "V7", "V8"), row.names = c(NA,
-3L), class = c("grouped_df", "tbl_df", "tbl", "data.frame"),
vars = "V1", drop = TRUE, indices = list(
0L, 1L, 2L), group_sizes = c(1L, 1L, 1L), biggest_group_size = 1L,
labels = structure(list(
V1 = c("a", "b", "c")), row.names = c(NA, -3L),
class = "data.frame", vars = "V1", drop = TRUE, .Names = "V1"))

r dataframe using rank

I would like to rank the row of a dataframe (with 30 columns) which has numerical values ranking from -inf to +inf.
This is what I have:
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
> df
StockA StockB StockC
1 -5 2 -3
2 3 -1 -4
3 6 3 4
This is what I would like to have:
> df_rank
StockA StockB StockC
1 3 1 2
2 1 2 3
3 1 3 2
I am using this command:
> rank(df[1,])
StockA StockB StockC
2 3 1
The resulting rank variables are not correct though as you can see.
rank() assigns the lowest rank to the smallest value.
So the short answer to your question is to use rank of the vector multiplied by -1:
rank (-c(-5, 2, -3) )
[1] 1 3 2
Here is the full code:
# data frame definition. The numbers should actually be integers as pointed out
# in comments, otherwise the rank command will sort them as strings
# So in the real word you should define them as integers,
# but to go with your data I will convert them to integers in the next step
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
# since you plan to rank them not as strings, but numbers, you need to convert
# them to integers:
df[] <- lapply(df,as.integer)
# apply will return a matrix or a list and you need to
# transpose the result and convert it back to a data.frame if needed
result <- as.data.frame(t( apply(df, 1, FUN=function(x){ return(rank(-x)) }) ))
result
# StockA StockB StockC
# 3 1 2
# 1 2 3
# 1 3 2

Setting different values in duplicated observations to NA

I have a data frame (DF) that looks as follows:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c("2", "1", "1", "2", "1"), VALUE4 = c("2", "1",
"2", "1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
Uniqueness in this case is defined by ID and REPORTER. So the DF above contains a duplicate for the ID 123 and REPORTER ONE and the ID 789 and REPORTER THREE. Since I cannot tell which values of VALUE1 to VALUE4 are the correct ones, I like to set all values to NA, that differ within a duplicate.
This means I first have to identify the columns of VALUE that contain different values. These are the ones to be set to NA. For the rest I like to keep the data since here I can tell the value is correct.
The expected output would look like this:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c(NA, NA, "1", NA, NA), VALUE4 = c(NA, NA, "2",
"1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
The goal is to ensure data quality. I don't like to just remove the problem cases since I can use the not differing values for analysis. But I also do not like to just use one of the rows because this would lead to wrong conclusions if I had chosen the wrong values.
How can I do this?
I think this is what you are looking for:
library(reshape2)
DFL <- melt(cbind(rn = 1:nrow(DF), DF), id.vars=c("rn", "ID", "REPORTER"))
DFL$value2 <- ave(DFL$value, DFL[c("ID", "REPORTER", "variable")],
FUN = function(x) {
ifelse(length(unique(x)) > 1, NA, x)
})
dcast(DFL, rn + ID + REPORTER ~ variable, value.var = "value2")
# rn ID REPORTER VALUE1 VALUE3 VALUE4
# 1 1 123 ONE 1 <NA> <NA>
# 2 2 123 ONE 1 <NA> <NA>
# 3 3 456 TWO 2 1 2
# 4 4 789 THREE 1 <NA> 1
# 5 5 789 THREE 1 <NA> 1
As you can see, I had to add a dummy "rn" supplementary ID variable to make sure that dcast wouldn't just collapse all the values into one row per ID+REPORTER combination.
Update
This is actually also entirely doable with base R's reshape and the ave step described above:
DFL <- reshape(DF, direction = "long",
varying = grep("VALUE", names(DF)), sep = "")
DFL <- within(DFL, {
VALUE <- ave(VALUE, ID, REPORTER, time, FUN = function(x)
ifelse(length(unique(x)) > 1, NA, x))
})
reshape(DFL)
# ID REPORTER id VALUE1 VALUE3 VALUE4
# 1.1 123 ONE 1 1 <NA> <NA>
# 2.1 123 ONE 2 1 <NA> <NA>
# 3.1 456 TWO 3 2 1 2
# 4.1 789 THREE 4 1 <NA> 1
# 5.1 789 THREE 5 1 <NA> 1
In the last line above, the attributes from the original reshape statement make it so we don't have to even worry about what arguments we need to put in. :-)
I created a function replaceDifferent() that looks like this:
replaceDifferent <- function(vector){
max <- max(vector)
min <- min(vector)
test <- max == min
if (!test){
return(NA)
}
else{
return(min(vector))
}
}
Then I melted the DF with melt() from the reshape package:
DFmelt <- melt(DF, id = c("ID", "REPORTER"))
After that I was able to apply the new function to the melted data frame wit ddply()
DFres <- ddply(DFmelt, .(ID, REPORTER, variable), function(x){replaceDifferent(x$value)})
To get the result data frame with duplicates removed I called dcast() on DFres:
DFres <- dcast(DFres, ID+REPORTER ~ variable)
This produces a slightly different output than the one I asked for, but is better in the way that I do not have to deal with duplicates anymore.

Resources