Why am I having Issues with Separating Rows in a Dataframe? - r

I'm having an issue with separating rows in a dataframe that I'm working in.
In my dataframe, there's a column called officialIndices that I want to separate the rows by. This column stores a list of numbers act as indexes to indicate which rows have the same data. For example: indices 2:3 means that rows 2:3 have the same data.
Here is the code that I am working with.
offices_list <- data_google$offices
offices_JSON <- toJSON(offices_list)
offices_from_JSON <-
separate_rows(fromJSON(offices_JSON), officialIndices, convert = TRUE)
This is what my offices_list frame looks like
This is what it looks like after I try to separate the rows
My code works fine when it has indices 2:3 since there is a difference of 1. However on indices like 7:10, it separates the rows as 7 and 10 instead of doing 7, 8, 9, 10, which is how I want it do be done. How would I get my code to separate the rows like this?
Output of dput(head(offices_list))
structure(list(position = c("President of the United States",
"Vice-President of the United States", "United States Senate",
"Governor", "Mayor", "Auditor"), divisionId = c("ocd-division/country:us",
"ocd-division/country:us", "ocd-division/country:us/state:or",
"ocd-division/country:us/state:or", "ocd-division/country:us/state:or/place:portland",
"ocd-division/country:us/state:or/place:portland"), levels = list(
"country", "country", "country", "administrativeArea1", NULL,
NULL), roles = list(c("headOfState", "headOfGovernment"),
"deputyHeadOfGovernment", "legislatorUpperBody", "headOfGovernment",
NULL, NULL), officialIndices = list(0L, 1L, 2:3, 4L, 5L,
6L)), row.names = c(NA, 6L), class = "data.frame")

This should work. I expect it will work for further rows too, since I tested for ranges greater than two in officialIndices.
First I extracted the start and end rows, and used their difference to determine how many rows are needed. Then tidyr::uncount() will add that many copies.
library(dplyr); library(tidyr)
data_sep <- data %>%
separate(officialIndices, into = c("start", "end"), sep = ":") %>%
# Use 1 row, and more if "end" is defined and larger than "start"
mutate(rows = 1 + if_else(is.na(end), 0, as.numeric(end) - as.numeric(start))) %>%
uncount(rows)

Related

How to match list of characters with partial strings in R?

I am analysing IDs from the RePEc database. Each ID matches a unique publication and sometimes publications are linked because they are different versions of each other (e.g. a working paper that becomes a journal article). I have a database of about 250,000 entries that shows the main IDs in one column and then the previous or alternative IDs in another. It looks like this:
df$repec_id <– c("RePEc:cid:wgha:353", "RePEc:hgd:wpfacu:350","RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019", "RePEc:tqu:vishdizi:d8z7-200x", "RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050", "RePEc:cid:wgha:353|RePEc:hgd:wpfacu:350")
I want to find out which IDs from the repec_id column are also present in the alt_repec_id column and create a dataframe that only has rows matching this condition. I tried to strsplit at "|" and use the %in% function like this:
df <- separate_rows(df, alt_repec_id, sep = "\\|")
df1 <- df1[trimws(df$alt_repec_id) %in% trimws(df$repec_id), ]
df1<- data.frame(df1)
df1 <- na.omit(df1)
df1 <- df1[!duplicated(df1$repec_id),]
It works but I'm worried that by eliminating duplicate rows based on the values in the repec_id column, I'm randomly eliminating matches. Is that right?
Ultimately, I want a dataframe that only contains values in which strings in the repec_id column match the partial strings in the alt_repec_id column. Using the example above, I want the following result:
df$repec_id <– c("RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050")
Does anyone have a solution to my problem? Thanks in advance for your help!
Try using str_detect() from stringr to identify if the repec_id exists in the larger alt_repec_id string.
Then filter() down to where it was found. This this is not returning as expected, try looking at and posting a few examples where found_match == FALSE but the match was expected.
library(stringr)
library(dplyr)
df %>%
mutate(found_match = str_detect(alt_repec_id, repec_id)) %>%
filter(found_match == TRUE)
Here is a base R solution using grepl() + apply() + subset()
dfout <- subset(df,apply(df, 1, function(v) grepl(v[1],v[2])))
such that
> dfout
repec_id alt_repec_id
3 RePEc:cpi:dynxce:050 RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050
DATA
df <- structure(list(repec_id = structure(c(1L, 3L, 2L), .Label = c("RePEc:cid:wgha:353",
"RePEc:cpi:dynxce:050", "RePEc:hgd:wpfacu:350"), class = "factor"),
alt_repec_id = structure(c(2L, 3L, 1L), .Label = c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050",
"RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019",
"RePEc:tqu:vishdizi:d8z7-200x"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))

How to avoid problems with more rows of data when using merge() in R?

I have gone through numerous posts for this problem and I haven't been able to produce the data frame that I want.
I have two data frames that I would like to merge. However, more rows of data were produced after using the merge function.
Ultimately there should be 6 rows (for this example), but all of commands are giving 36 rows. Is it because there might be duplicate since I am using 2 columns for the merge function?
These are my data and here's what I have already tried.
a <- structure(list(month = c(1L, 1L, 1L, 1L, 1L, 1L), site = c("Port",
"Port", "Port", "Port", "Port", "Port"), max = c(17.1530908785179,
17.6490466820266, 19.8794824562496, 16.6000416246619, 15.8144630183894,
14.4950690162599)), row.names = c(NA, -6L), class = c("tbl_df",
"tbl", "data.frame"))
b <- structure(list(month = c(1, 1, 1, 1, 1, 1), site = c("Port",
"Port", "Port", "Port", "Port", "Port"), slope = c(0.189564181246092,
0.142842264473357, 0.135918209518515, 0.152899782597735, 0.223283613118016,
0.177886719032959)), row.names = c(NA, 6L), class = "data.frame")
What I've tried:
merge(a, b, by=c("month", "site"))
merge(a, b, by=c("month", "site"), all=TRUE)
unique(a) %>%
merge(b, by=c("month", "site"), all =TRUE)
left_join(a, b, by=c("month", "site"))
right_join(a, b, by=c("month", "site"))
I am not sure what I am missing. Any pointers on where the problem is and how to fix it would be really helpful. Thank you.
The problem is, that you merge by month and site which are "1" or "port" for each entry in the data frames. The merge command now takes the first entry of the data frame b and checks if there are any matches for month and site in the data frame a. Because every entry in the data frame a is a match (again every entry of site and month is the same) it merges the first entry of the data frame b to all the entries in data frame a. It does this 6 times with every entry in the data frame b. Hence you have a data frame with 36 entries.
If you just want to slap the data frames together I would use cbind:
cbind(a,b[, 3])
This it not a task for merge. "month" and "site" do not uniquely identify the observations in the data. Put differently, every value of the "slope" column in b matches each row of a equally well.
Just do cbind:
df <- cbind(a, b[,3])

"unlist" a list stored in a variable

I have a dataframe with a two odd variables. For one one variable, each cell stores a list whose contents is simply a vector of two numbers. For the other variable, each cell stores a three dimensional array (even though only two dimensions are necessary) of 8 numbers.
I want to simplify the dataset by breaking out the odd variable into separate variables. I figured out how to break all the data out using a for loop but this is very slow. I know apply is supposed to be generally quicker, but I can't figure out how I would translate this to apply. Is it possible, or is there a better way to do this?
for (i in 1:nrow(df)){
if (length(df$coordinates.coordinates[[i]]>0)){
df[i,"coordinates.lon"]<- df$coordinates.coordinates[[i]][1]
df[i,"coordinates.lat"]<- df$coordinates.coordinates[[i]][2]
}
if (length(df$place.bounding_box.coordinates[[i]]>0)){
df[i,"place.bounding_box.a.lon"] <-df$place.bounding_box.coordinates[[i]][1,1,1]
df[i,"place.bounding_box.b.lon"] <-df$place.bounding_box.coordinates[[i]][1,2,1]
df[i,"place.bounding_box.c.lon"] <-df$place.bounding_box.coordinates[[i]][1,3,1]
df[i,"place.bounding_box.d.lon"] <-df$place.bounding_box.coordinates[[i]][1,4,1]
df[i,"place.bounding_box.a.lat"] <-df$place.bounding_box.coordinates[[i]][1,1,2]
df[i,"place.bounding_box.b.lat"] <-df$place.bounding_box.coordinates[[i]][1,2,2]
df[i,"place.bounding_box.c.lat"] <-df$place.bounding_box.coordinates[[i]][1,3,2]
df[i,"place.bounding_box.d.lat"] <-df$place.bounding_box.coordinates[[i]][1,4,2]
}
}
EDIT
Here is an example dataframe with one case (via dput)
structure(list(coordinates.coordinates = list(c(112.088477, -7.227974
)), place.bounding_box.coordinates = list(structure(c(112.044456,
112.044456, 112.143242, 112.143242, -7.263067, -7.134563, -7.134563,
-7.263067), .Dim = c(1L, 4L, 2L)))), .Names = c("coordinates.coordinates",
"place.bounding_box.coordinates"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -1L))
In case it helps, this is the data format that gets out when you try to read Twitter stream data using jsonlite's stream_in function (with flatten=TRUE)
library(dplyr)
df = data_frame(
coordinates.coordinates =
list(c(0, 1), c(2, 3)),
place.bounding_box.coordinates =
list(array(0, dim=c(1, 4, 2)),
array(1, dim=c(1, 4, 2))))
df %>%
rowwise %>%
do(with(., data_frame(
longitude = coordinates.coordinates[1],
latitude = coordinates.coordinates[2]) %>% bind_cols(
place.bounding_box.coordinates %>%
as.data.frame %>%
setNames(c(
"place.bounding_box.a.lon",
"place.bounding_box.b.lon",
"place.bounding_box.c.lon",
"place.bounding_box.d.lon",
"place.bounding_box.a.lat",
"place.bounding_box.b.lat",
"place.bounding_box.c.lat",
"place.bounding_box.d.lat")))))

Using "apply" functions across multiple data frames

I'm having an issue using apply functions (which I assume is the right way to do the following) across multiple data frames.
Some example data (3 different data frames, but the problem I'm working on has upwards of 50):
biz <- data.frame(
country = c("england","canada","australia","usa"),
businesses = sample(1000:2500,4))
pop <- data.frame(
country = c("england","canada","australia","usa"),
population = sample(10000:20000,4))
restaurants <- data.frame(
country = c("england","canada","australia","usa"),
restaurants = sample(500:1000,4))
Here's what I ultimately want to do:
1) Sort eat data frame from largest to smallest, according to the variable that's included
dataframe <- dataframe[order(dataframe$VARIABLE,)]
2) then create a vector variable that gives me the rank for each
dataframe$rank <- 1:nrow(dataframe)
3) Then create another data frame that has one column of the countries and the rank for each of the variables of interest as other columns. Something that would look like (rankings aren't real here):
country.rankings <- structure(list(country = structure(c(5L, 1L, 6L, 2L, 3L, 4L), .Label = c("brazil",
"canada", "england", "france", "ghana", "usa"), class = "factor"),
restaurants = 1:6, businesses = c(4L, 5L, 6L, 3L, 2L, 1L),
population = c(4L, 6L, 3L, 2L, 5L, 1L)), .Names = c("country",
"restaurants", "businesses", "population"), class = "data.frame", row.names = c(NA,
-6L))
So I'm guessing there's a way to put each of these data frames together into a list, something like:
lib <- c(biz, pop, restaurants)
And then do an lapply across that to 1) sort, 2)create the rank variable and 3) create the matrix or data frame of rankings for each variable (# of businesses, population size, # of restaurants) for each country. Problem I'm running into is that writing the lapply function to sort each data frame runs into issues when I try to order by the variable:
sort <- lapply(lib,
function(x){
x <- x[order(x[,2]),]
})
returns the error message:
Error in `[.default`(x, , 2) : incorrect number of dimensions
because I'm trying to apply column headings to a list. But how else would I tackle this problem when the variable names are different for every data frame (but keeping in mind that the country names are consistent)
(would also love to know how to use this using plyr)
Ideally I'd would recommend data.table for this.
However, here is a quick solution using data.frame
Try this:
Step1: Create a list of all data.frames
varList <- list(biz,pop,restaurants)
Step2: Combine all of them in one data.frame
temp <- varList[[1]]
for(i in 2:length(varList)) temp <- merge(temp,varList[[i]],by = "country")
Step3: Get ranks:
cbind(temp,apply(temp[,-1],2,rank))
You can remove the undesired columns if you want!!
cbind(temp[,1:2],apply(temp[,-1],2,rank))[,-2]
Hope this helps!!
totaldatasets <- c('biz','pop','restaurants')
totaldatasetslist <- vector(mode = "list",length = length(totaldatasets))
for ( i in seq(length(totaldatasets)))
{
totaldatasetslist[[i]] <- get(totaldatasets[i])
}
totaldatasetslist2 <- lapply(
totaldatasetslist,
function(x)
{
temp <- data.frame(
country = totaldatasetslist[[i]][,1],
countryrank = rank(totaldatasetslist[[i]][,2])
)
colnames(temp) <- c('country', colnames(x)[2])
return(temp)
}
)
Reduce(
merge,
totaldatasetslist2
)
Output -
country businesses population restaurants
1 australia 3 3 3
2 canada 2 2 2
3 england 1 1 1
4 usa 4 4 4

Returning first row of group

I have a dataframe consisting of an ID, that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched.
My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group.
I then have been looking at the aggregate() as well as ddply() function but I could not find an option in both that just returns the first entry without applying an aggregation function to the time interval value.
Is there an (easy) way to accomplish this?
ADDITION:
Maybe I was unclear by adding my aggregate() and ddply() notes. I do not necessarily need to aggregate. Given the fact that the dataframe is sorted in a way that the first row of each new group is the row I am looking for, it would suffice to just return a subset with each row that has a different ID than the one before (which is the start-row of each new group).
Example data:
structure(list(ID = c(1454L, 1322L, 1454L, 1454L, 1855L, 1669L,
1727L, 1727L, 1488L), Line = structure(c(2L, 1L, 3L, 1L, 1L,
1L, 1L, 1L, 1L), .Label = c("A", "B", "C"), class = "factor"),
Start = structure(c(1357038060, 1357221074, 1357369644, 1357834170,
1357913412, 1358151763, 1358691675, 1358789411, 1359538400
), class = c("POSIXct", "POSIXt"), tzone = ""), End = structure(c(1357110430,
1357365312, 1357564413, 1358230679, 1357978810, 1358674600,
1358853933, 1359531923, 1359568151), class = c("POSIXct",
"POSIXt"), tzone = ""), Interval = c(1206.16666666667, 2403.96666666667,
3246.15, 6608.48333333333, 1089.96666666667, 8713.95, 2704.3,
12375.2, 495.85)), .Names = c("ID", "Line", "Start", "End",
"Interval"), row.names = c(NA, -9L), class = "data.frame")
By reproducing the example data frame and testing it I found a way of getting the needed result:
Order data by relevant columns (ID, Start)
ordered_data <- data[order(data$ID, data$Start),]
Find the first row for each new ID
final <- ordered_data[!duplicated(ordered_data$ID),]
As you don't provide any data, here is an example using base R with a sample data frame :
df <- data.frame(group=c("a", "b"), value=1:8)
## Order the data frame with the variable of interest
df <- df[order(df$value),]
## Aggregate
aggregate(df, list(df$group), FUN=head, 1)
EDIT : As Ananda suggests in his comment, the following call to aggregate is better :
aggregate(.~group, df, FUN=head, 1)
If you prefer to use plyr, you can replace aggregate with ddply :
ddply(df, "group", head, 1)
Using ffirst from collapse
library(collapse)
ffirst(df, g = df$group)
data
df <- data.frame(group=c("a", "b"), value=1:8)
This could also be achieved by dplyr using group_by and slice-family of functions,
data %>%
group_by(ID) %>%
slice_head(n = 1)

Resources