I want to construct a data frame based on two data frames
Here it is an example
#toy example
name <- c("Li", "Pedro", "Dave")
age <- c(20, 30, 40)
d1 <- cbind.data.frame(name, age)
name <- c("Pedro", "Dave", "Grace")
fav_col <- c("red", "blue", "yellow")
lastname <- c("Sanchez", "Stone", "Flint")
fav_food <- c("pizza", "hamburguers", "salad")
d2 <- cbind.data.frame(name, fav_col, lastname, fav_food)
d1$name <- as.character(d1$name)
d2$name <- as.character(d2$name)
cols <- c()
for(i in 1:nrow(d1)) {
some <- dplyr::filter(d2, name==d1$name[i])
cols <- rbind.data.frame(cols, data.frame(some$name, some$fav_col, some$fav_food))
}
Doing this I am obtaining a data frame called "cols" and looks like this:
some.name some.fav_col some.fav_food
Pedro red pizza
Dave blue hamburguers
But what I want is
some.name some.fav_col some.fav_food
NA(or empty) NA(or empty) Na(or empty)
Pedro red pizza
Dave blue hamburguers
The first iteration when i = 1 must produce an empty exit because there is no Li in the second data frame, and I want this empty space in my data frame. Do you know how I could get this?
At the end I want to add the second and third columns of "cols" to "d1" to get:
name age fav_col fav_food
Li 20 NA (or empty) NA (or empty)
Pedro 30 red pizza
Dave 40 blue hamburguers
Also I don't want the empty spaces that the second data frame could produce like this:
name age fav_col fav_food
Li 20 NA NA
Pedro 30 red pizza
Dave 40 blue hamburguers
Grace NA yellow salad
I just want to merge the tables keeping only the names of the first data frame and add the two extra columns. I would appreciate any help
You can use union_all from dplyr.
library(tidyr)
library(dplyr)
df <- union_all(d1, d2) %>%
mutate_if(is.factor, as.character) %>% # only required when your text columns are
group_by(name) %>% # identified as factor and not character.
summarise_all(max, na.rm = TRUE) %>% # Because max only works on numeric or char
ungroup
df
# # A tibble: 3 x 4
# name age fav_col lastname
# <chr> <dbl> <chr> <chr>
# 1 Dave 40. blue Stone
# 2 Li 20. NA NA
# 3 Pedro 30. red Sanchez
To get your desired output, you can add drop_na to the chain.
df %>% select(name, fav_col) %>% drop_na
# # A tibble: 2 x 2
# name fav_col
# <chr> <chr>
# 1 Dave blue
# 2 Pedro red
I used this, and I added paste and collapse just in case that someone needs to add results from different cells in just one.
f_add_col <- function(vec) {
add_col <- dplyr::filter(d2, name==vec[1])
return (paste(add_col$fav_col, collapse = "|"))
}
cbind.data.frame(d1, fav_col=apply(d1, 1, f_add_col))
Then I did the same for the column fav_food.
Related
I have a dataframe df that looks like this
City TreeID Age Diameter
City_1 X 1 6
City_1 Y 2 5
City_2 Y 3 5
City_3 X 4 10
I have a variable nominal that can be "TreeId", "Age" or Diameter and another variable city that has the name of a city stored in string format.
I want to be able to pick up only the values of the column nominal for the correct city name
An example : city = "City_1" and nominal = "Age", then I should only pick up the values 1 and 2
I searched on here but nothing I could find was adapted to my case because I use variables so I don't know in advance which column I want to choose. I am lost, any help is appreciated. Thanks
Here's a function to do that.
return_value <- function(data, city, nominal) {
data[data$City == city, nominal]
}
return_value(df, 'City_1', 'Age')
#[1] 1 2
return_value(df, 'City_2', 'Diameter')
#[1] 5
You coul use pull:
library(dplyr)
df %>%
filter(City == "City_1") %>%
pull(Age)
df %>%
filter(City == "City_1") %>%
pull(Diameter)
[1] 1 2
[1] 6 5
I am working with a data frame that has two columns, name and spouse. I am trying to calculate the interracial marriage frequency, but I need to remove repeated registers.
When I have the name of a creature I need to keep this register in the data frame but remove the register where that creature name is the spouse name. I have this following data sample:
name spouse
15 Finarfin Eärwen
6 Tar-Vanimeldë Herucalmo
17 Faramir owyn
8 Tar-Meneldur Almarian
14 Finduilas of Dol Amroth Denethor II
12 Finwë MÃriel Serindë then ,Indis
9 Tar-Ancalimë Hallacar
7 Tar-MÃriel Ar-Pharazôn
5 Tarannon Falastur Berúthiel
21 Rufus Burrows Asphodel Brandybuck
2 Angrod Eldalótë
4 Ar-Gimilzôr Inzilbêth
19 Lobelia Sackville-Baggins Otho Sackville-Baggins
25 Mrs. Proudfoot Odo Proudfoot
22 Rudigar Bolger Belba Baggins
24 Odo Proudfoot Mrs. Proudfoot
3 Ar-Pharazôn Tar-MÃriel
13 Fingolfin Anairë
18 Silmariën Elatan
23 Rowan Greenhand Belba Baggins
20 RÃan Huor
1 Adanel Belemir
16 Fastolph Bolger Pansy Baggins
10 Morwen Steelsheen Thengel
11 Tar-Aldarion Erendis
25 Belemir Adanel
For example, I ran the code and in line 1 it caught name Adanel and got Belemir as its spouse, so I need to keep line 1, but remove line 25, because with that I will avoid duplicated data.
I have tried this following code:
interacialMariage <-data %>% filter(spouse != name) %>% select(name, spouse)
How can I get the same spouse name register out of the data frame registers?
P.S.: I would need it to avoid case sensitive (Belemir == belemir) so that I don't have problems in the future.
Thanks!
You could set up another vector with the row-wise alphabetically sorted names, and deduplicate using that...
sorted <- sapply(1:nrow(data),
function(i) paste(sort(c(trimws(tolower(data$name[i])),
trimws(tolower(data$spouse[i])))),
collapse=" "))
irM <- data[!duplicated(sorted),]
The trimws strips off any leading or trailing spaces before sorting and pasting, and tolower converts everything to lower case.
My attempt with tidyverse:
library(tidyverse)
dat %>%
mutate(id = 1:n()) %>% # add id to label the pairs
gather('key', 'name', -id) %>% # transform: key (name | spouse), name, id
group_by(name) %>% # group by unique name to find duplicated
top_n(-1, wt = id) %>% # if name > 1, take row with the lower id
spread(key, name) %>% # spread data to original format
select(-id) # remove id's
# # A tibble: 3 x 2
# name spouse
# <chr> <chr>
# 1 Adanel Belemir
# 2 Fastolph Bolger Pansy Baggins
# 3 Morwen Steelsheen Thengel
Data:
dat <- data.frame(
name = c("Adanel", "Fastolph Bolger", "Morwen Steelsheen", "Belemir"),
spouse = c("Belemir", "Pansy Baggins", "Thengel", "Adanel" ),
stringsAsFactors = F
)
I'm working with Census (CTPP) data, and the GEOID field is a long string that contains lots of geographic information. The format of this string changes for various Census tables, but they provide a code lookup. Here are a sample GEOID and format 'code'. (The parts I can already parse have been removed. This is the part of the GEOID I can't parse.)
geoid <- "0202000000126"
format <- "ssccczzzzzzzz"
This means that the first two characters ("02") signify the state (Alaska), the next three ("020") are the county, and the remaining characters are the zone.
I have a table of these geoid/format pairs, and the format can be different for each row.
s: state
c: county
p: place
z: zone
(others not used in this simple example)
df <- data.frame(
geoid = c(
"0224230",
"0202000000126"
),
format = c(
"ssppppp",
"ssccczzzzzzzz"
)
)
# A tibble: 2 x 2
geoid format
<chr> <chr>
1 0224230 ssppppp
2 0202000000126 ssccczzzzzzzz
What I'd like to do is break up the geoid column into columns for each geography like so:
# A tibble: 2 x 6
geoid format s p c z
<chr> <chr> <chr> <chr> <chr> <chr>
1 0224230 ssppppp 02 24230 NA NA
2 0202000000126 ssccczzzzzzzz 02 NA 020 00000126
I've looked at several approaches. extract() from stringr looked promising. I'm also pretty sure I'll need a custom function that I mapply(?)/map over my data frame.
A base alternative:
geo_codes <- c("s", "c", "p", "z")
# get starting position and lengths of consecutive characters in 'format'
g <- gregexpr("(.)\\1+", df$format)
# use the result above to extract corresponding substrings from 'geoid'
geo <- regmatches(df$geoid, g)
# select first element in each run of 'format' and split
# used to name substrings from above
fmt <- strsplit(gsub("(.)\\1+", "\\1", df$format), "")
# for each element in 'geo' and 'fmt',
# 1. create a named vector
# 2. index the vector with 'geo_codes'
# 3. set names of the full length vector
t(mapply(function(geo, fmt){
setNames(setNames(geo, fmt)[geo_codes], geo_codes)},
geo, fmt))
# s c p z
# [1,] "02" NA "24230" NA
# [2,] "02" "020" NA "00000126"
Another alternative,
geo <- strsplit(df$geoid, "")
fmt <- strsplit(df$format, "")
t(mapply(function(geo, fmt) unlist(lapply(split(geo, factor(fmt, levels = geo_codes)), function(x){
if(length(x)) paste(x, collapse = "") else NA})), geo, fmt))
My first alternative is about 2 times faster than the second, benchmarked on 2e5 rows.
As is so often the case, writing up the question and the minimum example helped me simplify the problem and identify a solution. I'm sure there is a fancier solution out there, but this is what I came up with, and it's easy(ish) to get your head around.
While the formats vary, there are a limited number of unique characters. In the toy example in this problem, only s, c, p, z. So here's what I did:
First, I created a function that takes a single format string, a single geoid string, and a single subgeo character/code. The function determines which character positions in format match subgeo and then returns those positions from geoid.
extract_sub_geo <- function(format, geoid, subgeo) {
geoid_v <- unlist(strsplit(geoid, ""))
format_v <- unlist(strsplit(format, ""))
positions <- which(format_v == subgeo)
result <- paste(geoid_v[positions], collapse = "")
return(result)
}
extract_sub_geo("ssccczzzzzzzz", "0202000000126", "s")
[1] "02"
I then looped over each unique code and used pmap() to apply the function to my entire data frame.
geo_codes <- c("s", "c", "p", "z")
for (code in geo_codes) {
df <- df %>%
mutate(
!!code := pmap_chr(list(format, remainder, !!(code)), extract_sub_geo)
)
}
# A tibble: 2 x 6
geoid format s c p z
<chr> <chr> <chr> <chr> <chr> <chr>
1 0224230 ssppppp 02 "" 02000 ""
2 0202000000126 ssccczzzzzzzz 02 020 "" 00000126
Probably cleaner to do the loop in base R instead of dplyr.
A tidyverse solution:
library(tidyverse)
create_new_code <- function(id, format, char) {
format %>%
str_locate_all(paste0(char, "*", char)) %>%
unlist() %>%
{substr(id, .[1], .[2])}
}
create_new_codes <- function(id, format) {
c("s", "p", "c", "z") %>%
set_names() %>%
map(create_new_code, id = id, format = format)
}
bind_cols(df,
with(df, map2_df(geoid, format, create_new_codes)))
# geoid format s p c z
#1 0224230 ssppppp 02 24230 <NA> <NA>
#2 0202000000126 ssccczzzzzzzz 02 <NA> 020 00000126
I have a dataframe in R for which one column has multiple variables. The variables either start with ABC, DEF, GHI. Those variables are followed by a series of 6 numbers (ie ABC052689, ABC062895, DEF045158).
For each row, i would like to pull one instance of ABC (the one with the largest number).
If the row has ABC052689, ABC062895, DEF045158, I would like it to pull out ABC062895 because it is greater than ABC052689.
I would then want to do the same for the variable that starts with DEF######.
I have managed to filter the data to have rows where ABC is there and either DEF or GHI is there:
library(tidyverse)
data_with_ABC <- test %>%
filter(str_detect(car,"ABC"))
data_with_ABC_and_DEF_or_GHI <- data_with_ABC %>%
filter(str_detect(car, "DEF") | str_detect(car, "GHI"))
I don't know how to pull out let's say ABC with the greatest number
ABC052689, ABC062895, DEF045158 -> ABC062895
For a base R solution, we can try using lapply along with strsplit to identify the greatest ABC plate in each CSV string, in each row.
df <- data.frame(car=c("ABC052689,ABC062895,DEF045158"), id=c(1),
stringsAsFactors=FALSE)
df$largest <- lapply(df$car, function(x) {
cars <- strsplit(x, ",", fixed=TRUE)[[1]]
cars <- cars[substr(cars, 1, 3) == "ABC"]
max <- cars[which.max(substr(cars, 4, 9))]
return(max)
})
df
car id largest
1 ABC052689,ABC062895,DEF045158 1 ABC062895
Note that we don't need to worry about casting the substring of the plate number, because it is fixed width text. This means that it should sort properly even as text.
Besides Tim's answer, if you want to do all ABC/DEF at one time, following code may help with library(tidyverse):
> df <- data.frame(car=c("ABC052689", "ABC062895", "DEF045158", "DEF192345"), stringsAsFactors=FALSE)
>
> df2 = df %>%
+ mutate(state = str_sub(car, 1, 3), plate = str_sub(car, 4, 9))
>
> df2
car state plate
1 ABC052689 ABC 052689
2 ABC062895 ABC 062895
3 DEF045158 DEF 045158
4 DEF192345 DEF 192345
>
> df2 %>%
+ group_by(state) %>%
+ summarise(maxplate = max(plate)) %>%
+ mutate(full = str_c(state, maxplate))
# A tibble: 2 x 3
state maxplate full
<chr> <chr> <chr>
1 ABC 062895 ABC062895
2 DEF 192345 DEF192345
The Problem
Plotting a bunch of line plots on top of one another, but I only want to color 10 specifically after they are all plotted amongst themselves (to visualize how my 'targets' traveled over time while being able to view the masses of other behind them. So an example of this would be like 100 line graphs over time, but I want to color 5 or 10 of them specifically to discuss about with respect to the trend of the 90 other grayscale ones.
The following post has a pretty good image that I want to replicate, but with slightly more meat on the bones, , Except I want MANY lines behind those 3 all grayscale, but those 3 are my highlighted cities I want to see in the foreground, per say.
My original data was in the following form:
# The unique identifier is a City-State combo,
# there can be the same cities in 1 state or many.
# Each state's year ranges from 1:35, but may not have
# all of the values available to us, but some are complete.
r1 <- c("city1" , "state1" , "year" , "population" , rnorm(11) , "2")
r2 <- c("city1" , "state2" , "year" , "population" , rnorm(11) , "3")
r3 <- c("city2" , "state1" , "year" , "population" , rnorm(11) , "2")
r4 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "1")
r5 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "7")
df <- data.frame(matrix(nrow = 5, ncol = 16))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- c("City", "State", "Year", "Population", 1:11, "Cluster")
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# City | State | Year | Population | ... 11 Variables ... | Cluster #
# ----------------------------------------------------------------------#
# Each row is a city instance with these features ... #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
But I thought it might be better to view the data differently, so I also have it in the following format. I am not sure which is better for this problem.
cols <- c(0:35)
rows <- c("unique_city1", "unique_city2","unique_city3","unique_city4","unique_city5")
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- cols
row.names(df) <- rows
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Year1 Year2 .......... Year 35 #
# UniqueCityState1 VAL NA .......... VAL #
# UniqueCityState2 VAL VAL .......... NA #
# . #
# . #
# . #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Prior Attempts
I have tried using melt to get the data into a format that is possible for ggplot to accept and plot each of these cities over time, but nothing has seemed to work. Also, I have tried creating my own functions to loop through each of my unique city-state combinations to stack ggplots which had some fair amount of research available on the topic, but nothing yet still. I am not sure how I could find each of these unique citystate pairs and plot them over time taking their cluster value or any numeric value for that matter. Or maybe what I am seeking is not possible, I am not sure.
Thoughts?
EDIT: More information about data structure
> head(df)
city state year population stat1 stat2 stat3 stat4 stat5
1 BESSEMER 1 1 31509 0.3808436 0 0.63473928 2.8563268 9.5528262
2 BIRMINGHAM 1 1 282081 0.3119671 0 0.97489728 6.0266377 9.1321287
3 MOUNTAIN BROOK 1 1 18221 0.0000000 0 0.05488173 0.2744086 0.4390538
4 FAIRFIELD 1 1 12978 0.1541069 0 0.46232085 3.0050855 9.8628448
5 GARDENDALE 1 1 7828 0.2554931 0 0.00000000 0.7664793 1.2774655
6 LEEDS 1 1 7865 0.2542912 0 0.12714558 1.5257470 13.3502861
stat6 stat6 stat7 stat8 stat9 cluster
1 26.976419 53.54026 5.712654 0 0.2856327 9
2 35.670605 65.49183 11.982374 0 0.4963113 9
3 6.311399 21.40387 1.426925 0 0.1097635 3
4 21.266759 68.11527 11.480968 0 1.0787487 9
5 6.770567 23.24987 3.960143 0 0.0000000 3
6 24.157661 39.79657 4.450095 0 1.5257470 15
agg
1 99.93970
2 130.08675
3 30.02031
4 115.42611
5 36.28002
6 85.18754
And ultimately I need it in the form of unique cities as row.names, 1:35 as col.names and the value inside each cell to be agg if that year was present or NA if it wasn't. Again I am sure this is possible, I just can't attain a good solution to it and my current way is unstable.
If I understand your question correctly, you want to plot all the lines in one color, and then plot a few lines with several different colors. You may use ggplot2, calling geom_line twice on two data frames. The first time plot all city data without mapping lines to color. The second time plot just the subset of your target city and mapping lines to color. You will need to re-organize your original data frame and subset the data frame for the target city. In the following code I used tidyr and dplyr to process the data frame.
### Set.seed to improve reproducibility
set.seed(123)
### Load package
library(tidyr)
library(dplyr)
library(ggplot2)
### Prepare example data frame
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- 1:35
df <- df %>% mutate(City = 1:5)
### Reorganize the data for plotting
df2 <- df %>%
gather(Year, Value, -City) %>%
mutate(Year = as.numeric(Year))
The gather function takes df as the first argument. It will create the key column called Year, which will store year number. The year number are the column names of each column in the df data frame except the City column. gather function will also create a column called Value, which will store all the numeric values from each column in in the df data frame except the City column. Finally, City column will not involve in this process, so use -City to tell the gather function "do not transform the data from the City column".
### Subset df2, select the city of interest
df3 <- df2 %>%
# In this example, assuming that City 2 and City 3 are of interest
filter(City %in% c(2, 3))
### Plot the data
ggplot(data = df2, aes(x = Year, y = Value, group = factor(City))) +
# Plot all city data here in gray lines
geom_line(size = 1, color = "gray") +
# Plot target city data with colors
geom_line(data = df3,
aes(x = Year, y = Value, group = City, color = factor(City)),
size = 2)
The resulting plot can be seen here: https://dl.dropboxusercontent.com/u/23652366/example_plot.png