populating data based on column/rownames with uneqal row number [duplicate] - r

This question already has answers here:
Match values in data frame with values in another data frame and replace former with a corresponding pattern from the other data frame
(3 answers)
Closed 4 years ago.
I need to populate empty data frame with values based on values in first columns (or alternatively row names it is the same for me in this case). So here are three objects:
set.seed=11
empty_df=data.frame(cities=c("New York","London","Rome","Vienna","Amsterdam"),
col.a=rep(NA,5),
col.b=rep(NA,5),
col.c=rep(NA,5))
values=rnorm(4,0,1)
to_fill=data.frame(cities=c("New York","London","Vienna","Amsterdam"),
col.a=values)
desired_output=data.frame(cities=c("New York","London","Rome","Vienna","Amsterdam"),
col.a=c(values[1],values[2],NA,values[3],values[4]),
col.b=rep(NA,5),
col.c=rep(NA,5))
First column (it can be converted to row names, both solutions using row names or first column with city name is fine) consists some cities i like to visit and other some unspecified values. First is df I want to fill with values and its output is:
cities col.a col.b col.c
1 New York NA NA NA
2 London NA NA NA
3 Rome NA NA NA
4 Vienna NA NA NA
5 Amsterdam NA NA NA
Second is object I want put INTO empty df and as you can see it is missing one row (with "Rome"):
cities col1
1 New York 0.55213218
2 London 0.98907729
3 Vienna 1.11703741
4 Amsterdam -0.04616725
So now I want to put this inside empty df leaving NA in row which dose not match:
cities col.a col.b col.c
1 New York -0.62731870 NA NA
2 London -1.80206612 NA NA
3 Rome NA NA NA
4 Vienna -1.73446286 NA NA
5 Amsterdam -0.05709419 NA NA
I was trying to use simplest merge solution like this: merge(empty_df,to_fill, by="cities"):
cities col.a.x col.b col.c col.a.y
1 Amsterdam NA NA NA -0.05709419
2 London NA NA NA -1.80206612
3 New York NA NA NA -0.62731870
4 Vienna NA NA NA -1.73446286
And when i tried desired_output$col.a=merge(empty_df,to_fill, by="cities") error occurred(replacement has 4 rows, data has 5). Is there any simple solution to do this that can be put in for loop or apply?

We can use match:
empty_df$col.a <- to_fill$col.a[match(empty_df$cities, to_fill$cities)]
empty_df;
# cities col.a col.b col.c
#1 New York 1.5567564 NA NA
#2 London -0.6969401 NA NA
#3 Rome NA NA NA
#4 Vienna 1.3336636 NA NA
#5 Amsterdam 0.7329989 NA NA
We fill col.a of empty_df with col.a values from to_fill by matching cities from empty_df with cities from to_fill.

Related

Calculating CAGR with differing time-based data. Issue with variable column referencing in R

I have a table of Calpers Private Equity Fund performance from several years. I cleaned and joined all the data into a large table with 186 entries for individual fund investments. Some of these funds have data for 5 yrs, most for 4 or less. I would like to calculate the CAGR for each fund using the earliest value and the latest value in the formula:
CAGR= Latest/First^(1/n)-1 ...
The columns with the data are named:
2017,2018,2019,2020,2021, so the formula in R will look something like this: (calper is the table with all the data ... one fund per row)
idx<- which(startsWith(names(calperMV),"2")) # locate columns with data needed for CAGR calc
idx <- rev(idx) # match to NCOL_NA order ...
the values here are (6,5,4,3,2) ... which are the column numbers for 2021-2020-2019-2018-2017.
the indx was formed by counting the number of NA in each row ... all the NA are left to right, so the totals here should be a reference to the idx and thus the correct columns.
I use the !!sym(as.String()) with name()[idx[indx]] to pull out the column names symbolically
calperMV %>% rowwise() %>%
mutate(CAGR=`2021`/!!sym((colnames(.)[idx[indx]])^(1/(5-indx))-1))))
Problem is that the referencing either does not work correctly or gets this error:
"Error in local_error_context(dots = dots, .index = i, mask = mask) :
promise already under evaluation: recursive default argument reference or earlier problems?"
I've tried creating test code which shows the addressing is working:
calper %>% rowwise() %>% mutate(test = (names(.)[idx[indx]]),
test1= !!sym(as.String(names(.)[idx[1]])),
test2= !!sym(as.String(names(.)[idx[2]])),
test3= !!sym(as.String(names(.)[idx[3]])),
test4= !!sym(as.String(names(.)[idx[4]])),
test5= !!sym(as.String(names(.)[idx[5]])))
But when I do the full CAGR calc I get that recursive error. Here'a tibble of the test data for reference:
Input data:
Security Name 2017 2018 2019 2020 2021 NA_cols indx
ASIA ALT NA NA NA 6,256,876.00 7,687,037.00 3 2
ASIA ALT NA NA NA 32,549,704.00 34,813,844.00 3 2
AVATAR NA NA NA NA 700,088.00 - 3 2
AVENUE FUND VI (A) NA NA NA 10,561,674.00 19,145,496.00 3 2
BDC III C NA 48,098,429.00 85,808,280.00 100,933,699.00 146,420,669.00 1 4
BIRCH HILL NA NA NA 6,488,941.00 9,348,941.00 3 2
BLACKSTONE NA NA NA 4,011,072.00 2,406,075.00 3 2
BLACKSTONE IV NA NA NA 4,923,625.00 3,101,081.00 3 2
BLACKSTONE V NA NA NA 18,456,472.00 17,796,711.00 3 2
BLACKSTONE VI NA NA NA 245,269,656.00 310,576,064.00 3 2
BLACKSTONE VII NA NA NA 465,415,036.00 607,172,062.00 3 2
Results: The indexing selects the proper String and also selects the proper # from the column ... but won't do when I operate with the selected variable:
selYR test1 test2 test3 test4 test5
2020 7,687,037.00 6,256,876.00 NA NA NA
2020 34,813,844.00 32,549,704.00 NA NA NA
2020 - 700,088.00 NA NA NA
2020 19,145,496.00 10,561,674.00 NA NA NA
2018 146,420,669.00 100,933,699.00 85,808,280.00 48,098,429.00 NA
2020 9,348,941.00 6,488,941.00 NA NA NA
2020 2,406,075.00 4,011,072.00 NA NA NA
2020 3,101,081.00 4,923,625.00 NA NA NA
2020 17,796,711.00 18,456,472.00 NA NA NA
2020 310,576,064.00 245,269,656.00 NA NA NA
2020 607,172,062.00 465,415,036.00 NA NA NA
(Sorry ... I don't know how to put these into proper columns :( )
I never learned all those fancy tidystuff techniques. Here's a base R approach:
Firstand second: Use read.delim to bring in tab data and your data has (yeccch) commas in the numbers.
(ignore the warnings, they are correct and you do want the NA's.)
calpDat <- read.delim(text=calpTab)
calpDat[2:6] <- lapply(calpDat[2:6], function(x) as.numeric(gsub("[,]", "",x)))
Warning messages:
1: In FUN(X[[i]], ...) : NAs introduced by coercion
2: In FUN(X[[i]], ...) : NAs introduced by coercion
3: In FUN(X[[i]], ...) : NAs introduced by coercion
4: In FUN(X[[i]], ...) : NAs introduced by coercion
Note that lapply in this case returns a list of numeric vectors which can be assigned back inot the origianl dataframe to overwrite the original character values. Or you could have created new columns which could then have gotten the same treatment as below. Now that the data is in, you can count the number of valid numbers and then calculate the CAGR for each row using apply on the numeric columns in a rowwise fashion:
calpDat$CAGR <- apply(calpDat[2:6], 1, function(rw) {n <- length(na.omit(rw));
(rw[5]/rw[6-n])^(1/n) -1})
calpDat
#----------------
Security.Name X2017 X2018 X2019 X2020 X2021 NA_cols indx CAGR
1 ASIA ALT NA NA NA 6256876 7687037 3 2 0.10841071
2 ASIA ALT NA NA NA 32549704 34813844 3 2 0.03419508
3 AVATAR NA NA NA NA 700088 NA 3 2 NA
4 AVENUE FUND VI (A) NA NA NA 10561674 19145496 3 2 0.34637777
5 BDC III C NA 48098429 85808280 100933699 146420669 1 4 0.32089372
6 BIRCH HILL NA NA NA 6488941 9348941 3 2 0.20031241
7 BLACKSTONE NA NA NA 4011072 2406075 3 2 -0.22549478
8 BLACKSTONE IV NA NA NA 4923625 3101081 3 2 -0.20637732
9 BLACKSTONE V NA NA NA 18456472 17796711 3 2 -0.01803608
10 BLACKSTONE VI NA NA NA 245269656 310576064 3 2 0.12528383
11 BLACKSTONE VII NA NA NA 465415036 607172062 3 2 0.14218298
Problems remaining ... funds that did not have a value in the most recent year; funds that might have had discontinuous reporting. You need to say how these would be handled and provide example data if you want tested solutions.

Box plot for one row of a (frequency) table

I have a data set as a .csv file (basically: people's wine choice in relation to the origin of the ambient music playing). Reading this as a dataframe results in a df looking like this:
Music Wine
1 French French
2 Italian French
3 None Italian
4 Italian Italian
5 French Other
...
As a table, it looks like this:
Wine
Music Other French Italian
French 35 39 1
None 43 30 11
Italian 35 30 19
Now I want to create a frequency diagram that ONLY plots the relative distribution of purchases made with Music == "None". So basically I'd get Other = 0.511904, French = 0.3571429, Italian = 0.1309524.
Now my problem is subsetting this table isn't working.
noMusic <- prop.table(table(data[data$Music == "None"]))
geenMuziekTabel <- prop.table(table(data[data$Music == "None"]))
Both result in this:
[1] 0.144032922 0.004115226 0.045267490 0.078189300 NA NA NA NA
[9] NA NA NA NA NA NA NA NA
[17] NA NA NA NA NA NA NA NA
[25] NA NA NA NA NA NA NA NA
[33] NA NA NA NA NA NA NA NA
[41] NA NA NA NA NA NA NA NA
[49] NA NA NA NA NA NA NA NA
[57] NA NA NA NA NA NA NA NA
[65] NA NA NA NA NA NA NA NA
[73] NA NA NA NA NA NA NA NA
[81] NA NA NA NA
I thought: maybe I should subset my dataframe FIRST and THEN make a proportional table out of it, but R seems to remember that there was other data, and make this table:
Wine
Music Other French Italian
French 0 0 0
None 43 30 11
Italian 0 0 0
I've tried a number of different things, too, but can't figure it out. Would anyone know what I'm doing wrong?
Edit: the solution, based on the accepted answer, is as follows:
noMusicTable <- prop.table(table(musicwine$Wine[musicwine$Music == "None"]))
#noMusicTable <- prop.table(table(subset(musicwine, Music == "None", select = Wine)))
noMusicDF <- as.data.frame(noMusicTable)
# need to declare x and y explicitly; use stat = 'identity' to map bars to y-variable
ggplot(noMusicDF, mapping = aes(x = Var1, y = Freq)) + geom_bar(stat = 'identity', fill='red')
Here three ways to subset correctly:
dat <- read.table(text =
"Music Wine
French French
Italian French
None Italian
Italian Italian
French Other", header = TRUE)
# Two different ways to subset
prop.table(table(dat$Wine[dat$Music == "None"]))
prop.table(table(subset(dat, Music == "None", select = Wine)))
# With dplyr and piping
library(dplyr)
dat %>%
filter(Music == "None") %>%
select(Wine) %>%
table() %>%
prop.table()

R: Rollapply, preserve Date column

I have a dataset that includes a date column and the other columns are daily index returns. I would like to receive the rolling standard deviation for all indices but preserve the date in order to plot the results. Data looks as follows:
head(mydata)
Date GL1 GL2 US CN JP DE UK
1 1990-01-03 0.02460 NA -0.25889 NA NA 3.00128 1.20872
2 1990-01-04 0.33681 NA -0.86503 NA NA -1.82327 -0.49234
3 1990-01-05 -0.81943 NA -0.98041 NA -1.13817 -0.86874 -0.29003
4 1990-01-06 NA NA NA NA NA NA NA
5 1990-01-07 NA NA NA NA NA NA NA
When using rollapply function the date however disappears and returns NAs
require(zoo)
Rolling = as.data.frame(rollapply(mydata,30,sd,na.rm=TRUE,align="right"))
head(Rolling)
Date GL1 GL2 US CN JP DE UK
1 NA 0.5527451 NA 1.033204 NA 0.9021960 1.486567 0.8421562
2 NA 0.5675608 NA 1.057156 NA 0.9318496 1.467637 0.7954081
3 NA 0.5681388 NA 1.077253 NA 0.9318496 1.438117 0.8123918
4 NA 0.5663124 NA 1.095049 NA 0.9264034 1.454327 0.8331727
5 NA 0.5623544 NA 1.075118 NA 0.9017324 1.443547 0.8123613
6 NA 0.5523878 NA 1.052310 NA 0.8797660 1.411220 0.8197624
I kept the date in the as.Date format but can not figure out how to continue showing the corresponding dates in the Date column.
The real problem here is using data frames to represent time series. If you use a time series representation then the entire problem goes away.
mydata_z <- read.zoo(mydata)
r <- rollapplyr(mydata_z, 30, sd, na.rm = TRUE, fill = NA)
Now you can use plot.zoo, xyplot.zoo or autoplot.zoo to graph r using classic graphics, lattice graphics or ggplot2 graphics respectively or if you need a data frame then fortify.zoo(r).

Split text string into column based on variable

I have a dataframe with a text column that I would like to split into multiple columns since the text string contains multiple variables, such a location, education, distance etc.
Dataframe:
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
df = data.frame(text.string)
df
text.string
1 &location=NY&distance=30&education=University
2 &location=CA&distance=30&education=Highschool&education=University
3 &location=MN&distance=10&industry=Healthcare
4 &location=VT&distance=30&education=University&industry=IT&industry=Business
I can split this using cSplit: cSplit(df, 'text.string', sep = "&"):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1: NA location=NY distance=30 education=University NA NA
2: NA location=CA distance=30 education=Highschool education=University NA
3: NA location=MN distance=10 industry=Healthcare NA NA
4: NA location=VT distance=30 education=University industry=IT industry=Business
Problem is that the text string may contain a multiple of the same variable, or some miss a certain variable. With cSplit the grouping of the variables per column become all mixed up. I would like to avoid this, and group them together.
So it would like similar to this (education and industry do not appear in multiple columns anymore):
text.string_1 text.string_2 text.string_3 text.string_4 text.string_5 text.string_6
1 NA location=NY distance=30 education=University <NA> NA
2 NA location=CA distance=30 education=Highschool education=University <NA> NA
3 NA location=MN distance=10 <NA> industry=Healthcare NA
4 NA location=VT distance=30 education=University industry=IT industry=Business NA
Taking into account #NicE comment:
This is one way, following your example:
library(data.table)
text.string = c("&location=NY&distance=30&education=University",
"&location=CA&distance=30&education=Highschool&education=University",
"&location=MN&distance=10&industry=Healthcare",
"&location=VT&distance=30&education=University&industry=IT&industry=Business")
clean <- strsplit(text.string, "&|=")
out <- lapply(clean, function(x){ma <- data.table(matrix(x[!x==""], nrow = 2, byrow = F ));
setnames(ma, as.character(ma[1,]));
ma[-1,]})
out <- rbindlist(out, fill = T)
out
location distance education education industry industry
1: NY 30 University NA NA NA
2: CA 30 Highschool University NA NA
3: MN 10 NA NA Healthcare NA
4: VT 30 University NA IT Business

Use zoo read and split a data frame over a column

I have a table containing observations on scores of restaurants(identified by ID). The variable mean is the mean rating of reviews received in a week-long window centered on each day (i.e. from 3 days before till 3 days later), and the variable count is the number of reviews received in the same window (see the code below for a dput of a randomly-generated sample of my data frame).
I am interested in looking at those restaurants that contain big spikes in either variable (like all of a sudden their mean rating goes up by a lot, or drops suddenly). For those restaurants, I would like to investigate what's going on by plotting the distribution (I have lots of restaurants so I can't do it manually and I have to restrict my domain for semi-manual inspection).
Also, since my data is day-by-day, I would like it to be less granular. In particolar, I want to average all the ratings or counts for a given month in a single value.
I think zoo should help me do it nicely: given the data frame in the example, I think I can convert it to a zoo time series which is aggregate the way I want and split the way I want by using:
z <- read.zoo(df, split = "restaurantID",
format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
however, splitting on restaurantID does not yield the expected result. What I get instead is lots of NAs:
mean.1006054 count.1006054 mean.1006639 count.1006639 mean.1006704 count.1006704 mean.1007177 count.1007177
Lug 2004 NA NA NA NA NA NA NA NA
Ago 2004 NA NA NA NA NA NA NA NA
Nov 2004 NA NA NA NA NA NA NA NA
Gen 2005 NA NA NA NA NA NA NA NA
Feb 2005 NA NA NA NA NA NA NA NA
Mar 2005 NA NA NA NA NA NA NA NA
mean.1007296 count.1007296 mean.1007606 count.1007606 mean.1007850 count.1007850 mean.1008272 count.1008272
Lug 2004 NA NA NA NA NA NA NA NA
Ago 2004 NA NA NA NA NA NA NA NA
Nov 2004 NA NA NA NA NA NA NA NA
Gen 2005 NA NA NA NA NA NA NA NA
Feb 2005 NA NA NA NA NA NA NA NA
Mar 2005 NA NA NA NA NA NA NA NA
Note that it works if I don't split it on the restaurantID column.
df$website <- NULL
> z <- read.zoo(df, format = "%m/%d/%Y", index.column = 2, FUN = as.yearmon, aggregate = mean)
> head(z)
restaurantID mean count
Lug 2004 1418680 3.500000 1
Ago 2004 1370457 5.000000 1
Nov 2004 1324645 4.333333 1
Gen 2005 1425933 1.920000 1
Feb 2005 1315289 3.000000 1
Mar 2005 1400577 2.687500 1
Also, plot.zoo(z) works but of course the produced graph has no meaning for me.
My questions are:
1) How can I filter the restaurants that have the higher "month-month" spikes in either column?
2) How can I split on restaurantID and plot the time series of only such restaurants?
DATA HERE (wouldn't fit SO's word limit)
Try:
# helper function to calculate change per time interval in a sequence
difflist <- function(v) {rr <- 0; for (i in 2:length(v)) {rr <- c(rr, v[i] - v[i-1])}; return(rr) }
# make center as dates
df$center <- as.Date(df$center,format='%m/%d/%Y')
# sort data frame in time order
df <- df[order(df$restaurantID, df$center),]
# now calculate the change in each column
deltas <- ddply(df, .(restaurantID), function(x) {cbind(center = x$center, delta_mean = difflist(x$mean), delta_count = difflist(x$count)) } )
# filter out only the big spikes
deltas_big <- subset(deltas, delta_mean > 2 | delta_count > 3)
# arrange the data
delta_melt <- melt(deltas_big,id.vars=c('restaurantID','center'))
# now plot by time
ggplot(delta_melt, aes(x=center,y=value,color=variable)) + geom_point()
The robfilter r package was developed to filter time series data to pick out outliers based on robust statistics methods for time series analysis. You can use the adore.filter function to fit a pattern to the data and then pick the outliers that deviate far from the signal.

Resources