I am trying to create a data frame (BOS.df) in order to explore the structure of a future analysis I will perform prior to receiving the actual data. In this scenario, lets say that there are 4 restaurants looking to run ad campaigns (the "Restaurant" variable). The total number of days that the campaign will last is cmp.lngth. I want random numbers for how much they are billing for the ads (ra.num). The ad campaigns start on StartDate. ultimately, I want to create a data frame the cycles through each restaurant, and adds a random billing number for each day of the ad campaign by adding rows.
#Create Data Placeholders
set.seed(123)
Restaurant <- c('B1', 'B2', 'B3', 'B4')
cmp.lngth <- 42
ra.num <- rnorm(cmp.lngth, mean = 100, sd = 10)
StartDate <- as.Date("2017-07-14")
BOS.df <- data.frame(matrix(NA, nrow =0, ncol = 3))
colnames(BOS.df) <- c("Restaurant", "Billings", "Date")
for(i in 1:length(Restaurant)){
for(z in 1:cmp.lngth){
BOS.row <- c(as.character(Restaurant[i]),ra.num[z],StartDate +
cmp.lngth[z]-1)
BOS.df <- rbind(BOS.df, BOS.row)
}
}
My code is not functioning correctly right now. The column names are incorrect, and the data is not being placed correctly if at all. The output comes through as follows:
X.B1. X.94.3952435344779. X.17402.
1 B1 94.3952435344779 17402
2 B1 <NA> <NA>
3 B1 <NA> <NA>
4 B1 <NA> <NA>
5 B1 <NA> <NA>
6 B1 <NA> <NA>
How can I obtain the correct output? Is there a more efficient way than using a for loop?
Using expand.grid:
cmp.lngth <- 2
StartDate <- as.Date("2017-07-14")
set.seed(1)
df1 <- data.frame(expand.grid(Restaurant, seq(cmp.lngth) + StartDate))
colnames(df1) <- c("Restaurant", "Date")
df1$Billings <- rnorm(nrow(df1), mean = 100, sd = 10)
df1 <- df1[ order(df1$Restaurant, df1$Date), ]
df1
# Restaurant Date Billings
# 1 B1 2017-07-15 93.73546
# 5 B1 2017-07-16 103.29508
# 2 B2 2017-07-15 101.83643
# 6 B2 2017-07-16 91.79532
# 3 B3 2017-07-15 91.64371
# 7 B3 2017-07-16 104.87429
# 4 B4 2017-07-15 115.95281
# 8 B4 2017-07-16 107.38325
You can use rbind, but this would be another way to do it.
Also, the length of the data frame should be cmp.lngth*length(Restaurant), not cmp.lngth.
#Create Data Placeholders
set.seed(123)
Restaurant <- c('B1', 'B2', 'B3', 'B4')
cmp.lngth <- 42
ra.num <- rnorm(cmp.lngth, mean = 100, sd = 10)
StartDate <- as.Date("2017-07-14")
BOS.df <- data.frame(matrix(NA, nrow = cmp.lngth*length(Restaurant), ncol = 3))
colnames(BOS.df) <- c("Restaurant", "Billings", "Date")
count <- 1
for(name in Restaurant){
for(z in 1:cmp.lngth){
BOS.row <- c(name, ra.num[z], as.character(StartDate + z - 1))
BOS.df[count,] <- BOS.row
count <- count + 1
}
}
I would also recommend you to look at the package called tidyverse and use add_row with tibble instead of data frame. Here is a sample code:
library(tidyverse)
BOS.tb <- tibble(Restaurant = character(),
Billings = numeric(),
Date = character())
for(name in Restaurant){
for(z in 1:cmp.lngth){
BOS.row <- c(name, ra.num[z], as.character(StartDate + z - 1))
BOS.tb <- add_row(BOS.tb,
Restaurant = name,
Billings = ra.num[z],
Date = as.character(StartDate + z - 1))
}
}
Related
I have a dataset with more than 300 features and a response column. I was wondering if it is possible to randomly create 4 subsets with different features using R.
For Example:
The dataset is in CSV format.
Thanks in advance.
If you just want random features, you can try something like this. Note this was based off of the posted image, but you mentioned your real data has 300 features. I don't know what your real dataset looks like, but you should be able to easily modify this to account for the actual dataset.
set.seed(123)
# create sample data
sample_data <- data.frame(id = 1:7,
F1 = sample(1:10,7),
F2 = sample(1:10,7),
F3 = sample(1:10,7),
F4 = sample(1:10,7),
F5 = sample(1:10,7),
F6 = sample(1:10,7),
F7 = sample(1:10,7),
F8 = sample(1:10,7),
label = sample(0:1, 7, replace = T))
# sample random factor columns
subset1 <- sample_data[, c(1,sample(2:9,2),10)]
subset2 <- sample_data[, c(1,sample(2:9,2),10)]
subset3 <- sample_data[, c(1,sample(2:9,2),10)]
subset4 <- sample_data[, c(1,sample(2:9,2),10)]
#subset1
# id F8 F6 label
#1 1 2 1 1
#2 2 8 3 1
#3 3 9 10 1
#4 4 10 7 1
#5 5 1 6 0
#6 6 6 5 0
#7 7 5 8 1
Thank to #jpsmith answer I managed to solve the problem using the following code:
data <- read.csv("-----/file.csv")
label <- data$label
subset1 <- data[, c(sample(1:300,75))]
subset1$label <- label
data = data[,!(names(data) %in% colnames(subset1))]
subset2 <- data[, c(sample(1:225,75))]
subset2$label <- label
data = data[,!(names(data) %in% colnames(subset2))]
subset3 <- data[, c(sample(1:150,75))]
subset3$label <- label
data = data[,!(names(data) %in% colnames(subset3))]
subset4 <- data
subset4$label <- label
If there is a way to improve the solution please post it.
I created a data frame with two variables: one with characters(teams)and one numeric. I'd like to do a complete random sample to choose two teams and then another sample between the two elected to get just one. Finally I'd like to repeat this without the first two elected teams, being able to replicate it.
I have tried with this code. However, when it comes to the second sample the election is not from the two elected teams, but from two other teams.
teams <- c('madrid','barcelona','psg','mancunited','mancity','juve')
mean <- c(14, 14.5, 13, 10, 13.4, 13.7)
df <- data.frame(teams, stats)
x <- 1:nrow(df)
a1 <- df[sample((x),2),]
y <- sample(c(a1[1,1], a1[2,1]), 1,
prob = c((a1[1,2]/(a1[1,2]+a1[2,2])), (a1[2,2]/(a1[1,2]+a1[2,2]))))
A1 <- df[y,]
A1
df <- df[!(df$teams==a1[1,1] | df$teams==a1[2,1]),]
x <- 1:nrow(df)
b1 <- df[sample((x),2),]
B1 <- df[sample(c(b1[1,1], b1[2,1]), 1,
prob = c((b1[1,2]/(b1[1,2]+b1[2,2])), (b1[2,2]/(b1[1,2]+b1[2,2])))),]
B1
You can use :
#Choose two teams
random_2_x <- sample(x, 2)
#Chose one out of the above two
random_2_1_x <- sample(random_2_x, 1)
#Chose two from the one not in random_2_x
random_2_y <- sample(x[-random_2_x], 2)
#Chose one out of the above two
random_2_y_1 <- sample(random_2_y, 1)
You can use these indexes to subset from dataframe :
df[random_2_x, ]
# teams mean
#4 mancunited 10.0
#6 juve 13.7
df[random_2_1_x, ]
# teams mean
#6 juve 13.7
df[random_2_y, ]
# teams mean
#1 madrid 14.0
#2 barcelona 14.5
df[random_2_y_1, ]
# teams mean
#2 barcelona 14.5
data
df<- data.frame(teams, mean)
If you want to use your stats column for weighting the probability of selection on the second draw (choosing 1 team from the 2 already selected), you can use the following function. The prob argument of sample can be a vector of probability weights. So you don't need to calculate actual proportions manually - just provide the stats column and R will do what you want.
game <- function(df){
x <- 1:nrow(df)
a1 <- df[sample((x),2),]
y1 <- sample(a1$teams, 1, prob = a1$stats)
df2 <- df[!(df$teams %in% a1$teams),]
x <- 1:nrow(df2)
b1 <- df2[sample(x,2),]
y2 <- sample(b1$teams, 1, prob = b1$stats)
c(y1, y2)
}
Here's your data:
teams <- c('madrid','barcelona','psg','mancunited','mancity','juve')
stats <- c(14, 14.5, 13, 10, 13.4, 13.7)
df <- data.frame(teams, stats) # R 4.0.0 no need to convert strings to factors.
Replicate 10,000 games:
games <- t(replicate(10000, game(df)))
head(games)
# [,1] [,2]
# [1,] "barcelona" "mancity"
# [2,] "madrid" "mancunited"
# [3,] "madrid" "psg"
# [4,] "juve" "psg"
# [5,] "mancity" "barcelona"
# [6,] "mancity" "juve"
You can see the proportion of times each team got selected in each of your phases.
sort(prop.table(table(games[,1])), decr = TRUE) # phase 1
# barcelona madrid psg juve mancity mancunited
# 0.1797 0.1787 0.1687 0.1677 0.1663 0.1389
sort(prop.table(table(games[,1])), decr = TRUE) # phase 2
# madrid barcelona juve mancity psg mancunited
# 0.1826 0.1755 0.1691 0.1670 0.1663 0.1395
I want to select the top 10 voted restaurants, and plot them together.
So i want to create a plot that shows the restaurant names and their votes.
I used:
topTenVotes <- top_n(dataSet, 10, Votes)
and it showed me data of the columns in dataset based on the top 10 highest votes, however i want just the number of votes and restaurant names.
My Question is how to select only the top 10 highest votes and their restaurant names, and plotting them together?
expected output:
Restaurant Names Votes
A 300
B 250
C 230
D 220
E 210
F 205
G 200
H 194
I 160
J 120
K 34
And then a bar plot that shows these restaurant names and their votes
Another simple approach with base functions creating another variable:
df <- data.frame(Names = LETTERS, Votes = sample(40:400, length(LETTERS)))
x <- df$Votes
names(x) <- df$Names # x <- setNames(df$Votes, df$Names) is another approach
barplot(sort(x, decreasing = TRUE)[1:10], xlab = "Restaurant Name", ylab = "Votes")
Or a one-line solution with base functions:
barplot(sort(xtabs(Votes ~ Names, df), decreasing = TRUE)[1:10], xlab = "Restaurant Names")
I'm not seeing a data set to use, so here's a minimal example to show how it might work:
library(tidyverse)
df <-
tibble(
restaurant = c("res1", "res2", "res3", "res4"),
votes = c(2, 5, 8, 6)
)
df %>%
arrange(-votes) %>%
head(3) %>%
ggplot(aes(x = reorder(restaurant, votes), y = votes)) +
geom_col() +
coord_flip()
The top_n command also works in this case but is designed for grouped data.
Its more efficient, though less readable, to use base functions:
#toy data
d <- data.frame(list(Names = sample(LETTERS, size = 15), value = rnorm(25, 10, n = 15)))
head(d)
Names value
1 D 25.592749
2 B 28.362303
3 H 1.576343
4 L 28.718517
5 S 27.648078
6 Y 29.364797
#reorder by, and retain, the top 10
newdata <- data.frame()
for (i in 1:10) {
newdata <- rbind(newdata,d[which(d$value == sort(d$value, decreasing = T)[1:10][i]),])
}
newdata
Names value
8 W 45.11330
13 K 36.50623
14 P 31.33122
15 T 30.28397
6 Y 29.36480
7 Q 29.29337
4 L 28.71852
10 Z 28.62501
2 B 28.36230
5 S 27.64808
I have a list (selected_key_ratios) containing 4 data frames ($nestle ; $unilever ; $pepsico ; $abf). Each data frame contains financial data of a company. All dataframes have the same row index and almost the same columns (only currency differ sometimes). Here is a screenshot of the list.
I'm trying to make a new list where each item would be a column of the dataframe, grouped by company. Here is a graphical exemple:
And so on for each column of the dataframes. I tried things with lapply for hours now but nothing produces the desired result.
Do you have any clues ? Thanks a lot !
You could try something like this nested lapply:
# Recreation of your list of dataframes
w <- list(
abc = data.frame(
"eps_usd" = runif(10) * 10,
"eps_gbp" = runif(10) * 8
),
def = data.frame(
"eps_usd" = runif(10) * 15,
"eps_eur" = runif(10) * 13
),
ghi = data.frame(
"eps_gbp" = runif(10) * 35,
"eps_aud" = runif(10) * 19
),
jkl = data.frame(
"eps_usd" = runif(10) * 2,
"eps_aud" = runif(10) * 1.4
)
)
# Create a new dataframe with the year column
result <- data.frame(year = 2007:2016)
# Apply to each name in the list
lapply(names(w), function(tbl) {
# Apply to each colname of each df
lapply(colnames(w[[tbl]]), function(col) {
# Assign to the reult df column the corresponding column int he list of df's
result[[paste0(tbl, "_", col)]] <<- w[[tbl]][[col]]
})
})
Output:
> result
year abc_eps_usd abc_eps_gbp def_eps_usd def_eps_eur ghi_eps_gbp ghi_eps_aud jkl_eps_usd jkl_eps_aud
1 2007 8.107360 3.419094 11.660133 9.9744151 3.801628 1.936746299 1.36976914 0.58472812
2 2008 7.527040 2.342307 11.407357 5.6755403 13.433364 8.595490269 0.31085568 0.06655984
3 2009 5.155562 4.272123 8.506886 8.5367400 20.305427 18.191703109 0.01993349 0.31829031
4 2010 2.947270 2.983519 5.686625 5.2630734 14.064397 9.049538589 0.92122668 0.55233980
5 2011 8.645507 2.657100 12.445061 6.9406141 5.056093 18.787235097 0.41227465 0.01664083
6 2012 7.192367 5.695391 3.620765 9.1173421 26.452499 0.002014068 1.84031115 0.38873530
7 2013 4.878473 1.527182 11.769227 9.6991108 16.232696 6.934076956 1.07328960 0.28808505
8 2014 1.766486 5.272151 12.656086 0.7318888 32.855694 15.643783443 1.33677381 1.09871196
9 2015 9.428541 6.462755 11.473938 4.3658361 7.547359 17.634770134 1.27743503 1.35510589
10 2016 6.047083 3.437785 13.845070 12.9766045 7.401827 18.032713128 1.73208881 0.03394082
Without a dataset I made up one.
set.seed(5489)
n <- 20
df_list <- list(
nestle = data.frame(A = runif(n), B = runif(n), C = runif(n)),
unilever = data.frame(D = runif(n), E = runif(n), F = runif(n)),
abf = data.frame(G = runif(n), H = runif(n), I = runif(n))
)
The code that follows assumes that you want to extract the first column of each data frame, and that you want to name the columns of the result with a combination of the names of the original df's names and of those first columns.
result <- as.data.frame(do.call(cbind, lapply(df_list, `[[`, 1)))
names(result) <- paste(names(result), sapply(df_list, function(DF) names(DF)[1]))
row.names(result) <- row.names(df_list[[1]])
head(result)
# nestle A unilever D abf G
#1 0.2348625 0.007785561 0.6453142
#2 0.5951392 0.494773356 0.2167643
#3 0.3001674 0.381868381 0.7182713
#4 0.1745270 0.983473145 0.8829462
#5 0.3387269 0.178523104 0.6042962
#6 0.1103261 0.211874225 0.4545857
The Problem
Plotting a bunch of line plots on top of one another, but I only want to color 10 specifically after they are all plotted amongst themselves (to visualize how my 'targets' traveled over time while being able to view the masses of other behind them. So an example of this would be like 100 line graphs over time, but I want to color 5 or 10 of them specifically to discuss about with respect to the trend of the 90 other grayscale ones.
The following post has a pretty good image that I want to replicate, but with slightly more meat on the bones, , Except I want MANY lines behind those 3 all grayscale, but those 3 are my highlighted cities I want to see in the foreground, per say.
My original data was in the following form:
# The unique identifier is a City-State combo,
# there can be the same cities in 1 state or many.
# Each state's year ranges from 1:35, but may not have
# all of the values available to us, but some are complete.
r1 <- c("city1" , "state1" , "year" , "population" , rnorm(11) , "2")
r2 <- c("city1" , "state2" , "year" , "population" , rnorm(11) , "3")
r3 <- c("city2" , "state1" , "year" , "population" , rnorm(11) , "2")
r4 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "1")
r5 <- c("city3" , "state2" , "year" , "population" , rnorm(11) , "7")
df <- data.frame(matrix(nrow = 5, ncol = 16))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- c("City", "State", "Year", "Population", 1:11, "Cluster")
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# City | State | Year | Population | ... 11 Variables ... | Cluster #
# ----------------------------------------------------------------------#
# Each row is a city instance with these features ... #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
But I thought it might be better to view the data differently, so I also have it in the following format. I am not sure which is better for this problem.
cols <- c(0:35)
rows <- c("unique_city1", "unique_city2","unique_city3","unique_city4","unique_city5")
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- cols
row.names(df) <- rows
head(df)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# Year1 Year2 .......... Year 35 #
# UniqueCityState1 VAL NA .......... VAL #
# UniqueCityState2 VAL VAL .......... NA #
# . #
# . #
# . #
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
Prior Attempts
I have tried using melt to get the data into a format that is possible for ggplot to accept and plot each of these cities over time, but nothing has seemed to work. Also, I have tried creating my own functions to loop through each of my unique city-state combinations to stack ggplots which had some fair amount of research available on the topic, but nothing yet still. I am not sure how I could find each of these unique citystate pairs and plot them over time taking their cluster value or any numeric value for that matter. Or maybe what I am seeking is not possible, I am not sure.
Thoughts?
EDIT: More information about data structure
> head(df)
city state year population stat1 stat2 stat3 stat4 stat5
1 BESSEMER 1 1 31509 0.3808436 0 0.63473928 2.8563268 9.5528262
2 BIRMINGHAM 1 1 282081 0.3119671 0 0.97489728 6.0266377 9.1321287
3 MOUNTAIN BROOK 1 1 18221 0.0000000 0 0.05488173 0.2744086 0.4390538
4 FAIRFIELD 1 1 12978 0.1541069 0 0.46232085 3.0050855 9.8628448
5 GARDENDALE 1 1 7828 0.2554931 0 0.00000000 0.7664793 1.2774655
6 LEEDS 1 1 7865 0.2542912 0 0.12714558 1.5257470 13.3502861
stat6 stat6 stat7 stat8 stat9 cluster
1 26.976419 53.54026 5.712654 0 0.2856327 9
2 35.670605 65.49183 11.982374 0 0.4963113 9
3 6.311399 21.40387 1.426925 0 0.1097635 3
4 21.266759 68.11527 11.480968 0 1.0787487 9
5 6.770567 23.24987 3.960143 0 0.0000000 3
6 24.157661 39.79657 4.450095 0 1.5257470 15
agg
1 99.93970
2 130.08675
3 30.02031
4 115.42611
5 36.28002
6 85.18754
And ultimately I need it in the form of unique cities as row.names, 1:35 as col.names and the value inside each cell to be agg if that year was present or NA if it wasn't. Again I am sure this is possible, I just can't attain a good solution to it and my current way is unstable.
If I understand your question correctly, you want to plot all the lines in one color, and then plot a few lines with several different colors. You may use ggplot2, calling geom_line twice on two data frames. The first time plot all city data without mapping lines to color. The second time plot just the subset of your target city and mapping lines to color. You will need to re-organize your original data frame and subset the data frame for the target city. In the following code I used tidyr and dplyr to process the data frame.
### Set.seed to improve reproducibility
set.seed(123)
### Load package
library(tidyr)
library(dplyr)
library(ggplot2)
### Prepare example data frame
r1 <- rnorm(35)
r2 <- rnorm(35)
r3 <- rnorm(35)
r4 <- rnorm(35)
r5 <- rnorm(35)
df <- data.frame(matrix(nrow = 5, ncol = 35))
df[1,] <- r1
df[2,] <- r2
df[3,] <- r3
df[4,] <- r4
df[5,] <- r5
names(df) <- 1:35
df <- df %>% mutate(City = 1:5)
### Reorganize the data for plotting
df2 <- df %>%
gather(Year, Value, -City) %>%
mutate(Year = as.numeric(Year))
The gather function takes df as the first argument. It will create the key column called Year, which will store year number. The year number are the column names of each column in the df data frame except the City column. gather function will also create a column called Value, which will store all the numeric values from each column in in the df data frame except the City column. Finally, City column will not involve in this process, so use -City to tell the gather function "do not transform the data from the City column".
### Subset df2, select the city of interest
df3 <- df2 %>%
# In this example, assuming that City 2 and City 3 are of interest
filter(City %in% c(2, 3))
### Plot the data
ggplot(data = df2, aes(x = Year, y = Value, group = factor(City))) +
# Plot all city data here in gray lines
geom_line(size = 1, color = "gray") +
# Plot target city data with colors
geom_line(data = df3,
aes(x = Year, y = Value, group = City, color = factor(City)),
size = 2)
The resulting plot can be seen here: https://dl.dropboxusercontent.com/u/23652366/example_plot.png