My dataset has 34,000 rows and 353 columns. One column is location and it has 11,000 unique values. I want to subset the dataset within a for loop. I can do this by creating a new data frame for each subset, but I want the subsets to form a single data frame. I have included a sample dataset below
structure(list(X = structure(c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L,
3L), .Label = c("Car", "DOG", "House"), class = "factor"), Y = c(20L,
20L, 20L, 20L, 410L, 410L, 410L, 410L, 60L), Z = structure(c(1L,
3L, 8L, 1L, 7L, 5L, 2L, 4L, 6L), .Label = c("ARGENTINA", "BERLIN GERMANY",
"BUENOS AIRES ARGENTINA", "DUBLIN IRELAND", "FROM AUSTRIA", "GERMANY",
"IN TRANSIT FROM GERMANY", "RIVER PLATE ARGENTINA"), class = "factor"),
K = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor")),
.Names = c("X", "Y", "Z", "K"), class = "data.frame", row.names = c(NA, -9L))
I can use the following code to create new data frames
l=c("ARGENTINA","IRELAND")
for(i in l){
assign(paste("newdata",i,sep=""),
subset(TESTL[which(grepl(i,TESTL$Z)&
!grepl("IN TRANSIT",TESTL$Z)&!grepl("FROM",TESTL$Z)),],
select=c("X","Y","Z")))}
However I want to create a single new dataframe to hold all the subsets. I have tried the following code
d<-data.frame()
for(i in l){d<-rbind(d,c(
subset(TESTL[which(grepl(i,TESTL$Z) & !grepl("IN TRANSIT",TESTL$Z)
& !grepl("FROM",TESTL$Z)),],
select=c("X","Y","Z")))}
I get the following errors
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "DOG") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "DUBLIN IRELAND") :
invalid factor level, NA generated
I have attempted to convert the factors to characters with no success. Any help appreciated
I think you are making your life rather difficult by using assign here and trying to store the subsets in separate data frames. Try something more like this:
l <- c("ARGENTINA","IRELAND")
res <- setNames(vector("list",length(l)),l)
for (i in seq_along(l)){
res[[i]] <- dat[grepl(l[i],dat$Z) & !grepl("IN TRANSIT",dat$Z) & !grepl("FROM",dat$Z),c("X","Y","Z")]
}
> res
$ARGENTINA
X Y Z
1 Car 20 ARGENTINA
2 Car 20 BUENOS AIRES ARGENTINA
3 Car 20 RIVER PLATE ARGENTINA
4 Car 20 ARGENTINA
$IRELAND
X Y Z
8 DOG 410 DUBLIN IRELAND
> do.call("rbind",res)
X Y Z
ARGENTINA.1 Car 20 ARGENTINA
ARGENTINA.2 Car 20 BUENOS AIRES ARGENTINA
ARGENTINA.3 Car 20 RIVER PLATE ARGENTINA
ARGENTINA.4 Car 20 ARGENTINA
IRELAND DOG 410 DUBLIN IRELAND
The warnings is becouse at first iteration of a loop (ARGENTINA) it introduces factors variables X and Z, and on the second indtroduce IRELAND with another factor levels. So:
First you should change a classes of your vaiables n TESTL:
for (i in names(TESTL) [grep ("factor", sapply (TESTL, class))]) {
TESTL[[i]] <- as.character (TESTL[[i]])
}
Then it will work with the next code:
d <- data.frame(stringsAsFactors=F)
for(i in l){d <- rbind(d,
TESTL [grepl(i,TESTL$Z) & !grepl("FROM|IN TRANSIT", TESTL$Z), c("X", "Y", "Z")])}
Related
I have 2 data frames with multiple factor columns. One is the base data frame and the other is the final data frame. I want to update the levels of the base data frame using the final data frame.
Consider this example:
base <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Non-Compounding and Standard Non-Compounding",
"OCR based Call", "Offsale Call", "Offsale Savings",
"Offsale Transactional", "Out of Scope","Personal Call"))
base$product <- as.factor(base$product)
final <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Standard Non-Compounding", "OCR based Call",
"Offsale Call", "Offsale Savings","Offsale Transactional",
"Out of Scope","Personal Call", "You Money"))
final$product <- as.factor(final$product)
What I would now want is for the final data base to have the same levels as base and remove the levels which do not exist at all like "You Money". Whereas "Monthly Standard Non-Compounding" to be fuzzy matched
Eg:
levels(base$var1) <- "a" "b" "c"
levels(final$var1) <- "Aa" "Bb" "Cc"
Is there a way to overwrite the levels in base data using the final data using some kind of fuzzy match?
Like I want the final levels for both data to be the same. i.e.
levels(base$var1) <- "Aa" "Bb" "Cc"
levels(final$var1) <- "Aa" "Bb" "Cc"
We could build our own fuzzyMatcher.
First, we'll need kinda vectorized agrep function,
agrepv <- function(x, y) all(as.logical(sapply(x, agrep, y)))
on which we build our fuzzyMatcher.
fuzzyMatcher <- function(from, to) {
mc <- mapply(function(y)
which(mapply(function(x) agrepv(y, x), Map(levels, to))),
Map(levels, from))
return(Map(function(x, y) `levels<-`(x, y), base,
Map(levels, from)[mc]))
}
final labels applied on base labels (note, that I've shifted columns to make it a little more sophisticated):
base[] <- fuzzyMatcher(final1, base1)
# X1 X2
# 1 Aa Xx
# 2 Aa Xx
# 3 Aa Yy
# 4 Aa Yy
# 5 Bb Yy
# 6 Bb Zz
# 7 Bb Zz
# 8 Aa Xx
# 9 Cc Xx
# 10 Cc Zz
Update
Based on the new provided data above it'll make sense to use another vectorized agrepv2(), which, used with outer(), enables us to apply agrep on all combinations of the levels of both vectors. Hereafter colSums that equal zero give us non-matching levels and which.max the matching levels of the target data frame final. We can use these two resulting vectors on the one hand to delete unused rows of final, on the other hand to subset the desired levels of the base data frame in order to rebuild the factor column.
# add to mimic other columns in data frame
base$x <- seq(nrow(base))
final$x <- seq(nrow(final))
# some abbrevations for convenience
p1 <- levels(base$product)
p2 <- levels(final$product)
# agrep
AGREPV2 <- Vectorize(function(x, y, ...) agrep(p2[x], p1[y])) # new vectorized agrep
out <- t(outer(seq(p2), seq(p1), agrepv2, max.distance=0.9)) # apply `agrepv2`
del.col <- grep(0, colSums(apply(out, 2, lengths))) # find negative matches
lvl <- unlist(apply(out, 2, which.max)) # find positive matches
lvl <- as.character(p2[lvl]) # get the labels
# delete "non-existing" rows and re-generate factor with new labels
transform(final[-del.col, ], product=factor(product, labels=lvl))
# product x
# 1 Business Call 1
# 2 Business Transactional 2
# 4 OCR based Call 4
# 5 Offsale Call 5
# 6 Offsale Savings 6
# 7 Offsale Transactional 7
# 8 Out of Scope 8
# 9 Personal Call 9
Data
base1 <- structure(list(X1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), X2 = structure(c(1L,
1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("x", "y", "z"
), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
final1 <- structure(list(X1 = structure(c(1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L,
2L, 2L, 3L, 3L, 2L, 2L, 2L), .Label = c("Xx", "Yy", "Zz"), class = "factor"),
X2 = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L), .Label = c("Aa", "Bb", "Cc"), class = "factor")), row.names = c(NA,
-15L), class = "data.frame")
I have a survey dataset that I imported as an SAS file but it did not include the text labels that are associated with the numeric codes in the dataset.
I'm trying to apply the factor function to all variables and then have the respective levels and labels for each variable.
I have a main dataframe with the actual data, and then a second dataframe with the text labels corresponding to each value for each variable.
So, for example, the variable column names in the main dataset are A1, B1, C1, D1. The second dataframe with the labels is listed below with dummy text. And for each variable, there are varying numbers of values that need text labels.
labels_list <- structure(list(VariableName = c("A1", "A1", "A1", "B1", "B1",
"B1", "B1", "C1", "C1", "C1", "C1", "C1", "D1", "D1", "D1", "D1",
"D1", "D1"), Value = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = c("Red", "Blue", "Yellow",
"Up", "Down", "Left", "Right", "Boston", "Atlanta", "Dallas",
"New York", "Los Angeles", "John", "Jim", "Jake", "Bill", "Bob",
"Brian")), class = "data.frame", row.names = c(NA, -18L))
I'm trying to write a function to automatically label all the factor variables. The function reduces down the data to make sure that they each contain the exact same variables and then are in the exact same order. I split the table above into a list using the split function, and then each variable name above has it's own list, but I'm encountering an error when I try to subset the list in the for loop.
Below is the for loop I have written.
df = main dataset
labels_list = list with the value and text labels
for(i in 1:ncol(df)) {
for(j in labels_list) {
if(names(x[,i]) == names(ahs_split[[j]])) {
x[,i] <- factor(x[,i], levels = c(ahs_split[[j]][[2]]), labels = c(ahs_split[[j]][[3]]))
As I mentioned, my ultimate goal is to take this dataframe with the text labels and corresponding values for each variable and apply it to each one individually using the factor function. I've tried for almost a month now and am just very stuck so I could use any help. I'm not sure if anyone could possibly recommend a better approach or point me in the right direction. I would greatly appreciate any help.
If you don't mind some tidyverse verbs, you can reshape your data with tidyr::gather. Once it's in a long shape, you can join the data with the code lookup by variable name, and reshape it back into a wide format. This workflow scales for however many columns you need.
library(dplyr)
library(tidyr)
labels_list <- structure(list(Variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1",
"B1", "C1", "D1"), class = "factor"), Value = c(1L, 2L, 3L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = structure(c(15L,
3L, 18L, 17L, 8L, 12L, 16L, 5L, 1L, 7L, 14L, 13L, 11L, 10L, 9L,
2L, 4L, 6L), .Label = c("Atlanta", "Bill", "Blue", "Bob", "Boston",
"Brian", "Dallas", "Down", "Jake", "Jim", "John", "Left", "Los_Angeles",
"New_York", "Red", "Right", "Up", "Yellow"), class = "factor")), class = "data.frame", row.names = c(NA,
-18L))
df <- tibble(A1 = rep(1:3,2),
B1 = c(1:4, 1, 2),
C1 = c(1:5, 1),
D1 = 1:6
)
A row number iterated over Variable will be necessary to spread the data, but you can drop it after it's no longer needed.
df %>%
gather(key = Variable, value = Value) %>%
left_join(labels_list, by = c("Variable", "Value")) %>%
select(-Value) %>%
group_by(Variable) %>%
mutate(row = row_number()) %>%
spread(key = Variable, value = Label)
#> Warning: Column `Variable` joining character vector and factor, coercing
#> into character vector
#> # A tibble: 6 x 5
#> row A1 B1 C1 D1
#> <int> <fct> <fct> <fct> <fct>
#> 1 1 Red Up Boston John
#> 2 2 Blue Down Atlanta Jim
#> 3 3 Yellow Left Dallas Jake
#> 4 4 Red Right New_York Bill
#> 5 5 Blue Up Los_Angeles Bob
#> 6 6 Yellow Down Boston Brian
One way is to convert your labels_list into a list of lists:
library(dplyr) # just using dplyr for the pipe %>%, otherwise everything is in base R
# Convert df to list of key:value pairs
labels_list <- labels_list %>%
split(f = labels_list$VariableName) %>%
lapply(function(x) list(key = x$Value, value = x$Label))
e.g.:
$A1
$A1$key
[1] 1 2 3
$A1$value
[1] "Red" "Blue" "Yellow"
This can be mapped onto your df col-wise with apply. This is a bit hacky as I put the column name as the first item of the vector passed to the function.
# Map labels onto sample data with factor()
apply(rbind(names(df), df),
2,
function(x) factor(x[2:length(x)],
levels = labels_list[[x[1]]]$key,
labels = labels_list[[x[1]]]$value)) %>%
as.data.frame()
A1 B1 C1 D1
1 Blue Up Dallas Jake
2 Red Down New York Jake
3 Yellow Left Boston Jim
4 Yellow Right Boston John
5 Yellow Down Los Angeles Jake
6 Red Left Atlanta Jake
7 Blue Down New York John
8 Red Down Atlanta Brian
9 Blue Up New York Jim
10 Yellow Down Atlanta Bill
Sample Data
set.seed(1724)
df <- data.frame(A1 = floor(runif(10, 1, 4)),
B1 = floor(runif(10, 1, 5)),
C1 = floor(runif(10, 1, 6)),
D1 = floor(runif(10, 1, 7)))
I have a panel dataset which looks like the following
ID Model Month Country Activations avg_price
1 VW Golf 2012-01 NL 23 5000
1 VW Golf 2012-02 NL 2 5500
1 VW Golf 2012-01 FR 8 6000
1 VW Golf 2012-02 FR 34 7000
2 Audi TT 2012-01 NL 8 6900
Now, I want to take first differences for the Activations and avg_price variables. I do this using the diff(data$Activations) function from the plm package, but first I have to transform the data frame using pdata.frame(data). So:
data_fd = pdata.frame(data)
data_fd$Activations = diff(data_fdactivations)
This returns the following error using the data above: duplicate couples (id-time) in resulting pdata.frame. This is because I have data on different countries and when I aggregate the data over all the countries (so total Activations and avg_price and only one id-month combination) this works fine. However, I want now to take the first differences also using the Country variable.
My dataframe should, then, look like:
ID Model Month Country Activations avg_price
1 VW Golf 2012-01 NL NA NA
1 VW Golf 2012-02 NL -21 500
1 VW Golf 2012-01 FR NA NA
1 VW Golf 2012-02 FR 26 1000
etc
Does anyone know how I can make this happen?
Have a look ,is this what you want?
lag_new <- structure(list(ID = c(1L, 1L, 1L, 1L, 2L), Model = structure(c(2L,
2L, 2L, 2L, 1L), .Label = c("Audi TT", "VW Golf"), class = "factor"),
Month = structure(c(1L, 2L, 1L, 2L, 1L), .Label = c("2012-01",
"2012-02"), class = "factor"), Country = structure(c(2L,
2L, 1L, 1L, 2L), .Label = c("FR", "NL"), class = "factor"),
Activations = c(23L, 2L, 8L, 34L, 8L), avg_price = c(5000L,
5500L, 6000L, 7000L, 6900L), Activations_new = c(NA, -21L,
6L, 26L, -26L), avg_price_new = c(NA, 500L, 500L, 1000L,
-100L)), row.names = c(NA, -5L), class = "data.frame")
lag_new$Activations_new <- lag_new$Activations-lag(lag_new$Activations)
lag_new$avg_price_new <- lag_new$avg_price-lag(lag_new$avg_price)
I have a table like the following image and I'm trying to use a simple if statement to return the country name only in cases where food is "Oranges". The 3rd column is the desired outcome, the 4th column is what I get in R.
In excel the formula would be:
=IF(A2="Oranges",B2,"n/a")
I have used the following r code to generate the "oranges_country" variable:
table$oranges_country <- ifelse (Food == "Oranges", Country , "n/a")
[As per the image above] The code returns the number of the level (e.g. 6) in the levels list for 'Country' rather than 'Country' itself (e.g. "Spain"). I understand where this coming from (the position in the extract as below), but it's a pain particularly when using several nested if statements.
levels(Country)
[1] "California" "Ecuador" "France" "New Zealand" "Peru" "Spain" "UK"
There must be a simple way to change this???
As requested in a comment: dput(table) output as follows:
dput(table)
structure(list(Food = structure(c(1L, 1L, 3L, 1L, 1L, 3L, 3L,
2L, 2L), .Label = c("Apples", "Bananas", "Oranges"), class = "factor"),
Country = structure(c(3L, 7L, 6L, 4L, 7L, 6L, 1L, 5L, 2L), .Label = c("California",
"Ecuador", "France", "New Zealand", "Peru", "Spain", "UK"
), class = "factor"), Desired_If.Outcome = structure(c(2L,
2L, 3L, 2L, 2L, 3L, 1L, 2L, 2L), .Label = c("California",
"n/a", "Spain"), class = "factor"), oranges_country = c("n/a",
"n/a", "6", "n/a", "n/a", "6", "1", "n/a", "n/a"), desiredcolumn = c(NA,
NA, 6L, NA, NA, 6L, 1L, NA, NA)), .Names = c("Food", "Country",
"Desired_If.Outcome", "oranges_country", "desiredcolumn"), row.names = c(NA,
-9L), class = "data.frame")
Try the ifelse loop. Firstly , change Table$Country to character()
table$Country<-as.character(Table$Country)
table$desiredcolumn<-ifelse(table$Food == "Oranges", table$Country, NA)
Here is my version:
Food<-c("Ap","Ap","Or","Ap","Ap","Or","Or","Ba","Ba")
Country<-c("Fra","UK","Sp","Nz","UK","Sp","Cal","Per","Eq")
Table<-cbind(Food,Country)
Table<-data.frame(Table)
Table$Country<-as.character(Table$Country)
Table$DC<-ifelse(Table$Food=="Or", Table$Country, NA)
Table
Food Country DC
1 Ap Fra <NA>
2 Ap UK <NA>
3 Or Sp Sp
4 Ap Nz <NA>
5 Ap UK <NA>
6 Or Sp Sp
7 Or Cal Cal
8 Ba Per <NA>
9 Ba Eq <NA>
Try this (if your table is called table):
table[table$Food=="Oragnes", ]
I'm trying to change the plotting order within facets of a faceted dotplot in ggplot2, but I can't get it to work. Here's my melted dataset:
> London.melt
country medal.type count
1 South Korea gold 13
2 Italy gold 8
3 France gold 11
4 Australia gold 7
5 Japan gold 7
6 Germany gold 11
7 Great Britain & N. Ireland gold 29
8 Russian Federation gold 24
9 China gold 38
10 United States gold 46
11 South Korea silver 8
12 Italy silver 9
13 France silver 11
14 Australia silver 16
15 Japan silver 14
16 Germany silver 19
17 Great Britain & N. Ireland silver 17
18 Russian Federation silver 26
19 China silver 27
20 United States silver 29
21 South Korea bronze 7
22 Italy bronze 11
23 France bronze 12
24 Australia bronze 12
25 Japan bronze 17
26 Germany bronze 14
27 Great Britain & N. Ireland bronze 19
28 Russian Federation bronze 32
29 China bronze 23
30 United States bronze 29
and here's my plot command:
qplot(x = count, y = country, data = London.melt, geom = "point", facets = medal.type ~.)
The result I get is as follows:
The facets themselves appear in the order I want in this plot. Within each facet, however, I'd like to sort by count. That is, for each type of medal, I'd like the country that won the greatest number of those medals on top, and so on. The procedure I have used successfully when there are no facets (say we're only looking at gold medals) is to use the reorder function on the factor country, sorting by count but this doesn't work in the present example.
I'd greatly appreciate any suggestions you might have.
Here a solution using paste, free scales and some relabeling
library(ggplot2)
London.melt$medal.type<-factor(London.melt$medal.type, levels = c("gold","silver","bronze"))
# Make every country unique
London.melt$country_l <- with(London.melt, paste(country, medal.type, sep = "_"))
#Reorder the unique countrys
q <- qplot(x = count, y = reorder(country_l, count), data = London.melt, geom = "point") + facet_grid(medal.type ~., scales = "free_y")
# Rename the countries using the original names
q + scale_y_discrete("Country", breaks = London.melt$country_l, label = London.melt$country)
This is obviously quite late, and some of what I'm doing may have not been around 6 years ago, but I came across this question while doing a similar task. I'm always reluctant to set tick labels with a vector—it feels safer to use a function that can operate on the original labels.
To do that, I'm creating a factor ID column based on the country and the medal, with some delimiter character that doesn't already appear in either of those columns—in this case, _ works. Then with forcats::fct_reorder, I can order that column by count. The last few levels of this column are below, and should correspond to the country + medal combinations with the highest counts.
library(tidyverse)
London_ordered <- London.melt %>%
mutate(id = paste(country, medal.type, sep = "_") %>%
as_factor() %>%
fct_reorder(count, .fun = min))
levels(London_ordered$id) %>% tail()
#> [1] "Great Britain & N. Ireland_gold" "United States_silver"
#> [3] "United States_bronze" "Russian Federation_bronze"
#> [5] "China_gold" "United States_gold"
Then use this ID as your y-axis. On its own, you'd then have very long labels that include the medal type. Because of the unique delimiter, you can write an inline function for the y-axis labels that will remove the delimiter and any word characters that come after it, leaving you with just the countries. Moving the facet specification to a facet_wrap function lets you then set the free y-scale.
qplot(x = count, y = id, data = London_ordered, geom = "point") +
scale_y_discrete(labels = function(x) str_remove(x, "_\\w+$")) +
facet_wrap(~ medal.type, scales = "free_y", ncol = 1)
This is the best I can do with qplot. Not exactly what you asked for but closer. OOOPs I see you already figured that out.
q <- qplot(x = count, y = reorder(country, count), data = London.melt, geom = "point", facets = medal.type ~.)
Here's a dput version so others can improve:
dput(London.melt)
structure(list(country = structure(c(9L, 6L, 3L, 1L, 7L, 4L,
5L, 8L, 2L, 10L, 9L, 6L, 3L, 1L, 7L, 4L, 5L, 8L, 2L, 10L, 9L,
6L, 3L, 1L, 7L, 4L, 5L, 8L, 2L, 10L), .Label = c("Australia",
"China", "France", "Germany", "Great Britain & N. Ireland", "Italy",
"Japan", "Russian Federation", "South Korea", "United States"
), class = "factor"), medal.type = structure(c(2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("bronze",
"gold", "silver"), class = "factor"), count = c(13L, 8L, 11L,
7L, 7L, 11L, 29L, 24L, 38L, 46L, 8L, 9L, 11L, 16L, 14L, 19L,
17L, 26L, 27L, 29L, 7L, 11L, 12L, 12L, 17L, 14L, 19L, 32L, 23L,
29L)), .Names = c("country", "medal.type", "count"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30"))