I have a survey dataset that I imported as an SAS file but it did not include the text labels that are associated with the numeric codes in the dataset.
I'm trying to apply the factor function to all variables and then have the respective levels and labels for each variable.
I have a main dataframe with the actual data, and then a second dataframe with the text labels corresponding to each value for each variable.
So, for example, the variable column names in the main dataset are A1, B1, C1, D1. The second dataframe with the labels is listed below with dummy text. And for each variable, there are varying numbers of values that need text labels.
labels_list <- structure(list(VariableName = c("A1", "A1", "A1", "B1", "B1",
"B1", "B1", "C1", "C1", "C1", "C1", "C1", "D1", "D1", "D1", "D1",
"D1", "D1"), Value = c(1L, 2L, 3L, 1L, 2L, 3L, 4L, 1L, 2L, 3L,
4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = c("Red", "Blue", "Yellow",
"Up", "Down", "Left", "Right", "Boston", "Atlanta", "Dallas",
"New York", "Los Angeles", "John", "Jim", "Jake", "Bill", "Bob",
"Brian")), class = "data.frame", row.names = c(NA, -18L))
I'm trying to write a function to automatically label all the factor variables. The function reduces down the data to make sure that they each contain the exact same variables and then are in the exact same order. I split the table above into a list using the split function, and then each variable name above has it's own list, but I'm encountering an error when I try to subset the list in the for loop.
Below is the for loop I have written.
df = main dataset
labels_list = list with the value and text labels
for(i in 1:ncol(df)) {
for(j in labels_list) {
if(names(x[,i]) == names(ahs_split[[j]])) {
x[,i] <- factor(x[,i], levels = c(ahs_split[[j]][[2]]), labels = c(ahs_split[[j]][[3]]))
As I mentioned, my ultimate goal is to take this dataframe with the text labels and corresponding values for each variable and apply it to each one individually using the factor function. I've tried for almost a month now and am just very stuck so I could use any help. I'm not sure if anyone could possibly recommend a better approach or point me in the right direction. I would greatly appreciate any help.
If you don't mind some tidyverse verbs, you can reshape your data with tidyr::gather. Once it's in a long shape, you can join the data with the code lookup by variable name, and reshape it back into a wide format. This workflow scales for however many columns you need.
library(dplyr)
library(tidyr)
labels_list <- structure(list(Variable = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("A1",
"B1", "C1", "D1"), class = "factor"), Value = c(1L, 2L, 3L, 1L,
2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L), Label = structure(c(15L,
3L, 18L, 17L, 8L, 12L, 16L, 5L, 1L, 7L, 14L, 13L, 11L, 10L, 9L,
2L, 4L, 6L), .Label = c("Atlanta", "Bill", "Blue", "Bob", "Boston",
"Brian", "Dallas", "Down", "Jake", "Jim", "John", "Left", "Los_Angeles",
"New_York", "Red", "Right", "Up", "Yellow"), class = "factor")), class = "data.frame", row.names = c(NA,
-18L))
df <- tibble(A1 = rep(1:3,2),
B1 = c(1:4, 1, 2),
C1 = c(1:5, 1),
D1 = 1:6
)
A row number iterated over Variable will be necessary to spread the data, but you can drop it after it's no longer needed.
df %>%
gather(key = Variable, value = Value) %>%
left_join(labels_list, by = c("Variable", "Value")) %>%
select(-Value) %>%
group_by(Variable) %>%
mutate(row = row_number()) %>%
spread(key = Variable, value = Label)
#> Warning: Column `Variable` joining character vector and factor, coercing
#> into character vector
#> # A tibble: 6 x 5
#> row A1 B1 C1 D1
#> <int> <fct> <fct> <fct> <fct>
#> 1 1 Red Up Boston John
#> 2 2 Blue Down Atlanta Jim
#> 3 3 Yellow Left Dallas Jake
#> 4 4 Red Right New_York Bill
#> 5 5 Blue Up Los_Angeles Bob
#> 6 6 Yellow Down Boston Brian
One way is to convert your labels_list into a list of lists:
library(dplyr) # just using dplyr for the pipe %>%, otherwise everything is in base R
# Convert df to list of key:value pairs
labels_list <- labels_list %>%
split(f = labels_list$VariableName) %>%
lapply(function(x) list(key = x$Value, value = x$Label))
e.g.:
$A1
$A1$key
[1] 1 2 3
$A1$value
[1] "Red" "Blue" "Yellow"
This can be mapped onto your df col-wise with apply. This is a bit hacky as I put the column name as the first item of the vector passed to the function.
# Map labels onto sample data with factor()
apply(rbind(names(df), df),
2,
function(x) factor(x[2:length(x)],
levels = labels_list[[x[1]]]$key,
labels = labels_list[[x[1]]]$value)) %>%
as.data.frame()
A1 B1 C1 D1
1 Blue Up Dallas Jake
2 Red Down New York Jake
3 Yellow Left Boston Jim
4 Yellow Right Boston John
5 Yellow Down Los Angeles Jake
6 Red Left Atlanta Jake
7 Blue Down New York John
8 Red Down Atlanta Brian
9 Blue Up New York Jim
10 Yellow Down Atlanta Bill
Sample Data
set.seed(1724)
df <- data.frame(A1 = floor(runif(10, 1, 4)),
B1 = floor(runif(10, 1, 5)),
C1 = floor(runif(10, 1, 6)),
D1 = floor(runif(10, 1, 7)))
Related
I have 2 data frames with multiple factor columns. One is the base data frame and the other is the final data frame. I want to update the levels of the base data frame using the final data frame.
Consider this example:
base <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Non-Compounding and Standard Non-Compounding",
"OCR based Call", "Offsale Call", "Offsale Savings",
"Offsale Transactional", "Out of Scope","Personal Call"))
base$product <- as.factor(base$product)
final <- data.frame(product=c("Business Call", "Business Transactional",
"Monthly Standard Non-Compounding", "OCR based Call",
"Offsale Call", "Offsale Savings","Offsale Transactional",
"Out of Scope","Personal Call", "You Money"))
final$product <- as.factor(final$product)
What I would now want is for the final data base to have the same levels as base and remove the levels which do not exist at all like "You Money". Whereas "Monthly Standard Non-Compounding" to be fuzzy matched
Eg:
levels(base$var1) <- "a" "b" "c"
levels(final$var1) <- "Aa" "Bb" "Cc"
Is there a way to overwrite the levels in base data using the final data using some kind of fuzzy match?
Like I want the final levels for both data to be the same. i.e.
levels(base$var1) <- "Aa" "Bb" "Cc"
levels(final$var1) <- "Aa" "Bb" "Cc"
We could build our own fuzzyMatcher.
First, we'll need kinda vectorized agrep function,
agrepv <- function(x, y) all(as.logical(sapply(x, agrep, y)))
on which we build our fuzzyMatcher.
fuzzyMatcher <- function(from, to) {
mc <- mapply(function(y)
which(mapply(function(x) agrepv(y, x), Map(levels, to))),
Map(levels, from))
return(Map(function(x, y) `levels<-`(x, y), base,
Map(levels, from)[mc]))
}
final labels applied on base labels (note, that I've shifted columns to make it a little more sophisticated):
base[] <- fuzzyMatcher(final1, base1)
# X1 X2
# 1 Aa Xx
# 2 Aa Xx
# 3 Aa Yy
# 4 Aa Yy
# 5 Bb Yy
# 6 Bb Zz
# 7 Bb Zz
# 8 Aa Xx
# 9 Cc Xx
# 10 Cc Zz
Update
Based on the new provided data above it'll make sense to use another vectorized agrepv2(), which, used with outer(), enables us to apply agrep on all combinations of the levels of both vectors. Hereafter colSums that equal zero give us non-matching levels and which.max the matching levels of the target data frame final. We can use these two resulting vectors on the one hand to delete unused rows of final, on the other hand to subset the desired levels of the base data frame in order to rebuild the factor column.
# add to mimic other columns in data frame
base$x <- seq(nrow(base))
final$x <- seq(nrow(final))
# some abbrevations for convenience
p1 <- levels(base$product)
p2 <- levels(final$product)
# agrep
AGREPV2 <- Vectorize(function(x, y, ...) agrep(p2[x], p1[y])) # new vectorized agrep
out <- t(outer(seq(p2), seq(p1), agrepv2, max.distance=0.9)) # apply `agrepv2`
del.col <- grep(0, colSums(apply(out, 2, lengths))) # find negative matches
lvl <- unlist(apply(out, 2, which.max)) # find positive matches
lvl <- as.character(p2[lvl]) # get the labels
# delete "non-existing" rows and re-generate factor with new labels
transform(final[-del.col, ], product=factor(product, labels=lvl))
# product x
# 1 Business Call 1
# 2 Business Transactional 2
# 4 OCR based Call 4
# 5 Offsale Call 5
# 6 Offsale Savings 6
# 7 Offsale Transactional 7
# 8 Out of Scope 8
# 9 Personal Call 9
Data
base1 <- structure(list(X1 = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L,
3L, 3L), .Label = c("a", "b", "c"), class = "factor"), X2 = structure(c(1L,
1L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 3L), .Label = c("x", "y", "z"
), class = "factor")), row.names = c(NA, -10L), class = "data.frame")
final1 <- structure(list(X1 = structure(c(1L, 3L, 1L, 1L, 2L, 3L, 2L, 1L,
2L, 2L, 3L, 3L, 2L, 2L, 2L), .Label = c("Xx", "Yy", "Zz"), class = "factor"),
X2 = structure(c(2L, 1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 3L), .Label = c("Aa", "Bb", "Cc"), class = "factor")), row.names = c(NA,
-15L), class = "data.frame")
I am trying to create new variables in my data set that are cumulative totals which restart based on other variables (using group by)… I want these to be new columns in the data set and this is the part I am struggling with...
Using the data below, I want to create cumulative Sale and Profit columns that will restart for every Product and Product_Cat grouping.
The below code partly gives me what I need, but the variables are not new variables, instead it overwrites the existing Sale/Profit... what am I getting wrong? I imagine this is simple haven't found anything.
Note: I'm using lapply as my real data set has 40+ varbs that I need to create calculations for.
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
Data for example:
Product <- c('A','A','A','B','B','B','C','C','C')
Product_Cat <- c('S1','S1','S2','C1','C1','C1','D1','E1','F1')
Sale <- c(10,15,5,20,15,10,5,5,5)
Profit <- c(2,4,2,6,8,2,4,6,8)
Sale_Cum <- c(10,25,5,20,35,45,5,5,5)
Profit_Cum <- c(2,6,2,6,14,16,4,6,8)
Data <- data.frame(Product,Product_Cat,Sale,Profit)
Desired_Data <- data.frame(Product,Product_Cat,Sale,Profit,Sale_Cum,Profit_Cum)
This doesn't use the group by per se but I think it achieves what you're looking for in that it is easily extensible to many columns:
D2 <- data.frame(lapply(Data[,c(3,4)], cumsum))
names(D2) <- gsub("$", "_cum", names(Data[,c(3,4)]))
Data <- cbind(Data, D2)
If you have 40+ columns just change the c(3,4) to include all the columns you're after.
EDIT:
I forgot that the OP wanted it to reset for each category. In that case, you can modify your original code:
DT <- setDT(Data)[,lapply(.SD, cumsum), by = .(Product,Product_Cat) ]
names(D2)[c(-1,-2)] <- gsub("$", "_cum", names(Data)[c(-1,-2)])
cbind(Data, D2[,c(-1,-2)])
library(data.table)
setDT(Data)
cols <- names(Data)[3:4]
Data[, paste0(cols, '_cumsum') := lapply(.SD, cumsum)
, by = .(Product, Product_Cat)
, .SDcols = cols]
Data:
structure(list(Product = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Product_Cat = structure(c(5L,
5L, 6L, 1L, 1L, 1L, 2L, 3L, 4L), .Label = c("C1", "D1", "E1",
"F1", "S1", "S2"), class = "factor"), Sale = c(10L, 15L, 5L,
20L, 15L, 10L, 5L, 5L, 5L), Profit = c(2L, 4L, 2L, 6L, 8L, 2L,
4L, 6L, 8L), Sale_Cum = c(10, 25, 5, 20, 35, 45, 5, 5, 5), Profit_Cum = c(2,
6, 2, 6, 14, 16, 4, 6, 8)), .Names = c("Product", "Product_Cat",
"Sale", "Profit", "Sale_Cum", "Profit_Cum"), row.names = c(NA,
-9L), class = "data.frame")`
We can iteratively slice the dataframe based on Product and Product_Cat, and for each iteration, assign the output produced by cumsum() to Sale_Cum and Product_Cum:
cols <- c('Sale', 'Profit')
for (column in cols){
x[, paste0(column, '_Cum')] <- 0
for(p in unique(x$Product)){
for (pc in unique(x$Product_Cat)){
x[x$Product == p & x$Product_Cat == pc, paste0(column, '_Cum')] <- cumsum(x[x$Product == p & x$Product_Cat == pc, column])
}
}
}
print(x)
# Product Product_Cat Sale Profit Sale_Cum Profit_Cum
# 1 A S1 10 2 10 2
# 2 A S1 15 4 25 6
# 3 A S2 5 2 5 2
# 4 B C1 20 6 20 6
# 5 B C1 15 8 35 14
# 6 B C1 10 2 45 16
# 7 C D1 5 4 5 4
# 8 C E1 5 6 5 6
# 9 C F1 5 8 5 8
Here is some pretty poor code that does everything step by step
#sample data
d<-sample(1:10)
f<-sample(1:10)
p<-c("f","f","f","f","q","q","q","w","w","w")
pc<-c("c","c","d","d","d","v","v","v","b","b")
cc<-data.table(p,pc,d,f)
#storing the values that are overwritten first.
three<-cc[,3]
four<- cc[,4]
#applying your function
dt<-setDT(c)[,lapply(.SD,cumsum), by=.(p,pc)]
#binding the stored values to your function and renaming everything.
x<-cbind(dt,three,four)
colnames(x)[5]<-"sale"
colnames(x)[6]<-"profit"
colnames(x)[4]<-"CumSale"
colnames(x)[3]<-"CumProfit"
#reordering the columns
xx<-x[,c("p","pc","profit","sale","CumSale","CumProfit")]
xx
I'm trying to get the data from column one that matches with column 2 but only on the "B" values. Need to somehow make the true values a list.
Need this to repeat for 50,000 rows. Around 37,000 of them are true.
I'm incredibly new to this so any help would be nice.
Data <- data.frame(
X = sample(1:10),
Y = sample(c("B", "W"), 10, replace = TRUE)
)
Count <- 1
If(data[count,2] == "B") {
List <- list(data[count,1]
Count <- count + 1
#I'm not sure what to use to repeat I just put
Repeat
} else {
Count <- count + 1
Repeat
}
End result should be a list() of only column one data.
In this if rows 1-5 had "B" I want the column one numbers from that.
Not sure if I understood correctly what you're looking for, but from the comments I would assume that this might help:
setNames(data.frame(Data[1][Data[2]=="B"]), "selected")
# selected
#1 2
#2 5
#3 7
#4 6
No loop needed.
data
Data <- structure(list(X = c(10L, 4L, 9L, 8L, 3L, 2L, 5L, 1L, 7L, 6L),
Y = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 1L),
.Label = c("B", "W"), class = "factor")),
.Names = c("X", "Y"), row.names = c(NA, -10L),
class = "data.frame")
For a sample dataframe:
df <- structure(list(animal.1 = structure(c(1L, 1L, 2L, 2L, 2L, 4L,
4L, 3L, 1L, 1L), .Label = c("cat", "dog", "horse", "rabbit"), class = "factor"),
animal.2 = structure(c(1L, 2L, 2L, 2L, 4L, 4L, 1L, 1L, 3L,
1L), .Label = c("cat", "dog", "hamster", "rabbit"), class = "factor"),
number = c(5L, 3L, 2L, 5L, 1L, 4L, 6L, 7L, 1L, 11L)), .Names = c("animal.1",
"animal.2","number"), class = "data.frame", row.names = c(NA,
-10L))
... I wish to make a new df with 'animal' duplicates all added together. For example multiple rows with the same animal in columns 1 and 2 will be put together. So for example the dataframe above would read:
cat cat 16
dog dog 7
cat dog 3 etc. etc... (those with different animals would be left as they are). Importantly the sum of 'number' in both dataframes would be the same.
My real df is >400K observations, so anything that anyone could recommend could cope with a large dataset would be great!
Thanks in advance.
One option would be to use data.table. Convert "data.frame" to "data.table" (setDT(), if the "animal.1" rows are equal to "animal.2", then, replace the "number" with sum of "number" after grouping by the two columns, and finally get the unique rows.
library(data.table)
setDT(df)[as.character(animal.1)==as.character(animal.2),
number:=sum(number) ,.(animal.1, animal.2)]
unique(df)
# animal.1 animal.2 number
#1: cat cat 16
#2: cat dog 3
#3: dog dog 7
#4: dog rabbit 1
#5: rabbit rabbit 4
#6: rabbit cat 6
#7: horse cat 7
#8: cat hamster 1
Or an option with dplyr. The approach is similar to data.table. We group by "animal.1", "animal.2", then replace the "number" with sum only when "animal.1" is equal to "animal.2", and get the unique rows
library(dplyr)
df %>%
group_by(animal.1, animal.2) %>%
mutate(number=replace(number,as.character(animal.1)==
as.character(animal.2),
sum(number))) %>%
unique()
My dataset has 34,000 rows and 353 columns. One column is location and it has 11,000 unique values. I want to subset the dataset within a for loop. I can do this by creating a new data frame for each subset, but I want the subsets to form a single data frame. I have included a sample dataset below
structure(list(X = structure(c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L,
3L), .Label = c("Car", "DOG", "House"), class = "factor"), Y = c(20L,
20L, 20L, 20L, 410L, 410L, 410L, 410L, 60L), Z = structure(c(1L,
3L, 8L, 1L, 7L, 5L, 2L, 4L, 6L), .Label = c("ARGENTINA", "BERLIN GERMANY",
"BUENOS AIRES ARGENTINA", "DUBLIN IRELAND", "FROM AUSTRIA", "GERMANY",
"IN TRANSIT FROM GERMANY", "RIVER PLATE ARGENTINA"), class = "factor"),
K = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "A", class = "factor")),
.Names = c("X", "Y", "Z", "K"), class = "data.frame", row.names = c(NA, -9L))
I can use the following code to create new data frames
l=c("ARGENTINA","IRELAND")
for(i in l){
assign(paste("newdata",i,sep=""),
subset(TESTL[which(grepl(i,TESTL$Z)&
!grepl("IN TRANSIT",TESTL$Z)&!grepl("FROM",TESTL$Z)),],
select=c("X","Y","Z")))}
However I want to create a single new dataframe to hold all the subsets. I have tried the following code
d<-data.frame()
for(i in l){d<-rbind(d,c(
subset(TESTL[which(grepl(i,TESTL$Z) & !grepl("IN TRANSIT",TESTL$Z)
& !grepl("FROM",TESTL$Z)),],
select=c("X","Y","Z")))}
I get the following errors
Warning messages:
1: In `[<-.factor`(`*tmp*`, ri, value = "DOG") :
invalid factor level, NA generated
2: In `[<-.factor`(`*tmp*`, ri, value = "DUBLIN IRELAND") :
invalid factor level, NA generated
I have attempted to convert the factors to characters with no success. Any help appreciated
I think you are making your life rather difficult by using assign here and trying to store the subsets in separate data frames. Try something more like this:
l <- c("ARGENTINA","IRELAND")
res <- setNames(vector("list",length(l)),l)
for (i in seq_along(l)){
res[[i]] <- dat[grepl(l[i],dat$Z) & !grepl("IN TRANSIT",dat$Z) & !grepl("FROM",dat$Z),c("X","Y","Z")]
}
> res
$ARGENTINA
X Y Z
1 Car 20 ARGENTINA
2 Car 20 BUENOS AIRES ARGENTINA
3 Car 20 RIVER PLATE ARGENTINA
4 Car 20 ARGENTINA
$IRELAND
X Y Z
8 DOG 410 DUBLIN IRELAND
> do.call("rbind",res)
X Y Z
ARGENTINA.1 Car 20 ARGENTINA
ARGENTINA.2 Car 20 BUENOS AIRES ARGENTINA
ARGENTINA.3 Car 20 RIVER PLATE ARGENTINA
ARGENTINA.4 Car 20 ARGENTINA
IRELAND DOG 410 DUBLIN IRELAND
The warnings is becouse at first iteration of a loop (ARGENTINA) it introduces factors variables X and Z, and on the second indtroduce IRELAND with another factor levels. So:
First you should change a classes of your vaiables n TESTL:
for (i in names(TESTL) [grep ("factor", sapply (TESTL, class))]) {
TESTL[[i]] <- as.character (TESTL[[i]])
}
Then it will work with the next code:
d <- data.frame(stringsAsFactors=F)
for(i in l){d <- rbind(d,
TESTL [grepl(i,TESTL$Z) & !grepl("FROM|IN TRANSIT", TESTL$Z), c("X", "Y", "Z")])}