Related
It is related to this question and this other one, although to a larger scale.
I have two data.tables:
The first one with market research data, containing answers stored as integers;
The second one being what can be called a dictionary, with category labels associated to the integers mentioned above.
See reproducible example :
EDIT: Addition of a new variable to include the '0' case.
EDIT 2: Modification of 'age_group' variable to include cases where all unique levels of a factor do not appear in data.
library(data.table)
library(magrittr)
# Table with survey data :
# - each observation contains the answers of a person
# - variables describe the sample population characteristics (gender, age...)
# - numeric variables (like age) are also stored as character vectors
repex_DT <- data.table (
country = as.character(c(1,3,4,2,NA,1,2,2,2,4,NA,2,1,1,3,4,4,4,NA,1)),
gender = as.character(c(NA,2,2,NA,1,1,1,2,2,1,NA,2,1,1,1,2,2,1,2,NA)),
age = as.character(c(18,40,50,NA,NA,22,30,52,64,24,NA,38,16,20,30,40,41,33,59,NA)),
age_group = as.character(c(2,2,2,NA,NA,2,2,2,2,2,NA,2,2,2,2,2,2,2,2,NA)),
status = as.character(c(1,NA,2,9,2,1,9,2,2,1,9,2,1,1,NA,2,2,1,2,9)),
children = as.character(c(0,2,3,1,6,1,4,2,4,NA,NA,2,1,1,NA,NA,3,5,2,1))
)
# Table of the labels associated to categorical variables, plus 'label_id' to match the values
labels_DT <- data.table (
label_id = as.character(c(1:9)),
country = as.character(c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4",NA,NA,NA,NA,NA)),
gender = as.character(c("Male","Female",NA,NA,NA,NA,NA,NA,NA)),
age_group = as.character(c("Less than 35","35 and more",NA,NA,NA,NA,NA,NA,NA)),
status = as.character(c("Employed","Unemployed",NA,NA,NA,NA,NA,NA,"Do not want to say")),
children = as.character(c("0","1","2","3","4","5 and more",NA,NA,NA))
)
# Identification of the variable nature (numeric or character)
var_type <- c("character","character","numeric","character","character","character")
# Identification of the categorical variable names
categorical_var <- names(repex_DT)[which(var_type == "character")]
You can see that the dictionary table is smaller to the survey data table, this is expected.
Also, despite all variables being stored as character, some are true numeric variables like age, and consequently do not appear in the dictionary table.
My objective is to replace the values of all variables of the first data.table with a matching name in the dictionary table by its corresponding label.
I have actually achieved it using a loop, like the one below:
result_DT1 <- copy(repex_DT)
for (x in categorical_var){
if(length(which(repex_DT[[x]]=="0"))==0){
values_vector <- labels_DT$label_id
labels_vector <- labels_DT[[x]]
}else{
values_vector <- c("0",labels_DT$label_id)
labels_vector <- c(labels_DT[[x]][1:(length(labels_DT[[x]])-1)], NA, labels_DT[[x]][length(labels_DT[[x]])])}
result_DT1[, (c(x)) := plyr::mapvalues(x=get(x), from=values_vector, to=labels_vector, warn_missing = F)]
}
What I want is a faster method (the fastest if one exists), since I have thousands of variables to qualify for dozens of thousands of records.
Any performance improvements would be more than welcome. I battled with stringi but could not have the function running without errors unless using hard-coded variable names. See example:
test_stringi <- copy(repex_DT) %>%
.[, (c("country")) := lapply(.SD, function(x) stringi::stri_replace_all_fixed(
str=x, pattern=unique(labels_DT$label_id)[!is.na(labels_DT[["country"]])],
replacement=unique(na.omit(labels_DT[["country"]])), vectorize_all=FALSE)),
.SDcols = c("country")]
Columns of your 2nd data.table are just look up vectors:
same_cols <- intersect(names(repex_DT), names(labels_DT))
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[as.integer(x)],
repex_DT[, same_cols, with = FALSE],
labels_DT[, same_cols, with = FALSE],
SIMPLIFY = FALSE
)
]
edit
you can add NA on first position in columns of labels_DT (similar like you did for other missing values) or better yet you can keep labels in list:
labels_list <- list(
country = c("COUNTRY 1","COUNTRY 2","COUNTRY 3","COUNTRY 4"),
gender = c("Male","Female"),
age_group = c("Less than 35","35 and more"),
status = c("Employed","Unemployed","Do not want to say"),
children = c("0","1","2","3","4","5 and more")
)
same_cols <- names(labels_list)
repex_DT[
,
(same_cols) := mapply(
function(x, y) y[factor(as.integer(x))],
repex_DT[, same_cols, with = FALSE],
labels_list,
SIMPLIFY = FALSE
)
]
Notice that this way it is necessary to convert to factor first because values in repex_DT can be are not sequance 1, 2, 3...
a very computationally effective way would be to melt your tables first, match them and cast again:
repex_DT[, idx:= .I] # Create an index used for melting
# Melt
repex_melt <- melt(repex_DT, id.vars = "idx")
labels_melt <- melt(labels_DT, id.vars = "label_id")
# Match variables and value/label_id
repex_melt[labels_melt, value2:= i.value, on= c("variable", "value==label_id")]
# Put the data back into its original shape
result <- dcast(repex_melt, idx~variable, value.var = "value2")
I finally found time to work on an answer to this matter.
I changed my approach and used fastmatch::fmatch to identify labels to update.
As pointed out by #det, it is not possible to consider variables with a starting '0' label in the same loop than other standard categorical variables, so the instruction is basically repeated twice.
Still, this is much faster than my initial for loop approach.
The answer below:
library(data.table)
library(magrittr)
library(stringi)
library(fastmatch)
#Selection of variable names depending on the presence of '0' labels
same_cols_with0 <- intersect(names(repex_DT), names(labels_DT))[
which(intersect(names(repex_DT), names(labels_DT)) %fin%
names(repex_DT)[which(unlist(lapply(repex_DT, function(x)
sum(stri_detect_regex(x, pattern="^0$", negate=FALSE), na.rm=TRUE)),
use.names=FALSE)>=1)])]
same_cols_standard <- intersect(names(repex_DT), names(labels_DT))[
which(!(intersect(names(repex_DT), names(labels_DT)) %fin% same_cols_with0))]
labels_std <- labels_DT[, same_cols_standard, with=FALSE]
labels_0 <- labels_DT[, same_cols_with0, with=FALSE]
levels_id <- as.integer(labels_DT$label_id)
#Update joins via matching IDs (credit to #det for mapply syntax).
result_DT <- data.table::copy(repex_DT) %>%
.[, (same_cols_standard) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=levels_id, nomatch=NA)],
repex_DT[, same_cols_standard, with=FALSE], labels_std, SIMPLIFY=FALSE)] %>%
.[, (same_cols_with0) := mapply(
function(x, y) y[fastmatch::fmatch(x=as.integer(x), table=(levels_id - 1), nomatch=NA)],
repex_DT[, same_cols_with0, with=FALSE], labels_0, SIMPLIFY=FALSE)]
I am coding in R and I have a dataframe for region such as:
data <- data.frame(Region = c("Cali", "NYC", "LA", "Vegas"),
Group = c(1,2,2,1), stringsAsFactors = F)
The regions have been clubbed to make a group. The group column tells which regions are a part of the group. How can I code, that when I have the group information, I can go and find the regions that constitute that group. Any help is really appreciated.
Most importantly and for future posts please
include sample data in a reproducible and copy&paste-able format using e.g. dput
refrain from adding superfluous statements like "This one is super urgent!"
As to your question, first I'll generate some sample data
set.seed(2018)
df <- data.frame(
Region = sample(letters, 10),
Group = sample(1:3, 10, replace = T))
I recommend summarising/aggregating data by Group, which will make it easy to extract information for specific Groups.
For example in base R you can aggregate the data based on Group and concatenate all Regions per Group
aggregate(Region ~ Group, data = df, FUN = toString)
# Group Region
#1 1 m
#2 2 i, l, g, c
#3 3 b, e, k, r, j
Or alternative you can store all Regions per Group in a list
aggregate(Region ~ Group, data = df, FUN = list)
# Group Region
#1 1 m
#2 2 i, l, g, c
#3 3 b, e, k, r, j
Note that while the output looks identical, toString creates a character string, while list stores the Regions in a list. The latter might be a better format for downstream processing.
Similar outputs can be achieved using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
summarise(Region = toString(Region))
So with a small, reproducible example,
data <- data.frame(Region = c("Cali", "NYC", "LA", "Vegas"), Group = c(1,2,2,1),stringsAsFactors=F)
we see the following results, say we want all from group 1
group.number = 1
data[data$Group == group.number,"Region"]
[1] Cali Vegas
Or using dpyr
library(dplyr)
group.number = 1
data %>%
filter(Group == group.number)%>%
.$Region
Or from Jilber Urbina (Much more readable)
subset(data, Group==1)$Region
I am creating a summary data.table to be inserted in a knitr report using xtable. I would like to check each row value in each column if is.numeric() == TRUE and if it is, format the number, then revert it back to a character. If is.numeric() == FALSE then return the value. The actual data.table may have many columns.
Here's what I have below, with the desired output at the bottom:
library(data.table)
library(magrittr)
dt <- data.table(A = c("apples",
"bananas",
1000000.999),
B = c("red",
5000000.999,
0.99))
dt
a <- dt[, lapply(.SD,
function(x) {
if (is.na(is.numeric(x))) {
prettyNum(as.numeric(x), digits = 0, big.mark = ",")
} else {
x
}
})]
a
b <- dt[, A := ifelse(is.na(is.numeric(A)),
format(as.numeric(A), digits = 0, big.mark = ","),
A)] %>%
.[, B := ifelse(is.na(is.numeric(B)),
format(as.numeric(B), digits = 0, big.mark = ","),
B)]
b
b
desired <- data.table(A = c("apples",
"bananas",
"1,000,000"),
B = c("red",
"5,000,000",
"1"))
desired
From my understanding lapply in the j argument of data.table syntax operates on the vector, so it can be used for functions like mean(), sum(), na.approx(), etc. and wouldn't necessarily work here. But I would like to loop over each column in the data.table without specifying each column name since there could be many columns and naming them would be cumbersome. It's kind of like I know the circle doesn't go in the square but I really want it to!
I tried the := ifelse() approach which I thought should work, but it seems to be returning the first element. On a different data.table where the column is entirely numeric, employing the same approach yields all NA.
Thanks for any help!
We can use set with number. Loop through the sequence of columns with a for loop, identify the index of elements that are all digits or . ('i1'), use that as the i in set, convert those elements to numeric, apply the number to set the format for that element
library(scales)
library(data.table)
for(j in seq_along(dt)) {
i1 <- grep("^[0-9.]+$", dt[[j]])
set(dt, i = i1, j = j, value = number(as.numeric(dt[[j]][i1]), big.mark = ","))
}
dt
# A B
#1: apples red
#2: bananas 5,000,001
#3: 1,000,001 1
I am looking for a quick way to do 'not join' (i.e. keep rows that didn't merge, or inverse of inner join). The way I've been doing is to use data.table for X and Y, then set key. For example:
require(data.table)
X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7))
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7))
XY <- merge(X,Y,by='category')
> XY
category val1 val2
1: B 0.3 2
2: C 0.8 3
3: D 0.7 5
But I need the inverse of this, so I have to do:
XY_All <- merge(X,Y,by='category',all=TRUE)
setkey(XY,category)
setkey(XY_All,category)
notXY <- XY_All[!XY] #data.table not join (finally)
> notXY
category val1 val2
1: A 0.2 NA
2: E NA 7
I feel like this is quite long winded (especially from data.frame). Am I missing something?
EDIT: I got this after thinking more about not joins
X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7),key = "category")
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7), key = "category")
notXY <- merge(X[!Y],Y[!X],all=TRUE)
But WheresTheAnyKey's answer below is clearer. One last hurdle is the presetting data.table keys, it'd be nice not to have to do that.
EDIT: To clarify, the accepted solution is:
merge(anti_join(X, Y, by = 'category'),anti_join(Y, X, by = 'category'), by = 'category', all = TRUE)
require(dplyr)
rbind_list(anti_join(X, Y), anti_join(Y, X))
EDIT:
Since someone asked for some explanation, here's what is happening:
The first anti_join() function returns rows from X that have no matching row in Y with the match determined by what the join is joining by. The second does the reverse. rbind_list() just takes the results of its inputs and makes them into a single tbl with all the observations from each of its inputs, replacing missing variable data with NA.
setkey(X,category)
setkey(Y,category)
rbind(X[!Y], Y[!X], fill = TRUE)
You can make it more concise like this:
X <- data.table(category = c('A','B','C','D'), val1 = c(0.2,0.3,0.8,0.7),key = "category")
Y <- data.table(category = c('B','C','D','E'), val2 = c(2,3,5,7), key = "category")
notXY <- merge(X,Y,all = TRUE)[!merge(X,Y)]
Try this.
First, merge with "all" set to "TRUE". Then take out all complete cases:
XY_All <- merge(X,Y,by='category',all=TRUE)
notXY <- XY_All[!complete.cases(XY_All),]
require(dplyr)
notXY = merge(X[!X$category %in% Y$category,], Y[!Y$category %in% X$category,],by =
"category",all = TRUE)
One way to look at an Anti-Join is that you need observations from X not in Y and observations from Y not in X concatenated together. This can be achieved in one step as shown above.
Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.
> mydata
id name marks gender
1 a1 56 female
2 a2 37 male
I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
I think that #Sacha's answer should suffice for what you need to do, even if you have more than one set.
You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).
So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.
First, here's some sample data.
# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20,
name = paste("a", 1:20, sep = ""),
marks = sample(20:100, 20, replace = TRUE),
gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
name = paste("b", 1:17, sep = ""),
marks = sample(30:100, 17, replace = TRUE),
gender = sample(c("F", "M"), 17, replace = TRUE))
Second, different options for "grouping".
Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) x[x$marks >= 30 & x$marks <= 50, ])
Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, x$marks >= 30 & x$marks <= 50))
Option 3: More flexible than the first. This is essentially #Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, cut(x$marks,
breaks = c(0, 30, 50, 75, 100),
include.lowest = TRUE)))
Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.
# Combine the data. Assumes all the rownames are the same in both sets
myDataALL <- rbind(myData1, myData2)
# Extract just the group of scores you're interested in
myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.
split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
I hope one of these options serves your needs!
I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :
Step 1 : Define range
Step 2 : Find the elements that fall in the range
Step 3 : Plot
A sample code is as shown below:
range = NULL
for(i in seq(0, max(all$downlink), 2000)){
range <- c(range, i)
}
counts <- numeric(length(range)-1);
for(i in 1:length(counts)) {
counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
}
countmax = max(counts)
a = round(countmax/1000)*1000
barplot(counts, col= rainbow(16), ylim = c(0,a))