I have 2 dataframes, testx and testy
testx
testx <- structure(list(group = 1:2), .Names = "group", class = "data.frame", row.names = c(NA,
-2L))
testy
testy <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
time = c(1L, 3L, 4L, 1L, 4L, 5L, 1L, 5L, 7L), value = c(50L,
52L, 10L, 4L, 84L, 2L, 25L, 67L, 37L)), .Names = c("group",
"time", "value"), class = "data.frame", row.names = c(NA, -9L
))
Based on this topic, I add missing time values using the following code, which works perfectly.
data <- setDT(testy, key='time')[, .SD[J(min(time):max(time))], by = group]
Now I would like to only add these missing time values IF the value for group appears in testx. In this example, I thus only want to add missing time values for groups matching the values for group in the file testx.
data <- setDT(testy, key='time')[,if(testy[group %in% testx[, group]]) .SD[J(min(time):max(time))], by = group]
The error I get is "undefined columns selected". I looked here, here and here, but I don't see why my code isn't working. I am doing this on large datasets, why I prefer using data.table.
You don't need to refer testy when you are within testy[] and are using group by, directly using group as a variable gives correct result, you need an extra else statement to return rows where group is not within testx if you want to keep all records in testy:
testy[, {if(group %in% testx$group) .SD[J(min(time):max(time))] else .SD}, by = group]
# group time value
# 1: 1 1 50
# 2: 1 2 NA
# 3: 1 3 52
# 4: 1 4 10
# 5: 2 1 4
# 6: 2 2 NA
# 7: 2 3 NA
# 8: 2 4 84
# 9: 2 5 2
# 10: 3 1 25
# 11: 3 5 67
# 12: 3 7 37
Related
I am currently working with a data set in R that contains four variables for a large set of individuals: pid, month, window, and agedays. I'm trying to create a loop that will output the min and max agedays of each group of combinations between month and window into a new data table that I can export as a csv.
Here's an example of the data:
pid agedays month window
1 22 2 1
2 35 3 2
3 33 3 2
4 55 3 2
1 66 2 1
2 55 4 2
3 80 4 2
4 90 4 2
I'd like for the new data table to contain the min and max agedays of each group within each combination of window and month as well as the count of each group within each combination. The range for month is 2-24 and the range for window is 0-2.
The data table should look something like this:
month window min max N
2 1 22 66 1
3 2 33 55 3
etc....
where N is the number of unique individuals (pids) within each group
After grouping by 'month', 'window', get the min, max of 'agedays' and the number of distinct (n_distinct) elements of 'pid'
library(dplyr)
df1 %>%
group_by(month, window) %>%
summarise(min = min(agedays), max = max(agedays), N = n_distinct(pid))
# A tibble: 3 x 5
# Groups: month [3]
# month window min max N
# <int> <int> <int> <int> <int>
#1 2 1 22 66 1
#2 3 2 33 55 3
#3 4 2 55 90 3
We can also do this with data.table
library(data.table)
setDT(df1)[, .(min = min(agedays), max = max(agedays),
N = uniqueN(pid)), by = .(month, window)]
Or using split from base R
do.call(rbind, lapply(split(df1, df1[c('month', 'window')], drop = TRUE),
function(x) cbind(month = x$month[1], window = x$window[1], min = min(x$agedays), max = max(x$agedays),
N = length(unique(x$pid)))))
data
df1 <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)),
class = "data.frame", row.names = c(NA,
-8L))
Using data.table, we can calculate min, max of agedays along with number of rows for each combination of month and window.
library(data.table)
setDT(df) #Convert to data.table if it is not already
df[, .(min_age = min(agedays, na.rm = TRUE),
max_age = max(agedays, na.rm = TRUE), N = .N), .(month, window)]
# month window min_age max_age N
#1: 2 1 22 66 2
#2: 3 2 33 55 3
#3: 4 2 55 90 3
data
df <- structure(list(pid = c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), agedays = c(22L,
35L, 33L, 55L, 66L, 55L, 80L, 90L), month = c(2L, 3L, 3L, 3L,
2L, 4L, 4L, 4L), window = c(1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L)), class = "data.frame",
row.names = c(NA, -8L))
I'm trying to fix the NA problem and trying to dot plot the data frame from ".CSV" file.
I'm trying to get the mean median and 10% trimmed mean of the given data frame, somehow I'm getting an error. I have already tried previous suggestions and still not helping me out. I have data and I can't plot the dot chart from it.
code for mean median and 10% trimmed mean
data_val <- read.csv(file =
"~/502_repos_2019/502_Problems/health_regiment.csv", head=TRUE, sep
= " ")
as.numeric(unlist(data_val))
print(ncol(data_val))
print(nrow(data_val))
# I have used several logics but it's not helping to solve the problem
mean(data_val,data_val$cholesterol_level[data_val$Treatment_type ==
'Control_group'])
mean(data_val$cholesterol_level[data_val$Treatment_type ==
'Treatment_group'])
code for dot chart & dot plot
data_val <- read.csv(file =
"~/502_repos_2019/502_Problems/health_regiment.csv", head=TRUE, sep
= " ")
data_val
plot(data_val$Treatment_type ~ data_val$cholestrol_level, xlab =
"Health Unit Range", ylab = " ",
main = "Regiment_Health", type="p") #p for point chart
#dotchart(data_val, data_val$Treatment_type ~
data_val$cholestrol_leve, labels = row.names(data_val),
#cex = 0.6,xlab = "units")
Following is the error message
[2] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 7 3
-4 14 2 5 22 -7 9 5 -6 5 9 4 4 12 37 [38] 5 3 3 [2] 2 [2] 20 argument is not numeric or logical: returning NA[2] NA argument is
not numeric or logical: returning NA[2] NA
and instead of point plot, I'm getting bar chart and dot chart syntax
is not working though I have given the proper syntax.
.csv data
Treatment_type cholestrol_level
Control_group 7
Control_group 3
Control_group -4
Control_group 14
Control_group 2
Control_group 5
Control_group 22
Control_group -7
Control_group 9
Control_group 5
Treatment_group -6
Treatment_group 5
Treatment_group 9
Treatment_group 4
Treatment_group 4
Treatment_group 12
Treatment_group 37
Treatment_group 5
Treatment_group 3
Treatment_group 3
Data in dput format.
data_val <-
structure(list(Treatment_type = structure(c(1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("Control_group", "Treatment_group"),
class = "factor"), cholestrol_level = c(7L, 3L, -4L, 14L,
2L, 5L, 22L, -7L, 9L, 5L, -6L, 5L, 9L, 4L, 4L, 12L, 37L,
5L, 3L, 3L)), class = "data.frame", row.names = c(NA, -20L))
Consider the sample data
df <-
structure(
list(
id = c(1L, 1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 8L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 0L, 1L, 0L, 0L)
),
.Names = c("id", "A", "B"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id (stored in column 1) has varying number of entries for column A and B. In the example data, there are four observations with id = 1. I am looking for a way to subset this data in R so that there will be at most 3 entries for for each id and finally create another column (labelled as C) which consists of the order of each id. The expected output would look like:
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L),
A = c(20L, 12L, 13L, 11L, 21L, 17L),
B = c(1L, 1L, 0L, 1L, 0L, 0L),
C = c(1L, 2L, 3L, 1L, 2L, 1L)
),
.Names = c("id", "A", "B","C"),
class = "data.frame",
row.names = c(NA,-6L)
)
Your help is much appreciated.
Like this?
library(data.table)
dt <- as.data.table(df)
dt[, C := seq(.N), by = id]
dt <- dt[C <= 3,]
dt
# id A B C
# 1: 1 20 1 1
# 2: 1 12 1 2
# 3: 1 13 0 3
# 4: 2 11 1 1
# 5: 2 21 0 2
# 6: 3 17 0 1
Here is one option with dplyr and considering the top 3 values based on A (based of the comments of #Ronak Shah).
library(dplyr)
df %>%
group_by(id) %>%
top_n(n = 3, wt = A) %>% # top 3 values based on A
mutate(C = rank(id, ties.method = "first")) # C consists of the order of each id
# A tibble: 6 x 4
# Groups: id [3]
id A B C
<int> <int> <int> <int>
1 1 20 1 1
2 1 12 1 2
3 1 13 0 3
4 2 11 1 1
5 2 21 0 2
6 3 17 0 1
I'm trying to replace NA cells by some value but only in one column. I found another thread explaining how to proceed but I don't understand how it works.
is.na(dt) returns a data table tracing the original dt but replacing all the values by either TRUE or FALSE depending on whether the original cell is NA. Now a datatable first parameters is supposed to accept a logical vector to select lines, not a whole datatable. And indeed dt[is.na(dt)] returns an error but dt[is.na(dt)]=0 will replace all the NA values with 0. Why does adding an =0 suddenly makes this call work ? Is it a special feature or part of datatable design.
The expression would work if it is a data.frame
dt[is.na(dt)]
#[1] NA NA NA NA NA
But, in a data.table, the syntax is different and converting to logical matrix is inefficient and not recommended in v1.10.0
setDT(dt)[is.na(dt)]
Error in [.data.table(setDT(dt), is.na(dt)) : i is invalid type
(matrix). Perhaps in future a 2 column matrix could return a list of
elements of DT (in the spirit of A[B] in FAQ 2.14). Please let
datatable-help know if you'd like this, or add your
A better option is set which replaces in place without copying
for(j in seq_along(dt)) {
set(dt, i = which(is.na(dt[[j]])), j = j, value = 0)
}
dt
# a b c
# 1: 1 0 2
# 2: 2 2 2
# 3: 2 1 1
# 4: 2 0 1
# 5: 0 1 2
# 6: 2 0 5
# 7: 1 1 4
# 8: 1 1 0
# 9: 2 1 5
#10: 2 1 1
Or another version is
setDT(dt)[, lapply(.SD, function(x) replace(x, is.na(x), 0))]
data
dt <- structure(list(a = c(1L, 2L, 2L, 2L, NA, 2L, 1L, 1L, 2L, 2L),
b = c(NA, 2L, 1L, NA, 1L, NA, 1L, 1L, 1L, 1L), c = c(2L,
2L, 1L, 1L, 2L, 5L, 4L, NA, 5L, 1L)), .Names = c("a", "b",
"c"), class = "data.frame", row.names = c(NA, -10L))
I want to roll up at customer unique id level with each observation being transposed againt it as given below
Below is the snapshot of my data
basedata <- structure(list(customer = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "d"), class = "factor"), obs = c(12L,
11L, 12L, 10L, 3L, 5L, 7L, 8L, 1L)), .Names = c("customer", "obs"
), class = "data.frame", row.names = c(NA, -9L))
Or
customer obs
a 12
a 11
a 12
a 10
b 3
b 5
b 7
d 8
d 1
I want to convert it in the following form
customer obs1 obs2 obs3 obs4
a 12 11 12 10
b 3 5 7 -
d 8 1 - -
I used the following code
basedata$shopping <- unlist(tapply(rawdata$customer, rawdata$customer,
function (x) seq(1, len = length(x))))
reshape(basedata, idvar = "customer", direction = "wide")
It gives the following error
Error in `[.data.frame`(data, , timevar) : undefined columns selected
How can I do it in R and excel?
Thank You
x <- structure(list(customer = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 3L, 3L), .Label = c("a", "b", "d"), class = "factor"), obs = c(12L,
11L, 12L, 10L, 3L, 5L, 7L, 8L, 1L)), .Names = c("customer", "obs"
), class = "data.frame", row.names = c(NA, -9L))
I chose to use a couple of extra packages (plyr and reshape2) because I find them easier and more general to use than reshape from the base package.
library(plyr)
library(reshape2)
## add observation number
x2 <- ddply(x,"customer",transform,num=1:length(customer))
## reshape
dcast(x2,customer~num,value.var="obs")
A base R way, assuming dat is the data,
> s <- split(dat$obs, dat$customer)
> df <- data.frame(do.call(rbind, lapply(s, function(x){ length(x) <- 4; x })))
> names(df) <- paste0('obs', seq(df))
> df
# obs1 obs2 obs3 obs4
# a 12 11 12 10
# b 3 5 7 NA
# d 8 1 NA NA
If you want the unique customer ID to be a column,
> df2 <- cbind(customer = rownames(df), df)
> rownames(df2) <- seq(nrow(df2))
> df2
# customer obs1 obs2 obs3 obs4
# 1 a 12 11 12 10
# 2 b 3 5 7 NA
# 3 d 8 1 NA NA
I'm assuming that "basedata" and "rawdata" are supposed to be the same (or at least copies of each other). If that's the case, you're simply missing specifying what the timevar argument for reshape should be.
Continuing from where you left off:
rawdata$shopping <- unlist(tapply(rawdata$customer, rawdata$customer,
function (x) seq(1, len = length(x))))
## rawdata$shopping <- with(rawdata, ave(customer, customer, FUN = seq_along))
Here's the actual reshaping step:
reshape(rawdata, idvar = "customer", timevar="shopping", direction = "wide")
# customer obs.1 obs.2 obs.3 obs.4
# 1 a 12 11 12 10
# 5 b 3 5 7 NA
# 8 d 8 1 NA NA