How can I check unchange row value in dataframe(by grouping ID) - r

how can i check unchanged row dynamically from data frame.(by grouping ID)
my data frame.
ID NAME GENDER AGE
1 muthu male 20
1 MUTHU MALE 20
2 NA male 28
3 jake male 30
3 jake male 31
4 jhon male 21
4 \n\rjhon\n\r \n\male\n\r 21
5 NA NA NA
5 NA NA NA
expected result will be.
Unchanged ID
1
2
4
5
dput data ,
structure(list(ID = c(1, 1, 2, 3, 3, 4, 4, 5, 5), NAME = structure(c(4L,
5L, NA, 2L, 2L, 3L, 1L, NA, NA), .Label = c("\\n\\rjhon\\n\\r",
"jake", "jhon", "muthu", "MUTHU"), class = "factor"), GENDER = structure(c(2L,
3L, 2L, 2L, 2L, 2L, 1L, NA, NA), .Label = c("\\n\\male\\n\\r",
"male", "MALE"), class = "factor"), AGE = c(20, 20, 28, 30, 31,
21, 21, NA, NA)), .Names = c("ID", "NAME", "GENDER", "AGE"), row.names = c(NA,
-9L), class = "data.frame")

Here is an option using data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', remove any \n or \r after converting it to 'lower', find the number of unique elements (uniqueN), check whether it is equal to 1, then Reduce it back to a single logical column using &, and subset the 'ID' based on the logical column ('V1')
library(data.table)
setDT(df1)[, Reduce(`&`, lapply(.SD, function(x)
uniqueN(gsub("([\\]+)(n|r)|[\\]+", "", tolower(x)))==1)) , ID][(V1), .(ID)]
# ID
#1: 1
#2: 2
#3: 4
#4: 5

Here is a base R idea. We clean the names from \\n and \\r and convert them tolower. After that the unchanged rows are the one with duplicates. The second condition, is for the group to have only one entry which we handle with ave.
c(df$ID[duplicated(sapply(df, function(i) tolower(gsub('[\\]n|[\\r]', '', i))))],
df$ID[with(df, ave(ID, ID, FUN = length)) == 1])
#[1] 1 4 5 2

Related

How to order contingency table based on data order?

Given
Group ss
B male
B male
B female
A male
A female
X male
Then
tab <- table(res$Group, res$ss)
I want the group column to be in the order B, A, X as it is on the data. Currently its alphabetic order which is not what I want. This is what I want
MALE FEMALE
B 5 5
A 5 10
X 10 12
If you arrange the factor levels based on the order you want, you'll get the desired result.
res$Group <- factor(res$Group, levels = c('B', 'A', 'X'))
#If it is based on occurrence in Group column we can use
#res$Group <- factor(res$Group, levels = unique(res$Group))
table(res$Group, res$ss)
#Or just
#table(res)
# female male
# B 1 2
# A 1 1
# X 0 1
data
res <- structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 3L),
.Label = c("A", "B", "X"), class = "factor"), ss = structure(c(2L, 2L, 1L, 2L,
1L, 2L), .Label = c("female", "male"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
unique returns the unique elements of a vector in the order they occur. A table can be ordered like any other structure by extracting its elements in the order you want. So if you pass the output of unique to [,] then you'll get the table sorted in the order of occurrence of the vector.
tab <- table(res$Group, res$ss)[unique(res$Group),]

Vectorizing loop operation in R

I have a long-format balanced data frame (df1) that has 7 columns:
df1 <- structure(list(Product_ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,
3, 3, 3, 3), Product_Category = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1950, 1950, 1950, 1950, 1960,
1960, 1960, 1960, 1960, 1940, 1940, 1940, 1940, 1940), Control_Date = c(1961L,
1962L, 1963L, 1964L, 1965L, 1961L, 1962L, 1963L, 1964L, 1965L,
1961L, 1962L, 1963L, 1964L, 1965L), Country_Code = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Var1 = c(NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Var2 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)), row.names = c(NA,
15L), class = "data.frame")
Each Product_ID in this data set is linked with a unique Product_Category and Country_Code and Manufacture_Date, and is followed over time (Control_Date). Product_Category has two possible values (A or B); Country_Code and Manufacture_Date have 190 and 90 unique values, respectively. There are 400,000 unique Product_ID's, that are followed over a period of 50 years (Control_Date from 1961 to 2010). This means that df1 has 20,000,000 rows. The last two columns of this data frame are NA at the beginning and have to be filled using the data available in another data frame (df2):
df2 <- structure(list(Product_ID = 1:6, Product_Category = structure(c(1L,
2L, 1L, 1L, 1L, 2L), .Label = c("A", "B"), class = "factor"),
Manufacture_Date = c(1950, 1960, 1940, 1950, 1940, 2000),
Country_Code = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("ABC",
"DEF", "GHI"), class = "factor"), Year_1961 = c(5, NA, 10,
NA, 6, NA), Year_1962 = c(NA, NA, 4, 5, 3, NA), Year_1963 = c(8,
6, NA, 5, 6, NA), Year_1964 = c(NA, NA, 9, NA, 10, NA), Year_1965 = c(6,
NA, 7, 4, NA, NA)), row.names = c(NA, 6L), class = "data.frame")
This second data frame contains another type of information on the exact same 400,000 products, in wide-format. Each row represents a unique product (Product_ID) accompanied by its Product_Category, Manufacture_Date and Country_Code. There are 50 other columns (for each year from 1961 to 2010) that contain a measured value (or NA) for each product in each of those years.
Now what I would like to do is to fill in the Var1 & Var2 columns in the first data frame, by doing some calculation on the data available in the second data frame. More precisely, for each row in the first data frame (i.e. a product at Control_Date "t"), the last two columns are defined as follows:
Var1: total number of products in df2 with the same Product_Category, Manufacture_Date and Country_Code that have non-NA value in Year_t;
Var2: total number of products in df2 with different Product_Category but the same Manufacture_Date and Country_Code that have non-NA value in Year_t.
My initial solution with nested for-loops is as follows:
for (i in unique(df1$Product_ID)){
Category <- unique(df1[which(df1$Product_ID==i),"Product_Category"])
Opposite_Category <- ifelse(Category=="A","B","A")
Manufacture <- unique(df1[which(df1$Product_ID==i),"Manufacture_Date"])
Country <- unique(df1[which(df1$Product_ID==i),"Country_Code"])
ID_Similar_Product <- df2[which(df2$Product_Category==Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
ID_Quasi_Similar_Product <- df2[which(df2$Product_Category==Opposite_Category & df2$Manufacture_Date==Manufacture & df2$Country_Code==Country),"Product_ID"]
for (j in unique(df1$Control_Date)){
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var1"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Similar_Product),paste0("Year_",j)])))
df1[which(df1$Product_ID==i & df1$Control_Date==j),"Var2"] <- length(which(!is.na(df2[which(df2$Product_ID %in% ID_Quasi_Similar_Product),paste0("Year_",j)])))
}
}
The problem with this approach is that it takes a lot of time to be run. So I would like to know if anybody could suggest a vectorized version that would do the job in less time.
See if this does what you want. I'm using the data.table package since you have a rather large (20M) dataset.
library(data.table)
setDT(df1)
setDT(df2)
# Set keys on the "triplet" to speed up everything
setkey(df1, Product_Category, Manufacture_Date, Country_Code)
setkey(df2, Product_Category, Manufacture_Date, Country_Code)
# Omit the Var1 and Var2 from df1
df1[, c("Var1", "Var2") := NULL]
# Reshape df2 to long form
df2.long <- melt(df2, measure=patterns("^Year_"))
# Split "variable" at the "_" to extract 4-digit year into "Control_Date" and delete leftovers.
df2.long[, c("variable","Control_Date") := tstrsplit(variable, "_", fixed=TRUE)][
, variable := NULL]
# Group by triplet, Var1=count non-NA in value, join with...
# (Group by doublet, N=count non-NA), update Var2=N-Var1.
df2_N <- df2.long[, .(Var1 = sum(!is.na(value))),
by=.(Product_Category, Manufacture_Date, Country_Code)][
df2.long[, .(N = sum(!is.na(value))),
by=.(Manufacture_Date, Country_Code)],
Var2 := N - Var1, on=c("Manufacture_Date", "Country_Code")]
# Update join: df1 with df2_N
df1[df2_N, c("Var1","Var2") := .(i.Var1, i.Var2),
on = .(Product_Category, Manufacture_Date, Country_Code)]
df1
Product_ID Product_Category Manufacture_Date Control_Date Country_Code Var1 Var2
1: 3 A 1940 1961 GHI 4 0
2: 3 A 1940 1962 GHI 4 0
3: 3 A 1940 1963 GHI 4 0
4: 3 A 1940 1964 GHI 4 0
5: 3 A 1940 1965 GHI 4 0
6: 1 A 1950 1961 ABC 6 0
7: 1 A 1950 1962 ABC 6 0
8: 1 A 1950 1963 ABC 6 0
9: 1 A 1950 1964 ABC 6 0
10: 1 A 1950 1965 ABC 6 0
11: 2 B 1960 1961 DEF NA NA
12: 2 B 1960 1962 DEF NA NA
13: 2 B 1960 1963 DEF NA NA
14: 2 B 1960 1964 DEF NA NA
15: 2 B 1960 1965 DEF NA NA
df2
Product_ID Product_Category Manufacture_Date Country_Code Year_1961 Year_1962 Year_1963 Year_1964 Year_1965
1: 5 A 1940 DEF 6 3 6 10 NA
2: 3 A 1940 GHI 10 4 NA 9 7
3: 1 A 1950 ABC 5 NA 8 NA 6
4: 4 A 1950 ABC NA 5 5 NA 4
5: 2 B 1940 DEF NA NA 6 NA NA
6: 6 B 2000 GHI NA NA NA NA NA

Mapping values across a dataframe

I have a large dataset. The example below is a much abbreviated version.
There are two dataframes, df1 and df2. I would like to map to each row of df1, a derived value using conditions from df2 with arguments from df1.
Hope the example below makes more sense
year <- rep(1996:1997, each=3)
age_group <- rep(c("20-24","25-29","30-34"),2)
df1 <- as.data.frame(cbind(year,age_group))
df1 is a database with all permutations of year and age group.
df2 <- as.data.frame(rbind(c(111,1997,"20-24"),c(222,1997,"30-34")))
names(df2) <- c("id","year","age.group")
df2 is a database where each row represents an individual at a particular year
I would like to use arguments from df1 conditional on values from df2 and then to map to df1. The arguments are as follows:
each_yr <- map(df1, function(year,age_group) case_when(
as.character(df1$year) == as.character(df2$year) & as.character(df1$age_group)
== as.character(df2$age.group)~ 0,
TRUE ~ 1))
The output i get is wrong and shown below
structure(list(year = c(1, 1, 1, 1, 1, 0), age_group = c(1, 1,
1, 1, 1, 0)), .Names = c("year", "age_group"))
The output i would ideally like is something like this (dataframe as an example but would be happy as a list)
structure(list(year = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("1996",
"1997"), class = "factor"), age_group = structure(c(1L, 2L, 3L,
1L, 2L, 3L), .Label = c("20-24", "25-29", "30-34"), class = "factor"),
v1 = structure(c(2L, 2L, 2L, 1L, 2L, 2L), .Label = c("0",
"1"), class = "factor"), v2 = structure(c(2L, 2L, 2L, 2L,
2L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("year",
"age_group", "v1", "v2"), row.names = c(NA, -6L), class = "data.frame")
I have used map before when 'df1' is a vector but in this scenario it is a dataframe where both columns are used as arguments. Can Map handle this?
In df3 the column v1 is the result of conditions based on df1 and df2 and then mapped to df1 for patient '111'. Likewise column v2 is the outcome for patient '222'.
Thanks in advance
Looks like some work for pmap instead. And a touch of tidyr to get the suggested result.
purrr::pmap(list(df2$id,as.character(df2$year),as.character(df2$age.group)),
function(id,x,y)
data.frame(df1,
key=paste0("v",id),
value=1-as.integer((x==df1$year)&(y==df1$age_group)),
stringsAsFactors=FALSE
)) %>%
replyr::replyr_bind_rows() %>% tidyr::spread(key,value)
# year age_group v1 v2
#1 1996 20-24 1 1
#2 1996 25-29 1 1
#3 1996 30-34 1 1
#4 1997 20-24 0 1
#5 1997 25-29 1 1
#6 1997 30-34 1 0
Whithing tidiverse you can do it this way:
library(tidyverse)
#library(dplyr)
#library(tidyr)
df2 %>%
mutate(tmp = 0) %>%
spread(id, tmp, fill = 1, sep = "_") %>%
right_join(df1, by = c("year", "age.group" = "age_group")) %>%
mutate_at(vars(-c(1, 2)), coalesce, 1)
# year age.group id_111 id_222
# 1 1996 20-24 1 1
# 2 1996 25-29 1 1
# 3 1996 30-34 1 1
# 4 1997 20-24 0 1
# 5 1997 25-29 1 1
# 6 1997 30-34 1 0
#Warning messages:
# 1: Column `year` joining factors with different levels, coercing to character vector
# 2: Column `age.group`/`age_group` joining factors with different levels, coercing to
# character vector

Replace 0 when first observation for a level factor R

I have this sample:
data <- structure(list(mmsi = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"),
tr = c(1, 1, 1, 0, 2, 2, 0, 4, 4, 0, 5, 5)), .Names = c("mmsi",
"tr"), row.names = c(NA, -12L), class = "data.frame")
I want to replace each 0 in the column tr with the previous value of tr, for each mmsi.
This function works well on the sample:
for ( i in levels(data$mmsi) ) {
data$test <- na.locf(with(data, { is.na(tr) <- tr == 0; tr }), fromLast = FALSE)}
But when I play with a bigger sample, one issue apears: if the first value is 0, then I have an error (because it can not find the previous value...).
For example if I edit the small sample with
data <- structure(list(mmsi = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L), .Label = c("a", "b"), class = "factor"),
tr = c(0, 1, 1, 0, 2, 2, 0, 4, 4, 0, 5, 5)), .Names = c("mmsi",
"tr"), row.names = c(NA, -12L), class = "data.frame")
The column tr begins here with 0 instead of 1 in the previous sample. If I apply the same function for ( i in levels(data$mmsi) ) {
data$test <- na.locf(with(data, { is.na(tr) <- tr == 0; tr }), fromLast = FALSE)} then I have of course the error
Error in `$<-.data.frame`(`*tmp*`, "test", value = c(1, 1, 1, 2, 2, 2, :
replacement has 11 rows, data has 12
--> the function could not replace the value I changes (the first value in the column tr)
I guess I need in my function one more row to edit first the 0 when they occur as a first level in tr. The new row should to replace the 0 with the following non-zero value. Then the rest of the function is fine.
The output I am looking for this new column is:
data$test
[1] 1 1 1 1 2 2 2 4 4 4 5 5
Any idea how to get this?
We can do this with one of the group by functions. Convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'mmsi', apply the na.locf (from zoo) after replacing the '0' values to 'NA' and with the option na.rm = FALSE, then we do a second na.locf with fromLast = TRUE to replace the starting 0 (aka NA) to the next value.
library(data.table)
library(zoo)
setDT(data)[, test := na.locf(na.locf(replace(tr, tr==0, NA),
na.rm=FALSE), fromLast=TRUE), by = mmsi]
data
# mmsi tr test
# 1: a 0 1
# 2: a 1 1
# 3: a 1 1
# 4: a 0 1
# 5: a 2 2
# 6: a 2 2
# 7: a 0 2
# 8: b 4 4
# 9: b 4 4
#10: b 0 4
#11: b 5 5
#12: b 5 5
We could also do this without using the na.locf
setDT(data)[, test := pmax(pmax(tr, shift((NA^!tr) * tr), na.rm = TRUE),1), mmsi]

Ordered values - select first instance of lowest value, then first instance of next lowest subsequent value and so on

I have a dataframe with with many distinct UniqueIDs, and which are also ordered by dates. Each UniqueID is sorted from oldest date to newest date. We also have a column called steps which is ordered from 1 to 4.
The goal is for each UniqueID is to find the oldest instance of the first Step, then the oldest instance of the second step etc. Some steps may be missing, for instance step 3 is missing for UniqueID = "B". In this case we skip over Step 3 and move on to step 4.
Here is the original dataframe.
UniqueID Date Step
1 A 2015-07-03 2
2 A 2015-07-07 3
3 A 2015-07-09 1
4 A 2015-07-14 4
5 A 2015-07-17 1
6 A 2015-07-20 2
7 A 2015-07-23 2
8 A 2015-07-24 3
9 A 2015-07-29 3
10 B 2015-06-01 3
11 B 2015-06-15 2
12 B 2015-06-22 1
13 B 2015-06-29 4
14 B 2015-07-13 2
15 B 2015-06-22 2
16 B 2015-07-08 2
17 B 2015-07-27 4
The valid entries we want to select are observations 3, 6, 8, 12, 14, 17. Creating this dataframe:
UniqueID Date Step
3 A 2015-07-09 1
6 A 2015-07-20 2
8 A 2015-07-24 3
12 B 2015-06-22 1
14 B 2015-07-13 2
17 B 2015-07-27 4
I have the logic and some pseudo code but can't put it together. So in the example data frame for UniqueID = "A" we would first group the dataframe:
group_by(UniqueID)
The find the lowest value for UniqueID = "A" and assign to a variable.
v <- min(Step)returns 1
Then take the index for this step
i <- which.min(Step) returns 3
We then want to find the min step that is greater than the first step, and only search the elements that occur after the first step. So now we are only searching for values of Step which are > 1, and only from the position of the first value we found onward, in this case from observation 3. We want to keep repeating this for all observations of each UniqueID until we either reach the last observation, or can no longer find an observation that is greater than the last observation in the remaining elements.
Here is the dput for creating the example dataframe:
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A",
"B"), class = "factor"), Date = structure(c(16619, 16623, 16625,
16630, 16633, 16636, 16639, 16640, 16645, 16587, 16601, 16608,
16615, 16629, 16608, 16624, 16643), class = "Date"), Step = c(2,
3, 1, 4, 1, 2, 2, 3, 3, 3, 2, 1, 4, 2, 2, 2, 4)), .Names = c("UniqueID",
"Date", "Step"), row.names = c(NA, -17L), class = "data.frame")
Alternative dput which crashes using jeremycg's method.
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 8L, 8L, 9L, 9L, 10L, 11L), .Label = c("A","B",
"C","D","E","F","G","H","I","J","K"),
class = "factor"), Date = c("3/08/2015",
"21/07/2015", "7/07/2015", "7/07/2015", "29/07/2015", "29/07/2015",
"29/06/2015", "13/07/2015", "9/07/2015", "29/07/2015", "24/07/2015",
"2/07/2015", "16/07/2015", "18/06/2015", "8/07/2015", "29/07/2015",
"12/06/2015", "27/07/2015"), Step = c(1, 1, 4, 4, 4, 3,
5, 5, 1, 4, 1, 2, 2, 2, 3, 3, 2, 2)), .Names = c("UniqueID",
"Date", "Step"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -18L))
Edit: dput of UniqueID that continues to crash even using updated code from jeremycg :
structure(list(UniqueID = structure(c(1L, 1L, 1L, 1L, 1L, 1L ), .Label = c("A" ), class = "factor"), Date = structure(c(16619, 16623, 16625, 16630, 16633, 16636), class = "Date"), Step = c(1, 5, 5, 1, 1, 1)), .Names = c("UniqueID", "Date", "Step"), row.names = c(NA, -6L), class = "data.frame")
Pretty inefficient, but working.
First define a function:
myseq <- function(df){
if(which.min(df$Step) == nrow(df)){
return(list(df[nrow(df),]))
}
store <- vector(mode = "list", length = nrow(df))
i=1
while(any(!is.na(df$Step))){
store[[i]] <- df[which.min(df$Step),]
df <- df[which.min(df$Step) : nrow(df), ]
df$Step[df$Step == min(df$Step)] <- NA
i = i+1
}
store
}
Then wrap it on the dataframe using dplyr:
library(dplyr)
dta %>% group_by(UniqueID) %>%
do(do.call(rbind, myseq(.)))
Source: local data frame [6 x 3]
Groups: UniqueID
UniqueID Date Step
1 A 2015-07-09 1
2 A 2015-07-20 2
3 A 2015-07-24 3
4 B 2015-06-22 1
5 B 2015-07-13 2
6 B 2015-07-27 4

Resources