Random row selection in R - r

I have this dataframe
id <- c(1,1,1,2,2,3)
name <- c("A","A","A","B","B","C")
value <- c(7:12)
df<- data.frame(id=id, name=name, value=value)
df
This function selects a random row from it:
randomRows = function(df,n){
return(df[sample(nrow(df),n),])
}
i.e.
randomRows(df,1)
But I want to randomly select one row per 'name' (or per 'id' which is the same) and concatenate that entire row into a new table, so in this case, three rows. This has to loop throught a 2000+ rows dataframe. Please show me how?!

I think you can do this with the plyr package:
library("plyr")
ddply(df,.(name),randomRows,1)
which gives you for example:
id name value
1 1 A 8
2 2 B 11
3 3 C 12
Is this what you are looking for?

Here's one way of doing it in base R.
> df.split <- split(df, df$name)
> df.sample <- lapply(df.split, randomRows, 1)
> df.final <- do.call("rbind", df.sample)
> df.final
id name value
A 1 A 7
B 2 B 11
C 3 C 12

Related

looping within a variable in panel data using loop in R

I have a panel data like id <- c(1,1,1,1,1,1,1,2,2,2,2,2), intm <- c(1,1,0,0,1,0,0,0,0,0,1,1). The data frame is like
dta <- data.frame(cbind(id,intm)) which gives:
id intm
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 0
7 1 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 1
I would like to replace the subsequent values of "intm" variable by the first value within the ID variable. That is for ID=1, the first value is 1, so the intm should have all values as 1 and for ID=2, intm should have all values as 0. The data should be like
id <- c(1,1,1,1,1,1,1,2,2,2,2,2),intm <- c(1,1,1,1,1,1,1,0,0,0,0,0) with data frame
dta <- data.frame(cbind(id,intm))
How can I do this in R by looping or any other means? I have a big data set.
Consider the following code:
new_column <- c(); i <- 1; # new column to be created
# loop
for (j in unique(dta$id)){ # let's separate the unique values of ID
index <- which(dta$id==j) # which row index satisfy id==1, or id==2, ...?
value <- dta$intm[index[1]] # which value of intm corresponds to the first value of the index?
new_column[i:tail(index,n = 1)] <- rep(value,nrow(dta[id==j,])) # repeat this value the number of rows times which contains the ID
i <- tail(index,n = 1)+1 # the new_column component must start with its last value + 1
}
dta <- cbind(dta,new_column)
Alternatively, you can use the subset() function, i.e
rep(value,nrow(subset(dta,dta$id==j)))
You can use dplyr to do this.
There are other fancier ways to do this but I think dplyr is more graceful.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2)
intm <- c(1,1,0,0,1,0,0,0,0,0,1,1)
df = data.frame(id, intm)
library(dplyr)
df2 = df %>% group_by(id) %>% do({
.$intm = .$intm[1, drop = TRUE]
.
})
You can also try data.table library which shall be faster.
BTW: you do not need cbind to make a data.frame.
Consider ave + head
dta$intm <- with(dta, ave(intm, id, FUN=function(x) head(x, 1)))

How to get maximum value for a list in a data frame in R

I am trying to create a new column that gets me the maximum value for a list in a data frame. I was wondering how I can create this column called maxvalue from the df$value column i.e., I would like to get the max of that list in the column.
x <- c( "000010011100011111001111111100", "011110", "0000000")
y<- c(1, 2,3)
df<- data.frame(x,y)
library(stringr)
df$value <- strsplit(df$x, "[^1]+", perl=TRUE)
# expected output ( I have tried the following)
df$maxvalue<- max(df$value)
df$maxvalue
8
4
0
this should do the trick
df$value <- lapply(lapply(strsplit(as.character(df$x),"[^1]+"), nchar),max)
output:
> df
x y value
1 000010011100011111001111111100 1 8
2 011110 2 4
3 0000000 3 0
Simplified version of #Daniel O's logic:
df$value <- sapply(strsplit(as.character(df$x),"[^1]+"), function(x){max(nchar(x))})
We can also use rawToChar and charToRaw
sapply(as.character(df$x), function(x)
with(rle(charToRaw(x)), max(lengths[as.character(values) == 31])))

Removing rows in data.frame having columns subsumed in others

I am trying to achieve something similar to unique in a data.frame where column each element of a column in a row are vectors. What I want to do is if the elements of the vector in the column of that hat row a subset or equal to another remove the row with smaller number of elements. I can achieve this with a nested for loop but since data contains 400,000 rows the program is very inefficient.
Sample data
# Set the seed for reproducibility
set.seed(42)
# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20),
grps = sample(1:5, 20, replace = TRUE),
supergrp = sample(LETTERS[1:4], replace = TRUE))
# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)
# Arrange by number of items for each grp and supergroup
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]
Temp looks like
grps supergrp items
1 4 D a, c, d
2 3 D c, d
3 5 D a, d
4 1 A b
5 2 A b
6 3 A b
7 4 A b
8 5 A b
9 1 D d
10 2 D c
Now you can see that second combination of supergrp and items in second and third row is contained in first row. So, I want to delete the second and third rows from the result. Similarly, rows 5 to 8 are contained in row 4. Finally, rows 9 and 10 are contained in the first row, so I want to delete rows 9 and 10.
Hence, my result would look like:
grps supergrp items
1 4 D a, c, d
4 1 A b
My implementation is as follows::
# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]
# For all rows in the the original data
for(i in 1:nrow(temp))
{
# Index to check if all the items are found
indx <- TRUE
# Check if item in the original data appears in the new data
for(j in 1:nrow(newdf))
{
if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in%
c(newdf$supergrp[[j]], newdf$items[[j]]))){
# set indx to false if a row with same items and supergroup
# as the old data is found in the new data
indx <- FALSE
}
}
# If none of the rows in new data contain items and supergroup in old data append that
if(indx){
newdf <- rbind(newdf, temp[i, ])
}
}
I believe there is an efficient way to implement this in R; may be using the tidy framework and dplyr chains but I am missing the trick. Apologies for a longish question. Any input would be highly appreciated.
I would try to get the items out of a list column and store them in a longer dataframe. Here is my somewhat hacky solution:
library(stringr)
items <- temp$items %>%
map(~str_split(., ",")) %>%
map_df(~data.frame(.))
out <- bind_cols(temp[, c("grps", "supergrp")], items)
out %>%
gather(item_name, item, -grps, -supergrp) %>%
select(-item_name, -grps) %>%
unique() %>%
filter(!is.na(item))

creating a summary data frame from long formated data

Following this worked example:
case <- c('a','a','a','b','b','c','c','c','c','d','d','e','e')
ID <- c('aa','bb','zz','aa','cc','ee','ff','gg','kk','aa','kk','cc','dd')
score <- c(1,1,3,4,2,3,2,2,1,1,3,3,2)
df1 <- data.frame(case, ID, score)
identifier <- c('aa','bb','ff')
For each unique case, (that is a,b,c,d...), I want to scan the ID column and see how often we have an identifier value.
So we look into the 3x case==a, then how many times do the ID equal identifier? (in this case 2 times)
We then look at 2x case==b, and also count how many time ID equal identifier? (in this case 1 times)
we do this for all unique case's
I have used the following command, but this is for the whole sample, not separated per unique case
df1$ID %in% identifier
And what I want as a end result is a table, with one column with each unique case and a second column with the number of times ID and identifier were equal.
So I want to loop/automate the process and return a similiar output like:
data.frame(c('a','b','c','d','e'), c(2,1,1,1,0))
You can use tapply():
tapply(df1$ID, df1$case, FUN = function(id) sum(id %in% identifier))
a b c d e
2 1 1 1 0
but as #Jaap pointed out, you can use aggregate() to get a data.frame:
aggregate(ID ~ case, data = df1, FUN = function(id) sum(id %in% identifier))
case ID
1 a 2
2 b 1
3 c 1
4 d 1
5 e 0
And if you want more grouping you can do :
df <- aggregate(ID ~ case+(score>1), data = df1, FUN = function(id) sum(id %in% identifier))
df[df$`score > 1`,c(1,3)]
case ID
4 a 0
5 b 1
6 c 1
7 d 0
8 e 0

How to label ties when creating a variable capturing the most frequent occurence of a group?

In the following example, how do I ask R to identify a tie as "tie" when I want to determine the most frequent value within a group?
I am basically following on from a previous question, that used which.max or which.is.max and a custom function (Create a variable capturing the most frequent occurence by group), but I want to acknowledge the ties as a tie. Any ideas?
df1 <-data.frame(
id=c(rep(1,3),rep(2,3)),
v1=as.character(c("a","b","b",rep("c",3)))
)
I want to create a third variable freq that contains the most frequent observation in v1 by id, but also creates identifies ties as "tie".
From previous answers, this code works to create the freq variable, but just doesn't deal with the ties:
myFun <- function(x){
tbl <- table(x$v1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
x
}
ddply(df1,.(id),.fun=myFun)
You could slightly modify your function by testing if the maximum count occurs more than once. This happens in sum(tbl == max(tbl)). Then proceed accordingly.
df1 <-data.frame(
id=rep(1:2, each=4),
v1=rep(letters[1:4], c(2,2,3,1))
)
myFun <- function(x){
tbl <- table(x$v1)
nmax <- sum(tbl == max(tbl))
if (nmax == 1)
x$freq <- rep(names(tbl)[which.max(tbl)],nrow(x))
else
x$freq <- "tie"
x
}
ddply(df1,.(id),.fun=myFun)
id v1 freq
1 1 a tie
2 1 a tie
3 1 b tie
4 1 b tie
5 2 c c
6 2 c c
7 2 c c
8 2 d c

Resources