How to Copy and Insert rows into same dataframe based on count - r

I have a dataframe which looks like this (obviously with few variables compared to original data I need to work on with)
woe <- c('1:woe', '2:woe', '3:woe', '4:woe', '5:woe')
svi <- c('stated','verified','verified','stated','stated')
fico_avg <- ceiling(runif(5,750, 780))
count <- c(8,12,34,24,7)
df <- data.frame(cbind(woe,svi,fico_avg,count))
woe svi fico_avg count
1:woe stated 771 8
2:woe verified 759 12
3:woe verified 752 34
4:woe stated 776 24
5:woe stated 767 7
I would like to create a dataset with first row repeating 8 times( filling first 8 rows), second row repeating 12 times, third one 34 times depending on the value of variable 'count' . I tried lookup the function InsertRow() in DataCombine package. InsertRow() require RowNum as one of the argument to insert newrow. the RawNum changes as I insert newrows into the frame. Basic idea is to extract each row from original dataframe copy it x time ( if count=x) and finally row bind all those rows into one frame. Any help is appretiated. Thanks in advance

If your dataset is large - probably this should be Quicker
df <- data.frame(woe,svi,fico_avg,count)
df[rep(seq.int(1,nrow(df)), df$count),]
Works.

Try:
outdf = df
outdf = outdf[-c(1:nrow(outdf)),]
for(i in 1:nrow(df)){
for(j in 1:df[i,]$count) outdf[nrow(outdf)+1,]= df[i,]
}
outdf
You should use:
df <- data.frame(woe,svi,fico_avg,count)
rather than
df <- data.frame(cbind(woe,svi,fico_avg,count))
No need for cbind here. It actually converts your count variable from numeric to a factor variable.

Try this:
df_long <- df[rep(1:nrow(df), df$count), ]
Hope it helps

Related

R conditionally bind rows in loop

I have a double for loop over a data frame and its rows. I'm applying some calculations for each row of the data frame (which represent different batteries and therefore all vary in their values). In the end I want to check if a row (e.g. a battery) fits the criteria. If it does, I want to put it in a new df gathering all batteries that fit the criteria.
df1 <- as.data.frame(matrix("values",nrow=24,ncol=19))
df2 <- as.data.frame(matrix("values",nrow=2976,ncol=22))
df3 <- df1[0,] #empty df of the same structure as df1
What I'm doing:
for(i in 1:nrow(df1)){
for(j in 1:nrow(df2)){
# some calculations giving me a result what the necessary capacity "nc" is
...
so far it works alright. What I want to do then is compare if the result for each row in in df1 (e.g. the necessary battery capacity) is bigger then a condition "con":
...
con <- df1[i,4]
nc <- max(df2[[20]])) # defining the necessary capacity
if(con > nc){
newdf <- bind_rows(df3,df1[i,])
}
}
}
I expect newdf to have 0 to max 24 rows. According to the real data I should get 11 entries. What I got was 1 (that was the last row of df1) or some more than 30000 entries. So this is not working as expected. Any ideas? Thanks!
I think you forgot to increment df3 and this is why you only have one line, you always bind one row to the empty df3 data.frame. Your code works otherwise. You should change the line
newdf <- bind_rows(df3,df1[i,])
into
df3 <- bind_rows(df3,df1[i,])
This might be very slow however, so I suggest you use vectorization, for example as suggested in the comments by #Dave2i newdf<-df[df[,4] > nc, ]

R: delete columns from data.frame if condition fulfilled

I have got a data.frame with approx. 20,000 columns. From this data.frame I want to remove columns for which the follow vector has a value of 1.
u.snp <- apply(an[25:19505], 2, mean)
I am sure there must be a straight forward way to accomplish this but canĀ“t see it right now. Any hints would be greatly appreciated. Thanks.
Update: Thanks for your help. Now I tried the following:
cm <- colMeans(an.mdr[25:19505])
tail(sort(cm), n=40)
With the tail function I see that 22 columns out of 19481 columns of an.mdr have mean=1. Next I remove these columns using the code as suggested.
an.mdr.s <- an.mdr
an.mdr.s[colMeans(an.mdr.s[25:19505])==1] <- NULL
As anticipated an.mdr.s has 22 columns less than an.mdr. But when I calculate the column means for all but the first 24 columns I again have 22 columns with column mean=1 in an.mdr.s.
cmm <- colMeans(an.mdr.s[25:19483])
tail(sort(cmm), n=40)
Honestly, I cannot see what is going on here right now.
That should be quite easily accomplished with the following command:
df[colMeans(df)==1] <- NULL
You can do in two simple steps (df is your data frame):
# step 1 - calculate mean for all columns and filter with mean = 1
remove_columns <- sapply(df, mean)
remove_columns <- names(remove_columns[remove_columns == 1])
# alternate using filter (just for knowledge)
## remove_columns <- names(Filter(function(x) x == 1,sapply(df, mean)))
# step 2 - remove them
df_new <- df[,setdiff(names(df), remove_columns)]

How to know if the value if a column is part of another column's value in R data.table

I have a data.table where I have few customers,some day value and pay_day value .
pay_day is a vector of length 5 for each customer and it consists of day values
I want to check each day value with the pay_day vector whether the day is part of the pay_day
Here is a dummy data for this (pardon for the messy way to create the data ) could not think of a better way atm
customers <- c("179288" ,"146506" ,"202287","16207","152979","14421","41395","199103","183467","151902")
mdays <- 1:31
set.seed(1)
data <- sort(rep(customers,100))
days <- sample(mdays,1000,replace=T)
xyz <- cbind(data,days)
x <- vector(length=1000L)
j <- 1
for( i in 1:10){
set.seed(i) ## I wanted diff dates to be picked
m <- sample(mdays,5)
while(j <=100*i){
x[j] <- paste(m,collapse = ",")
j <- j+1
}
}
xyz <- cbind(xyz,x)
require(data.table)
my_data <- setDT(as.data.frame(xyz))
setnames(my_data, c("cust","days","pay_days"))
my_data[,pay:=runif(1000,min = 0,max=10000)]
Now I want for each cust the vector of pays which happens in pay_days.
i have tried various ways but cant seem to figure it out , my initial thought is to create a flag based if days is a subset of pay_days and then take the pays according to the flag
my_data[,ifelse(grepl(days,pay_days),1,0),cust]
this does not work as I expect it to . I dont want to use a native loop as the
actual data is huge .
Using tidyr to split the pay_days column into and then checking if days is in pay_days:
library(tidyr)
library(dplyr)
# creating long-form data
tidier <- my_data %>%
mutate(pay_days = strsplit(as.character(pay_days), ",")) %>%
unnest(pay_days)
# casting as numeric to make factor & character columns comparable
tidier[, days := as.numeric(days)]
tidier[, pay_days := as.numeric(pay_days)]
tidier[days == pay_days, pay, by=cust]
Not sure how this performs for large data, as you multiply your table length by the number of days in pay_days...
Side note: I can't comment yet, but to replicate your data one needs to add library(data.table) and initialize x x<-vector() which is otherwise not found, as Dee also points out.
Another one-liner approach using the data table:
my_data[,result:=sum(unlist(lapply(strsplit(as.character(pay_days),","),match,days)),na.rm=T)>0,by=1:nrow(my_data)]

sample replication

I have a data frame (d) composed of 640 observations for 55 variables.
I would like to randomly sample this data frame in 10 sub data frame of 64
observations for 55 variables. I dont want any of the observation to be in
more than one sub data-frame.
This code work for one sample
d1 <- d[sample(nrow(d),64,replace=F),]
How can I repeat this treatment ten times ?
This one give me a dataframe of 10 variable (each one is one sample...)
d1 <- replicate(10,sample(nrow(d),64,replace = F))}
Can anyone help me?
Here's a solution that returns the result in a list of data.frames:
d <- data.frame(A=1:640, B=sample(LETTERS, 640, replace=TRUE)) # an exemplary data.frame
idx <- sample(rep(1:10, length.out=nrow(d)))
res <- split(d, idx)
res[[1]] # first data frame
res[[10]] # last data frame
The only tricky part involves creating idx. idx[i] identifies the resulting data.frame, idx[i] in {1,...,10}, in which the ith row of d will occur. Such an approach assures us that no row will be put into more than 1 data.frame.
Also, note that sample returns a random permutation of (1,2,...,10,1,2,...,10).
Another approach is to use:
apply(matrix(sample(nrow(d)), ncol=10), 2, function(idx) d[idx,])

!is.na creates NAs in other columns

In the midst of merging several data sets, I'm trying to remove all rows of a data frame that have a missing value for one particular variable (I want to keep the NAs in some of the other columns for the time being). I used the following line:
data.frame <- data.frame[!is.na(data.frame$year),]
This successfully removes all rows with NAs for year, (and no others), but the other columns, which previously had data, are now entirely NAs. In other words, non-missing values are being converted to NA. Any ideas as to what's going on here? I've tried these alternatives and got the same outcome:
data.frame <- subset(data.frame, !is.na(year))
data.frame$x <- ifelse(is.na(data.frame$year) == T, 1, 0);
data.frame <- subset(data.frame, x == 0)
Am I using is.na incorrectly? Are there any alternatives to is.na in this scenario? Any help would be greatly appreciated!
Edit Here is code that should reproduce the issue:
#data
tc <- read.csv("http://dl.dropbox.com/u/4115584/tc2008.csv")
frame <- read.csv("http://dl.dropbox.com/u/4115584/frame.csv")
#standardize NA codes
tc[tc == "."] <- NA
tc[tc == -9] <- NA
#standardize spatial units
colnames(frame)[1] <- "loser"
colnames(frame)[2] <- "gainer"
frame$dyad <- paste(frame$loser,frame$gainer,sep="")
tc$dyad <- paste(tc$loser,tc$gainer,sep="")
drops <- c("loser","gainer")
tc <- tc[,!names(tc) %in% drops]
frame <- frame[,!names(frame) %in% drops]
rm(drops)
#merge tc into frame
data <- merge(tc, frame, by.x = "year", by.y = "dyad", all.x=T, all.y=T) #year column is duplicated in this process. I haven't had this problem with nearly identical code using other data.
rm(tc,frame)
#the first column in the new data frame is the duplicate year, which does not actually contain years. I'll rename it.
colnames(data)[1] <- "double"
summary(data$year) #shows 833 NA's
summary(data$procedur) #note that at this point there are non-NA values
#later, I want to create 20 year windows following the events in the tc data. For simplicity, I want to remove cases with NA in the year column.
new.data <- data[!is.na(data$year),]
#now let's see what the above operation did
summary(new.data$year) #missing years were successfully removed
summary(new.data$procedur) #this variable is now entirely NA's
I think the actual problem is with your merge.
After you merge and have the data in data, if you do:
# > table(data$procedur, useNA="always")
# 1 2 3 4 5 6 <NA>
# 122 112 356 59 39 19 192258
You see there are these many (122+112...+19) values for data$procedur. But, all these values are corresponding to data$year = NA.
> all(is.na(data$year[!is.na(data$procedur)]))
# [1] TRUE # every value of procedur occurs where year = NA
So, basically, all values of procedur are also removed because you removed those rows checking for NA in year.
To solve this problem, I think you should use merge as:
merge(tc, frame, all=T) # it'll automatically calculate common columns
# also this will not result in duplicated year column.
Check if this merge gives you the desired result.
Try complete.cases:
data.frame.clean <- data.frame[complete.cases(data.frame$year),]
...though, as noted above, you may want to pick a more descriptive name.

Resources