R - interesting issue with either data frame or global variable - r

Take a look at this code. I am trying to make code that starts with an empty data frame with only IDs and dynamically adds data. For example, let's say it starts with
ID
1 1
2 2
3 3
And then I make a call
addPair(1,"a",4); #sets the value of column "a" at row 1 to be the value 4
it would become
ID a
1 1 4
2 2 NA
3 3 NA
Take a look at this code below. The desired final total variable is:
ID a
1 1 4
2 2 NA
3 3 NA
But at the end, total is just
ID
1 1
2 2
3 3
Here is the code. Why is total not keeping what it adds? At the end of the method, total is correct, but then after the method, total is back to just the IDs. Here is the code and below is the output.
# rm(list=ls()) # that code _should_ always be commented out
#get all the IDs
IDs = c("1","2","3")
N = length(IDs)
#the big data frame
total <- data.frame("ID"=IDs)
addPair = function(i,name,val) {
total[,toString(name)] = rep(NA,N)
total[,toString(name)][i] = val
print("end")
print(total)
}
addPair(1,"a",4)
print("after call")
print(total)
Here is the output:
[1] "end"
ID a
1 1 4
2 2 NA
3 3 NA
> print("after call")
[1] "after call"
> print(total)
ID
1 1
2 2
3 3
Why does total lose that column a after the method is over?

transform(total, a = {b=rep(NA,N); b[1] <- 4;b })
ID a
1 1 4
2 2 NA
3 3 NA
Returns desired object, however it does not change total unless you assign the value to that name.
addPair <- function(df,item,name, val) transform(
df, name={t=rep(NA,nrow(df)); t[item]=val;t} )
addPair(total, 1,"a",4)
ID name
1 1 4
2 2 NA
3 3 NA
> total <- addPair(total, 1,"a",4)
> total
ID name
1 1 4
2 2 NA
3 3 NA
Unfortunately as when using with, transform is more designed for console use than programming. It's not totally safe for coding use.

Related

Filling (NA values) in the column based on its previous records and another column (with interval) in R

I want to fill action column based on its records and time column. NA in action column should be filled based on previous action record and time interval. lets say we set time interval to 10, which means that if action is A and time is 1, all NA in action should be A till time==11 (1+10).
Please note that if action or ID change, this process should be reset. For example (in row 3) we have B with time==11, I want to fill the next NAs with B until time==21, but we have C in time==16, so we continue filling NA with C until time==26.
df<-read.table(text="
id action time
1 A 1
1 NA 4
1 NA 9
1 B 11
1 NA 12
1 C 16
1 NA 19
1 NA 30
1 A 31
1 NA 32
2 NA 1
2 A 2
2 NA 6",header=T,stringsAsFactors = F)
Desired Result:
id action time
1 A 1
1 A 4
1 A 9
1 B 11
1 B 12
1 C 16
1 C 19
1 NA 30
1 A 31
1 A 32
2 NA 1
2 A 2
2 A 6
We can extract the non-NA rows to use as a reference for filling in values, then iterate through the data set and conditionally replace values based on if they meet the requirements of id and the time interval.
# Use row numbers as an index (unique Id)
df$idx <- 1:nrow(df)
# Find the non-NA rows to use a reference for imputation
idx <- df %>%
group_by(id) %>%
na.omit(action)
The temporary data set idx is used as the reference and the column idx is our unique identifier. Let's first look at the logic for finding and filling in the missing values without worrying about the time interval, so that it's easier to read and understand:
# Ignoring the 'interval' limitation, we'd fill them in like this:
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
If you're running this example code make sure you re-create df and idx before proceeding, since it would be modified by that last example code block.
The time interval requires us to do a logical test on the value of time and also another test to avoid trying to conduct the time comparison on NA values:
# Accounting for the max interval:
interval <- 10
for(r in 1:nrow(df)){
if(is.na(df$action[r])){
if(!is.na(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]))){
if(dplyr::last(idx$time[idx$idx < df$idx[r] & idx$id == df$id[r]]) + interval >= df$time[r])
df$action[r] <- dplyr::last(idx$action[idx$idx < df$idx[r] & idx$id == df$id[r]])
}
}
}
df
This gives us:
id action time idx
1 1 A 1 1
2 1 A 4 2
3 1 A 9 3
4 1 B 11 4
5 1 B 12 5
6 1 C 16 6
7 1 C 19 7
8 1 <NA> 30 8
9 1 A 31 9
10 1 A 32 10
11 2 <NA> 1 11
12 2 A 2 12
13 2 A 6 13
which matches your desired output.

subseting columns by the name of rows of another dataframe

I need to subset the columns of a dataframe taking into account the rownames of another dataframe.(in R)
Im trying to select the representative species of Brazilian Amazon subseting a great Brazilian database taking into account the percentage of representative location, information which is in another dataframe
> a <- data.frame("John" = c(2,1,1,2), "Dora" = c(1,1,3,2), "camilo" = c(1:4),"alex"=c(1,2,1,2))
> a
John Dora camilo alex
1 2 1 1 1
2 1 1 2 2
3 1 3 3 1
4 2 2 4 2
> b <- data.frame("SN" = 1:3, "Age" = c(15,31,2), "Name" = c("John","Dora","alex"))
> b
SN Age Name
1 1 15 John
2 2 31 Dora
3 3 2 alex
> result <- a[,rownames(b)[1:3]]
Error in `[.data.frame`(a, , rownames(b)[1:3]) :
undefined columns selected
I want to get this dataframe
John Dora alex
1 2 1 1
2 1 1 2
3 1 3 1
4 2 2 2
The simple a[,b$Name] does not work because b$Name is considered a factor. Be careful because it won't throw an error but you will get the wrong answer!
But this is easy to fit by using a[,as.character(b$Name)]instead!

R - Create a column with entries only for the first row of each subset

For instance if I have this data:
ID Value
1 2
1 2
1 3
1 4
1 10
2 9
2 9
2 12
2 13
And my goal is to find the smallest value for each ID subset, and I want the number to be in the first row of the ID group while leaving the other rows blank, such that:
ID Value Start
1 2 2
1 2
1 3
1 4
1 10
2 9 9
2 9
2 12
2 13
My first instinct is to create an index for the IDs using
A <- transform(A, INDEX=ave(ID, ID, FUN=seq_along)) ## A being the name of my data
Since I am a noob, I get stuck at this point. For each ID=n, I want to find the min(A$Value) for that ID subset and place that into the cell matching condition of ID=n and INDEX=1.
Any help is much appreciated! I am sorry that I keep asking questions :(
Here's a solution:
within(A, INDEX <- "is.na<-"(ave(Value, ID, FUN = min), c(FALSE, !diff(ID))))
ID Value INDEX
1 1 2 2
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 1 10 NA
6 2 9 9
7 2 9 NA
8 2 12 NA
9 2 13 NA
Update:
How it works? The command ave(Value, ID, FUN = min) applies the function min to each subset of Value along the values of ID. For the example, it returns a vector of five times 2 and four times 9. Since all values except the first in each subset should be NA, the function "is.na<-" replaces all values at the logical index defined by c(FALSE, !diff(ID)). This index is TRUE if a value is identical with the preceding one.
You're almost there. We just need to make a custom function instead of seq_along and to split value by ID (not ID by ID).
first_min <- function(x){
nas <- rep(NA, length(x))
nas[which.min(x)] <- min(x, na.rm=TRUE)
nas
}
This function makes a vector of NAs and replaces the first element with the minimum value of Value.
transform(dat, INDEX=ave(Value, ID, FUN=first_min))
## ID Value INDEX
## 1 1 2 2
## 2 1 2 NA
## 3 1 3 NA
## 4 1 4 NA
## 5 1 10 NA
## 6 2 9 9
## 7 2 9 NA
## 8 2 12 NA
## 9 2 13 NA
You can achieve this with a tapply one-liner
df$Start<-as.vector(unlist(tapply(df$Value,df$ID,FUN = function(x){ return (c(min(x),rep("",length(x)-1)))})))
I keep going back to this question and the above answers helped me greatly.
There is a basic solution for beginners too:
A$Start<-NA
A[!duplicated(A$ID),]$Start<-A[!duplicated(A$ID),]$Value
Thanks.

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Calculating the occurrences of numbers in the subsets of a data.frame

I have a data frame in R which is similar to the follows. Actually my real ’df’ dataframe is much bigger than this one here but I really do not want to confuse anybody so that is why I try to simplify things as much as possible.
So here’s the data frame.
id <-c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3)
a <-c(3,1,3,3,1,3,3,3,3,1,3,2,1,2,1,3,3,2,1,1,1,3,1,3,3,3,2,1,1,3)
b <-c(3,2,1,1,1,1,1,1,1,1,1,2,1,3,2,1,1,1,2,1,3,1,2,2,1,3,3,2,3,2)
c <-c(1,3,2,3,2,1,2,3,3,2,2,3,1,2,3,3,3,1,1,2,3,3,1,2,2,3,2,2,3,2)
d <-c(3,3,3,1,3,2,2,1,2,3,2,2,2,1,3,1,2,2,3,2,3,2,3,2,1,1,1,1,1,2)
e <-c(2,3,1,2,1,2,3,3,1,1,2,1,1,3,3,2,1,1,3,3,2,2,3,3,3,2,3,2,1,3)
df <-data.frame(id,a,b,c,d,e)
df
Basically what I would like to do is to get the occurrences of numbers for each column (a,b,c,d,e) and for each id group (1,2,3) (for this latter grouping see my column ’id’).
So, for column ’a’ and for id number ’1’ (for the latter see column ’id’) the code would be something like this:
as.numeric(table(df[1:10,2]))
##The results are:
[1] 3 7
Just to briefly explain my results: in column ’a’ (and regarding only those records which have number ’1’ in column ’id’) we can say that number '1' occured 3 times and number '3' occured 7 times.
Again, just to show you another example. For column ’a’ and for id number ’2’ (for the latter grouping see again column ’id’):
as.numeric(table(df[11:20,2]))
##After running the codes the results are:
[1] 4 3 3
Let me explain a little again: in column ’a’ and regarding only those observations which have number ’2’ in column ’id’) we can say that number '1' occured 4 times, number '2' occured 3 times and number '3' occured 3 times.
So this is what I would like to do. Calculating the occurrences of numbers for each custom-defined subsets (and then collecting these values into a data frame). I know it is not a difficult task but the PROBLEM is that I’m gonna have to change the input ’df’ dataframe on a regular basis and hence both the overall number of rows and columns might change over time…
What I have done so far is that I have separated the ’df’ dataframe by columns, like this:
for (z in (2:ncol(df))) assign(paste("df",z,sep="."),df[,z])
So df.2 will refer to df$a, df.3 will equal df$b, df.4 will equal df$c etc. But I’m really stuck now and I don’t know how to move forward…
Is there a proper, ”automatic” way to solve this problem?
How about -
> library(reshape)
> dftab <- table(melt(df,'id'))
> dftab
, , value = 1
variable
id a b c d e
1 3 8 2 2 4
2 4 6 3 2 4
3 4 2 1 5 1
, , value = 2
variable
id a b c d e
1 0 1 4 3 3
2 3 3 3 6 2
3 1 4 5 3 4
, , value = 3
variable
id a b c d e
1 7 1 4 5 3
2 3 1 4 2 4
3 5 4 4 2 5
So to get the number of '3's in column 'a' and group '1'
you could just do
> dftab[3,'a',1]
[1] 4
A combination of tapply and apply can create the data you want:
tapply(df$id,df$id,function(x) apply(df[id==x,-1],2,table))
However, when a grouping doesn't have all the elements in it, as in 1a, the result will be a list for that id group rather than a nice table (matrix).
$`1`
$`1`$a
1 3
3 7
$`1`$b
1 2 3
8 1 1
$`1`$c
1 2 3
2 4 4
$`1`$d
1 2 3
2 3 5
$`1`$e
1 2 3
4 3 3
$`2`
a b c d e
1 4 6 3 2 4
2 3 3 3 6 2
3 3 1 4 2 4
$`3`
a b c d e
1 4 2 1 5 1
2 1 4 5 3 4
3 5 4 4 2 5
I'm sure someone will have a more elegant solution than this, but you can cobble it together with a simple function and dlply from the plyr package.
ColTables <- function(df) {
counts <- list()
for(a in names(df)[names(df) != "id"]) {
counts[[a]] <- table(df[a])
}
return(counts)
}
results <- dlply(df, "id", ColTables)
This gets you back a list - the first "layer" of the list will be the id variable; the second the table results for each column for that id variable. For example:
> results[['2']]['a']
$a
1 2 3
4 3 3
For id variable = 2, column = a, per your above example.
A way to do it is using the aggregate function, but you have to add a column to your dataframe
> df$freq <- 0
> aggregate(freq~a+id,df,length)
a id freq
1 1 1 3
2 3 1 7
3 1 2 4
4 2 2 3
5 3 2 3
6 1 3 4
7 2 3 1
8 3 3 5
Of course you can write a function to do it, so it's easier to do it frequently, and you don't have to add a column to your actual data frame
> frequency <- function(df,groups) {
+ relevant <- df[,groups]
+ relevant$freq <- 0
+ aggregate(freq~.,relevant,length)
+ }
> frequency(df,c("b","id"))
b id freq
1 1 1 8
2 2 1 1
3 3 1 1
4 1 2 6
5 2 2 3
6 3 2 1
7 1 3 2
8 2 3 4
9 3 3 4
You didn't say how you'd like the data. The by function might give you the output you like.
by(df, df$id, function(x) lapply(x[,-1], table))

Resources