Countif function (excel) in R - r

I have a dataset "a" with a column "id" with about 23,000 rows, which are unique in this dataframe. I want to count the appearance frequency of these unique values in another two datasets "b" and "c".
To do this, I tried the code:
count1 <- as.data.frame(apply(a,1,function(x)sum(b$id==x[45])))
a <- cbind(a,count1)
, since "id" is the 45th column in the dataframe "b".
The code works for counting in b, but when I tried the same code for counting the frequency of "id" in dataframe "c":
count2 <- as.data.frame(apply(a,1,function(x)sum(c$id==x[17])))
"id" in dataframe "c" is in the 17th column. The frequencies of all "id"s are counted as 0, which is not the case it should be. Could anyone suggest where the problem is or how to fix this problem?

We can actually do this in a way that might at first seem a little weird, but is relatively straight forward. Let's start by working with just data frames a and b and let's simplify things a bit. Let's assume that the id variable in both a and b are the following:
a_id <- 1:5
b_id <- 1:5
In this simple example, a_id and b_id are exactly identical. What we want to know is how many times each of the values in a_id shows up in b_id. We obviously know the answer is one time each, but how do we get R to tell us that? That's where the table function can come in handy:
table(a_id, b_id)
# b_id
# a_id 1 2 3 4 5
# 1 1 0 0 0 0
# 2 0 1 0 0 0
# 3 0 0 1 0 0
# 4 0 0 0 1 0
# 5 0 0 0 0 1
That might look a little ugly, but you can see that we have our b_ids on the top (1-5) and our a_ids on the left-hand side. Down the diagonal, we see the counts for how many times each value of a_id shows up in b_id, and it's 1 each just like we already knew. So how do we get just that information? R has a nice function called diag that gets the main diagonal for us:
diag(table(a_id, b_id))
# 1 2 3 4 5
# 1 1 1 1 1
And there we have it. A vector with our "countif" values. But what if b_id doesn't have all of the values that are in a_id? If we try to do what we just did, we'll get an error because table doesn't like it when two vectors have different lengths. So we modify it a bit:
a_id <- 1:10
b_id <- 4:8
table(b_id[b_id %in% a_id])
# 4 5 6 7 8
# 1 1 1 1 1
A couple new things here. The use of %in% just asks R to tell us if a value exists in a vector. For example, 1 %in% 1:3 would return TRUE, but 4 %in% 1:3 would return FALSE. Next, you'll notice that we indexed b_id by using [. This only returns of the values of b_id where b_id %in% a_id is TRUE, which in this case is all of b_id.
So what does this look like if we expect more than one value of each a_id in b_id, but not every a_id value to be in b_id? Let's look at a more real example:
a_id <- 1:10
b_id <- sample(3:7, 1000, replace=TRUE)
table(b_id[b_id %in% a_id])
# 3 4 5 6 7
# 210 182 216 177 215
Like I said, it might seem a little weird at first, but it's relatively straight forward. Hopefully this helps you more than it confuses you.

Related

replace values in one dataset with values in another dataset R

I have a somewhat seemingly simple problem that I am stumped with. I have a df, say:
x y z
0 1 2
3 5 4
1 0 5
0 5 0
and another:
x y z
1 5 6
2 4 5
4 5 7
5 8 5
I want to replace the zero values in df1 with the value in df2. E.g., cell 1 of df1 would be 1 instead of zero. I want this for all columns in a dataframe. Can you help me code? I cant seem to figure it out. Thanks!
First, you can locate the indices of 0's using which
zero_locations <- which(df1 == 0, arr.ind=TRUE)
Then, you can use the locations to make the replacements:
df1[zero_locations] <- df2[zero_locations]
As David Arenburg pointed out in the comments, which isn't strictly necessary:
zero_locations <- df1 == 0
Will work as well.

Using Table on data frame by mutliple variables

I have a data table that is in a "long" format, containing many entries for each unique ID. For example...
id <- c(1,1,1,2,2,2)
date <- c("A","A","B","C","C","C")
loc <- c("X", "X", "X", "X","Y","Z")
dfTest <- data.frame(id,date,loc)
Which creates a sample table.
id date loc
1 1 A X
2 1 A X
3 1 B X
4 2 C X
5 2 C Y
6 2 C Z
My goal is to create a table that looks like this.
id X Y Z
1 2 0 0
2 1 1 1
I would like to see how many times a location was visited uniquely. ID#1 visited X on day A and day B, giving a total unique visits of 2. I approached this using reshape, thinking to turn this into a "wide" format. However, I don't know how to factor in the second variable (the date). I'm trying to pull out the number of visits to each location on unique dates. The actual date itself otherwise does not matter, just that it identify the duplicate entries.
My current solution would be poor form in R (to use iterative loops to look at locations found within each unique date). I was hoping reshape, apply, aggregate, or perhaps another package may be of more help. I've looked through a bunch of other reshape guides, but am still a bit stuck on the clever way to do this.
By the sounds of it, you should be able to do what you need with:
table(unique(dfTest)[-2])
## loc
## id X Y Z
## 1 2 0 0
## 2 1 1 1
We can group by 'loc', 'id', get the length of unique elements of 'date' and use dcast to get the expected output.
library(data.table)#v1.9.6+
dcast(setDT(dfTest)[, uniqueN(date), .(loc, id)], id~loc, value.var='V1', fill=0)
# id X Y Z
#1: 1 2 0 0
#2: 2 1 1 1

How do I select distinct from a dataframe in R?

I have a dataframe in R. I want to see what groups are in the dataframe. If this were a SQL database, I would do Select distinct group from dataframe. Is there a way to perform a similar operation in R?
> head(orl.df)
long lat order hole piece group id
1 3710959 565672.3 1 FALSE 1 0.1 0
2 3710579 566171.1 2 FALSE 1 0.1 0
The unique() function should do the trick:
> dat <- data.frame(x=c(1,1,2),y=c(1,1,3))
> dat <- data.frame(x=c(1,1,2),y=c(1,1,3))
> dat
x y
1 1 1
2 1 1
3 2 3
> unique(dat)
x y
1 1 1
3 2 3
Edit: For your example (didn't see the group part)
unique(orl.df$group)
I think the table() function is also a good choice.
table(orl.df$group)
It also tell you the number the items in each group.

transpose-like procedure in data.table

I asked a similar yet different question before (here)
Now I wanna change this dataset:
dt <- data.table(a=c("A","A","A"),b=1:3,c=c(0,1,0))
dt
a b c
1: A 1 0
2: A 2 1
3: A 3 0
to this one:
a 1 2 3
1: A 0 1 0
So values of column "b" should become columns each with the value of column "c". Values of "a" can be seen as participants (here just one Person ("A"). The original dataset continues with multiple values of B and so on. After "transposing", column "a" should include unique values (e.g. A,B,C etc)
Any suggestions?
I think I can see where you're going with this.
The following should cope with varying unique items in b as well and line them up.
.u = unique(dt$b) # find first the unique columns
ans = dt[,as.list(c[match(.u,b)]),by=a] # as.list is the way to make columns
setnames(ans,c("a",.u)) # rename once afterwards is faster
ans
a 1 2 3
1: A 0 1 0
Untested on more complex cases.

Data frame "expand" procedure in R?

This is not a real statistical question, but rather a data preparation question before performing the actual statistical analysis. I have a data frame which consists of sparse data. I would like to "expand" this data to include zeroes for missing values, group by group.
Here is an example of the data (a and b are two factors defining the group, t is the sparse timestamp and xis the value):
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
Assuming I would like to expand the values between t=0 and t=9, this is the result I'm hoping for:
test.expanded <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2),
t=c(0,1,2,3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9),
x=c(1,0,2,1,2,0,0,2,0,0,0,0,0,1,1,0,2,1,1,3))
Zeroes have been inserted for all missing values of t. This makes it easier to use.
I have a quick and dirty implementation which sorts the dataframe and loops through each of its lines, adding missing lines one at a time. But I'm not entirely satisfied by the solution. Is there a better way to do it?
For those who are familiar with SAS, it is similar to the proc expand.
Thanks!
As you noted in a comment to the other answer, doing it by group is easy with plyr which just leaves how to "fill in" the data sets. My approach is to use merge.
library("plyr")
test.expanded <- ddply(test, c("a","b"), function(DF) {
DF <- merge(data.frame(t=0:9), DF[,c("t","x")], all.x=TRUE)
DF[is.na(DF$x),"x"] <- 0
DF
})
merge with all.x=TRUE will make the missing values NA, so the second line of the function is needed to replace those NAs with 0's.
This is convoluted but works fine:
test <- data.frame(
a=c(1,1,1,1,1,1,1,1,1,1,1),
b=c(1,1,1,1,1,2,2,2,2,2,2),
t=c(0,2,3,4,7,3,4,6,7,8,9),
x=c(1,2,1,2,2,1,1,2,1,1,3))
my.seq <- seq(0,9)
not.t <- !(my.seq %in% test$t)
test[nrow(test)+seq(length(my.seq[not.t])),"t"] <- my.seq[not.t]
test
#------------
a b t x
1 1 1 0 1
2 1 1 2 2
3 1 1 3 1
4 1 1 4 2
5 1 1 7 2
6 1 2 3 1
7 1 2 4 1
8 1 2 6 2
9 1 2 7 1
10 1 2 8 1
11 1 2 9 3
12 NA NA 1 NA
13 NA NA 5 NA
Not sure if you want it sorted by t afterwards or not. If so, easy enough to do:
https://stackoverflow.com/a/6871968/636656

Resources