Add a Count field to a data frame [duplicate] - r

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 8 years ago.
I have the following dataframe dat:
> dat
subjectid variable
1 1234 12
2 1234 14
3 2143 19
4 3456 12
5 3456 14
6 3456 13
How do I add another column which shows the count of each unique subjectid?
ddply(dat,.(subjectid),summarize,quan_95=quantile(variable,0.95),uniq=count(unique(subjectid)))

Here is an approach via dplyr. First we group by subjectid, then use the function n() to count number of rows in each group:
dat <- read.table(text="
subjectid variable
1 1234 12
2 1234 14
3 2143 19
4 3456 12
5 3456 14
6 3456 13")
library(dplyr)
dat %>%
group_by(subjectid) %>%
mutate(count = n())
subjectid variable count
1 1234 12 2
2 1234 14 2
3 2143 19 1
4 3456 12 3
5 3456 14 3
6 3456 13 3

If dat is ordered by subjectid
tbl <- table(dat[,1])
transform(dat, count=rep(tbl, tbl))
# subjectid variable count
#1 1234 12 2
#2 1234 14 2
#3 2143 19 1
#4 3456 12 3
#5 3456 14 3
#6 3456 13 3

Similar to ave(), you may also use split/lapply/unsplit:
i = split(dat$variable, dat$subjectid)
count = unsplit(lapply(i, length), dat$subjectid)
Then graft the count variable back using data.frame() or whatever your preferred method.
The split() function just creates a list of dat$variable values for each value of dat$subjectid. The count is found by using lapply() to apply the length() function over each index in the list (i) and unsplit() puts everything back in order.
unsplit() is pure magic and fairy dust. I didn't believe it the first 100 times.

Related

How to combine value with with a reference table in R?

This question is very similar to these two posts: How to compare values with a reference table in R?
and Combine Table with reference to another table in R . However, mine is more complicated:
I have two data frames:
> df1
name value1 value2
1 applefromJapan 2 8
2 applesfromJapan 3 9
3 applenotfromUS 4 10
4 pearsgoxxJapan 5 11
5 bananaxxeeChina 6 12
> df2
name value1 value2
1 applefromJapan 33 1
2 watermeleonnotfromUS 34 2
3 applesfromJapan 35 3
4 pearfromChina 36 4
5 pearfromphina 37 5 # only one letter different will not cause problem
and a reference table:
> ref.df
fruit country name seller
1 1:5 10:14 appleJapan John
2 1:5 10:14 pearsJapan Mike
3 1:6 11:15 applesJapan Nicole
4 1:6 11:12 bananaUS Amy
5 1:4 9:13 pearChina Jenny
6 1:5 13:14 appleUS Mark
7 1:10 18:22 watermeleonchina James
The reference table works as the following:
the name column in the ref.df contains the full name, where df1(df2) can obtain this name by trimming its name column according to the fruit and country column.
My desire output table is:
#output
name value1 value2 value3 value4
1 appleJapan 2 8 33 1 # ended up using the name from the ref.df
2 applesJapan 3 9 34 2 # merge df1 and df2
3 appleUS 4 10 NA NA # replacing NA for not exist values
4 pearsJapan 5 11 NA NA
5 watermeleonUS NA NA 34 2
6 pearChina NA NA 73 9 # name with only one letter difference consider as typo, so we will just sum the values up.
# bananaxxeeChina is not here because it does not referenced by the ref.df
*There are thousands of row in the real dataset (~12000 rows on average in each of the 22 dfs, ~ 200 rows in the ref.df), this is only the first couple rows (with some alternation). So I think it is better to compare each df to the ref.df, then combine the df2? But how can i achieve this?
Codes for producing data:
https://codeshare.io/2p179V

Search through range of 'education' variables and assign a number based on the highest qualification contained within these variables

I'm currently an R novice and have come across an issue while recoding variables in my dataset. I would really appreciate any advice you may have on this issue. I have several different "education_code" variables that contain information about a given individual's educational qualifications.
#create simulated data
df = data.frame(ID = c(1001,1002,1003,1004, 1005,1006,1007,1008,1009,1010,1011),
education_code_1 = c('1','2','1','1','NA', '5', '2', '3', 'NA','2','5'),
education_code_2 = c('2','4','3','4','5', '2','1','2','5','1','3'),
education_code_3 = c('3', '3','NA', '4','2', '1','NA','3','4','NA','2'))
Which looks like this:
ID education_code_1 education_code_2 education_code_3
1 1001 1 2 3
2 1002 2 4 3
3 1003 1 3 NA
4 1004 1 4 4
5 1005 NA 5 2
6 1006 5 2 1
7 1007 2 1 NA
8 1008 3 2 3
9 1009 NA 5 4
10 1010 2 1 NA
11 1011 5 3 2
Assuming that a higher value represents a higher educational level, I would like to create a new variable "Highest_degree_obtained" (below) that assigns a number based on the highest value contained within columns 2:4.
df$Highest_degree_obtained <- NA
Any suggestions on how to go about doing this?
You could just use apply
df$Highest_degree_obtained <- apply(df[, -1], 1, function(x) {
max(as.numeric(as.character(x)), na.rm = T)
})
df$Highest_degree_obtained
[1] 3 4 3 4 5 5 2 3 5 2 5

How to merge tables and fill the empty cells in the mean time in R?

Assume there are two tables a and b.
Table a:
ID AGE
1 20
2 empty
3 40
4 empty
Table b:
ID AGE
2 25
4 45
5 60
How to merge the two table in R so that the resulting table becomes:
ID AGE
1 20
2 25
3 40
4 45
You could try
library(data.table)
setkey(setDT(a), ID)[b, AGE:= i.AGE][]
# ID AGE
#1: 1 20
#2: 2 25
#3: 3 40
#4: 4 45
data
a <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
b <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
Assuming you have NA on every position in the first table where you want to use the second table's age numbers you can use rbind and na.omit.
Example
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60))
na.omit(rbind(x,y))
Results in what you're after (although unordered and I assume you just forgot ID 5)
ID AGE
1 20
3 40
2 25
4 45
5 60
EDIT
If you want to merge two different data.frames's and keep the columns its a different thing. You can use merge to achieve this.
Here are two data frames with different columns:
x <- data.frame(ID=c(1,2,3,4), AGE=c(20,NA,40,NA), COUNTY=c(1,2,3,4))
y <- data.frame(ID=c(2,4,5), AGE=c(25,45,60), STATE=c('CA','CA','IL'))
Add them together into one data.frame
res <- merge(x, y, by='ID', all=T)
giving us
ID AGE.x COUNTY AGE.y STATE
1 20 1 NA <NA>
2 NA 2 25 CA
3 40 3 NA <NA>
4 NA 4 45 CA
5 NA NA 60 IL
Then massage it into the form we want
idx <- which(is.na(res$AGE.x)) # find missing rows in x
res$AGE.x[idx] <- res$AGE.y[idx] # replace them with y's values
names(res)[agrep('AGE\\.x', names(res))] <- 'AGE' # rename merged column AGE.x to AGE
subset(res, select=-AGE.y) # dump the AGE.y column
Which gives us
ID AGE COUNTY STATE
1 20 1 <NA>
2 25 2 CA
3 40 3 <NA>
4 45 4 CA
5 60 NA IL
The package in the other answer will work. Here is a dirty hack if you don't want to use the package:
x$AGE[is.na(x$AGE)] <- y$AGE[y$ID %in% x$ID]
> x
ID AGE
1 1 20
2 2 25
3 3 40
4 4 45
But, I would use the package to avoid the clunky code.

Combining Rows - Summing Certain Columns and Not Others in R

I have a data set that has repeated names in column 1 and then 3 other columns that are numeric.
I want to combine the rows of repeated names into one column and sum 2 of the columns while leaving the other alone. Is there a simple way to do this? I have been trying to figure it out with sapply and lapply and have read a lot of the Q&As here and can't seem to find a solution
Name <- c("Jeff", "Hank", "Tom", "Jeff", "Hank", "Jeff",
"Jeff", "Bill", "Mark")
data.Point.1 <- c(3,4,3,3,4,3,3,6,2)
data.Point.2 <- c(6,9,2,5,7,4,8,2,9)
data.Point.3 <- c(2,2,8,6,4,3,3,3,1)
data <- data.frame(Name, data.Point.1, data.Point.2, data.Point.3)
The data looks like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 6 2
2 Hank 4 9 2
3 Tom 3 2 8
4 Jeff 3 5 6
5 Hank 4 7 4
6 Jeff 3 4 3
7 Jeff 3 8 3
8 Bill 6 2 3
9 Mark 2 9 1
I'd like to get it to look like this (summing columns 3 and 4 and leaving column 1 alone. I'd like it to look like this:
Name data.Point.1 data.Point.2 data.Point.3
1 Jeff 3 23 14
2 Hank 4 16 6
3 Tom 3 2 8
8 Bill 6 2 3
9 Mark 2 9 1
Any help would great. Thanks!
Another solution which is a bit more straightforward is by using the library dplyr
library(dplyr)
data <- data %>% group_by(Name, data.Point.1) %>% # group the columns you want to "leave alone"
summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) # sum columns 3 and 4
if you want to sum over all other columns except those you want to "leave alone" then replace summarize(data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)) with summarise_each(funs(sum))
I'd do it this way using data.table:
setDT(data)[, c(data.Point.1 = data.Point.1[1L],
lapply(.SD, sum)), by=Name,
.SDcols = -"data.Point.1"]
# Name data.Point.1 data.Point.2 data.Point.3
# 1: Jeff 3 23 14
# 2: Hank 3 16 6
# 3: Tom 3 2 8
# 4: Bill 3 2 3
# 5: Mark 3 9 1
We group by Name, and for each group, get first element of data.Point.1, and for the rest of the columns, we compute sum by using base function lapply and looping it through the columns of .SD, which stands for Subset of Data. The columns in .SD is provided by .SDcols, to which we remove data.Point.1, so that all the other columns are provided to .SD.
Check the HTML vignettes for detailed info.
You could try
library(data.table)
setDT(data)[, list(data.Point.1=data.Point.1[1L],
data.Point.2=sum(data.Point.2), data.Point.3=sum(data.Point.3)), by=Name]
# Name data.Point.1 data.Point.2 data.Point.3
#1: Jeff 3 23 14
#2: Hank 4 16 6
#3: Tom 3 2 8
#4: Bill 6 2 3
#5: Mark 2 9 1
or using base R
data$Name <- factor(data$Name, levels=unique(data$Name))
res <- do.call(rbind,lapply(split(data, data$Name), function(x) {
x[3:4] <- colSums(x[3:4])
x[1,]} ))
Or using dplyr, you can use summarise_each to apply the function that needs to be applied on multiple columns, and cbind the output with the 'summarise' output for a single column
library(dplyr)
res1 <- data %>%
group_by(Name) %>%
summarise(data.Point.1=data.Point.1[1L])
res2 <- data %>%
group_by(Name) %>%
summarise_each(funs(sum), 3:4)
cbind(res1, res2[-1])
# Name data.Point.1 data.Point.2 data.Point.3
#1 Jeff 3 23 14
#2 Hank 4 16 6
#3 Tom 3 2 8
#4 Bill 6 2 3
#5 Mark 2 9 1
EDIT
The data created and the data showed initially differed in the original post. After the edit on OP's post (by #dimitris_ps), you can get the expected result by replacing group_by(Name) with group_by(Name, data.Point.1) in the res2 <- .. code.

Counting unique items in data frame

I want a simple count of the number of subjects in each condition of a study. The data look something like this:
subjectid cond obser variable
1234 1 1 12
1234 1 2 14
2143 2 1 19
3456 1 1 12
3456 1 2 14
3456 1 3 13
etc etc etc etc
This is a large dataset and it is not always obvious how many unique subjects contribute to each condition, etc.
I have this in a data.frame.
What I want is something like
cond ofSs
1 122
2 98
Where for each "condition" I get a count of the number of unique Ss contributing data to that condition. Seems like this should be painfully simple.
Use the ddply function from the plyr package:
require(plyr)
df <- data.frame(subjectid = sample(1:3,7,T),
cond = sample(1:2,7,T), obser = sample(1:7))
> ddply(df, .(cond), summarize, NumSubs = length(unique(subjectid)))
cond NumSubs
1 1 1
2 2 2
The ddply function "splits" the data-frame by the cond variable, and produces a summary column NumSubs for each sub-data-frame.
Using your snippet of data that I loaded into object dat:
> dat
subjectid cond obser variable
1 1234 1 1 12
2 1234 1 2 14
3 2143 2 1 19
4 3456 1 1 12
5 3456 1 2 14
6 3456 1 3 13
Then one way to do this is to use aggregate to count the unique subjectid (assuming that is what you meant by "Ss"???
> aggregate(subjectid ~ cond, data = dat, FUN = function(x) length(unique(x)))
cond subjectid
1 1 2
2 2 1
or, if you like SQL and don't mind installing a package:
library(sqldf);
sqldf("select cond, count(distinct subjectid) from dat")
Just to give you even more choice, you could also use tapply
tapply(a$subjectid, a$cond, function(x) length(unique(x)))
1 2
2 1

Resources