Adding counts of a factor to a dataframe [duplicate] - r

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 6 years ago.
I have a data frame where each row is an observation concerning a pupil. One of the vectors in the data frame is an id for the school. I have obtained a new vector with counts for each school as follows:
tbsch <- table(dt$school)
Now I want to add the relevant count value to each row in dt. I have done it using for() looping through each row in dt and making a new vector containing the relevant count and finally using cbind() to add it to dt, but I think this is very inefficient. Is there a smart/easy way to do that ?

using jmsigner's data you could do:
dt$count <- ave(dt$school, dt$school, FUN = length)

This is a lot easier in data.table v1.8.1. := now works by group. Groups don't have to be contiguous and it retains the original order. And it's just one line:
library(data.table)
# set up data
set.seed(2)
npupils <- rpois(10, 20)
pupil <- unlist(lapply(npupils, seq_len))
school <- rep(seq_along(npupils), npupils)
dt <- data.table(school = school, pupil = pupil) # Create a data.table
dt <- dt[sample(seq_len(nrow(dt)))] # Mix it up
dt
school pupil
1: 5 2
2: 6 13
3: 2 14
4: 5 3
5: 10 14
---
186: 3 11
187: 7 2
188: 8 12
189: 3 6
190: 7 10
(dt[, schoolSize := .N, by = school])
school pupil schoolSize
1: 5 2 16
2: 6 13 18
3: 2 14 15
4: 5 3 16
5: 10 14 24
---
186: 3 11 14
187: 7 2 28
188: 8 12 19
189: 3 6 14
190: 7 10 28
That has all the usual speed advantages of fast grouping, and assigns the new column by reference with no copy at all.
Edit: Deleted an answer that was only relevant for data.table prior to version 1.8.1: (Thanks to Matthew for the update).

You could try something like this:
dt <- data.frame(p=1:20, school=sample(1:5, 20, replace=T))
tbsch <- table(dt$school)
tbsch <- data.frame(tbsch)
merge(dt, tbsch, by.x="school", by.y="Var1")

You can also use plyr...and preserve the original order using this one
liner
join(dt, count(dt, "school"))

Related

Replace value from updated dataset based on number of instances it appears in a second dataset

I have a simple 2-column dataset containing variable cluster_size and index. Originally all values of index were assigned a value 1. Subsequently, I received a second dataset containing only a few clusters where index should updated with different integer values.
I simply need to replace the index value from the updated dataset. My specific issue is that the values for cluster_size can repeat multiple times, but I only need to replace it for the number of instances it appears in the updated dataset. For instance, in the example data below, the cluster_size value of 34 appears three times, but only once in the updated data with an index of 6. This means that only one of these three rows should update to 6 (doesn't matter which one).
Code to recreate a 20-row sample of the original data (have), updated subset (updated), and desired dataset (want) are below. The actual data has tens of thousands of rows. Ive tried several merge and loop functions (all too pathetic to waste your time by posting here), but cant seem to find an elegant solution.
# Data with original index cases
set.seed(03151813)
have <- data.frame(clust_size=sample(1:50,20,replace=TRUE),index=rep(1,times=20))
have <- have[order(have$clust_size),]
# Updated data only contains clusters that need updating of inde
updated <- data.frame(clust_size=c(30,34,42,44,44,46),
index=c(2,6,4,8,9,4))
# Desired dataset
want <- data.frame(clust_size=have$clust_size,
index=c(rep(1,times=9),2,1,6,
1,1,1,4,1,8,9,4))
Here is a base R approach. Add row numbers to have and updated for each clust_size. So the clust_size of 34 will have rows numbered consecutively 1, 2, and 3.
Then, you can merge the two together on both clust_size and row number. If you include all.x you will get all rows from the first data frame have.
Final step is to replace the missing NA values in your new index column with the original index.
have$rn <- with(have, ave(seq_along(clust_size), clust_size, FUN = seq_along))
updated$rn <- with(updated, ave(seq_along(clust_size), clust_size, FUN = seq_along))
want <- merge(have, updated, all.x = TRUE, by = c("clust_size", "rn"))
want$index.y <- ifelse(is.na(want$index.y), want$index.x, want$index.y)
want[, c("clust_size", "index.y")]
An alternate version using dplyr would be something like this:
library(dplyr)
have2 <- have %>%
group_by(clust_size) %>%
mutate(rn = row_number())
updated2 <- updated %>%
group_by(clust_size) %>%
mutate(rn = row_number())
left_join(have2, updated2, by = c("clust_size", "rn")) %>%
mutate(index.y = coalesce(index.y, index.x))
Output
clust_size index.y
1 1 1
2 5 1
3 8 1
4 10 1
5 16 1
6 20 1
7 22 1
8 27 1
9 29 1
10 30 2
11 30 1
12 34 6
13 34 1
14 34 1
15 35 1
16 42 4
17 43 1
18 44 8
19 44 9
20 46 4

Multiply various subsets of a data frame by different elements of a vector R

I have a data frame:
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
And a vector:
vals <- sample(7:100, 10)
I want to multiply cols Room1, Room2 and Room3 by a different element of the vector for every unique ID number and output a new data frame (df2).
I managed to multiply each column per id by EVERY element of the vector using the following:
samp_func <- function(x) {
x*vals[i]
}
for (i in vals) {
df2 <- df %>% mutate_at(c("Room1", "Room2", "Room3"), samp_func)
}
But the resulting df (df2) is each Room column multiplied by the same element of the vector (vals) for each of the different ids. When what I want is each Room column (per id) multiplied by a different element of the vector vals. Sorry in advance if this is not clear I am a beginner and still getting to grips with the terminology.
Thanks!
EDIT: The desired output should look like the below, where the columns for each ID have been multiplied by a different element of the vector vals.
id Room1 Room2 Room3
1 1 24.674826880 60.1942571 46.81276141
2 1 21.970270107 46.0461779 35.09928150
3 1 26.282357614 -3.5098880 38.68400541
4 1 29.614182061 -39.3025587 25.09146592
5 1 33.030886472 46.0354881 42.68209027
6 1 41.362699668 -23.6624632 26.93845129
7 1 5.429031042 26.7657577 37.49086963
8 1 18.733422977 -42.0620572 23.48992138
9 1 -17.144070723 9.9627315 55.43999326
10 1 45.392182468 20.3959968 -16.52166621
11 2 30.687978299 -11.7194020 27.67351631
12 2 -4.559185345 94.9256561 9.26738357
13 2 86.165076849 -1.2821515 29.36949423
14 2 -12.546711562 47.1763755 152.67588456
15 2 18.285856423 60.5679496 113.85971720
16 2 72.074929648 47.6509398 139.69051486
17 2 -12.332519694 67.8890324 20.73189965
18 2 80.889634991 69.5703581 98.84404415
19 2 87.991093995 -20.7918559 106.13610773
20 2 -2.685594148 71.0611693 47.40278949
21 3 4.764445589 -7.6155681 12.56546664
22 3 -1.293867841 -1.1092243 13.30775785
23 3 16.114831628 -5.4750642 8.58762550
24 3 -0.309470950 7.0656088 10.07624289
25 3 11.225609780 4.2121241 16.59168866
26 3 -3.762529113 6.4369973 15.82362705
27 3 -5.103277731 0.9215625 18.20823042
28 3 -10.623165177 -5.2896293 33.13656839
29 3 -0.002517872 5.0861361 -0.01966699
30 3 -2.183752881 24.4644310 13.55572730
This should solve your problem. You can use a new dataset of all id, value combinations to make sure you calculate each combination and merge on the Room values. Then use mutate to make new Room columns.
Also, in the future I'd recommend setting a seed when asking questions with random data as it's easier for someone to replicate your output.
library(dplyr)
set.seed(0)
df<-data.frame(id=rep(1:10,each=10),
Room1=rnorm(100,0.4,0.5),
Room2=rnorm(100,0.3,0.5),
Room3=rnorm(100,0.7,0.5))
vals <- sample(7:100, 10)
other_df <- data.frame(id=rep(1:10),
val = rep(vals, 10))
df2 <- inner_join(other_df, df)
df2 <- df2 %>%
mutate(Room1 = Room1*val,
Room2 = Room2*val,
Room3 = Room3*val)

R Concatenate column in data frame with one value/string [duplicate]

This question already has answers here:
How to add leading zeros?
(8 answers)
Closed 4 years ago.
I am trying to concatenate some data in a column of a df, with "0000"
I tried to use paste() in a loop, but it becomes very performance heavy, as I have +2.000.000 rows. Thus, it takes forever.
Is there a smart, less performance heavy way to do it?
#DF:
CUSTID VALUE
103 12
104 10
105 15
106 12
... ...
#Desired result:
#DF:
CUSTID VALUE
0000103 12
0000104 10
0000105 15
0000106 12
... ...
How can this be achieved?
paste is vectorized so it'll work with a vector of values (i.e. a column in a data frame. The following should work:
DF <- data.frame(
CUSTID = 103:107,
VALUE = 13:17
)
DF$CUSTID <- paste0('0000', DF$CUSTID)
Should give you
CUSTID VALUE
1 0000103 13
2 0000104 14
3 0000105 15
4 0000106 16
5 0000107 17

efficient rbind alternative with applied function

To take a step back, my ultimate goal is to read in around 130,000 images into R with a pixel size of HxW and then to make a dataframe/datatable containing the rgb of each pixel of each image on a new row. So the output will be something like this:
> head(train_data, 10)
image_no r g b pixel_no
1: 00003e153.jpg 0.11764706 0.1921569 0.3098039 1
2: 00003e153.jpg 0.11372549 0.1882353 0.3058824 2
3: 00003e153.jpg 0.10980392 0.1843137 0.3019608 3
4: 00003e153.jpg 0.11764706 0.1921569 0.3098039 4
5: 00003e153.jpg 0.12941176 0.2039216 0.3215686 5
6: 00003e153.jpg 0.13333333 0.2078431 0.3254902 6
7: 00003e153.jpg 0.12549020 0.2000000 0.3176471 7
8: 00003e153.jpg 0.11764706 0.1921569 0.3098039 8
9: 00003e153.jpg 0.09803922 0.1725490 0.2901961 9
10: 00003e153.jpg 0.11372549 0.1882353 0.3058824 10
I currently have a piece of code to do this in which I apply a function to get the rgb for each pixel of a specified image, returning the result in a dataframe:
#function to get rgb from image file paths
get_rgb_table <- function(link){
img <- readJPEG(toString(link))
# Creating the data frame
rgb_image <- data.frame(r = as.vector(img[1:H, 1:W, 1]),
g = as.vector(img[1:H, 1:W, 2]),
b = as.vector(img[1:H, 1:W, 3]))
#add pixel id
rgb_image$pixel_no <- row.names(rgb_image)
#add image id
train_rgb <- cbind(sub('.*/', '',link),rgb_image)
colnames(train_rgb)[1] <- "image_no"
return(train_rgb)
}
I call this function on another dataframe which contains the links to all the images:
train_files <- list.files(path="~/images/", pattern=".jpg",all.files=T, full.names=T, no.. = T)
train <- data.frame(matrix(unlist(train_files), nrow=length(train_files), byrow=T))
The train dataframe looks like this:
> head(train, 10)
link
1 C:/Documents/image/00003e153.jpg
2 C:/Documents/image/000155de5.jpg
3 C:/Documents/image/00021ddc3.jpg
4 C:/Documents/image/0002756f7.jpg
5 C:/Documents/image/0002d0f32.jpg
6 C:/Documents/image/000303d4d.jpg
7 C:/Documents/image/00031f145.jpg
8 C:/Documents/image/00053c6ba.jpg
9 C:/Documents/image/00057a50d.jpg
10 C:/Documents/image/0005d01c8.jpg
I finally get the result I want with the following loop:
for(i in 1:length(train[,1])){
train_data <- rbind(train_data,get_rgb_table(train[i,1]))
}
However, this last bit of code is very inefficient. An optimization of how the function is applied and and/or the rbind would help. I think the function get_rgb_table() itself is quick but the problem is with the loop and the rbind. I have tried using apply() but can't manage to do this on each row and put the result in one dataframe without running out of memory. Any help on this would be great. Thanks!
This is very difficult to answer given the vagueness of the question, but I'll make a reproducible example of what I think you're asking and will give a solution.
Say I have a function that returns a data frame:
MyFun <- function(x)randu[1:x,]
And I have a data frame df that will act an input to the function.
# a b
# 1 1 21
# 2 2 22
# 3 3 23
# 4 4 24
# 5 5 25
# 6 6 26
# 7 7 27
# 8 8 28
# 9 9 29
# 10 10 30
From your question, it looks like only one column will be used as input. So, I apply the function to each row of this data frame using lapply then I bind the results together using do.call and rbind like this:
do.call(rbind, lapply(df$a, MyFun))

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Resources