I have a large file and need to match admin ids with users something like this:
TABLE1 TABLE 2
INDEX V1 IDS AdmID
1 A 30 30
2 U 3 123
3 U 25 60
4 U 4 .
5 U 5 .
6 A 123 .
7 U 7
8 U 8
9 U 9
10 A 60
11 U 26
12 U 2
. . .
. . .
. . .
I want something like this:
COMPLETE TABLE
INDEX V1 IDS ADMIN_ID
1 A 30 30
2 U 3 30
3 U 25 30
4 U 4 30
5 U 5 30
6 A 123 123
7 U 7 123
8 U 8 123
9 U 9 123
10 A 60 60
11 U 26 60
12 U 2 60
. . . .
. . . .
. . . .
So I wrote this loop, but is taking forever to finish. Any idea of how to use apply() on this situation:
ln=10,000;#number of records in the Adm table
TABLE2= index of the adm ids
for (k in 1:ln){
w<-TABLE2$A_ID[k] #Ids of the adms
for(i in seq(from=AdmID[k], to=AdmID[k+1], by=1)){
TABLE1$ADMIN_ID[i]<-w
}
}
It'll be easier if how the mapping is applied by records - admin$ind. Cumulative sums are obtained and the mapping table is reversed - admin. Then ids can be replaced sequentially - in your case, 12, 9, 5.
df <- data.frame(index = c(1:12),
v1 = c("A","U","U","U","U","A","U","U","U","A","U","U"),
ids = 13:24,
admin = 0)
# need rule to assign ids - ind
admin <- data.frame(ind = c(5,4,3), id = c(30,123,60))
# get cumulative sum and reverse admin table
admin$cum <- cumsum(admin$ind)
admin <- admin[nrow(admin):1,]
admin
ind id cum
3 3 60 12
2 4 123 9
1 5 30 5
# ids will be subsequently updated - 12, 9, 5
for(i in 1:length(admin$cum)) {
df[as.numeric(row.names(df)) <= admin$cum[i], 4] <- admin$id[i]
}
df
index v1 ids admin
1 1 A 13 30
2 2 U 14 30
3 3 U 15 30
4 4 U 16 30
5 5 U 17 30
6 6 A 18 123
7 7 U 19 123
8 8 U 20 123
9 9 U 21 123
10 10 A 22 60
11 11 U 23 60
12 12 U 24 60
Below is another version that uses the individual matching rule, but the cumulative one.
df <- data.frame(index = c(1:12),
v1 = c("A","U","U","U","U","A","U","U","U","A","U","U"),
ids = 13:24)
# need rule to assign ids - ind
admin <- data.frame(ind = c(5,4,3), id = c(30,123,60))
df$admin <- do.call(c, lapply(1:length(admin$ind), function(x) {
rep(admin$id[x], sum(as.numeric(row.names(df)) <= admin$ind[x]))
}))
Related
I have this data set in R:
first_variable = rexp(100,100)
second_variable = rexp(100,100)
n_obs = 1:100
question_data = data.frame(n_obs, first_variable, second_variable)
I want to make this dataset so that:
The rows 1-10 has id:1,2,3,4,5,6,7,8,9,10
The rows 11-20 has id: 1,2,3,4,5,6,7,8,9,10
The rows 21-30 has id : 1,2,,3,4,5,6,7,8,9,10
etc
In other words, the id's 1-10 repeat for each sets of 10 rows.
I found this code that I thought would work:
# here, n = 10 (a set of n = 10 rows)
bloc_len <- 10
question_data$id <-
rep(seq(1, 1 + nrow(question_data) %/% bloc_len), each = bloc_len, length.out = nrow(question_data))
But this is not working, and is making each set of 10 rows as the same ID:
n_obs first_variable second_variable id
1 1 0.006223412 0.0258968583 1
2 2 0.004473815 0.0065543554 1
3 3 0.011745754 0.0005061101 1
4 4 0.005620351 0.0033549525 1
5 5 0.045860202 0.0132625822 1
6 6 0.002477348 0.0068517981 1
I would have wanted something like this:
n_obs first_variable second_variable id
1 1 0.0062234115 0.0258968583 1
2 2 0.0044738150 0.0065543554 2
3 3 0.0117457544 0.0005061101 3
4 4 0.0056203508 0.0033549525 4
5 5 0.0458602019 0.0132625822 5
6 6 0.0024773478 0.0068517981 6
7 7 0.0049527013 0.0047461094 7
8 8 0.0058581805 0.0108604478 8
9 9 0.0041171801 0.0002445268 9
10 10 0.0090667287 0.0019289691 10
11 11 0.0039002449 0.0135441919 1
12 12 0.0064558661 0.0230979415 2
13 13 0.0104993267 0.0005609776 3
14 14 0.0153162705 0.0038364012 4
15 15 0.0107109676 0.0183818539 5
16 16 0.0131620151 0.0029710189 6
17 17 0.0244441763 0.0095645480 7
18 18 0.0058112355 0.0125754349 8
19 19 0.0005022588 0.0156614272 9
20 20 0.0007572985 0.0049964333 10
21 21 0.0276024376 0.0024303513 1
Is this possible?
Thank you!
Instead of each, try using times:
question_data$id <-
rep(seq(bloc_len), times = nrow(question_data) %/% bloc_len, length.out = nrow(question_data))
Like the example shared, if the number of rows in the data (100) is completely divisible by the number of id's (10) then we can use R's recycling property to repeat the id's.
bloc_len <- 10
question_data$id <- seq_len(bloc_len)
If they are not completely divisible we can use rep -
question_data$id <- rep(seq_len(bloc_len), length.out = nrow(question_data))
I have a data and a vector contain name of variables and i want to create new variable contain rowsum of variables in my vector, and i want the name of new variable ( sum of variables in my vector) to be concatenation of names of variables
for example i have this data
> data
Name A B C D E
r1 1 5 12 21 15
r2 2 4 7 10 9
r3 5 15 6 9 6
r4 7 8 0 7 18
and this vector
>Vec
"A" , "C" , "D"
the result i want is the sum of Variables A , C and D and the name of my variable is ACD
here's the result i want :
> data
Name A B C D ACD E
r1 1 5 12 21 34 15
r2 2 4 7 10 18 9
r3 5 15 6 9 20 6
r4 7 8 0 7 14 18
I tried this :
data <- cbind(data , as.data.frame(rowSums(data[,Vec]) ))
But i don't know how to create the name
Here's the result i got
>data
Name A B C D E rowSums(data[,Vec])
r1 1 5 12 21 15 34
r2 2 4 7 10 9 18
r3 5 15 6 9 6 20
r4 7 8 0 7 18 14
Not that i gave just a sample example to explain what i want to do
i want to do affectation of my old data to my new data ( that contains the new variable), like i did in my command above
edit 1 : in my real program , i don't know the elements ( name of my variables in my vector so i can not do data$ACD <- cbind(data , as.data.frame(rowSums(data[,Vec]) )) as suggested by Pax, in fact i have for loop that generate my vectors and each time i create variable to put the result i want ( sum of variable in my vector) so i don't know how to affect the name without knowing the elements of vectors
Please tell me if you need anymore clarifications or informations
Thank you
It's not a one line solution but you can set the name on the subsequent line:
data <- data.frame(A = c(1, 2, 5, 7),
B = c(5, 4, 15, 8),
C = c(12, 7, 6, 0),
D = c(21, 10, 9, 7),
E = c(15, 9, 6, 18))
Vec <- c("A" , "C" , "D")
data <- cbind(data, rowSums(data[,Vec]))
# Add name
names(data)[ncol(data)] <- paste(Vec, collapse="")
# A B C D E ACD
# 1 1 5 12 21 15 34
# 2 2 4 7 10 9 19
# 3 5 15 6 9 6 20
# 4 7 8 0 7 18 14
Here is an option with the janitor package. You can use adorn_totals which appends a totals row or column to a data.frame. The name argument includes the name of the new column in this case, and final Vec included at the end includes the columns to total.
library(janitor)
adorn_totals(data, "col", fill = NA, na.rm = TRUE, name = paste(Vec, collapse = ""), all_of(Vec))
Output
A B C D E ACD
1 5 12 21 15 34
2 4 7 10 9 19
5 15 6 9 6 20
7 8 0 7 18 14
I need to create 10 bins with the most approximate frequency each; for this,
I am using the function "ClassInvervals" from the library (ClassInt) with the style
'quantile' for binning some data. This is working for must columns; but, when I have a column that has 1 number repeated too many times, it appears an error that says that some brackets are not unique, which makes sense assuming the last +30% of the column data is the same number so the function doesn't know how to split the bins.
What I would like to do is that if a number is greater than the 10% of the length of the column, then treat it as a different bin, and if not, then use the function as it is.
For example, let's assume we have this DF:
df <- read.table(text="
X
1 5
2 29
3 4
4 26
5 4
6 17
7 4
8 4
9 4
10 25
11 4
12 4
13 5
14 14
15 18
16 13
17 29
18 4
19 13
20 6
21 26
22 11
23 2
24 23
25 4
26 21
27 7
28 4
29 18
30 4",h=T,strin=F)
So in this case the 10% of the length would be 3, so if we create a table containing the frequency of each number, it would appear something like this:
2 1
4 11
5 2
6 1
7 1
11 1
13 2
14 1
17 1
18 2
21 1
23 1
25 1
26 2
29 2
With this info, first we should treat "4" as a unique bin.
So we have a final output more or less like this:
X Bins
1 5 [2,6)
2 29 [27,30)
3 4 [4]
4 26 [26,27)
5 4 [4]
6 17 [15,19)
7 4 [4]
8 4 [4]
9 4 [4]
10 25 [19,26)
11 4 [4]
12 4 [4]
13 5 [2,6)
14 14 [12,15)
15 18 [15,19)
16 13 [12,15)
17 29 [27,30)
18 4 [4]
19 13 [12,15)
20 6 [6,12)
21 26 [26,27)
22 11 [6,12)
23 2 [2,6)
24 23 [19,26)
25 4 [4]
26 21 [19,26)
27 7 [6,12)
28 4 [4]
29 18 [15,19)
30 4 [4]
Until now, my approach has been something like this:
Moda <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
Binner <- function(df) {
library(classInt)
#Input is a matrix that wants to be binned
for (c in 1:ncol(df)) {
if (sapply(df,class)[c]=="numeric") {
VectorTest <- df[,c]
# Here I get the 10% of the values
TenPer <- floor(length(VectorTest)/10)
while((sum(VectorTest == Moda(VectorTest)))>=TenPer) {
# in this loop I manage to remove the values that
# are repeated more than 10% but I still don't know how to add it as a special bin
VectorTest <- VectorTest[VectorTest!=Moda(VectorTest)]
Counter <- Counter +1
}
binsTest <- classIntervals(VectorTest_Fixed, 10- Counter, style = 'quantile')
binsBrakets <- cut(VectorTest, breaks = binsTest$brks)
df[ , paste0("Binned_", colnames(df)[c])] <- binsBrakets
}
}
return (df)
}
Can someone help me?
You could use cutr::smart_cut:
# devtools::install_github("moodymudskipper/cutr")
library(cutr)
df$Bins <- smart_cut(df$X,list(10,"balanced"),"g",simplify = F)
table(df$Bins)
#
# [2,4) [4,5) [5,6) [6,11) [11,14) [14,18) [18,21) [21,25) [25,29) [29,29]
# 1 11 2 2 3 2 2 2 3 2
more on cutr and smart_cut
you can create two different dataframes: one with the 10% bins and the rest with the cut created bins. Then bind them together (make sure the bins are strings).
library(magrittr)
#lets find the numbers that appear more than 10% of the time
large <- table(df$X) %>%
.[. >= length(df$X)/10] %>%
names()
#these numbers appear less than 10% of the time
left_over <- df$X[!df$X %in% large]
#we want a total of 10 bins, so we'll cut the data into 10 - the number of 10%
left_over_bins <- cut(left_over, 10 - length(large))
#Let's combine the information into a single data frame
numbers_bins <- rbind(
data.frame(
n = left_over,
bins = left_over_bins %>% as.character,
stringsAsFactors = F
),
data.frame(
n = df$X[df$X %in% large],
bins = df$X[df$X %in% large] %>% as.character,
stringsAsFactors = F
)
)
If you table the information you'll get something like this
table(numbers_bins$bins) %>% sort(T)
4 (1.97,5] (11,14] (23,26] (17,20]
11 3 3 3 2
(20,23] (26,29] (5,8] (14,17] (8,11]
2 2 2 1 1
Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}
I have a data frame m with:
>m
id w y z
1 2 5 8
2 18 5 98
3 1 25 5
4 52 25 8
5 5 5 4
6 3 3 5
Below is a general function for normally transforming a variable that I need to apply to columns w,y,z.
y<-qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x))
For example, if I wanted to run this function on "column w" to get the output column appended to dataframe "m" then:
m$w_n<-qnorm((rank(m$w,na.last="keep")-0.5)/sum(!is.na(m$w))
Can someone help me automate this to run on multiple columns in data frame m?
Ideally, I would want an output data frame with the following columns:
id w y z w_n y_n z_n
Note this is a sample data frame, the one I have is much larger and I have more letter columns to run this function on other than w, y,z.
Thanks!
Probably a way to do it in a single step, but what about:
df <- data.frame(id = 1:6, w = sample(50, 6), z = sample(50, 6) )
df
id w z
1 1 39 40
2 2 20 26
3 3 43 11
4 4 4 37
5 5 36 24
6 6 27 14
transCols <- function(x) qnorm((rank(x,na.last="keep")-0.5)/sum(!is.na(x)))
tmpdf <- lapply(df[, -1], transCols)
names(tmpdf) <- paste0(names(tmpdf), "_n")
df_final <- cbind(df, tmpdf)
df_final
df_final
id w z w_n z_n
1 1 39 40 -0.2104284 -1.3829941
2 2 20 26 1.3829941 1.3829941
3 3 43 11 0.2104284 0.6744898
4 4 4 37 -1.3829941 0.2104284
5 5 36 24 0.6744898 -0.6744898
6 6 27 14 -0.6744898 -0.2104284