I have a panel data like id <- c(1,1,1,1,1,1,1,2,2,2,2,2), intm <- c(1,1,0,0,1,0,0,0,0,0,1,1). The data frame is like
dta <- data.frame(cbind(id,intm)) which gives:
id intm
1 1 1
2 1 1
3 1 0
4 1 0
5 1 1
6 1 0
7 1 0
8 2 0
9 2 0
10 2 0
11 2 1
12 2 1
I would like to replace the subsequent values of "intm" variable by the first value within the ID variable. That is for ID=1, the first value is 1, so the intm should have all values as 1 and for ID=2, intm should have all values as 0. The data should be like
id <- c(1,1,1,1,1,1,1,2,2,2,2,2),intm <- c(1,1,1,1,1,1,1,0,0,0,0,0) with data frame
dta <- data.frame(cbind(id,intm))
How can I do this in R by looping or any other means? I have a big data set.
Consider the following code:
new_column <- c(); i <- 1; # new column to be created
# loop
for (j in unique(dta$id)){ # let's separate the unique values of ID
index <- which(dta$id==j) # which row index satisfy id==1, or id==2, ...?
value <- dta$intm[index[1]] # which value of intm corresponds to the first value of the index?
new_column[i:tail(index,n = 1)] <- rep(value,nrow(dta[id==j,])) # repeat this value the number of rows times which contains the ID
i <- tail(index,n = 1)+1 # the new_column component must start with its last value + 1
}
dta <- cbind(dta,new_column)
Alternatively, you can use the subset() function, i.e
rep(value,nrow(subset(dta,dta$id==j)))
You can use dplyr to do this.
There are other fancier ways to do this but I think dplyr is more graceful.
id <- c(1,1,1,1,1,1,1,2,2,2,2,2)
intm <- c(1,1,0,0,1,0,0,0,0,0,1,1)
df = data.frame(id, intm)
library(dplyr)
df2 = df %>% group_by(id) %>% do({
.$intm = .$intm[1, drop = TRUE]
.
})
You can also try data.table library which shall be faster.
BTW: you do not need cbind to make a data.frame.
Consider ave + head
dta$intm <- with(dta, ave(intm, id, FUN=function(x) head(x, 1)))
Related
In the below reproducible code, I would like to add a column for SumIfs using dplyr as detailed in the below image, whereby the Excel sumifs() formula in column H of the image has conditions with the tops of the specified ranges "anchored", for a "rolling" calculation as you move down row-wise. Any recommendations for how to do the same in dplyr? I'm sure it requires grouping but unsure of how to handle conditions. The blue below shows the current reproducible code output, the yellow shows what I would like to add, and the non-highlighted shows the underlying XLS formulas.
Now using my words: to derive Sumifs, for each row one-at-a-time rolling from top-to-bottom of the array sequentially, sum all values in column D from the top of the column D range to the current row in the Column D range that have a column C "Code1" value less than the current row column C "Code1" value. So for example in deriving the value of 3 in cell G6: add the 1 in cell D3 (because its Code1 of 0 (cell C3) is < Code1 of 3 (cell C6)) to the 2 in cell D5 (because its Code1 of 1 (cell C5) is < Code1 of 3 (cell C6)).
Reproducible code:
library(dplyr)
myData <-
data.frame(
Name = c("B","R","R","R","R","B","A","A","A"),
Group = c(0,1,1,2,2,0,0,0,0),
Code1 = c(0,1,1,3,3,4,-1,0,0),
Code2 = c(1,0,2,0,1,2,1,0,0)
)
CountIfs <- function(x,y) {
out <- integer(length(x))
for(i in seq_along(x)) {
cond1 <- y[1:i] > 0
cond2 <- x[1:i] == x[i]
out[i] <- sum(cond1*cond2)
}
out
}
myDataRender <-
myData %>%
mutate(CountIfs = CountIfs(Code1, Code2))
print.data.frame(myDataRender)
Adapt Tsai solution for situations where the top/bottom of the XLS sumifs() ranges are anchored (fixed, not rolling)(where first XLS formula in the image would be =SUMIFS(D$3:D$11,C$3:$C11,"<"&C3)), for those of us transitioning from XLS to R:
myData %>% mutate(SumIfs = sapply(1:n(), function(x) sum(Code2[1:n()][Code1[1:n()] < Code1[x]])))
You could use map() or imap() from purrr:
library(dplyr)
library(purrr)
# (1)
myData %>%
mutate(SumIfs = map_dbl(1:n(), ~ sum(Code2[1:.x][Code1[1:.x] < Code1[.x]])))
# (2)
myData %>%
mutate(SumIfs = imap_dbl(Code1, ~ sum(Code2[1:.y][Code1[1:.y] < .x])))
# Name Group Code1 Code2 SumIfs
# 1 B 0 0 1 0
# 2 R 1 1 0 1
# 3 R 1 1 2 1
# 4 R 2 3 0 3
# 5 R 2 3 1 3
# 6 B 0 4 2 4
# 7 A 0 -1 1 0
# 8 A 0 0 0 1
# 9 A 0 0 0 1
If you don't want to rely on purrr, the map() solution can be adapted directly for the base sapply() version:
myData %>%
mutate(SumIfs = sapply(1:n(), \(x) sum(Code2[1:x][Code1[1:x] < Code1[x]])))
Here is another way using map2_dbl() with the row number.
library(dplyr)
library(purrr)
myData %>%
mutate(SumIfs = map2_dbl(Code1, row_number(),
~ sum(if_else(Code1 < .x & row_number() <= .y, Code2, 0))))
Also using base Map(), this will scale to as many criteria as you want.
library(dplyr)
myData %>%
mutate(SumIfs = unlist(Map(\(x, y) sum(if_else(Code1 < x & row_number() <= y, Code2, 0)),
Code1, row_number())))
I am trying to generate a new column (variable) based on the value inside multiple columns.
I have over 60 columns in the dataset and I wanted to subset the columns that I want to loop through.
The column variables I am using in my condition at all characters, and when a certain pattern is matched, to return a value of 1 in the new variable.
I am using when because I need to run multiple conditions on each column to return a value.
CODE:
df read.csv("sample.csv")
*#Generate new variable name*
df$new_var <- 0
*#For loop through columns 16 to 45*
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
grepl("I8501", df[[i]]) ~ 1
))
}
This does not work as when I table the results, I only get 1 value matched.
My other attempt was using:
for (i in colnames(df[16:45])) {
df <- df %>%
mutate(new_var=
case_when(
df[[i]] == "I8501" ~ 1
))
}
Any other possible ways to run through multiple columns with multiple conditions and change the value of the variable accordingly? to be achieved using R ?
If I'm understanding what you want, I think you just need to specify another case in your case_when() for keeping the existing values when things don't match "I8501". This is how I would do that:
df$new_var <- 0
for (index in (16:45)) {
df <- df %>%
mutate(
new_var = case_when(
grepl("I8501", df[[index]]) ~ 1,
TRUE ~ df$new_var
)
)
}
I think a better way to do this though would be to use the ever useful apply():
has_match = apply(df[, 16:45], 1, function(x) sum(grepl("I8501", x)) > 0)
df$new_var = ifelse(has_match, 1, 0)
Kindly check if this works for your file.
Sample df:
df <- data.frame(C1=c('A','B','C','D'),C2=c(1,7,3,4),C3=c(5,6,7,8))
> df
C1 C2 C3
1 A 1 5
2 B 7 6
3 C 3 7
4 D 4 8
library(dplyr)
df %>%
rowwise() %>%
mutate(new_var = as.numeric(any(str_detect(c_across(2:last_col()), "7")))) # change the 2:last_col() to select your column range ex: 2:5
Output for finding "7" in any of the columns:
C1 C2 C3 new_var
<chr> <dbl> <dbl> <dbl>
1 A 1 5 0
2 B 7 6 1
3 C 3 7 1
4 D 4 8 0
I am trying to create a new column that gets me the maximum value for a list in a data frame. I was wondering how I can create this column called maxvalue from the df$value column i.e., I would like to get the max of that list in the column.
x <- c( "000010011100011111001111111100", "011110", "0000000")
y<- c(1, 2,3)
df<- data.frame(x,y)
library(stringr)
df$value <- strsplit(df$x, "[^1]+", perl=TRUE)
# expected output ( I have tried the following)
df$maxvalue<- max(df$value)
df$maxvalue
8
4
0
this should do the trick
df$value <- lapply(lapply(strsplit(as.character(df$x),"[^1]+"), nchar),max)
output:
> df
x y value
1 000010011100011111001111111100 1 8
2 011110 2 4
3 0000000 3 0
Simplified version of #Daniel O's logic:
df$value <- sapply(strsplit(as.character(df$x),"[^1]+"), function(x){max(nchar(x))})
We can also use rawToChar and charToRaw
sapply(as.character(df$x), function(x)
with(rle(charToRaw(x)), max(lengths[as.character(values) == 31])))
Suppose, you're given the following dataframe:
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"))
What I am looking for is to create the variables flag_1, ..., flag_7 in a containing the information of how many times the respective values occur. For a, I would expect the following result:
var flag_1 flag_2 flag_3 flag_4 flag_5
",1,2,3," 1. 1. 1. 0. 0.
",2,3,5," 0. 1. 1. 0. 1.
",1,3,5,5," 1. 0. 1. 0. 2.
I managed to get the result using a nested for-loop and an if-condition but there must be a nicer (more aesthetic and better performing) solution.
One option would be to do strsplit, get the table and then cbind with original data
cbind(a, do.call(rbind, lapply(strsplit(as.character(a$var), ","),
function(x) table(factor(x[nzchar(x)], levels = 1:5, labels = paste0("flag_", 1:5))))))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2
Another option is with tidyverse
library(tidyverse)
str_extract_all(a$var, "[0-9]") %>%
map(~ as.integer(.x) %>%
as_tibble) %>%
bind_rows(.id = 'grp') %>%
count(grp, value = factor(value, levels = min(value):max(value))) %>%
spread(value, n, drop = FALSE, fill = 0) %>%
select(-grp) %>%
bind_cols(a, .) %>%
rename_at(vars(matches("^[0-9]+$")), ~ paste0("flag_", .))
# var flag_1 flag_2 flag_3 flag_4 flag_5
#1 ,1,2,3, 1 1 1 0 0
#2 ,2,3,5, 0 1 1 0 1
#3 ,1,3,5,5, 1 0 1 0 2
First, don't make the strings into factors. Nothing good comes from that.
a <- data.frame(var = c(",1,2,3,", ",2,3,5,", ",1,3,5,5,"),
stringsAsFactors = FALSE)
To get from strings to your table is simple enough if we take it in small steps. Here, I've written (or renamed) a function per step and then gone through the steps using lapply one at a time. You can string it all together in a pipeline if like, but it would be roughly these steps.
First, I extract the numbers from the strings. That involves splitting on commas, getting rid of empty strings, you have those because you can begin and end a string with a comma, but otherwise, that step wouldn't be necessary. Then we need to translate the strings into numbers, count how often we see each (we can do that with the as.numeric and table functions, respectively), and then it is just a question of mapping the observed counts into a table that also includes those we haven't observed.
pick_indices <- function(str) unlist(strsplit(str, split = ","))
remove_empty <- function(chrs) chrs[nchar(chrs) > 0]
get_indices <- as.numeric
to_counts <- table
to_flag_vect <- function(counts, len) {
vec <- rep(0, len)
names(vec) <- 1:len
vec[names(counts)] <- counts
vec
}
strings <- lapply(a$var, pick_indices)
cleaned <- lapply(strings, remove_empty)
indices <- lapply(cleaned, get_indices)
counts <- lapply(indices, to_counts)
flags <- lapply(counts, to_flag_vect, len = 5)
We now have the flag-counts in a list, so to make it into the table you want, with the column names you want, we simply do this:
tbl <- do.call(rbind, flags)
colnames(tbl) <- paste0("flag_", 1:5)
tbl
Done.
Split and unlist the values into a factor with appropriate levels
x = strsplit(a$var, ",")
xp = factor(unlist(x), levels = seq_len(5))
Create an index that maps the values of xp to the rows they came from
i = rep(seq_along(x), lengths(x))
use xtabs() to cross-tabulate the entries by row
xt = xtabs(~ i + xp)
and cbind() the matrix representation of the result to the original
> cbind(a, unclass(xt))
var 1 2 3 4 5
1 ,1,2,3, 1 1 1 0 0
2 ,2,3,5, 0 1 1 0 1
3 ,1,3,5,5, 1 0 1 0 2
I would like to use the vector:
time.int<-c(1,2,3,4,5) #vector to be use as a "guide"
and the database:
time<-c(1,1,1,1,5,5,5)
value<-c("s","s","s","t","d","d","d")
dat1<- as.data.frame(cbind(time,value))
to create the following vector, which I can then add to the first vector "time.int" into a second database.
freq<-c(4,0,0,0,3) #wished result
This vector is the sum of the events that belong to each time interval, there are four 1 in "time" so the first value gets a four and so on.
Potentially I would like to generalize it so that I can decide the interval, for example saying sum in a new vector the events in "times" each 3 numbers of time.int.
EDIT for generalization
time.int<-c(1,2,3,4,5,6)
time<-c(1,1,1,2,5,5,5,6)
value<-c("s","s","s","t", "t","d","d","d")
dat1<- data.frame(time,value)
let's say I want it every 2 seconds (every 2 time.int)
freq<-c(4,0,4) #wished result
or every 3
freq<-c(4,4) #wished result
I know how to do that in excel, with a pivot table.
sorry if a duplicate I could not find a fitting question on this website, I do not even know how to ask this and where to start.
The following will produce vector freq.
freq <- sapply(time.int, function(x) sum(x == time))
freq
[1] 4 0 0 0 3
BTW, don't use the construct as.data.frame(cbind(.)). Use instead
dat1 <- data.frame(time,value))
In order to generalize the code above to segments of time.int of any length, I believe the following function will do it. Note that since you've changed the data the output for n == 1 is not the same as above.
fun <- function(x, y, n){
inx <- lapply(seq_len(length(x) %/% n), function(m) seq_len(n) + n*(m - 1))
sapply(inx, function(i) sum(y %in% x[i]))
}
freq1 <- fun(time.int, time, 1)
freq1
[1] 3 1 0 0 3 1
freq2 <- fun(time.int, time, 2)
freq2
[1] 4 0 4
freq3 <- fun(time.int, time, 3)
freq3
[1] 4 4
We can use the table function to count the event number and use merge to create a data frame summarizing the information. event_dat is the final output.
# Create example data
time.int <- c(1,2,3,4,5)
time <- c(1,1,1,1,5,5,5)
# Count the event using table and convert to a data frame
event <- as.data.frame(table(time))
# Convert the time.int to a data frame
time_dat <- data.frame(time = time.int)
# Merge the data
event_dat <- merge(time_dat, event, by = "time", all = TRUE)
# Replace NA with 0
event_dat[is.na(event_dat)] <- 0
# See the result
event_dat
time Freq
1 1 4
2 2 0
3 3 0
4 4 0
5 5 3