R - dataframe - every x rows new number in other column - r

my question is:
I have a matrix of 200.000 rows and 3 different columns (productID, week, order).
I want to put the productID (starting with 1) in the product column and create 26 rows for each ID. Than I want to put 1-26 in the week column for every ID.
I know it's not that hard, but I keep making mistakes.
Thank you so much for your help!

Do you look for something like this:
tibble(productID = 1:4, week = 5:8, order = "Test") %>%
tidyr::complete(week = 1:26, productID = 1:4, fill = list(order = NA_character_))
# A tibble: 104 x 3
week productID order
<int> <int> <chr>
1 1 1 NA
2 1 2 NA
3 1 3 NA
4 1 4 NA
5 2 1 NA
6 2 2 NA
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 NA
# ... with 94 more rows

Related

Removing NAs from a large dataframe

I have a very large dataframe, with number of rows = 10 703 009. I want to remove NAs but getting this error, 'Colloc couldnot allocate memory of 10703009 bytes.
My input dataframe is 'a' with many rows with NAs,
IDs
Codes
1
C493
1
NA
2
E348
3
NA
I need a output with rows without NAs
IDs
Codes
1
C493
2
E348
I tried both, but getting memory error,
drop_na(a,Codes)
subset(a,Codes)
Please suggest the solution to this in R.
A frame of 10,703,009 lines is no problem for R. See below. I generated a tibble with exactly the number of lines where the variable Codes contains NA with a probability of probNA = 0.3.
library(tidyverse)
n=10703009
probNA = 0.3
df = tibble(IDs = 1:n,
Codes = paste0(sample(LETTERS[1:10], n, replace = TRUE),
sample(100:999, n, replace = TRUE))) %>%
mutate(Codes = ifelse(sample(c(T,F), n, replace = TRUE,
prob = c(probNA, 1-probNA)), NA, Codes))
df
output
# A tibble: 10,703,009 x 2
IDs Codes
<int> <chr>
1 1 I586
2 2 A188
3 3 H674
4 4 D641
5 5 A793
6 6 B455
7 7 B837
8 8 A805
9 9 NA
10 10 E380
# ... with 10,702,999 more rows
The size of such a tibble is object.size (df) return 12 894 1096 bytes.
We will try to get rid of the lines with NA values.
df %>% filter(!is.na(Codes))
output
# A tibble: 7,490,809 x 2
IDs Codes
<int> <chr>
1 1 I586
2 2 A188
3 3 H674
4 4 D641
5 5 A793
6 6 B455
7 7 B837
8 8 A805
9 10 E380
10 11 C231
# ... with 7,490,799 more rows
Now let's replace all NA values with an empty string.
df %>% mutate(Codes = ifelse(is.na(Codes), "", Codes))
output
# A tibble: 10,703,009 x 2
IDs Codes
<int> <chr>
1 1 "I586"
2 2 "A188"
3 3 "H674"
4 4 "D641"
5 5 "A793"
6 6 "B455"
7 7 "B837"
8 8 "A805"
9 9 ""
10 10 "E380"
# ... with 10,702,999 more rows
As you can see, everything works smoothly and without any problems.

R: Duplicating a subset of row values, based on condition, across a whole dataframe

I have a dataframe df containing count data at different sites, across two days:
day site count
1 A 2
1 B 3
2 A 10
2 B 12
I would like to add a new column day1count that represents the count value at day 1, for each unique site. So, on rows where day==1, count and day1count would be identical. The new df would look like:
day site count day1count
1 A 2 2
1 B 3 3
2 A 10 2
2 B 12 3
So far I've created a new column that has duplicate values for day 1 rows, and NA for everything else:
df$day1count= ifelse(df$day==1, df$count, NA)
day site count day1count
1 A 2 2
1 B 3 3
2 A 10 NA
2 B 12 NA
How can I now replace the NA entries with values corresponding to each unique site from day 1?
I figured it out. It's not very elegant (and I invite others to submit a more efficient approach) but...
Do NOT create the new column with df$day1count= ifelse(df$day==1, df$count, NA) as I did in the original example. Instead, start by making a duplicate of df, but which only contains rows from day 1
tmpdf = df[df$day==1,]
Rename count as day1count, and remove day column
tmpdf = rename(tmpdf, c("count"="day1count"))
tmpdf$day = NULL
Merge the two dataframes by site
newdf = merge(x=df,y=tmpdf, by="site")
newdf
site day count day1count
1 A 1 2 2
2 A 2 10 2
3 B 1 3 3
4 B 2 12 3
With tidyverse you could do the following:
library(tidyverse)
df %>%
group_by(site) %>%
mutate(day1count = first(count))
Output
# A tibble: 4 x 4
# Groups: site [2]
day site count day1count
<int> <fct> <int> <int>
1 1 A 2 2
2 1 B 3 3
3 2 A 10 2
4 2 B 12 3
Data
df <- read.table(
text =
"day site count
1 A 2
1 B 3
2 A 10
2 B 12", header = T
)

R: reconstruct manager-employee data

I have a list of employee information with employee id and direct line manager id. I want to rearrange the data so it will list all level of managers for each employee.
I want to create a loop to find line managers repeatedly.
Here is the code to create a sample dataset.
employee_id = seq(1:10)
manager_id =c(1,1,2,3,4,2,3,1,4,5)
hr=data.frame(employee_id,manager_id)
Here is what I expect:
Using employee_id 4 as an example
employee_id managerL1 managerL2 managerL3
4 3 2 1
I should also mention that this is a simplified example. In the real data that I'm working with, manager and employee ids are not sequential. They are some random numbers with prefixs. The id itself don't have any information on managerial levels. The level is purely driven by data.
Seems like this requires an iterative solution.
Start with the level 1 managers of our employees. The row index of the employee who is the manager of each employee is
i <- 1
idx = match(hr$manager_id, hr$employee_id)
The manager's manager is hr$manager_id[idx], and we can use the same match() approach iteratively. Record and repeat until there is just a single employee as manager
repeat {
idx = match(hr$manager_id[idx], hr$employee_id)
hr[[paste0("manager_", i)]] = hr$employee_id[idx]
if (length(unique(idx)) == 1)
break
i <- i + 1
}
A variant might allow for one or more top-level managers by using NA as their manager, and stopping appropriately
hr$employee_id[1] = NA # the boss; there could be several top-level managers...
i <- 1
idx = match(hr$manager_id, hr$employee_id)
repeat {
idx = match(hr$manager_id[idx], hr$employee_id)
hr[[paste0("manager_", i)]] = hr$employee_id[idx]
if (all(is.na(idx)))
break
i <- i + 1
}
Here is an option with tidyverse
library(tidyverse)
hr %>%
uncount(manager_id, .remove = FALSE) %>%
group_by(employee_id) %>%
mutate(new_id = row_number(), nm1 = str_c('manager_', new_id)) %>%
spread(nm1,new_id)
# A tibble: 10 x 7
# Groups: employee_id [10]
# employee_id manager_id manager_1 manager_2 manager_3 manager_4 manager_5
# <int> <dbl> <int> <int> <int> <int> <int>
# 1 1 1 1 NA NA NA NA
# 2 2 1 1 NA NA NA NA
# 3 3 2 1 2 NA NA NA
# 4 4 3 1 2 3 NA NA
# 5 5 4 1 2 3 4 NA
# 6 6 2 1 2 NA NA NA
# 7 7 3 1 2 3 NA NA
# 8 8 1 1 NA NA NA NA
# 9 9 4 1 2 3 4 NA
#10 10 5 1 2 3 4 5
Or with map and spread
hr %>%
mutate(new_id = map(manager_id, seq)) %>%
unnest %>%
mutate(nm1 = str_c('manager_', new_id)) %>%
spread(nm1, new_id)

R - Replace values in a specific even column based on values from a odd specific column - Application to the whole dataframe

My data frame:
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
For the case A and qA (= quality A): I want the values assigned to the quality value 1 and 3 are replaced by NA
And the same for the case B and qB
The final data has to be like this:
desired_data <- data.frame(A = c("NA",5,6,"NA","NA"), qA = c(1,2,2,3,1), B = c(2,5,"NA","NA","NA"), qB = c(2,2,1,3,1))
My question is how to perform that?
I have a big dataframe with about 90 columns, so I need code which doesn't require the column names to work properly.
To help, I have this part of code which select columns starting with "q" letter:
data[,grep("^[q]", colnames(data))]
You could just do this...
data[,seq(1,ncol(data),2)][(data[,seq(2,ncol(data),2)]==1)|
(data[,seq(2,ncol(data),2)]==3)] <- NA
data
A qA B qB
1 NA 1 2 2
2 5 2 5 2
3 6 2 NA 1
4 NA 3 NA 3
5 NA 1 NA 1
One solution is to separate in two tables and use vectorisation in base R
data <- data.frame(A = c(1,5,6,8,7), qA = c(1,2,2,3,1), B = c(2,5,6,8,4), qB = c(2,2,1,3,1))
data
#> A qA B qB
#> 1 1 1 2 2
#> 2 5 2 5 2
#> 3 6 2 6 1
#> 4 8 3 8 3
#> 5 7 1 4 1
quality <- data[,grep("^[q]", colnames(data))]
data2 <- data[,setdiff(colnames(data), names(quality))]
data2[quality == 1 | quality == 3] <- NA
data2
#> A B
#> 1 NA 2
#> 2 5 5
#> 3 6 NA
#> 4 NA NA
#> 5 NA NA

how to replace the NA in a data frame with the average number of this data frame

I have a data frame like this:
nums id
1233 1
3232 2
2334 3
3330 1
1445 3
3455 3
7632 2
NA 3
NA 1
And I can know the average "nums" of each "id" by using:
id_avg <- aggregate(nums ~ id, data = dat, FUN = mean)
What I would like to do is to replace the NA with the value of the average number of the corresponding id. for example, the average "nums" of 1,2,3 are 1000, 2000, 3000, respectively. The NA when id == 3 will be replaced by 3000, the last NA whose id == 1 will be replaced by 1000.
I tried the following code to achieve this:
temp <- dat[is.na(dat$nums),]$id
dat[is.na(dat$nums),]$nums <- id_avg[id_avg[,"id"] ==temp,]$nums
However, the second part
id_avg[id_avg[,"id"] ==temp,]$nums
is always NA, which means I always pass NA to the NAs I want to replace.
I don't know where I was wrong, or do you have better method to do this?
Thank you
Or you can fix it by:
dat[is.na(dat$nums),]$nums <- id_avg$nums[temp]
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
What you want is contained in the zoo package.
library(zoo)
na.aggregate.default(dat, by = dat$id)
nums id
1 1233.000 1
2 3232.000 2
3 2334.000 3
4 3330.000 1
5 1445.000 3
6 3455.000 3
7 7632.000 2
8 2411.333 3
9 2281.500 1
Here is a dplyr way:
df %>%
group_by(id) %>%
mutate(nums = replace(nums, is.na(nums), as.integer(mean(nums, na.rm = T))))
# Source: local data frame [9 x 2]
# Groups: id [3]
# nums id
# <int> <int>
# 1 1233 1
# 2 3232 2
# 3 2334 3
# 4 3330 1
# 5 1445 3
# 6 3455 3
# 7 7632 2
# 8 2411 3
# 9 2281 1
You essentially want to merge the id_avg back to the original data frame by the id column, so you can also use match to follow your original logic:
dat$nums[is.na(dat$nums)] <- id_avg$nums[match(dat$id[is.na(dat$nums)], id_avg$id)]
dat
# nums id
# 1: 1233.000 1
# 2: 3232.000 2
# 3: 2334.000 3
# 4: 3330.000 1
# 5: 1445.000 3
# 6: 3455.000 3
# 7: 7632.000 2
# 8: 2411.333 3
# 9: 2281.500 1

Resources