I have a dataset that has a column called QTY in which most of the values are already summed, but a few are several integers separated by commas. How can I replace those rows with the sums of the values?
I have:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 2, 4, 5, 8
4 Bcb 4, 1
Desired result:
ID Name QTY
1 Abc 2
2 Bac 3
3 Cba 19
4 Bcb 5
I've tried messing around with for loops a bit and using ifelse(), but I can't quite figure it out.
This looks a bit ugly but should work. Assuming column QTY is a character -
your_df$QTY_new <- sapply(strsplit(your_df$QTY, ", "), function(x) sum(as.numeric(x)))
Using for loops should be this way:
data <- data.table(ID = 1:4,
Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1"),
QTY2 = numeric(4))
for(i in 1:nrow(data)){
data$QTY2[i] <- sum(as.numeric(unlist(strsplit(as.character(data$QTY[i]), ', '))))
}
and the resulting DF is:
ID Name QTY QTY2
1: 1 Abc 2 2
2: 2 Bac 3 3
3: 3 Cba 2, 4, 5, 8 19
4: 4 Bcb 4, 1 5
I made a function for solving your question. But let me explain how it works:
sumInRow = function(row_value, split = ",") {
# 1. split the values
row_value = strsplit(row_value, split = split)
# 2. Convert them to numeric and sum
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
return(row_sum)
}
The row_value, by default, will be a character because of the comma.
Then for each value we need to split them:
row_value = strsplit(row_value, split = split)
But it will return a list contain the split for all element in row_value, don't worry we'll use it later.
row_sum = sapply(row_value, function(single_row) {
single_row = as.numeric(single_row)
return(sum(single_row))
})
Sapply function works as an interator, for each element of the list we'll use the following function: convert to numeric and return the sum of them.
[EDIT_1]
To use if you have to call:
sumInRow(<your data frame>$QYT)
I hope it helps you.
Here is one option with tidyverse, We split the 'QTY' column by the delimiter , to expand the rows (separate_rows), grouped by 'ID', 'Name', get the sum of the 'QTY'
library(tidyverse)
df1 %>%
separate_rows(QTY, convert = TRUE) %>%
group_by(ID, Name) %>%
summarise(QTY = sum(QTY))
# A tibble: 4 x 3
# Groups: ID [4]
# ID Name QTY
# <int> <chr> <int>
#1 1 Abc 2
#2 2 Bac 3
#3 3 Cba 19
#4 4 Bcb 5
data
df1 <- structure(list(ID = 1:4, Name = c("Abc", "Bac", "Cba", "Bcb"),
QTY = c("2", "3", "2, 4, 5, 8", "4, 1")), class = "data.frame", row.names = c(NA,
-4L))
Related
I've been trying to figure this out for a few hours and I'm hoping someone can point me in the right direction. I'm trying to get from the below data set named "current data"
current_data <-
tribble(
~ID, ~grade_Q1, ~points_Q1,
"1", c("D-", "C-", "C-", "C-"), c(1, 2, 2, 2),
"2", c("A", "B", "B+", "B+"), c(4, 3, 3, 3),
)
to the below dataset named "updated_data"
updated_data <-
tribble(
~ID, ~grade_Q1, ~points_Q1, ~n_grades,
"1", "D- C C- C-", "1 2 2 2", 4,
"2", "A B B+ B+ A", "4 3 3 3 4", 5
)
The "n_grades" column is literally just a count of the number of letter grades in the "grade_q1" column. Anyone have any ideas how to proceed?
We can get the lengths of 'grade_Q1' to create the n_grades, then loop over the list columns with map, concatenate into a single string with str_c
library(dplyr)
library(stringr)
library(purrr)
current_data %>%
mutate(n_grades = lengths(grade_Q1),
grade_Q1 = map_chr(grade_Q1, str_c, collapse= ' '),
points_Q1 = map_chr(points_Q1, str_c, collapse = ' '))
-output
# A tibble: 2 x 4
# ID grade_Q1 points_Q1 n_grades
# <chr> <chr> <chr> <int>
#1 1 D- C- C- C- 1 2 2 2 4
#2 2 A B B+ B+ A 4 3 3 3 4 5
If there are many columns, it can be simplified with across
current_data %>%
mutate(n_grades = lengths(grade_Q1),
across(c(grade_Q1, points_Q1), ~ map_chr(., str_c, collapse= ' ')))
Or using base R
current_data$n_grades <- lengths(current_data$grade_Q1)
current_data[c("grade_Q1", "points_Q1")] <-
lapply(current_data[c("grade_Q1", "points_Q1")],
sapply, paste, collapse= ' ')
A data.table option
setDT(current_data)[
,
c(
lapply(.SD, function(x) paste0(unlist(x), collapse = " ")),
n_grades = lengths(grade_Q1)
),
ID
][]
gives
ID grade_Q1 points_Q1 n_grades
1: 1 D- C- C- C- 1 2 2 2 4
2: 2 A B B+ B+ A 4 3 3 3 4 5
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I tried several times and it does not work.
How can I split sentences contained in a cell into different rows maintaining the rest of the values?
Example:
Dataframe df has 20 columns.
Row j, Column i contains some comments which are separated by " | "
I want to have a new dataframe df2 which increases the amount of rows depending the number of sentences.
This means, if cell j,i has Sentence A | Sentence B
Row j, Column i has Sentence A
Row j+1, Column i has Sentence B
Columns 1 to i-1 and i+1 to 20 have the same value in rows j and j+1.
I do not know if this has an easy solution.
Thank you very much.
We could use cSplit from splitstackshape
library(splitstackshape)
cSplit(df, 'col3', sep="\\|", "long", fixed = FALSE)
# col1 col2 col3
#1: a 1 fitz
#2: a 1 buzz
#3: b 2 foo
#4: b 2 bar
#5: c 3 hello world
#6: c 3 today is Thursday
#7: c 3 its 2:00
#8: d 4 fitz
data
df <- structure(list(col1 = c("a", "b", "c", "d"), col2 = c(1, 2, 3,
4), col3 = c("fitz|buzz", "foo|bar", "hello world|today is Thursday | its 2:00",
"fitz")), class = "data.frame", row.names = c(NA, -4L))
Here is a solution using 3 tidyverse packages that accounts for an unknown maximum number of comments
library(dplyr)
library(tidyr)
library(stringr)
# Create function to calculate the max number comments per observation within
# df$col3 and create a string of unique "names"
cols <- function(x) {
cmts <- str_count(x, "([|])")
max_cmts <- max(cmts, na.rm = TRUE) + 1
features <- c(sprintf("V%02d", seq(1, max_cmts)))
}
# Create the data
df1 <- data.frame(col1 = c("a", "b", "c", "d"),
col2 = c(1, 2, 3, 4),
col3 = c("fitz|buzz", NA,
"hello world|today is Thursday | its 2:00|another comment|and yet another comment", "fitz"),
stringsAsFactors = FALSE)
# Generate the desired output
df2 <- separate(df1, col3, into = cols(x = df1$col3),
sep = "([|])", extra = "merge", fill = "right") %>%
pivot_longer(cols = cols(x = df1$col3), values_to = "comments",
values_drop_na = TRUE) %>%
select(-name)
Which results in
df2
# A tibble: 8 x 3
col1 col2 comments
<chr> <dbl> <chr>
1 a 1 "fitz"
2 a 1 "buzz"
3 c 3 "hello world"
4 c 3 "today is Thursday "
5 c 3 " its 2:00"
6 c 3 "another comment"
7 c 3 "and yet another comment"
8 d 4 "fitz"
I have two databases - old one and update one.
Both have same structures, with unique ID.
If record changes - there's new record with same ID and new data.
So after rbind(m1,m2) I have duplicated records.
I can't just remove duplicated ID's, since the data could be updated.
There's no way to tell the difference which record is new, beside it being in old file or update file.
How can I merge two tables, and if there's row with duplicated ID, leave the one from newer file?
I know I could add column to both and just ifelse() this, but I'm looking for something more elegant, preferably oneliner.
hard to give the correct answer without sample data.. but here is an approach that you can adjust to your data..
#sample data
library( data.table )
dt1 <- data.table( id = 2:3, value = c(2,4))
dt2 <- data.table( id = 1:2, value = c(2,6))
#dt1
# id value
# 1: 2 2
# 2: 3 4
#dt2
# id value
# 1: 1 2
# 2: 2 6
#rowbind...
DT <- rbindlist( list(dt1,dt2), use.names = TRUE )
# id value
# 1: 2 2
# 2: 3 4
# 3: 1 2
# 4: 2 6
#deselect duplicated id from the buttom up
# assuming the last file in the list contains the updated values
DT[ !duplicated(id, fromLast = TRUE), ]
# id value
# 1: 3 4
# 2: 1 2
# 3: 2 6
Say you have:
old <- data.frame(id = c(1,2,3,4,5), val = c(21,22,23,24,25))
new <- data.frame(id = c(1,4), val = c(21,27))
so the record with id 4 has changed in the new dataset and 1 is a pure duplicate.
You can use dplyr::anti_join to find old records not in the new dataset and then just use rbind to add the new ones on.
joined <- rbind(anti_join(old,new, by = "id"),new)
You could use dplyr:
df_new %>%
full_join(df_old, by="id") %>%
transmute(id = id, value = coalesce(value.x, value.y))
returns
id value
1 1 0.03432355
2 2 0.28396359
3 3 0.01121692
4 4 0.57214035
5 5 0.67337745
6 6 0.67637187
7 7 0.69178855
8 8 0.83953140
9 9 0.55350251
10 10 0.27050363
11 11 0.28181032
12 12 0.84292569
given
df_new <- structure(list(id = 1:10, value = c(0.0343235526233912, 0.283963593421504,
0.011216921498999, 0.572140350239351, 0.673377452883869, 0.676371874753386,
0.691788548836485, 0.839531400706619, 0.553502510068938, 0.270503633422777
)), class = "data.frame", row.names = c(NA, -10L))
df_old <- structure(list(id = c(1, 4, 5, 3, 7, 9, 11, 12), value = c(0.111697669373825,
0.389851713553071, 0.252179590053856, 0.91874519130215, 0.504653975600377,
0.616259852424264, 0.281810319051147, 0.842925694771111)), class = "data.frame", row.names = c(NA,
-8L))
Follow-up question to Dynamically create value labels with haven::labelled, where akrun provided a good answer using deframe.
I am using haven::labelled to set value labels of a variable. The goal is to create a fully documented dataset I can export to SPSS.
Now, say I have a df value_labels of values and their value labels. I also have a df df_data with variables to which I want allocate value labels.
value_labels <- tibble(
value = c(seq(1:6), seq(1:3), NA),
labels = c(paste0("value", 1:6),paste0("value", 1:3), NA),
name = c(rep("var1", 6), rep("var2", 3), "var3")
)
df_data <- tibble(
id = 1:10,
var1 = floor(runif(10, 1, 7)),
var2 = floor(runif(10, 1, 4)),
var3 = rep("string", 10)
)
Manually, I would create value labels for df_data$var1 and df_data$var2 like so:
df_data$var1 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3, values4 = 4, values5 = 5, values6 = 6))
df_data$var2 <- haven::labelled(df_data$var, labels = c(values1 = 1, values2 = 2, values3 = 3))
I need a more dynamic way of assigning correct value labels to the correct variable in a large dataset. The solution also needs to ignore character vectors, since I dont want these to have value labels. For that reason, var3 in value_labels is listed as NA.
The solution does not need to work with multiple datasets in a list.
Here is one option where we split the named 'value/labels' by 'name' after removing the NA rows, use the names of the list to subset the columns of 'df_data', apply the labelled and assign it to back to the same columns
lbls2 <- na.omit(value_labels)
lstLbls <- with(lbls2, split(setNames(value, labels), name))
df_data[names(lstLbls)] <- Map(haven::labelled,
df_data[names(lstLbls)], labels = lstLbls)
df_data
# A tibble: 10 x 4
# id var1 var2 var3
# <int> <dbl+lbl> <dbl+lbl> <chr>
# 1 1 2 [value2] 2 [value2] string
# 2 2 5 [value5] 2 [value2] string
# 3 3 4 [value4] 1 [value1] string
# 4 4 1 [value1] 2 [value2] string
# 5 5 1 [value1] 1 [value1] string
# 6 6 6 [value6] 2 [value2] string
# 7 7 1 [value1] 3 [value3] string
# 8 8 1 [value1] 1 [value1] string
# 9 9 3 [value3] 3 [value3] string
#10 10 6 [value6] 1 [value1] string
I would like to create a new identifier column in each data frame with values from the name of containing nested list.
parent <- list(
a = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))),
b = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))))
Therefore, the result for the first data frame in list a would look like this:
> foo
first second identifier
1 1 4 a
2 2 5 a
3 3 6 a
The first data frame in list b would look like this:
>foo
first second identifier
1 1 4 b
2 2 5 b
3 3 6 b
Seems like you might want something like this
Map(function(name, list) {
lapply(list, function(x) cbind(x, identifier=name))
}, names(parent), parent)
Here we use Map() and take the list and the names of the list and just cbind those identifiers into the data.frames.
We could use tidyverse. Loop through the list with imap (gives both the values as well as the keys (name of the list) as .x and .y, then with map2, loop through the inner list of data.frame and mutate to create the column 'identifier as .y aka the names of the list
library(tidyverse)
imap(parent, ~ map2(.x, .y, ~ .x %>%
mutate(identifier = .y)))
#$a
#$a$foo
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$a$bar
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$a$puppy
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$b
#$b$foo
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
#$b$bar
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
#$b$puppy
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
If we want to have the column based on the data.frame name, loop through just the list elements with map, then use imap to loop through the inner list so as to get the keys (names of the inner list) and create a new column 'identifier
map(parent, ~ imap(.x, ~ .x %>%
mutate(identifier = .y)))