I have data in which the education column takes the form from 1 to 3.
Payment column - also accepts values from 1 to 3
I wanted to group them in pairs
I have an education and payment column. How can I convert the table so that the payment is divided into 3 different columns
I would like the table to look like this:
enter image description here
*I tried to do this but it gave me an error
The function pivot_wider from the tidyr package is the way to go:
library(dplyr)
library(dplyr)
df %>%
pivot_wider(names_from = Education,
values_from = count,
names_glue = "Education = {.name}")
# A tibble: 3 × 4
PaymentTier `Education = 1` `Education = 2` `Education = 3`
<dbl> <dbl> <dbl> <dbl>
1 1 1000 666 6543
2 2 33 2222 9999
3 3 455 1111 5234
Data:
df <- data.frame(
Education = c(1,1,1,2,2,2,3,3,3),
PaymentTier = c(1,2,3,1,2,3,1,2,3),
count = c(1000,33,455,666,2222,1111,6543,9999,5234)
)
Related
I was working in the following problem. I've got monthly data from a survey, let's call it df:
df1 = tibble(ID = c('1','2'), reported_value = c(1200, 31000), anchor_month = c(3,5))
ID reported_value anchor_month
1 1200 3
2 31000 5
So, the first row was reported in March, but there's no way to know if it's reporting March or February values and also can be an approximation to the real value. I've also got a table with actual values for each ID, let's call it df2:
df2 = tibble( ID = c('1', '2') %>% rep(4) %>% sort,
real_value = c(1200,1230,11000,10,25000,3100,100,31030),
month = c(1,2,3,4,2,3,4,5))
ID real_value month
1 1200 1
1 1230 2
1 11000 3
1 10 4
2 25000 2
2 3100 3
2 100 4
2 31030 5
So there's two challenges: first, I only care about the anchor month OR the previous month to the anchor month of each ID and then I want to match to the closest value (sounds like fuzzy join). So, my first challenge was to filter my second table so it only has the anchor month or the previous one, which I did doing the following:
filter_aux = df1 %>%
bind_rows(df1 %>% mutate(anchor_month = if_else(anchor_month == 1, 12, anchor_month- 1)))
df2 = df2 %>%
inner_join(filter_aux , by=c('ID', 'month' = 'anchor_month')) %>% distinct(ID, month)
Reducing df2 to:
ID real_value month
1 1230 2
1 11000 3
2 100 4
2 31030 5
Now I tried to do a difference_inner_join by ID and reported_value = real_value, (df1 %>% difference_inner_join(df2, by= c('ID', 'reported_value' = 'real_value'))) but it's bringing a non-numeric argument to binary operator error I'm guessing because ID is a string in my actual data. What gives? I'm no expert in fuzzy joins, so I guess I'm missing something.
My final dataframe would look like this:
ID reported_value anchor_month closest_value month
1 1200 3 1230 2
2 31000 5 31030 5
Thanks!
It was easier without fuzzy_join:
df3 = df1 %>% left_join(df2 , by='ID') %>%
mutate(dif = abs(real_value - reported_value)) %>%
group_by(ID) %>% filter(dif == min(dif))
Output:
ID reported_value anchor_month real_value month dif
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1200 3 1230 2 30
2 2 31000 5 31030 5 30
I have a data frame like so...
df = tibble(id = c(64512, 64513, 64514, 64515),
customer=c("a", "a", "b", "b"))
and want to join two further data frames by id like these...
uvp_new = tibble(id=c(64512, 64513, 64514), uvp=c(12, 14, 16))
uvp_old = tibble(id=c(64512, 64515), uvp=c(10, 18))
with the following logic: whenever there is an entry for a uvp in uvp_new, i want to take this one (ignoring uvp_old), if there is no entry for uvp in uvp_new, i want to take the entry for uvp from uvp_old.
Any help appreciated
You can left_join() uvp_old and then use rows_update() with uvp_new:
library(dplyr)
df %>%
left_join(uvp_old, by = "id") %>%
rows_update(uvp_new, by = "id")
# A tibble: 4 x 3
id customer uvp
<dbl> <chr> <dbl>
1 64512 a 12
2 64513 a 14
3 64514 b 16
4 64515 b 18
Or it might be safer if there are duplicated ids in df to use rows_upsert() first and join the result to df:
uvp_old %>%
rows_upsert(uvp_new, by = "id") %>%
right_join(df, by = "id")
Here is a base R option using transform + merge
transform(
merge(merge(df, uvp_new, by = "id", all.x = TRUE), uvp_old, by = "id", all.x = TRUE),
uvp = ifelse(is.na(uvp.x), uvp.y, uvp.x)
)[c("id","customer","uvp")]
which gives
id customer uvp
1 64512 a 12
2 64513 a 14
3 64514 b 16
4 64515 b 18
You can join the three together using two joins, keeping track of which data.frame the uvp column came from with suffixes. Then, you can select the first non-NA one with coalesce.
df %>%
left_join(uvp_new, by = "id") %>%
left_join(uvp_old, by = "id", suffix = c("_new", "_old")) %>%
mutate(uvp = coalesce(uvp_new, uvp_old))
# id customer uvp_new uvp_old uvp
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 64512 a 12 10 12
# 2 64513 a 14 NA 14
# 3 64514 b 16 NA 16
# 4 64515 b NA 18 18
You can do a full join between uvp_new and uvp_old to have all the id's in one dataframe and then join this combined dataframe with df and select the new uvp value if present or else the old one using coalesce.
library(dplyr)
uvp_new %>%
rename(uvp_n = uvp) %>%
full_join(uvp_old %>%
rename(uvp_o = uvp), by = 'id') %>%
right_join(df, by = 'id') %>%
mutate(uvp = coalesce(uvp_n, uvp_o))
# id uvp_n uvp_o customer uvp
# <dbl> <dbl> <dbl> <chr> <dbl>
#1 64512 12 10 a 12
#2 64513 14 NA a 14
#3 64514 16 NA b 16
#4 64515 NA 18 b 18
You can remove uvp_n and uvp_o columns if they are not needed.
After parsing xml files I have data looking like this:
example_df <-
tibble(id = "ABC",
wage_type = "salary",
name = c("Description","Code","Base",
"Description","Code","Base",
"Description","Code"),
value = c("wage_element_1","51B","600",
"wage_element_2","51C","740",
"wage_element_3","51D"))
example_df
# A tibble: 8 x 4
id wage_type name value
<chr> <chr> <chr> <chr>
1 ABC salary Description wage_element_1
2 ABC salary Code 51B
3 ABC salary Base 600
4 ABC salary Description wage_element_2
5 ABC salary Code 51C
6 ABC salary Base 740
7 ABC salary Description wage_element_3
8 ABC salary Code 51D
with roughly 1000 different id, and each having three possible values for wage_type.
I want to change the values in the name column into columns.
I have tried to use pivot but I am struggling to handle the resulting list-cols: since not all salary have a Base, the resulting list-cols are of different size as below:
example_df <- example_df %>%
pivot_wider(id_cols = c(id, wage_type),
names_from = name,
values_from = value)
example_df
# A tibble: 1 x 5
id wage_type Description Code Base
<chr> <chr> <list> <list> <list>
1 ABC salary <chr [3]> <chr [3]> <chr [2]>
So when I try to unnest the cols it throws an error:
example_df%>%
unnest(cols = c(Description,Code,Base))
Error: Can't recycle `Description` (size 3) to match `Base` (size 2).
I understand that is because tidyr functions do not recycle, but I could not find a way around this or a base rsolution to my problem. I have tried to make a df with
unlist(strsplit(as.character(x)) solution as per how to split one row into multiple rows in R but also ran into a resulting column length issue.
Desired output is as follows:
desired_df <-
tibble(
id=c("ABC","ABC","ABC"),
wage_type=c("salary","salary","salary"),
Description = c("wage_element_1","wage_element_2","wage_element_3"),
Code = c("51B","51C","51D"),
Base = c("600","740",NA))
desired_df
id wage_type Description Code Base
<chr> <chr> <chr> <chr> <chr>
1 ABC salary wage_element_1 51B 600
2 ABC salary wage_element_2 51C 740
3 ABC salary wage_element_3 51D NA
I would love a tidyr solution but any help would be appreciated. Thanks.
I would suggest this approach using tidyverse functions. The issue you had was due to how functions manage different rows. So, by creating an id variable like id2 you can avoid list outputs in your final reshaped data:
library(tidyverse)
#Code
example_df %>%
arrange(name) %>%
group_by(id,wage_type,name) %>%
mutate(id2=1:n()) %>% ungroup() %>%
pivot_wider(names_from = name,values_from=value) %>%
select(-id2)
Output:
# A tibble: 3 x 5
id wage_type Base Code Description
<chr> <chr> <chr> <chr> <chr>
1 ABC salary 600 51B wage_element_1
2 ABC salary 740 51C wage_element_2
3 ABC salary NA 51D wage_element_3
I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.
I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')