I have a data frame in R with a column (sni) with numbers that looks like this etc
bransch
sni
name
15
name
15
name
16-18
somename
16-18
name
241-3
someothername
241-3
where I have to transform/create a new column with just one number per row, i.e. no intervals so for example it should be a new row for all individual values in the intervals and look like this
bransch
sni
name
15
name
15
name
16
name
17
name
18
somename
16
somename
17
somename
18
name
241
name
242
name
243
someothername
241
someothername
242
someothername
243
I'm a bit unsure which function can do this the best way, or if someone has stumble upon a similar problem/solution. Currently I have tried to split the sni column (where the "-" starts) into two new ones, but then I'm a bit stuck since I will have many rows in one of the new columns without any values etc. Also the column is a character at the moment.
Any advice?
Sincerely,
TS
I took a while. Here is tidyverse approach:
library(dplyr)
library(tidyr)
df %>%
separate(sni, c("x", "y")) %>%
as_tibble() %>%
mutate(y = ifelse(as.numeric(y)<=9, paste0(substr(x, 1, nchar(x)-1), y),
y)) %>%
mutate(id = row_number()) %>%
pivot_longer(c(x,y)) %>%
mutate(value = as.numeric(value)) %>%
group_by(col2 =as.integer(gl(n(),2,n()))) %>%
fill(value, .direction = "down") %>%
complete(value = seq(first(value), last(value), by=1)) %>%
fill(bransch, .direction = "down") %>%
select(bransch, sni=value) %>%
group_by(col2, sni) %>%
slice(1)
col2 bransch sni
<int> <chr> <dbl>
1 1 name 15
2 2 name 15
3 3 name 16
4 3 name 17
5 3 name 18
6 4 somename 16
7 4 somename 17
8 4 somename 18
9 5 name 241
10 5 name 242
11 5 name 243
12 6 someothername 241
13 6 someothername 242
14 6 someothername 243
Let's try this.
Assume only three digits interval would have the pattern of 123-5 instead of 123-125, therefore in the ifelse, we modify this special pattern (e.g. 123-5) of interval into more regular one (123-125). Then separate the interval to individual integer using separate_rows.
We can then use complete to fill in the missing sequence in the interval.
library(tidyverse)
df %>%
group_by(sni,bransch) %>%
mutate(sni2 = ifelse(grepl("-", sni) & nchar(sub("-.*$", "", sni)) >= 3,
sub("^(\\d\\d)(.)-", "\\1\\2-\\1", sni),
sni)) %>%
separate_rows(sni2, convert = T) %>%
complete(sni2 = min(sni2):max(sni2)) %>%
ungroup() %>%
select(-sni)
# A tibble: 14 × 2
bransch sni2
<chr> <int>
1 name 15
2 name 15
3 name 16
4 name 17
5 name 18
6 somename 16
7 somename 17
8 somename 18
9 name 241
10 name 242
11 name 243
12 someothername 241
13 someothername 242
14 someothername 243
If I understood correctly
tmp=setNames(strsplit(df$sni,"-"),df$bransch)
tmp=unlist(
lapply(tmp,function(x){
x=as.numeric(x)
if (length(x)>1) {
if (x[1]<x[2]) {
seq(x[1],x[2],1)
} else {
seq(x[1],x[1]+x[2]-1,1)
}
} else {
x
}
})
)
data.frame(
"bransch"=names(tmp),
"sni"=tmp
)
bransch sni
1 name 15
2 name 15
3 name1 16
4 name2 17
5 name3 18
6 somename1 16
7 somename2 17
8 somename3 18
9 name1 241
10 name2 242
11 name3 243
12 someothername1 241
13 someothername2 242
14 someothername3 243
Using separate to get the start and end of the sequence, the we can map and unnest to get the result.
library (tidyverse)
data %>%
separate(
sni,
into = c("from", "to"),
fill = "right",
convert = TRUE) %>%
mutate(to = if_else(is.na(to), from, to)) %>%
transmute(
bransch,
sni = map2(from, to, `:`)) %>%
unnest_longer(sni)
# A tibble: 14 x 2
bransch sni
<chr> <int>
1 name 15
2 name 15
3 name 16
4 name 17
5 name 18
6 some name 16
7 some name 17
8 some name 18
9 name 241
10 name 242
11 name 243
12 someothername 241
13 someothername 242
14 someothername 243
Data
data <- tibble(
bransch = c("name","name","name","some name","name","someothername"),
sni =c("15","15","16-18","16-18","241-243","241-243"))
Related
I have a dataset with staff information. I have a column that lists their current age and a column that lists their salary. I want to create an R data frame that has 3 columns: one to show all the unique ages, one to count the number of people who are that age and one to give me the median salary for each particular age. On top of this, I would like to group those who are under 21 and over 65. Ideally it would look like this:
age
number of people
median salary
Under 21
36
26,300
22
15
26,300
23
30
27,020
24
41
26,300
etc
Over65
47
39,100
The current dataset has hundreds of columns and thousands of rows but the columns that are of interest are like this:
ageyears
sal22
46
28,250
32
26,300
19
27,020
24
26,300
53
36,105
47
39,100
47
26,200
70
69,500
68
75,310
I'm a bit lost on the best way to do this but assume some sort of loop would work best? Thanks so much for any direction or help.
library(tidyverse)
sample_data <- tibble(
age = sample(17:70, 100, replace = TRUE) %>% as.character(),
salary = sample(20000:90000, 100, replace = TRUE)
)
# A tibble: 100 × 2
age salary
<chr> <int>
1 56 35130
2 56 44203
3 20 28701
4 47 66564
5 66 60823
6 54 36755
7 66 30731
8 68 21338
9 19 80875
10 61 44547
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
sample_data %>%
mutate(age = case_when(age <= 21 ~ "Under 21",
age >= 65 ~ "Over 65",
TRUE ~ age)) %>%
group_by(age) %>%
summarise(count = n(),
median_salary = median(salary))
# A tibble: 38 × 3
age count median_salary
<chr> <int> <dbl>
1 22 4 46284.
2 23 3 55171
3 25 3 74545
4 27 1 37052
5 28 3 66006
6 29 1 82877
7 30 2 40342.
8 31 2 27815
9 32 1 32282
10 33 3 64523
# … with 28 more rows
# ℹ Use `print(n = ...)` to see more rows
There are some similar questions, however I haven't been able to find the solution for my data:
ID <- c(27,46,72)
Gest1 <- c(27,28,29)
Sys1 <- c(120,123,124)
Dia1 <- c(90,89,92)
Gest2 <- c(29,28,30)
Sys2 <- c(122,130,114)
Dia2 <- c(89,78,80)
Gest3 <- c(32,29,30)
Sys3 <- c(123,122,124)
Dia3 <- c(90,88,89)
Gest4 <- c(33,30,32)
Sys4 <- c(124,123,128)
Dia4 <- c(94,89,80)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3,
Dia3,Gest4,Sys4,Dia4)
df.1
What I need to do is identify where there are any cases of gestational age duplicates (variables beginning with Gest), and then find the mean of the associated Sys and Dia variables.
Once the mean has been calculated, I need to replace the duplicates with just 1 Gest variable, and the mean of the Sys variable and the mean of the Dia variable. Everything after those duplicates should then be moved up the dataframe.
Here is what it should look like:
df.2
My real data has 25 Gest variables with 25 associated Sys variables and 25 association Dia variables.
Sorry if this is confusing! I've tried to write an ok question but it is my first time using stack overflow.
Thank you!!
This is easier to manage in long (and tidy) format.
Using tidyverse, you can use pivot_longer to put into long form. After grouping by ID and Gest you can substitute Sys and Dia values with the mean. If there are more than one Gest for a given ID it will then use the average.
Then, you can keep that row of data with slice. After grouping by ID, you can renumber after combining those with common Gest values.
library(tidyverse)
df.1 %>%
pivot_longer(cols = -ID, names_to = c(".value", "number"), names_pattern = "(\\w+)(\\d+)") %>%
group_by(ID, Gest) %>%
mutate(across(c(Sys, Dia), mean)) %>%
slice(1) %>%
group_by(ID) %>%
mutate(number = row_number())
Output
ID number Gest Sys Dia
<dbl> <int> <dbl> <dbl> <dbl>
1 27 1 27 120 90
2 27 2 29 122 89
3 27 3 32 123 90
4 27 4 33 124 94
5 46 1 28 126. 83.5
6 46 2 29 122 88
7 46 3 30 123 89
8 72 1 29 124 92
9 72 2 30 119 84.5
10 72 3 32 128 80
Note - I would keep in long form - but if you wanted wide again, you can add:
pivot_wider(id_cols = ID, names_from = number, values_from = c(Gest, Sys, Dia))
This involved change the structure of the table into the long format, averaging the duplicates and then reformatting back into the desired table:
library(tidyr)
library(dplyr)
df.1 <- data.frame(ID,Gest1,Sys1,Dia1,Gest2,Sys2,Dia2,Gest3,Sys3, Dia3,Gest4,Sys4,Dia4)
#convert data to long format
longdf <- df.1 %>% pivot_longer(!ID, names_to = c(".value", "time"), names_pattern = "(\\D+)(\\d)", values_to="count")
#average duplicate rows
temp<-longdf %>% group_by(ID, Gest) %>% summarize(Sys=mean(Sys), Dia=mean(Dia)) %>% mutate(time = row_number())
#convert back to wide format
answer<-temp %>% pivot_wider(ID, names_from = time, values_from = c("Gest", "Sys", "Dia"), names_glue = "{.value}{time}")
#resort the columns
answer <-answer[ , names(df.1)]
answer
# A tibble: 3 × 13
# Groups: ID [3]
ID Gest1 Sys1 Dia1 Gest2 Sys2 Dia2 Gest3 Sys3 Dia3 Gest4 Sys4 Dia4
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 27 27 120 90 29 122 89 32 123 90 33 124 94
2 46 28 126. 83.5 29 122 88 30 123 89 NA NA NA
3 72 29 124 92 30 119 84.5 32 128 80 NA NA NA
Here is the sample code that I am using
library(dplyr)
naics <- c("000000","000000",123000,123000)
year <- c(2020,2021,2020,2021)
January <- c(250,251,6,9)
February <- c(252,253,7,16)
March <- c(254,255,8,20)
sample2 <- data.frame (naics, year, January, February, March)
Here is the intended result
Jan2020 Feb2020 March2020 Jan2021 Feb2021 March2021
000000 250 252 254 251 253 255
123000 6 7 8 9 16 20
Is this something that is done with pivot_wider or is it more complex?
We use pivot_wider by selecting the values_from with the month column, names_from as 'year' and then change the column name format in names_glue and if needed convert the 'naics' to row names with column_to_rownames (from tibble)
library(tidyr)
library(tibble)
pivot_wider(sample2, names_from = year, values_from = January:March,
names_glue = "{substr(.value, 1, 3)}{year}")%>%
column_to_rownames('naics')
-output
Jan2020 Jan2021 Feb2020 Feb2021 Mar2020 Mar2021
000000 250 251 252 253 254 255
123000 6 9 7 16 8 20
With reshape function from BaseR,
reshape(sample2, dir = "wide", sep="",
idvar = "naics",
timevar = "year",
new.row.names = unique(naics))[,-1]
# January2020 February2020 March2020 January2021 February2021 March2021
# 000000 250 252 254 251 253 255
# 123000 6 7 8 9 16 20
This takes a longer route than #akrun's answer. I will leave this here in case it may help with more intuition on the steps being taken. Otherwise, #akrun's answer is more resource efficient.
sample2 %>%
tidyr::pivot_longer(-c(naics, year), names_to = "month",
values_to = "value") %>%
mutate(Month=paste0(month, year)) %>%
select(-year, - month) %>%
tidyr::pivot_wider(names_from = Month,values_from = value)
# A tibble: 2 x 7
naics January2020 February2020 March2020 January2021 February2021
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 000000 250 252 254 251 253
2 123000 6 7 8 9 16
# ... with 1 more variable: March2021 <dbl>
I found a way to make it work, but it seems clumsy. There has to be a better way...
The question I might try to answer is
If I wanted to find out how often a language was selected by country, how would I do that efficiently?
This works, what's better?
library(tidyverse)
data(SO_survey, package = "crunch")
# what does it look like now?
SO_survey %>% select(Country, WantWorkLanguage) %>% head()
It's set up like this
# Country WantWorkLanguage
# 4 United States Matlab; Python; R; SQL
# 11 United States C#; R; SQL
# 36 Italy JavaScript; Python; R
# 125 Denmark Groovy; Java; JavaScript; Lua; SQL; TypeScript
# 242 United States C++; Python
# 298 Dominican Republic C; C#; CoffeeScript; Go; Haskell; JavaScript; Perl; PHP; Python; R; Ruby; SQL
Made a unique list of languages
# extract unique languages
(wanted = SO_survey %>%
select(WantWorkLanguage) %>%
unlist() %>%
strsplit("; ", fixed = T) %>%
unlist() %>%
unique()
To extract the county by one country
# how often did a respondent pick a particular language in the US?
SO_survey %>%
filter(Country == "United States") %>%
{strsplit(unlist(.$WantWorkLanguage),"; ",fixed = T)} %>%
unlist() %>%
table(. %in% wanted)
If I want it arranged/sorted
# If I want it sorted or arranged
SO_survey %>%
filter(Country == "United States") %>%
{strsplit(unlist(.$WantWorkLanguage),"; ",fixed = T)} %>%
unlist() %>%
table(. %in% wanted) %>%
data.frame() %>%
select(-Var2) %>%
arrange(-Freq)
The output:
# . Freq
# 1 Python 321
# 2 R 288
# 3 SQL 209
# 4 JavaScript 199
# 5 C++ 136
# 6 Java 115
# 7 Go 101
# 8 C# 100
# 9 Scala 81
# 10 C 74
# 11 Swift 57
# 12 Julia 56
# 13 TypeScript 55
# 14 Haskell 52
# 15 Rust 38
# 16 F# 36
# 17 PHP 32
# 18 Ruby 32
# 19 Assembly 29
# 20 Clojure 29
# 21 Matlab 29
# 22 Elixir 23
# 23 Perl 18
# 24 Objective-C 17
# 25 CoffeeScript 16
# 26 Erlang 16
# 27 Lua 13
# 28 Common Lisp 12
# 29 VBA 11
# 30 Groovy 7
# 31 Dart 5
# 32 Smalltalk 3
# 33 VB.NET 3
# 34 Hack 2
# 35 Visual Basic 6 1
Your tidyverse solution seems pretty good. For something more concise you could try base R or data.table:
library(data.table)
setDT(SO_survey)
setkey(SO_survey, Country)
SO_survey['United States', .(lang = unlist(strsplit(WantWorkLanguage, '; ')))
][, .N, keyby = V1 ][order(-N)]
# lang N
# 1: Python 321
# 2: R 288
# 3: SQL 209
# 4: JavaScript 199
# 5: C++ 136
# ...
I have a dataset with a million records that I need to aggregate after first subsetting the data. It is difficult to provide a good reproducible sample because in this case, the sample size would be rather large - but I will try anyway.
A random sample of the data that I am working with looks like this:
> df
auto_id user_id month
164537 7124 240249 10
151635 7358 226423 9
117288 7376 172463 9
177119 6085 199194 11
128904 7110 141608 9
157194 7143 241964 9
71303 6090 141646 7
72480 6808 175910 7
108705 6602 213098 8
97889 7379 185516 8
184906 6405 212580 12
37242 6057 197905 8
157284 6548 162928 9
17910 6885 194180 10
70660 7162 161827 7
8593 7375 207061 8
28712 6311 176373 10
144194 7324 142715 9
73106 7196 176153 7
67065 7392 171039 7
77954 7116 161489 7
59842 7107 162637 7
101819 5994 182973 9
183546 6427 142029 12
102881 6477 188129 8
In every month, there many users who are the same, and first we should subset by month and make a frequency table of the users and the amount of trips taken (unfortunately, in the random sample above there is only one trip per user, but in the larger dataset, this is not the case):
full_data <- full_data[full_data$month == 7,]
users <- as.data.frame(table(full_data$user_id))
head(users)
Var1 Freq
1 100231 10
2 100744 17
3 111281 1
4 111814 2
5 113716 3
6 117493 3
As we can see, in the full data set, in month of July (month = 7), users have taken multiple trips. Now the important part - which is to subset only the top 10% of these users (the top 10% in terms of Freq)
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
Now the new dataframe - topten - can be summed and we get the amount of trips taken by the top ten percent of users
sum(topten$Freq)
[1] 12147
In the end the output should look like this
> output
month trips
1 7 12147
2 8 ...
3 9 ...
4 10 ...
5 11 ...
6 12 ...
Is there a way to automate this process using dplyr - I mean specifically the subsetting by the top ten percent ? I have tried
output <- full_data %>%
+ group_by(month) %>%
+ summarise(n = n())
But this only aggregates total trips by month. Could someone suggest a way to integrate this part into the query in dplyr ? :
tenPercent = round(nrow(users)/10)
users <- users[order(-users$Freq),]
topten <- head(users, n = tenPercent)
The code below counts the number of rows for each user_id in each month, and then selects the 10% of users with the most rows in each month and sums them. Let me know if it solves your problem.
library(dplyr)
full_data %>% group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
UPDATE: Following up on your comment, let's do a check with some fake data. Below we have 30 different values of user_id and 10,000 total rows. I've also used the prob argument so that the probability of a user_id being selected is proportional to its value (i.e., user_id 1 is the least likely to be chosen and user_id 30 is the most likely to be chosen).
set.seed(3)
full_data = data.frame(user_id=sample(1:30,10000, replace=TRUE, prob=1:30),
month=sample(1:12, 10000, replace=TRUE))
Let's look as the number of rows for each user_id for month==1. The code below counts the number of rows for each user_id and sorts from most to least common. Note that the three most common values of user_id (28,29,26) comprise 171 rows (60+57+54). Since there are 30 different values of user_id the top three users represent the top 10% of users:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
arrange(desc(n)) %>% as.data.frame
month user_id n
1 1 28 60
2 1 29 57
3 1 26 54
4 1 30 53
5 1 27 49
6 1 22 43
7 1 21 41
8 1 20 40
9 1 23 40
10 1 24 38
11 1 25 38
12 1 19 37
13 1 18 33
14 1 16 28
15 1 15 27
16 1 17 27
17 1 14 26
18 1 9 20
19 1 12 20
20 1 13 20
21 1 10 17
22 1 11 17
23 1 6 15
24 1 7 13
25 1 8 13
26 1 4 9
27 1 5 7
28 1 2 3
29 1 3 2
30 1 1 1
So now let's take the next step and select the top 10% of users. To answer the question in your comment, filter(percent_rank(n) >= 0.9) keeps only the top 10% of user_id, based on the value of n (which is the number of rows for each user_id). percent_rank is on of several ranking functions in dplyr that have different ways of dealing with ties (which may be the reason you're not getting the results you expect). See ?percent_rank for details:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9)
month user_id n
1 1 26 54
2 1 28 60
3 1 29 57
And the sum of n (the total number of trips for the top 10%) is:
full_data %>% filter(month==1) %>%
group_by(month, user_id) %>%
tally %>%
group_by(month) %>%
filter(percent_rank(n) >= 0.9) %>%
summarise(n_trips = sum(n))
month n_trips
1 1 171
So it looks like the code does what we'd naively expect, but maybe the issue is related to how ties are dealt with. Let me know if you're still getting anomalous results in your real data or if I've misunderstood what you're trying to accomplish.