Apologies in advance if this has already been asked elsewhere, but I've tried different attempts and nothing has worked so far.
In my data frame Mesure I would like to split the values of the column Row.names into two new columns named Sample_type and Locality. I try to use a tidyverse solution but R returns me that the column must not be dupicated... How can I modify it ? Also, is it possible to remove the "<" ?
> head(Mesure)
Row.names mean_Mesure max_Mesure min_Mesure
1 Aquatic_moss.Paris.AG-110m.< 100 110 90
2 Aquatic_moss.Paris.BE-7. 123 177 53
3 Aquatic_moss.Paris.CO-57.< 40 60 20
4 Aquatic_moss.Paris.CO-58.< 40 50 30
5 Aquatic_moss.Paris.CO-60.< 50 70 30
6 Aquatic_moss.Paris.CS-134.< 200 300 100
>
> library(tidyverse)
> new_df <- Mesure %>%
+ rownames_to_column(var = "Row.names") %>%
+ separate(Row.names,sep = ".",into = c("Sample_type","Locality"))
Error: Column name `Row.names` must not be duplicated.
Run `rlang::last_error()` to see where the error occurred.
To separate that with the first "dot" you can use:
Mesure %>%
separate(Row.names, sep = "\\.", into = c("Sample_type", "Locality"), extra = "merge")
Explanation:
You don't need to convert rownames_to_column(), because "Row.names" is already a column.
sep = "." is not enough as the . is taken as a regular expression.
There are many . in the column, so you need to specify extra = "merge" to separate only at first appearance. If you would like to keep only "Paris" without AG-110m etc, you specify extra = "drop" there.
Result with extra = "merge":
Sample_type Locality mean_Mesure max_Mesure min_Mesure
1 Aquatic_moss Paris.AG-110m.< 100 110 90
2 Aquatic_moss Paris.BE-7. 123 177 53
3 Aquatic_moss Paris.CO-57.< 40 60 20
4 Aquatic_moss Paris.CO-58.< 40 50 30
5 Aquatic_moss Paris.CO-60.< 50 70 30
6 Aquatic_moss Paris.CS-134.< 200 300 100
Result with extra = "drop":
Sample_type Locality mean_Mesure max_Mesure min_Mesure
1 Aquatic_moss Paris 100 110 90
2 Aquatic_moss Paris 123 177 53
3 Aquatic_moss Paris 40 60 20
4 Aquatic_moss Paris 40 50 30
5 Aquatic_moss Paris 50 70 30
6 Aquatic_moss Paris 200 300 100
If you need to drop "<" at the end of Locality column, run something like:
Mesure$Locality <- gsub("<$", "", Mesure$Locality)
where "<$" means "< at the end of the string".
Apologies. I should read your question properly. The second part of your answer would be:
d %>% separate(Row.names, into=c("Sample_type","Locality"), extra="drop")
# A tibble: 6 x 6
Sample_type Locality mean_Mesure max_Mesure min_Mesure
<chr> <chr> <dbl> <dbl> <dbl>
1 Aquatic moss 100 110 90
2 Aquatic moss 123 177 53
3 Aquatic moss 40 60 20
4 Aquatic moss 40 50 30
5 Aquatic moss 50 70 30
6 Aquatic moss 200 300 100
I can't help you with the first part because I don't know how you create the input data frame.
Related
I have a dataset with staff information. I have a column that lists their current age and a column that lists their salary. I want to create an R data frame that has 3 columns: one to show all the unique ages, one to count the number of people who are that age and one to give me the median salary for each particular age. On top of this, I would like to group those who are under 21 and over 65. Ideally it would look like this:
age
number of people
median salary
Under 21
36
26,300
22
15
26,300
23
30
27,020
24
41
26,300
etc
Over65
47
39,100
The current dataset has hundreds of columns and thousands of rows but the columns that are of interest are like this:
ageyears
sal22
46
28,250
32
26,300
19
27,020
24
26,300
53
36,105
47
39,100
47
26,200
70
69,500
68
75,310
I'm a bit lost on the best way to do this but assume some sort of loop would work best? Thanks so much for any direction or help.
library(tidyverse)
sample_data <- tibble(
age = sample(17:70, 100, replace = TRUE) %>% as.character(),
salary = sample(20000:90000, 100, replace = TRUE)
)
# A tibble: 100 × 2
age salary
<chr> <int>
1 56 35130
2 56 44203
3 20 28701
4 47 66564
5 66 60823
6 54 36755
7 66 30731
8 68 21338
9 19 80875
10 61 44547
# … with 90 more rows
# ℹ Use `print(n = ...)` to see more rows
sample_data %>%
mutate(age = case_when(age <= 21 ~ "Under 21",
age >= 65 ~ "Over 65",
TRUE ~ age)) %>%
group_by(age) %>%
summarise(count = n(),
median_salary = median(salary))
# A tibble: 38 × 3
age count median_salary
<chr> <int> <dbl>
1 22 4 46284.
2 23 3 55171
3 25 3 74545
4 27 1 37052
5 28 3 66006
6 29 1 82877
7 30 2 40342.
8 31 2 27815
9 32 1 32282
10 33 3 64523
# … with 28 more rows
# ℹ Use `print(n = ...)` to see more rows
Let's say I have a dataset where I have a list of names and their ages
Tom 65
Sam 40
Sue 88
Kay 4
Jon 25
Lia 85
Ian 39
Joe 10
Bea 17
Jan 43
Jen 17
Ike 24
Jay 35
Cam 77
Jin 12
Ron 1
Ray 45
Leo 29
Ken 98
Mel 56
Amy 49
Joy 67
Ivy 3
Noe 14
Max 31
Jax 61
Lee 19
Ace 28
Ben 5
Guy 74
I'm trying to divide the dataset into ten equal bins by descending order (Ex. the first bin will have Ken, Sue, and Lia and the last bin will have Ben, Ivy, and Ron) and I want to find the average age for each bin (So the average age for the first bin would be 90.33). I was able to do this on MS excel quite easily but I'm not exactly sure how to do this efficiently on R. Any suggestions?
We can use cut to create a group and then summarise by taking the mean
library(dplyr)
df1 %>%
group_by(grp = cut(v2, breaks = 10)) %>%
summarise(v1 = list(v1), v2 = mean(v2))
I have a dataframe with crop names and their respective FAO codes. Unfortunately, some crop categories, such as 'other cereals', have multiple FAO codes, ranges of FAO codes or even worse - multiple ranges of FAO codes.
Snippet of the dataframe with the different formats for FAO codes.
> FAOCODE_crops
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68,71,75,89,92,94,97,101,103,108
27 other oil crops 260:310,312:339
31 other fibre crops 773:821
Using the following code successfully breaks down these numbers,
unlist(lapply(unlist(strsplit(FAOCODE_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
[1] 15 27 56 44 79 79 83 68 71 75 89 92 94 97 101 103 108
... but I fail to merge these numbers back into the dataframe, where every FAOCODE gets its own row.
> FAOCODE_crops$FAOCODE <- unlist(lapply(unlist(strsplit(MAPSPAM_crops$FAOCODE, ",")), function(x) eval(parse(text = x))))
Error in `$<-.data.frame`(`*tmp*`, FAOCODE, value = c(15, 27, 56, 44, :
replacement has 571 rows, data has 42
I fully understand why it doesn't merge successfully, but I can't figure out a way to fill the table with a new row for each FAOCODE as idealized below:
SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 other cereals 68
8 other cereals 71
8 other cereals 75
8 other cereals 89
And so on...
Any help is greatly appreciated!
We can use separate_rows to separate the ,. After that, we can loop through the FAOCODE using map and ~eval(parse(text = .x)) to evaluate the number range. Finnaly, we can use unnest to expand the data frame.
library(tidyverse)
dat2 <- dat %>%
separate_rows(FAOCODE, sep = ",") %>%
mutate(FAOCODE = map(FAOCODE, ~eval(parse(text = .x)))) %>%
unnest(cols = FAOCODE)
dat2
# # A tibble: 140 x 2
# SPAM_full_name FAOCODE
# <chr> <dbl>
# 1 wheat 15
# 2 rice 27
# 3 other cereals 68
# 4 other cereals 71
# 5 other cereals 75
# 6 other cereals 89
# 7 other cereals 92
# 8 other cereals 94
# 9 other cereals 97
# 10 other cereals 101
# # ... with 130 more rows
DATA
dat <- read.table(text = " SPAM_full_name FAOCODE
1 wheat 15
2 rice 27
8 'other cereals' '68,71,75,89,92,94,97,101,103,108'
27 'other oil crops' '260:310,312:339'
31 'other fibre crops' '773:821'",
header = TRUE, stringsAsFactors = FALSE)
I have the following data:
ID AGE SEX RACE COUNTRY VISITNUM VSDTC VSTESTCD VSORRES
32320058 58 M WHITE UKRAINE 2 2016-04-28 DIABP 74
32320058 58 M WHITE UKRAINE 1 2016-04-21 HEIGHT 183
32320058 58 M WHITE UKRAINE 1 2016-04-21 SYSBP 116
32320058 58 M WHITE UKRAINE 2 2016-04-28 SYSBP 116
32320058 58 M WHITE UKRAINE 1 2016-04-21 WEIGHT 109
22080090 75 M WHITE MEXICO 1 2016-05-17 DIABP 81
22080090 75 M WHITE MEXICO 1 2016-05-17 HEIGHT 176
22080090 75 M WHITE MEXICO 1 2016-05-17 SYSBP 151
I would like to reshape the data using tidyr::spread to get the following output:
ID AGE SEX RACE COUNTRY VISITNUM VSDTC DIABP SYSBP WEIGHT HEIGHT
32320058 58 M WHITE UKRAINE 2 2016-04-28 74 116 NA NA
32320058 58 M WHITE UKRAINE 1 2016-04-21 NA 116 109 183
22080090 75 M WHITE MEXICO 1 2016-05-17 81 151 NA 176
I receive duplicate errors, although I don't have duplicates in my data!
df1=spread(df,VSTESTCD,VSORRES)
Error: Duplicate identifiers for rows (36282, 36283), (59176, 59177), (59179, 59180)
I assume that I understand your question
# As many rows are identical, we should create a unique identifier column
# Let's take iris dataset as an example
# install caret package if you don't have it
install.packages("caret")
# require library
library(tidyverse)
library(caret)
# check the dataset (iris)
head(iris)
# assume that I gather all columns in iris dataset, except Species variable
# Create an unique identifier column and transform wide data to long data as follow
iris_gather<- iris %>% dplyr::mutate(ID=row_number(Species)) %>% tidyr::gather(key=Type,value=my_value,1:4)
# check first six rows
head(iris_gather)
# using *spread* to spread out the data
iris_spread<- iris_gather %>% dplyr::group_by(ID) %>% tidyr::spread(key=Type,value=my_value) %>% dplyr::ungroup() %>% dplyr::select(-ID)
# Check first six rows of iris_spread
head(iris_spread)
Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)