Related
I have a data set with a variable 'education' which is coded differently in each of the three countries included, for example:
Code
Country 1
Country 2
Country 3
1
No education
No education
No education
2
Primary
Primary
Islamic education
3
Secondary
Secondary
Primary
4
NA
NA
Secondary
I need to apply factor levels, which are different for each country.
Below is my attempt, but it doesn't appear to work:
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4)
)
df$Education <-
if(df$Country == "Country1") {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else if (df$Country == "Country2") {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Primary", "Secondary", "NA"))
} else {
factor(df$Education,
levels = c(1:4),
labels = c("No education", "Islamic education", "Primary", "Secondary")
)
}
Thanks
Perhaps this helps? This takes the data from the table mapping countries with the education code and the education category and converts it to long format.
Then use a left join to the two column dataframe with countries and education codes.
You could use the resulting column with education type as a string or the codes could be recoded to be consistent.
library(dplyr)
library(tidyr)
library(stringr)
df <- data.frame(
Country = sample(c("Country 1", "Country 2", "Country 3"), 100, replace = TRUE),
Education_1 = sample(1:4))
df_ed <- structure(list(Code = 1:4, Country.1 = c("No education", "Primary",
"Secondary", NA), Country.2 = c("No education", "Primary", "Secondary",
NA), Country.3 = c("No education", "Islamic education", "Primary",
"Secondary")), class = "data.frame", row.names = c(NA, -4L))
df_levels <-
df_ed %>%
pivot_longer(-Code) %>%
mutate(name = str_replace(name, "\\.", " "))
df1 <-
df %>%
left_join(df_levels, by = c("Country" = "name", "Education_1" = "Code"))
head(df1)
#> Country Education_1 value
#> 1 Country 1 3 Secondary
#> 2 Country 2 4 <NA>
#> 3 Country 3 1 No education
#> 4 Country 1 2 Primary
#> 5 Country 3 3 Primary
#> 6 Country 2 4 <NA>
Created on 2021-09-22 by the reprex package (v2.0.0)
We have the current data frame df as below
df <- data.frame(ID = c(1,2,3,4,5,6), Name = c("Chris", "J", "Kemp", "President,", "CEO & ", "Director", "Ashton", "K", "Christian", "Analyst"),
Font = c("Font A", "Font A", "Font A", "Font B", "Font B", "Font B", "Font A", "Font A", "Font A", "Font B"))
The expected Output is
final_df <- data.frame(Name = c("Chris J Kemp", "Ashton K Christian"), Designation = c("President, CEO & Director", "Analyst"))
So basically I want to add names until there's a certain font type in column Font and this is of course a sample of the huge data frame I'm dealing with. Thanks for the help in advance !
Sorry all your efforts earlier. This question has been re-edited a bit.
Data:
df <- data.frame(
ID = c(1:12),
Name = c("Chris", "J", "Kemp", "President,", "CEO & ", "Director",
"Bad", "D", "King", "Best,", "Teacher & ", "Friend"),
Font = c("Font A", "Font A", "Font A", "Font B", "Font B", "Font B",
"Font A", "Font A", "Font A", "Font B", "Font B", "Font B")
)
You can do:
df$group <- cumsum(c(TRUE, df$Font[-1] != df$Font[-length(df$Font)]))
final_df <- as.data.frame(matrix(lapply(split(df$Name, df$group), paste, collapse = " "), ncol = 2))
colnames(final_df) <- c("Name", "Designation")
A grouping row of consecutif font is created. Then split allows to have a list per font then you can reformat data using paste.
Output:
Name Designation
1 Chris J Kemp Bad D King
2 President, CEO & Director Best, Teacher & Friend
You can try -
library(dplyr)
library(tidyr)
df %>%
mutate(Font = recode(Font, 'Font A' = 'Name', 'Font B' = 'Designation'),
ID = data.table::rleid(Font)) %>%
group_by(ID, Font) %>%
summarise(Name = toString(Name), .groups = 'drop') %>%
mutate(ID = ceiling(ID/2)) %>%
pivot_wider(names_from = Font, values_from = Name) %>%
select(-ID)
# Name Designation
# <chr> <chr>
#1 Chris, J, Kemp President,, CEO & , Director
#2 Ashton, K, Christian Analyst
How about this method:
df_final = as.data.frame(matrix(unlist(lapply(unique(df$Font),function(i){paste(collapse = ' ', df[df$Font%in%i,"Name"])})), byrow=T,ncol = 2))
colnames(df_final)=c("names", "designation")
Let me know if it's ok
Here is a solution with dplyr:
library(dplyr)
df %>%
group_by(Font, fontnum) %>%
summarize(Tmp = paste(Name, collapse = " ")) %>%
mutate(ID = fontnum %/% 2) %>%
pivot_wider(id_cols = ID, names_from = Font, values_from = Tmp) %>%
transmute(Name = `Font A`, Designation = `Font B`)
where
df <- data.frame(
ID = 1:12,
Name = c("Chris", "J", "Kemp", "President,", "CEO & ", "Director",
"Bad", "D", "King", "Best,", "Teacher & ", "Friend"),
Font = c("Font A", "Font A", "Font A", "Font B", "Font B", "Font B",
"Font A", "Font A", "Font A", "Font B", "Font B", "Font B")
)
and
df$fontnum <- cumsum(c(0, abs(diff(as.numeric(factor(df$Font))))))
And the result will be
# A tibble: 2 × 2
Name Designation
<chr> <chr>
1 Chris J Kemp President, CEO & Director
2 Bad D King Best, Teacher & Friend
Here is an alternative strategy with dplyr: Data used from Clemsang (many thanks!)
group_by and divide ID by 3
summarise and collapse (bring rows to one row by group)
use group_split to split the groups (returns a list)
use bind_cols to get a dataframe
tweak names and select
library(dplyr)
df %>%
group_by(Font, ceiling(ID/3)) %>%
summarise(Name = paste0(Name, collapse = " ")) %>%
group_split(Font) %>%
bind_cols() %>%
select(Name = Name...3, Designation=Name...6)
Name Designation
<chr> <chr>
1 Chris J Kemp President, CEO & Director
2 Bad D King Best, Teacher & Friend
I have a variable in which there are several levels containing "games" as key word.I can not manually change the levels of the variables .Instead I want to change all levels containing games to a common level.
For example
Category
Games x
Games Y
Games Z
Entertainment
What I need is
Category
Games
Games
Games
Entertainment
Thanks
We can use sub to match one or more spaces (\\s+) followed by other characters, replace it with blank ("") in the 'Category' column
df1$Category <- sub("\\s+.*", "", df1$Category)
df1$Category
#[1] "Games" "Games" "Games" "Entertainment"
data
df1 <- structure(list(Category = c("Games x", "Games Y", "Games Z",
"Entertainment")), class = "data.frame", row.names = c(NA, -4L
))
Another way is with package forcats, a CRAN package to work with factors.
I will work with df2, a copy of the original data.frame.
library(forcats)
df2 <- df
new_games <- as.character(df$Category[grep("Games", df$Category)])
df2$Category <- fct_collapse(df$Category, Games = new_games)
df2
# Category
#1 Games
#2 Games
#3 Games
#4 Entertainment
Data.
df <-
structure(list(Category = structure(c(2L, 3L, 4L, 1L),
.Label = c("Entertainment", "Games x", "Games Y",
"Games Z"), class = "factor")),
class = "data.frame", row.names = c(NA, -4L))
Below is the structure of the dataframe
Village <- c("Location A" , "Location B", "Location C", "Location C", "Location A")
Farmers_Name <- c("Mary", "John", "Grace","Steph", "Richard")
Practiced_MinimumTillage <- c(0,1,1,0,1)
Practiced_Intercropping <- c(1,1,1,0,0)
Practiced_CropRotation <- c(1,1,1,1,0)
Practiced_ApplicationOfManure <- c(0,1,0,1,0)
farmers <- data.frame(Farmers_Name,Village,Practiced_MinimumTillage,Practiced_Intercropping,Practiced_CropRotation,Practiced_ApplicationOfManure)
The output of the dataframe farmers.
Farmers_Name Village Practiced_MinimumTillage Practiced_Intercropping Practiced_CropRotation Practiced_ApplicationOfManure
1 Mary Location A 0 1 1 0
2 John Location B 1 1 1 1
3 Grace Location C 1 1 1 0
4 Steph Location C 0 0 1 1
5 Richard Location A 1 0 0 0
Summarizing farm practices to get an understanding of the usage. A frequency table of the practices used by farmers in their farm.
practices <- select(farmers,Practiced_MinimumTillage,Practiced_Intercropping,Practiced_CropRotation,Practiced_ApplicationOfManure)
practices %>%
summarise_all(sum, na.rm=TRUE) %>%
gather(var,value) %>%
arrange(desc(value)) %>%
ggplot(aes(var, value)) + geom_bar(stat = "Identity") + coord_flip()
In the farmers dataframe, I'd like to use the column Village, for selectInput function. Whereby if a person selects "Location A" or "Location B" from the dropdown, above plot based on the frequency table is rendered in the output. How do I restructure the dataframe to suit this
using either dplyr or data.table?
It's pretty straightforward but comment if you have any questions -
Village <- c("Location A" , "Location B", "Location C", "Location C", "Location A")
Farmers_Name <- c("Mary", "John", "Grace","Steph", "Richard")
Practiced_MinimumTillage <- c(0,1,1,0,1)
Practiced_Intercropping <- c(1,1,1,0,0)
Practiced_CropRotation <- c(1,1,1,1,0)
Practiced_ApplicationOfManure <- c(0,1,0,1,0)
farmers <- data.frame(Farmers_Name,Village,Practiced_MinimumTillage,Practiced_Intercropping,
Practiced_CropRotation,Practiced_ApplicationOfManure)
shinyApp(
ui = fluidPage(
selectInput("village", "Select Village", choices = unique(farmers$Village)),
plotOutput("some_plot")
),
server = function(input, output, session) {
output$some_plot <- renderPlot({
filter(farmers, Village == input$village) %>%
select(Practiced_MinimumTillage,Practiced_Intercropping,Practiced_CropRotation,
Practiced_ApplicationOfManure) %>%
summarise_all(sum, na.rm=TRUE) %>%
gather(var,value) %>%
arrange(desc(value)) %>%
ggplot(aes(var, value)) + geom_bar(stat = "Identity") + coord_flip()
})
}
)
ex <- structure(list(group = c("group B", "group B", "group C", "group B","group C", "group B", "group B", "group A", "group C", "group C", "group C", "group B", "group A", "group A", "group A", "group B", "group A", "group A", "group B", "group C", "group B", "group A", "group C", "group C", "group C", "group C", "group B", "group A", "group A", "group C", "group B", "group A", "group A", "group B", "group C", "group C", "group A", "group C", "group C", "group A", "group B", "group B", "group A", "group B", "group C", "group C","group A", "group B", "group C", "group C"), A1 = c(0.765913072274998, 0.167720616329461, 0.282011203467846, 0.16467465297319, 0.407501850277185, 0.33958561392501, 0.117573569528759, 0.267871993361041, 0.930967768887058, 0.286146199563518, 0.741841563722119, 0.637853658990934, 0.137378493556753, 0.820813736645505, 0.249520575627685, 0.275153698632494, 0.916794545250013, 0.316050065914169, 0.393918378278613, 0.342175736324862, 0.0177193265408278, 0.178873546421528, 0.376545072998852, 0.411527326330543, 0.904074088903144, 0.487975180381909, 0.491365089081228, 0.591370195383206, 0.319207336986437, 0.98943907325156, 0.916014631278813, 0.0347612821497023, 0.323899461887777, 0.155270972754806, 0.436683354899287, 0.316902073565871, 0.734995431266725, 0.584133808733895, 0.515310257440433, 0.921727291075513, 0.0689518100116402, 0.659549278207123, 0.894137248862535, 0.00174906081520021, 0.873320956015959, 0.77207364118658, 0.637504813494161, 0.473099726485088, 0.557896945858374, 0.632965805241838), A2 = c(0.782154354499653, 0.718993512215093, 0.391234505455941, 0.337346265325323, 0.141482090810314, 0.587817938998342, 0.384924706770107, 0.0679492244962603, 0.0509498412720859, 0.786300176288933, 0.00685039279051125, 0.361857839627191, 0.851737944642082, 0.333896369440481, 0.521961389342323, 0.761324436869472, 0.486214824952185, 0.249763275263831, 0.536617708392441, 0.982582966331393, 0.879302836721763, 0.0212801641318947, 0.999207010492682, 0.661623647902161, 0.514440550701693, 0.748157452791929, 0.609151393873617, 0.581557413795963, 0.495366840157658, 0.595225095050409, 0.694380027009174, 0.419036868494004, 0.618371620541438, 0.406731882831082, 0.947823651600629, 0.182527825701982, 0.365398081485182, 0.307149735512212, 0.905119536910206, 0.657605888554826, 0.706386201782152, 0.461993521312252, 0.637554163113236, 0.280387100065127, 0.454221101710573, 0.0712104975245893, 0.914795317919925, 0.951028517214581, 0.645093881059438, 0.754043457563967), A3 = c(0.590488174697384, 0.876135899219662, 0.349565496202558, 0.365676332963631, 0.709230658365414, 0.584304825868458, 0.391973132034764, 0.464247716590762, 0.00831679091788828, 0.282901889178902, 0.842566592851654, 0.141866789199412, 0.278708242345601, 0.680587171344087, 0.256092368392274, 0.535304376389831, 0.803430012892932, 0.336343225324526, 0.320332229137421, 0.809689761372283, 0.588527292944491, 0.767302295425907, 0.124350237427279, 0.605355758452788, 0.619420127244666, 0.326774680987, 0.917224677512422, 0.710018905811012, 0.892817938234657, 0.149181636283174, 0.65066168922931, 0.433064805110916, 0.167979725869372, 0.809581968234852, 0.803237372776493, 0.703188817715272, 0.507392750121653, 0.372131450567394, 0.0688441153615713, 0.928956841118634, 0.960712827509269, 0.37454927386716, 0.753415656508878, 0.687665716046467, 0.05052674934268, 0.155349446227774, 0.806162646971643, 0.725155076943338, 0.537310504587367, 0.674253351520747), A4 = c(0.426875792676583, 0.168233293108642, 0.38692078506574, 0.673673333134502, 0.221049380488694, 0.142470651771873, 0.505352358799428, 0.579006788786501, 0.809476702939719, 0.343090934911743, 0.136329119792208, 0.881694708252326, 0.142607795307413, 0.658202062360942, 0.0624804550316185, 0.938871977152303, 0.477995269699022, 0.989794839406386, 0.307003591908142, 0.40553830191493, 0.0249065780080855, 0.321581491269171, 0.432656849268824, 0.578710418893024, 0.482647196389735, 0.72430428257212, 0.611029474530369, 0.748521578731015, 0.939656358910725, 0.803305297158659, 0.339922665851191, 0.919090943178162, 0.0926963407546282, 0.671128012472764, 0.634122629882768, 0.219061656622216, 0.376445228001103, 0.468331813113764, 0.131768246181309, 0.258267979836091, 0.651934198103845, 0.678243630565703, 0.663701833924279, 0.678762876661494, 0.524524878012016, 0.380242201732472, 0.433922954136506, 0.795754680642858, 0.383180371485651, 0.160383063135669)), .Names = c("group", "A1", "A2", "A3", "A4"), row.names = c(NA, -50L), class = c("tbl_df", "tbl", "data.frame"))
With above sample data I want to perform msClustering within groups. This clustering requires tuning parameter h so I define few values of it in column h.cand. Then I want to call msClustering with subsequent values of h and store the output in a list column. Theoretically, it should be feasible with purrr, but I think it requires nested map, and precisely speaking map inside map2. Here is my problem, I'm not sure how to refer for different list arguments. I have tried something like below:
ex %>%
group_by(group) %>%
nest() %>%
h.cand = map(data, ~quantile(dist(.x), seq(0.05, 0.40, by = 0.05))) %>%
mutate(cluster = map2(h.cand, data, ~map(.x, ~msClustering(
.y, # data (second argument of outter map2)
h = .x # h.cand element (first argument of inner map)
))))
and ended up with error:
Error: cannot allocate vector of size 1681.9 Gb
How should I refer to elements of outter and inner map in order to perform 8 (a length of h.cand vector) clusterings for each group?
For complicated anonymous functions, like this one, it's better if you use the function(x) instead of lambda/~ syntax for passing to map()'s .f argument.
Clean up the data:
map(ex, length)
# make element5 same length
ex[[5]] <- c(ex[[5]], runif(16))
# make into data frame
ex <- dplyr::bind_cols(ex)
Use function(x) instead of ~:
ex2 <- ex %>%
group_by(group) %>%
nest() %>%
mutate(h.cand = map(data,
~ quantile(dist(.), seq(0.05, 0.40, by = 0.05))),
cluster = map2(h.cand, data,
function(x, y) { map(x,
function(x2) { msClustering(y, x2) }) } ) )
Result check:
unnest(ex2, cluster)
# A tibble: 24 x 2
group cluster
<chr> <list>
1 group B <list [2]>
2 group B <list [2]>
3 group B <list [2]>
4 group B <list [2]>
5 group B <list [2]>
6 group B <list [2]>
7 group B <list [2]>
8 group B <list [2]>
9 group C <list [2]>
10 group C <list [2]>
# ... with 14 more rows