So, I have a data set with a lot of observations for X individuals and more rows per some individuals. For each row, I have assigned a classification (the variable clinical_significance) that takes three values in prioritized order: definite disease, possible, colonization. Now, I would like to have only one row for each individual and the "highest classification" across the rows, e.g. definite if present, subsidiary possible and colonization. Any good suggestions on how to overcome this?
For instance, as seen in the example, I would like all ID #23 clinical_signifiance to be 'definite disease' as this outranks 'possible'
id id_row number_of_samples species_ny clinical_significa…
18 1 2 MAC possible
18 2 2 MAC possible
20 1 2 scrofulaceum possible
20 2 2 scrofulaceum possible
23 1 2 MAC possible
23 2 2 MAC definite disease
Making a reproducible example:
df <- structure(
list(
id = c("18", "18", "20", "20", "23", "23"),
id_row = c("1","2", "1", "2", "1", "2"),
number_of_samples = c("2", "2", "2","2", "2", "2"),
species_ny = c("MAC", "MAC", "scrofulaceum", "scrofulaceum", "MAC", "MAC"),
clinical_significance = c("possible", "possible", "possible", "possible", "possible", "definite disease")
),
row.names = c(NA, -6L), class = c("data.frame")
)
The idea is to turn clinical significance into a factor, which is stored as an integer instead of character (i.e. 1 = definite, 2 = possible, 3 = colonization). Then, for each ID, take the row with lowest number.
df_prio <- df |>
mutate(
fct_clin_sig = factor(
clinical_significance,
levels = c("definite disease", "possible", "colonization")
)
) |>
group_by(id) |>
slice_min(fct_clin_sig)
I fixed it using
df <- df %>%
group_by(id) %>%
mutate(clinical_significance_new = ifelse(any(clinical_significance == "definite disease"), "definite disease", as.character(clinical_significance)))
Related
I have a data frame of postcodes with a regional/metro classification assigned. In some instances, due to the datasource, the same postcode will occur with both a regional and metro classification.
POSTCODE REGON
1 3000 METRO
2 3000 REGIONAL
3 3256 METRO
4 3145 METRO
I am wondering how to remove the duplicate row and replace the region with "SPLIT" in these instances.
I have tried using the below code however this reassignes the entire dataset with either "METRO" or "REGIONAL"
test <- within(PC_ACTM, REGION <- ifelse(duplicated("Postcode"), "SPLIT", REGION))
The desired output would be
POSTCODE REGON
1 3000 SPLIT
2 3256 METRO
3 3145 METRO
Example data:
dput(PC_ACTM)
structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L), REGON = c("METRO",
"REGIONAL", "METRO", "METRO")), class = "data.frame", row.names = c("1",
"2", "3", "4"))
Based on your title, you're looking for an ifelse() solution; perhaps this will suit?
PC_ACTM <- structure(list(POSTCODE = c(3000L, 3000L, 3256L, 3145L),
REGION = c("METRO", "REGIONAL", "METRO", "METRO")),
class = "data.frame",
row.names = c("1", "2", "3", "4"))
PC_ACTM$REGION <- ifelse(duplicated(PC_ACTM$POSTCODE), "SPLIT", PC_ACTM$REGION)
PC_ACTM[!duplicated(PC_ACTM$POSTCODE, fromLast = TRUE),]
#> POSTCODE REGION
#> 2 3000 SPLIT
#> 3 3256 METRO
#> 4 3145 METRO
Created on 2022-04-07 by the reprex package (v2.0.1)
Consider ave to sequential count by group and then subset the last but before use ifslse to replace needed value for any group counts over 1. Below uses new base R 4.1.0+ pipe |>:
test <- within(
PC_ACTM, {
PC_SEQ <- ave(1:nrow(test), POSTCODE, FUN=seq_along)
PC_COUNT <- ave(1:nrow(test), POSTCODE, FUN=length)
REGION <- ifelse(
(PC_SEQ == PC_COUNT) & (PC_COUNT > 1), "SPLIT", REGION
)
}
) |> subset(
subset = PC_SEQ == PC_COUNT, # SUBSET ROWS
select = c(POSTCODE, REGION) # SELECT COLUMNS
) |> `row.names<-`(NULL) # RESET ROW NAMES
I don't understand spatial.data at all. I have been studying but I'm missing something.
What I have: data.frame enterprises with the columns: id, parent_subsidiary, city_cod.
What I need: the mean and the max distance from the parent's city to the subsidiary cities.
Ex:
id | mean_dist | max_dist
1111 | 25km | 50km
232 | 110km | 180km
333 | 0km | 0km
What I did :
library("tidyverse")
library("sf")
# library("brazilmaps") not working anymore
library("geobr")
parent <- enterprises %>% filter(parent_subsidiary==1)
subsidiary <- enterprises %>% filter(parent_subsidiary==2)
# Cities - polygons
m_city_br <- read_municipality(code_muni="all", year=2019)
# or shp_city<- st_read("/BR_Municipios_2019.shp")
# data.frame with the column geom
map_parent <- left_join(parent, m_city_br, by=c("city_cod"="code_muni"))
map_subsidiary <- left_join(subsidiary, m_city_br, by=c("city_cod"="code_muni"))
st_distance(map_parent$geom[1],map_subsidiary$geom[2]) %>% units::set_units(km)
# it took a long time and the result is different from google.maps
# is it ok?!
# To do by ID -- I also stucked here
distance_p_s <- data.frame(id=as.numeric(),subsidiar=as.numeric(),mean_dist=as.numeric(),max_dist=as.numeric())
id_v <- as.vector(parent$id)
for (i in 1:length(id_v)){
test_p <- map_parent %>% filter(id==id_v[i])
test_s <- map_subsidiary %>% filter(id==id_v[i])
total <- 0
value <- 0
max <- 0
l <- 0
l <- nrow(test_s)
for (j in 1:l){
value <- as.numeric(round(st_distance(test_p$geom[1],test_s$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
}
mean_dist <- total/l
done <- data.frame(id=id[i],subsidiary=l,mean_dist=round(mean_dist,2),max_dist=max)
distance_p_s <- rbind(distance_p_s,done)
rm(done)
}
}
Is it right?
Can I calculate the centroid of the cities and than calculate the distance?
I realize that the distance from code_muni==4111407 to code_muni==4110102, the distance is 0, but is another city (Imbituva, PR,Brasil - Ivaí, PR,Brasil). Why?
Data example: structure(list(id = c("1111", "1111", "1111", "1111", "232", "232", "232", "232", "3123", "3123", "4455", "4455", "686", "333", "333", "14112", "14112", "14112", "3633", "3633"), parent_subsidiary = c("1","2", "2", "2", "1", "2", "2", "2", "1", "2", "1", "2", "1", "2", "1", "1", "2", "2", "1", "2"), city_cod = c(4305801L,4202404L, 4314803L, 4314902L, 4318705L, 1303403L, 4304507L, 4314100L, 2408102L, 3144409L, 5208707L, 4205407L, 5210000L, 3203908L, 3518800L, 3118601L, 4217303L, 3118601L, 5003702L, 5205109L)), row.names = c(NA, 20L), class = "data.frame")
PS: this is Brazilian cities
https://github.com/ipeaGIT/geobr/tree/master/r-package
Great problem. I looked at it for a little while. Then I came back and looked some more after thinking about it. The mean was not calculated. Only the distances were determined from each parent to its subsidiaries.
The data was binded - the cities data and the data frame data. Then the new df was mutated to add the centroid data for each point on the surface.
The df was split by id and resulted in a list of 8 df's. Each df contained separate parent with related subsidiaries. (1:4, 1:3, 1:4, 1:2, .... )
A loop with a function cleaned up the 8 df's, and calculated the distance from each parent to each subsidiary.
I checked the distance of the first df in the list against values for distances from a website. The distances of df1 were nearly identical to the website.
The output is shown at [link]
I did something like that:
distance_p_s <- data.frame(id=as.character(),
qtd_subsidiary=as.numeric(),
dist_min=as.numeric(),
dist_media=as.numeric(),
dist_max=as.numeric())
id <- as.vector(mparentid$id)
for (i in 1:length(id)){
eval(parse(text=paste0("
print('Filtering id: ",id[i]," (",i," of ",length(id),")')
")))
teste_m <- mparentid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- msubsidiaryid %>% filter(id==id[i]) %>% st_as_sf()
teste_f <- st_centroid(teste_f)
teste_m <- st_centroid(teste_m)
teste_f = st_transform(teste_f, 4674)
teste_m = st_transform(teste_m, 4674)
total <- 0
value <- 0
min <- 0
max <- 0
l <- 0
l <- nrow(teste_f)
for (j in 1:l){
eval(parse(text=paste0("
print('Tratando id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
value <- as.numeric(round(st_distance(teste_m$geom[1],teste_f$geom[j]) %>% units::set_units(km),2))
total <- total + value
ifelse(value>max,max<-value,NA)
if(j==1){
min<-value
} else {
ifelse(value<min,min<-value,NA)}
}
dist_med <- total/l
done <- data.frame(id=id[i],qtd_subsidiary=l,dist_min=min,dist_media=round(dist_med,2),dist_max=max)
distance_p_s <- rbind(distance_p_s,done)
eval(parse(text=paste0("
print('Concluido id: ",id[i]," (",i," de ",length(id),"), subsidiary: ",j," de ",l,"')
")))
rm(done)
}
Probably this is not the best way, but it solved my problem for now.
I'm using quanteda to create dictionaries and look up for terms.
Here is a reproducible example of my data:
dput(tweets[1:4, ])
structure(list(tweet_id = c("174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"), tweet_message = c("Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c("2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"), group = c("1",
"2", "3", "4")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
Here is how I use my dictionary following a suggestion I got from this forum:
climate_corpus <- corpus(tweets, text_field = "tweet_message")
climatechange_dict <-
dictionary(list(climate = c("environment*", "climate change")))
groupeddfm <- tokens(climate_corpus) %>%
tokens_lookup(dictionary = climatechange_dict) %>%
dfm(groups = "group")
convert(groupeddfm, to = "data.frame")
What I need to do is to create a dummy in my original dataset "tweets" equal to 1 when tokens_lookup identifies a word included in my dictionary in one specific observation (tweet). Using my reproducible example, I would like to generate a dummy equal to 1 for the first three observations (they include dictionary words), and equal to 0 for the fourth one (no dictionary words).
I would really appreciate your help on this.
Many thanks!
library("quanteda")
## Package version: 2.0.1
tweets <- structure(
list(tweet_id = c(
"174457180812_10156824364270813",
"174457180812_10156824136360813", "174457180812_10156823535820813",
"174457180812_10156823868565813"
), tweet_message = c(
"Climate change is a big issue",
"We should care about the environment", "Let's rethink environmental policies",
"#Davos WEF"
), date = c(
"2019-03-25T23:03:56+0000", "2019-03-25T21:10:36+0000",
"2019-03-25T21:00:03+0000", "2019-03-25T20:00:03+0000"
), group = c(
"1",
"2", "3", "4"
)),
row.names = c(NA, -4L), class = c(
"tbl_df",
"tbl", "data.frame"
)
)
climate_corpus <- corpus(tweets, text_field = "tweet_message")
climatechange_dict <-
dictionary(list(climate = c("environment*", "climate change")))
groupeddfm <- tokens(climate_corpus) %>%
tokens_lookup(dictionary = climatechange_dict) %>%
dfm(groups = "group")
tweets$mentions_climate <- as.logical(groupeddfm[, "climate"])
tweets
## # A tibble: 4 x 5
## tweet_id tweet_message date group mentions_climate
## <chr> <chr> <chr> <chr> <lgl>
## 1 174457180812_1015… Climate change is a b… 2019-03-25T2… 1 TRUE
## 2 174457180812_1015… We should care about … 2019-03-25T2… 2 TRUE
## 3 174457180812_1015… Let's rethink environ… 2019-03-25T2… 3 TRUE
## 4 174457180812_1015… #Davos WEF 2019-03-25T2… 4 FALSE
I believe this is fairly simple, although I am new to using R and code. I have a dataset which has a single row for each rodent trap site. There were however, 8 occasions of trapping over 4 years. What I wish to do is to expand the trap site data and append a number 1 to 8 for each row.
Then I can then label them with the trap visit for a subsequent join with the obtained trap data.
I have managed to replicate the rows with the following code. And while the rows are expanded in the data frame to 1, 1.1...1.7,2, 2.1...2.7 etc. I cannot figure out how to convert this to a useable column based ID.
structure(list(TrapCode = c("IA1sA", "IA2sA", "IA3sA", "IA4sA",
"IA5sA"), Y = c(-12.1355987315, -12.1356879776, -12.1357664998,
-12.1358823313, -12.1359720852), X = c(-69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532)), row.names = c(NA,
5L), class = "data.frame")
gps_1 <– gps_1[rep(seq_len(nrow(gps_1)), 3), ]
gives
"IA5sA", "IA1sA", "IA2sA", "IA3sA", "IA4sA", "IA5sA", "IA1sA",
"IA2sA", "IA3sA", "IA4sA", "IA5sA"), Y = c(-12.1355987315, -12.1356879776,
-12.1357664998, -12.1358823313, -12.1359720852, -12.1355987315,
-12.1356879776, -12.1357664998, -12.1358823313, -12.1359720852,
-12.1355987315, -12.1356879776, -12.1357664998, -12.1358823313,
-12.1359720852), X = c(-69.1335789865, -69.1335225279, -69.1334668485,
-69.1333847769, -69.1333226532, -69.1335789865, -69.1335225279,
-69.1334668485, -69.1333847769, -69.1333226532, -69.1335789865,
-69.1335225279, -69.1334668485, -69.1333847769, -69.1333226532
)), row.names = c("1", "2", "3", "4", "5", "1.1", "2.1", "3.1",
"4.1", "5.1", "1.2", "2.2", "3.2", "4.2", "5.2"), class = "data.frame")
I have a column with Trap_ID currently being a unique identifier. I hope that after the replication I could append an iteration number to this to keep it as a unique ID.
For example:
Trap_ID
IA1sA.1
IA1sA.2
IA1sA.3
IA2sA.1
IA2sA.2
IA2sA.3
Simply use a cross join (i.e., join with no by columns to return a cartesian product of both sets):
mdf <- merge(data.frame(Trap_ID = 1:8), trap_side_df, by=NULL)
I need to manipulate the raw data (csv) to a wide format so that I can analyze in R or SPSS.
It looks something like this:
1,age,30
1,race,black
1,scale_total,35
2,age,20
2,race,white
2,scale_total,99
Ideally it would look like:
ID,age,race,scale_total, etc
1, 30, black, 35
2, 20, white, 99
I added values to the top row of the raw data (ID, Question, Response) and tried the cast function but I believe this aggregated data instead of just transforming it:
data_mod <- cast(raw.data2, ID~Question, value="Response")
Aggregation requires fun.aggregate: length used as default
You could use tidyr...
library(tidyr)
df<-read.csv(text="1,age,30
1,race,black
1,scale_total,35
2,age,20
2,race,white
2,scale_total,99", header=FALSE, stringsAsFactors=FALSE)
df %>% spread(key=V2,value=V3)
V1 age race scale_total
1 1 30 black 35
2 2 20 white 99
We need a sequence column to be created to take care of the duplicate rows which by default results in aggregation to length
library(data.table)
dcast(setDT(df1), ID + rowid(Question) ~ Question, value.var = 'Response')
NOTE: The example data clearly works (giving expected output) without using the sequence column.
dcast(setDT(df1), ID ~ Question)
# ID age race scale_total
#1: 1 30 black 35
#2: 2 20 white 99
So, this is a case when applied on the full dataset with duplicate rows
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Question = c("age",
"race", "scale_total", "age", "race", "scale_total"), Response = c("30",
"black ", "35", "20", "white", "99")), class = "data.frame",
row.names = c(NA, -6L))
For SPSS:
data list list/ID (f5) Question Response (2a20).
begin data
1 "age" "30"
1 "race" "black"
1 "scale_total" "35"
2 "age" "20"
2 "race" "white"
2 "scale_total" "99"
end data.
casestovars /id=id /index=question.
Note that the resulting variables age and scale_total will be string variables - you'll have to turn them into numbers before further transformations:
alter type age scale_total (f8).