I have a dataset containing a weight column, which I would like to subset while adjusting these weights to keep it representative of the original dataset.
Let us say I have the dataframe :
data.frame(Age=c(10,20,30,25,50,60,40),
Country=c("Germany","Germany","Germany","China","China","China","China"),
Class=c("A", "B", NA, NA, "B", "A", "A"),
Weight=c(1.1, 0.8, 1.2, 1.7, 0.7, 1.3, 0.9))
I would like to remove NA rows in the column Class, and update the Weight column to keep my sample representative of the original dataset given the columns Age and Country. (The above dataframe may be too small for such question, but this is just for illustration).
df.fillna(df.mean())
if want to fill Na values with mean or other specific value you can simply run this code.
data.fillna()
is use for fill nan values in pandas dataframe you can put value whatever you want to replace
Related
I have data that includes four essential elements: ID, Time, exposure, outcome
I want to have a scatterplot between my exposure and outcome but the timepoint of interest for my exposure is different from the timepoint of interest for the outcome and thus there are some IDs that do not have any assessment at that outcome timepoint. What I want to do is create a subset of data with each ID as a row, then exposure at time-1 and outcome at time-3 but if an ID doesnt have an assessment at time-3 I have it included with the value NA. The issue is that in the data if a timepoint was not assessed, the relative row for that ID doesnt exist in the first place.
Here is an example of the data:
ID <- c(1,1,2,2,2,3,3,3,4,4)
exposure <-c(1.2, 1.3, 1.4, 1.5, 2.1, 2.2, 3.2, 4.2, 5.2, 6.2)
outcome <-c(0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 2.1, 3.1)
Time<-c("time_1","time_2","time_1","time_2","time_3","time_1","time_2","time_3","time_1","tme_2")
data <-data.frame(ID,exposure,outcome,Time)
Why I am doing this because the scatter plot is a cross-sectional plot and thus if I just plot based on time per ID, the plot would be empty because there would be no pair in each row with exposure at time_1 and outcome at time_3 so that is why I need to create a subset of data and making the paires myself.
I tried these codes so far:
# so you see the empty cells and the reason of getting an empty plot
df <- data |> pivot_wider (name_from = Time, values_from = c(exposure,outcome))
#subsetting the data to only my desired time points (this helps me to see in my actual # data which IDs are actually not having an assessed time point
df1 <- data %>%
group_by(ID)%>%
filter(data, Time=="time_1" | Time=="time_3")%>%
ungroup()
# And eventually subsetting the data based on different timepoint to then merge them #together
df2 <- filter (data, Time=="time_1")
df3 <- filter (data, Time=="time_3")
But in the last code, you see the size of the two datasets are different and apart from that, it is clinically important for me to show that for instance for ID=1, the outcome has NA at time_3, so I don`t want to just subset on those IDs with both values available.
So the dataset that I want to eventually have, needs to have the following structure:
ID exposure_time_1 outcome_time_3
----------------------------------
1 1.2 NA
2 1.4 0.4
3 2.2 0.1
4 5.2 NA
Does anyoe has any solution on this?
You almost have it. Just select the columns after you pivot_wider.
df %>%
select(ID, exposure_time_1, outcome_time_3) %>%
filter(!is.na(exposure_time_1) | !is.na(outcome_time_3))
Your dataset didn't need it here, but I added the filter to make sure that at least one of the last two columns is non-empty. Perhaps you actually want filter(!is.na(outcome_time_3)), though.
I've got a dataset of species observations over time and I am trying to calculate observation dates based on the max value of criteria:
Df <- data.frame(Sp = c(1,1,2,2,3,3),
Site = c("A", "B", "C", "D"),
date = c('2021-1-1','2021-1-2','2021-1-3','2021-1-4','2021-1-5','2021-1-6', "2021-03-01","2021-03-05")
N = c(2,5,9,4,14,7,3,11)
I want to create a new column called Nmax that showing the in which date the value of N for a Sp on a given Site was max, so the column would look something like this:
Dmax=c("2021-1-2", "2021-1-2", '2021-1-2', '2021-1-2', '2021-1-5', '2021-1-5', "2021-03-05","2021-03-05")
So Dmax would show that for Sp 1 in site A the date in which N was max was "2021-1-2" and so on.
I've tried grouping by Site, Sp, and date and using mutate together which.max(N) but didn't work. I'd like to keep all my rows.
Any help is welcome.
Thanks!
From your desired output, it seems like you want the max date regardless of site. Just group by site. Also, your sample data only has 6 rows for Sp instead of 8 so I just assumed a 4th Sp
Df |>
group_by(Sp) |>
mutate(Dmax = date[which.max(N)])
I have a large dataframe of over 122000 rows and 60 columns, but simplified this is what the dataframe looks like:
structure(list(mz = c(40, 50, 60, 70, 80, 90),
`sample 1` = c(NA, 51, NA, NA, 675, 12),
`sample 2` = c(NA, 51, NA, NA, 2424, 5),
`Sample 3` = c(NA, 51, NA, 300, 1241, NA),
`Blank Average` = c(10, 20, 50, 78, NA, 0.00333333),
row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))
What I want to do: I want the function I am writing, to create a new data frame where a row is removed in case ALL SAMPLE COLUMNS return NA.
I tried subsetting the entirety of sample columns first:
sample_cols <- grep("sample", names(dataframe),ignore.case = TRUE)
Consecutively in order to delete rows when ONLY these subsetted sample columns ALL returned NA I tried:
na_omit -> this does not work, as this deletes the rows, but also deletes the rows with just one value NA and not all values in that row of samples.
I also tried:
Sample_cols_df<- dataframe[sample_cols] #Sample_cols are all the sample columns
Row_filtered<-Sample_cols_df[rowSums(is.na(Sample_cols_df)) != ncol(Sample_cols_df),
But I did not really understand this solution too well as I'm unfamiliar with rowSums and still new to R. I did end up with the right rows deleted with this code, BUT this method also removed the columns that were not sample columns in the process of making it work.
**In short:
I need to subset sample columns, in case all of the sample columns are NA, this row should be filtered out.
In case only PART of the sample values of that row returns NA, the row should NOT be removed.
The other columns aside from the sample columns should not get removed in the process, I want to end up with exactly the same dataframe lay out, just with certain rows containing only NA values for all sample columns removed in it.**
-> For reference: In my example dataframe provided above, rows 1 and 3 should be removed, as all sample values are NA, eventhough the mz and Blank average are not. Row 4 for example should not be removed, as one of the sample values returns a result and no NA.
I already noticed a lot of topics on this on StackOverflow, but after a day of searching and trying, I can't seem to find a topic that exactly matches what I want to do. In case anyone has any ideas please let me know!
We can use
df1[!rowSums(!is.na(df1[sample_cols])),]
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am trying to create a new dataframe or columns with averages based off groupings in another column... this will be best explained with some examples:
Data Example
So in the data example I have Ports 1-5 and three variables (V2_IV, V3_IV, R2)
I would like to have an average for these variables for each hour that is based of groupings of ports. Ports 1 and 2 as one average=a. Ports 3, 4, and 5 as another average=b.
So to get something like this:
Results
*Note variable numbers given in results are just for example, not the accual averages.
First we recreate your data in R so we can work with it:
data <- data.frame(Year = 2014, Month = 8, Day = 26,
Hour = c(9,9,9,9,9,10,10,10,10,10,11,11,11,11,11),
Port = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
DoY = 238,
Date = "8/26/2014",
Season = "Summer",
V2_IV = c(19.361, 19.676, 21.831, 20.692, 19.405, 19.597, 19.8935, 22.5585, 21.321, 20.8605, 19.919, 20.4825, 23.401, 22.093, 21.7965),
V3_IV = c(.872, NA, .826, NA, .868, .872, NA, .829, NA, .8665, .8715, NA, .8285, NA, .867),
R2 = c(.998676, .998901, .9923, .994796, .992848, .997106, .996422, .972802, .995367, .996529, .995808, .998653, .988912, .996155, .987083))
The code below now assigns the ports to the groups that you mentioned. If you wish to scale this code to incorporate more groups then you can just assign more groups. The idea here is that you need a column that tells you what group each observation is assigned to. You provided two groups so I just used the binary assignment of an ifelse statement:
a <- c(1,2)
b <- c(3,4,5)
data$Group <- ifelse(data$Port %in% a, "a", "b")
Now we just need to calculate the averages for those three variables. You had some missing entries in the V3_IV column that I choose to input as NAs, in order to handle those missing values in the summarise_at function you have to specify na.rm = TRUE . If you fill in these values then that part is unnecessary.
library("dplyr")
avgs <- data %>% group_by(Group, Date, Hour) %>%
summarise_at(.vars = vars(V2_IV, V3_IV, R2), mean, na.rm = TRUE)
I have a data set with survey scores. I am summing the scores by row as follows:
d$total.score <- rowSums(d[, c("a", "b", "c", "d")], na.rm=TRUE)
I need to create another variable, the average score. If some of the variables had NA as a cell (e.g., 3+1+4+NA=8), the total I need to divide by will not be 4 but might be 2 or 3. What function can I use to calculate this number I need to divide by?
Thank you!