Merging Dataframes based on values - r

Apologies if I lack enough info in the question, first time posting here
I have two data frames, one with 12,000(GPS) second with 196 (Details).
The GPS dataframe has repeated values for a "names" column.
The Details datafame has a "position" column and a name column with a different position value for each name.
I need the GPS df to have a column "position" which pulls from Details$position but repeats each time a name is shown
I tried to do this by creating a list of the names and then using a combination of setDT & setDF using a line of code given to me by someone trying something similar:
Weigh_in_check <- setDF(setDT(Weigh_in_check)[setDT(Weight_first),
Weight_initial := Weight_first$Weight, on=c("Name")])
however I cannot change it around for it to work for me with as follows
Name_check <- setDF(setDT(Name_check)[setDT(GPSReview2), Position :=
PlayerDetails$Position, on=c("Player Name")])
New code following comment by Flo.P
GPSReview4[,"Position"] <- NA
GPSReview4$Position <- as.character(GPSReview4$Position)
GPSReview4$Position <- left_join(GPSReview4, PlayerDetails, by ="Position" )
Which gives following error
Error in $<-.data.frame(*tmp*, Position, value = list(Full session = c("Yes", :
replacement has 132235 rows, data has 26447
**EDIT:
These are the 2 dataframes
GPS Review4
Detail

Related

change all columns based on the maximum value in each row

I have a large data frame which has multiple columns calculated from other columns. The issues come where there are values of 8888 and 9999 which constitute NA or refused to answer respectively. These values have been incorrectly used to calculate other columns (such as the value of pricepergram) as they have not been signaled as NA prior to calculation.
I'm not able to recalculate all the values, so instead I would like to find some code, which takes in as an argument each row of the dataframe. If the maximum value in the row is above 8887, then I would like it to return the row but with the value of all prices set to NA.
the solution needs to be applicable to a dataframe of 250 columns.
I need to be able to apply the code across multiple columns, rather than just one.
I have confirmed that the only values above 8887 in the dataframe are indeed either 9999 or 8888 and therefore constitute values that we want to change.
I am not able to post the dataset due to data protection (apologies), but have given an example of minimum complexity to illustrate my point.
This would be the ideal output:
The rows with values above 8887 have had their price set to NA.
We can break this problem into two steps:
find out if there are any 8888 or 9999 codes in a row
set values in the row to NA
Step 1: The following code produces an indicator for whether a row contains any codes greater than 8887:
any_large_codes = apply(df, MARGIN = 1, function(row){any(row > 8887)})
It works as follows: apply treats the dataframe as a matrix. MARGIN = 1 means that the function is applied to each row of the matrix. function(row){any(row > 8887)} checks if any value in its input (each row) is larger than 8887.
I have not used dplyr for this as I am not aware of any row-wise operators in dplyr. This seems the best option. You can use dplyr to add it into the dataframe if you wish, but this is not necessary:
df = df %>% mutate(na_indicator = any_large_codes)
Step 2: The following code sets the values in a single column to NA where there are any large codes:
df = df %>%
mutate(this_one_column = ifelse(any_large_codes, NA, this_one_column))
If you want to handle multiple columns, I would suggest something like this:
all_columns_to_handle = c(
"col1",
"col2",
"col3",
...
)
for(cc in all_columns_to_handle){
df = df %>%
mutate(!!sym(cc) := ifelse(any_large_codes, NA, !!sym(cc)))
}
Where !!sym(cc) is a way to use the column name stored in cc and := is equivalent to = but allows us to use !!sym(cc) on the left-hand side. For other options to this approach see the programming with dplyr vignette.

dplyr mutate grouped data without using exact column name

I'm trying to wirte a function to process multiple similar dataset, here I want to subtract scores obtained by subject in the second interview by scores obtained by the same subject in the previous interview. In all dataset I want to process, interested score will be stored in the second column. Writing for each specific dataset is simple, simply use the exact column name, everything will go fine.
d <- a %>%
arrange(by_group=interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = colname_2nd-lag(colname_2nd))
But since I need a generic function that can be used to process multiple dataset, I can not use exact column name. So I tried 3 approaches, both of them only altered the last line
Approach#1:
dplyr::mutate(score_change = dplyr::vars(2)-lag(dplyr::vars(2)))
Approach#2:
Second column name of interested dataset contains a same string ,so I tried
dplyr::mutate(score_change = dplyr::vars(matches('string'))-lag(dplyr::vars(matches('string'))))
Error messages of the above 2 approaches will be
Error in dplyr::vars(2) - lag(dplyr::vars(2)) :
non-numeric argument to binary operator
Approach#3:
dplyr::mutate(score_change = .[[2]]-lag(.[[2]]))
Error message:
Error: Column `score_change` must be length 2 (the group size) or one, not 10880
10880 is the row number of my sample dataset, so it look like group_by does not work in this approach
Does anyone know how to make the function perform in the desired way?
If you want to use position of the column names use cur_data()[[2]] to refer the 2nd column of the dataframe.
library(dplyr)
d <- a %>%
arrange(interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = cur_data()[[2]]-lag(cur_data()[[2]]))
Also note that cur_data() doesn't count the grouped column so if subjectkey is first column in your data and colname_2nd is the second one you may need to use cur_data()[[1]] instead when you group_by.

Assigning Unnamed Columns To Another DataFrame

I'm in a very basic class that introduces R for genetic purposes. I'm encountering a rather peculiar problem in trying to follow the instructions given. Here is what I have along with the instructor's notes:
MangrovesRaw<-read.csv("C:/Users/esteb/Documents/PopGen/MangrovesSites.csv")
#i'm going to make a new dataframe now, with one column more than the mangrovesraw dataframe but the same number of rows.
View(MangrovesRaw)
Mangroves<-data.frame(matrix(nrow = 528, ncol = 23))
#next I want you to name the first column of Mangroves "pop"
colnames(Mangroves)<-c(col1="pop")
#i'm now assigning all values of that column to be 1
Mangroves$pop<-1
#assign the rest of the columns (2 to 23) to the entirety of the MangrovesRaw dataframe
#then change the names to match the mangroves raw names
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
I'm not really sure how to assign columns that haven't been named used the $ as we have in the past. A friend suggested I first run
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw
#X338 is the name of the first column from MangrovesRaw
But while this does transfer the data from MangrovesRaw, it comes at the cost of having my column names messed up with X338. added to every subsequent column. In an attempt to modify this I found the following "fix"
colnames(Mangroves)[2:23]<-colnames(MangrovesRaw)
Mangroves$X338<-MangrovesRaw[,2]
#Mangroves$X338<-MangrovesRaw[,2:22]
#MangrovesRaw has 22 columns in total
While this transferred all the data I needed for the X338 Column, it didn't transfer any data for the remaining 21 columns. The code in # just results in the same problem of having X388. show up in all my column names.
What am I doing wrong?
There are a few ways to solve this problem. It may be that your instructor wants it done a certain way, but here's one simple solution: just cbind() the Mangroves$pop column with the real data. Then the data and column names are already added.
Mangroves <- cbind(Mangroves$pop, MangrovesRaw)
Here's another way:
Mangroves[, 2:23] <- MangrovesRaw
colnames(Mangroves)[2:23] <- colnames(MangrovesRaw)

Fill columns with data via 2 defined parameters

I have a sample working data set (called df) which I have added columns to in R, and I would like to fill these columns with data according to very specific conditions.
I ran samples in a lab with 8 different variables, and always ran each sample with each variable twice (sample column). From this, I calculated an average result, called Cq_mean.
The columns I have added in R below refer to each variable name.
I would like to fill these columns with positive or negative based on 2 conditions :
Variable
Cq_mean
As you see with my code below, I am able to create positive or negative results based on Cq_mean, however this logically runs it over the entire dataset, not taking into account variable as well, and it fills in cells with data that I would like to remain empty. I am not sure how to ask R to take these two conditions into account at the same time.
positive: Cq_mean <= 37.1
negative: Cq_mean >= 37
Helpful information:
Under sample, the data is always separated by a dash (-) with sample number in front, and variable name after. Somehow I need to isolate what comes after the dash.
Please refer to my desired results table to visualize what I am aiming for.
df <- read.table("https://pastebin.com/raw/ZPJS9Vjg", header=T,sep="")
add column names respective to variables
df$TypA <- ""
df$TypB <- ""
df$TypC <- ""
df$RP49 <- ""
df$RPS5 <- ""
df$H20 <- ""
df$F1409B <-""
df$F1430A <- ""
fill columns with data
df$TypA <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
df$TypB <- ifelse(df$Cq_mean>=37.1,"negative", 'positive')
and continue through with each variable
desired results (subset of entire dataset done by hand in excel):
desired_outcome <- read.table("https://pastebin.com/raw/P3PPbiwr", header = T, sep="\t")
Something like this will do the trick:
df$TypA[grepl('TypA', df$sample1)] <- ifelse(df$Cq_mean[grepl('TypA', df$sample1)] >= 37.1,
'neg', 'pos')
You'll need to do this once per new column you want.
The grepl will filter out only the rows where your string of choice (here TypA) is present in the sample variable.

R replacement has rows data has rows

I have a dataframe df w/ a column id, and a list file_list. I want to create a new column based on the position of a list using the value in the id column
df$new <- ""
for (i in df$id) {
df$new[i] <- file_list[as.numeric(df$id[i])]
}
I am getting an error similar to "replacement has x rows, data has y rows". I searched and found the below, and initialized my new column, but am still getting the error message.
R Error - replacement has [x] rows, data has [y]
Alternatively, if I can simply replace the id column with the new value from file_list that would work as well.
I'm sure I'm missing something simple, it has been a few years since I touched R. Thanks in advance for pointers.

Resources