I am currently working with a dataset with 551 observation and 141 variables. Normally there are some mistakes done by the data entry operators and I am now screening and correcting those. But the problem is the ID number and the row number of the dataset is not similar/corresponding. And I can only bring the row number where the problematic data lies in. It is taking more time of mine to find the ID number as they do not correspond. Is there any way to get the ID number of the problematic data within one command?
Suppose, the row number of the B345 ID, is #1. For B346 ID the row is #2.
My dataset is presented like this-
ID S1 S2 S3 I30 I31 I34
B345 12 23 3 2 1 4
B346 15 4 4 3 2 4
I am using the following command in my original dataset and got the following results. Row number 351 and 500 but actually their ID number is B456 and B643.
which (x$I30 ==0)
[1] 351 500
I am expecting to get the ID number within 1 command. It will be very helpful to me.
How about this?
x$ID[which(x$I30==0)]
We can just use the logical condition to subset the 'ID'
x$ID[x$I30 ==0]
Related
I need to update all the values of a column, using as reference another df.
The two dataframes have equal structures:
cod name dom_by
1 A 3
2 B 4
3 C 1
4 D 2
I tried to use the following line, but apparently it did not work:
df2$name[df2$dom_by==df1$cod] <- df1$name[df2$dom_by==df1$cod]
It keeps saying that replacement has 92 rows, data has 2.
(df1 has 92 rows and df2 has 2).
Although it seems like a simple problem, I still can not solve it, even after some searches.
I am trying to transform my data-frame from wide format to long format. I have seen many questions already posted here regarding this, but it is not quite what I am looking for / I do not see how to apply it to my problem.
The data-frames share some columns like Name, SharedVal etc. but also have columns the other dataset does not have.
What I want to achieve:
Merge these two dataframes based on the UserId, but per UserID have as many rows as there are MeasureNo.
So if there have been two measurements for a user, there will be two rows with the same user id.
And the rows have the same length, but some columns have different entries/no entry at all.
Example:
Dataset1:
UserID Name MeasureNo SharedVal1 SpecificVal1
1 Anna 1 42 8
2 Alex 1 28 50
and
Dataset2:
UserID Name MeasureNo SharedVal1 DifferentVal1
1 Anna 2 15 99
2 Alex 2 33 45
And they should be merged into:
UserID Name MeasureNo SharedVal1 SpecificVal1 DifferentVal1
1 Anna 1 42 8 -
1 Anna 2 15 - 99
2 Alex 1 28 50 -
2 Alex 2 33 - 45
and so on...
problem is, the dataset is huge and there are a lot of rows and columns, so I thought that somehow merging them on the id and than reshaping is the most generic approach. But I could not achieve the expected behaviour.
What I am trying to say programatically is:
"Merge the two dataframes based on the userid and create a as much rows per userid as there are different times of measurement(MeasureNo). Both rows obviously have the same amount of columns. So im both rows, some values in certain columns cannot be filled.
Sorry I am new to SO and this was my best approach to visualizing a table with rows starting in a new line and the Key:Val representing a columing inside that row.
You can do outer join:
new_df <- merge(df1, df2, all = T)
I have data (df) like this with 50 diagnosis codes (dx.1 through dx.50) per patient:
ID dx.1 dx.2 dx.50
1 150200 140650 250400
2 752802 851812 NA
3 441402 450220 NA
4 853406 853200 150404
5 250604 NA NA
I would like to select the rows that have any of the diagnosis codes starting with "250". So in the example, it would be ID 1 and 5.
After stumbling around for awhile, I finally came up with this:
df$select = rowSums(sapply(df[,2:ncol(df)], function(x) grepl("\\<250", x)))
selected = df[df$select>0,]
It's kind of clanky and takes a while since I'm running it on several thousand rows.
Is there a better/faster way to do this?
Is there an easy way to extend this to multiple search criteria?
I would like to implement the following solution:
I have datasets that are generated every 5 minutes, those have several colums and rows, where the row are the variables and the metrics in the rows.
The issue is that I would like to load those entire datasets and assign them the time they have been loaded for all of the variables in the given time for the same exact metric value, ¿Does it make any sense?.
A
dding a column with the time as a metric for eacehone of the variables, but I would like to know if I could do it globally as a dataframe in R, and everytime I'm accessing to this dataframe, keep it in an inside variable as it would apply to each element of the dataframe (the time of the extraction).
Thank you.
EDIT: Imagine the following table for T1, I load it in one dataframe call it df1.
PlaneID FlightTime Passengers Cost
1 123 5 6
2 34 4 2
3 93 3 1
Now the very next day I recieve a new data report, on the time T2 and save it on a df2:
PlaneID FlightTime Passengers Cost
1 33 10 16
2 134 2 1
3 393 3 6
Now, what I'd like to do (besides being on different dataframes) would be for example to analyze the number of passangers for each plane during T1 and T2, and creating a time series out of it.
Hope it helps.
I printed out the summary of a column variables as such:
summary(document$subject)
A,B,C,D,E,F,.. are the subjects belonging to a column of a data.frame where A,B,C,...appear many times in the column, and the summary above shows the number of times (frequency) these subjects have appeared in the file. Also, the term "OTHER" refers to those subjects which have appeared only once in the file, I also need to assign "1" to these subjects.
There are so many different subjects that it's difficult to list out all of them if we use command "c".
I want to build up a new column (or data.frame) and then assign these corresponding numbers (scores) to the subjects. Ideally, it will become this in the file:
A 198
B 113
C 96
D 69
A 198
E 65
F 62
A 198
C 113
BZ 21
BC 1
CJ 1
...
I wonder what command I should use to take the scores/values from the summary table and then build a new column to assign these values to the corresponding subjects in the file.
Plus, since it's a summary table printed by R, I don't know how to build it into a table in a file, or take out the values and subject names from the table. I also wonder how I could find out the subject names which appeared only once in the file, so that the summary table added them up into "OTHER".
Your question is hard to interpret without a reproducible example. Please take a look this threat for tips on how to do that:
How to make a great R reproducible example?
Having said that, here is how I interpret your question. You have two data frames, one with a score per subject and another with the subjects multiple times in a column:
Sum <- data.frame(subject=c("A","B"),score=c(1,2))
foo <- data.frame(subject=c("A","B","A"))
> Sum
subject score
1 A 1
2 B 2
> foo
subject
1 A
2 B
3 A
You can then use match() to match the subjects in one data frame to the other and create the new variable in the second data frame:
foo$score <- Sum$score[match(foo$subject, Sum$subject)]
> foo
subject score
1 A 1
2 B 2
3 A 1