Back to Projects List
Special thanks to Fernando Pérez-García (UCL/KCL) for explaining PyTorch conventions and tensor permutations.
WE ARE HIRING - see job opportunities here if interested!
NCI IDC is a new component of the Cancer Research Data Commons (CRDC). The goal of IDC is to enable a broad spectrum of cancer researchers, with and without imaging expertise, to easily access and explore the value of de-identified imaging data and to support integrated analyses with non-imaging data. IDC maintains cancer imaging data collections in Google Cloud Platform, and is developing tools and examples to support cloud-based analysis of imaging data.
Some examples of what you can do with IDC:
In this project we would like to interact with the project week participants to answer their questions about IDC and understand their needs, collect feedback and suggestions for the functionality users would like to see in IDC, and help users get started with the platform.
Free cloud credits are available to support the use of IDC for cancer imaging research.
Broad motivation for the experiment is to enrich IDC data offering by improving the richness of metadata accompanying IDC content.
An experiment that can be completed within the Project Week can implement tool for tagging of the individual series within an MRI exam with the series type. The experiment will follow the catigorization of individual series that was proposed in Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features.
It is a valuable capability currently missing to allow for automatic tagging of individual series within a DICOM study, which is important for feeding data into the subsequent analysis steps.
The idea for the experiment is to develop a tool allowing to tag individual series, using, as needed, DICOM metadata and content of the image, utilizing the metadata table of the mentioned paper as a source of inspiration if not training/testing.
An additional and probably key feature of this experiment is that it’s cloud native. This means that all resources and data does not leave the cloud datacenter. This is expected to bring insights on efficient working setups that utilize the cloud infrastructure and provide an update on what’s the barrier for entry to perform research on cloud resources.
Visit “IDC-Bot” stream set up by Theodore under the discord project channel to watch short demo videos about IDC.
The only setup requirement for utilizing the power of IDC is a Google Cloud account. This account has to be setup only once and if the user already uses or in the past used Google Cloud products - everything is in place.
Keep in mind that Google provides free credits to new users and IDC does the same for existing users (fill in the form here).
This experiment utilized the following APIs:
In real life you would probably want to add the following APIs to the mentioned ones:
The experiment utilized the free tools provided by Google to all it’s users to see if such research can be contucted without the cloud infrastructure “heavy-lifting”. The main computation platform was the free version of the Colab Notebooks that were stored in a Google Drive folder.
All the notebooks created for this experiment are available in the Github repository. Run them in Google Colab now:
001_IDC_&_ReferenceData_Exploration.ipynb
By default Colab provides instances with 2 cores and 12 GB of RAM. With an additional GPU that you can attach to the notebook this is enough for most of the tasks. For comparison analysis the preprocessing was also done on a 12 core 32 GB RAM instance to see if additional multiprocessing can boost performance.
The use of a dedicated VM can boost performance if the scripts enable multiprocessing for computation. Additionally firing up multiple instances of the gsutil
commands can speed up data transfer. For example, during the experiment the command
cat "$TARGET_CLASS"_gcs_paths.txt | gsutil -u "$MY_PROJECT_ID" -m cp -I ./data/"$TARGET_CLASS"
was executed in 4 different screen sessions simultaneously to test the download speed. The results were 16 MBps when there is only one gsutil
command running and 8 MBps if there are 4 gsutil
commands running.
from google.colab import auth auth.authenticate_user()
%%bigquery –project=$
SELECT *
FROM <my_cohort_BQ_table>
cohort_df = cohort_df.join(cohort_df[“gcs_url”].str.split(‘#’, 1, expand=True).rename(columns={0:’gcs_url_no_revision’, 1:’gcs_revision’})) cohort_df[“gcs_url_no_revision”].to_csv(“gcs_paths.txt”, header=False, index=False)
!mkdir downloaded_cohort
!cat gcs_paths.txt | gsutil -u